using regular expression to find polysyllabic words - javascript

i'm trying to use regexp to find the number of polysyllabic words in a piece of text, my code works most of the time but doesn't pick up on some of the poly words:
polySyllableCount = lWords2.replace(/(?:[^laeiouy\s]es|ed|[^laeiouy\s]e)$/, '');
is what I use to count the syllables, and
polySyllableCount = lWords2.replace(/^y/, '');
to replace the leading Y's so they are not counted,
and finally:
try
{
polySyllables = polySyllableCount.match(/[aeiouy]\S[aeiouy]\S[aeiouy]/g).length;
}
catch(err)
{
console.log("No Poly Words")
}
to count the number of polysyllabic words.
My thought process is that it will find any 3 vowels in a (modified) word, separated by anything except a whitespace, to give me the number of polysyllabic words

please notice that \S also matches punctuation marks like . and , and that can be the cause of some mis-detection. For example:
'ame.na mana miu' //'ame.na' will be treated like one word with your regexp
You can replace \S with \w for better results. Of course \w will include numbers too and if you want to be really accurate, you may use [a-z]. Also you are using the /g switch. You need to add /i to it so that it searches for AEIOUY too so it will be
/...regexp.../gi
You can learn more here: javascriptkit.com/javatutors/redev2.shtml

Related

Regex replace not removing characters properly

I have the regular expression:
const regex = /^\d*\.?\d{0,2}$/
and its inverse (I believe) of
const inverse = /^(?!\d*\.?\d{0,2}$)/
The first regex is validating the string fits any positive number, allowing a decimal and two decimal digits (e.g. 150, 14., 7.4, 12.68). The second regex is the inverse of the first, and doing some testing I'm fairly confident it's giving the expected result, as it only validates when the string is anything but a number that may have a decimal and two digits after (e.g. 12..05, a5, 54.357).
My goal is to remove any characters from the string that do not fit the first regex. I thought I could do that this way:
let myString = '123M.45';
let fixed = myString.replace(inverse, '');
But this does not work as intended. To debug, I tried having the replace character changed to something I would be able to see:
let fixed = myString.replace(inverse, 'ZZZ');
When I do this, fixed becomes: ZZZ123M.45
Any help would be greatly appreciated.
I think I understand your logic here trying to find a regex that is the inverse of the regex that matches your valid string, in the hopes that it will allow you to remove any characters that make your string invalid and leave only the valid string. However, I don't think replace() will allow you to solve your problem in this way. From the MDN docs:
The replace() method returns a new string with some or all matches of a pattern replaced by a replacement.
In your inverse pattern you are using a negative lookahead. If we take a simple example of X(?!Y) we can think of this as "match X if not followed by Y". In your pattern your "X" is ^ and your "Y" is \d*\.?\d{0,2}$. From my understanding, the reason you are getting ZZZ123M.45 is that it is finding the first ^ (i.e, the start of the string) that is not followed by your pattern \d*\.?\d{0,2}$, and since 123M.45 doesn't match your "Y" pattern, your negative lookahead is satisfied and the beginning of your string is matched and "replaced" with ZZZ.
That (I think) is an explanation of what you are seeing.
I would propose an alternative solution to your problem that better fits with how I understand the .replace() method. Instead of your inverse pattern, try this one:
const invalidChars = /[^\d\.]|\.(?=\.)|(?<=\.\d\d)\d*/g
const myString = '123M..456444';
const fixed = myString.replace(invalidChars, '');
Here I am using a pattern that I think will match the individual characters that you want to remove. Let's break down what this one is doing:
[^\d\.]: match characters that are not digits
\.(?=\.): match . character if it is followed by another . character.
(?<=\.\d\d)\d*: match digits that are preceded by a decimal and 2 digits
Then I join all these with ORs (|) so it will match any one of the above patterns, and I use the g flag so that it will replace all the matches, not just the first one.
I am not sure if this will cover all your use cases, but I thought I would give it a shot. Here's a link to a breakdown that might be more helpful than mine, and you can use this tool to tweak the pattern if necessary.
I don't think you can do this
remove any characters from the string that do not fit the first regex
Because regex matching is meant for the entire string, and replace is used to replace just a PART inside that string. So the Regex inside replace must be a Regex to match unwanted characters only, not inverted Regex.
What you could do is to validate the string with your original regex, then if it's not valid, replace and validate again.
//if (notValid), replace unwanted character
// replace everything that's not a dot or digit
const replaceRegex = /[^\d.]/g; // notice g flag here to match every occurrence
const myString = '123M.45';
const fixed = myString.replace(replaceRegex, '');
console.log(fixed)
// validate again

RegEx: Understanding Syllable Counter Code

I have used Dylan's question on here regarding JavaScript syllable counting, and more specifically artfulhacker's answer, in my own code and, regardless of which single or multi word string I feed it, the function is always able to correctly count the number of syllables.
I have a limited experience with RegEx and not enough prior knowledge to decipher what exactly is happening in the following code without some help. I'm not someone who is ever happy with having some code I pulled from somewhere just work without me knowing how it works. Is someone able to please articulate what is happening in the new_count(word) function below and help me decipher the use of RegEx and how it is that the function is able to correctly count syllables? Many
function new_count(word) {
word = word.toLowerCase(); //word.downcase!
if(word.length <= 3) { return 1; } //return 1 if word.length <= 3
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, ''); //word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word = word.replace(/^y/, ''); //word.sub!(/^y/, '')
return word.match(/[aeiouy]{1,2}/g).length; //word.scan(/[aeiouy]{1,2}/).size
}
As far as I see it, we basically want to count the vowels, or vowel pairs, with some special cases. Let's start by the last line, which does that, i.e. count vowels and pairs:
return word.match(/[aeiouy]{1,2}/g).length;
This will match any vowel, or vowel pair. [...] means a character class, i.e. that if we go through the string character-by-character, we have a match, if the actual character is one of those. {1, 2} is the number of repetitions, i.e. it means that we should match exactly one or two such characters.
The other two lines are for special cases.
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '');
This line will remove 'syllables' from the end of the word, which are either:
Xes (where X is anything but any of 'laeiouy', e.g. 'zes')
ed
Xe (where X is anything but any of 'laeiouy', e.g. 'xe')
(I'm not really sure what the grammatical meaning behind this is, but I guess, that 'syllables' at the end of the word, like '-ed', '-ded', '-xed' etc. don't really count as such.)
As for the regexp part: (?:...) is a non-capturing group. I guess it's not really important in this case that this group be non-capturing; this just means that we would like to group the whole expression, but then we do not need to refer back to it. However, we could have used a capturing group too (i.e. (...) )
The [^...] is a negated character class. It means, match any character, which is none of those listed here. (Compare to the (non-negated) character-class mentioned above.)
The pipe symbol, i.e. |, is the alternation operator, which means, that any of the expressions can match.
Finally, the $ anchor matches the end of the line, or string (depending on the context).
word = word.replace(/^y/, '');
This line removes 'y'-s from the beginning of words (probably 'y' at the beginning does not count as a syllable -- which makes sense in my opinion).
^ is the anchor for matching the beginning of the line, or string (c.f. $ mentioned above).
Note: the algorithm only works if word really contains one single word.
/(?:[^laeiouy]es|ed|[^laeiouy]e)$/
That matches three possible substrings: a letter other than 'l' or a vowel followed by 'es' (like "res" or "tes"); 'ed'; or a non-vowel, non-'l' followed by just an 'e'. Those patterns must appear at the end of the word to match because of the $ at the end of the pattern. The grouping (?: ) is just a grouping; the leading ?: makes that distinction. The pattern could have been a little shorter:
/(?:[^laeiouy]es?|ed)$/
would do the same thing. In any case, if the pattern matches the characters involved are removed from the word.
Then,
/^y/
matches a 'y' at the beginning of a word. If a 'y' is found, it's removed.
Finally,
/[aeiouy]{1,2}/g
matches any one- or two-character stretch of vowels (including 'y'). The g suffix makes it a global match, so that the return value is an array consisting of all such spans of vowels. The length of that returned array is the number of syllables (according to this technique).
Note that the words "poem" and "lion" would be reported as one-syllable words, which may be correct for some English variants but not all.
Here is a pretty good reference for JavaScript regular expression operators.

Splitting a string into words and keeping delimiter

I want to split up a string (sentence) in an array of words and keep the delimiters.
I have found and I am currently using this regex for this:
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
An explanation can be found here: http://regex101.com/
This works exactly as I want it to and effectively makes a string like
This is a sentence.
To an array of
["This", "is", "a", "sentence."]
The problem here is that it does not include spaces nor newlines. I want the string to be parsed as words as it already does but I also want the corresponding space and or newline character to belong to the previous word.
I have read about positive lookahead that should look for future characters (space and or newline) but still take them into account when extracting the word. Although this might be the solution I have failed to implement it.
If it makes any difference I am using JavaScript and the following code:
//save the regex -- g modifier to get all matches
var reg = /[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)/g;
//define variable for holding matches
var matches;
//loop through each match
while(matches = reg.exec(STRING_HERE)){
//the word without spaces or newlines
console.log(matches[0]);
}
The code works but as I said, it does not include spaces and newline characters.
Yo can try something simpler:
str.split(/\b(?!\s)/);
However, note non word characters (e.g. full stop) will be considered another word:
"This is a sentence.".split(/\b(?!\s)/);
// [ "This ", "is ", "a ", "sentence", "." ]
To fix that, you can use a character class with the characters that shouldn't begin another word:
str.split(/\b(?![\s.])/);
function split_string(str){
var arr = str.split(" ");
var last_i = arr.length - 1;
for(var i=0; i<last_i; i++){
arr[i]+=" ";
}
return arr;
}
It may be as simple as this:
var sentence = 'This is a sentence.';
sentence = sentence.split(' ').join(' ||');
sentence = sentence.split('\n').join('\n||');
var matches = sentence.split('||');
Note that I use 2 pipes as a delimiter, but ofcourse you can use anything as long as it's unique.
Also note that I only split \n as a newline, but you may add \r\n or whatever you want to split as well.
General Solution
To keep the delimiters conjoined in the results, the regex needs to be a zero-width match. In other words, the regex can be thought of as matching the point between a delimiter and non-delimiter, rather than matching the delimiters themselves. This can be achieved with zero-width matching expressions, matching before, at, or after the split point (at most one each); let's call these A, B, and C. Sometimes a single sub-expression will do it, others you'll need two; offhand, I can't think of a case where you'd need three.
Not only look-aheads but lookarounds in general are the perfect candidates for this purpose: lookbehinds ((?<=...)) to match before the split point, and lookaheads ((?=...)) after. That's the essence of this approach. Positive or negative lookarounds can be used. The one pitfall is that lookbehinds are relatively new to JS regexes, so not all browsers or other JS engines will support them (current versions of Firefox, Chrome, Opera, Edge, and node.js do; Safari does not). If you need to support a JS engine that doesn't support lookbehinds, you might still be able to write & use a regex that matches at-and-before (BC).
To have the delimiters appear at the end of each match, put them in A. To have them at the start, in C. Fortunately, JS regexes do not place restrictions on lookbehinds, so simply wrapping the delimiter regex in the positive lookaround markers should be all that's required for delimiters. If the delimiters aren't so simple (i.e. context-sensitive), it might take a little more work to write the regex, which doesn't need to match the entire delimiter.
Paired with the delimiter pattern, you'll need to write a pattern that matches the start (for C) or end (for A) of the non-delimiter. This step is likely the one that will require the most additional work.
The at-split-point match, B
will often (always?) be a simple boundary, such as \b.
Specific Solution
If spaces are the only delimiters, and they're to appear at the end of each match, the delimiter pattern would be (?<=\s), in A. However, there are some cases not covered in the problem description. For example, should words separated by only punctuation (e.g. "x.y") be split? Which side of a split point should quotation marks and hyphens appear, if any? Should they count as punctuation? Another option for the delimiter is to match (after) all non-word characters, in which case A would be (<?=\W).
Since the split-point is at a word boundary, B could be \b.
Since the start of a match is a word character, (?=\w) will suffice for C.
Any two of those three should suffice. One that is perhaps clearest in meaning (and splits at the most points) is /(<?=\W)(?=\w)/, which can be translated as "split at the start of each word". \b could be added, if you find it more understandable, though it has no functional affect: /(<?=\W)\b(?=\w)/.
Note Oriol's excellent solutions are given by B=\b and (C=(?!\s) or C=(?![\s.])).
Additional
As a point of interest, there would be a simpler solution for this particular case if JS regexes supported TCL word boundaries: \m matches only at the start of a word, so str.split(/\m/) would split exactly at the start of each word. (\m is equivalent to (<?=\W)(?=\w).)
If you want to include the whitespace after the word, the regex \S+\s* should work.
const s = `This is a sentence.
This is another sentence.`;
console.log(s.match(/\S+\s*/g))

Regex with dynamic length

I have string with 2 or 3 words:
'apple grape lemon'
'apple grape'
I need to get first char from all words.
my regex:
/^(\w).*?\ (\w).*?\ ?(\w?).*?$/
For all strings this regex get only first char of 2 words.
How to fix?
You cannot do this with one regex (unless you are using .NET). But you can use a regex that matches one first character of a word, then get all the matches, and join them together:
var firstLetters = '';
var match = str.match(/\b\w/g)
if (match)
firstLetters = match.join('');
Of course if you just want to get the letters on their own, there is no need for the join, since the match will simply be an array containing all those letters.
You should not, that \w is not only letters, but digits and underscores, too.
If you work with javascript, you don't need to regex the hell out of a simple problem.
To get the first letter, just do that:
var aString = 'apple bee plant';
var anArray = aString.split(' ');
for(var aWord in anArray) {
var firstLetter = aWord.charAt(0);
}
Regular expressions are a regular language, such that you cannot have this kind of repetition in them. What you want is to cut the string into individual tokes (which can be done via regular expressions to match the separator) and then apply an regular expression on each token. To get the first char from each word it is faster to use a substring operation instead of a regular expression.
The problem with your regex is that the .*? after the second word eats up all the following content as everything afterwards is optional. This could be solved, but I personally think it makes things more complicated than required.
The most simple way would be:
firstLetters = (m = str.match(/\b\w/g))? m.join('') : '';
In regexp "words" don't mean only letters. In JavaScript \w is equals [A-Za-z0-9_]. So if you want only letters in your result, you can use [A-Za-z].

jquery REGEX for longstring with azAZ-09-specialchars and |

Hello can someone help me in jquery regex?
whew coz im stack here since last night and finally iv'e decided to ask some help :)
any here's my regex abd the string is in exg variable.. then i want to split the string each
matches[0] = 'eNortjI0sLBScgQDz3yTfK98XCdH59RKc4M8&+SSXFzzXFz3UE9H9yzfYMfCYtPiDLes0NSAXCL3nIj0osJcIvNCjwxLv6z8YhPTXFxv8&KSMNekjIrgqqzQvOJyy0zXNPMoZ4vS0PQS4+S0&IIgU7OssPIolygXJWtcMMMFXCch|eNortjI0sLRScgQDz3yTfK98XCfHXDBDjzx3X4&cXCLXygKn4tzsNCNcJ+NMk+xEM6Ok&OIq1+DcXFxLw8AwjyhHb480lyxTg&LkKv8sXw&zpCSnJE+XYo&EVH&3yKyAsMjEtKxSi4CIqlwigwL&giinXDC3wCiXKBcla1wwpPEmEA==|';
matches[1] = 'eNortjI0NLJScgQDz3yTfK98XCdH9yCPZJ&CiCpD36xcMMuSsLwox6qAwMqkUAPTlChHI8ugvDQL9zzjbBMfT8u8RIOgMgvnHJ9SpzynvFDfQAugijBLv6CgXDBT&0LzKMdI06BIf9OyKGd&U58kN19fV8colygXJWtcMNaqJP8=|';
var regex = /[a-zA-Z]+[0-9]+[/-=&_]+|/g;
var exg = 'eNortjI0sLBScgQDz3yTfK98XCdH59RKc4M8&+SSXFzzXFz3UE9H9yzfYMfCYtPiDLes0NSAXCL3nIj0osJcIvNCjwxLv6z8YhPTXFxv8&KSMNekjIrgqqzQvOJyy0zXNPMoZ4vS0PQS4+S0&IIgU7OssPIolygXJWtcMMMFXCch|eNortjI0sLRScgQDz3yTfK98XCfHXDBDjzx3X4&cXCLXygKn4tzsNCNcJ+NMk+xEM6Ok&OIq1+DcXFxLw8AwjyhHb480lyxTg&LkKv8sXw&zpCSnJE+XYo&EVH&3yKyAsMjEtKxSi4CIqlwigwL&giinXDC3wCiXKBcla1wwpPEmEA==|eNortjI0NLJScgQDz3yTfK98XCdH9yCPZJ&CiCpD36xcMMuSsLwox6qAwMqkUAPTlChHI8ugvDQL9zzjbBMfT8u8RIOgMgvnHJ9SpzynvFDfQAugijBLv6CgXDBT&0LzKMdI06BIf9OyKGd&U58kN19fV8colygXJWtcMNaqJP8=|eNodwdEKgjAUXDDQf&ELnLk57GnXJaiQq4do923YSuXqQKNgXx90zl4yxspE&TUhD21cMNVmqzQkKbdYGVQ6rfzrIy9+nEThYPBvhLU2bpezFs&YSw4H2xdEj+t4mzoVz8Rhuy&i1KTL4BCIx5mcd1tt7Bc16uT4A7goJkI=|eNodyN0KwiAUXDDgd9kb2CzGujoqyeaJFjmKc9fYMCGxkLWfpy&6Lr9UMsbLDP6qyGMdBZzOFRsmq3RYXYN8tW8kUQRdyJ7k4SNq2fntbJwP7QWQ5HOcSeQEKcd0TGzB3XXhY2&wV&6h7a27b17KGQImm9ZOpEhl+y9eOlwnaQ==|eNortjI0NLBScgQDz3yTfK98XCdHv3J3b9P80rwo54pwTy+DgkR3M3eXTMdAY6ekjChHD4PcFJ+KNIMg54Bko9IKxxC&XFyTQl+niqLc7NAkV6PIXCKjgFCD0FQTc4vSCuf0JP+SJIPcspKsKEdj48ys9CiXKBcla1wwr5gmBA==|eNortjI0sLBScgQDz3yTfK98XCdHp1KniMLwiNKglFxc71L3wqrAPEe3UF&PHMuq9BDPXDCflKTkkJDQCIM0s&B8rwxLR5co5ySXbJ9SY7dSH&+SyAqL7JQkV3cPX&NKc9OilKoU51ST&CQ&k6Ki0vIolygXJWtcMKhcMCZa|eNortjI0MLJScgQDz3yTfK98XCdHP5MoxySjJMfIFL&CEsuSUKNQv4LiSre8KKco56IUT3&L0CyfgMIox&IoXCevKFwns&R011THgMKsfLcoZ++MQo&0cNeyIPOqkPTIqrCqCstcMM+C8ChHz8goFyVrXDCmkyQg|eNortjI0NLJScgQDz3yTfK98XCdH&4z8yCin0szUgNwo54zkkBDTJB&TrIBA06S8XFxvr7SKyuDsKOeIZOPcsFA&&4woZ49UU&ekIKMq&7DURCOLimRjX9eKSmdDxxwXZ4&wjDT&lLDsJI&iqpB01yhcJ4v0KJcoFyVrXDA8ZVwnTQ==|eNortjI0MrBScgQDz3yTfK98XCdHLy&LSLfMyOxcIvfIioJit9Lc9NC04NK8NEvncHe&9Byv5MRwy8pC79ywgLT0SJ+qjORCC5PK0oh8i8yU0gpPR9PK7BJXn9SCxDznsEQXb7e8LLMCp+L8tMTyKJcoF7XUioLMotTi+Mw8WwMla1ww8SosXw==|eNortjI0NLRScgQDz3yTfK98XCdHr0pcJ+PAdP9sc+NcIvcyY4PEoNywyuyKzOQsC8twDx9Hg7TUkHyv7BDjMtPg3LwcjyinqlwiNxcXXCeXKFwn7wggEWmck59YHJhZ5Z4ZmF4R5eSfaFnkWuzpVFRglpxcXB7lEuWiZA1cMO+XJv4=|eNortjI0MLdScgQDz3yTfK98XCdHr7LgEpekEmO&xIBgo5CMoOSSKu&QDKPK5OziojS&&Bwvk&BK4yqLCFOjTNNcXC9Tz8C0jLzSkuCMYI88v8KIwORQi&TMQLM8i&B8Sy9Pl7QqXwt&gxQvrzLHKJcoFyVrXDCNviXB|eNortjI0NLRScgQDz3yTfK98XCdHn4pCp5yQxNIkM8MQ97SI0PLkoIqy9DJvZ3PXgPDwgKSIKKfwPKfi&CqvtAqTTNf0YMdwS8cM&yjH7Mr0omTTsLQqg&RcXLOK9Oz00LDgwijHCOeqyuC8KFwnf2fvSMcolygXJWtcMD&GXCfp|eNortjI0NLBScgQDz3yTfK98XCdHF6&gcEdTr5SiNO8KY1wnk1KLIE&&yowox+JcMB&XlGyzQH8&gwKvyIjyEMuyNI+ktMSQ7DDzdAsTA&fyKMckl&KS9HDDgNRKL0+jkCintLCStFwin+LcRC+PXFxcJ7+qwCiXKBcla1wwqjUmMg=='
if(regex.test(exg)) {
var matches = exg.match(regex);
for(var match in matches) {
alert(matches[match]);
}
} else {
alert("No matches found!");
}`
but my regex won't work whew can someone give me a right regex for it? :) please help..
Elias answer is probably the easiest way to do this but if you insist on regex then how about this:
var regex = /[a-zA-Z0-9\/-=&_+]+\|{0,1}/g
Explanation of your regex and why it doesn't work:
[a-zA-Z]+ // Match one or more a-z upper or lower case
[0-9]+ // *THEN* match one or more 0-9
[/-=&_]+ // *THEN* match one or more of these characters
| // *THEN* match a pipe
The problem here is that the letters, numbers and symbols in your search string are mixed together. Therefore they all need to go inside square brackets together so you match one or more of all of them together in any order. Yours puts them in a specific order, letters first, then numbers, then symbols.
The {0,1} on the end matches either zero or one pipe and will therefore catch the last match which does not have a pipe at the end.
Incidentely there's no such thing as JQuery regex. The regex functions are javascript.
erm... how about just using split like so marches = yourString.split('|');
this will return an array of strings, but the pipe char's will not be included, but just concat them to the substring if you need them.
You've missed a slash before |, so this may be what you want?
var regex = /[a-zA-Z0-9\/-=&_]+\|/g;

Categories