RegEx: Understanding Syllable Counter Code

RegEx: Understanding Syllable Counter Code - javascript

I have used Dylan's question on here regarding JavaScript syllable counting, and more specifically artfulhacker's answer, in my own code and, regardless of which single or multi word string I feed it, the function is always able to correctly count the number of syllables.
I have a limited experience with RegEx and not enough prior knowledge to decipher what exactly is happening in the following code without some help. I'm not someone who is ever happy with having some code I pulled from somewhere just work without me knowing how it works. Is someone able to please articulate what is happening in the new_count(word) function below and help me decipher the use of RegEx and how it is that the function is able to correctly count syllables? Many
function new_count(word) {
word = word.toLowerCase(); //word.downcase!
if(word.length <= 3) { return 1; } //return 1 if word.length <= 3
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, ''); //word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word = word.replace(/^y/, ''); //word.sub!(/^y/, '')
return word.match(/[aeiouy]{1,2}/g).length; //word.scan(/[aeiouy]{1,2}/).size
}

As far as I see it, we basically want to count the vowels, or vowel pairs, with some special cases. Let's start by the last line, which does that, i.e. count vowels and pairs:
return word.match(/[aeiouy]{1,2}/g).length;
This will match any vowel, or vowel pair. [...] means a character class, i.e. that if we go through the string character-by-character, we have a match, if the actual character is one of those. {1, 2} is the number of repetitions, i.e. it means that we should match exactly one or two such characters.
The other two lines are for special cases.
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '');
This line will remove 'syllables' from the end of the word, which are either:
Xes (where X is anything but any of 'laeiouy', e.g. 'zes')
ed
Xe (where X is anything but any of 'laeiouy', e.g. 'xe')
(I'm not really sure what the grammatical meaning behind this is, but I guess, that 'syllables' at the end of the word, like '-ed', '-ded', '-xed' etc. don't really count as such.)
As for the regexp part: (?:...) is a non-capturing group. I guess it's not really important in this case that this group be non-capturing; this just means that we would like to group the whole expression, but then we do not need to refer back to it. However, we could have used a capturing group too (i.e. (...) )
The [^...] is a negated character class. It means, match any character, which is none of those listed here. (Compare to the (non-negated) character-class mentioned above.)
The pipe symbol, i.e. |, is the alternation operator, which means, that any of the expressions can match.
Finally, the $ anchor matches the end of the line, or string (depending on the context).
word = word.replace(/^y/, '');
This line removes 'y'-s from the beginning of words (probably 'y' at the beginning does not count as a syllable -- which makes sense in my opinion).
^ is the anchor for matching the beginning of the line, or string (c.f. $ mentioned above).
Note: the algorithm only works if word really contains one single word.

/(?:[^laeiouy]es|ed|[^laeiouy]e)$/
That matches three possible substrings: a letter other than 'l' or a vowel followed by 'es' (like "res" or "tes"); 'ed'; or a non-vowel, non-'l' followed by just an 'e'. Those patterns must appear at the end of the word to match because of the $ at the end of the pattern. The grouping (?: ) is just a grouping; the leading ?: makes that distinction. The pattern could have been a little shorter:
/(?:[^laeiouy]es?|ed)$/
would do the same thing. In any case, if the pattern matches the characters involved are removed from the word.
Then,
/^y/
matches a 'y' at the beginning of a word. If a 'y' is found, it's removed.
Finally,
/[aeiouy]{1,2}/g
matches any one- or two-character stretch of vowels (including 'y'). The g suffix makes it a global match, so that the return value is an array consisting of all such spans of vowels. The length of that returned array is the number of syllables (according to this technique).
Note that the words "poem" and "lion" would be reported as one-syllable words, which may be correct for some English variants but not all.
Here is a pretty good reference for JavaScript regular expression operators.

Related

Javascript regular expression: Want to exclude function words in all caps

I have the following regular expression to parse google script formula to get precedents
([A-z]{2,}!)?:?\$?[A-Z]\$?[A-Z]?(\$?[1-9]\$?[0-9]?)?
I needed to make the numbers optional to cater to ranges that are entire columns- see image. Because the numbers are optional I am also matching items that are functions -all caps words- that I want to exclude. I suppose I could do this after the fact but I would like to modify the regex to exclude them. How do I do that?
Example:
=IFERROR(VLOOKUP($AA16,Account_List_S!$AA:$AC,3,0),0)
IFERROR(IF(AD3=1,INDEX(CapEx!$AB$15:$AE$15,1,YEAR(AD$13)-
YEAR($Z$13)-1)*IF(Import_CapEx!AD$15>=0,Import_CapEx!AD$15,0),0),0)";
The words I want to match refer to cells with an optional sheet name, and optional $ before the row or column identifier. They can be ranges or single cells.
Examples of words I want to match:
$AA16
$AB$15
AD$15
$Z$13
Account_List_S!$AA:$AC
CapEx!$AB$15:$AE$15
Import_CapEx!AD$15
The words I want to exclude are the functions:
IFERROR
VLOOKUP
IF
YEAR

Try this regex:
/[\(,+\-\*/><=]((\w+!)?\$?[A-Z]{1,2}(\$?[\d]{0,3})?(:\$?[A-Z]{1,2}(\$?\d{0,3})?)?(?=[\),+\-\*/><=]))/g
While a little long, this has the advantage that it will reject these when found in the formula:
Anything that has [A-Z] and [0-9] but not a column, e.g. ZIP50210
Anything that has [A-Z] and [0-9] but in the wrong order, e.g. 25E
Any variables like "AR" or 'JOHN'
Any constants in the formula like TRUE, FALSE or other argument values
Explanation:
[\(,+\-\*/><=] look for starting literal ( or , or operands like +,-,/,*,>,<,=. We expect column identifiers to start with these characters.
( now we start our matching group
(\w+!)? allow for optional sheet names like 'Account_List_S!'
\$?[A-Z]{1,2}(\$?[\d]{0,3})? will match columns like A or $B1 or $AB$12 or AB123
(:\$?[A-Z_$]{1,2}(\$?[\d]{0,3}))? adds optional match for a range of columns, e.g. trailing :DD or :$C1 or :AC$1 or :AC123 or some such
(?=[,\)=:><]) lookahead for ending literal ) or , or operands like +,-,/,*,>,<,=. We expect column identifiers to end with these characters.
) close matching group
g global match (more than one instance)
Demo:
let regex = /[\(,+\-\*/><=]((\w+!)?\$?[A-Z]{1,2}(\$?[\d]{0,3})?(:\$?[A-Z]{1,2}(\$?\d{0,3})?)?(?=[\),+\-\*/><=]))/g;
let str = '=IFERROR(VLOOKUP($AA16,Account_List_S!$AA:$AC,3,0),0)IFERROR(IF(AD3=1,INDEX(CapEx!$AB$15:$AE$15,1,YEAR(AD$13)-YEAR($Z$13)-1)*IF(Import_CapEx!AD$15>=0,Import_CapEx!AD$15,0),0),0)";';
let arr = []
while(match = regex.exec(str)) {
arr.push(match[1]); //we only want the first matching group
}
console.log(arr);
/*
[ '$AA16',
'Account_List_S!$AA:$AC',
'AD3',
'CapEx!$AB$15:$AE$15',
'AD$13',
'$Z$13',
'Import_CapEx!AD$15',
'Import_CapEx!AD$15' ] */

This feels like a bad fit for a regular expression, but I cant pass up a good regex challenge.
My solution involves alot of conditional checks
(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$\w+)*(?!\()\b
Breakdown
(
\w+\! Words followed by an !
)? which might exist.
\$? A $ which might exist
[A-Z]{1,} At least 1 capitalized letter maybe more
(?:
\d+ A non capturing group of digits after our letters
)? but they might not exist
(
\:? A : which might exist
\$\w+ A $ followed by characters
)* With none or many of them
(?!\() All of this, ONLY IF we DONT have a ( after it
\b All of this, ONLY IF we have a word break
The magic really happens at the end with the conditional breaks, without them you capture alot of other stuff.
Sample
let text = `=IFERROR(VLOOKUP($AA165,Account_List_S!$AA:$AC,3,0),0)
IFERROR(IF(AD3=1,INDEX(CapEx!$AB$15:$AE$15,1,YEAR(AD$13)-
YEAR($Z$13)-1)*IF(Import_CapEx!AD$15>=0,Import_CapEx!AD$15,0),0),0)";`
let exp = /(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$\w+)*(?!\()\b/gm
let match;
while((match=exp.exec(text))) {
console.log(match[0]);
}
Ouput:
$AA165
Account_List_S!$AA:$AC
AD3
CapEx!$AB$15:$AE$15
AD$13
$Z$13
Import_CapEx!AD$15
Import_CapEx!AD$15
Simple change to the expression making the $ after the : optional makes it work for your added use case
(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$?\w+)*(?!\()\b
let text = `$X74,Calc_Named_HC!AE$32:AE$103)-Calc_General_HC!AE74";`
let exp = /(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$\w+)*(?!\()\b/gm
let match;
while((match=exp.exec(text))) {
console.log(match[0]);
}

First shot: filter out full upper case words
This answer is not perfect yet, but using a negative look-ahead at the beginning of the expression can allow you to filter out IF and any sequence of 3+ upper case letters:
(?!\b[A-Z]{3,}\b|\bIF\b)(\b[A-z]{2,}!)?:?\$?\b[A-Z]\$?[A-Z]?(\$?[1-9]\$?[0-9]?)?\b
The \b in several places is to make sure the positive and negative matches go from the beginning of the letter sequence to the end.
The problem that remains is that it matches Account_List_S!$AA:$AC in two matches, Account_List_S!$AA and :$AC. So...
Second shot: fix the positive matching part of the regex
Here is a more complicated version that matches the ranges correctly:
EDIT: fixed to handle the examples given by OP in the comments.
(?!\b[A-Z]{3,}\b|\bIF\b)(\b[A-z]{2,}!)?\$?\b[A-Z]{1,3}(\$?[1-9]{1,3})?(:\$?[A-Z]{1,3}(\$?\d{1,3})?)?\b
With this version, Account_List_S!$AA:$AC is matched as a whole, as I believe you want, and so is Calc_Named_HC!AE$32:AE$103 added in the comments below.
Third shot: accepts some spurious patterns but easier to read
If you are willing to accept matching a superfluous : before the first address, this simpler expression would work:
EDIT: fixed to handle the examples given in the comments.
(?!\b[A-Z]{3,}\b|\bIF\b)(\b[A-z]{2,}!)?(:?\$?\b[A-Z]{1,3}(\$?\d{1,3})?){1,2}\b
Note that I kept your [A-z] range as is, but [A-Za-z_] might be more appropriate, as pointed out by #sp00m in his comment.

Regex Expressions For Emoji

http://jsfiddle.net/bxeLyneu/1/
function custom() {
var str = document.getElementById('original').innerHTML;
var replacement = str.replace(/\B:poop:\B/g,'REPLACED');
document.getElementById('replaced').innerHTML = replacement;
}
custom()
Yes = :poop: should be replaced with "REPLACED"
No = :poop: should not be replaced. In other words, remain untouched.
Number 4, 5, 6 doesn't seems to follow the rule provided. I do know why, but I don't have much idea how to combine multiple expressions into one.
I have tried many others but I just can't get them to work the way I wanted them to be. Odds aren't in my favor.
And yes, this is very similar to how Facebook emoji in chat box works.
New issue:
http://jsfiddle.net/xaekh8op/13/
/(^|\s):bin:(\s|$)/gm
It is unable to scan and replace the one in the middle.
How can I fix that?

\B means "Any location not at a word boundary" whereas \s means "Whitespace". Based upon your given examples, the following code works perfectly.
function custom() {
var str = document.getElementById('original').innerHTML;
var replacement = str.replace(/([\s>]|^):poop:(?=[\s<]|$)/gm,'$1REPLACED');
document.getElementById('replaced').innerHTML = replacement;
}
custom()
http://jsfiddle.net/xaekh8op/15/
Explanation:
The regular expression ([\s>]|^):poop:(?=[\s<]|$) stands for the following:
(image created in Debuggex)
By picking one of \s and > at the start (or using ^ meaning start of line), and grouping it as group 1, we can use it later. Similarly for after the :poop: (\s or < or end-of-line $). However, the second time, it is done using a look-ahead ((?= ...) is the syntax), which checks whether the [\s<]|$ portion is there after, but it doesn't consume it in the replacement. The < and > take care of any HTML tags that might be just beside the :poop:. The $1 in the replacement string $1REPLACED places the first group back, thereby rendering only the :poop: being replaced with REPLACED. The second "group" was just a look-ahead, and thus does not need to be replaced back.
For further information on word boundaries, you can refer to http://www.regular-expressions.info/wordboundaries.html which says:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.

Splitting a string into words and keeping delimiter

I want to split up a string (sentence) in an array of words and keep the delimiters.
I have found and I am currently using this regex for this:
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
An explanation can be found here: http://regex101.com/
This works exactly as I want it to and effectively makes a string like
This is a sentence.
To an array of
["This", "is", "a", "sentence."]
The problem here is that it does not include spaces nor newlines. I want the string to be parsed as words as it already does but I also want the corresponding space and or newline character to belong to the previous word.
I have read about positive lookahead that should look for future characters (space and or newline) but still take them into account when extracting the word. Although this might be the solution I have failed to implement it.
If it makes any difference I am using JavaScript and the following code:
//save the regex -- g modifier to get all matches
var reg = /[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)/g;
//define variable for holding matches
var matches;
//loop through each match
while(matches = reg.exec(STRING_HERE)){
//the word without spaces or newlines
console.log(matches[0]);
}
The code works but as I said, it does not include spaces and newline characters.

Yo can try something simpler:
str.split(/\b(?!\s)/);
However, note non word characters (e.g. full stop) will be considered another word:
"This is a sentence.".split(/\b(?!\s)/);
// [ "This ", "is ", "a ", "sentence", "." ]
To fix that, you can use a character class with the characters that shouldn't begin another word:
str.split(/\b(?![\s.])/);

function split_string(str){
var arr = str.split(" ");
var last_i = arr.length - 1;
for(var i=0; i<last_i; i++){
arr[i]+=" ";
}
return arr;
}

It may be as simple as this:
var sentence = 'This is a sentence.';
sentence = sentence.split(' ').join(' ||');
sentence = sentence.split('\n').join('\n||');
var matches = sentence.split('||');
Note that I use 2 pipes as a delimiter, but ofcourse you can use anything as long as it's unique.
Also note that I only split \n as a newline, but you may add \r\n or whatever you want to split as well.

General Solution
To keep the delimiters conjoined in the results, the regex needs to be a zero-width match. In other words, the regex can be thought of as matching the point between a delimiter and non-delimiter, rather than matching the delimiters themselves. This can be achieved with zero-width matching expressions, matching before, at, or after the split point (at most one each); let's call these A, B, and C. Sometimes a single sub-expression will do it, others you'll need two; offhand, I can't think of a case where you'd need three.
Not only look-aheads but lookarounds in general are the perfect candidates for this purpose: lookbehinds ((?<=...)) to match before the split point, and lookaheads ((?=...)) after. That's the essence of this approach. Positive or negative lookarounds can be used. The one pitfall is that lookbehinds are relatively new to JS regexes, so not all browsers or other JS engines will support them (current versions of Firefox, Chrome, Opera, Edge, and node.js do; Safari does not). If you need to support a JS engine that doesn't support lookbehinds, you might still be able to write & use a regex that matches at-and-before (BC).
To have the delimiters appear at the end of each match, put them in A. To have them at the start, in C. Fortunately, JS regexes do not place restrictions on lookbehinds, so simply wrapping the delimiter regex in the positive lookaround markers should be all that's required for delimiters. If the delimiters aren't so simple (i.e. context-sensitive), it might take a little more work to write the regex, which doesn't need to match the entire delimiter.
Paired with the delimiter pattern, you'll need to write a pattern that matches the start (for C) or end (for A) of the non-delimiter. This step is likely the one that will require the most additional work.
The at-split-point match, B
will often (always?) be a simple boundary, such as \b.
Specific Solution
If spaces are the only delimiters, and they're to appear at the end of each match, the delimiter pattern would be (?<=\s), in A. However, there are some cases not covered in the problem description. For example, should words separated by only punctuation (e.g. "x.y") be split? Which side of a split point should quotation marks and hyphens appear, if any? Should they count as punctuation? Another option for the delimiter is to match (after) all non-word characters, in which case A would be (<?=\W).
Since the split-point is at a word boundary, B could be \b.
Since the start of a match is a word character, (?=\w) will suffice for C.
Any two of those three should suffice. One that is perhaps clearest in meaning (and splits at the most points) is /(<?=\W)(?=\w)/, which can be translated as "split at the start of each word". \b could be added, if you find it more understandable, though it has no functional affect: /(<?=\W)\b(?=\w)/.
Note Oriol's excellent solutions are given by B=\b and (C=(?!\s) or C=(?![\s.])).
Additional
As a point of interest, there would be a simpler solution for this particular case if JS regexes supported TCL word boundaries: \m matches only at the start of a word, so str.split(/\m/) would split exactly at the start of each word. (\m is equivalent to (<?=\W)(?=\w).)

If you want to include the whitespace after the word, the regex \S+\s* should work.
const s = `This is a sentence.
This is another sentence.`;
console.log(s.match(/\S+\s*/g))

Since "a+?" is Lazy, Why does "a+?b" Match "aaab"?

While learning regular expressions in javascript using JavaScript: The Definitive Guide, I was confused by this passage:
But /a+?/ matches one or more occurrences of the letter a, matching as
few characters as necessary. When applied to the same string, this
pattern matches only the first letter a.
…
Now let’s use the nongreedy version: /a+?b/. This should match the
letter b preceded by the fewest number of a’s possible. When applied
to the same string “aaab”, you might expect it to match only one a and
the last letter b. In fact, however, this pattern matches the entire
string, just like the greedy version of the pattern.
Why is this so?
This is the explanation from the book:
This is because regular-expression pattern matching is done by finding
the first position in the string at which a match is possible. Since a
match is possible starting at the first character of the
string,shorter matches starting at subsequent characters are never
even considered.
I don't understand. Can anyone give me a more detailed explanation?

Okay, so you have your search space, "aaabc", and your pattern, /a+?b/
Does /a+?b/ match "a"? No.
Does /a+?b/ match "aa"? No.
Does /a+?b/ match "aaa"? No.
Does /a+?b/ match "aaab"? Yes.

Since you're matching literal characters and not any sort of wildcard, the regular expression a+?b is effectively the same as a+b anyway. The only type of sequence either one will match is a string of one or more a characters followed by a single b character. The non-greedy modifier makes no difference here, as the only thing an a can possibly match is an a.
The non-greedy qualifier becomes interesting when it's applied to something that can take on lots of different values, like .. (edit or cases where there's interesting stuff to the left of something like a+?)
edit — if you're expecting a+?b to match just the last a before the b in aaab, well that's not how it works. Searching for a pattern in a string implicitly means to search for the earliest occurrence of the pattern. Thus, though starting from the last a does give a substring that matches the pattern, it's not the first substring that matches.

The Engine Attempts a Match at the Beginning of the String
Can anyone give me a more detailed explanation?
Yes.
In short: .+? does not look for a shortest match globally, at the level of the entire string, but locally, from the position in the string where the engine is currently positioned.
How the Engine Works
When you try a regex against the string aaab, the engine first tries to find a match starting at the very first position in the string. That position is the position before the first a. If the engine cannot find a match at the first position, it moves on and tries again starting from the second position (between the first and second a)
So can a match be found by the regex a+?b at the first position? Yes.
a matches the first a
The +? quantifiers tells the engine to match the fewest number of a chars necessary. Since we are looking to return a match, necessary means that the following tokens (in this case) have to be allowed to match. In this case, the fewest number of a chars needed to allow the b to match is all the remaining a chars.
b matches
In the details the second point is a bit more complex (the engine tries to match b against the second a, fails, backtracks...) but you don't need to worry about that.

'?' after a+ means minimum number of characters to satisfy expression. /a+/ means one 'a' or as many as you can encounter before some other character. In order to satisfy /a+?/ (since it's nogreedy) it only needs single 'a'.
In order to satisfy /a+?b/, since we have 'b' at the end, in order to satisfy this expression it needs to match one or more 'a' before it hits 'b'. It has to hit that 'b'. /a+/ doesn't have to hit b because RegEx doesn't ask for that. /a+?b/ has to hit that 'b'.
Just think about it. What other meaning /a+?b/ could have?
Hope this helps

Javascript regex

I was trying to do a regex for someone else when I ran into this problem. The requirement was that the regex should return results from a set of strings that has, let's say, "apple" in it. For example, consider the following strings:
"I have an apple"
"You have two Apples"
"I give you one more orange"
The result set should have the first two strings.
The regex(es) I tried are:
/[aA]pple/ and /[^a-zA-Z0-9][aA]pple/
The problem with the first one is that words like "aapple", "bapple", etc (ok, so they are meaningless, but still...) test positive with it, and the problem with the second one is that when a string actually starts with the word "apple", "Apples and oranges", for example, it tests negative. Can someone explain why the second regex behaves this way and what the correct regex would be?

/(^.*?\bapples?\b.*$)/i
Edit: The above will match the entire string containing the word "apples", which I thought is what you were asking for. If you are just trying to see if the string contains the word, the following will work.
/\bapples?\b/i
The regex(es) I tried are:
/[aA]pple/ and /[^a-zA-Z0-9][aA]pple/
The first one just checks for the existence of the following characters, in order: a-p-p-l-e, regardless of what context they are used in. The \b, or word-boundary character, matches any spot where a non-word character and a word character meet, ala \W\w.
The second one is trying to match other characters before the occurrance of a-p-p-l-e, and is essentially the same as the first, except it requires other characters in front of it.
The one I answered with works like following. From the beginning of the string, matches any characters (if they exist) non-greedily until it encounters a word boundary. If the string starts with apple, the beginning of a string is a word-boundary, so it still matches. It then matches the letters a-p-p-l-e, and s if it exists, followed by another word boundary. It then matches all characters to the end of the string. The /i at the end means it's case-insensitive, so 'Apple', 'APPLE', and 'apple' are all valid.
If you have the time, I would highly recommend walking through the tutorial at http://regular-expressions.info. It really goes in-depth and talks about how the regular expression engines match different expressions, it helped me a ton.

To build on #tj111, the reason your second regex fails is that [^a-zA-Z0-9] requires that a character matches; that is, there is some character in that position, and its value is not contained in the set [a-zA-Z0-9]. Markers like \b are called "zero-width assertions". \b, in particular, matches against boundaries between characters or at the beginning or end of a string. Because it is not matching against any character, its "width" is zero.
In sum, [^a-zA-Z0-9] requires a character that does not take a particular value be present, while \b requires only that a boundary be present.
Edit: #tj111 has added most of this to his response. I'm in too late, again :)

This works for apple and apples and its case-insensitive spellings:
var strings = ["I have an apple", "You have two Apples", "I give you one more orange"];
var result = [];
var pattern = /\bapples?\b/i;
for (var i=0; i<strings.length; i++) {
if (pattern.test(strings[i])) {
result.push(strings[i]);
}
}

Your second regex requires a nonalphanumeric character before the first a in apple. "apple" doesn't satisfy this. As others note, "\b" matches not a character, but a word boundary position.

/\bapple/i
\b is a word boundary.
To explain why your attempts do not work, the first one does not check that it is the beginning of the word, so it can have something before it. The second regex you gave says that something must be before the word "apple", but it can't be alphanumeric.

We Keep Coding

JavaScript is the programming language of the Web.

RegEx: Understanding Syllable Counter Code - javascript

Related

Javascript regular expression: Want to exclude function words in all caps

Regex Expressions For Emoji

Splitting a string into words and keeping delimiter

Since "a+?" is Lazy, Why does "a+?b" Match "aaab"?

Javascript regex

Categories

Resources