Javascript Regex for all words not between certain characters

Javascript Regex for all words not between certain characters - javascript

I'm trying to return a count of all words NOT between square brackets. So given ..
[don't match these words] but do match these
I get a count of 4 for the last four words.
This works in .net:
\b(?<!\[)[\w']+(?!\])\b
but it won't work in Javascript because it doesn't support lookbehind
Any ideas for a pure js regex solution?

Ok, I think this should work:
\[[^\]]+\](?:^|\s)([\w']+)(?!\])\b|(?:^|\s)([\w']+)(?!\])\b
You can test it here:
http://regexpal.com/
If you need an alternative with text in square brackets coming after the main text, it could be added as a second alternative and the current second one would become third.
It's a bit complicated but I can't think of a better solution right now.
If you need to do something with the actual matches you will find them in the capturing groups.
UPDATE:
Explanation:
So, we've got two options here:
\[[^\]]+\](?:^|\s)([\w']+)(?!\])\b
This is saying:
\[[^\]]+\] - match everything in square brackets (don't capture)
(?:^|\s) - followed by line start or a space - when I look at it now take the caret out as it doesn't make sense so this will become just \s
([\w']+) - match all following word characters as long as (?!\])the next character is not the closing bracket - well this is probably also unnecessary now, so let's try and remove the lookahead
\b - and match word boundary
2 (?:^|\s)([\w']+)(?!\])\b
If you cannot find the option 1 - do just the word matching, without looking for square brackets as we ensured with the first part that they are not here.
Ok, so I removed all the things that we don't need (they stayed there because I tried quite a few options before it worked:-) and the revised regex is the one below:
\[[^\]]+\]\s([\w']+)(?!\])\b|(?:^|\s)([\w']+)\b

I would use something like \[[^\]]*\] to remove the words between square brackets, and then explode by spaces the returned string to count the remaining words.

Chris, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a general question about how to exclude patterns in regex.)
Here's our simple regex (see it at work on regex101, looking at the Group captures in the bottom right panel):
\[[^\]]*\]|(\b\w+\b)
The left side of the alternation matches complete [bracketed groups]. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right words because they were not matched by the expression on the left.
This program shows how to use the regex (see the count result in the online demo):
<script>
var subject = '[match ye not these words] but do match these';
var regex = /\[[^\]]*\]|(\b\w+\b)/g;
var group1Caps = [];
var match = regex.exec(subject);
// put Group 1 captures in an array
while (match != null) {
if( match[1] != null ) group1Caps.push(match[1]);
match = regex.exec(subject);
}
document.write("<br>*** Number of Matches ***<br>");
document.write(group1Caps.length);
</script>
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

Related

Javascript regular expression: Want to exclude function words in all caps

I have the following regular expression to parse google script formula to get precedents
([A-z]{2,}!)?:?\$?[A-Z]\$?[A-Z]?(\$?[1-9]\$?[0-9]?)?
I needed to make the numbers optional to cater to ranges that are entire columns- see image. Because the numbers are optional I am also matching items that are functions -all caps words- that I want to exclude. I suppose I could do this after the fact but I would like to modify the regex to exclude them. How do I do that?
Example:
=IFERROR(VLOOKUP($AA16,Account_List_S!$AA:$AC,3,0),0)
IFERROR(IF(AD3=1,INDEX(CapEx!$AB$15:$AE$15,1,YEAR(AD$13)-
YEAR($Z$13)-1)*IF(Import_CapEx!AD$15>=0,Import_CapEx!AD$15,0),0),0)";
The words I want to match refer to cells with an optional sheet name, and optional $ before the row or column identifier. They can be ranges or single cells.
Examples of words I want to match:
$AA16
$AB$15
AD$15
$Z$13
Account_List_S!$AA:$AC
CapEx!$AB$15:$AE$15
Import_CapEx!AD$15
The words I want to exclude are the functions:
IFERROR
VLOOKUP
IF
YEAR

Try this regex:
/[\(,+\-\*/><=]((\w+!)?\$?[A-Z]{1,2}(\$?[\d]{0,3})?(:\$?[A-Z]{1,2}(\$?\d{0,3})?)?(?=[\),+\-\*/><=]))/g
While a little long, this has the advantage that it will reject these when found in the formula:
Anything that has [A-Z] and [0-9] but not a column, e.g. ZIP50210
Anything that has [A-Z] and [0-9] but in the wrong order, e.g. 25E
Any variables like "AR" or 'JOHN'
Any constants in the formula like TRUE, FALSE or other argument values
Explanation:
[\(,+\-\*/><=] look for starting literal ( or , or operands like +,-,/,*,>,<,=. We expect column identifiers to start with these characters.
( now we start our matching group
(\w+!)? allow for optional sheet names like 'Account_List_S!'
\$?[A-Z]{1,2}(\$?[\d]{0,3})? will match columns like A or $B1 or $AB$12 or AB123
(:\$?[A-Z_$]{1,2}(\$?[\d]{0,3}))? adds optional match for a range of columns, e.g. trailing :DD or :$C1 or :AC$1 or :AC123 or some such
(?=[,\)=:><]) lookahead for ending literal ) or , or operands like +,-,/,*,>,<,=. We expect column identifiers to end with these characters.
) close matching group
g global match (more than one instance)
Demo:
let regex = /[\(,+\-\*/><=]((\w+!)?\$?[A-Z]{1,2}(\$?[\d]{0,3})?(:\$?[A-Z]{1,2}(\$?\d{0,3})?)?(?=[\),+\-\*/><=]))/g;
let str = '=IFERROR(VLOOKUP($AA16,Account_List_S!$AA:$AC,3,0),0)IFERROR(IF(AD3=1,INDEX(CapEx!$AB$15:$AE$15,1,YEAR(AD$13)-YEAR($Z$13)-1)*IF(Import_CapEx!AD$15>=0,Import_CapEx!AD$15,0),0),0)";';
let arr = []
while(match = regex.exec(str)) {
arr.push(match[1]); //we only want the first matching group
}
console.log(arr);
/*
[ '$AA16',
'Account_List_S!$AA:$AC',
'AD3',
'CapEx!$AB$15:$AE$15',
'AD$13',
'$Z$13',
'Import_CapEx!AD$15',
'Import_CapEx!AD$15' ] */

This feels like a bad fit for a regular expression, but I cant pass up a good regex challenge.
My solution involves alot of conditional checks
(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$\w+)*(?!\()\b
Breakdown
(
\w+\! Words followed by an !
)? which might exist.
\$? A $ which might exist
[A-Z]{1,} At least 1 capitalized letter maybe more
(?:
\d+ A non capturing group of digits after our letters
)? but they might not exist
(
\:? A : which might exist
\$\w+ A $ followed by characters
)* With none or many of them
(?!\() All of this, ONLY IF we DONT have a ( after it
\b All of this, ONLY IF we have a word break
The magic really happens at the end with the conditional breaks, without them you capture alot of other stuff.
Sample
let text = `=IFERROR(VLOOKUP($AA165,Account_List_S!$AA:$AC,3,0),0)
IFERROR(IF(AD3=1,INDEX(CapEx!$AB$15:$AE$15,1,YEAR(AD$13)-
YEAR($Z$13)-1)*IF(Import_CapEx!AD$15>=0,Import_CapEx!AD$15,0),0),0)";`
let exp = /(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$\w+)*(?!\()\b/gm
let match;
while((match=exp.exec(text))) {
console.log(match[0]);
}
Ouput:
$AA165
Account_List_S!$AA:$AC
AD3
CapEx!$AB$15:$AE$15
AD$13
$Z$13
Import_CapEx!AD$15
Import_CapEx!AD$15
Simple change to the expression making the $ after the : optional makes it work for your added use case
(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$?\w+)*(?!\()\b
let text = `$X74,Calc_Named_HC!AE$32:AE$103)-Calc_General_HC!AE74";`
let exp = /(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$\w+)*(?!\()\b/gm
let match;
while((match=exp.exec(text))) {
console.log(match[0]);
}

First shot: filter out full upper case words
This answer is not perfect yet, but using a negative look-ahead at the beginning of the expression can allow you to filter out IF and any sequence of 3+ upper case letters:
(?!\b[A-Z]{3,}\b|\bIF\b)(\b[A-z]{2,}!)?:?\$?\b[A-Z]\$?[A-Z]?(\$?[1-9]\$?[0-9]?)?\b
The \b in several places is to make sure the positive and negative matches go from the beginning of the letter sequence to the end.
The problem that remains is that it matches Account_List_S!$AA:$AC in two matches, Account_List_S!$AA and :$AC. So...
Second shot: fix the positive matching part of the regex
Here is a more complicated version that matches the ranges correctly:
EDIT: fixed to handle the examples given by OP in the comments.
(?!\b[A-Z]{3,}\b|\bIF\b)(\b[A-z]{2,}!)?\$?\b[A-Z]{1,3}(\$?[1-9]{1,3})?(:\$?[A-Z]{1,3}(\$?\d{1,3})?)?\b
With this version, Account_List_S!$AA:$AC is matched as a whole, as I believe you want, and so is Calc_Named_HC!AE$32:AE$103 added in the comments below.
Third shot: accepts some spurious patterns but easier to read
If you are willing to accept matching a superfluous : before the first address, this simpler expression would work:
EDIT: fixed to handle the examples given in the comments.
(?!\b[A-Z]{3,}\b|\bIF\b)(\b[A-z]{2,}!)?(:?\$?\b[A-Z]{1,3}(\$?\d{1,3})?){1,2}\b
Note that I kept your [A-z] range as is, but [A-Za-z_] might be more appropriate, as pointed out by #sp00m in his comment.

Regular expression match specific key words

I am trying to use regexp to match some specific key words.
For those codes as below, I'd like to only match those IFs at first and second line, which have no prefix and postfix. The regexp I am using now is \b(IF|ELSE)\b, and it will give me all the IFs back.
IF A > B THEN STOP
IF B < C THEN STOP
LOL.IF
IF.LOL
IF.ELSE
Thanks for any help in advance.
And I am using http://regexr.com/ for test.
Need to work with JS.

I'm guessing this is what you're looking for, assuming you've added the m flag for multiline:
(?:^|\s)(IF|ELSE)(?:$|\s)
It's comprised of three groups:
(?:^|\s) - Matches either the beginning of the line, or a single space character
(IF|ELSE) - Matches one of your keywords
(?:$|\s) - Matches either the end of the line, or a single space character.
Regexr

you can do it with lookaround (lookahead + lookbehind). this is what you really want as it explicitly matches what you are searching. you don't want to check for other characters like string start or whitespaces around the match but exactly match "IF or ELSE not surrounded by dots"
/(?<!\.)(IF|ELSE)(?!\.)/g
explanation:
use the g-flag to find all occurrences
(?<!X)Y is a negative lookbehind which matches a Y not preceeded by an X
Y(?!X) is a negative lookahead which matches a Y not followed by an X
working example: https://regex101.com/r/oS2dZ6/1
PS: if you don't have to write regex for JS better use a tool which supports the posix standard like regex101.com

Matching multiple optional characters depending on each other

I want to match all valid prefixes of substitute followed by other characters, so that
sub/abc/def matches the sub part.
substitute/abc/def matches the substitute part.
subt/abc/def either doesn't match or only matches the sub part, not the t.
My current Regex is /^s(u(b(s(t(i(t(u(te?)?)?)?)?)?)?)?)?/, which works, however this seems a bit verbose.
Is there any better (as in, less verbose) way to do this?

This would do like the same as you mentioned in your question.
^s(?:ubstitute|ubstitut|ubstitu|ubstit|ubsti|ubst|ubs|ub|u)?
The above regex will always try to match the large possible word. So at first it checks for substitute, if it finds any then it will do matching else it jumps to next pattern ie, substitut , likewise it goes on upto u.
DEMO 1 DEMO 2

you could use a two-step regex
find first word of subject by using this simple pattern ^(\w+)
use the extracted word from step 1 as your regex pattern e.g. ^subs against the word substitute

RegEx: Understanding Syllable Counter Code

I have used Dylan's question on here regarding JavaScript syllable counting, and more specifically artfulhacker's answer, in my own code and, regardless of which single or multi word string I feed it, the function is always able to correctly count the number of syllables.
I have a limited experience with RegEx and not enough prior knowledge to decipher what exactly is happening in the following code without some help. I'm not someone who is ever happy with having some code I pulled from somewhere just work without me knowing how it works. Is someone able to please articulate what is happening in the new_count(word) function below and help me decipher the use of RegEx and how it is that the function is able to correctly count syllables? Many
function new_count(word) {
word = word.toLowerCase(); //word.downcase!
if(word.length <= 3) { return 1; } //return 1 if word.length <= 3
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, ''); //word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word = word.replace(/^y/, ''); //word.sub!(/^y/, '')
return word.match(/[aeiouy]{1,2}/g).length; //word.scan(/[aeiouy]{1,2}/).size
}

As far as I see it, we basically want to count the vowels, or vowel pairs, with some special cases. Let's start by the last line, which does that, i.e. count vowels and pairs:
return word.match(/[aeiouy]{1,2}/g).length;
This will match any vowel, or vowel pair. [...] means a character class, i.e. that if we go through the string character-by-character, we have a match, if the actual character is one of those. {1, 2} is the number of repetitions, i.e. it means that we should match exactly one or two such characters.
The other two lines are for special cases.
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '');
This line will remove 'syllables' from the end of the word, which are either:
Xes (where X is anything but any of 'laeiouy', e.g. 'zes')
ed
Xe (where X is anything but any of 'laeiouy', e.g. 'xe')
(I'm not really sure what the grammatical meaning behind this is, but I guess, that 'syllables' at the end of the word, like '-ed', '-ded', '-xed' etc. don't really count as such.)
As for the regexp part: (?:...) is a non-capturing group. I guess it's not really important in this case that this group be non-capturing; this just means that we would like to group the whole expression, but then we do not need to refer back to it. However, we could have used a capturing group too (i.e. (...) )
The [^...] is a negated character class. It means, match any character, which is none of those listed here. (Compare to the (non-negated) character-class mentioned above.)
The pipe symbol, i.e. |, is the alternation operator, which means, that any of the expressions can match.
Finally, the $ anchor matches the end of the line, or string (depending on the context).
word = word.replace(/^y/, '');
This line removes 'y'-s from the beginning of words (probably 'y' at the beginning does not count as a syllable -- which makes sense in my opinion).
^ is the anchor for matching the beginning of the line, or string (c.f. $ mentioned above).
Note: the algorithm only works if word really contains one single word.

/(?:[^laeiouy]es|ed|[^laeiouy]e)$/
That matches three possible substrings: a letter other than 'l' or a vowel followed by 'es' (like "res" or "tes"); 'ed'; or a non-vowel, non-'l' followed by just an 'e'. Those patterns must appear at the end of the word to match because of the $ at the end of the pattern. The grouping (?: ) is just a grouping; the leading ?: makes that distinction. The pattern could have been a little shorter:
/(?:[^laeiouy]es?|ed)$/
would do the same thing. In any case, if the pattern matches the characters involved are removed from the word.
Then,
/^y/
matches a 'y' at the beginning of a word. If a 'y' is found, it's removed.
Finally,
/[aeiouy]{1,2}/g
matches any one- or two-character stretch of vowels (including 'y'). The g suffix makes it a global match, so that the return value is an array consisting of all such spans of vowels. The length of that returned array is the number of syllables (according to this technique).
Note that the words "poem" and "lion" would be reported as one-syllable words, which may be correct for some English variants but not all.
Here is a pretty good reference for JavaScript regular expression operators.

Splitting a string into words and keeping delimiter

I want to split up a string (sentence) in an array of words and keep the delimiters.
I have found and I am currently using this regex for this:
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
An explanation can be found here: http://regex101.com/
This works exactly as I want it to and effectively makes a string like
This is a sentence.
To an array of
["This", "is", "a", "sentence."]
The problem here is that it does not include spaces nor newlines. I want the string to be parsed as words as it already does but I also want the corresponding space and or newline character to belong to the previous word.
I have read about positive lookahead that should look for future characters (space and or newline) but still take them into account when extracting the word. Although this might be the solution I have failed to implement it.
If it makes any difference I am using JavaScript and the following code:
//save the regex -- g modifier to get all matches
var reg = /[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)/g;
//define variable for holding matches
var matches;
//loop through each match
while(matches = reg.exec(STRING_HERE)){
//the word without spaces or newlines
console.log(matches[0]);
}
The code works but as I said, it does not include spaces and newline characters.

Yo can try something simpler:
str.split(/\b(?!\s)/);
However, note non word characters (e.g. full stop) will be considered another word:
"This is a sentence.".split(/\b(?!\s)/);
// [ "This ", "is ", "a ", "sentence", "." ]
To fix that, you can use a character class with the characters that shouldn't begin another word:
str.split(/\b(?![\s.])/);

function split_string(str){
var arr = str.split(" ");
var last_i = arr.length - 1;
for(var i=0; i<last_i; i++){
arr[i]+=" ";
}
return arr;
}

It may be as simple as this:
var sentence = 'This is a sentence.';
sentence = sentence.split(' ').join(' ||');
sentence = sentence.split('\n').join('\n||');
var matches = sentence.split('||');
Note that I use 2 pipes as a delimiter, but ofcourse you can use anything as long as it's unique.
Also note that I only split \n as a newline, but you may add \r\n or whatever you want to split as well.

General Solution
To keep the delimiters conjoined in the results, the regex needs to be a zero-width match. In other words, the regex can be thought of as matching the point between a delimiter and non-delimiter, rather than matching the delimiters themselves. This can be achieved with zero-width matching expressions, matching before, at, or after the split point (at most one each); let's call these A, B, and C. Sometimes a single sub-expression will do it, others you'll need two; offhand, I can't think of a case where you'd need three.
Not only look-aheads but lookarounds in general are the perfect candidates for this purpose: lookbehinds ((?<=...)) to match before the split point, and lookaheads ((?=...)) after. That's the essence of this approach. Positive or negative lookarounds can be used. The one pitfall is that lookbehinds are relatively new to JS regexes, so not all browsers or other JS engines will support them (current versions of Firefox, Chrome, Opera, Edge, and node.js do; Safari does not). If you need to support a JS engine that doesn't support lookbehinds, you might still be able to write & use a regex that matches at-and-before (BC).
To have the delimiters appear at the end of each match, put them in A. To have them at the start, in C. Fortunately, JS regexes do not place restrictions on lookbehinds, so simply wrapping the delimiter regex in the positive lookaround markers should be all that's required for delimiters. If the delimiters aren't so simple (i.e. context-sensitive), it might take a little more work to write the regex, which doesn't need to match the entire delimiter.
Paired with the delimiter pattern, you'll need to write a pattern that matches the start (for C) or end (for A) of the non-delimiter. This step is likely the one that will require the most additional work.
The at-split-point match, B
will often (always?) be a simple boundary, such as \b.
Specific Solution
If spaces are the only delimiters, and they're to appear at the end of each match, the delimiter pattern would be (?<=\s), in A. However, there are some cases not covered in the problem description. For example, should words separated by only punctuation (e.g. "x.y") be split? Which side of a split point should quotation marks and hyphens appear, if any? Should they count as punctuation? Another option for the delimiter is to match (after) all non-word characters, in which case A would be (<?=\W).
Since the split-point is at a word boundary, B could be \b.
Since the start of a match is a word character, (?=\w) will suffice for C.
Any two of those three should suffice. One that is perhaps clearest in meaning (and splits at the most points) is /(<?=\W)(?=\w)/, which can be translated as "split at the start of each word". \b could be added, if you find it more understandable, though it has no functional affect: /(<?=\W)\b(?=\w)/.
Note Oriol's excellent solutions are given by B=\b and (C=(?!\s) or C=(?![\s.])).
Additional
As a point of interest, there would be a simpler solution for this particular case if JS regexes supported TCL word boundaries: \m matches only at the start of a word, so str.split(/\m/) would split exactly at the start of each word. (\m is equivalent to (<?=\W)(?=\w).)

If you want to include the whitespace after the word, the regex \S+\s* should work.
const s = `This is a sentence.
This is another sentence.`;
console.log(s.match(/\S+\s*/g))

We Keep Coding

JavaScript is the programming language of the Web.

Javascript Regex for all words not between certain characters - javascript

I would use something like \[[^\]]*\] to remove the words between square brackets, and then explode by spaces the returned string to count the remaining words.

Related

Javascript regular expression: Want to exclude function words in all caps

Regular expression match specific key words

Matching multiple optional characters depending on each other

RegEx: Understanding Syllable Counter Code

Splitting a string into words and keeping delimiter

Categories

Resources