Javascript regular expression: Want to exclude function words in all caps - javascript

I have the following regular expression to parse google script formula to get precedents
([A-z]{2,}!)?:?\$?[A-Z]\$?[A-Z]?(\$?[1-9]\$?[0-9]?)?
I needed to make the numbers optional to cater to ranges that are entire columns- see image. Because the numbers are optional I am also matching items that are functions -all caps words- that I want to exclude. I suppose I could do this after the fact but I would like to modify the regex to exclude them. How do I do that?
Example:
=IFERROR(VLOOKUP($AA16,Account_List_S!$AA:$AC,3,0),0)
IFERROR(IF(AD3=1,INDEX(CapEx!$AB$15:$AE$15,1,YEAR(AD$13)-
YEAR($Z$13)-1)*IF(Import_CapEx!AD$15>=0,Import_CapEx!AD$15,0),0),0)";
The words I want to match refer to cells with an optional sheet name, and optional $ before the row or column identifier. They can be ranges or single cells.
Examples of words I want to match:
$AA16
$AB$15
AD$15
$Z$13
Account_List_S!$AA:$AC
CapEx!$AB$15:$AE$15
Import_CapEx!AD$15
The words I want to exclude are the functions:
IFERROR
VLOOKUP
IF
YEAR

Try this regex:
/[\(,+\-\*/><=]((\w+!)?\$?[A-Z]{1,2}(\$?[\d]{0,3})?(:\$?[A-Z]{1,2}(\$?\d{0,3})?)?(?=[\),+\-\*/><=]))/g
While a little long, this has the advantage that it will reject these when found in the formula:
Anything that has [A-Z] and [0-9] but not a column, e.g. ZIP50210
Anything that has [A-Z] and [0-9] but in the wrong order, e.g. 25E
Any variables like "AR" or 'JOHN'
Any constants in the formula like TRUE, FALSE or other argument values
Explanation:
[\(,+\-\*/><=] look for starting literal ( or , or operands like +,-,/,*,>,<,=. We expect column identifiers to start with these characters.
( now we start our matching group
(\w+!)? allow for optional sheet names like 'Account_List_S!'
\$?[A-Z]{1,2}(\$?[\d]{0,3})? will match columns like A or $B1 or $AB$12 or AB123
(:\$?[A-Z_$]{1,2}(\$?[\d]{0,3}))? adds optional match for a range of columns, e.g. trailing :DD or :$C1 or :AC$1 or :AC123 or some such
(?=[,\)=:><]) lookahead for ending literal ) or , or operands like +,-,/,*,>,<,=. We expect column identifiers to end with these characters.
) close matching group
g global match (more than one instance)
Demo:
let regex = /[\(,+\-\*/><=]((\w+!)?\$?[A-Z]{1,2}(\$?[\d]{0,3})?(:\$?[A-Z]{1,2}(\$?\d{0,3})?)?(?=[\),+\-\*/><=]))/g;
let str = '=IFERROR(VLOOKUP($AA16,Account_List_S!$AA:$AC,3,0),0)IFERROR(IF(AD3=1,INDEX(CapEx!$AB$15:$AE$15,1,YEAR(AD$13)-YEAR($Z$13)-1)*IF(Import_CapEx!AD$15>=0,Import_CapEx!AD$15,0),0),0)";';
let arr = []
while(match = regex.exec(str)) {
arr.push(match[1]); //we only want the first matching group
}
console.log(arr);
/*
[ '$AA16',
'Account_List_S!$AA:$AC',
'AD3',
'CapEx!$AB$15:$AE$15',
'AD$13',
'$Z$13',
'Import_CapEx!AD$15',
'Import_CapEx!AD$15' ] */

This feels like a bad fit for a regular expression, but I cant pass up a good regex challenge.
My solution involves alot of conditional checks
(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$\w+)*(?!\()\b
Breakdown
(
\w+\! Words followed by an !
)? which might exist.
\$? A $ which might exist
[A-Z]{1,} At least 1 capitalized letter maybe more
(?:
\d+ A non capturing group of digits after our letters
)? but they might not exist
(
\:? A : which might exist
\$\w+ A $ followed by characters
)* With none or many of them
(?!\() All of this, ONLY IF we DONT have a ( after it
\b All of this, ONLY IF we have a word break
The magic really happens at the end with the conditional breaks, without them you capture alot of other stuff.
Sample
let text = `=IFERROR(VLOOKUP($AA165,Account_List_S!$AA:$AC,3,0),0)
IFERROR(IF(AD3=1,INDEX(CapEx!$AB$15:$AE$15,1,YEAR(AD$13)-
YEAR($Z$13)-1)*IF(Import_CapEx!AD$15>=0,Import_CapEx!AD$15,0),0),0)";`
let exp = /(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$\w+)*(?!\()\b/gm
let match;
while((match=exp.exec(text))) {
console.log(match[0]);
}
Ouput:
$AA165
Account_List_S!$AA:$AC
AD3
CapEx!$AB$15:$AE$15
AD$13
$Z$13
Import_CapEx!AD$15
Import_CapEx!AD$15
Simple change to the expression making the $ after the : optional makes it work for your added use case
(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$?\w+)*(?!\()\b
let text = `$X74,Calc_Named_HC!AE$32:AE$103)-Calc_General_HC!AE74";`
let exp = /(\w+\!)?\$?[A-Z]{1,}(?:\d+)?(\:?\$\w+)*(?!\()\b/gm
let match;
while((match=exp.exec(text))) {
console.log(match[0]);
}

First shot: filter out full upper case words
This answer is not perfect yet, but using a negative look-ahead at the beginning of the expression can allow you to filter out IF and any sequence of 3+ upper case letters:
(?!\b[A-Z]{3,}\b|\bIF\b)(\b[A-z]{2,}!)?:?\$?\b[A-Z]\$?[A-Z]?(\$?[1-9]\$?[0-9]?)?\b
The \b in several places is to make sure the positive and negative matches go from the beginning of the letter sequence to the end.
The problem that remains is that it matches Account_List_S!$AA:$AC in two matches, Account_List_S!$AA and :$AC. So...
Second shot: fix the positive matching part of the regex
Here is a more complicated version that matches the ranges correctly:
EDIT: fixed to handle the examples given by OP in the comments.
(?!\b[A-Z]{3,}\b|\bIF\b)(\b[A-z]{2,}!)?\$?\b[A-Z]{1,3}(\$?[1-9]{1,3})?(:\$?[A-Z]{1,3}(\$?\d{1,3})?)?\b
With this version, Account_List_S!$AA:$AC is matched as a whole, as I believe you want, and so is Calc_Named_HC!AE$32:AE$103 added in the comments below.
Third shot: accepts some spurious patterns but easier to read
If you are willing to accept matching a superfluous : before the first address, this simpler expression would work:
EDIT: fixed to handle the examples given in the comments.
(?!\b[A-Z]{3,}\b|\bIF\b)(\b[A-z]{2,}!)?(:?\$?\b[A-Z]{1,3}(\$?\d{1,3})?){1,2}\b
Note that I kept your [A-z] range as is, but [A-Za-z_] might be more appropriate, as pointed out by #sp00m in his comment.

Related

Adding an additional letter matching group to an existing regex

I have the following regex: (?:\/us)?\/[a-z]{2}[_-][a-z]{2}(?:\/?$|(?=\/))|\/[a-z]{2}(?:\/?$|(?=\/))^([a-z]{2}\/retail)
As you can see, it's not particularly easy on the eyes. You can see it in action here: https://regex101.com/r/4AZwuP/1 (enable substitutions to see the desired result - the removal of matches)
Here's a few entries it's supposed to match:
/us/en_us/retail/en (matches /us/ and /us/en_us/)
/us/en_us/retail (matches /us/ and /en_us/)
/gb/en_gb/retail/en-uk (matches /en_gb and /en-uk)
Note that, these are just prefixes and the full url might look something like:
/de/de_de/retail/de_de/products/catalog
The goal is to run the regex and delete matches so that this lines becomes:
de/retail/products/catalog
The above Regex accomplishes this with one exception: in the first example, I need it to match not only /us/en_us but also /en (or /de or /mx - in other words, there's an additional country code there; it unfortunately does not.
What I do know for a fact is that if those two characters are present, it'll be one of these two:
.../retail/en
.../retail/en/something/or/other
In either case it's always two characters either alone or followed by a forward slash.
How can I modify the original regex to deal with this annoying edge case?
Bonus: how does the original work?
If a lookbehind is supported you might use:
(?:\/[a-z]{2})?\/[a-z]{2}[-_][a-z]{2}\b|(?<=\/retail)\/[a-z]{2}\b
(?:\/[a-z]{2})? Optionally match / and 2 chars a-z
\/[a-z]{2}[-_][a-z]{2}\b Match / 2 chars a-z. Then either - or _ and 2 char a-z
| Or
(?<=\/retail)\/[a-z]{2}\b Match 2 chars a-z asserting /retail directly to the left
Regex demo
Or use a capture group, and in the callback of replace check if group 1 exists. If it does, use it in the replacement to keep it.
(?:\/[a-z]{2})?\/[a-z]{2}[-_][a-z]{2}\b|\/(retail)\/[a-z]{2}\b
Regex demo
I suppose you want remove country code.then the begin /gb is country code also.
My regex is this (\/\w{2}(?=\/|$))|(\/\w{2}(-|_)\w{2}(?=\/|$))
let break in into two regex
(\/\w{2}(?=\/|$)) match two letter after / and end with / or nothing
(\/\w{2}(-|_)\w{2}(?=\/|$)) match two letter plus _|- and plus two letter,also start with / end with /
it match all example in your regex101,but it will failed if there has other two letters in your url

JQuery match with RegEx not working

I have a filename that will be something along the lines of this:
Annual-GDS-Valuation-30th-Dec-2016-082564K.docx
It will contain 5 numbers followed by a single letter, but it may be in a different position in the file name. The leading zero may or may not be there, but it is not required.
This is the code I come up with after checking examples, however SelectedFileClientID is always null
var SelectedFileClientID = files.match(/^d{5}\[a-zA-Z]{1}$/);
I'm not sure what is it I am doing wrong.
Edit:
The 0 has nothing to do with the code I am trying to extract. It may or may not be there, and it could even be a completely different character, or more than one, but has nothing to do with it at all. The client has decided they want to put additional characters there.
There are at least 3 issues with your regex: 1) the pattern is enclosed with anchors, and thus requires a full string match, 2) the d matches a letter d, not a digit, you need \d to match a digit, 3) a \[ matches a literal [, so the character class is ruined.
Use
/\d{5}[a-zA-Z]/
Details:
\d{5} - 5 digits
[a-zA-Z] - an ASCII letter
JS demo:
var s = 'Annual-GDS-Valuation-30th-Dec-2016-082564K.docx';
var m = s.match(/\d{5}[a-zA-Z]/);
console.log(m[0]);
All right, there are a few things wrong...
var matches = files.match(/\-0?(\d{5}[a-zA-Z])\.[a-z]{3,}$/);
var SelectedFileClientID = matches ? matches[1] : '';
So:
First, I get the matches on your string -- .match()
Then, your file name will not start with the digits - so drop the ^
You had forgotten the backslash for digits: \d
Do not backslash your square bracket - it's here used as a regular expression token
no need for the {1} for your letters: the square bracket content is enough as it will match one, and only one letter.
Hope this helps!
Try this pattern , \d{5}[a-zA-Z]
Try - 0?\d{5}[azA-Z]
As you mentioned 0 may or may not be there. so 0? will take that into account.
Alternatively it can be done like this. which can match any random character.
(\w+|\W+|\d+)?\d{5}[azA-Z]

RegEx: Understanding Syllable Counter Code

I have used Dylan's question on here regarding JavaScript syllable counting, and more specifically artfulhacker's answer, in my own code and, regardless of which single or multi word string I feed it, the function is always able to correctly count the number of syllables.
I have a limited experience with RegEx and not enough prior knowledge to decipher what exactly is happening in the following code without some help. I'm not someone who is ever happy with having some code I pulled from somewhere just work without me knowing how it works. Is someone able to please articulate what is happening in the new_count(word) function below and help me decipher the use of RegEx and how it is that the function is able to correctly count syllables? Many
function new_count(word) {
word = word.toLowerCase(); //word.downcase!
if(word.length <= 3) { return 1; } //return 1 if word.length <= 3
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, ''); //word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word = word.replace(/^y/, ''); //word.sub!(/^y/, '')
return word.match(/[aeiouy]{1,2}/g).length; //word.scan(/[aeiouy]{1,2}/).size
}
As far as I see it, we basically want to count the vowels, or vowel pairs, with some special cases. Let's start by the last line, which does that, i.e. count vowels and pairs:
return word.match(/[aeiouy]{1,2}/g).length;
This will match any vowel, or vowel pair. [...] means a character class, i.e. that if we go through the string character-by-character, we have a match, if the actual character is one of those. {1, 2} is the number of repetitions, i.e. it means that we should match exactly one or two such characters.
The other two lines are for special cases.
word = word.replace(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '');
This line will remove 'syllables' from the end of the word, which are either:
Xes (where X is anything but any of 'laeiouy', e.g. 'zes')
ed
Xe (where X is anything but any of 'laeiouy', e.g. 'xe')
(I'm not really sure what the grammatical meaning behind this is, but I guess, that 'syllables' at the end of the word, like '-ed', '-ded', '-xed' etc. don't really count as such.)
As for the regexp part: (?:...) is a non-capturing group. I guess it's not really important in this case that this group be non-capturing; this just means that we would like to group the whole expression, but then we do not need to refer back to it. However, we could have used a capturing group too (i.e. (...) )
The [^...] is a negated character class. It means, match any character, which is none of those listed here. (Compare to the (non-negated) character-class mentioned above.)
The pipe symbol, i.e. |, is the alternation operator, which means, that any of the expressions can match.
Finally, the $ anchor matches the end of the line, or string (depending on the context).
word = word.replace(/^y/, '');
This line removes 'y'-s from the beginning of words (probably 'y' at the beginning does not count as a syllable -- which makes sense in my opinion).
^ is the anchor for matching the beginning of the line, or string (c.f. $ mentioned above).
Note: the algorithm only works if word really contains one single word.
/(?:[^laeiouy]es|ed|[^laeiouy]e)$/
That matches three possible substrings: a letter other than 'l' or a vowel followed by 'es' (like "res" or "tes"); 'ed'; or a non-vowel, non-'l' followed by just an 'e'. Those patterns must appear at the end of the word to match because of the $ at the end of the pattern. The grouping (?: ) is just a grouping; the leading ?: makes that distinction. The pattern could have been a little shorter:
/(?:[^laeiouy]es?|ed)$/
would do the same thing. In any case, if the pattern matches the characters involved are removed from the word.
Then,
/^y/
matches a 'y' at the beginning of a word. If a 'y' is found, it's removed.
Finally,
/[aeiouy]{1,2}/g
matches any one- or two-character stretch of vowels (including 'y'). The g suffix makes it a global match, so that the return value is an array consisting of all such spans of vowels. The length of that returned array is the number of syllables (according to this technique).
Note that the words "poem" and "lion" would be reported as one-syllable words, which may be correct for some English variants but not all.
Here is a pretty good reference for JavaScript regular expression operators.

How to find occurence of multiple strings in a given string using javascript RegExp()

I wanted to check the availability of multiple strings in a given string ( without using a loop ).
like
my_string = "How to find occurence of multiple sting in a given string using javascript RegExp";
// search operated on this string
// having a-z (lower case) , '_' , 0-9 , and space
// following are the strings wanted to search .( having a-z , 0-9 , and '_')
search_str[0]="How";
search_str[1]="javascript";
search_str[2]="sting";
search_str[3]="multiple";
I don't need their position.
I just needed to know all the search_str are must be in my_string.
order of search_str never effect the result .
is there is any regular expression available for this ?
UPDATE : WHAT AM I MISSING
in the answers i found this one is working in the above problem
if (/^(?=.*\bHow\b)(?=.*\bjavascript\b)(?=.*\bsting\b)(?=.*\bmultiple\b)/.test(subject)) {
// Successful match
}
But in this case it is not working.
m_str="_3_5_1_13_10_11_";
search_str[0]='3';
search_str[1]='1';
tst=new RegExp("^(?=.*\\b_"+search_str[0]+"_\\b)(?=.*\\b_"+search_str[1]+"_\\b)");
if(tst.test(m_str)) alert('fooooo'); else alert('wrong');
if (/^(?=.*\bHow\b)(?=.*\bjavascript\b)(?=.*\bsting\b)(?=.*\bmultiple\b)/.test(subject)) {
// Successful match
}
This assumes that your string doesn't contain newlines. If it does, you need to change all the .s to [\s\S].
I have used word boundary anchors to make sure that Howard or resting don't accidentally provide a match. If you do want to allow that, remove the \bs.
Explanation:
(?=...) is a lookahead assertion: It looks ahead in the string to check whether the enclosed regex could match at the current position without actually consuming characters for the match. Therefore, a succession of lookaheads works like a sequence of regexes (anchored to the start of the string by ^) that are combined with a logical && operator.

Javascript Regex for all words not between certain characters

I'm trying to return a count of all words NOT between square brackets. So given ..
[don't match these words] but do match these
I get a count of 4 for the last four words.
This works in .net:
\b(?<!\[)[\w']+(?!\])\b
but it won't work in Javascript because it doesn't support lookbehind
Any ideas for a pure js regex solution?
Ok, I think this should work:
\[[^\]]+\](?:^|\s)([\w']+)(?!\])\b|(?:^|\s)([\w']+)(?!\])\b
You can test it here:
http://regexpal.com/
If you need an alternative with text in square brackets coming after the main text, it could be added as a second alternative and the current second one would become third.
It's a bit complicated but I can't think of a better solution right now.
If you need to do something with the actual matches you will find them in the capturing groups.
UPDATE:
Explanation:
So, we've got two options here:
\[[^\]]+\](?:^|\s)([\w']+)(?!\])\b
This is saying:
\[[^\]]+\] - match everything in square brackets (don't capture)
(?:^|\s) - followed by line start or a space - when I look at it now take the caret out as it doesn't make sense so this will become just \s
([\w']+) - match all following word characters as long as (?!\])the next character is not the closing bracket - well this is probably also unnecessary now, so let's try and remove the lookahead
\b - and match word boundary
2 (?:^|\s)([\w']+)(?!\])\b
If you cannot find the option 1 - do just the word matching, without looking for square brackets as we ensured with the first part that they are not here.
Ok, so I removed all the things that we don't need (they stayed there because I tried quite a few options before it worked:-) and the revised regex is the one below:
\[[^\]]+\]\s([\w']+)(?!\])\b|(?:^|\s)([\w']+)\b
I would use something like \[[^\]]*\] to remove the words between square brackets, and then explode by spaces the returned string to count the remaining words.
Chris, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a general question about how to exclude patterns in regex.)
Here's our simple regex (see it at work on regex101, looking at the Group captures in the bottom right panel):
\[[^\]]*\]|(\b\w+\b)
The left side of the alternation matches complete [bracketed groups]. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right words because they were not matched by the expression on the left.
This program shows how to use the regex (see the count result in the online demo):
<script>
var subject = '[match ye not these words] but do match these';
var regex = /\[[^\]]*\]|(\b\w+\b)/g;
var group1Caps = [];
var match = regex.exec(subject);
// put Group 1 captures in an array
while (match != null) {
if( match[1] != null ) group1Caps.push(match[1]);
match = regex.exec(subject);
}
document.write("<br>*** Number of Matches ***<br>");
document.write(group1Caps.length);
</script>
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

Categories