Why are positive lookbehinds captured as part of regex matches in javascript? - javascript

The following javascript expression yields ["myParameter=foobar", "foobar"].
'XXXXmyParameter=foobar'.match(/(?:myParameter=)([A-Za-z]+)/);
Why is "myParameter=foobar" even a match? I thought that positive lookbehinds are excluded from matches?
Is there a way to only capture the ([A-Za-z]+) portion of my regex in javascript? I could just take the second item in the list, but is there a way to explicitly exclude myParameter= from matches in the regex?

(?:myParameter=) is a non-capturing group, not a lookbehind. JavaScript does not support lookbehinds.
The first element of the result is always the complete match. The value of your capture group is the second element of the array, "foobar".
If you used a capture group, i.e. (myParameter=), the result would be:
["myParameter=foobar", "myParameter=", "foobar"]
So again, the first element is the complete match, every other element corresponds to a capture group.

You are not implementing Positive Lookbehind, (?:...) syntax is called a non-capturing group which is used to group expressions, not capture them (usually used in alternation when you have a set of different patterns).
You can simply reference the captured group for the match result.
var r = 'XXXXmyParameter=foobar'.match(/myParameter=([A-Za-z]+)/)[1];
if (r)
console.log(r); //=> "foobar"
Note: Lookbehind assertions are not supported in JavaScript ...

The myParameter is a match because you are using a non capturing group.
The non capturing group matches the text but it cannot be backreferenced.
Non Capturing Group:
(?:myParameter=)
Positive Lookahead:
(?=myParameter=)
Negative Lookahead:
(?!myParameter=)
The regex you need is:
(?!myParameter=)[A-Za-z]+$
DEMO

Related

How to not match given prefix in RegEx without negative lookbehind?

Goal
The goal is matching a string in JavaScript without certain delimiters, i.e. a string between two characters (the characters can be included in the match).
For example, this string should be fully matched: $ test string $. This can appear anywhere in a string. That would be trivial, however, we want to allow escaping the syntax, e.g. The price is 5\$ to 10\$.
Summarized:
Match any string that is enclosed by two $ signs.
Do not match it if the dollar signs are escaped using \$.
Solution using negative lookbehind
A solution that achieves this goal perfectly is: (?<!\\)\$(.*?)(?<!\\)\$.
Problem
This solution uses negative lookbehind, which is not supported on Safari. How can the same matches be achieved without using negative lookbehind (i.e. on Safari)?
A solution that partially works is (?<!\\)\$(.*?)(?<!\\)\$. However, this will also match the character in front of the $ sign if it is not a \.
You might rule out what you don't want by matching it, and capture what you want to keep in group 1
\\\$.*?\$|\$.*?\\\$|(\$.*?\$)
Regex demo
You may use this regex and grab your inner text using capture group #1 as you are already doing in your current regex using lookbehind:
(?:^|[^\\])\$((?:\\.|[^$])*)\$
RegEx Demo
RegEx Details:
(?:^|[^\\]): Match start position or a non-backslash character in a non-capturing group
\$: Match starting $
(: Start capturing group
(?:\\.|[^$])*: Match any escaped character or a non-$ character. Repeat this group 0 or more times
): End capturing group
\$: Match closing $
PS: This regex will give same matches as your current regex: (?<!\\)\$(.*?)(?<!\\)\$

String replace with regexp overwrites non matching character

The idea is replacing in a string all decimal numbers without a digit before the decimal point with the zero so .03 sqrt(.02) would become 0.03 sqrt(0.02).
See the code below for a sample, the problem is that the replacement overwrites the opening parenthesis when there's one preceding the decimal point. I think that the parenthesis does not pertain to the matching string, does it?
let s='.05 sqrt(.005) another(.33) thisShouldntChange(a.b) neither(3.4)'
s=s.replace(/(?:^|\D)\.(\d+)/g , "0.$1");
console.log(s)
Make your initial group capturing, not non-capturing, and use it in the replacement:
s=s.replace(/(^|[^\d])\.(\d+)/g , "$10.$2");
// ^---- capturing, not non-capturing
Example:
let s = '.05 sqrt(.005) another(.33) thisShouldntChange(a.b) neither(3.4)'
s=s.replace(/(^|[^\d])\.(\d+)/g , "$10.$2");
console.log(s)
I think that the parenthesis does not pertain to the matching string, does it?
It does, because it matches [^\d].
Side note: As Wiktor points out, you can use \D instead of [^\d].
Side note 2: JavaScript regexes are finally getting lookbehind (in the living specification, and will be in the ES2018 spec snapshot), so an alternate way to do this with modern JavaScript environments would be a negative lookbehind:
s=s.replace(/(?<!\d)\.(\d+)/g , "0.$1");
// ^^^^^^^--- negative lookbehind for a digit
That means basically "If there's a digit here, don't match." (There's also positive lookbehind, (?<=...).)
Example:
let s = '.05 sqrt(.005) another(.33) thisShouldntChange(a.b) neither(3.4)'
s=s.replace(/(?<!\d)\.(\d+)/g , "0.$1");
console.log(s)
A parenthesis is a nn-digit, thus it is matched with [^\d] and removed.
The solution is to match and capture the part before a dot and then insert back using a replacement backreference.
Use
.replace(/(^|\D)\.(\d+)/g , "$10.$2")
See the regex demo.
Pattern details
(^|\D) - Capturing group 1 (later referred to with $1 from the replacement pattern): a start of string or any non-digit ([^\d] = \D)
\. - a dot
(\d+) - Capturing group 2 (later referred to with $2 from the replacement pattern): 1+ digits.
See the JS demo:
let s='.05 sqrt(.005) another(.33) thisShouldnt(a.b) neither(3.4)'
s=s.replace(/(^|\D)\.(\d+)/g , "$10.$2");
console.log(s)
Note that $10.$2 will be parsed by the RegExp engine as $1 backreference, then 0. text and then $2 backreference, since there are only 2 capturing groups in the pattern, there are no 10 capturing groups and thus $10 will not be considered as a valid token in the replacement pattern.

Regular expression match specific key words

I am trying to use regexp to match some specific key words.
For those codes as below, I'd like to only match those IFs at first and second line, which have no prefix and postfix. The regexp I am using now is \b(IF|ELSE)\b, and it will give me all the IFs back.
IF A > B THEN STOP
IF B < C THEN STOP
LOL.IF
IF.LOL
IF.ELSE
Thanks for any help in advance.
And I am using http://regexr.com/ for test.
Need to work with JS.
I'm guessing this is what you're looking for, assuming you've added the m flag for multiline:
(?:^|\s)(IF|ELSE)(?:$|\s)
It's comprised of three groups:
(?:^|\s) - Matches either the beginning of the line, or a single space character
(IF|ELSE) - Matches one of your keywords
(?:$|\s) - Matches either the end of the line, or a single space character.
Regexr
you can do it with lookaround (lookahead + lookbehind). this is what you really want as it explicitly matches what you are searching. you don't want to check for other characters like string start or whitespaces around the match but exactly match "IF or ELSE not surrounded by dots"
/(?<!\.)(IF|ELSE)(?!\.)/g
explanation:
use the g-flag to find all occurrences
(?<!X)Y is a negative lookbehind which matches a Y not preceeded by an X
Y(?!X) is a negative lookahead which matches a Y not followed by an X
working example: https://regex101.com/r/oS2dZ6/1
PS: if you don't have to write regex for JS better use a tool which supports the posix standard like regex101.com

What does ?=^ mean in a regexp?

I want to write regexp which allows some special characters like #-. and it should contain at least one letter. I want to understand below things also:
/(?=^[A-Z0-9. '-]{1,45}$)/i
In this regexp what is the meaning of ?=^ ? What is a subexpression in regexp?
(?=) is a lookahead, it's looking ahead in the string to see if it matches without actually capturing it
^ means it matches at the BEGINNING of the input (for example with the string a test, ^test would not match as it doesn't start with "test" even though it contains it)
Overall, your expression is saying it has to ^ start and $ end with 1-45 {1,45} items that exist in your character group [A-Z0-9. '-] (case insensitive /i). The fact it is within a lookahead in this case just means it's not going to capture anything (zero-length match).
?= is a positive lookahead
Read more on regex

Difference between ?:, ?! and ?=

I searched for the meaning of these expressions but couldn't understand the exact difference between them.
This is what they say:
?: Match expression but do not capture it.
?= Match a suffix but exclude it from capture.
?! Match if the suffix is absent.
I tried using these in simple RegEx and got similar results for all.
For example: the following 3 expressions give very similar results.
[a-zA-Z0-9._-]+#[a-zA-Z0-9-]+(?!\.[a-zA-Z0-9]+)*
[a-zA-Z0-9._-]+#[a-zA-Z0-9-]+(?=\.[a-zA-Z0-9]+)*
[a-zA-Z0-9._-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9]+)*
The difference between ?= and ?! is that the former requires the given expression to match and the latter requires it to not match. For example a(?=b) will match the "a" in "ab", but not the "a" in "ac". Whereas a(?!b) will match the "a" in "ac", but not the "a" in "ab".
The difference between ?: and ?= is that ?= excludes the expression from the entire match while ?: just doesn't create a capturing group. So for example a(?:b) will match the "ab" in "abc", while a(?=b) will only match the "a" in "abc". a(b) would match the "ab" in "abc" and create a capture containing the "b".
?: is for non capturing group
?= is for positive look ahead
?! is for negative look ahead
?<= is for positive look behind
?<! is for negative look behind
Please check Lookahead and Lookbehind Zero-Length Assertions for very good tutorial and examples on lookahead in regular expressions.
To better understand let's apply the three expressions plus a capturing group and analyse each behaviour.
() capturing group - the regex inside the parenthesis must be matched and the match create a capturing group
(?:) non-capturing group - the regex inside the parenthesis must be matched but does not create the capturing group
(?=) positive lookahead - asserts that the regex must be matched
(?!) negative lookahead - asserts that it is impossible to match the regex
Let's apply q(u)i to quit.q matches q and the capturing group u matches u.The match inside the capturing group is taken and a capturing group is created.So the engine continues with i.And i will match i.This last match attempt is successful.qui is matched and a capturing group with u is created.
Let's apply q(?:u)i to quit.Again, q matches q and the non-capturing group u matches u.The match from the non-capturing group is taken, but the capturing group is not created.So the engine continues with i.And i will match i.This last match attempt is successful.qui is matched.
Let's apply q(?=u)i to quit.The lookahead is positive and is followed by another token.Again, q matches q and u matches u.But the match from the lookahead must be discarded, so the engine steps back from i in the string to u.Given that the lookahead was successful the engine continues with i.But i cannot match u.So this match attempt fails.
Let's apply q(?=u)u to quit.The lookahead is positive and is followed by another token.Again, q matches q and u matches u.But the match from the lookahead must be discarded, so the engine steps back from u in the string to u.Given that the lookahead was successful the engine continues with u.And u will match u. So this match attempt is successful.qu is matched.
Let's apply q(?!i)u to quit.Even in this case lookahead is positive (because i does not match) and is followed by another token.Again, q matches q and i doesn't match u.The match from the lookahead must be discarded, so the engine steps back from u in the string to u.Given that the lookahead was successful the engine continues with u.And u will match u.So this match attempt is successful.qu is matched.
So, in conclusion, the real difference between lookahead and non-capturing groups is all about if you want just to test the existence or test and save the match.
But capturing groups are expensive so use it judiciously.
Try matching foobar against these:
/foo(?=b)(.*)/
/foo(?!b)(.*)/
The first regex will match and will return "bar" as first submatch — (?=b) matches the 'b', but does not consume it, leaving it for the following parentheses.
The second regex will NOT match, because it expects "foo" to be followed by something different from 'b'.
(?:...) has exactly the same effect as simple (...), but it does not return that portion as a submatch.
The simplest way to understand assertions is to treat them as the command inserted into a regular expression.
When the engine runs to an assertion, it will immediately check the condition described by the assertion.
If the result is true, then continue to run the regular expression.
This is the real difference:
>>> re.match('a(?=b)bc', 'abc')
<Match...>
>>> re.match('a(?:b)c', 'abc')
<Match...>
# note:
>>> re.match('a(?=b)c', 'abc')
None
If you dont care the content after "?:" or "?=", "?:" and "?=" are just the same. Both of them are ok to use.
But if you need those content for further process(not just match the whole thing. In that case you can simply use "a(b)") You have to use "?=" instead. Cause "?:"will just through it away.

Categories