Javascript Regexp Question Mark Cross-Browser Behavior - javascript

Today I was messing around with Javascript regexp's and found out this:
//Suppose
var one = 'HELLOxBYE';
var two = 'HELLOBYE';
You could create a regex that tries to capture the 'x' in both of these ways:
/^HELLO(x?)BYE$/ //(A)
//or
/^HELLO(x)?BYE$/ //(B)
I've found out that when you use (A) on var two, the regexp returns an empty string ''; while when you use (B) the regexp returns null.
You have to be careful with that.
Does anyone knows if this is a cross-browser behavior?
I've tested this on Google Chrome (Webkit) build 15.
UPDATE: Whoa, just did some tests on Internet Explorer 8, and it returns an empty string '' for both cases. So my conclusion is that the best alternative is to use (A) and then test for an empty string.

Technically (A) should return '' on HELLOBYE because the capturing brackets can capture both an 'x' and an empty string, since the ? is inside the capturing group.
Whereas in (B), the capturing brackets can only ever capture the string x. If the x is not present, then the group is never captured at all, because the entire group is optional, as opposed to the regex within the group.
Subtle difference!
So a browser or regex engine will always return '' for (A), but what it returns for (B) isn't all that well defined, so may differ depending on implementation - Chrome distinguishes between "the group matched an empty string" and "the group didn't match at all". Whereas IE doesn't make this distinction (or if it does, it coerces the return type for the second case into an empty string).
Summary -- use (A) because you know that if there is no x then the capturing group definitely matches ''. Using (B) depends on whether a browser distinguishes between "zero-length match" and "no match at all".

Related

str.match(reg) equivalent for str.split(".") in javascript

Is there a regular expression reg, so that for any string str the results of str.split(".") and str.match(reg) are equivalent? If multiline should somehow matter, a solution for a single line would be sufficient.
As an example: Considering the RegExp /[^\.]+/g: for the string "nice.sentance", "nice.sentance".split(".") gives the same result as "nice.sentance".match(/[^\.]+/g) - ["nice", "sentance"]. However, this is not the case for any string. E.g. for the empty string "" they would give different results, "".split(".") returning [""] and "".match(/[^\.]+/g) returning null, meaning /[^\.]+/g is not a solution, as it would need to work for any possible string.
The question comes from a misinterpretation of another question here and left me wondering. I do not have a practical application for it at the moment and am interested because i could not find an answer - it looks like an interesting RegExp problem. It may however be impossible.
Things i have considered:
Imho it is fairly clear that reg needs the global flag, removing capture groups as a possibility
/[^\.]+/g does not match empty parts, e.g. for "", ".a" or "a..a"
/[^\.]*/g produces additional empty strings after non-empty matches, because when iteration starts for the next match, it can fit in an empty match. E.g. for "a"
With features not available on javascript currently (but on other languages), one could repair the previous flaw: /(?<=^|\.)[^\.]*/g
My conclusion here would be that real empty matches need to be considered but cannot be differentiated from empty matches between a non-empty match and the following dot or EOL, without "looking behind". This seems a bit vague to count as a proper argument for it being impossible, but maybe is already enough. There might however be a RegExp feature i don't know about, e.g. to advance the index after a match without including the symbol, or something similar to be used as a trick.
Allowing some correction step on the array resulting from match makes the problem trivial.
I found some related questions, which as expected utilize look-behind or capture groups though:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
characters between two delimiters
lots more similar to the above
I do not see the point but assume you have to apply this in an environment where .split is not available.
Crafting a matching regex that does the same as .split(".") or /\./ requires to account for several cases:
no input => empty split
single . => two empty splits
. at the beginning => empty split at position 0
. at the end => empty split at the end
. in the middle
multiple consecutive .s => one empty split per ..
Following this, I came up with the following solution:
^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)
Code Sample*:
const regex = /^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)/gm;
const test = `
.
.a
a.
a.a
a..a
.a.
..a..
.a.z
..`;
var a = test.split("\n");
a.forEach(str => {
console.log(`"${str}"`);
console.log(str.split("."));
let m; let matches = [];
while ((m = regex.exec(str)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
matches.push(m[0]);
}
console.log(matches);
});
The output should be read in triple blocks: input/split/regex-match.
The output on each 2nd and 3rd line should be the same.
Have fun!
*Caveat: This requires RegExp Lookbehind Assertions: JavaScript Lookbehind assertions are approved by TC39 and are now part of the ES2018 standard.
RegExp Lookbehind Assertions have been implemented in V8 and shipped without flags with Google Chrome v62 and in Node.js v6 behind a flag and v9 without a flag. The Firefox team is working on it, and for Microsoft Edge, it's an implementation suggestion.

Reusing a `RegExp` object multiple times results in weird behaviours [duplicate]

I've been trying evaluate a string based in a regular expression, however I noticed a weird behaviour and when I test with the regular expression more than once. The test method alternates between true and false .
Check this codepen --> http://codepen.io/gpincheiraa/pen/BoXrEz?editors=001
var emailRegex = /^([a-zA-Z0-9_\.\-]){0,100}\#(([a-zA-Z0-9\-]){0,100}\.)+([a-zA-Z0-9]{2,4})+$/,
phoneChileanRegex = /^\+56\S*\s*9\S*\s*\d{8}(?!\#\w+)/g,
number = "+56982249953";
if(!number.match(phoneChileanRegex) && !number.match(emailRegex) ) {
console.log("it's not a phone number or email address");
}
//Weird Behaviour
console.log(phoneChileanRegex.test(number)); --> true
console.log(phoneChileanRegex.test(number)); --> false
From the MDN documentation:
As with exec() (or in combination with it), test() called multiple times on the same global regular expression instance will advance past the previous match.
So, the second time you call the method, it will look for a match after the first match, i.e. after +56982249953. There is no match because the pattern is anchored to the start of the string (^) (and because there are no characters left), so it returns false.
To make this work you have to remove the g modifier.
That's because the test method, like many RegExp methods, tries to find multiple matches in the same string when used with the g flag. To do this, it keeps track of position it should search from in the RegExp's lastIndex property. If you don't need to find multiple matches, just don't put the g flag. And if you ever want to use it, just remember to set regex.lastIndex to 0 when you want to test a new string.
Read more about lastIndex on MDN

javascript regex to match double word gives different result everytime

Here is the output from my browsers console. This regex checks for doubled_words, looks for occurrences of words (strings containing 1 or more letters) followed by whitespace followed by the same word.
var reg=/([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1/gi;
undefined
reg.test("sdfs sdsdf")
true
reg.test("sdfs sdsdf")
false
the result is true alternate times, why is this weird behavior?
That behavior is due to use of global flag. Remove it
var reg=/([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1/;
Using g causes regex state (lastIndex value) to be remembered across multipel calls to test or exec methods.
Check this official reference Read the Description section.

It seems that JavaScript RegExp isn't finding "leftmost longest"

I observe these results:
// Test 1:
var re = /a|ab/;
"ab".match(re); // returns ["a"] <--- Unexpected
// Test 2:
re = /ab|a/;
"ab".match(re); // returns ["ab"]
I would expect tests 1 and 2 to both return ["ab"], due to the principal of "leftmost longest". I don't understand why the order of the 2 alternatives in the regex should change the results.
Find the reason below:
Note that alternatives are considered left to right until a match is
found. If the left alternative matches, the right alternative is
ignored, even if it would have produced a “better” match. Thus, when
the pattern /a|ab/ is applied to the string “ab,” it matches only the
first letter.
(source: Oreilly - Javascript Pocket Reference - Chapter 9 Regular Expressions)
Thanks.
This is because JavaScript doesn't implement the POSIX engine.
POSIX NFA Engines work similarly to Traditional NFAs with one
exception: a POSIX engine always picks the longest of the leftmost
matches. For example, the alternation cat|category would match the full word "category" whenever possible, even if the first alternative ("cat") matched and appeared earlier in the alternation. (SEE MRE 153-154)
Source: Oreilly - Javascript Pocket Reference, p.4

Negating in /,?(([1-9]-[1-9])|([1-9]))/g

I am trying to match a string containing a mix of digits and hyphenated digits, like a crossword answer specification, for example 1,2-2 or 1-1,3,4,2-2
/,?(([1-9]-[1-9])|([1-9]))/g is what I've come up to match the string
value = value.replace(/,?(([1-9]-[1-9])|([1-9]))/g, '');
replaces ok, and I've checked it out in an online tester.
What I really need is to negate this, so I can use it on a keyup event, examine the contents of a textarea and remove characters that don't fit, so it only allows through characters as in the example.
I've tried ^ where expected, but this it's not doing what I expect, how should I negate the regex so I remove everything that doesn't match?
If there is a better way of doing this I'm open to suggestions too.
var value = 'hello,1,2,3,4-6,1-1,3,test,4,2-2';
var pattern = /,?(([1-9]-[1-9])|([1-9]))/g;
value.replace(pattern, ''); // "hello,test"
You can use String#match. With /g flag, it returns an array of all the matches, then you can use Array#join to join them.
The problem is that String#match returns null when there is no match, so you have to handle that case and use an empty array so that it can join:
(value.match(pattern) || []).join(''); // ",1,2,3,4-6,1-1,3,4,2-2"
Note: It may better to check them on onblur rather than onkeyup. Messing with the text that the user is currently typing will make it annoying. Better to wait for the user to finish typing.
Didn't test it in JS, but this should return the valid string beginning from the left and as long as valid values are encountered (note that I used \d - if you'd like 1-9 only, then use your brackets).
(?:\d(?:-\d)?,)*\d(?:-\d)?
E.g. matching this regular expression with the string "0-1,1,2,3,4-4,2,,1,3--4" will return "0-1,1,2,3,4-4,2" as the first match.

Categories