javascript regex to match double word gives different result everytime

javascript regex to match double word gives different result everytime - javascript

Here is the output from my browsers console. This regex checks for doubled_words, looks for occurrences of words (strings containing 1 or more letters) followed by whitespace followed by the same word.
var reg=/([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1/gi;
undefined
reg.test("sdfs sdsdf")
true
reg.test("sdfs sdsdf")
false
the result is true alternate times, why is this weird behavior?

That behavior is due to use of global flag. Remove it
var reg=/([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1/;
Using g causes regex state (lastIndex value) to be remembered across multipel calls to test or exec methods.
Check this official reference Read the Description section.

Related

str.match(reg) equivalent for str.split(".") in javascript

Is there a regular expression reg, so that for any string str the results of str.split(".") and str.match(reg) are equivalent? If multiline should somehow matter, a solution for a single line would be sufficient.
As an example: Considering the RegExp /[^\.]+/g: for the string "nice.sentance", "nice.sentance".split(".") gives the same result as "nice.sentance".match(/[^\.]+/g) - ["nice", "sentance"]. However, this is not the case for any string. E.g. for the empty string "" they would give different results, "".split(".") returning [""] and "".match(/[^\.]+/g) returning null, meaning /[^\.]+/g is not a solution, as it would need to work for any possible string.
The question comes from a misinterpretation of another question here and left me wondering. I do not have a practical application for it at the moment and am interested because i could not find an answer - it looks like an interesting RegExp problem. It may however be impossible.
Things i have considered:
Imho it is fairly clear that reg needs the global flag, removing capture groups as a possibility
/[^\.]+/g does not match empty parts, e.g. for "", ".a" or "a..a"
/[^\.]*/g produces additional empty strings after non-empty matches, because when iteration starts for the next match, it can fit in an empty match. E.g. for "a"
With features not available on javascript currently (but on other languages), one could repair the previous flaw: /(?<=^|\.)[^\.]*/g
My conclusion here would be that real empty matches need to be considered but cannot be differentiated from empty matches between a non-empty match and the following dot or EOL, without "looking behind". This seems a bit vague to count as a proper argument for it being impossible, but maybe is already enough. There might however be a RegExp feature i don't know about, e.g. to advance the index after a match without including the symbol, or something similar to be used as a trick.
Allowing some correction step on the array resulting from match makes the problem trivial.
I found some related questions, which as expected utilize look-behind or capture groups though:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
characters between two delimiters
lots more similar to the above

I do not see the point but assume you have to apply this in an environment where .split is not available.
Crafting a matching regex that does the same as .split(".") or /\./ requires to account for several cases:
no input => empty split
single . => two empty splits
. at the beginning => empty split at position 0
. at the end => empty split at the end
. in the middle
multiple consecutive .s => one empty split per ..
Following this, I came up with the following solution:
^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)
Code Sample*:
const regex = /^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)/gm;
const test = `
.
.a
a.
a.a
a..a
.a.
..a..
.a.z
..`;
var a = test.split("\n");
a.forEach(str => {
console.log(`"${str}"`);
console.log(str.split("."));
let m; let matches = [];
while ((m = regex.exec(str)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
matches.push(m[0]);
}
console.log(matches);
});
The output should be read in triple blocks: input/split/regex-match.
The output on each 2nd and 3rd line should be the same.
Have fun!
*Caveat: This requires RegExp Lookbehind Assertions: JavaScript Lookbehind assertions are approved by TC39 and are now part of the ES2018 standard.
RegExp Lookbehind Assertions have been implemented in V8 and shipped without flags with Google Chrome v62 and in Node.js v6 behind a flag and v9 without a flag. The Firefox team is working on it, and for Microsoft Edge, it's an implementation suggestion.

Reusing a `RegExp` object multiple times results in weird behaviours [duplicate]

I've been trying evaluate a string based in a regular expression, however I noticed a weird behaviour and when I test with the regular expression more than once. The test method alternates between true and false .
Check this codepen --> http://codepen.io/gpincheiraa/pen/BoXrEz?editors=001
var emailRegex = /^([a-zA-Z0-9_\.\-]){0,100}\#(([a-zA-Z0-9\-]){0,100}\.)+([a-zA-Z0-9]{2,4})+$/,
phoneChileanRegex = /^\+56\S*\s*9\S*\s*\d{8}(?!\#\w+)/g,
number = "+56982249953";
if(!number.match(phoneChileanRegex) && !number.match(emailRegex) ) {
console.log("it's not a phone number or email address");
}
//Weird Behaviour
console.log(phoneChileanRegex.test(number)); --> true
console.log(phoneChileanRegex.test(number)); --> false

From the MDN documentation:
As with exec() (or in combination with it), test() called multiple times on the same global regular expression instance will advance past the previous match.
So, the second time you call the method, it will look for a match after the first match, i.e. after +56982249953. There is no match because the pattern is anchored to the start of the string (^) (and because there are no characters left), so it returns false.
To make this work you have to remove the g modifier.

That's because the test method, like many RegExp methods, tries to find multiple matches in the same string when used with the g flag. To do this, it keeps track of position it should search from in the RegExp's lastIndex property. If you don't need to find multiple matches, just don't put the g flag. And if you ever want to use it, just remember to set regex.lastIndex to 0 when you want to test a new string.
Read more about lastIndex on MDN

It seems that JavaScript RegExp isn't finding "leftmost longest"

I observe these results:
// Test 1:
var re = /a|ab/;
"ab".match(re); // returns ["a"] <--- Unexpected
// Test 2:
re = /ab|a/;
"ab".match(re); // returns ["ab"]
I would expect tests 1 and 2 to both return ["ab"], due to the principal of "leftmost longest". I don't understand why the order of the 2 alternatives in the regex should change the results.

Find the reason below:
Note that alternatives are considered left to right until a match is
found. If the left alternative matches, the right alternative is
ignored, even if it would have produced a “better” match. Thus, when
the pattern /a|ab/ is applied to the string “ab,” it matches only the
first letter.
(source: Oreilly - Javascript Pocket Reference - Chapter 9 Regular Expressions)
Thanks.

This is because JavaScript doesn't implement the POSIX engine.
POSIX NFA Engines work similarly to Traditional NFAs with one
exception: a POSIX engine always picks the longest of the leftmost
matches. For example, the alternation cat|category would match the full word "category" whenever possible, even if the first alternative ("cat") matched and appeared earlier in the alternation. (SEE MRE 153-154)
Source: Oreilly - Javascript Pocket Reference, p.4

Javascript Regexp Question Mark Cross-Browser Behavior

Today I was messing around with Javascript regexp's and found out this:
//Suppose
var one = 'HELLOxBYE';
var two = 'HELLOBYE';
You could create a regex that tries to capture the 'x' in both of these ways:
/^HELLO(x?)BYE$/ //(A)
//or
/^HELLO(x)?BYE$/ //(B)
I've found out that when you use (A) on var two, the regexp returns an empty string ''; while when you use (B) the regexp returns null.
You have to be careful with that.
Does anyone knows if this is a cross-browser behavior?
I've tested this on Google Chrome (Webkit) build 15.
UPDATE: Whoa, just did some tests on Internet Explorer 8, and it returns an empty string '' for both cases. So my conclusion is that the best alternative is to use (A) and then test for an empty string.

Technically (A) should return '' on HELLOBYE because the capturing brackets can capture both an 'x' and an empty string, since the ? is inside the capturing group.
Whereas in (B), the capturing brackets can only ever capture the string x. If the x is not present, then the group is never captured at all, because the entire group is optional, as opposed to the regex within the group.
Subtle difference!
So a browser or regex engine will always return '' for (A), but what it returns for (B) isn't all that well defined, so may differ depending on implementation - Chrome distinguishes between "the group matched an empty string" and "the group didn't match at all". Whereas IE doesn't make this distinction (or if it does, it coerces the return type for the second case into an empty string).
Summary -- use (A) because you know that if there is no x then the capturing group definitely matches ''. Using (B) depends on whether a browser distinguishes between "zero-length match" and "no match at all".

Regular expression to remove a file's extension

I am in need of a regular expression that can remove the extension of a filename, returning only the name of the file.
Here are some examples of inputs and outputs:
myfile.png -> myfile
myfile.png.jpg -> myfile.png
I can obviously do this manually (ie removing everything from the last dot) but I'm sure that there is a regular expression that can do this by itself.
Just for the record, I am doing this in JavaScript

Just for completeness: How could this be achieved without Regular Expressions?
var input = 'myfile.png';
var output = input.substr(0, input.lastIndexOf('.')) || input;
The || input takes care of the case, where lastIndexOf() provides a -1. You see, it's still a one-liner.

/(.*)\.[^.]+$/
Result will be in that first capture group. However, it's probably more efficient to just find the position of the rightmost period and then take everything before it, without using regex.

The regular expression to match the pattern is:
/\.[^.]*$/
It finds a period character (\.), followed by 0 or more characters that are not periods ([^.]*), followed by the end of the string ($).
console.log(
"aaa.bbb.ccc".replace(/\.[^.]*$/,'')
)

/^(.+)(\.[^ .]+)?$/
Test cases where this works and others fail:
".htaccess" (leading period)
"file" (no file extension)
"send to mrs." (no extension, but ends in abbr.)
"version 1.2 of project" (no extension, yet still contains a period)
The common thread above is, of course, "malformed" file extensions. But you always have to think about those corner cases. :P
Test cases where this fails:
"version 1.2" (no file extension, but "appears" to have one)
"name.tar.gz" (if you view this as a "compound extension" and wanted it split into "name" and ".tar.gz")
How to handle these is problematic and best decided on a project-specific basis.

/^(.+)(\.[^ .]+)?$/
Above pattern is wrong - it will always include the extension too. It's because of how the javascript regex engine works. The (\.[^ .]+) token is optional so the engine will successfully match the entire string with (.+)
http://cl.ly/image/3G1I3h3M2Q0M
Here's my tested regexp solution.
The pattern will match filenameNoExt with/without extension in the path, respecting both slash and backslash separators
var path = "c:\some.path/subfolder/file.ext"
var m = path.match(/([^:\\/]*?)(?:\.([^ :\\/.]*))?$/)
var fileName = (m === null)? "" : m[0]
var fileExt = (m === null)? "" : m[1]
dissection of the above pattern:
([^:\\/]*?) // match any character, except slashes and colon, 0-or-more times,
// make the token non-greedy so that the regex engine
// will try to match the next token (the file extension)
// capture the file name token to subpattern \1
(?:\. // match the '.' but don't capture it
([^ :\\/.]*) // match file extension
// ensure that the last element of the path is matched by prohibiting slashes
// capture the file extension token to subpattern \2
)?$ // the whole file extension is optional
http://cl.ly/image/3t3N413g3K09
http://www.gethifi.com/tools/regex
This will cover all cases that was mentioned by #RogerPate but including full paths too

another no-regex way of doing it (the "oposite" of #Rahul's version, not using pop() to remove)
It doesn't require to refer to the variable twice, so it's easier to inline
filename.split('.').slice(0,-1).join()

This will do it as well :)
'myfile.png.jpg'.split('.').reverse().slice(1).reverse().join('.');
I'd stick to the regexp though... =P

return filename.split('.').pop();
it will make your wish come true. But not regular expression way.

In javascript you can call the Replace() method that will replace based on a regular expression.
This regular expression will match everything from the begining of the line to the end and remove anything after the last period including the period.
/^(.*)\..*$/
The how of implementing the replace can be found in this Stackoverflow question.
Javascript regex question

We Keep Coding

JavaScript is the programming language of the Web.

javascript regex to match double word gives different result everytime - javascript

That behavior is due to use of global flag. Remove it var reg=/([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1/; Using g causes regex state (lastIndex value) to be remembered across multipel calls to test or exec methods. Check this official reference Read the Description section.

Related

str.match(reg) equivalent for str.split(".") in javascript

Reusing a `RegExp` object multiple times results in weird behaviours [duplicate]

It seems that JavaScript RegExp isn't finding "leftmost longest"

Javascript Regexp Question Mark Cross-Browser Behavior

Regular expression to remove a file's extension

Categories

Resources