RegEx match() in Javascript does not produce result as expected

RegEx match() in Javascript does not produce result as expected - javascript

I'm having trouble working out why a regex in Javascript is not working how I would expect it to.
The pattern is as follows:
\[(.+)\]\((.+)\)
trying to match text in the following format:
[Learn more](https://www.example.com)
const text = 'Lorem ipsum etc [Learn more](https://www.google.com), and produce costly [test link](https://www.google.com). [another test link](https://www.google.com).'
const regex = /\[(.+)\]\((.+)\)/
const found = text.match(regex)
console.log(found)
I am expecting the value of found to be the following:
[
"[Learn more](https://www.google.com)",
"[test link](https://www.google.com)",
"[another test link](https://www.google.com)"
]
But the value seems to be as follows:
[
"[Learn more](https://www.google.com), and produce costly [test link](https://www.google.com). [another test link](https://www.google.com)",
"Learn more](https://www.google.com), and produce costly [test link](https://www.google.com). [another test link",
"https://www.google.com"
]
I've tried the /ig flags but this doesn't seem to work. I'm trying in a different application (RegExRX) and getting the expected result but in Javascript, I can't get it to produce the same result.

The + quantifier is greedy and will "eat" as much of the source string as possible. You can use .+? instead:
const regex = /\[(.+?)\]\((.+?)\)/
Better yet, instead of . match "not ]":
const regex = /\[([^\]]+)\]\(([^)]+)\)/
Explicitly excluding the boundary characters can perform better anyway.

TL;DR: The regex \[(.+?)\]\((.+?)\) should do.
The reason the original pattern doesn't work is because the + quantifier is "greedy" by default—it will try to match as many characters as possible. Therefore, .+ means "as much of anything except new line character as possible". You can already tell that closing bracket fits the definition just fine.
To make it work properly, you have to say "as much of anything as possible, until the first closing bracket." To do that, you should either substitute .+ by [^\]]+ ([^\)]+ for the second group), or simply make the aforementioned quantifier not so greedy by appending it with ?, which turns both capturing groups into (.+?).

Related

Splitting a string at question mark, exclamation mark, or period in javascript and retain those marks?

I was a bit surprised, that actually no one had the exact same issue in javascript...
I tried several different solutions none of them parse the content correctly.
The closest one I tried : (I stole its regex query from a PHP solution)
const test = `abc?aaa.abcd?.aabbccc!`;
const sentencesList = test.split("/(\?|\.|!)/");
But result just going to be
["abc?aaa.abcd?.aabbccc!"]
What I want to get is
['abc?', 'aaa.', 'abcd?','.', 'aabbccc!']
I am so confused.. what exactly is wrong?

/[a-z]*[?!.]/g) will do what you want:
const test = `abc?aaa.abcd?.aabbccc!`;
console.log(test.match(/[a-z]*[?!.]/g))

To help you out, what you write is not a regex. test.split("/(\?|\.|!)/"); is simply an 11 character string. A regex would be, for example, test.split(/(\?|\.|!)/);. This still would not be the regex you're looking for.
The problem with this regex is that it's looking for a ?, ., or ! character only, and capturing that lone character. What you want to do is find any number of characters, followed by one of those three characters.
Next, String.split does not accept regexes as arguments. You'll want to use a function that does accept them (such as String.match).
Putting this all together, you'll want to start out your regex with something like this: /.*?/. The dot means any character matches, the asterisk means 0 or more, and the questionmark means "non-greedy", or try to match as few characters as possible, while keeping a valid match.
To search for your three characters, you would follow this up with /[?!.]/ to indicate you want one of these three characters (so far we have /.*?[?!.]/). Lastly, you want to add the g flag so it searches for every instance, rather than only the first. /.*?[?!.]/g. Now we can use it in match:
const rawText = `abc?aaa.abcd?.aabbccc!`;
const matchedArray = rawText.match(/.*?[?!.]/g);
console.log(matchedArray);

The following code works, I do not think we need pattern match. I take that back, I have been answering in Java.
final String S = "An sentence may end with period. Does it end any other way? Ofcourse!";
final String[] simpleSentences = S.split("[?!.]");
//now simpleSentences array has three elements in it.

regex lookbehind alternative for parser (js)

Good morning
(I saw this topic has a LOT of answers but I couldn't find one that fits)
I am writing a little parser in javascript that would cut the text into sections like this :
var tex = "hello this :word is apart"
var parsed = [
"hello",
" ",
"this",
" ",
// ":word" should not be there, neither "word"
" ",
"is",
"apart"
]
the perfect regex for this is :
/((?!:[a-z]+)([ ]+|(?<= |^)[a-z]*(?= |$)))/g
But it has a positive lookbehind that, as I read, was only implemented in javascript in 2018, so I guess many browser compatibility conflicts... and I would like it to have at least a little compatibility...
I considered :
trying capturing groups (?:) but it consumes the space before...
just removing the spaces-check, but ":word" comes in as "word"
parsing the text 2 times, one for words, the other for spaces, but i fear putting them in the right order would be a pain
Understand, I NEED words AND ALL spaces, and to exclude some words.
I am open in other methods, like not using regex.
my last option :
removing the spaces-check and organising my whole regex in the right order, praying that ":word" would be kept in the "special words" group before anything else.
my question :
would that work in javascript, and be reliable ?
I tried
/(((:[a-z]+)|([ ]+)|([a-z]*))/g
in https://regexr.com/ seems to work, will it work in every case ?

You said you're open to non-regex solutions, but I can give you one that includes both. Since you can't rely on lookbehind being supported, then just capture everything and filter out what you don't want, words followed by a colon.
const text = 'hello this :word is apart';
const regex = /(\w+)|(:\w+)|(\s+)/g;
const parsed = text.match(regex).filter(word => !word.includes(':'));
console.log(parsed);

I would use 2 regexes, first one matches the Words, you DON'T want and then replace them with an empty string, this is the simple regex:
/:\w+/g
Then replace with an empty string. Now you have a string, that can be parsed with this regex:
/([ ]+)|([a-z]*)/g
which is a simplified version of your second regex, since forbidden Words are already gone.

str.match(reg) equivalent for str.split(".") in javascript

Is there a regular expression reg, so that for any string str the results of str.split(".") and str.match(reg) are equivalent? If multiline should somehow matter, a solution for a single line would be sufficient.
As an example: Considering the RegExp /[^\.]+/g: for the string "nice.sentance", "nice.sentance".split(".") gives the same result as "nice.sentance".match(/[^\.]+/g) - ["nice", "sentance"]. However, this is not the case for any string. E.g. for the empty string "" they would give different results, "".split(".") returning [""] and "".match(/[^\.]+/g) returning null, meaning /[^\.]+/g is not a solution, as it would need to work for any possible string.
The question comes from a misinterpretation of another question here and left me wondering. I do not have a practical application for it at the moment and am interested because i could not find an answer - it looks like an interesting RegExp problem. It may however be impossible.
Things i have considered:
Imho it is fairly clear that reg needs the global flag, removing capture groups as a possibility
/[^\.]+/g does not match empty parts, e.g. for "", ".a" or "a..a"
/[^\.]*/g produces additional empty strings after non-empty matches, because when iteration starts for the next match, it can fit in an empty match. E.g. for "a"
With features not available on javascript currently (but on other languages), one could repair the previous flaw: /(?<=^|\.)[^\.]*/g
My conclusion here would be that real empty matches need to be considered but cannot be differentiated from empty matches between a non-empty match and the following dot or EOL, without "looking behind". This seems a bit vague to count as a proper argument for it being impossible, but maybe is already enough. There might however be a RegExp feature i don't know about, e.g. to advance the index after a match without including the symbol, or something similar to be used as a trick.
Allowing some correction step on the array resulting from match makes the problem trivial.
I found some related questions, which as expected utilize look-behind or capture groups though:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
characters between two delimiters
lots more similar to the above

I do not see the point but assume you have to apply this in an environment where .split is not available.
Crafting a matching regex that does the same as .split(".") or /\./ requires to account for several cases:
no input => empty split
single . => two empty splits
. at the beginning => empty split at position 0
. at the end => empty split at the end
. in the middle
multiple consecutive .s => one empty split per ..
Following this, I came up with the following solution:
^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)
Code Sample*:
const regex = /^(?=\.)[^.]*|[^.]+(?=\.)|(?<=\.)[^.]*$|^$|[^.]+|(?=\.)(?<=\.)/gm;
const test = `
.
.a
a.
a.a
a..a
.a.
..a..
.a.z
..`;
var a = test.split("\n");
a.forEach(str => {
console.log(`"${str}"`);
console.log(str.split("."));
let m; let matches = [];
while ((m = regex.exec(str)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
matches.push(m[0]);
}
console.log(matches);
});
The output should be read in triple blocks: input/split/regex-match.
The output on each 2nd and 3rd line should be the same.
Have fun!
*Caveat: This requires RegExp Lookbehind Assertions: JavaScript Lookbehind assertions are approved by TC39 and are now part of the ES2018 standard.
RegExp Lookbehind Assertions have been implemented in V8 and shipped without flags with Google Chrome v62 and in Node.js v6 behind a flag and v9 without a flag. The Firefox team is working on it, and for Microsoft Edge, it's an implementation suggestion.

Get all the WORDS except one specific word

I want to get all the words, except one, from a string using JS regex match function. For example, for a string testhello123worldtestWTF, excluding the word test, the result would be helloworldWTF.
I realize that I have to do it using look-ahead functions, but I can't figiure out how exactly. I came up with the following regex (?!test)[a-zA-Z]+(?=.*test), however, it work only partially.
http://refiddle.com/refiddles/59511c2075622d324c090000

IMHO, I would try to replace the incriminated word with an empty string, no?

Lookarounds seem to be an overkill for it, you can just replace the test with nothing:
var str = 'testhello123worldtestWTF';
var res = str.replace(/test/g, '');

Plugging this into your refiddle produces the results you're looking for:
/(test)/g
It matches all occurrences of the word "test" without picking up unwanted words/letters. You can set this to whatever variable you need to hold these.

WORDS OF CAUTION
Seeing that you have no set delimiters in your inputted string, I must say that you cannot reliably exclude a specific word - to a certain extent.
For example, if you want to exclude test, this might create a problem if the input was protester or rotatestreet. You don't have clear demarcations of what a word is, thus leading you to exclude test when you might not have meant to.
On the other hand, if you just want to ignore the string test regardless, just replace test with an empty string and you are good to go.

javascript regex replace multiline strings [duplicate]

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..

DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.

[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.

You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.

Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms

[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.

I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working

In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces

[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

We Keep Coding

JavaScript is the programming language of the Web.

RegEx match() in Javascript does not produce result as expected - javascript

The + quantifier is greedy and will "eat" as much of the source string as possible. You can use .+? instead: const regex = /\[(.+?)\]\((.+?)\)/ Better yet, instead of . match "not ]": const regex = /\[([^\]]+)\]\(([^)]+)\)/ Explicitly excluding the boundary characters can perform better anyway.

Related

Splitting a string at question mark, exclamation mark, or period in javascript and retain those marks?

regex lookbehind alternative for parser (js)

str.match(reg) equivalent for str.split(".") in javascript

Get all the WORDS except one specific word

javascript regex replace multiline strings [duplicate]

Categories

Resources