regex lookbehind alternative for parser (js)

regex lookbehind alternative for parser (js) - javascript

Good morning
(I saw this topic has a LOT of answers but I couldn't find one that fits)
I am writing a little parser in javascript that would cut the text into sections like this :
var tex = "hello this :word is apart"
var parsed = [
"hello",
" ",
"this",
" ",
// ":word" should not be there, neither "word"
" ",
"is",
"apart"
]
the perfect regex for this is :
/((?!:[a-z]+)([ ]+|(?<= |^)[a-z]*(?= |$)))/g
But it has a positive lookbehind that, as I read, was only implemented in javascript in 2018, so I guess many browser compatibility conflicts... and I would like it to have at least a little compatibility...
I considered :
trying capturing groups (?:) but it consumes the space before...
just removing the spaces-check, but ":word" comes in as "word"
parsing the text 2 times, one for words, the other for spaces, but i fear putting them in the right order would be a pain
Understand, I NEED words AND ALL spaces, and to exclude some words.
I am open in other methods, like not using regex.
my last option :
removing the spaces-check and organising my whole regex in the right order, praying that ":word" would be kept in the "special words" group before anything else.
my question :
would that work in javascript, and be reliable ?
I tried
/(((:[a-z]+)|([ ]+)|([a-z]*))/g
in https://regexr.com/ seems to work, will it work in every case ?

You said you're open to non-regex solutions, but I can give you one that includes both. Since you can't rely on lookbehind being supported, then just capture everything and filter out what you don't want, words followed by a colon.
const text = 'hello this :word is apart';
const regex = /(\w+)|(:\w+)|(\s+)/g;
const parsed = text.match(regex).filter(word => !word.includes(':'));
console.log(parsed);

I would use 2 regexes, first one matches the Words, you DON'T want and then replace them with an empty string, this is the simple regex:
/:\w+/g
Then replace with an empty string. Now you have a string, that can be parsed with this regex:
/([ ]+)|([a-z]*)/g
which is a simplified version of your second regex, since forbidden Words are already gone.

Related

Regex for finding element tagname and attributes "skips" attributes

I'm trying to make a regular expression that finds the tagnames and attributes of elements. For example, if I have this:
<div id="anId" class="aClass">
I want to be able to get an array that looks like this:
["(full match)", "div", "id", "anId", "class", "aClass"]
Currently, I have the regex /<(\S*?)(?: ?(.*?)="(.*?)")*>/, but for some reason it skips over every attribute except for the last one.
var str = '<div id="anId" class="aClass">'
console.log(str.match(/<(\S*)(?: ?(.*?)="(.*?)")*>/));
Regex101: https://regex101.com/r/G0ncwF/2
Another odd thing: if I remove the * after the non-capture group, the capture group in quotes seems to somehow "forget" that it's lazy. (Regex101: https://regex101.com/r/C0UwI8/2)
Why does this happen, and how can I avoid it? I couldn't find any questions/answers that helped me (Python re.finditer match.groups() does not contain all groups from match looked promising, but didn't seem help me at all)
(note: I know there are better ways to get the attributes, I'm just experimenting with regex)
UPDATE:
I've figured out at least why the quantifiers seem to "forget" that they're lazy. It's actually just that the regex is trying to match all the way to the angle brackets. I suppose I must have been thinking that the non-capturing group was "insulating" everything and preventing that from happening, and I didn't see it was still lazy because there was only one angle bracket for it to find.
var str = '"foo" "bar"> "baz>"'
console.log("/\".*?\"/ produces ", str.match(/".*?"/), ", finds first quote, finds text, lazily stops at second quote");
console.log("/\".*?\">/ produces ", str.match(/".*?">/), ", finds first quote, finds text, sees second quote but doesn't see angle bracket, keeps going until it sees \">, lazily stops");
So at least that's solved. But I still don't understand why it skips over every attribute but the last one.
And note: Other regexes using different tricks to find the attributes are nice and all, but I'm mostly looking to learn why my regex skips over the attributes, so I can maybe understand regex a bit better.

Playing along with your experimentation you could do this: Instead of scanning for what you want, you can scan for what you don't want, and then filter it out:
const html = '<div id="anId" class="aClass">';
const regex = /[<> ="]/;
let result = html.split(regex).filter(Boolean);
console.log('result: '+JSON.stringify(result));
Output:
result: ["div","id","anId","class","aClass"]
Explanation:
regex /[<> ="]/ lists all chars you don't want
.split(regex) splits your text along the unwanted chars
.filter(Boolean) gets rid of the unwanted chars
Mind you this has flaws, for example it will split incorrectly for html <div id="anId" class="aClass anotherClass">, e.g a space in an attribute value. To support that you could preprocess the html with another regex to escape spaces in quotes, then postprocess with another regex to restore the spaces...
Yes, an HTML parser is more reliable for these kind of tasks.

RegEx match() in Javascript does not produce result as expected

I'm having trouble working out why a regex in Javascript is not working how I would expect it to.
The pattern is as follows:
\[(.+)\]\((.+)\)
trying to match text in the following format:
[Learn more](https://www.example.com)
const text = 'Lorem ipsum etc [Learn more](https://www.google.com), and produce costly [test link](https://www.google.com). [another test link](https://www.google.com).'
const regex = /\[(.+)\]\((.+)\)/
const found = text.match(regex)
console.log(found)
I am expecting the value of found to be the following:
[
"[Learn more](https://www.google.com)",
"[test link](https://www.google.com)",
"[another test link](https://www.google.com)"
]
But the value seems to be as follows:
[
"[Learn more](https://www.google.com), and produce costly [test link](https://www.google.com). [another test link](https://www.google.com)",
"Learn more](https://www.google.com), and produce costly [test link](https://www.google.com). [another test link",
"https://www.google.com"
]
I've tried the /ig flags but this doesn't seem to work. I'm trying in a different application (RegExRX) and getting the expected result but in Javascript, I can't get it to produce the same result.

The + quantifier is greedy and will "eat" as much of the source string as possible. You can use .+? instead:
const regex = /\[(.+?)\]\((.+?)\)/
Better yet, instead of . match "not ]":
const regex = /\[([^\]]+)\]\(([^)]+)\)/
Explicitly excluding the boundary characters can perform better anyway.

TL;DR: The regex \[(.+?)\]\((.+?)\) should do.
The reason the original pattern doesn't work is because the + quantifier is "greedy" by default—it will try to match as many characters as possible. Therefore, .+ means "as much of anything except new line character as possible". You can already tell that closing bracket fits the definition just fine.
To make it work properly, you have to say "as much of anything as possible, until the first closing bracket." To do that, you should either substitute .+ by [^\]]+ ([^\)]+ for the second group), or simply make the aforementioned quantifier not so greedy by appending it with ?, which turns both capturing groups into (.+?).

Splitting a string at question mark, exclamation mark, or period in javascript and retain those marks?

I was a bit surprised, that actually no one had the exact same issue in javascript...
I tried several different solutions none of them parse the content correctly.
The closest one I tried : (I stole its regex query from a PHP solution)
const test = `abc?aaa.abcd?.aabbccc!`;
const sentencesList = test.split("/(\?|\.|!)/");
But result just going to be
["abc?aaa.abcd?.aabbccc!"]
What I want to get is
['abc?', 'aaa.', 'abcd?','.', 'aabbccc!']
I am so confused.. what exactly is wrong?

/[a-z]*[?!.]/g) will do what you want:
const test = `abc?aaa.abcd?.aabbccc!`;
console.log(test.match(/[a-z]*[?!.]/g))

To help you out, what you write is not a regex. test.split("/(\?|\.|!)/"); is simply an 11 character string. A regex would be, for example, test.split(/(\?|\.|!)/);. This still would not be the regex you're looking for.
The problem with this regex is that it's looking for a ?, ., or ! character only, and capturing that lone character. What you want to do is find any number of characters, followed by one of those three characters.
Next, String.split does not accept regexes as arguments. You'll want to use a function that does accept them (such as String.match).
Putting this all together, you'll want to start out your regex with something like this: /.*?/. The dot means any character matches, the asterisk means 0 or more, and the questionmark means "non-greedy", or try to match as few characters as possible, while keeping a valid match.
To search for your three characters, you would follow this up with /[?!.]/ to indicate you want one of these three characters (so far we have /.*?[?!.]/). Lastly, you want to add the g flag so it searches for every instance, rather than only the first. /.*?[?!.]/g. Now we can use it in match:
const rawText = `abc?aaa.abcd?.aabbccc!`;
const matchedArray = rawText.match(/.*?[?!.]/g);
console.log(matchedArray);

The following code works, I do not think we need pattern match. I take that back, I have been answering in Java.
final String S = "An sentence may end with period. Does it end any other way? Ofcourse!";
final String[] simpleSentences = S.split("[?!.]");
//now simpleSentences array has three elements in it.

javascript regex replace multiline strings [duplicate]

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..

DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.

[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.

You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.

Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms

[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.

I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working

In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces

[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

Greedy Regex for varying number of new line [duplicate]

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..

DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.

[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.

You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.

Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms

[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.

I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working

In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces

[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

We Keep Coding

JavaScript is the programming language of the Web.

regex lookbehind alternative for parser (js) - javascript

Related

Regex for finding element tagname and attributes "skips" attributes

RegEx match() in Javascript does not produce result as expected

Splitting a string at question mark, exclamation mark, or period in javascript and retain those marks?

javascript regex replace multiline strings [duplicate]

Greedy Regex for varying number of new line [duplicate]

Categories

Resources