Lookaheads to delimit text

Lookaheads to delimit text - javascript

I'm trying to delimit a huge text with several documents inside. Each document starts with the word 'MINISTÉRIO', so i'm trying to use lookaheads to catch everything from MINISTÉRIO until the next MINISTÉRIO:
(MINISTÉRIO)[\s\S]*?(^(?=\1))
http://regexr.com/3dk6k
I also was trying to:
(^MINISTÉRIO)[\s\S]*?(?=\1)
http://regexr.com/3dk6h
Nether is working. I have two questions: Why my regex is not working? Should be i think... And, how to fix?
Thanks!

Issue Description
The /(MINISTÉRIO)[\s\S]*?(^(?=\1))/gm matches the word MINISTÉRIO at any place in the text capturing it into Group 1. [\s\S]*? matches lazily any character, 0 or more repetitions up to a beginning of a line that is followed with the word MINISTÉRIO. Thus, if you have a "document" from some place in the string up to the end, that match won't be found as you cannot specify the $ anchor since it is redefined to match the end of a line.
Using /(^MINISTÉRIO)[\s\S]*?(?=\1)/g, you match and capture the MINISTÉRIO word at the beginning of the whole string only, and match any char as few as possible up to the first MINISTÉRIO substring in the string, at any place in the string, and there is no check for the beginning of a line.
Solution
You may use an unrolled regex like
/^MINISTÉRIO\b.*(?:\n(?!MINISTÉRIO\b).*)*/gm
The regex demo is here
When the text is too long, lazy matching like in your pattern takes too much time, and using negated character classes can greatly increase performance.
In short:
^MINISTÉRIO\b - matches MINISTÉRIO as a whole word at the start of a line:
^ - start of a line (due to /m modifier)
MINISTÉRIO\b - a whole word MINISTÉRIO as \b is a word boundary
.*(?:\n(?!MINISTÉRIO\b).*)* - matches any text that is not MINISTÉRIO at the start of a line:
.* - 0+ chars other than a newline
(?:\n(?!MINISTÉRIO\b).*)* - 0+ sequences of:
\n(?!MINISTÉRIO\b) - a newline not followed with MINISTÉRIO as a whole word
.* - 0+ chars other than a newline
It is basically the same as /^MINISTÉRIO\b(?:(?!^MINISTÉRIO\b)[\s\S])*/gm, but should be much faster as the tempered greedy token ((?:(?!^MINISTÉRIO\b)[\s\S])*) is rather resource consuming.

Related

how does javascript regex lazy match work?

For this string
abc.com/file/some.png?v=123
how do I match .png? I use
/\..*?\?/
but it is matching .com/file/some.png?, so why is the lazy match rule not working here?

There are lots of variants to this answer. I will propose matching the first file suffix after the last / character.
That can be done with this regex
/(?!.*\/)\.\w+\?/
Explaination
(?!.*/)\.\w+\?
Options: Case insensitive; Dot doesn’t match line breaks; ^$ match at line breaks
Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!.*/)
Match any single character that is NOT a line break character (line feed, carriage return, line separator, paragraph separator) .*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match the character “/” literally /
Match the character “.” literally \.
Match a single character that is a “word character” (ASCII letter, digit, or underscore only) \w+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the question mark character \?
\1
Insert a backslash \
Insert the character “1” literally 1
Created with RegexBuddy

Issue with javascript regex not matching less than 3 characters

I have the following javascript regex:
/^[^\s][a-z0-9 ]+[^\s]$/i
I need to allow any alphanumeric character as well as spaces inside the string but not at the beginning nor at the end.
Oddly enough, the above regex will not accept less than 3 characters, e.g. aa will not match but aaa will.
I am not sure why. Can anyone please help ?

You have: [^\s] (requires matching at least one non-whitespace character), [a-z0-9 ]+ (requires matching at least one alphanumeric or space character), and [^\s] again (requires matching at least one non-whitespace character). So, in total, you need at least 3 characters in the string.
Use word boundaries at the beginning and end instead:
/^\b[a-z0-9 ]+\b$/i
https://regex101.com/r/2GhH3N/1

Try the following regex:
^(?! )[a-z0-9 ]*[a-z0-9]$
Details:
^(?! ) - Start of the string and no space after it (so here we exclude the
initial space).
[a-z0-9 ]* - A sequence of letters, digits and spaces, possibly empty
(the content before the last letter(see below).
[a-z0-9]$ - The last letter and the end of string (so here we exclude the
terminal space).

You should re-write the expression as
/^[a-z0-9]+(?:\s+[a-z0-9]+)*$/i
See the regex demo.
NOTE: If only one whitespace is allowed between the alphanumeric chars use
/^[a-z0-9]+(?:\s[a-z0-9]+)*$/i
^^
Details
^ - start of string
[a-z0-9]+ - 1+ letters/digits
(?:\s+[a-z0-9]+)* - 0 or more repetitions of 1+ whitespaces (\s+) and 1+ digit/letters
$ - end of string.
See the regex graph:

(/\s+(\W)/g, '$1') - how are the spaces being removed?

let a = ' lots of spaces in this ! '
console.log(a.replace(/\s+(\W)/g, '$1'))
log shows lots of spaces in this!
The above regex does exactly what I want, but I am trying to understand why?
I understand the following:
s+ is looking for 1 or more spaces
(\W) is capturing the non-alphanumeric characters
/g - global, search/replace all
$1 returns the prior alphanumeric character
The capture/$1 is what removes the space between the words This and !
I get it, but what I don't get is HOW are all the other spaces being removed?? I don't believe I have asked for them to (although I am happy they are).
I get this one console.log(a.replace(/\s+/g, ' ')); because the replace is replacing 1 or more spaces between alphanumeric characters with a single space ' '.
I'm scratching my head to understand HOW the first RegEx /\s+(\W)/g, '$1'replaces 1 or more spaces with a single space.

What your regex says is "match one or more spaces, followed by one or more non-alphanumeric character, and replace that whole result with that one or more non-alphanumeric character". The key is that the \s+ is greedy, meaning that it will try and match as many characters as possible. So in any given string of spaces it will try and match all of the spaces it can. However, your regex also requires one or more non-word characters (\W+). Because in your case the next character after each final space is a word character (i.e. a letter), this last part of the regex must match the last space.
Therefore, given the string a b, and using parens to mark the \s+ and \W+ matches, a( )( )b is the only way for the regex to be valid (\s+ matches the first two spaces and \W+ matches the last space). Now it's just a simple substitution. Since you wrapped the \W+ in parentheses that makes it the first and only capturing group, so replacing the match with $1 will replace it with that final space.
As another example, running this replace against a !b will result in the match looking like a( )(!)b (since ! is now the last non-word character), so the final replaced result will be a!b.

Lets take this string 'aaa &bbb' and run it through.
We get 'aaa&bbb'
\s+ grabs the 3 spaces before the ampersand
(\W) grabs the ampersand
$1 is the ampersand and replaces ' &' with '&'
That same principal applies to the spaces. You are forcing one of the spaces to satisfy the (\W) capture group for the replacement. It's also why your exclamation point isn't nuked.

List of matches would be the following. I replaced space with ☹ so it is easier to see
"☹☹☹☹(☹)",
"☹☹☹☹(☹)",
"☹☹(!)",
"☹(☹)"
And the code is saying to replace the match with what is in the capture group.
' lots of☹☹☹☹(☹)spaces☹☹☹☹(☹)in this☹☹(!)☹(☹)'
so when you replace it you get
' lots of☹spaces☹in this!☹'

Javascript regular expression quantifier

I am trying to write a javascript regular expression that matches a min and max number of words based on finding this pattern: any number of characters followed by a space. This matches one word followed by an empty space (for example: one ):
(^[a-zA-Z]+\s$)
Debuggex Demo
When I add in the range quantifier {1,3}, it doesn't match two occurrences of the pattern (for example: one two ). What do I need to change to the regular expression to match a min and max of this pattern?
(^[a-zA-Z]+\s$){1,3}
Debuggex Demo
Any explanation is greatly appreciated.

Take ^ and $ out of the quantified group, because you can't match the beginning and end of the string multiple times in one line.
^([a-zA-Z]+\s){1,3}$
DEMO

The following will work exactly as specified:
^([a-zA-Z]+ ){1,3}$
Replace the space with \s to match any single whitespace character:
^([a-zA-Z]+\s){1,3}$
Add a quantifier to the \s to set how many whitespace characters are acceptable. The following allows one or more by adding +:
^([a-zA-Z]+\s+){1,3}$
If the whitespace at the end is optional, then the following will work:
^([a-zA-Z]+(\s[a-zA-Z]+){0,2})\s*$

(^[a-zA-Z]+\s$) will start scanning from the start of the line ^, scan for a word [a-zA-Z]+, scan for a space \s, and expect the end of the line $
When you have two words, it does not find the end of the line, so it fails. If you take out $, the second word would fail because it is not the start of the line.
So the start line and end line have to go around the limit scan.
To make it more generic:
(\S+\s*){1,3}
\S+: At least one Non-whitespace
\s*: Any amount of Whitespace
This will allow scanning of words even if there is no space at the end of the string. If you want to force the whole line, then you can put ^ in the front and $ at the end:
^(\S+\s*){1,3}$

/(\S)\1(\1)+/g matching all occurrences of three equal non-whitespace characters following each other

Its given: /(\S)\1(\1)+/g matches all occurrences of three equal non-whitespace characters following each other.
I don't understand why there is () around (\S) and 2nd (\1), but not around 1st (\1). Can anyone help in explaining how above regex works?
src: http://www.javascriptkit.com/javatutors/redev2.shtml
Thnx in advance.

The \S needs parentheses to capture its value, so you can refer back to the captured value with \1. \1 means "match the same text which capturing group #1 matched".
I believe there is a problem with this regex. You said you want to match "three equal non-whitespace characters". But the + will make this match 3 or more equal, consecutive non-whitespace characters.
The g on the end means "apply this regex over the entire input string, or globally".

The second set of parentheses is not necessary. It needlessly captures the repeated character a second time, while matching the same strings as this regex:
/(\S)\1\1+/g
Also, as #AlexD pointed out, the description should say that it matches at least three characters. If you replaced that regex with BONK in the string fooxxxxxxbar:
'fooxxxxxxbar'.replace(/(\S)\1\1+/g, 'BONK')
..you might expect the result to be fooBONKBONKbar from their description, because there are two sets of three 'x's. But in fact the result would be fooBONKbar; the first \1 matches the second 'x', and the \1+ matches the third 'x' and any 'x's that follow it. If they wanted to match just three characters, they should have left the + off.
I noticed several other sloppy descriptions like that, plus at least one outright error: \B is equivalent to (?!\b) (a position that's not a word boundary), not [^\b] (a character that's not a backspace). For that matter, their description of word boundaries--"the position between a word and a space"--is wrong, too. A word boundary isn't defined by any particular character, like a space--in fact, it can just as well be the absence of any character that creates one. The string:
Word
...starts with a word boundary because 'W' is a word character and, being first, it's not preceded by another word character. Similarly, the 'd' is not followed by another word character, so the end of the string is also a word boundary.
Also, a regex doesn't know from words, only word characters. The definition of a word character can vary depending on the regex flavor and Unicode or locale settings, but it always includes [A-Za-z0-9_] (ASCII letters and digits plus the underscore). A word boundary is simply a position that's between one of those characters and any other character (or no other character, as I explained earlier).
If you want to learn about regexes, I suggest you forget that site and start here instead: regular-expressions.info.

We Keep Coding

JavaScript is the programming language of the Web.