Detecting the end of a multiline regex sample (equivalent of \z) - javascript

I am trying to match each section of a string starting with one or more + and ending whenever it finds another instance of it or the end of the sample.
The sample is the following:
+ Horizontal Rules
Lorem ipsum
+ Lists
Everything about lists
++ Bulleted Lists
Make a list element by starting a line with an asterisk. To increase the indent put extra spaces
before the asterisk.
+ Another title
content
The regex I'm currently using (/^(\++) (.*?)(?=^\++)/gms) will split the sample into different matches. But in this case, the last section isn't matched, since the look-ahead doesn't find anything:
+ Another title
content
I tried using \z, which is supposed to match the end of the sample in a multiline regex :
^(\++) (.*?)(?=^\++|\z)
But unfortunately, the JavaScript regex engines haven't implemented this basic feature, making it look for a literal z instead.
Is there any way to replicate \z behavior ?
I've tried the following regex without any concrete results : ^(\+)+ (.*?)(?=^\++)|(?:$(?!.*))

You may use
/^(\++) (.*?)(?=^\+|$(?!.))/gms
The $(?!.) part matches the end of a line that is not followed with any char (. with s modifier matches any char).
However, it is more efficient to unroll the lazy dot matching part to grab whole lines that do not start with +:
/^(\++) (.*(?:\r?\n(?!\+).*)*)/gm
See the regex demo. Note the absense of the s flag.
Details
^ - start of a line (due to m flag)
(\++) - Capturing group 1: one or more + symbols
- a space
(.*(?:\r?\n(?!\+).*)*) - Capturing group 2:
.* - 0+ chars other than line break chars as few as possible
(?:\r?\n(?!\+).*)* - 0 or more repetitions of
\r?\n(?!\+) - a CRLF or LF line ending that is not immediately followed with + (if you want, you may make it safer if you use (?!\++ ))
.* - 0+ chars other than line break chars as few as possible

Related

JavaScript regex replace last pattern in string?

I have a string which looks like
var std = new Bammer({mode:"deg"}).bam(0, 112).bam(177.58, (line-4)/2).bam(0, -42)
.ramBam(8.1, 0).bam(8.1, (slot_height-thick)/2)
I want to put a tag around the last .bam() or .ramBam().
str.replace(/(\.(ram)?bam\(.*?\))$/i, '<span class="focus">$1</span>');
And I hope to get:
new Bammer({mode:"deg"}).bam(0, 112).bam(177.58, (line-4)/2).bam(0, -42).ramBam(8.1, 0)<span class="focus">.bam(8.1, (slot_height-thick)/2)</span>
But somehow I keep on fighting with the non greedy parameter, it wraps everything after new Bammer with the span tags. Also tried a questionmark after before the $ to make the group non greedy.
I was hoping to do this easy, and with the bam or ramBam I thought that regex would be the easiest solution but I think I'm wrong.
Where do I go wrong?
You can use the following regex:
(?!.*\)\.)(\.(?:bam|ramBam)\(.*\))$
Demo
(?!.*\)\.) # do not match ').' later in the string
( # begin capture group 1
.\ # match '.'
(?:bam|ramBam) # match 'bam' or 'ramBam' in non-cap group
\(.*\) # match '(', 0+ chars, ')'
) # end capture group 1
$ # match end of line
For the example given in the question the negative lookahead (?!.*\)\.) moves an internal pointer to just before the substring:
.bam(8.1, (slot_height-thick)/2)
as that is the first location where there is no substring ). later in the string.
If there were no end-of-line anchor $ and the string ended:
...0).bam(8.1, (slot_height-thick)/2)abc
then the substitution would still be made, resulting in a string that ends:
...0)<span class="focus">.bam(8.1, (slot_height-thick)/2)</span>abc
Including the end-of-line anchor prevents the substitution if the string does not end with the contents of the intended capture group.
Regex to use:
/\.((?:ram)?[bB]am\([^)]*\))(?!.*\.(ram)?[bB]am\()/
\. Matches period.
(?:ram)? Optionally matches ram in a non-capturing group.
[bB]am Matches bam or Bam.
\( Matches (.
[^)]* Matches 0 or more characters as long as they are not a ).
) Matches a ). Items 2. through 6. are placed in Capture Group 1.
(?!.*\.(ram)?[bB]am\() This is a negative lookahead assertion stating that the rest of the string contains no further instance of .ram( or .rambam( or .ramBam( and therefore this is the last instance.
See Regex Demo
let str = 'var std = new Bammer({mode:"deg"}).bam(0, 112).bam(177.58, 0).bam(0, -42).ramBam(8.1, 0).bam(8.1, slot_height)';
console.log(str.replace(/\.((?:ram)?[bB]am\([^)]*\))(?!.*\.(ram)?[bB]am\()/, '<span class="focus">.$1</span>'));
Update
The JavaScript regular expression engine is not powerful enough to handle nested parentheses. The only way I know of solving this is if we can make the assumption that after the final call to bam or ramBam there are no more extraneous right parentheses in the string. Then where I had been scanning the parenthesized expression with \([^)]*\), which would fail to pick up final parentheses, we must now use \(.*\) to scan everything until the final parentheses. At least I know no other way. But that also means that the way that I had been using to determine the final instance of ram or ramBam by using a negative lookahead needs a slight adjustment. I need to make sure that I have the final instance of ram or ramBam before I start doing any greedy matches:
(\.(?:bam|ramBam)(?!.*\.(bam|ramBam)\()\((.*)\))
See Regex Demo
\. Matches ..
(?:bam|ramBam) Matches bam or ramBam.
(?!.*\.(bam|ramBam)\() Asserts that Item 1. was the final instance
\( Matches (.
(.*) Greedily matches everything until ...
\) the final ).
) Items 1. through 6. are placed in Capture Group 1.
let str = 'var std = new Bammer({mode:"deg"}).bam(0, 112).bam(177.58, (line-4)/2).bam(0, -42) .ramBam(8.1, 0).bam(8.1, (slot_height-thick)/2)';
console.log(str.replace(/(\.(?:bam|ramBam)(?!.*\.(bam|ramBam)\()\((.*)\))/, '<span class="focus">$1</span>'));
The non-greedy flag isn't quite right here, as that will just make the regex select the minimal number of characters to fit the pattern. I'd suggest that you do something with a negative lookahead like this:
str.replace(/(\.(?:ram)?[Bb]am\([^)]*\)(?!.*(ram)?[Bb]am))/i, '<span class="focus">$1</span>');
Note that this will only replace the last function name (bam OR ramBam), but not both. You'd need to take a slightly different approach to be able to replace both of them.

Replace last word with asterisk, or last two words

I need to hide surname of persons. For persons with three words in their name, just hide last word, ej:
Laura Torres Bermudez
shoud be
Laura Torres ********
and for
Maria Fernanda Gonzales Lopez
should be
Maria Fernanda ******** *****
I think they are two regex because based on the number of words, regex will be applied.
I know \w+ replaces all word by a single asterisk, and with (?!\s). I can replace chars except spaces. I hope you can help me. Thanks.
This is my example:
https://regex101.com/r/yW4aZ3/942
Try this:
(?<=\w+\s+\w+\s+.*)[^\s]
Explanation:
?<= is a negative lookbehind - match only occurrences preceded by specified pattern
[^\s] / match everything except whitespace (what you used - (?!\s). - is actually weird use of lookahead - "look to next character, if it is not a whitespate; then match any character")
summary: replace any non-whitespace space character preceded by at least two sequences of letters (\w) and spaces (\s).
Just note that it won't hide anything for persons with only two words in their name (which is common in many countries).
Also, the regex has to be slightly modified for that testing tool to match one name per line - see https://regex101.com/r/yW4aZ3/943 (^ was added to match from start of each line and a "multi line" flag was set).
A JavaScript solution that does not rely on the ECMAScript 2018 extended regex features is
s = s.replace(/^(\S+\s+\S+)([\s\S]*)/, function($0, $1, $2) {return $1 + $2.replace(/\S/g, '*');})
Details:
^ - start of string
(\S+\s+\S+) - Group 1: one or more non-whitespaces, 1 or more whitespaces and then 1 or more non-whitespaces
([\s\S]*) - Group 2: any 1 or more chars.
The replacement is Group 1 contents and the contents of Group 2 with each non-whitespace char replaced with an asterisk.
Java solution:
s = s.replaceAll("(\\G(?!^)\\s*|^\\S+\\s+\\S+\\s+)\\S", "$1*");
See the regex demo
Details
(\G(?!^)\s*|^\S+\s+\S+\s+) - Group 1: either then end of the previous match (\G(?!^)) and 0 or more whitespaces or (|) 1+ non-whitespaces, 1+ whitespaces and again 1+ non-whitespaces, 1+ whitespaces at the start of the string
\S - a non-whitespace char.
Interested if this can be done in JavaScript without a callback, I came up with
str = str.replace(/^(\w+\W+\w+\W+\b)\w?|(?!^)(\W*)\w/gy, '$1$2*');
See this demo at regex101
The idea might look a bit confusing but it seems to work fine. It should fail on one or two words but start as soon, as there appears a word character after the first two words. Important to use the sticky flag y which is similar to the \G anchor (continue on last match) but always is bound to start.
To not add an additional asterisk, the ...\b)\w?... part after the first two words is essential. The word boundary will force a third word to start but the first capturing group is closed after \b and the first character of the third word will be consumed but not captured to correctly match the asterisk count.
The second capturing group on the right side of the alternation will capture any optional non word characters appearing between any words after the third one.
var strs = ['Foo', 'Foo Bar B', 'Laura Torres Bermudez', 'Maria Fernanda Gonzales Lopez'];
strs = strs.map(str => str.replace(/^(\w+\W+\w+\W+\b)\w?|(?!^)(\W*)\w/gy, '$1$2*'));
console.log(strs);

Remove Last Instance Of Character From String - Javascript - Revisited

According to the accepted answer from this question, the following is the syntax for removing the last instance of a certain character from a string (In this case I want to remove the last &):
function remove (string) {
string = string.replace(/&([^&]*)$/, '$1');
return string;
}
console.log(remove("height=74&width=12&"));
But I'm trying to fully understand why it works.
According to regex101.com,
/&([^&]*)$/
& matches the character & literally (case sensitive)
1st Capturing Group ([^&]*)
Match a single character not present in the list below [^&]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
& matches the character & literally (case sensitive)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
So if we're matching the character & literally with the first &:
Then why are we also "matching a single character not present in the following list"?
Seems counter productive.
And then, "$ asserts position at the end of the string" - what does this mean? That it starts searching for matches from the back of the string first?
And finally, what is the $1 doing in the replaceValue? Why is it $1 instead of an empty string? ""
1- The solution for that problem I think is different to the solution you want:
That regex will replace the last "&" no matter where it is, in the middle or in the end of the string.
If you apply this regex to this two examples you will see that the first get "incorrectly" replaced:
height=74&width=12&test=1
height=74&width=12&test=1&
They get replaced as :
height=74&width=12test=1
height=74&width=12&test=1
So to really replace the last "&" the only thing you need to do is :
string.replace(/&$/, '');
Now, if you want to replace the last ocurrence of "&" no matter where it is, I will explain that regex :
$1 Represents a (capturing group), everything inside those ([^&]*) are captured inside that $1. This is a oversimplification.
&([^&]*)$
& Will match a literal "&" then in the following capturing group this regex will look for any amount (0 to infinite) of characters (NOT EQUAL TO "&", explained latter) until the end of the string or line (Depending on the flag you use in the regex, /m for matching lines ). Anything captured in this capturing group will go to $1 when you apply the replacement.
So, If you apply this logic in your mind you will see that it will always match the last & and replace it with anything on its right that does not contain a single "&""
&(<nothing-like-a-&>*)<until-we-reach-the-end> replaced by anything found inside (<nothing-like-a-&>*) == $1. In this case because of the use of * , it means 0 or more times, sometimes the capturing group $1 will be empty.
NOT EQUAL TO part:
The regex uses a [^], in simple terms [] represents a group of independent characters, example: [ab] or [ba] represents the same, it will always look for "a" or "b". Inside this you can also look for ranges like 0 to 9 like this [0-9ba], it will always match anything from 0 to 9, a or b.
The "^" here [^] represents a negation of the content, so, it will match anything not in this group, like [^0-9] will always match anything that is not a number. In your regex [^&] it was used for looking for anything that is not a "&"

Lookaheads to delimit text

I'm trying to delimit a huge text with several documents inside. Each document starts with the word 'MINISTÉRIO', so i'm trying to use lookaheads to catch everything from MINISTÉRIO until the next MINISTÉRIO:
(MINISTÉRIO)[\s\S]*?(^(?=\1))
http://regexr.com/3dk6k
I also was trying to:
(^MINISTÉRIO)[\s\S]*?(?=\1)
http://regexr.com/3dk6h
Nether is working. I have two questions: Why my regex is not working? Should be i think... And, how to fix?
Thanks!
Issue Description
The /(MINISTÉRIO)[\s\S]*?(^(?=\1))/gm matches the word MINISTÉRIO at any place in the text capturing it into Group 1. [\s\S]*? matches lazily any character, 0 or more repetitions up to a beginning of a line that is followed with the word MINISTÉRIO. Thus, if you have a "document" from some place in the string up to the end, that match won't be found as you cannot specify the $ anchor since it is redefined to match the end of a line.
Using /(^MINISTÉRIO)[\s\S]*?(?=\1)/g, you match and capture the MINISTÉRIO word at the beginning of the whole string only, and match any char as few as possible up to the first MINISTÉRIO substring in the string, at any place in the string, and there is no check for the beginning of a line.
Solution
You may use an unrolled regex like
/^MINISTÉRIO\b.*(?:\n(?!MINISTÉRIO\b).*)*/gm
The regex demo is here
When the text is too long, lazy matching like in your pattern takes too much time, and using negated character classes can greatly increase performance.
In short:
^MINISTÉRIO\b - matches MINISTÉRIO as a whole word at the start of a line:
^ - start of a line (due to /m modifier)
MINISTÉRIO\b - a whole word MINISTÉRIO as \b is a word boundary
.*(?:\n(?!MINISTÉRIO\b).*)* - matches any text that is not MINISTÉRIO at the start of a line:
.* - 0+ chars other than a newline
(?:\n(?!MINISTÉRIO\b).*)* - 0+ sequences of:
\n(?!MINISTÉRIO\b) - a newline not followed with MINISTÉRIO as a whole word
.* - 0+ chars other than a newline
It is basically the same as /^MINISTÉRIO\b(?:(?!^MINISTÉRIO\b)[\s\S])*/gm, but should be much faster as the tempered greedy token ((?:(?!^MINISTÉRIO\b)[\s\S])*) is rather resource consuming.

Javascript regular expression quantifier

I am trying to write a javascript regular expression that matches a min and max number of words based on finding this pattern: any number of characters followed by a space. This matches one word followed by an empty space (for example: one ):
(^[a-zA-Z]+\s$)
Debuggex Demo
When I add in the range quantifier {1,3}, it doesn't match two occurrences of the pattern (for example: one two ). What do I need to change to the regular expression to match a min and max of this pattern?
(^[a-zA-Z]+\s$){1,3}
Debuggex Demo
Any explanation is greatly appreciated.
Take ^ and $ out of the quantified group, because you can't match the beginning and end of the string multiple times in one line.
^([a-zA-Z]+\s){1,3}$
DEMO
The following will work exactly as specified:
^([a-zA-Z]+ ){1,3}$
Replace the space with \s to match any single whitespace character:
^([a-zA-Z]+\s){1,3}$
Add a quantifier to the \s to set how many whitespace characters are acceptable. The following allows one or more by adding +:
^([a-zA-Z]+\s+){1,3}$
If the whitespace at the end is optional, then the following will work:
^([a-zA-Z]+(\s[a-zA-Z]+){0,2})\s*$
(^[a-zA-Z]+\s$) will start scanning from the start of the line ^, scan for a word [a-zA-Z]+, scan for a space \s, and expect the end of the line $
When you have two words, it does not find the end of the line, so it fails. If you take out $, the second word would fail because it is not the start of the line.
So the start line and end line have to go around the limit scan.
To make it more generic:
(\S+\s*){1,3}
\S+: At least one Non-whitespace
\s*: Any amount of Whitespace
This will allow scanning of words even if there is no space at the end of the string. If you want to force the whole line, then you can put ^ in the front and $ at the end:
^(\S+\s*){1,3}$

Categories