Chaining multiple positive lookaheads in JavaScript regex - javascript

I'm new to learning Regular Expressions, and I came across this answer which uses positive lookahead to validate passwords.
The regular expression is - (/^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])[0-9a-zA-Z]{8,}$/) and the breakdown provided by the user is -
(/^
(?=.*\d) //should contain at least one digit
(?=.*[a-z]) //should contain at least one lower case
(?=.*[A-Z]) //should contain at least one upper case
[a-zA-Z0-9]{8,} //should contain at least 8 from the mentioned characters
$/)
However, I'm not very clear on chaining multiple lookaheads together. From what I have learned, a positive lookahead checks if the expression is followed by what is specified in the lookahead. As an example, this answer says -
The regex is(?= all) matches the letters is, but only if they are immediately followed by the letters all
So, my question is how do the individual lookaheads work? If I break it down -
The first part is ^(?=.*\d). Does this indicate that at the starting of the string, look for zero or more occurrences of any character, followed by 1 digit (thereby checking the presence of 1 digit)?
If the first part is correct, then with the second part (?=.*[a-z]), does it check that after checking for Step 1 at the start of the string, look for zero or more occurrences of any character, followed by a lowercase letter? Or are the two lookaheads completely unrelated to each other?
Also, what is the use of the ( ) around every lookahead? Does it create a capturing group?
I have also looked at the Rexegg article on lookaheads, but it didn't help much.
Would appreciate any help.

As mentionned in the comments, the key point here are not the lookaheads but backtracking:
(?=.*\d) looks for a complete line (.*), then backtracks to find at least one number (\d).
This is repeated throughout the different lookaheads and could be optimized like so:
(/^
(?=\D*\d) // should contain at least one digit
(?=[^a-z]*[a-z]) // should contain at least one lower case
(?=[^A-Z]*[A-Z]) // should contain at least one upper case
[a-zA-Z0-9]{8,} // should contain at least 8 from the mentioned characters
$/)
Here, the principle of contrast applies.

Assertions are atomic, independent expressions with separate context
from the rest of the regex.
It is best visualized as: They exist between characters.
Yes, there is such a place.
Being independent though, they receive the current search position,
then they start moving through the string trying to match something.
They literally advance their private (local) copy of the search position
to do this.
They return a true or false, depending on if they matched something.
The caller of this assertion maintains it's own copy of the search position.
So, when the assertion returns, the callers search position has not changed.
Thus, you can weave in and out of places without having to worry about
the search position.
You can see this a little more dramatically, when assertions are nested:
Target1: Boy1 has a dog and a train
Target2: Boy2 has a dog
Regex: Boy\d(?= has a dog(?! and a train))
Objective: Find the Boy# that matches the regex.
Other noteworthy things about assertions:
They are atomic (ie: independent) in that they are immune to backtracking
from external forces.
Internally, they can backtrack just like anywhere else.
But, when it comes to the position they were given, that cannot change.
Also, inside assertions, it is possible to capture just like anywhere else.
Example ^(?=.*\b(\w+)\b) captures the last word in string, however the search position does not change.
Also, assertions are like a red light. The immediate expression that follows the assertion
must wait until it gets the green light.
This is the result the assertion passes back, true or false.

Related

Javascript -- Regex -- Blacklist of multiple words to END with a partial match

I've read many Questions on StackOverflow, including this one, this one, and even read Rexegg's Best Trick, which is also in a question here. I found this one, which works on entire lines, but not "everything up to the bad word". None of these have helped me, so here I go:
In Javascript, I have a long regex pattern. I'm trying to match a sequence in similar sentence structures, like follows:
1 UniquePrefixA [some-token] and [some-token] want to take [some-token] to see some monkeys.
2 UniqueC [some-token] wants to take [some-token] to the store. UniqueB, [some-token] is in the pattern once more.
3 UniquePrefixA [some-token] is using [some-token] to [some-token].
Notice that each pattern starts with a unique prefix. Encountering that prefix signals the start of a pattern. If I encounter that pattern again during capture, I should not capture a second occurance, and STOP THERE. I'll have captured everything up to that prefix.
If I don't encounter the prefix later in the pattern, I need to continue matching that pattern.
I'm also using capture groups (not repeating, since Capture Groups only return the last matched of that group). The capture group contents need to be returned, so I'm using match, non-greedy.
Here's my pattern and a working example
/(?:UniquePrefixA|UniqueB|UniqueC)\s*(\[some-token\])(?:and|\s)*(\[some-token\])?(\s|[^\[\]])*(\[some-token\])? --->(\s|[^\[\]])*<--- (\[some-token\])?(\s|[^\[\]])*/i
It's basically 2 repeating patterns in a specific order:
(\s|[^\[\]])* // Basicaly .*, but excluding brackets
(\[some-token\]) // A token [some-token]
How I can prevent the match from continuing past a black list of words?
I want this to happen where I drew three arrows, for context. The equivalent of Any character, but not the contents of this list: (UniquePrefixA|UniqueB|UniqueC) (as seen in capture group 1).
It's possible I need a better understanding of negative lookahead, or if it can work with a group of things. Most importantly, I'm looking to know if a negative look-ahead approach can support a list of options Or is there a better way altogether? If the answer is "you can't do that," that's cool too.
I think, an easier to maintain solution is to divide your task into 2 parts:
Find each chunk of text starting from any of your unique prefixes,
up to the next or to the end of string.
Process each such chunk, looking for your some tokens and maybe
also the content between them.
The regex performing the first task should include 3 parts:
(?:UniquePrefixA|UniqueB|UniqueC) - A non-capturing group looking
for any unique prefix.
((?:.|\n)+?) - A capturing group - the fragment to catch for further
processing (see the note below).
(?=UniquePrefixA|UniqueB|UniqueC|$) - A positive lookahead, looking
for either any unique prefix or the end of the string (a stop criterion
you are looking for).
To sum up, the whole regex looks like below:
/(?:UniquePrefixA|UniqueB|UniqueC)((?:.|\n)+?)(?=UniquePrefixA|UniqueB|UniqueC|$)/gi
Note: Unfortunately, JavaScript flavour of regex does not implement
single-line (-s) option. So, instead of just . in the capturing group
above, you must use (?:.|\n), meaning:
either any char other than \n (.),
or just \n.
Both these variants are "enveloped" into a non-capturing group,
to put limits of variants (both sides of |), because the repetition
marker (+?) pertains to both variants.
Note ? after +, meaning the reluctant version.
So this part of regex (the capturing group) will match any sequence of chars
including \n, ending before the next uniqie prefix (if any),
just as you expect.
The second task is to apply another regex to the captured chunk (group 1),
looking for [some-token]s and possibly the content between them.
You didn't specify what you want exactly do with each chunk,
so I'm not sure what this second regex shoud include.
Maybe it will be enough just to match [some-token]?
to ensure a pattern not occurs in a repeating character sequence such as (\s|[^\[\]])*, note that \s is included in [^\[\]] so may be just [^\[\]]*, is to prepend a negative lookahead (which is a zero lentgh match assertion like ^) at the left and inside the repeating pattern so that it is checked for every character :
((?!UniquePrefixA)(\s|[^\[\]]))*

Since "a+?" is Lazy, Why does "a+?b" Match "aaab"?

While learning regular expressions in javascript using JavaScript: The Definitive Guide, I was confused by this passage:
But /a+?/ matches one or more occurrences of the letter a, matching as
few characters as necessary. When applied to the same string, this
pattern matches only the first letter a.
…
Now let’s use the nongreedy version: /a+?b/. This should match the
letter b preceded by the fewest number of a’s possible. When applied
to the same string “aaab”, you might expect it to match only one a and
the last letter b. In fact, however, this pattern matches the entire
string, just like the greedy version of the pattern.
Why is this so?
This is the explanation from the book:
This is because regular-expression pattern matching is done by finding
the first position in the string at which a match is possible. Since a
match is possible starting at the first character of the
string,shorter matches starting at subsequent characters are never
even considered.
I don't understand. Can anyone give me a more detailed explanation?
Okay, so you have your search space, "aaabc", and your pattern, /a+?b/
Does /a+?b/ match "a"? No.
Does /a+?b/ match "aa"? No.
Does /a+?b/ match "aaa"? No.
Does /a+?b/ match "aaab"? Yes.
Since you're matching literal characters and not any sort of wildcard, the regular expression a+?b is effectively the same as a+b anyway. The only type of sequence either one will match is a string of one or more a characters followed by a single b character. The non-greedy modifier makes no difference here, as the only thing an a can possibly match is an a.
The non-greedy qualifier becomes interesting when it's applied to something that can take on lots of different values, like .. (edit or cases where there's interesting stuff to the left of something like a+?)
edit — if you're expecting a+?b to match just the last a before the b in aaab, well that's not how it works. Searching for a pattern in a string implicitly means to search for the earliest occurrence of the pattern. Thus, though starting from the last a does give a substring that matches the pattern, it's not the first substring that matches.
The Engine Attempts a Match at the Beginning of the String
Can anyone give me a more detailed explanation?
Yes.
In short: .+? does not look for a shortest match globally, at the level of the entire string, but locally, from the position in the string where the engine is currently positioned.
How the Engine Works
When you try a regex against the string aaab, the engine first tries to find a match starting at the very first position in the string. That position is the position before the first a. If the engine cannot find a match at the first position, it moves on and tries again starting from the second position (between the first and second a)
So can a match be found by the regex a+?b at the first position? Yes.
a matches the first a
The +? quantifiers tells the engine to match the fewest number of a chars necessary. Since we are looking to return a match, necessary means that the following tokens (in this case) have to be allowed to match. In this case, the fewest number of a chars needed to allow the b to match is all the remaining a chars.
b matches
In the details the second point is a bit more complex (the engine tries to match b against the second a, fails, backtracks...) but you don't need to worry about that.
'?' after a+ means minimum number of characters to satisfy expression. /a+/ means one 'a' or as many as you can encounter before some other character. In order to satisfy /a+?/ (since it's nogreedy) it only needs single 'a'.
In order to satisfy /a+?b/, since we have 'b' at the end, in order to satisfy this expression it needs to match one or more 'a' before it hits 'b'. It has to hit that 'b'. /a+/ doesn't have to hit b because RegEx doesn't ask for that. /a+?b/ has to hit that 'b'.
Just think about it. What other meaning /a+?b/ could have?
Hope this helps

Find out the position where a regular expression failed

I'm trying to write a lexer in JavaScript for finding tokens of a simple domain-specific language. I started with a simple implementation which just tries to match subsequent regexps from the current position in a line to find out whether it matches some token format and accept it then.
The problem is that when something doesn't match inside such regexp, the whole regexp fails, so I don't know which character exactly caused it to fail.
Is there any way to find out the position in the string which caused the regular expression to fail?
INB4: I'm not asking about debugging my regexp and verifying its correctness. It is correct already, matches correct strings and drops incorrect ones. I just want to know programmatically where exactly the regexp stopped matching, to find out the position of a character which was incorrect in the user input, and how much of them were OK.
Is there some way to do it with just simple regexps instead of going on with implementing a full-blown finite state automaton?
Short answer
There is no such thing as a "position in the string that causes the
regular expression to fail".
However, I will show you an approach to answer the reverse question:
At which token in the regex did the engine become unable to match the
string?
Discussion
In my view, the question of the position in the string which caused the regular expression to fail is upside-down. As the engine moves down the string with the left hand and the pattern with the right hand, a regex token that matches six characters one moment can later, because of quantifiers and backtracking, be reduced to matching zero characters the next—or expanded to match ten.
In my view, a more proper question would be:
At which token in the regex did the engine become unable to match the
string?
For instance, consider the regex ^\w+\d+$ and the string abc132z.
The \w+ can actually match the entire string. Yet, the entire regex fails. Does it make sense to say that the regex fails at the end of the string? I don't think so. Consider this.
Initially, \w+ will match abc132z. Then the engine advances to the next token: \d+. At this stage, the engine backtracks in the string, gradually letting the \w+ give up the 2z (so that the \w+ now only corresponds to abc13), allowing the \d+ to match 2.
At this stage, the $ assertion fails as the z is left. The engine backtracks, letting the \w+, give up the 3 character, then the 1 (so that the \w+ now only corresponds to abc), eventually allowing the \d+ to match 132. At each step, the engine tries the $ assertion and fails. Depending on engine internals, more backtracking may occur: the \d+ will give up the 2 and the 3 once again, then the \w+ will give up the c and the b. When the engine finally gives up, the \w+ only matches the initial a. Can you say that the regex "fails on the "3"? On the "b"?
No. If you're looking at the regex pattern from left to right, you can argue that it fails on the $, because it's the first token we were not able to add to the match. Bear in mind that there are other ways to argue this.
Lower, I'll give you a screenshot to visualize this. But first, let's see if we can answer the other question.
The Other Question
Are there techniques that allow us to answer the other question:
At which token in the regex did the engine become unable to match the
string?
It depends on your regex. If you are able to slice your regex into clean components, then you can devise an expression with a series of optional lookaheads inside capture groups, allowing the match to always succeed. The first unset capture group is the one that caused the failure.
Javascript is a bit stingy on optional lookaheads, but you can write something like this:
^(?:(?=(\w+)))?(?:(?=(\w+\d+)))?(?:(?=(\w+\d+$)))?.
In PCRE, .NET, Python... you could write this more compactly:
^(?=(\w+))?(?=(\w+\d+))?(?=(\w+\d+$))?.
What happens here? Each lookahead builds incrementally on the last one, adding one token at a time. Therefore we can test each token separately. The dot at the end is an optional flourish for visual feedback: we can see in a debugger that at least one character is matched, but we don't care about that character, we only care about the capture groups.
Group 1 tests the \w+ token
Group 2 seems to test \w+\d+, therefore, incrementally, it tests the \d+ token
Group 3 seems to test \w+\d+$, therefore, incrementally, it tests the $ token
There are three capture groups. If all three are set, the match is a full success. If only Group 3 is not set (as with abc123a), you can say that the $ caused the failure. If Group 1 is set but not Group 2 (as with abc), you can say that the \d+ caused the failure.
For reference: Inside View of a Failure Path
For what it's worth, here is a view of the failure path from the RegexBuddy debugger.
You can use a negated character set RegExp,
[^xyz]
[^a-c]
A negated or complemented character set. That is, it matches anything
that is not enclosed in the brackets. You can specify a range of
characters by using a hyphen, but if the hyphen appears as the first
or last character enclosed in the square brackets it is taken as a
literal hyphen to be included in the character set as a normal
character.
index property of String.prototype.match()
The returned Array has an extra input property, which contains the
original string that was parsed. In addition, it has an index
property, which represents the zero-based index of the match in the
string.
For example to log index where digit is matched for RegExp /[^a-zA-z]/ in string aBcD7zYx
var re = /[^a-zA-Z]/;
var str = "aBcD7zYx";
var i = str.match(re).index;
console.log(i); // 4
Is there any way to find out the position in the string which caused the regular expression to fail?
No, there isn't. A Regex either matches or doesn't. Nothing in between.
Partial Expressions can match, but the whole pattern doesnt. So the engine always needs to evaluates the whole expression:
Take the String Hello my World and the Pattern /Hello World/. While each word will match individually, the whole Expression fails. You cannot tell whether Hello or World matched - independent, both do. Also the whitespace between them is available.

Understanding regular expression

There is a method in jQuery datatables library file which constructs a regular expression. Can anyone tell me what does the following regular expression mean -
^(?=.*?il)(?=.*?oh).*$
^
Matches the begging of the input. This matches a position, rather than a character (think of it as the space in between characters).
(?=)
This is called a lookahead. Again, this matches a position. The position it matches is where the text immediately in front of the current position equals the given text, but the "pointer" doesn't move forward. Think of it like peeking ahead without popping.
.*?il
Matches any number of any character (except newlines, by default), followed by the characters "il".
.*?oh
Same as above, except for the characters "oh".
$
Matches the end of the input.
Basically, this regex is checking to see if the input string contains the characters "il" and "oh".
Analogy:
Think of it like this. You have a lineup of people and you step up to the first person (^). You then look ahead one person at a time until you find someone with a red hat, immediately followed by a yellow hat. ((?=.*?il)). Your eyes dart back to the first person in the lineup and you repeat the search, except this time you are looking for a person wearing a purple hat immediately followed by a green hat ((?=.*?oh)). Finally, you walk past all of the people, pulling each person out of the lineup, until you come to the end of the line (.*$). If, at any point, you couldn't find what you were looking for, you would have turned around and left the room (equivalent to returning false). Otherwise, after coming to the end of the lineup, you shout "candy!" (equivalent to returning true).
Point of Interest:
The lookaheads use what's called "non-greedy" quantifiers (*?). This basically says "match as many as you must, but no more". A greedy quantifier (*) says "match as many as you can". If greedy quantifiers had been used, it would be equivalent to moving your eyes to the back of the lineup and then scanning toward the front, stopping at the first match (which would be the last in the lineup, if counting from the front).
If you were to remove the beginning of input anchor (^) then this expression would be vulnerable to catastrophic backtracking. Since the lookahead matches based on a position, if it doesn't match, then it will try to step forward one character and try again. The ^ keeps the lookaheads anchored to the first position in the input. If they can't find what they're looking for from that position, then they'll just fail.
The .*$ part is fluff. You could remove it without affecting the expression (EDIT: Well, actually, that's true if you are simply testing the input. You are using the resulting match, then you need the .* to produce a non-zero-length string). If, however, you want to make sure that the input was a certain length, you use .{5,10}$ instead. This would be like walking through the lineup, counting the number of people you've pulled out, and only yelling "candy!" if you've found at least 5 people but no more than 10 (alternatives: {5,} - at least 5 characters with no upper bound; {0,10} - no more than 10 characters with 0 as lower bound value). Given that you are looking for the characters "il" and "oh" already, there is already an implicit requirement that the input be at least 4 characters (with no upper bound).
You can use http://gskinner.com/RegExr/ to help analyse most regular expressions and test them against input data. There are a few tools like this around the Internet. This one requires Flash. (That's not a selling point, just information.)
Note that the URL I'm providing is mentioned in the tag wiki page for regex.

Looking for another regex explanation

In my regex expression, I was trying to match a password between 8 and 16 character, with at least 2 of each of the following: lowercase letters, capital letters, and digits.
In my expression I have:
^((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,16})$
But I don't understand why it wouldn't work like this:
^((?=\d)(?=[a-z])(?=[A-Z])(?=\d)(?=[a-z])(?=[A-Z]){8,16})$
Doesnt ".*" just meant "zero or more of any character"? So why would I need that if I'm just checking for specific conditions?
And why did I need the period before the curly braces defining the limit of the password?
And one more thing, I don't understand what it means to "not consume any of the string" in reference to "?=".
Your last two questions are related. The ?= (which is called a lookahead, by the way) doesn't consume any of the string, meaning that it tests a condition of the string but itself is zero-characters long. If the lookahead is true, then the matching continues, but the next part of the expression starts from where you were before you checked the lookahead.
Because all your stuff is made up of lookaheads, they all add up to zero characters in length. So, for {8,16} to match something, you need to supply the . first. .{8,16} means "8 to 16 characters, I don't care what those characters are." {8,16} without anything before it isn't a valid expression (or at least won't mean what .{8,16} means).
Regarding your first question, you need .* in each of your lookaheads because your expression starts with ^. That means "starting at the very beginning of the string" rather than "matching anywhere within the string". Since you're not trying to match only at the beginning of the string, .* allows you to have the lookaheads affect anywhere in the string.
Lastly, I'm afraid your regexp doesn't work. Because the lookaheads are zero-length, putting the same lookahead in twice as you have done will match the same thing twice. So this expression only checks if you have a single instance of each of the types of characters that you want to enforce there being two instances of. The expression you want is more like this:
^((?=.*\d.*\d)(?=.*[a-z].*[a-z])(?=.*[A-Z].*[A-Z]).{8,16})$
And that expression is equivalent to the more elegant:
^((?=(.*\d){2})(?=(.*[a-z]){2})(?=(.*[A-Z]){2}).{8,16})$
(And, giving credit where it's due, Dennis beat me to that last expression. Well done, sir.)
The problem is that this character ^ means something like 'Right on start'. It means that these specific characters SHOULD BE strictly at the start of text you're searching in, which is not what you want.
Your expression will not work as you want it to.
Because of the lookaheads, both instances of (?=.*\d) will actually match the same digit, thus validating passwords with only one digit.
This should work:
^(?=(.*\d){2})(?=(.*[a-z]){2})(?=(.*[A-Z]){2}).{8,16}$
The difference between (?=.*\d) and (?=\d) is that, while they are both zero-width lookaheads, is that the former will match if there is a digit anywhere in the string (after the current location), but the latter will match only if that digit is immediately after the current location. So, that first regex looks for 8-16 characters, including one digit, lowercase, and uppercase each. The second regex requires the first character to be a digit, and a lowercase, and an uppercase, which is absurd. If you want to math two digits, then instead of (?=.*\d)(?=.*\d), do (?=.*\d.*\d).

Categories