What does ?=^ mean in a regexp? - javascript

I want to write regexp which allows some special characters like #-. and it should contain at least one letter. I want to understand below things also:
/(?=^[A-Z0-9. '-]{1,45}$)/i
In this regexp what is the meaning of ?=^ ? What is a subexpression in regexp?

(?=) is a lookahead, it's looking ahead in the string to see if it matches without actually capturing it
^ means it matches at the BEGINNING of the input (for example with the string a test, ^test would not match as it doesn't start with "test" even though it contains it)
Overall, your expression is saying it has to ^ start and $ end with 1-45 {1,45} items that exist in your character group [A-Z0-9. '-] (case insensitive /i). The fact it is within a lookahead in this case just means it's not going to capture anything (zero-length match).

?= is a positive lookahead
Read more on regex

Related

Regex expression only matching the first occurrance

I'm trying to match all #mentions and #hashtags on a String using this RegEx expression:
(^|\s)([##][a-z\d-]+)
According to regex101.com, since the + is there it should match all occurances
"+" Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed
But when I run it through a String with more than one occurrance, it only matches the first.
What's going on?
Thanks for your attention.
Add the g (global) flag at the end for multiple matches.
/(^|\s)([##][a-z\d-]+)/g
^ this symbol defines beginning of string. That is why it only match with first string.
Use /[##]\w+/ regex.

Regular expression match specific key words

I am trying to use regexp to match some specific key words.
For those codes as below, I'd like to only match those IFs at first and second line, which have no prefix and postfix. The regexp I am using now is \b(IF|ELSE)\b, and it will give me all the IFs back.
IF A > B THEN STOP
IF B < C THEN STOP
LOL.IF
IF.LOL
IF.ELSE
Thanks for any help in advance.
And I am using http://regexr.com/ for test.
Need to work with JS.
I'm guessing this is what you're looking for, assuming you've added the m flag for multiline:
(?:^|\s)(IF|ELSE)(?:$|\s)
It's comprised of three groups:
(?:^|\s) - Matches either the beginning of the line, or a single space character
(IF|ELSE) - Matches one of your keywords
(?:$|\s) - Matches either the end of the line, or a single space character.
Regexr
you can do it with lookaround (lookahead + lookbehind). this is what you really want as it explicitly matches what you are searching. you don't want to check for other characters like string start or whitespaces around the match but exactly match "IF or ELSE not surrounded by dots"
/(?<!\.)(IF|ELSE)(?!\.)/g
explanation:
use the g-flag to find all occurrences
(?<!X)Y is a negative lookbehind which matches a Y not preceeded by an X
Y(?!X) is a negative lookahead which matches a Y not followed by an X
working example: https://regex101.com/r/oS2dZ6/1
PS: if you don't have to write regex for JS better use a tool which supports the posix standard like regex101.com

Regex returns with incorrect value

I'm trying to create a function with a regex that can decide if my string value is correct or not. It should be true, if the string begins with lower or uppercase alphabetical characters or underscore. If it begins with any others, the function must return false.
My test input is something like this: ".dasfh"
The expressions, what I tried to use: [_a-zA-Z]..., [:alpha:]..., but both of them returned true.
I tried a bit easier task also:
"Hadfg" where the expression is [a-z]...: returns true
BUT
"hadfg" where the expression is [A-Z]...: returns false
Could anybody help me to understand this behaviour?
You're trying to match the first character in the string to be something in particular, this means you have to tell regex that it has to be the first character in the string.
The regex engine just tries to find any match in the entire string.
All you're telling it with [a-z] is "find me a lowercase character anywhere in the string". This means that:
"Hadfg" will equal true because it can find a, d, f or g as a match.
"HADFG" will equal false because there are no lowercase letters.
the same will happen for "hADFG" when matched with [A-Z] for instance, it will be able to find an A, D, F or G as a match whereas "hadfg" will return false because there is no uppercase character.
What you are looking for here is ^ in your regex, it is a special kind of modifier that indicates "start of line"
So when you apply this to your regex it will look like this: /^[a-z]/.
The regex on the previous line basically says "from the start of the string, is the first character following up a lowercase a-z?"
Try it out and you'll see.
For your solution you'd need /^[_a-zA-Z]/ to check if the first character is an _, a-z or A-Z character.
For reference, you can find cheatsheets within these tools (and test your regexes with it ofcourse!)
Regexr - My personal favorite (Uses your browsers JS regex engine)
Rubular - A Ruby regex tester
Regex101 - A Python / PCRE / PHP / JavaScript
And for a reference or tutorial (I'd recommend reading from start to finish if you want to start understanding regexp and how they work) theres regular-expressions.info.
Regex is never easy and be careful with what you do with it, it's a powerful but sometimes ugly beast to deal with :)
PS
I see you tagged your question as email-validation so I'll add a little bonus regex that validates the minimum requirements for an email address to be absolutely correct, I use this one personally:
.+#.+\..{2,}
which when broken up, looks like this:
.+ - one or more of any character
# - followed by a literal # character
.+ - one or more of any character
\. - followed by a literal . character
.{2,} - two or more of any character
Optionally you could replace {2,} with a + to make it one or more but this would allow a TLD with 1 character.
To see a RFC email-regex at work check this link.
When I look at that regex I basically just want to cry in a corner somewhere, there are definitely things you cannot do in an email address that my regex doesn't address but at least it makes sure it's something that looks like it's e-mailable anyways, if a new user decides to fill in some bull that's not my problem anymore and I wouldn't want to force them to change that 1 character just because the huge regex doesn't agree with it either.

Difference between ?:, ?! and ?=

I searched for the meaning of these expressions but couldn't understand the exact difference between them.
This is what they say:
?: Match expression but do not capture it.
?= Match a suffix but exclude it from capture.
?! Match if the suffix is absent.
I tried using these in simple RegEx and got similar results for all.
For example: the following 3 expressions give very similar results.
[a-zA-Z0-9._-]+#[a-zA-Z0-9-]+(?!\.[a-zA-Z0-9]+)*
[a-zA-Z0-9._-]+#[a-zA-Z0-9-]+(?=\.[a-zA-Z0-9]+)*
[a-zA-Z0-9._-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9]+)*
The difference between ?= and ?! is that the former requires the given expression to match and the latter requires it to not match. For example a(?=b) will match the "a" in "ab", but not the "a" in "ac". Whereas a(?!b) will match the "a" in "ac", but not the "a" in "ab".
The difference between ?: and ?= is that ?= excludes the expression from the entire match while ?: just doesn't create a capturing group. So for example a(?:b) will match the "ab" in "abc", while a(?=b) will only match the "a" in "abc". a(b) would match the "ab" in "abc" and create a capture containing the "b".
?: is for non capturing group
?= is for positive look ahead
?! is for negative look ahead
?<= is for positive look behind
?<! is for negative look behind
Please check Lookahead and Lookbehind Zero-Length Assertions for very good tutorial and examples on lookahead in regular expressions.
To better understand let's apply the three expressions plus a capturing group and analyse each behaviour.
() capturing group - the regex inside the parenthesis must be matched and the match create a capturing group
(?:) non-capturing group - the regex inside the parenthesis must be matched but does not create the capturing group
(?=) positive lookahead - asserts that the regex must be matched
(?!) negative lookahead - asserts that it is impossible to match the regex
Let's apply q(u)i to quit.q matches q and the capturing group u matches u.The match inside the capturing group is taken and a capturing group is created.So the engine continues with i.And i will match i.This last match attempt is successful.qui is matched and a capturing group with u is created.
Let's apply q(?:u)i to quit.Again, q matches q and the non-capturing group u matches u.The match from the non-capturing group is taken, but the capturing group is not created.So the engine continues with i.And i will match i.This last match attempt is successful.qui is matched.
Let's apply q(?=u)i to quit.The lookahead is positive and is followed by another token.Again, q matches q and u matches u.But the match from the lookahead must be discarded, so the engine steps back from i in the string to u.Given that the lookahead was successful the engine continues with i.But i cannot match u.So this match attempt fails.
Let's apply q(?=u)u to quit.The lookahead is positive and is followed by another token.Again, q matches q and u matches u.But the match from the lookahead must be discarded, so the engine steps back from u in the string to u.Given that the lookahead was successful the engine continues with u.And u will match u. So this match attempt is successful.qu is matched.
Let's apply q(?!i)u to quit.Even in this case lookahead is positive (because i does not match) and is followed by another token.Again, q matches q and i doesn't match u.The match from the lookahead must be discarded, so the engine steps back from u in the string to u.Given that the lookahead was successful the engine continues with u.And u will match u.So this match attempt is successful.qu is matched.
So, in conclusion, the real difference between lookahead and non-capturing groups is all about if you want just to test the existence or test and save the match.
But capturing groups are expensive so use it judiciously.
Try matching foobar against these:
/foo(?=b)(.*)/
/foo(?!b)(.*)/
The first regex will match and will return "bar" as first submatch — (?=b) matches the 'b', but does not consume it, leaving it for the following parentheses.
The second regex will NOT match, because it expects "foo" to be followed by something different from 'b'.
(?:...) has exactly the same effect as simple (...), but it does not return that portion as a submatch.
The simplest way to understand assertions is to treat them as the command inserted into a regular expression.
When the engine runs to an assertion, it will immediately check the condition described by the assertion.
If the result is true, then continue to run the regular expression.
This is the real difference:
>>> re.match('a(?=b)bc', 'abc')
<Match...>
>>> re.match('a(?:b)c', 'abc')
<Match...>
# note:
>>> re.match('a(?=b)c', 'abc')
None
If you dont care the content after "?:" or "?=", "?:" and "?=" are just the same. Both of them are ok to use.
But if you need those content for further process(not just match the whole thing. In that case you can simply use "a(b)") You have to use "?=" instead. Cause "?:"will just through it away.

Looking for another regex explanation

In my regex expression, I was trying to match a password between 8 and 16 character, with at least 2 of each of the following: lowercase letters, capital letters, and digits.
In my expression I have:
^((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,16})$
But I don't understand why it wouldn't work like this:
^((?=\d)(?=[a-z])(?=[A-Z])(?=\d)(?=[a-z])(?=[A-Z]){8,16})$
Doesnt ".*" just meant "zero or more of any character"? So why would I need that if I'm just checking for specific conditions?
And why did I need the period before the curly braces defining the limit of the password?
And one more thing, I don't understand what it means to "not consume any of the string" in reference to "?=".
Your last two questions are related. The ?= (which is called a lookahead, by the way) doesn't consume any of the string, meaning that it tests a condition of the string but itself is zero-characters long. If the lookahead is true, then the matching continues, but the next part of the expression starts from where you were before you checked the lookahead.
Because all your stuff is made up of lookaheads, they all add up to zero characters in length. So, for {8,16} to match something, you need to supply the . first. .{8,16} means "8 to 16 characters, I don't care what those characters are." {8,16} without anything before it isn't a valid expression (or at least won't mean what .{8,16} means).
Regarding your first question, you need .* in each of your lookaheads because your expression starts with ^. That means "starting at the very beginning of the string" rather than "matching anywhere within the string". Since you're not trying to match only at the beginning of the string, .* allows you to have the lookaheads affect anywhere in the string.
Lastly, I'm afraid your regexp doesn't work. Because the lookaheads are zero-length, putting the same lookahead in twice as you have done will match the same thing twice. So this expression only checks if you have a single instance of each of the types of characters that you want to enforce there being two instances of. The expression you want is more like this:
^((?=.*\d.*\d)(?=.*[a-z].*[a-z])(?=.*[A-Z].*[A-Z]).{8,16})$
And that expression is equivalent to the more elegant:
^((?=(.*\d){2})(?=(.*[a-z]){2})(?=(.*[A-Z]){2}).{8,16})$
(And, giving credit where it's due, Dennis beat me to that last expression. Well done, sir.)
The problem is that this character ^ means something like 'Right on start'. It means that these specific characters SHOULD BE strictly at the start of text you're searching in, which is not what you want.
Your expression will not work as you want it to.
Because of the lookaheads, both instances of (?=.*\d) will actually match the same digit, thus validating passwords with only one digit.
This should work:
^(?=(.*\d){2})(?=(.*[a-z]){2})(?=(.*[A-Z]){2}).{8,16}$
The difference between (?=.*\d) and (?=\d) is that, while they are both zero-width lookaheads, is that the former will match if there is a digit anywhere in the string (after the current location), but the latter will match only if that digit is immediately after the current location. So, that first regex looks for 8-16 characters, including one digit, lowercase, and uppercase each. The second regex requires the first character to be a digit, and a lowercase, and an uppercase, which is absurd. If you want to math two digits, then instead of (?=.*\d)(?=.*\d), do (?=.*\d.*\d).

Categories