What is the difference between "\\w+#\\w+[.]\\w+" and "^\\w+#\\w+[.]\\w+$"? I have tried to google for it but no luck.
^ means "Match the start of the string" (more exactly, the position before the first character in the string, so it does not match an actual character).
$ means "Match the end of the string" (the position after the last character in the string).
Both are called anchors and ensure that the entire string is matched instead of just a substring.
So in your example, the first regex will report a match on email#address.com.uk, but the matched text will be email#address.com, probably not what you expected. The second regex will simply fail.
Be careful, as some regex implementations implicitly anchor the regex at the start/end of the string (for example Java's .matches(), if you're using that).
If the multiline option is set (using the (?m) flag, for example, or by doing Pattern.compile("^\\w+#\\w+[.]\\w+$", Pattern.MULTILINE)), then ^ and $ also match at the start and end of a line.
Try the Javadoc:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
^ and $ match the beginnings/endings of a line (without consuming them)
Related
What is the difference between "\\w+#\\w+[.]\\w+" and "^\\w+#\\w+[.]\\w+$"? I have tried to google for it but no luck.
^ means "Match the start of the string" (more exactly, the position before the first character in the string, so it does not match an actual character).
$ means "Match the end of the string" (the position after the last character in the string).
Both are called anchors and ensure that the entire string is matched instead of just a substring.
So in your example, the first regex will report a match on email#address.com.uk, but the matched text will be email#address.com, probably not what you expected. The second regex will simply fail.
Be careful, as some regex implementations implicitly anchor the regex at the start/end of the string (for example Java's .matches(), if you're using that).
If the multiline option is set (using the (?m) flag, for example, or by doing Pattern.compile("^\\w+#\\w+[.]\\w+$", Pattern.MULTILINE)), then ^ and $ also match at the start and end of a line.
Try the Javadoc:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
^ and $ match the beginnings/endings of a line (without consuming them)
The following returns true
Regex.IsMatch("FooBar\n", "^([A-Z]([a-z][A-Z]?)+)$");
so does
Regex.IsMatch("FooBar\n", "^[A-Z]([a-z][A-Z]?)+$");
The RegEx is in SingleLine mode by default, so $ should not match \n. \n is not an allowed character.
This is to match a single ASCII PascalCaseWord (yes, it will match a trailing Cap)
Doesn't work with any combinations of RegexOptions.Multiline | RegexOptions.Singleline
What am I doing wrong?
In .NET regex, the $ anchor (as in PCRE, Python, PCRE, Perl, but not JavaScript) matches the end of line, or the position before the final newline ("\n") character in the string.
See this documentation:
$ The match must occur at the end of the string or line, or before \n at the end of the string or line. For more information, see End of String or Line.
No modifier can redefine this in .NET regex (in PCRE, you can use D PCRE_DOLLAR_ENDONLY modifier).
You must be looking for \z anchor: it matches only at the very end of the string:
\z The match must occur at the end of the string only. For more information, see End of String Only.
A short test in C#:
Console.WriteLine(Regex.IsMatch("FooBar\n", #"^[A-Z]([a-z][A-Z]?)+$")); // => True
Console.WriteLine(Regex.IsMatch("FooBar\n", #"^[A-Z]([a-z][A-Z]?)+\z")); // => False
From wikipedia:
$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
So you are asking whether there is a capital letter after the start of the beginning of the string, followed by any number of times (zero or one letter), followed by the end of the string, or the position just before the newline.
That all seems true.
And yes, there seems to be some mismatch between different documentation sources about what is regarded as newline and how $ works or should work exactly. It always brings to mind the wisdom:
Sometimes a man has a problem and he figures he will use a regex to solve it.
Now the man has two problems.
I'm trying to match all #mentions and #hashtags on a String using this RegEx expression:
(^|\s)([##][a-z\d-]+)
According to regex101.com, since the + is there it should match all occurances
"+" Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed
But when I run it through a String with more than one occurrance, it only matches the first.
What's going on?
Thanks for your attention.
Add the g (global) flag at the end for multiple matches.
/(^|\s)([##][a-z\d-]+)/g
^ this symbol defines beginning of string. That is why it only match with first string.
Use /[##]\w+/ regex.
I'm trying to write a lexer in JavaScript for finding tokens of a simple domain-specific language. I started with a simple implementation which just tries to match subsequent regexps from the current position in a line to find out whether it matches some token format and accept it then.
The problem is that when something doesn't match inside such regexp, the whole regexp fails, so I don't know which character exactly caused it to fail.
Is there any way to find out the position in the string which caused the regular expression to fail?
INB4: I'm not asking about debugging my regexp and verifying its correctness. It is correct already, matches correct strings and drops incorrect ones. I just want to know programmatically where exactly the regexp stopped matching, to find out the position of a character which was incorrect in the user input, and how much of them were OK.
Is there some way to do it with just simple regexps instead of going on with implementing a full-blown finite state automaton?
Short answer
There is no such thing as a "position in the string that causes the
regular expression to fail".
However, I will show you an approach to answer the reverse question:
At which token in the regex did the engine become unable to match the
string?
Discussion
In my view, the question of the position in the string which caused the regular expression to fail is upside-down. As the engine moves down the string with the left hand and the pattern with the right hand, a regex token that matches six characters one moment can later, because of quantifiers and backtracking, be reduced to matching zero characters the next—or expanded to match ten.
In my view, a more proper question would be:
At which token in the regex did the engine become unable to match the
string?
For instance, consider the regex ^\w+\d+$ and the string abc132z.
The \w+ can actually match the entire string. Yet, the entire regex fails. Does it make sense to say that the regex fails at the end of the string? I don't think so. Consider this.
Initially, \w+ will match abc132z. Then the engine advances to the next token: \d+. At this stage, the engine backtracks in the string, gradually letting the \w+ give up the 2z (so that the \w+ now only corresponds to abc13), allowing the \d+ to match 2.
At this stage, the $ assertion fails as the z is left. The engine backtracks, letting the \w+, give up the 3 character, then the 1 (so that the \w+ now only corresponds to abc), eventually allowing the \d+ to match 132. At each step, the engine tries the $ assertion and fails. Depending on engine internals, more backtracking may occur: the \d+ will give up the 2 and the 3 once again, then the \w+ will give up the c and the b. When the engine finally gives up, the \w+ only matches the initial a. Can you say that the regex "fails on the "3"? On the "b"?
No. If you're looking at the regex pattern from left to right, you can argue that it fails on the $, because it's the first token we were not able to add to the match. Bear in mind that there are other ways to argue this.
Lower, I'll give you a screenshot to visualize this. But first, let's see if we can answer the other question.
The Other Question
Are there techniques that allow us to answer the other question:
At which token in the regex did the engine become unable to match the
string?
It depends on your regex. If you are able to slice your regex into clean components, then you can devise an expression with a series of optional lookaheads inside capture groups, allowing the match to always succeed. The first unset capture group is the one that caused the failure.
Javascript is a bit stingy on optional lookaheads, but you can write something like this:
^(?:(?=(\w+)))?(?:(?=(\w+\d+)))?(?:(?=(\w+\d+$)))?.
In PCRE, .NET, Python... you could write this more compactly:
^(?=(\w+))?(?=(\w+\d+))?(?=(\w+\d+$))?.
What happens here? Each lookahead builds incrementally on the last one, adding one token at a time. Therefore we can test each token separately. The dot at the end is an optional flourish for visual feedback: we can see in a debugger that at least one character is matched, but we don't care about that character, we only care about the capture groups.
Group 1 tests the \w+ token
Group 2 seems to test \w+\d+, therefore, incrementally, it tests the \d+ token
Group 3 seems to test \w+\d+$, therefore, incrementally, it tests the $ token
There are three capture groups. If all three are set, the match is a full success. If only Group 3 is not set (as with abc123a), you can say that the $ caused the failure. If Group 1 is set but not Group 2 (as with abc), you can say that the \d+ caused the failure.
For reference: Inside View of a Failure Path
For what it's worth, here is a view of the failure path from the RegexBuddy debugger.
You can use a negated character set RegExp,
[^xyz]
[^a-c]
A negated or complemented character set. That is, it matches anything
that is not enclosed in the brackets. You can specify a range of
characters by using a hyphen, but if the hyphen appears as the first
or last character enclosed in the square brackets it is taken as a
literal hyphen to be included in the character set as a normal
character.
index property of String.prototype.match()
The returned Array has an extra input property, which contains the
original string that was parsed. In addition, it has an index
property, which represents the zero-based index of the match in the
string.
For example to log index where digit is matched for RegExp /[^a-zA-z]/ in string aBcD7zYx
var re = /[^a-zA-Z]/;
var str = "aBcD7zYx";
var i = str.match(re).index;
console.log(i); // 4
Is there any way to find out the position in the string which caused the regular expression to fail?
No, there isn't. A Regex either matches or doesn't. Nothing in between.
Partial Expressions can match, but the whole pattern doesnt. So the engine always needs to evaluates the whole expression:
Take the String Hello my World and the Pattern /Hello World/. While each word will match individually, the whole Expression fails. You cannot tell whether Hello or World matched - independent, both do. Also the whitespace between them is available.
I want to write regexp which allows some special characters like #-. and it should contain at least one letter. I want to understand below things also:
/(?=^[A-Z0-9. '-]{1,45}$)/i
In this regexp what is the meaning of ?=^ ? What is a subexpression in regexp?
(?=) is a lookahead, it's looking ahead in the string to see if it matches without actually capturing it
^ means it matches at the BEGINNING of the input (for example with the string a test, ^test would not match as it doesn't start with "test" even though it contains it)
Overall, your expression is saying it has to ^ start and $ end with 1-45 {1,45} items that exist in your character group [A-Z0-9. '-] (case insensitive /i). The fact it is within a lookahead in this case just means it's not going to capture anything (zero-length match).
?= is a positive lookahead
Read more on regex