Regular expressions and regex special characters in javascript

Regular expressions and regex special characters in javascript - javascript

I have the following code:
var html = "<div class='test'><b>Hello</b> <i>world!</i></div>";
var results = html.match(/<(\/?) (\w+) ([^>]*?)>/);
About the three sets of parenthesis:
First mean: forward slash or nothing.
Second mean: one or more alphanumeric characters.
Third mean: anything but '>' then I don't understand the '*?' !
Also how do I interpret the fact that there are three sets of parenthesis separated by white spaces?
Regards,

* means "match as much as possible" (possibly zero characters) of the previously defined literal, ? means: match just enough so that the RegExp returns a match.
Example:
String:
Tester>
[^>]*
Tester
[^>]*?
<empty string>
[^>]*e
Teste
[^>]*?e
Te (Including T is required to produce a valid match)
In your case:
String:
<input value=">"> junk
[^>]*>
<input value=">">
[^>]*?>
<input value=">

An asterisk (*) means match the preceding bit zero or more times. The preceding bit is [^>], meaning anything but a >. As #user278064 says, the ? is redundant. It's meant to make the * non-greedy, but there's no need as the [^>] already specifies what the * should refer to. (You could replace [^>] with a . (full-stop/period) which would match any character, then the ? would make sure it matches anything until >.)
As for the spaces, they shouldn't be there... they literally match spaces, which I don't think you want.

*? in regex is a "lazy star".
A star means "repeat the previous item zero or more times". The previous item in this case is a character class that defines "any character except >".
By default a star on its own is "greedy", which means that it will match as many characters as possible while still meeting the criteria for the rest of the expression around it.
Changing it to a lazy star by adding the question mark means that it will instead match as few characters as possible while still meeting the rest of the criteria.
In the case of your expression, this will in fact make no difference at all to the actual results, because you the character to match immediately after the star is a >, which is the exact opposite of the previous match. This means that the expression will always match the same result for the [^>]* regardless of whether it is lazy or greedy.
In other regular expressions, the difference is more important because greedy expressions can swallow parts of the string that would have otherwise matched later in the expression.
However, although there may be no difference to the result, there may still be a difference between greedy and lazy expressions, because the different ways in which they are processed can result in the expressions running at different speeds. Again, I don't think it will make much different in your case, but in some cases it can make a big impact.
I recommend reading up on regex at http://www.regular-expressions.info/ -- it's got an excellent reference table for all the regex syntax you're likely to need, and articles on many of the difficult topics.

Related

Matching variable-term equations

I am trying to develop a regular expression to match the following equations:
(Price+10%+100+200)
(Price+20%+200)
(Price+30%)
(Price+100)
(Price-10%-100-200)
(Price-20%-200)
(Price-30%)
(Price-100)
My regex so far is...
/([(])+([P])+([r])+([i])+([c])+([e])+([+]|[-]){1}([\d])+([+]|[-])?([\d])+([%])?([)])/g
..., but it only matches the following equations:
(Price+100+10%)
(Price+100+100)
(Price+200)
(Price-100-10%)
(Price-100-100)
(Price-200)
Can someone help me understand how to make my pattern match the full set of equations provided?
Note: Parentheses and 'Price' are musts in the equations that the pattern must match.

Try this, which matches all the input strings provided in the question:
/\(Price([+-]\d+%?){1,3}\)/g
You can test it in a regex fiddle.
Things to note:
Only use parentheses where you want to group. Parentheses around single-possibility, fixed-quantity matches (e.g. ([P]) provide no value.
Use character classes (opened with [ and closed with ]) for multiple characters that can match at a position in the pattern (e.g. [+-]). Single-possibility character classes (e.g. [P]) similarly provide no value.
Yes, character classes (generally) implicitly escape regex special characters within them (e.g. ( in [(] vs. equivalent \( outside a character class), but to just escape regex special characters (i.e. to match them literally), you are better off not using a character class and just escaping them (e.g. \() – unless multiple characters should match at a position in the pattern (per the previous point to note).
The quantifier {1} is (almost) always useless: drop it.
The quantifier + means "one or more" as you probably know. However, in a series of cases where you used it (i.e. ([(])+([P])+([r])+([i])+([c])+([e])+), it would match many values that I doubt you expect (e.g. ((((((PPPrriiiicccceeeeee): basically, don't overuse it. Stop to consider whether you really want to match one or more of the character (class) or group to which + applies in the pattern.
To match a literal string without any regex special characters like Price, just use the literal string at the appropriate position in the pattern – e.g. Price in \(Price.

/\(Price[+-](\d)+(%)?([+-]\d+%?)?([+-]\d+%?)?\)/g
works on http://www.regexr.com/

/^[(Price]+\d+\d+([%]|[)])&/i
try at your own risk!

Since "a+?" is Lazy, Why does "a+?b" Match "aaab"?

While learning regular expressions in javascript using JavaScript: The Definitive Guide, I was confused by this passage:
But /a+?/ matches one or more occurrences of the letter a, matching as
few characters as necessary. When applied to the same string, this
pattern matches only the first letter a.
…
Now let’s use the nongreedy version: /a+?b/. This should match the
letter b preceded by the fewest number of a’s possible. When applied
to the same string “aaab”, you might expect it to match only one a and
the last letter b. In fact, however, this pattern matches the entire
string, just like the greedy version of the pattern.
Why is this so?
This is the explanation from the book:
This is because regular-expression pattern matching is done by finding
the first position in the string at which a match is possible. Since a
match is possible starting at the first character of the
string,shorter matches starting at subsequent characters are never
even considered.
I don't understand. Can anyone give me a more detailed explanation?

Okay, so you have your search space, "aaabc", and your pattern, /a+?b/
Does /a+?b/ match "a"? No.
Does /a+?b/ match "aa"? No.
Does /a+?b/ match "aaa"? No.
Does /a+?b/ match "aaab"? Yes.

Since you're matching literal characters and not any sort of wildcard, the regular expression a+?b is effectively the same as a+b anyway. The only type of sequence either one will match is a string of one or more a characters followed by a single b character. The non-greedy modifier makes no difference here, as the only thing an a can possibly match is an a.
The non-greedy qualifier becomes interesting when it's applied to something that can take on lots of different values, like .. (edit or cases where there's interesting stuff to the left of something like a+?)
edit — if you're expecting a+?b to match just the last a before the b in aaab, well that's not how it works. Searching for a pattern in a string implicitly means to search for the earliest occurrence of the pattern. Thus, though starting from the last a does give a substring that matches the pattern, it's not the first substring that matches.

The Engine Attempts a Match at the Beginning of the String
Can anyone give me a more detailed explanation?
Yes.
In short: .+? does not look for a shortest match globally, at the level of the entire string, but locally, from the position in the string where the engine is currently positioned.
How the Engine Works
When you try a regex against the string aaab, the engine first tries to find a match starting at the very first position in the string. That position is the position before the first a. If the engine cannot find a match at the first position, it moves on and tries again starting from the second position (between the first and second a)
So can a match be found by the regex a+?b at the first position? Yes.
a matches the first a
The +? quantifiers tells the engine to match the fewest number of a chars necessary. Since we are looking to return a match, necessary means that the following tokens (in this case) have to be allowed to match. In this case, the fewest number of a chars needed to allow the b to match is all the remaining a chars.
b matches
In the details the second point is a bit more complex (the engine tries to match b against the second a, fails, backtracks...) but you don't need to worry about that.

'?' after a+ means minimum number of characters to satisfy expression. /a+/ means one 'a' or as many as you can encounter before some other character. In order to satisfy /a+?/ (since it's nogreedy) it only needs single 'a'.
In order to satisfy /a+?b/, since we have 'b' at the end, in order to satisfy this expression it needs to match one or more 'a' before it hits 'b'. It has to hit that 'b'. /a+/ doesn't have to hit b because RegEx doesn't ask for that. /a+?b/ has to hit that 'b'.
Just think about it. What other meaning /a+?b/ could have?
Hope this helps

Find out the position where a regular expression failed

I'm trying to write a lexer in JavaScript for finding tokens of a simple domain-specific language. I started with a simple implementation which just tries to match subsequent regexps from the current position in a line to find out whether it matches some token format and accept it then.
The problem is that when something doesn't match inside such regexp, the whole regexp fails, so I don't know which character exactly caused it to fail.
Is there any way to find out the position in the string which caused the regular expression to fail?
INB4: I'm not asking about debugging my regexp and verifying its correctness. It is correct already, matches correct strings and drops incorrect ones. I just want to know programmatically where exactly the regexp stopped matching, to find out the position of a character which was incorrect in the user input, and how much of them were OK.
Is there some way to do it with just simple regexps instead of going on with implementing a full-blown finite state automaton?

Short answer
There is no such thing as a "position in the string that causes the
regular expression to fail".
However, I will show you an approach to answer the reverse question:
At which token in the regex did the engine become unable to match the
string?
Discussion
In my view, the question of the position in the string which caused the regular expression to fail is upside-down. As the engine moves down the string with the left hand and the pattern with the right hand, a regex token that matches six characters one moment can later, because of quantifiers and backtracking, be reduced to matching zero characters the next—or expanded to match ten.
In my view, a more proper question would be:
At which token in the regex did the engine become unable to match the
string?
For instance, consider the regex ^\w+\d+$ and the string abc132z.
The \w+ can actually match the entire string. Yet, the entire regex fails. Does it make sense to say that the regex fails at the end of the string? I don't think so. Consider this.
Initially, \w+ will match abc132z. Then the engine advances to the next token: \d+. At this stage, the engine backtracks in the string, gradually letting the \w+ give up the 2z (so that the \w+ now only corresponds to abc13), allowing the \d+ to match 2.
At this stage, the $ assertion fails as the z is left. The engine backtracks, letting the \w+, give up the 3 character, then the 1 (so that the \w+ now only corresponds to abc), eventually allowing the \d+ to match 132. At each step, the engine tries the $ assertion and fails. Depending on engine internals, more backtracking may occur: the \d+ will give up the 2 and the 3 once again, then the \w+ will give up the c and the b. When the engine finally gives up, the \w+ only matches the initial a. Can you say that the regex "fails on the "3"? On the "b"?
No. If you're looking at the regex pattern from left to right, you can argue that it fails on the $, because it's the first token we were not able to add to the match. Bear in mind that there are other ways to argue this.
Lower, I'll give you a screenshot to visualize this. But first, let's see if we can answer the other question.
The Other Question
Are there techniques that allow us to answer the other question:
At which token in the regex did the engine become unable to match the
string?
It depends on your regex. If you are able to slice your regex into clean components, then you can devise an expression with a series of optional lookaheads inside capture groups, allowing the match to always succeed. The first unset capture group is the one that caused the failure.
Javascript is a bit stingy on optional lookaheads, but you can write something like this:
^(?:(?=(\w+)))?(?:(?=(\w+\d+)))?(?:(?=(\w+\d+$)))?.
In PCRE, .NET, Python... you could write this more compactly:
^(?=(\w+))?(?=(\w+\d+))?(?=(\w+\d+$))?.
What happens here? Each lookahead builds incrementally on the last one, adding one token at a time. Therefore we can test each token separately. The dot at the end is an optional flourish for visual feedback: we can see in a debugger that at least one character is matched, but we don't care about that character, we only care about the capture groups.
Group 1 tests the \w+ token
Group 2 seems to test \w+\d+, therefore, incrementally, it tests the \d+ token
Group 3 seems to test \w+\d+$, therefore, incrementally, it tests the $ token
There are three capture groups. If all three are set, the match is a full success. If only Group 3 is not set (as with abc123a), you can say that the $ caused the failure. If Group 1 is set but not Group 2 (as with abc), you can say that the \d+ caused the failure.
For reference: Inside View of a Failure Path
For what it's worth, here is a view of the failure path from the RegexBuddy debugger.

You can use a negated character set RegExp,
[^xyz]
[^a-c]
A negated or complemented character set. That is, it matches anything
that is not enclosed in the brackets. You can specify a range of
characters by using a hyphen, but if the hyphen appears as the first
or last character enclosed in the square brackets it is taken as a
literal hyphen to be included in the character set as a normal
character.
index property of String.prototype.match()
The returned Array has an extra input property, which contains the
original string that was parsed. In addition, it has an index
property, which represents the zero-based index of the match in the
string.
For example to log index where digit is matched for RegExp /[^a-zA-z]/ in string aBcD7zYx
var re = /[^a-zA-Z]/;
var str = "aBcD7zYx";
var i = str.match(re).index;
console.log(i); // 4

Is there any way to find out the position in the string which caused the regular expression to fail?
No, there isn't. A Regex either matches or doesn't. Nothing in between.
Partial Expressions can match, but the whole pattern doesnt. So the engine always needs to evaluates the whole expression:
Take the String Hello my World and the Pattern /Hello World/. While each word will match individually, the whole Expression fails. You cannot tell whether Hello or World matched - independent, both do. Also the whitespace between them is available.

Javascript Regex for Javascript Regex and Digits

The title might seem a bit recursive, and indeed it is.
I am working on a Javascript which can highlight/color Javascript code displayed in HTML. Thus, in the Internet Browser, comments will be turned green, definitions (for, if, while, etc.) will be turned a dark blue and italic, numbers will be red, and so on for other elements. However, the coloring is not all that important.
I am trying to figure out two different regular expressions which have started to cause a minor headache.
1. Finding a regular expression using a regular expression
I want to find regular expressions within the script-tags of HTML using a Javascript, such as:
match(/findthis/i);
, where the regex part of course is "/findthis/i".
The rules are as follows:
Finding multiple occurrences (/g) is not important.
It must be on the same line (not /m).
Caseinsensitive (/i).
If a backward slash (ignore character) is followed directly by a forward slash, "/", the forward slash is part of the expression - not an escape character. E.g.: /itdoesntstop\/untilnow:/
Two forward slashes right next to each other (//) is: (A) At the beginning: Not a regex; it's a comment. (B) Later on: First slash is the end of the regex and the second slash is nothing but a character.
Regex continues until the line breaks or end of input (\n|$), or the escape character (second forward slash which complies with rule 4) is encountered. However, also as long as only alphabetic characters are encountered, following the second forward slash, they are considered part of the regex. E.g.: /aregex/allthisispartoftheregex
So far what I've got is this:
'\\/(?:[^\\/\\\\]|\\/\\*)*\\/([a-zA-Z]*)?'
However, it isn't consistent. Any suggestions?
2. Find digits (alphanumeric, floating) using a regular expression
Finding digits on their own is simple. However, finding floating numbers (with multiple periods) and letters including underscore is more of a challenge.
All of the below are considered numbers (a new number starts after each space):
3 3.1 3.1.4 3a 3.A 3.a1 3_.1
The rules:
Finding multiple occurrences (/g) is not important.
It must be on the same line (not /m).
Caseinsensitive (/i).
A number must begin with a digit. However, the number can be preceeded or followed by a non-word (\W) character. E.g.: "=9.9;" where "9.9" is the actual number. "a9" is not a number. A period before the number, ".9", is not considered part of the number and thus the actual number is "9".
Allowed characters: [a-zA-Z0-9_.]
What I've got:
'(^|\\W)\\d([a-zA-Z0-9_.]*?)(?=([^a-zA-Z0-9_.]|$))'
It doesn't work quite the way I want it.

For the first part, I think you are quite close. Here is what I would use (as a regex literal, to avoid all the double escapes):
/\/(?:[^\/\\\n\r]|\\.)+\/([a-z]*)/i
I don't know what you intended with your second alternative after the character class. But here the second alternative is used to consume backslashes and anything that follows them. The last part is important, so that you can recognize the regex ending in something like this: /backslash\\/. And the ? at the end of your regex was redundant. Otherwise this should be fine.
Test it here.
Your second regex is just fine for your specification. There are a few redundant elements though. The main thing you might want to do is capture everything but the possible first character:
/(?:^|\W)(\d[\w.]*)/i
Now the actual number (without the first character) will be in capturing group 1. Note that I removed the ungreediness and the lookahead, because greediness alone does exactly the same.
Test it here.

regular expression for ends with some word

I want to build regular expression for series
cd1_inputchk,rd_inputchk,optinputchk where inputchk is common (ending characters)
please guide for the same

Very simply, it's:
/inputchk$/
On a per-word basis (only testing matching /inputchk$/.test(word) ? 'matches' : 'doesn\'t match';). The reason this works, is it matches "inputchk" that comes at the end of a string (hence the $)
As for a list of words, it starts becoming more complicated.
Are there spaces in the list?
Are they needed?
I'm going to assume no is the answer to both questions, and also assume that the list is comma-separated.
There are then a couple of ways you could proceed. You could use list.split() to get an array of each word, and teast each to see if they end in inputchk, or you could use a modified regular expression:
/[^,]*inputchk(?:,|$)/g
This one's much more complicated.
[^,] says to match non-, characters
* then says to match 0 or more of those non-, chars. (it will be greedy)
inputchk matches inputchk
(?:...) is a non-capturing parenthesis. It says to match the characters, but not store the match as part of the result.
, matches the , character
| says match one side or the other
$ says to match the end of the string
Hopefully all of this together will select the strings that you're looking for, but it's very easy to make a mistake, so I'd suggest doing some rigorous testing to make sure there aren't any edge-conditions that are being missed.

This one should work (dollar sign basically means "end of string"):
/inputchk$/

We Keep Coding

JavaScript is the programming language of the Web.