JavaScript regex: why is alternation not ordered? [duplicate]

JavaScript regex: why is alternation not ordered? [duplicate] - javascript

This question already has answers here:
Why order matters in this RegEx with alternation?
(3 answers)
Order of regular expression operator (..|.. ... ..|..)
(1 answer)
Closed 2 years ago.
Given this code:
const regex = /graph|photograph/;
'A photograph'.match(regex);
// Output: [ 'photograph', index: 2, input: 'A photograph', groups: undefined ]
Why is the engine not finding graph first? After looking at similar SO questions and the ECMAScript docs, I can see that
The | regular expression operator separates two alternatives. The pattern first tries to match the left Alternative (followed by the sequel of the regular expression); if it fails, it tries to match the right Disjunction (followed by the sequel of the regular expression).
Now, the above quote covers the case /photo|photograph/ where the alternatives share a common beginning, but the case where they share a common ending appears to be governed by a different rule.
I am content with the result I am getting, as in my use case I prefer to get the longest match, not the earliest one, but I would like to know why this happens, so I can be sure this isn't just a coincidence that is bound to change in the future.

The alternative graph does not match starting at the third character, but the alternative photograph does. The engine proceeds through the string from left to right.
The ordering you refer to in the question applies when alternatives match from a common starting point in the string. Otherwise, while proceeding through the "haystack" string, the alternatives are all considered. If there's a single match starting from a particular character,
then the rest of the regex will proceed with that (and may of course backtrack later).
Whether the engine prefers longer matches from a set of alternatives when there are multiple matches from the same character in the source, I can't say off the top of my head. I would guess it would try the longer one first, to consume more of the string optimistically, because it can always backtrack. However, I don't know that to be actual specified behavior and just thinking about reading the regex semantics in the spec makes my head hurt.

Related

Solve Catastrophic Backtracking in my regex detecting Email [duplicate]

This question already has an answer here:
Email validation Regular expression is causing catastrophic backtracking
(1 answer)
Closed 7 months ago.
I have regex
/^\w+([.-]?\w+)*#\w+([.-]?\w+)*(\.\w{2,4})+$/
for checking valid Email.
It works, but GitHub's code scanner shows this error
This Part of the Regular Expression May Cause Exponential Backtracking on Strings Starting With 'A#a' and Containing Many Repetitions of 'A'.
I got the error, however, I'm not sure how to solve it.

A good place to start is this: How can I recognize an evil regex?
As one of the answers there says, the key is to avoid "repetition of a repetition". For instance, given (\w+)* and the input aaa, it could match as (aaa), or (a)(aa), or (aa)(a), or (a)(a)(a); and as the input gets longer, the number of possibilities goes up exponentially. If instead you just write (\w*), it will match all the same strings, but only in one way.
In your case, you have two places where you write ([.-]?\w+)* and because you've made the [.-] optional, it can match in all the ways that (\w+)* can. But text without a dot or dash is already matched by the \w+ just before, so you can have ([.-]\w+)* instead.
The string .aaa can now only match one way, because (.a)(aa) doesn't have a dot or dash at the start of the second group. Other strings like aaa or ..a can be ruled out because you need exactly one dot or dash, and at least one character matching \w (which doesn't include . or -).

Exclude list of string in validation - regex [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?

A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].

Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.

Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)

You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.

The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]

I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.

The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.

If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo

Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*

I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}

Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.

I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

Match all Inside Parenthesis but not Outside [duplicate]

This question already has answers here:
Recursive matching with regular expressions in Javascript
(6 answers)
Closed 8 years ago.
I'm trying to use regular expressions to match certain groups of strings which correspond to functions. Right now it looks like this:
(Spreadsheet.[^)\)]+\))
Where it finds the variable Spreadsheet which has the function as an attribute. The expression keeps going until it gets to the end parenthesis. For simple functions such as
Spreadsheet.ADD(1,2)
the regular expression will work fine.
However, if I try to do any sort of nesting, the expression does not work because it will stop at the inside parenthesis instead of going to the last parenthesis.
Spreadsheet.ADD(Spreadsheet.ADD(1, 2), 3)
Thus, the ", 3)" isn't identified and ends being ignored. Of course, due to the way my code processes it, this unusual string ends up causing an error.
Does anyone with more knowledge of regular expressions know how it could be changed such that it will stop only when it is at the last parenthesis and not the first?
Thanks.

Assuming that you only want to match functions in the form that you state in the question. If you want to match any type of function (including operators, nested comments, etc) then what you are wanting is going to be difficult with regex, see here. Anyway, to match the last bracket you can use:
(Spreadsheet\..+\))
This will match
Spreadsheet.ADD(1,2)
Spreadsheet.ADD(Spreadsheet.ADD(1, 2), 3)
Spreadsheet.ADD(Spreadsheet.ADD(1, 2), 3)foo
(foo not part of the match)
The reason that your regex did not match the full string is because it will stop when it finds a character that is not a ) which is the first ). Also, as an aside Spreadsheet. will match Spreadsheeta, Spreadsheetb, Spreadsheetc. To match a dot you need \..
In my regex .+) will include the last bracket because + is greedy, so it will get the longest match it can. As an aside you would specify a non-greedy match using +?

In a regular expression, match one thing or another, or both

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.

The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/

Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A

Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).

Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.

Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

Division/RegExp conflict while tokenizing Javascript [duplicate]

This question already has answers here:
When parsing Javascript, what determines the meaning of a slash?
(5 answers)
Closed 8 years ago.
I'm writing a simple javascript tokenizer which detects basic types: Word, Number, String, RegExp, Operator, Comment and Newline. Everything is going fine but I can't understand how to detect if the current character is RegExp delimiter or division operator. I'm not using regular expressions because they are too slow. Does anybody know the mechanism of detecting it? Thanks.

You can tell by what the preceding token is is in the stream. Go through each token that your lexer emits and ask whether it can reasonably be followed by a division sign or a regexp; you'll find that the two resulting sets of tokens are disjoint. For example, (, [, {, ;, and all of the binary operators can only be followed by a regexp. Likewise, ), ], }, identifiers, and string/number literals can only be followed by a division sign.
See Section 7 of the ECMAScript spec for more details.

you have to check the context when encounter the slash. if the slash is after a expression, then it must be division, or it is a regexp start.
in order to recognize the context, maybe you have to make a syntax parser.
for example
function f() {}
/1/g
//this case ,the slash is after a function definition, so it's a refexp start
var a = {}
/1/g;
//this case, the slash is after an object expression，so it's a division

We Keep Coding

JavaScript is the programming language of the Web.

JavaScript regex: why is alternation not ordered? [duplicate] - javascript

Related

Solve Catastrophic Backtracking in my regex detecting Email [duplicate]

Exclude list of string in validation - regex [duplicate]

Match all Inside Parenthesis but not Outside [duplicate]

In a regular expression, match one thing or another, or both

Division/RegExp conflict while tokenizing Javascript [duplicate]

Categories

Resources