I want to know if there is a way to see if part of a regular expression is deterministic or not. For example, the regular expression 0{3} is deterministic a.k.a there is only one string that matches it: "000". So for example if we had the regular expression \d0{3} along with the string "1", is there a way to get the string "1000" from that? It seems technically possible since once you have the first digit, you know that the rest of the digits are all 0s and there can only be 3 of them. I don't know if I am missing something or not though.
A sufficient condition for a regular expression to be deterministic is that it does not contain:
The | operator.
Any quantifiers (+,*,?,{n,m}) other than fixed repetition ({n}).
Any character class matching more than one character (\w, [a-z])
These conditions are not necessary because of zero-width assertions. For example, the expression (?!x)(x|y) only matches y. So this simple approach will not cover all cases, though it may suffice for your application.
At least for the case of true regular expressions without backreferences, it should be possible to determine whether they are singular. Simply use the standard construction to turn the expression into a nondeterministic finite automaton, then a deterministic finite automaton, then minimize it. The minimal DFA is singular if and only if there is exactly one accepting state, the accepting state has no edges coming from it, and every nonaccepting state has one edge coming from it.
To handle lookahead assertions, you might need to turn the expression into an alternating finite automaton, then use an approach similar to Thompson's construction to get the NFA, then proceed from there. Note that the worst-case here could have doubly exponential blowup. You can take \b and ^ and similar and translate them to one-character lookbehind assertions, then do some fiddly stuff to get those to work.
Related
I am working on optimizing a regex (regular expression) that will match the following URL schema:
protocol://anything1/folder/index.html[?param=anything2]
where items in brackets are optional, and anything1 and anything2 can each be sequences of any characters. Everything else is static literals.
If it matters, anything1 will range in length from 36 to 48 characters, and anything2 will range in length from 5 to 40 characters. The regex does not need to validate any of this.
Importantly, both anything1 and anything2 can include forward slashes.
There are no issues if the regex requires anything1 or anything2 to be at least 1 character, as it always will be, but as performance is most important, I'm fine if it matches 0 or 1+ characters for anything1 and/or anything2.
The regex is only used for matching, and not for parsing. Captured groups are not used elsewhere in the code.
Most importantly, I would like the regex to be as efficient (in regards to speed) as possible.
So far, I have:
^protocol://.+/folder/index\.html($|\?param=.+)
The regex must match the entire string, and not just part of it.
The regex engine is the one used internally by Firefox for its CSS engine (which I believe is the same as their JavaScript regex engine).
My regex works as expected, and I'm asking if it can be further optimized for performance.
Instead of using ^protocol://.+/folder/index\.html($|\?param=.+), you could make the pattern a bit more specific so there is less work for the engine:
^protocol://.{36,48}/folder/index\.html(?:\?param=.+)?$
A few recommendations could be
Change the first .+ to the known limit of 36-48 characters according to the comments.
The .+ will first match until the end of the string, then it starts backtracking to match / and the rest of the pattern.
Knowing the range upfront might cut down on the backtracking.
If you don't need the last group, you can turn it into an optional non capturing group so there is no group data to store.
See for example 2 pages about why using capturing groups is slower than using non capturing groups.
See capturing group VS non-capturing group and Why is regex search slower with capturing groups in Python?.
As an indication, you can see for example the difference in steps with less backtracking (with PCRE selected on regex101, the Javascript option does not yield any steps)
Original pattern and updated pattern
I don't know if Firefox does optimization beforehand, so if you really want to be sure if there is a performance win, you would have to benchmark it in Firefox.
Note that the . can match any character except a newline (and will also match a space), which does not take any data structure into account.
Try something along this lines:
^protocol:\/\/[a-z\d-]+([^?]+)?\/folder\/index\.html(\?[^=&]+=[^&]+)?$
Note the usage of early-failing character classes to prevent backtracking (you may want to add more characters to those [^]'s).
https://regex101.com/r/h3x5vH/2.
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
I'm trying to understand unroll loops in regex. What is the big difference between:
MINISTÉRIO[\s\S]*?PÁG
and
MINISTÉRIO(?:[^P]*(?:P(?!ÁG\s:\s\d+\/\d+)[^P]*)(?:[\s\S]*?))PÁG
In this context:
http://regexr.com/3dmlr
Why should i use the second, if the first do the SAME thing?
Thanks.
What is Unroll-the-loop
See this Unroll the loop technique source:
This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*.
The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
So, this is an optimization technique where alternations are turned into linearly matching atoms.
This makes these unrolled patterns very efficient since they involve less backtracking.
Current Scenario
Your MINISTÉRIO[\s\S]*?PÁG is a non-unrolled pattern while MINISTÉRIO[^P]*(?:P(?!ÁG)[^P]*)*PÁG is. See the demos (both saved with PCRE option to show the number of steps in the box above. Regex performance is different across regex engines, but this will tell you exactly the performance difference). Add more text after text: the first regex will start requiring more steps to finish, the second one will only show more steps after adding P. So, in texts where the character you used in the known part is not common, unrolled patterns are very efficient.
See the Difference between .*?, .* and [^"]*+ quantifiers section in my answer to understand how lazy matching works (your [\s\S]*? is the same as .*? with a DOTALL modifier in languages that allow a . to match a newline, too).
Performance Question
Is the lazy matching pattern always slow and inefficient? It is not always so. With very short strings, lazy dot matching is usually better (1-10 symbols). When we talk about long inputs, where there can be the leading delimiter, and no trailing one, this may lead to excessive backtracking leading to time out issues.
Use unrolled patterns when you have arbitrary inputs of potentially long length and where there may be no match.
Use lazy matching when your input is controlled, you know there will always be a match, some known set log formats, or the like.
Bonus: Commonly Unrolled patterns
Tempered greedy tokens
Regular string literals ("String\u0020:\"text\""): "[^"\\]*(?:\\.[^"\\]*)*"
Multiline comment regex (/* Comments */): /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
#<...># comment regex: #<[^>]*(?:>[^#]*)*#
I'd like to know how is it possible to get mutual recursion with Regex in Javascript.
For instance, a regex recognising the language {a^n b^n} would be /a$0b/ where $0 stands for the current regex. It would be basic recursion.
Mutual recursion is the same, but with different regexs.
How can we get it done with Javascript ?
Thanks for your answers !
Unfortunately, Javascript doesn't have support for recursive regular expressions.
You can implement a basic recursive function with checks for the first and last letter (to ensure they're a and b respectively) and a recursive call for the interior string (the $0 in your example).
For one, you realize that this language produces phrases that have strictly an even number of letters, so you can implement a check to break out early.
Secondly, your only other failure case is if the letters at each end don't match.
You can map any other regular expression to such a tester function. To implement a mutually recursive one, simply call one function from the other.
To truly embody a finite-state machine, go token by token when you do the checks; for the example above, first check that you have an a at the beginning, then recurse, then check for the b at the end.
The a^n b^n language you are describing isn't even a regular language, and so a mutually recursive language would be far from it. You cannot use the regular expression engine in Javascript to produce a mutually recursive language, which is why you need to write a finite-state machine to take a token at a time and proceed to the next state if we encounter a match.
is there any way to get all the possible outcomes of a regular expression pattern?. everything I've seen refers to a pattern that is evaluated against a string. but what I need is to have a pattern like this:
^EM1650S(B{1,2}|L{1,2})?$
generate all possible matches:
EM1650S
EM1650SB
EM1650SBB
EM1650SL
EM1650SLL
In the general case, no. In this case, you have almost no solution space.
There's a section covering this in Higher Order Perl (PDF) and a Perl module. I never re-implemented it in anything else, but I had a similar problem and this solution was adequate for similarly-limited needs.
There are tools that can display all possible matches of a regex.
Here is one written in Haskell: https://github.com/audreyt/regex-genex
and here is a Perl module: http://metacpan.org/pod/Regexp::Genex
Unfortunately I couldn't find anything for JavaScript
In this particular case, yes. The regex generates a finite number of valid string, so they can be counted up.
You'll just have to parse the regex. Some part of that (EM1650S) is mandatory, so think for the rest. Parse by the | (or) symbol. Then enumerate the strings for both sides of it. Then you can get all possible combinations of them.
Some regex (containing * or + symbols) can represent an infinite number of strings, so they cannot be counted.
From a computational theoretic standpoint, regular expressions are equivalent to finite state machines. This is part of "automata theory." You could create a finite state machine that is equivalent to a regular expression and then use graph traversal algorithms to traverse all paths of the FSM. In the general case a countably infinite number of strings may match a regular expression, so your program may never terminate depending on the input regular expression.