I'm trying to understand unroll loops in regex. What is the big difference between:
MINISTÉRIO[\s\S]*?PÁG
and
MINISTÉRIO(?:[^P]*(?:P(?!ÁG\s:\s\d+\/\d+)[^P]*)(?:[\s\S]*?))PÁG
In this context:
http://regexr.com/3dmlr
Why should i use the second, if the first do the SAME thing?
Thanks.
What is Unroll-the-loop
See this Unroll the loop technique source:
This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*.
The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
So, this is an optimization technique where alternations are turned into linearly matching atoms.
This makes these unrolled patterns very efficient since they involve less backtracking.
Current Scenario
Your MINISTÉRIO[\s\S]*?PÁG is a non-unrolled pattern while MINISTÉRIO[^P]*(?:P(?!ÁG)[^P]*)*PÁG is. See the demos (both saved with PCRE option to show the number of steps in the box above. Regex performance is different across regex engines, but this will tell you exactly the performance difference). Add more text after text: the first regex will start requiring more steps to finish, the second one will only show more steps after adding P. So, in texts where the character you used in the known part is not common, unrolled patterns are very efficient.
See the Difference between .*?, .* and [^"]*+ quantifiers section in my answer to understand how lazy matching works (your [\s\S]*? is the same as .*? with a DOTALL modifier in languages that allow a . to match a newline, too).
Performance Question
Is the lazy matching pattern always slow and inefficient? It is not always so. With very short strings, lazy dot matching is usually better (1-10 symbols). When we talk about long inputs, where there can be the leading delimiter, and no trailing one, this may lead to excessive backtracking leading to time out issues.
Use unrolled patterns when you have arbitrary inputs of potentially long length and where there may be no match.
Use lazy matching when your input is controlled, you know there will always be a match, some known set log formats, or the like.
Bonus: Commonly Unrolled patterns
Tempered greedy tokens
Regular string literals ("String\u0020:\"text\""): "[^"\\]*(?:\\.[^"\\]*)*"
Multiline comment regex (/* Comments */): /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
#<...># comment regex: #<[^>]*(?:>[^#]*)*#
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?
The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.
If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.
Start simple
Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.
Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.
If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)
Order from the menu
Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.
The pattern . is special: rather than matching a literal dot only, it matches any character†. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].
Think of character classes as menus: pick just one.
Helpful shortcuts
Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).
The uppercased variants are their complements, so \S matches any non-whitespace character, for example.
Once is not enough
From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are
* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)
Putting some of these blocks together, the pattern [Nn]*ick matches all of
ick
Nick
nick
Nnick
nNick
nnick
(and so on)
The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.
A few other useful examples:
[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01
Grouping
A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.
To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.
Alternation
Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).
For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.
Escaping
Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.
Greediness
Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.
For example, say the input is
"Hello," she said, "How are you?"
You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.
To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.
If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.
(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)
Anchors
Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.
Say you want to match comments of the form
-- This is a comment --
you'd write ^--\s+(.+)\s+--$.
Build your own
Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.
Tools for writing and debugging regexes:
RegExr (for JavaScript)
Perl: YAPE: Regex Explain
Regex Coach (engine backed by CL-PPCRE)
RegexPal (for JavaScript)
Regular Expressions Online Tester
Regex Buddy
Regex 101 (for PCRE, JavaScript, Python, Golang, Java 8)
I Hate Regex
Visual RegExp
Expresso (for .NET)
Rubular (for Ruby)
Regular Expression Library (Predefined Regexes for common scenarios)
Txt2RE
Regex Tester (for JavaScript)
Regex Storm (for .NET)
Debuggex (visual regex tester and helper)
Books
Mastering Regular Expressions, the 2nd Edition, and the 3rd edition.
Regular Expressions Cheat Sheet
Regex Cookbook
Teach Yourself Regular Expressions
Free resources
RegexOne - Learn with simple, interactive exercises.
Regular Expressions - Everything you should know (PDF Series)
Regex Syntax Summary
How Regexes Work
JavaScript Regular Expressions
Footnote
†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.
I am working on optimizing a regex (regular expression) that will match the following URL schema:
protocol://anything1/folder/index.html[?param=anything2]
where items in brackets are optional, and anything1 and anything2 can each be sequences of any characters. Everything else is static literals.
If it matters, anything1 will range in length from 36 to 48 characters, and anything2 will range in length from 5 to 40 characters. The regex does not need to validate any of this.
Importantly, both anything1 and anything2 can include forward slashes.
There are no issues if the regex requires anything1 or anything2 to be at least 1 character, as it always will be, but as performance is most important, I'm fine if it matches 0 or 1+ characters for anything1 and/or anything2.
The regex is only used for matching, and not for parsing. Captured groups are not used elsewhere in the code.
Most importantly, I would like the regex to be as efficient (in regards to speed) as possible.
So far, I have:
^protocol://.+/folder/index\.html($|\?param=.+)
The regex must match the entire string, and not just part of it.
The regex engine is the one used internally by Firefox for its CSS engine (which I believe is the same as their JavaScript regex engine).
My regex works as expected, and I'm asking if it can be further optimized for performance.
Instead of using ^protocol://.+/folder/index\.html($|\?param=.+), you could make the pattern a bit more specific so there is less work for the engine:
^protocol://.{36,48}/folder/index\.html(?:\?param=.+)?$
A few recommendations could be
Change the first .+ to the known limit of 36-48 characters according to the comments.
The .+ will first match until the end of the string, then it starts backtracking to match / and the rest of the pattern.
Knowing the range upfront might cut down on the backtracking.
If you don't need the last group, you can turn it into an optional non capturing group so there is no group data to store.
See for example 2 pages about why using capturing groups is slower than using non capturing groups.
See capturing group VS non-capturing group and Why is regex search slower with capturing groups in Python?.
As an indication, you can see for example the difference in steps with less backtracking (with PCRE selected on regex101, the Javascript option does not yield any steps)
Original pattern and updated pattern
I don't know if Firefox does optimization beforehand, so if you really want to be sure if there is a performance win, you would have to benchmark it in Firefox.
Note that the . can match any character except a newline (and will also match a space), which does not take any data structure into account.
Try something along this lines:
^protocol:\/\/[a-z\d-]+([^?]+)?\/folder\/index\.html(\?[^=&]+=[^&]+)?$
Note the usage of early-failing character classes to prevent backtracking (you may want to add more characters to those [^]'s).
https://regex101.com/r/h3x5vH/2.
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
I want to know if there is a way to see if part of a regular expression is deterministic or not. For example, the regular expression 0{3} is deterministic a.k.a there is only one string that matches it: "000". So for example if we had the regular expression \d0{3} along with the string "1", is there a way to get the string "1000" from that? It seems technically possible since once you have the first digit, you know that the rest of the digits are all 0s and there can only be 3 of them. I don't know if I am missing something or not though.
A sufficient condition for a regular expression to be deterministic is that it does not contain:
The | operator.
Any quantifiers (+,*,?,{n,m}) other than fixed repetition ({n}).
Any character class matching more than one character (\w, [a-z])
These conditions are not necessary because of zero-width assertions. For example, the expression (?!x)(x|y) only matches y. So this simple approach will not cover all cases, though it may suffice for your application.
At least for the case of true regular expressions without backreferences, it should be possible to determine whether they are singular. Simply use the standard construction to turn the expression into a nondeterministic finite automaton, then a deterministic finite automaton, then minimize it. The minimal DFA is singular if and only if there is exactly one accepting state, the accepting state has no edges coming from it, and every nonaccepting state has one edge coming from it.
To handle lookahead assertions, you might need to turn the expression into an alternating finite automaton, then use an approach similar to Thompson's construction to get the NFA, then proceed from there. Note that the worst-case here could have doubly exponential blowup. You can take \b and ^ and similar and translate them to one-character lookbehind assertions, then do some fiddly stuff to get those to work.