I'm refactoring a rather large RegExp into a function that returns a RegExp. As a backward-compatibility test, I compared the .source of the returned RegExp with the .source of the old RegExp:
getRegExp(/* in the case requiring backward compatibility there's no arguments */)
.source == oldRegExp.source
However, I've noticed that the old RegExp contains various excessive backslashes like [\.\w] instead of [.\w]. I'd like to refactor such bits, but there's a number of them and it would be nice to have a similar check (backward compability is not broken). The problem is, /[\.\w]/.source != /[.\w]/.source. And identifying which backslashes may be removed automatically is not trivial (\. and . are not the same outside [...] and may be in some other cases).
Are you aware of somewhat simple ways to do so? It seems this can only be done by actual parsing of the .source (compare the example above with /\[\.\w]\/ and /\[.\w]\/), but may be I'm missing some trick of utilizing browser's built-in properties/methods. The point is, '\"' == '"' is true, so strings defined with these different syntaxes are stored as "normalized" values ("), I wonder if such "normalized" pattern is available for a RegExp.
Sadly, comparing two regular expressions to see if they're the same is exactly the same as comparing any other two pieces of code - ie, hard.
The only real way I know of to do this is to create a suite of tests, each one targeting a specific aspect of the regular expression and verifying that it works properly. This is not an easy process-regular expressions are subtle and complex with a lot of potential for unrealized side effects. I recently had to fix some defects in a regex based address parser and it took about a thousand unit tests before I was satisfied with my coverage... but then as soon as I started to change the regex MY TESTS CAUGHT STUFF CONSTANTLY!!
Unit testing sucks and it's just tiring and not fun, but for almost any piece of logic it has real value, and when using powerful tools like regex, I would say it's absolutely crucial.
Related
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
I am writing JavaScript to Python translator and "\8" and "\9" are causing me lot's of problems. According to the documentation something like "\8" or "\9" is illegal since they are not valid octal escapes. Esprima parser throws exception on such literal. However JS engines they seem to allow them and they evaluate to "8" and "9" respectively.
Therefore:
/\8/.exec("\8")
RegExp('\\8').exec('\8')
/\8/.exec("8")
RegExp('\\8').exec('8')
Should all return a match since /\8/ should be the same as /8/. However the results are inconsistent across JS engines and some return a match while others don't (for example Safari's).
What's the reason for all these differences? And what is the right way - how to handle other cases involving these literals?
You're right that the spec does not allow for it but no one ever said that JS engines are perfect.
The "right" way to handle those cases is to report them as a syntax error, given that this isn't valid in JS nor Python*.
*As far as I know. I don't write a lot of Python but a quick Googling seems to indicate it isn't.
is there any way to get all the possible outcomes of a regular expression pattern?. everything I've seen refers to a pattern that is evaluated against a string. but what I need is to have a pattern like this:
^EM1650S(B{1,2}|L{1,2})?$
generate all possible matches:
EM1650S
EM1650SB
EM1650SBB
EM1650SL
EM1650SLL
In the general case, no. In this case, you have almost no solution space.
There's a section covering this in Higher Order Perl (PDF) and a Perl module. I never re-implemented it in anything else, but I had a similar problem and this solution was adequate for similarly-limited needs.
There are tools that can display all possible matches of a regex.
Here is one written in Haskell: https://github.com/audreyt/regex-genex
and here is a Perl module: http://metacpan.org/pod/Regexp::Genex
Unfortunately I couldn't find anything for JavaScript
In this particular case, yes. The regex generates a finite number of valid string, so they can be counted up.
You'll just have to parse the regex. Some part of that (EM1650S) is mandatory, so think for the rest. Parse by the | (or) symbol. Then enumerate the strings for both sides of it. Then you can get all possible combinations of them.
Some regex (containing * or + symbols) can represent an infinite number of strings, so they cannot be counted.
From a computational theoretic standpoint, regular expressions are equivalent to finite state machines. This is part of "automata theory." You could create a finite state machine that is equivalent to a regular expression and then use graph traversal algorithms to traverse all paths of the FSM. In the general case a countably infinite number of strings may match a regular expression, so your program may never terminate depending on the input regular expression.
Im trying to get a javascript regex that matches x opening braces, then x closing braces, while allowing them to be nested in-between each other.
For example, it would match:
"{ a { q } }"
but not
"{ a { q } { }"
or
"{ } } { } {"
That being said, I have no idea how to do it with regexpes, or if it's even possible.
The short answer to this is no. Regular expressions are a non-context-free grammar, so it cannot be done with true regex. You can, however, look for specific (non-arbitrary) nesting patterns.
http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
The recursion problem here is, at its heart, the same reason you can't correctly parse HTML with regex. Like XML, the construct you've described is a context-free grammar; note its close similarity with the first example from the Wikipedia article.
I've heard there are engines out there that extend regex to offer support for arbitrarily nested elements, but this would make them something other than true regex. Anyway, I don't know of any such libraries for JavaScript. I think what you want is some kind of string-manipulation-based parser.
AFAIK, uou can’t really do this with regular expressions only.
However, Javascript’s String.replace method does have a nice feature that could allow you some level of recursion. If you pass a function as the second parameter, that function will be called for each match encountered. You could then perform the same replace on that match, passing along the same function, which would be called for each match inside that match, etc.
I’m too tired right now to write up an example that fits what you’re asking for — or even if it’s actually possible, so I’ll leave it at this possible hint, and further working out as an exercise to the reader.
That is not possible to do with real regular expression, and even with full-blown PCRE the "counting problem" that you're describing is an example of something that you just can't do.
An old textbook I had in school said, "regular expressions can't count." That's not true of modern "supercharged" regular expression implementations with the "{n,m}" qualifiers, but note that the values in curly braces there are constants.
To do that, you need a more complicated automaton. Context-free grammars can represent languages like you describe, as can parse expression grammars.
Yes, it's probably possible with Regexes. No, it isn't possible in Javascript Regexes. Yes, it's probably possible in .NET Regexes for example (Balancing Groups http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.71).aspx ). No, I don't know how to do them. They give me migraine (and I'm not kidding here). They are quite extreme voodoo.