Why are lookbehind assertions not supported in Javascript? - javascript

Recently I realized (by some embarrassment) that regex lookbehind assertions were not possible in Javascript.
What is the (factual) reason for the absence for this assertion so seemingly common?
I realize there are alternate ways to achieve the same thing perhaps, although Is it the basic semantics at work which forbid the functionality, or what exactly?
It also seems that some regex testing tools out there that generate Javascript code from regex patterns seem to ignore this fact — which strikes me as a bit odd.

Today
Lookbehind is now an official part of the ES 2018 specification. Axel Rauschmayer gives a good introduction in his blog post.
History
It looks like at the time, Brendan Eich wasn't aware of its existence (because Netscape was built on an older version of Perl):
This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks.
I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org.
If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this.
/be
There have been a bunch of different on the mailing list with attempts to include it, but it still seems to be a pretty complex feature performance-wise, because EcmaScript Regular Expressions are backtracking based and backtracking is needed in lookbehind when working with capturing groups. This can lead to problems such as catastrophic backtracking when used incorrectly.
At some point it was suggested for ES6/Es 2015, but it never made the draft, let alone the specification. In the last post in the discussion, it seems that nobody took up the task of implementing it. If anybody feels called to write an implementation, they can sign up for the ES Discuss list and propose it.
Update May 2015:
In May 2015, Nozomu Katō has proposed an ES7 look-behind implementation.
Update September 2015:
Regex Look-behind was added as a stage 0 proposal.
Update May 2017:
The proposal is now at stage 3. This means that now at least two browsers need to implement it for it to become a part of the next EcmaScript standard. As #martixy mentioned in the comments, Chrome has implemented it behind the JS experimental flag.

To speak from the conclusion, I think look-behind is not implemented in JavaScript, since no one has any idea how it should behave, and existing implementations show that adding support for look-behind is rather complex.
JavaScript/ECMAScript is different from other languages in the sense that the specification includes an abstract implementation of the regex engine, while most other language only stops short at description of behavior of each pieces of regex syntax, with scant description of how different tokens interacts with each other.
Look-ahead? Easy to implement
The implementation of look-ahead is quite straight-forward. You only need to treat the pattern inside the look-ahead in the same manner as pattern outside look-ahead, and perform a left-to-right match as per usual, except that after the look-ahead succeeds 1) the current position is restored to before entering the look-ahead, and 2) choice points inside look-ahead are discarded after it is matched.
There is no limit to what can be included inside look-ahead, since it is a very simple extension to the existing natural left-to-right matching facilities.
Look-behind? Not so easy
On the other hand, the implementation of look-behind is not as straight forward.
Imagine how you would implement the following look-behind construct:
(?<=fixed-string)
(?<=a|fixed|string)
(?<=t[abc]{1,3})
(?<=(abc){2,6})
(?<=^.*abc.*)
(?<=\G"[^"]+");
(?<=^(.....|.......)+)
\b(\w+)\b(?<!\b\1\b.*\1)
Apart from the basic case (?<=fixed-string), which any look-behind implementation must support, (?<=a|fixed|string) is a much desirable case to support.
Different regex engine has varied level of support for the regex above.
Let us look at how they are implemented in .NET and Java. (This is the two flavors whose look-behind behavior I have studied.)
.NET implementation
In Microsoft .NET implementation, all those regex above are valid, since .NET implements look-behind by using right-to-left mode, with the starting offset at the current position. The look-behind construct doesn't generate any choice point by itself.
However, if you use capturing groups inside the look-behind, it starts to get confusing, since the atoms in the patterns are interpreted from right-to-left, as demonstrated in this post. This is the disadvantage of this method: you would need to wrap your mind to think right-to-left when writing a look-behind.
Java implementation
In contrast, Java regex implementation implements look-behind by reusing the left-to-right matching facilities.
It first analyzes the pattern inside the look-behind for the minimum and maximum length of the pattern. Then, look-behind is implemented by trying to match the pattern inside from left-to-right, starting from (current position - minimum length) to (current position - maximum length).
Is there anything missing? Yes! Since we are matching from left-to-right, we need to make sure that the match ends right at the position before entering the look-behind (current position). In Java, this is implemented by appending a node at the end of the pattern inside look-behind.
This implementation is very inefficient, as there are maximum - minimum + 1 choice points created in the look-behind itself, before we even talk about choice points created by the pattern inside the look-behind.
The look-behind bound check is also inefficient, since it is placed at the end of the pattern, and can't prune choice points that are clearly hopeless (those already far exceeding the current position in the middle of pattern).
Summary
As you can see, adding support for look-behind is not easy:
The right-to-left approach seems reasonably efficient. However, it requires additional specification on the right-to-left matching behavior on other existing constructs.
The approach to reuse left-to-right matching facilities is complex to specify, and is very inefficient. It also requires the introduction of pattern analysis into the specification, lest the performance is thrown out of the window.
(Note that I have not yet covered the behavior when look-behind is used inside look-ahead, and vice-versa. This should also be taken into consideration when defining semantics for look-behind construct).
These technical hurdles are also mentioned by Waldemar Horwat (who wrote the ES3 regex spec) in the mail cited in nils' answer:
No one has yet submitted a well-defined proposal for lookbehinds on the table. Lookbehinds are difficult to translate into the language used by the spec and get quite fuzzy when the order of evaluation of parts of the regexp matters, which is what happens if capturing parentheses are involved. Where do you start looking for the lookbehind? Shortest first, longest first, or reverse string match? Greedy or not? Backtrack into capturing results?

Related

The reason to use "-" instead of "_" in the HTML class and id attributes [duplicate]

In the past I've always used underscores for defining class and id attributes in HTML. Over the last few years I changed over to dashes, mostly to align myself with the trend in the community, not necessarily because it made sense to me.
I've always thought dashes have more drawbacks, and I don't see the benefits:
Code completion & Editing
Most editors treat dashes as word separators, so I can't tab through to the symbol I want. Say the class is "featured-product", I have to auto-complete "featured", enter a hyphen, and complete "product".
With underscores "featured_product" is treated as one word, so it can be filled in one step.
The same applies to navigating through the document. Jumping by words or double-clicking on class names is broken by hyphens.
(More generally, I think of classes and ids as tokens, so it doesn't make sense to me that a token should be so easily splittable on hyphens.)
Ambiguity with arithmetic operator
Using dashes breaks object-property access to form elements in JavaScript. This is only possible with underscores:
form.first_name.value='Stormageddon';
(Admittedly I don't access form elements this way myself, but when deciding on dashes vs underscores as a universal rule, consider that someone might.)
Languages like Sass (especially throughout the Compass framework) have settled on dashes as a standard, even for variable names. They originally used underscores in the beginning too. The fact that this is parsed differently strikes me as odd:
$list-item-10
$list-item - 10
Inconsistency with variable naming across languages
Back in the day, I used to write underscored_names for variables in PHP, ruby, HTML/CSS, and JavaScript. This was convenient and consistent, but again in order to "fit in" I now use:
dash-case in HTML/CSS
camelCase in JavaScript
underscore_case in PHP and ruby
This doesn't really bother me too much, but I wonder why these became so misaligned, seemingly on purpose. At least with underscores it was possible to maintain consistency:
var featured_product = $('#featured_product'); // instead of
var featuredProduct = $('#featured-product');
The differences create situations where we have to translate strings unnecessarily, along with the potential for bugs.
So I ask: Why did the community almost universally settle on dashes, and are there any reasons that outweigh underscores?
There is a related question from back around the time this started, but I'm of the opinion that it's not (or shouldn't have been) just a matter of taste. I'd like to understand why we all settled on this convention if it really was just a matter of taste.
Code completion
Whether dash is interpreted as punctuation or as an opaque identifier depends on the editor of choice, I guess. However, as a personal preference, I favor being able to tab between each word in a CSS file and would find it annoying if they were separated with underscore and there were no stops.
Also, using hyphens allows you to take advantage of the |= attribute selector, which selects any element containing the text, optionally followed by a dash:
span[class|="em"] { font-style: italic; }
This would make the following HTML elements have italic font-style:
<span class="em">I'm italic</span>
<span class="em-strong">I'm italic too</span>
Ambiguity with arithmetic operator
I'd say that access to HTML elements via dot notation in JavaScript is a bug rather than a feature. It's a terrible construct from the early days of terrible JavaScript implementations and isn't really a great practice. For most of the stuff you do with JavaScript these days, you'd want to use CSS Selectors for fetching elements from the DOM anyway, which makes the whole dot notation rather useless. Which one would you prefer?
var firstName = $('#first-name');
var firstName = document.querySelector('#first-name');
var firstName = document.forms[0].first_name;
I find the two first options much more preferable, especially since '#first-name' can be replaced with a JavaScript variable and built dynamically. I also find them more pleasant on the eyes.
The fact that Sass enables arithmetic in its extensions to CSS doesn't really apply to CSS itself, but I do understand (and embrace) the fact that Sass follows the language style of CSS (except for the $ prefix of variables, which of course should have been #). If Sass documents are to look and feel like CSS documents, they need to follow the same style as CSS, which uses dash as a delimiter. In CSS3, arithmetic is limited to the calc function, which goes to show that in CSS itself, this isn't an issue.
Inconsistency with variable naming across languages
All languages, being markup languages, programming languages, styling languages or scripting languages, have their own style. You will find this within sub-languages of language groups like XML, where e.g. XSLT uses lower-case with hyphen delimiters and XML Schema uses camel-casing.
In general, you will find that adopting the style that feels and looks most "native" to the language you're writing in is better than trying to shoe-horn your own style into every different language. Since you can't avoid having to use native libraries and language constructs, your style will be "polluted" by the native style whether you like it or not, so it's pretty much futile to even try.
My advice is to not find a favorite style across languages, but instead make yourself at home within each language and learn to love all of its quirks. One of CSS' quirks is that keywords and identifiers are written in lowercase and separated by hyphens. Personally, I find this very visually appealing and think it fits in with the all-lowercase (although no-hyphen) HTML.
Perhaps a key reason why the HTML/CSS community aligned itself with dashes instead of underscores is due to historical deficiencies in specs and browser implementations.
From a Mozilla doc published March 2001 # https://developer.mozilla.org/en-US/docs/Underscores_in_class_and_ID_Names
The CSS1 specification, published in its final form in 1996, did not
allow for the use of underscores in class and ID names unless they
were "escaped." An escaped underscore would look something like this:
p.urgent\_note {color: maroon;}
This was not well supported by browsers at the time, however, and the
practice has never caught on. CSS2, published in 1998, also forbade
the use of underscores in class and ID names. However, errata to the
specification published in early 2001 made underscores legal for the
first time. This unfortunately complicated an already complex
landscape.
I generally like underscores but the backslash just makes it ugly beyond hope, not to mention the scarce support at the time. I can understand why developers avoided it like the plague. Of course, we don't need the backslash nowadays, but the dash-etiquette has already been firmly established.
I don't think anyone can answer this definitively, but here are my educated guesses:
Underscores require hitting the Shift key, and are therefore harder to type.
CSS selectors which are part of the official CSS specifications use dashes (such as pseudo-classes like :first-child and pseudo-elements :first-line), not underscores. Same thing for properties, e.g. text-decoration, background-color, etc. Programmers are creatures of habit. It makes sense that they would follow the standard's style if there's no good reason not to.
This one is further out on the ledge, but... Whether it's myth or fact, there is a longstanding idea that Google treats words separated by underscores as a single word, and words separated by dashes as separate words. (Matt Cutts on Underscores vs. Dashes.) For this reason, I know that my preference now for creating page URLs is to use-words-with-dashes, and for me at least, this has bled into my naming conventions for other things, like CSS selectors.
There are many reasons, but one of the most important thing is maintaining consistency.
I think this article explains it comprehensively.
CSS is a hyphen-delimited syntax. By this I mean we write things like font-size, line-height, border-bottom etc.
So:
You just shouldn’t mix syntaxes: it’s inconsistent.
There's been a clear uptick in hyphen-separated, whole-word segments of URLs over recent years. This is encouraged by SEO best practices. Google explicitly "recommend that you use hyphens (-) instead of underscores (_) in your URLs": http://www.google.com/support/webmasters/bin/answer.py?answer=76329.
As noted, different conventions have prevailed at different times in different contexts, but they typically are not a formal part of any protocol or framework.
My hypothesis, then, is that Google's position anchors this pattern within one key context (SEO), and the trend to use this pattern in class, id, and attribute names is simply the herd moving slowly in this general direction.
I think it's a programmer dependent thing. Someones like to use dashes, others use underscores.
I personally use underscores (_) because I use it in other places too. Such as:
- JavaScript variables (var my_name);
- My controller actions (public function view_detail)
Another reason that I use underscores, is this that in most IDEs two words separated by underscores are considered as 1 word. (and are possible to select with double_click).
point of refactoring only btn to bt
case: btn_pink
search btn in word
result btn
case: btn-pink
search btn in word
result btn | btn-pink
case: btn-pink
search btn in regexp
\bbtn\b(?!-) type to hard
result btn

Searching all emails over large document [duplicate]

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"

How to check whether 2 regexp are the same accounting syntax differences?

I'm refactoring a rather large RegExp into a function that returns a RegExp. As a backward-compatibility test, I compared the .source of the returned RegExp with the .source of the old RegExp:
getRegExp(/* in the case requiring backward compatibility there's no arguments */)
.source == oldRegExp.source
However, I've noticed that the old RegExp contains various excessive backslashes like [\.\w] instead of [.\w]. I'd like to refactor such bits, but there's a number of them and it would be nice to have a similar check (backward compability is not broken). The problem is, /[\.\w]/.source != /[.\w]/.source. And identifying which backslashes may be removed automatically is not trivial (\. and . are not the same outside [...] and may be in some other cases).
Are you aware of somewhat simple ways to do so? It seems this can only be done by actual parsing of the .source (compare the example above with /\[\.\w]\/ and /\[.\w]\/), but may be I'm missing some trick of utilizing browser's built-in properties/methods. The point is, '\"' == '"' is true, so strings defined with these different syntaxes are stored as "normalized" values ("), I wonder if such "normalized" pattern is available for a RegExp.
Sadly, comparing two regular expressions to see if they're the same is exactly the same as comparing any other two pieces of code - ie, hard.
The only real way I know of to do this is to create a suite of tests, each one targeting a specific aspect of the regular expression and verifying that it works properly. This is not an easy process-regular expressions are subtle and complex with a lot of potential for unrealized side effects. I recently had to fix some defects in a regex based address parser and it took about a thousand unit tests before I was satisfied with my coverage... but then as soon as I started to change the regex MY TESTS CAUGHT STUFF CONSTANTLY!!
Unit testing sucks and it's just tiring and not fun, but for almost any piece of logic it has real value, and when using powerful tools like regex, I would say it's absolutely crucial.

Comma-separated list of integers with max 5 elements [duplicate]

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"

need a JavaScript Regex that requires upper or lowercase letters

I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters:
/(?=.*[a-z])/
You Can’t Get There from Here
I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters: /(?=.*[a-z])/
Unfortunately, it is utterly impossible to do this correctly using Javascript! Read this flavor comparison’s ECMA column for all of what Javascript cannot do.
Theory vs Practice
The proper pattern for lowercase is the standard Unicode derived binary property \p{Lowercase}, and the proper pattern for uppercase is similarly \p{Uppercase}. These are normative properties that sometimes include non-letters in them under certain exotic circumstances.
Using just General Category properties, you can have \p{Ll} for Lowercase_Letter, \p{Lu} for Uppercase_Letter, and \p{Lt} for titlecase letter. Remember they are three cases in Unicode, not two). There is a standard alias \p{LC} which means [\p{Lu}\p{Lt}\p{Ll}].
If you want a letter than is not a lowercase letter, you could use (?=\P{Ll})\pL. Written in longhand that’s (?=\P{Lowercase_Letter})\p{Letter}. Again, these mix some of the Other_Lowercase code points that \p{Lowercase} recognizes. I must again stress that the Lowercase property is a superset of the Lowercase_Letter property.
Remember the previous paragraph, swapping in upper everywhere I have written lower, and you get the same thing for the capitals.
Possible Platforms
Because access to these essential properties is the minimal level of critical functionality necessary for Unicode regular expressions, some versions of Javascript implement them in just the way I have written them above. However, the standard for Javascript still does not require them, so you cannot in general count on them. This means that it is impossible to this correctly under all implementations of Javascript.
Languages in which it is possible to do what you want done minimally include:
C♯ and Java (both only General Categories)
Ruby if and only if v1.9 or better (only binary properties, including General Categories)
PHP and PCRE (only General Category and Script properties plus a couple extras)
ICU’s C++ library and Perl, which both support all Unicode properties
Of those listed bove, only the last line’s — ICU and Perl — strictly and completely meet all Level 1 compliance requirements (plus some Levels 2 and 3) for the proper handling of Unicode in regexes. However, all of those I’ve listed in the previous paragraph’s bullets can easily handle most, and quite probably all, of what you need.
Javascript is not amongst those, however. Your version might, though, if you are very lucky and never have to run on a standard-only Javascript platform.
Summary
So very sadly, you cannot really use Javascript regexes for Unicode work unless you have a non-standard extension. Some people do, but most do not. If you do not, you may have to use a different platform until the relevant ECMA standard catches up with the 21st century (Unicode 3.1 came out a decade ago!!).
If anyone knows of a Javascript library that implements the Level 1 requirements of UTS#18 on Unicode Regular Expressions including both RL1.2 “Properties” and RL1.2a “Annex C: Compatibility Properties”, please chime in.
Not sure if you mean mixed-case, or strictly lowercase plus strictly uppercase.
Here's the mixed-case version:
/^[a-zA-Z]+$/
And the strictly one-or-the-other version:
/^([a-z]+|[A-Z]+)$/
Try /(?=.*[a-z])/i
Note the i at the end, this makes the expression case insensitive.
Or add an uppercase range to your regex:
/(?=.*[a-zA-Z])/

Categories