I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters:
/(?=.*[a-z])/
You Can’t Get There from Here
I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters: /(?=.*[a-z])/
Unfortunately, it is utterly impossible to do this correctly using Javascript! Read this flavor comparison’s ECMA column for all of what Javascript cannot do.
Theory vs Practice
The proper pattern for lowercase is the standard Unicode derived binary property \p{Lowercase}, and the proper pattern for uppercase is similarly \p{Uppercase}. These are normative properties that sometimes include non-letters in them under certain exotic circumstances.
Using just General Category properties, you can have \p{Ll} for Lowercase_Letter, \p{Lu} for Uppercase_Letter, and \p{Lt} for titlecase letter. Remember they are three cases in Unicode, not two). There is a standard alias \p{LC} which means [\p{Lu}\p{Lt}\p{Ll}].
If you want a letter than is not a lowercase letter, you could use (?=\P{Ll})\pL. Written in longhand that’s (?=\P{Lowercase_Letter})\p{Letter}. Again, these mix some of the Other_Lowercase code points that \p{Lowercase} recognizes. I must again stress that the Lowercase property is a superset of the Lowercase_Letter property.
Remember the previous paragraph, swapping in upper everywhere I have written lower, and you get the same thing for the capitals.
Possible Platforms
Because access to these essential properties is the minimal level of critical functionality necessary for Unicode regular expressions, some versions of Javascript implement them in just the way I have written them above. However, the standard for Javascript still does not require them, so you cannot in general count on them. This means that it is impossible to this correctly under all implementations of Javascript.
Languages in which it is possible to do what you want done minimally include:
C♯ and Java (both only General Categories)
Ruby if and only if v1.9 or better (only binary properties, including General Categories)
PHP and PCRE (only General Category and Script properties plus a couple extras)
ICU’s C++ library and Perl, which both support all Unicode properties
Of those listed bove, only the last line’s — ICU and Perl — strictly and completely meet all Level 1 compliance requirements (plus some Levels 2 and 3) for the proper handling of Unicode in regexes. However, all of those I’ve listed in the previous paragraph’s bullets can easily handle most, and quite probably all, of what you need.
Javascript is not amongst those, however. Your version might, though, if you are very lucky and never have to run on a standard-only Javascript platform.
Summary
So very sadly, you cannot really use Javascript regexes for Unicode work unless you have a non-standard extension. Some people do, but most do not. If you do not, you may have to use a different platform until the relevant ECMA standard catches up with the 21st century (Unicode 3.1 came out a decade ago!!).
If anyone knows of a Javascript library that implements the Level 1 requirements of UTS#18 on Unicode Regular Expressions including both RL1.2 “Properties” and RL1.2a “Annex C: Compatibility Properties”, please chime in.
Not sure if you mean mixed-case, or strictly lowercase plus strictly uppercase.
Here's the mixed-case version:
/^[a-zA-Z]+$/
And the strictly one-or-the-other version:
/^([a-z]+|[A-Z]+)$/
Try /(?=.*[a-z])/i
Note the i at the end, this makes the expression case insensitive.
Or add an uppercase range to your regex:
/(?=.*[a-zA-Z])/
Related
In the past I've always used underscores for defining class and id attributes in HTML. Over the last few years I changed over to dashes, mostly to align myself with the trend in the community, not necessarily because it made sense to me.
I've always thought dashes have more drawbacks, and I don't see the benefits:
Code completion & Editing
Most editors treat dashes as word separators, so I can't tab through to the symbol I want. Say the class is "featured-product", I have to auto-complete "featured", enter a hyphen, and complete "product".
With underscores "featured_product" is treated as one word, so it can be filled in one step.
The same applies to navigating through the document. Jumping by words or double-clicking on class names is broken by hyphens.
(More generally, I think of classes and ids as tokens, so it doesn't make sense to me that a token should be so easily splittable on hyphens.)
Ambiguity with arithmetic operator
Using dashes breaks object-property access to form elements in JavaScript. This is only possible with underscores:
form.first_name.value='Stormageddon';
(Admittedly I don't access form elements this way myself, but when deciding on dashes vs underscores as a universal rule, consider that someone might.)
Languages like Sass (especially throughout the Compass framework) have settled on dashes as a standard, even for variable names. They originally used underscores in the beginning too. The fact that this is parsed differently strikes me as odd:
$list-item-10
$list-item - 10
Inconsistency with variable naming across languages
Back in the day, I used to write underscored_names for variables in PHP, ruby, HTML/CSS, and JavaScript. This was convenient and consistent, but again in order to "fit in" I now use:
dash-case in HTML/CSS
camelCase in JavaScript
underscore_case in PHP and ruby
This doesn't really bother me too much, but I wonder why these became so misaligned, seemingly on purpose. At least with underscores it was possible to maintain consistency:
var featured_product = $('#featured_product'); // instead of
var featuredProduct = $('#featured-product');
The differences create situations where we have to translate strings unnecessarily, along with the potential for bugs.
So I ask: Why did the community almost universally settle on dashes, and are there any reasons that outweigh underscores?
There is a related question from back around the time this started, but I'm of the opinion that it's not (or shouldn't have been) just a matter of taste. I'd like to understand why we all settled on this convention if it really was just a matter of taste.
Code completion
Whether dash is interpreted as punctuation or as an opaque identifier depends on the editor of choice, I guess. However, as a personal preference, I favor being able to tab between each word in a CSS file and would find it annoying if they were separated with underscore and there were no stops.
Also, using hyphens allows you to take advantage of the |= attribute selector, which selects any element containing the text, optionally followed by a dash:
span[class|="em"] { font-style: italic; }
This would make the following HTML elements have italic font-style:
<span class="em">I'm italic</span>
<span class="em-strong">I'm italic too</span>
Ambiguity with arithmetic operator
I'd say that access to HTML elements via dot notation in JavaScript is a bug rather than a feature. It's a terrible construct from the early days of terrible JavaScript implementations and isn't really a great practice. For most of the stuff you do with JavaScript these days, you'd want to use CSS Selectors for fetching elements from the DOM anyway, which makes the whole dot notation rather useless. Which one would you prefer?
var firstName = $('#first-name');
var firstName = document.querySelector('#first-name');
var firstName = document.forms[0].first_name;
I find the two first options much more preferable, especially since '#first-name' can be replaced with a JavaScript variable and built dynamically. I also find them more pleasant on the eyes.
The fact that Sass enables arithmetic in its extensions to CSS doesn't really apply to CSS itself, but I do understand (and embrace) the fact that Sass follows the language style of CSS (except for the $ prefix of variables, which of course should have been #). If Sass documents are to look and feel like CSS documents, they need to follow the same style as CSS, which uses dash as a delimiter. In CSS3, arithmetic is limited to the calc function, which goes to show that in CSS itself, this isn't an issue.
Inconsistency with variable naming across languages
All languages, being markup languages, programming languages, styling languages or scripting languages, have their own style. You will find this within sub-languages of language groups like XML, where e.g. XSLT uses lower-case with hyphen delimiters and XML Schema uses camel-casing.
In general, you will find that adopting the style that feels and looks most "native" to the language you're writing in is better than trying to shoe-horn your own style into every different language. Since you can't avoid having to use native libraries and language constructs, your style will be "polluted" by the native style whether you like it or not, so it's pretty much futile to even try.
My advice is to not find a favorite style across languages, but instead make yourself at home within each language and learn to love all of its quirks. One of CSS' quirks is that keywords and identifiers are written in lowercase and separated by hyphens. Personally, I find this very visually appealing and think it fits in with the all-lowercase (although no-hyphen) HTML.
Perhaps a key reason why the HTML/CSS community aligned itself with dashes instead of underscores is due to historical deficiencies in specs and browser implementations.
From a Mozilla doc published March 2001 # https://developer.mozilla.org/en-US/docs/Underscores_in_class_and_ID_Names
The CSS1 specification, published in its final form in 1996, did not
allow for the use of underscores in class and ID names unless they
were "escaped." An escaped underscore would look something like this:
p.urgent\_note {color: maroon;}
This was not well supported by browsers at the time, however, and the
practice has never caught on. CSS2, published in 1998, also forbade
the use of underscores in class and ID names. However, errata to the
specification published in early 2001 made underscores legal for the
first time. This unfortunately complicated an already complex
landscape.
I generally like underscores but the backslash just makes it ugly beyond hope, not to mention the scarce support at the time. I can understand why developers avoided it like the plague. Of course, we don't need the backslash nowadays, but the dash-etiquette has already been firmly established.
I don't think anyone can answer this definitively, but here are my educated guesses:
Underscores require hitting the Shift key, and are therefore harder to type.
CSS selectors which are part of the official CSS specifications use dashes (such as pseudo-classes like :first-child and pseudo-elements :first-line), not underscores. Same thing for properties, e.g. text-decoration, background-color, etc. Programmers are creatures of habit. It makes sense that they would follow the standard's style if there's no good reason not to.
This one is further out on the ledge, but... Whether it's myth or fact, there is a longstanding idea that Google treats words separated by underscores as a single word, and words separated by dashes as separate words. (Matt Cutts on Underscores vs. Dashes.) For this reason, I know that my preference now for creating page URLs is to use-words-with-dashes, and for me at least, this has bled into my naming conventions for other things, like CSS selectors.
There are many reasons, but one of the most important thing is maintaining consistency.
I think this article explains it comprehensively.
CSS is a hyphen-delimited syntax. By this I mean we write things like font-size, line-height, border-bottom etc.
So:
You just shouldn’t mix syntaxes: it’s inconsistent.
There's been a clear uptick in hyphen-separated, whole-word segments of URLs over recent years. This is encouraged by SEO best practices. Google explicitly "recommend that you use hyphens (-) instead of underscores (_) in your URLs": http://www.google.com/support/webmasters/bin/answer.py?answer=76329.
As noted, different conventions have prevailed at different times in different contexts, but they typically are not a formal part of any protocol or framework.
My hypothesis, then, is that Google's position anchors this pattern within one key context (SEO), and the trend to use this pattern in class, id, and attribute names is simply the herd moving slowly in this general direction.
I think it's a programmer dependent thing. Someones like to use dashes, others use underscores.
I personally use underscores (_) because I use it in other places too. Such as:
- JavaScript variables (var my_name);
- My controller actions (public function view_detail)
Another reason that I use underscores, is this that in most IDEs two words separated by underscores are considered as 1 word. (and are possible to select with double_click).
point of refactoring only btn to bt
case: btn_pink
search btn in word
result btn
case: btn-pink
search btn in word
result btn | btn-pink
case: btn-pink
search btn in regexp
\bbtn\b(?!-) type to hard
result btn
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"
Recently I realized (by some embarrassment) that regex lookbehind assertions were not possible in Javascript.
What is the (factual) reason for the absence for this assertion so seemingly common?
I realize there are alternate ways to achieve the same thing perhaps, although Is it the basic semantics at work which forbid the functionality, or what exactly?
It also seems that some regex testing tools out there that generate Javascript code from regex patterns seem to ignore this fact — which strikes me as a bit odd.
Today
Lookbehind is now an official part of the ES 2018 specification. Axel Rauschmayer gives a good introduction in his blog post.
History
It looks like at the time, Brendan Eich wasn't aware of its existence (because Netscape was built on an older version of Perl):
This was 1998, Netscape 4 work I did in '97 was based on Perl 4(!), but we proposed to ECMA TC39 TG1 (the JS group -- things were different then, including capitalization) something based on Perl 5. We didn't get everything, and we had to rationalize some obvious quirks.
I don't remember lookbehind (which emerged in Perl 5.005 in July '98) being left out on purpose. Waldemar may recall more, I'd handed him the JS keys inside netscape.com to go do mozilla.org.
If you are game to write a proposal or mini-spec (in the style of ES5 even), let me know. I'll chat with other TC39'ers next week about this.
/be
There have been a bunch of different on the mailing list with attempts to include it, but it still seems to be a pretty complex feature performance-wise, because EcmaScript Regular Expressions are backtracking based and backtracking is needed in lookbehind when working with capturing groups. This can lead to problems such as catastrophic backtracking when used incorrectly.
At some point it was suggested for ES6/Es 2015, but it never made the draft, let alone the specification. In the last post in the discussion, it seems that nobody took up the task of implementing it. If anybody feels called to write an implementation, they can sign up for the ES Discuss list and propose it.
Update May 2015:
In May 2015, Nozomu Katō has proposed an ES7 look-behind implementation.
Update September 2015:
Regex Look-behind was added as a stage 0 proposal.
Update May 2017:
The proposal is now at stage 3. This means that now at least two browsers need to implement it for it to become a part of the next EcmaScript standard. As #martixy mentioned in the comments, Chrome has implemented it behind the JS experimental flag.
To speak from the conclusion, I think look-behind is not implemented in JavaScript, since no one has any idea how it should behave, and existing implementations show that adding support for look-behind is rather complex.
JavaScript/ECMAScript is different from other languages in the sense that the specification includes an abstract implementation of the regex engine, while most other language only stops short at description of behavior of each pieces of regex syntax, with scant description of how different tokens interacts with each other.
Look-ahead? Easy to implement
The implementation of look-ahead is quite straight-forward. You only need to treat the pattern inside the look-ahead in the same manner as pattern outside look-ahead, and perform a left-to-right match as per usual, except that after the look-ahead succeeds 1) the current position is restored to before entering the look-ahead, and 2) choice points inside look-ahead are discarded after it is matched.
There is no limit to what can be included inside look-ahead, since it is a very simple extension to the existing natural left-to-right matching facilities.
Look-behind? Not so easy
On the other hand, the implementation of look-behind is not as straight forward.
Imagine how you would implement the following look-behind construct:
(?<=fixed-string)
(?<=a|fixed|string)
(?<=t[abc]{1,3})
(?<=(abc){2,6})
(?<=^.*abc.*)
(?<=\G"[^"]+");
(?<=^(.....|.......)+)
\b(\w+)\b(?<!\b\1\b.*\1)
Apart from the basic case (?<=fixed-string), which any look-behind implementation must support, (?<=a|fixed|string) is a much desirable case to support.
Different regex engine has varied level of support for the regex above.
Let us look at how they are implemented in .NET and Java. (This is the two flavors whose look-behind behavior I have studied.)
.NET implementation
In Microsoft .NET implementation, all those regex above are valid, since .NET implements look-behind by using right-to-left mode, with the starting offset at the current position. The look-behind construct doesn't generate any choice point by itself.
However, if you use capturing groups inside the look-behind, it starts to get confusing, since the atoms in the patterns are interpreted from right-to-left, as demonstrated in this post. This is the disadvantage of this method: you would need to wrap your mind to think right-to-left when writing a look-behind.
Java implementation
In contrast, Java regex implementation implements look-behind by reusing the left-to-right matching facilities.
It first analyzes the pattern inside the look-behind for the minimum and maximum length of the pattern. Then, look-behind is implemented by trying to match the pattern inside from left-to-right, starting from (current position - minimum length) to (current position - maximum length).
Is there anything missing? Yes! Since we are matching from left-to-right, we need to make sure that the match ends right at the position before entering the look-behind (current position). In Java, this is implemented by appending a node at the end of the pattern inside look-behind.
This implementation is very inefficient, as there are maximum - minimum + 1 choice points created in the look-behind itself, before we even talk about choice points created by the pattern inside the look-behind.
The look-behind bound check is also inefficient, since it is placed at the end of the pattern, and can't prune choice points that are clearly hopeless (those already far exceeding the current position in the middle of pattern).
Summary
As you can see, adding support for look-behind is not easy:
The right-to-left approach seems reasonably efficient. However, it requires additional specification on the right-to-left matching behavior on other existing constructs.
The approach to reuse left-to-right matching facilities is complex to specify, and is very inefficient. It also requires the introduction of pattern analysis into the specification, lest the performance is thrown out of the window.
(Note that I have not yet covered the behavior when look-behind is used inside look-ahead, and vice-versa. This should also be taken into consideration when defining semantics for look-behind construct).
These technical hurdles are also mentioned by Waldemar Horwat (who wrote the ES3 regex spec) in the mail cited in nils' answer:
No one has yet submitted a well-defined proposal for lookbehinds on the table. Lookbehinds are difficult to translate into the language used by the spec and get quite fuzzy when the order of evaluation of parts of the regexp matters, which is what happens if capturing parentheses are involved. Where do you start looking for the lookbehind? Shortest first, longest first, or reverse string match? Greedy or not? Backtrack into capturing results?
Is there any way to disable all symbols, punctuations, block elements, geometric shapes and dingbats such like these:
✁ ✂ ✃ ✄ ✆ ✇ ✈ ✉ ✌ ✍ ✎ ✏ ✐ ✑ ✒ ✓ ✔ ✕ ⟻ ⟼ ⟽ ⟾ ⟿ ⟻ ⟼ ⟽ ⟾ ⟿ ▚ ▛ ▜ ▝ ▞ ▟
without writing down all of them in the Regular Expression Pattern, while enable all other normal language characters such like chinese, arabic etc.. such like these:
文化中国 الجزيرة نت
?
I'm building a javascript validation function and my real problem is that I can't use:
[a-zA-Z0-9]
Because this ignores a lots of languages too not just the symbols.
The Unicode standard divides up all the possible characters into code charts. Each code chart contains related characters. If you want to exclude (or include) only certain classes of characters, you will have to make a suitable list of exclusions (or inclusions). Unicode is big, so this might be a lot of work.
Not really.
JavaScript doesn't support Unicode Character Properties. The closest you'll get is excluding ranges by Unicode code point as Greg Hewgill suggested.
For example, to match all of the characters under Mathematical Symbols:
/[\u2190-\u259F]/
This depends on your regex dialect. Unfortunately, probably most existing JavaScript engines don't support Unicode character classes.
In regex engines such as the one in (recent) Perl or .Net, Unicode character classes can be referenced.
\p{L}: any kind of letter from any language.
\p{N}: any number symbol from any language (including, as I recall, the Indian and Arabic and CJK number glyphs).
Because Unicode supports composed and decomposed glyphs, you may run into certain complexities: namely, if only decomposed forms exist, it's possible that you might accidentally exclude some diacritic marks in your matching pattern, and you may need to explicitly allow glyphs of the type Mark. You can mitigate this somewhat by using, if I recall correctly, a string that has been normalized using kC normalization (only for characters that have a composed form). In environments that support Unicode well, there's usually a function that allows you to normalize Unicode strings fairly easily (true in Java and .Net, at least).
Edited to add: If you've started down this path, or have considered it, in order to regain some sanity, you may want to experiment with the Unicode Plugin for XRegExp (which will require you to take a dependency on XRegExp).
JavaScript regular expressions do not have native Unicode support. An alternative to to validate (or sanitize) the string at server site, or to use a non-native regex library. While I've never used it, XRegExp is such a library, and it has a Unicode Plugin.
Take a look at the Unicode Planes. You probably want to exclude everything but planes 0 and 2. After that, it gets ugly as you'll have to exclude a lot of plane 0 on a case-by-case basis.