How to escape a character out of Basic Multilingual Plane? - javascript

For characters in Basic Multilingual Plane, we can use '\uxxxx' escape it. For example, you can use /[\u4e00-\u9fff]/ to match a common chinese character(0x4e00-0x9fff is the range of CJK Unified Ideographs).
But for characters out of Basic Multilingual Plane, their codes are bigger than 0xffff. So you can't use format '\uxxxx' to escape it, because '\u20000' means character '\u2000' and character '0', not a character which code is 0x20000.
How can I escape characters out of Basic Multilingual Plane? Use those characters directly is not a good idea, because they can't show in most fonts.

Characters outside the BMP are not recognized directly by Javascript -- they're represented internally as UTF-16 surrogate pairs. For instance, the character you mentioned, U+20000 (currently allocated to "CJK Unified Ideographs Ext. B") is represented as the surrogate pair U+D840 U+DC00. As a Javascript string, this would simply be "\u2840\uDC00". (Note that s.length is 2 for this string, even though it displays as a single character.)
Wikipedia has details on the encoding scheme used.

You can use a pair of escaped surrogate code points, as described in #duskwuff’s answer. You can use my Full Unicode input utility to get the notations (button “Show \u”), or use the Fileformat.info character search to find them out (item “C/C++/Java source code”, because JavaScript uses the same notation here).
Alternatively, you can enter the characters directly: “You can enter non-BMP characters as such into string literals in your JavaScript code,whether in a separate file or as embedded in HTML. Naturally, you need suitable Unicode support in the editor you use. But JavaScript implementations need not support non-BMP characters in program source. They may, and modern browser implementations generally do.” (Going Global with JavaScript and Globalize.js, p. 177) There are some caveats like properly declaring the character encoding.
Font support is a different issue, but when working with characters, you generally want to see them at some point anyway, at least in testing. So you more or less need some font(s) that cover the characters. The Fileformat.info pages also contain links to browser support info, such as (U+20000) Font Support – a good starting point, though not quite complete. For example, U+20000 '𠀀' is also supported in SimSun-ExtB

Interesting problem.
Now that we have ES6, we can do this:
let newSpeak = '\u{1F4A9}'
Note that internally it's still UTF-16 with surrogate pairs:
newSpeak.length === 2 // "wrong"
[...newSpeak].length === 1
newSpeak === '\uD83D\uDCA9'
Unicode is huge.
Also, it's not just the literals:
newSpeak.charCodeAt(0) === 0xD83D // "wrong"
newSpeak.codePointAt(0) === 0x1F4A9
String.fromCharCode(0x1F4A9) !== newSpeak
String.fromCodePoint(0x1F4A9) === newSpeak
for (let i = 0; i < newSpeak.length; i++) console.log(newSpeak[i]) // "wrong"
for (let c of newSpeak) console.log(c)
[...'🏃🚚'].map(c => `__${c}`).join('') === "__🏃__🚚"
I � handling Unicode.

Related

How to detect WORDS regardless of character set, using a JS regex [duplicate]

There should be something akin to \w that can match any code-point in Letters or Marks category (not just the ASCII ones), and hopefully have filters like [[P*]] for punctuation, etc.
Situation for ES 6
The ECMAScript language specification, edition 6 (also commonly known as ES2015), includes Unicode-aware regular expressions. Support must be enabled with the u modifier on the regex. See Unicode-aware regular expressions in ES6 for a break-down of the feature and some caveats.
ES6 is widely adopted in both browsers and stand-alone Javascript runtimes such as Node.js, so using this feature won't require extra effort in most cases. Full compatibility list: https://kangax.github.io/compat-table/es6/
Situation for ES 5 and below (legacy browsers)
There is a transpiler named regexpu that translates ES6 Unicode regular expressions into equivalent ES5. It can be used as part of your build process. Try it out online..
Even though JavaScript operates on Unicode strings, it does not implement Unicode-aware character classes and has no concept of POSIX character classes or Unicode blocks/sub-ranges.
Issues with Unicode in JavaScript regular expressions
Check your expectations here: Javascript RegExp Unicode Character Class tester (Edit: the original page is down, the Internet Archive still has a copy.)
Flagrant Badassery has an article on JavaScript, Regex, and Unicode that sheds some light on the matter.
Also read Regex and Unicode here on SO. Probably you have to build your own "punctuation character class".
Check out the Regular Expression: Match Unicode Block Range builder (archived copy), which lets you build a JavaScript regular expression that matches characters that fall in any number of specified Unicode blocks.
I just did it for the "General Punctuation" and "Supplemental Punctuation" sub-ranges, and the result is as simple and straight-forward as I would have expected it:
[\u2000-\u206F\u2E00-\u2E7F]
There also is XRegExp, a project that brings Unicode support to JavaScript by offering an alternative regex engine with extended capabilities.
And of course, required reading: mathiasbynens.be - JavaScript has a Unicode problem:
Personally, I would rather not install another library just to get this functionality. My answer does not require any external libraries, and it may also work with little modification for regex flavors besides JavaScript.
Unicode's website provides a way to translate Unicode categories into a set of code points. Since it's Unicode's website, the information from it should be accurate.
Note that you will need to exclude the high-end characters, as JavaScript can only handle characters less than FFFF (hex). I suggest checking the Abbreviate Collate, and Escape check boxes, which strike a balance between avoiding unprintable characters and minimizing the size of the regex.
Here are some common expansions of different Unicode properties:
\p{L} (Letters):
[A-Za-z\u00AA\u00B5\u00BA\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377\u037A-\u037D\u037F\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u052F\u0531-\u0556\u0559\u0561-\u0587\u05D0-\u05EA\u05F0-\u05F2\u0620-\u064A\u066E\u066F\u0671-\u06D3\u06D5\u06E5\u06E6\u06EE\u06EF\u06FA-\u06FC\u06FF\u0710\u0712-\u072F\u074D-\u07A5\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0815\u081A\u0824\u0828\u0840-\u0858\u08A0-\u08B4\u0904-\u0939\u093D\u0950\u0958-\u0961\u0971-\u0980\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD\u09CE\u09DC\u09DD\u09DF-\u09E1\u09F0\u09F1\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A59-\u0A5C\u0A5E\u0A72-\u0A74\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD\u0AD0\u0AE0\u0AE1\u0AF9\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D\u0B5C\u0B5D\u0B5F-\u0B61\u0B71\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BD0\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C39\u0C3D\u0C58-\u0C5A\u0C60\u0C61\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD\u0CDE\u0CE0\u0CE1\u0CF1\u0CF2\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D\u0D4E\u0D5F-\u0D61\u0D7A-\u0D7F\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0E01-\u0E30\u0E32\u0E33\u0E40-\u0E46\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB0\u0EB2\u0EB3\u0EBD\u0EC0-\u0EC4\u0EC6\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F88-\u0F8C\u1000-\u102A\u103F\u1050-\u1055\u105A-\u105D\u1061\u1065\u1066\u106E-\u1070\u1075-\u1081\u108E\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u1380-\u138F\u13A0-\u13F5\u13F8-\u13FD\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16F1-\u16F8\u1700-\u170C\u170E-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176C\u176E-\u1770\u1780-\u17B3\u17D7\u17DC\u1820-\u1877\u1880-\u18A8\u18AA\u18B0-\u18F5\u1900-\u191E\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A16\u1A20-\u1A54\u1AA7\u1B05-\u1B33\u1B45-\u1B4B\u1B83-\u1BA0\u1BAE\u1BAF\u1BBA-\u1BE5\u1C00-\u1C23\u1C4D-\u1C4F\u1C5A-\u1C7D\u1CE9-\u1CEC\u1CEE-\u1CF1\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2183\u2184\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2E2F\u3005\u3006\u3031-\u3035\u303B\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FD5\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA67F-\uA69D\uA6A0-\uA6E5\uA717-\uA71F\uA722-\uA788\uA78B-\uA7AD\uA7B0-\uA7B7\uA7F7-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA822\uA840-\uA873\uA882-\uA8B3\uA8F2-\uA8F7\uA8FB\uA8FD\uA90A-\uA925\uA930-\uA946\uA960-\uA97C\uA984-\uA9B2\uA9CF\uA9E0-\uA9E4\uA9E6-\uA9EF\uA9FA-\uA9FE\uAA00-\uAA28\uAA40-\uAA42\uAA44-\uAA4B\uAA60-\uAA76\uAA7A\uAA7E-\uAAAF\uAAB1\uAAB5\uAAB6\uAAB9-\uAABD\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEA\uAAF2-\uAAF4\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uAB30-\uAB5A\uAB5C-\uAB65\uAB70-\uABE2\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D\uFB1F-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC]
\p{Nd} (Number decimal digits):
[0-9\u0660-\u0669\u06F0-\u06F9\u07C0-\u07C9\u0966-\u096F\u09E6-\u09EF\u0A66-\u0A6F\u0AE6-\u0AEF\u0B66-\u0B6F\u0BE6-\u0BEF\u0C66-\u0C6F\u0CE6-\u0CEF\u0D66-\u0D6F\u0DE6-\u0DEF\u0E50-\u0E59\u0ED0-\u0ED9\u0F20-\u0F29\u1040-\u1049\u1090-\u1099\u17E0-\u17E9\u1810-\u1819\u1946-\u194F\u19D0-\u19D9\u1A80-\u1A89\u1A90-\u1A99\u1B50-\u1B59\u1BB0-\u1BB9\u1C40-\u1C49\u1C50-\u1C59\uA620-\uA629\uA8D0-\uA8D9\uA900-\uA909\uA9D0-\uA9D9\uA9F0-\uA9F9\uAA50-\uAA59\uABF0-\uABF9\uFF10-\uFF19]
\p{P} (Punctuation):
[!-#%-*,-/\:;?#\[-\]_\{\}\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u0AF0\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2010-\u2027\u2030-\u2043\u2045-\u2051\u2053-\u205E\u207D\u207E\u208D\u208E\u2308-\u230B\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E42\u3001-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA8FC\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65]
The page also recognizes a number of obscure character classes, such as \p{Hira}, which is just the (Japanese) Hiragana characters:
[\u3041-\u3096\u309D-\u309F]
Lastly, it's possible to plug a char class with more than one Unicode property to get a shorter regex than you would get by just combining them (as long as certain settings are checked).
Having also not found a good solution, I wrote a small script a long time ago, by downloading data from the unicode specification (v.5.0.0) and generating intervals for each unicode category and subcategory in the BMP (lately replaced by a small Java program that uses its own native Unicode support).
Basically it converts \p{...} to a range of values, much like the output of the tool mentioned by Tomalak, but the intervals can end up quite large (since it's not dealing with blocks, but with characters scattered through many different places).
For instance, a Regex written like this:
var regex = unicode_hack(/\p{L}(\p{L}|\p{Nd})*/g);
Will be converted to something like this:
/[\u0041-\u005a\u0061-\u007a...]([...]|[\u0030-\u0039\u0660-\u0669...])*/g
Haven't used it a lot in practice, but it seems to work fine from my tests, so I'm posting here in case someone find it useful. Despite the length of the resulting regexes (the example above has 3591 characters when expanded), the performance seems to be acceptable (see the tests at jsFiddle; thanks to #modiX and #Lwangaman for the improvements).
Here's the source (raw, 27.5KB; minified, 24.9KB, not much better...). It might be made smaller by unescaping the unicode characters, but OTOH will run the risk of encoding issues, so I'm leaving as it is. Hopefully with ES6 this kind of thing won't be necessary anymore.
Update: this looks like the same strategy adopted in the XRegExp Unicode plug-in mentioned by Tim Down, except that in this case regular JavaScript regexes are being used.
September 2018 (updated February 2019)
It seems that regexp /\p{L}/u for match letters (as unicode categories)
works on Chrome 68.0.3440.106 and Safari 11.1.2 (13605.3.8)
NOT working on Firefox 65.0 :(
Here is a working example
In below field you should be able to to type letters but not numbers<br>
<input type="text" name="field" onkeydown="return /\p{L}/u.test(event.key)" >
I report this bug here.
Update
After over 2 years according to: 1500035 > 1361876 > 1634135 finally this bug is fixed and will be available in Firefox v.78+
[^\u0000-\u007F]+ for any characters which is not included ASCII characters.
For example:
function isNonLatinCharacters(s) {
return /[^\u0000-\u007F]/.test(s);
}
console.log(isNonLatinCharacters("身分"));// Japanese
console.log(isNonLatinCharacters("测试"));// Chinese
console.log(isNonLatinCharacters("حمید"));// Persian
console.log(isNonLatinCharacters("테스트"));// Korean
console.log(isNonLatinCharacters("परीक्षण"));// Hindi
console.log(isNonLatinCharacters("מִבְחָן"));// Hebrew
Here are some perfect references:
Unicode range RegExp generator
Unicode Regular Expressions
Unicode 10.0 Character Code Charts
Match Unicode Block Range
As mentioned in other answers, JavaScript regexes have no support for Unicode character classes. However, there is a library that does provide this: Steven Levithan's excellent XRegExp and its Unicode plug-in.
In JavaScript, \w and \d are ASCII, while \s is Unicode. Don't ask me why. JavaScript does support \p with Unicode categories, which you can use to emulate a Unicode-aware \w and \d.
For \d use \p{N} (numbers)
For \w use [\p{L}\p{N}\p{Pc}\p{M}] (letters, numbers, underscores, marks)
Update: Unfortunately, I was wrong about this. JavaScript does does not officially support \p either, though some implementations may still support this. The only Unicode support in JavaScript regexes is matching specific code points with \uFFFF. You can use those in ranges in character classes.
This will do it:
/[A-Za-z\u00C0-\u00FF ]+/.exec('hipopótamo maçã pólen ñ poção água língüa')
It explicitly selects a range of unicode characters.
It will work for latin characters, but other strange characters may be out of this range.
If you are using Babel then Unicode support is already available.
I also released a plugin which transforms your source code such that you can write regular expressions like /^\p{L}+$/. These will then be transformed into something that browsers understand.
Here is the project page of the plugin:
babel-plugin-utf-8-regex
I'm answering this question
What would be the equivalent for \p{Lu} or \p{Ll} in regExp for js?
since it was marked as an exact duplicate of the current old question.
Querying the UCD Database of Unicode 12, \p{Lu} generates 1,788 code points.
Converting to UTF-16 yields the class construct equivalency.
It's only a 4k character string and is easily doable in any regex engines.
(?:[\u0041-\u005A\u00C0-\u00D6\u00D8-\u00DE\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C\u011E\u0120\u0122\u0124\u0126\u0128\u012A\u012C\u012E\u0130\u0132\u0134\u0136\u0139\u013B\u013D\u013F\u0141\u0143\u0145\u0147\u014A\u014C\u014E\u0150\u0152\u0154\u0156\u0158\u015A\u015C\u015E\u0160\u0162\u0164\u0166\u0168\u016A\u016C\u016E\u0170\u0172\u0174\u0176\u0178-\u0179\u017B\u017D\u0181-\u0182\u0184\u0186-\u0187\u0189-\u018B\u018E-\u0191\u0193-\u0194\u0196-\u0198\u019C-\u019D\u019F-\u01A0\u01A2\u01A4\u01A6-\u01A7\u01A9\u01AC\u01AE-\u01AF\u01B1-\u01B3\u01B5\u01B7-\u01B8\u01BC\u01C4\u01C7\u01CA\u01CD\u01CF\u01D1\u01D3\u01D5\u01D7\u01D9\u01DB\u01DE\u01E0\u01E2\u01E4\u01E6\u01E8\u01EA\u01EC\u01EE\u01F1\u01F4\u01F6-\u01F8\u01FA\u01FC\u01FE\u0200\u0202\u0204\u0206\u0208\u020A\u020C\u020E\u0210\u0212\u0214\u0216\u0218\u021A\u021C\u021E\u0220\u0222\u0224\u0226\u0228\u022A\u022C\u022E\u0230\u0232\u023A-\u023B\u023D-\u023E\u0241\u0243-\u0246\u0248\u024A\u024C\u024E\u0370\u0372\u0376\u037F\u0386\u0388-\u038A\u038C\u038E-\u038F\u0391-\u03A1\u03A3-\u03AB\u03CF\u03D2-\u03D4\u03D8\u03DA\u03DC\u03DE\u03E0\u03E2\u03E4\u03E6\u03E8\u03EA\u03EC\u03EE\u03F4\u03F7\u03F9-\u03FA\u03FD-\u042F\u0460\u0462\u0464\u0466\u0468\u046A\u046C\u046E\u0470\u0472\u0474\u0476\u0478\u047A\u047C\u047E\u0480\u048A\u048C\u048E\u0490\u0492\u0494\u0496\u0498\u049A\u049C\u049E\u04A0\u04A2\u04A4\u04A6\u04A8\u04AA\u04AC\u04AE\u04B0\u04B2\u04B4\u04B6\u04B8\u04BA\u04BC\u04BE\u04C0-\u04C1\u04C3\u04C5\u04C7\u04C9\u04CB\u04CD\u04D0\u04D2\u04D4\u04D6\u04D8\u04DA\u04DC\u04DE\u04E0\u04E2\u04E4\u04E6\u04E8\u04EA\u04EC\u04EE\u04F0\u04F2\u04F4\u04F6\u04F8\u04FA\u04FC\u04FE\u0500\u0502\u0504\u0506\u0508\u050A\u050C\u050E\u0510\u0512\u0514\u0516\u0518\u051A\u051C\u051E\u0520\u0522\u0524\u0526\u0528\u052A\u052C\u052E\u0531-\u0556\u10A0-\u10C5\u10C7\u10CD\u13A0-\u13F5\u1C90-\u1CBA\u1CBD-\u1CBF\u1E00\u1E02\u1E04\u1E06\u1E08\u1E0A\u1E0C\u1E0E\u1E10\u1E12\u1E14\u1E16\u1E18\u1E1A\u1E1C\u1E1E\u1E20\u1E22\u1E24\u1E26\u1E28\u1E2A\u1E2C\u1E2E\u1E30\u1E32\u1E34\u1E36\u1E38\u1E3A\u1E3C\u1E3E\u1E40\u1E42\u1E44\u1E46\u1E48\u1E4A\u1E4C\u1E4E\u1E50\u1E52\u1E54\u1E56\u1E58\u1E5A\u1E5C\u1E5E\u1E60\u1E62\u1E64\u1E66\u1E68\u1E6A\u1E6C\u1E6E\u1E70\u1E72\u1E74\u1E76\u1E78\u1E7A\u1E7C\u1E7E\u1E80\u1E82\u1E84\u1E86\u1E88\u1E8A\u1E8C\u1E8E\u1E90\u1E92\u1E94\u1E9E\u1EA0\u1EA2\u1EA4\u1EA6\u1EA8\u1EAA\u1EAC\u1EAE\u1EB0\u1EB2\u1EB4\u1EB6\u1EB8\u1EBA\u1EBC\u1EBE\u1EC0\u1EC2\u1EC4\u1EC6\u1EC8\u1ECA\u1ECC\u1ECE\u1ED0\u1ED2\u1ED4\u1ED6\u1ED8\u1EDA\u1EDC\u1EDE\u1EE0\u1EE2\u1EE4\u1EE6\u1EE8\u1EEA\u1EEC\u1EEE\u1EF0\u1EF2\u1EF4\u1EF6\u1EF8\u1EFA\u1EFC\u1EFE\u1F08-\u1F0F\u1F18-\u1F1D\u1F28-\u1F2F\u1F38-\u1F3F\u1F48-\u1F4D\u1F59\u1F5B\u1F5D\u1F5F\u1F68-\u1F6F\u1FB8-\u1FBB\u1FC8-\u1FCB\u1FD8-\u1FDB\u1FE8-\u1FEC\u1FF8-\u1FFB\u2102\u2107\u210B-\u210D\u2110-\u2112\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u2130-\u2133\u213E-\u213F\u2145\u2183\u2C00-\u2C2E\u2C60\u2C62-\u2C64\u2C67\u2C69\u2C6B\u2C6D-\u2C70\u2C72\u2C75\u2C7E-\u2C80\u2C82\u2C84\u2C86\u2C88\u2C8A\u2C8C\u2C8E\u2C90\u2C92\u2C94\u2C96\u2C98\u2C9A\u2C9C\u2C9E\u2CA0\u2CA2\u2CA4\u2CA6\u2CA8\u2CAA\u2CAC\u2CAE\u2CB0\u2CB2\u2CB4\u2CB6\u2CB8\u2CBA\u2CBC\u2CBE\u2CC0\u2CC2\u2CC4\u2CC6\u2CC8\u2CCA\u2CCC\u2CCE\u2CD0\u2CD2\u2CD4\u2CD6\u2CD8\u2CDA\u2CDC\u2CDE\u2CE0\u2CE2\u2CEB\u2CED\u2CF2\uA640\uA642\uA644\uA646\uA648\uA64A\uA64C\uA64E\uA650\uA652\uA654\uA656\uA658\uA65A\uA65C\uA65E\uA660\uA662\uA664\uA666\uA668\uA66A\uA66C\uA680\uA682\uA684\uA686\uA688\uA68A\uA68C\uA68E\uA690\uA692\uA694\uA696\uA698\uA69A\uA722\uA724\uA726\uA728\uA72A\uA72C\uA72E\uA732\uA734\uA736\uA738\uA73A\uA73C\uA73E\uA740\uA742\uA744\uA746\uA748\uA74A\uA74C\uA74E\uA750\uA752\uA754\uA756\uA758\uA75A\uA75C\uA75E\uA760\uA762\uA764\uA766\uA768\uA76A\uA76C\uA76E\uA779\uA77B\uA77D-\uA77E\uA780\uA782\uA784\uA786\uA78B\uA78D\uA790\uA792\uA796\uA798\uA79A\uA79C\uA79E\uA7A0\uA7A2\uA7A4\uA7A6\uA7A8\uA7AA-\uA7AE\uA7B0-\uA7B4\uA7B6\uA7B8\uA7BA\uA7BC\uA7BE\uA7C2\uA7C4-\uA7C6\uFF21-\uFF3A]|(?:\uD801[\uDC00-\uDC27\uDCB0-\uDCD3]|\uD803[\uDC80-\uDCB2]|\uD806[\uDCA0-\uDCBF]|\uD81B[\uDE40-\uDE5F]|\uD835[\uDC00-\uDC19\uDC34-\uDC4D\uDC68-\uDC81\uDC9C\uDC9E-\uDC9F\uDCA2\uDCA5-\uDCA6\uDCA9-\uDCAC\uDCAE-\uDCB5\uDCD0-\uDCE9\uDD04-\uDD05\uDD07-\uDD0A\uDD0D-\uDD14\uDD16-\uDD1C\uDD38-\uDD39\uDD3B-\uDD3E\uDD40-\uDD44\uDD46\uDD4A-\uDD50\uDD6C-\uDD85\uDDA0-\uDDB9\uDDD4-\uDDED\uDE08-\uDE21\uDE3C-\uDE55\uDE70-\uDE89\uDEA8-\uDEC0\uDEE2-\uDEFA\uDF1C-\uDF34\uDF56-\uDF6E\uDF90-\uDFA8\uDFCA]|\uD83A[\uDD00-\uDD21]))
Querying the UCD database of Unicode 12, \p{Ll} generates 2,151 code points.
Converting to UTF-16 yields the class construct equivalency.
(?:[\u0061-\u007A\u00B5\u00DF-\u00F6\u00F8-\u00FF\u0101\u0103\u0105\u0107\u0109\u010B\u010D\u010F\u0111\u0113\u0115\u0117\u0119\u011B\u011D\u011F\u0121\u0123\u0125\u0127\u0129\u012B\u012D\u012F\u0131\u0133\u0135\u0137-\u0138\u013A\u013C\u013E\u0140\u0142\u0144\u0146\u0148-\u0149\u014B\u014D\u014F\u0151\u0153\u0155\u0157\u0159\u015B\u015D\u015F\u0161\u0163\u0165\u0167\u0169\u016B\u016D\u016F\u0171\u0173\u0175\u0177\u017A\u017C\u017E-\u0180\u0183\u0185\u0188\u018C-\u018D\u0192\u0195\u0199-\u019B\u019E\u01A1\u01A3\u01A5\u01A8\u01AA-\u01AB\u01AD\u01B0\u01B4\u01B6\u01B9-\u01BA\u01BD-\u01BF\u01C6\u01C9\u01CC\u01CE\u01D0\u01D2\u01D4\u01D6\u01D8\u01DA\u01DC-\u01DD\u01DF\u01E1\u01E3\u01E5\u01E7\u01E9\u01EB\u01ED\u01EF-\u01F0\u01F3\u01F5\u01F9\u01FB\u01FD\u01FF\u0201\u0203\u0205\u0207\u0209\u020B\u020D\u020F\u0211\u0213\u0215\u0217\u0219\u021B\u021D\u021F\u0221\u0223\u0225\u0227\u0229\u022B\u022D\u022F\u0231\u0233-\u0239\u023C\u023F-\u0240\u0242\u0247\u0249\u024B\u024D\u024F-\u0293\u0295-\u02AF\u0371\u0373\u0377\u037B-\u037D\u0390\u03AC-\u03CE\u03D0-\u03D1\u03D5-\u03D7\u03D9\u03DB\u03DD\u03DF\u03E1\u03E3\u03E5\u03E7\u03E9\u03EB\u03ED\u03EF-\u03F3\u03F5\u03F8\u03FB-\u03FC\u0430-\u045F\u0461\u0463\u0465\u0467\u0469\u046B\u046D\u046F\u0471\u0473\u0475\u0477\u0479\u047B\u047D\u047F\u0481\u048B\u048D\u048F\u0491\u0493\u0495\u0497\u0499\u049B\u049D\u049F\u04A1\u04A3\u04A5\u04A7\u04A9\u04AB\u04AD\u04AF\u04B1\u04B3\u04B5\u04B7\u04B9\u04BB\u04BD\u04BF\u04C2\u04C4\u04C6\u04C8\u04CA\u04CC\u04CE-\u04CF\u04D1\u04D3\u04D5\u04D7\u04D9\u04DB\u04DD\u04DF\u04E1\u04E3\u04E5\u04E7\u04E9\u04EB\u04ED\u04EF\u04F1\u04F3\u04F5\u04F7\u04F9\u04FB\u04FD\u04FF\u0501\u0503\u0505\u0507\u0509\u050B\u050D\u050F\u0511\u0513\u0515\u0517\u0519\u051B\u051D\u051F\u0521\u0523\u0525\u0527\u0529\u052B\u052D\u052F\u0560-\u0588\u10D0-\u10FA\u10FD-\u10FF\u13F8-\u13FD\u1C80-\u1C88\u1D00-\u1D2B\u1D6B-\u1D77\u1D79-\u1D9A\u1E01\u1E03\u1E05\u1E07\u1E09\u1E0B\u1E0D\u1E0F\u1E11\u1E13\u1E15\u1E17\u1E19\u1E1B\u1E1D\u1E1F\u1E21\u1E23\u1E25\u1E27\u1E29\u1E2B\u1E2D\u1E2F\u1E31\u1E33\u1E35\u1E37\u1E39\u1E3B\u1E3D\u1E3F\u1E41\u1E43\u1E45\u1E47\u1E49\u1E4B\u1E4D\u1E4F\u1E51\u1E53\u1E55\u1E57\u1E59\u1E5B\u1E5D\u1E5F\u1E61\u1E63\u1E65\u1E67\u1E69\u1E6B\u1E6D\u1E6F\u1E71\u1E73\u1E75\u1E77\u1E79\u1E7B\u1E7D\u1E7F\u1E81\u1E83\u1E85\u1E87\u1E89\u1E8B\u1E8D\u1E8F\u1E91\u1E93\u1E95-\u1E9D\u1E9F\u1EA1\u1EA3\u1EA5\u1EA7\u1EA9\u1EAB\u1EAD\u1EAF\u1EB1\u1EB3\u1EB5\u1EB7\u1EB9\u1EBB\u1EBD\u1EBF\u1EC1\u1EC3\u1EC5\u1EC7\u1EC9\u1ECB\u1ECD\u1ECF\u1ED1\u1ED3\u1ED5\u1ED7\u1ED9\u1EDB\u1EDD\u1EDF\u1EE1\u1EE3\u1EE5\u1EE7\u1EE9\u1EEB\u1EED\u1EEF\u1EF1\u1EF3\u1EF5\u1EF7\u1EF9\u1EFB\u1EFD\u1EFF-\u1F07\u1F10-\u1F15\u1F20-\u1F27\u1F30-\u1F37\u1F40-\u1F45\u1F50-\u1F57\u1F60-\u1F67\u1F70-\u1F7D\u1F80-\u1F87\u1F90-\u1F97\u1FA0-\u1FA7\u1FB0-\u1FB4\u1FB6-\u1FB7\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FC7\u1FD0-\u1FD3\u1FD6-\u1FD7\u1FE0-\u1FE7\u1FF2-\u1FF4\u1FF6-\u1FF7\u210A\u210E-\u210F\u2113\u212F\u2134\u2139\u213C-\u213D\u2146-\u2149\u214E\u2184\u2C30-\u2C5E\u2C61\u2C65-\u2C66\u2C68\u2C6A\u2C6C\u2C71\u2C73-\u2C74\u2C76-\u2C7B\u2C81\u2C83\u2C85\u2C87\u2C89\u2C8B\u2C8D\u2C8F\u2C91\u2C93\u2C95\u2C97\u2C99\u2C9B\u2C9D\u2C9F\u2CA1\u2CA3\u2CA5\u2CA7\u2CA9\u2CAB\u2CAD\u2CAF\u2CB1\u2CB3\u2CB5\u2CB7\u2CB9\u2CBB\u2CBD\u2CBF\u2CC1\u2CC3\u2CC5\u2CC7\u2CC9\u2CCB\u2CCD\u2CCF\u2CD1\u2CD3\u2CD5\u2CD7\u2CD9\u2CDB\u2CDD\u2CDF\u2CE1\u2CE3-\u2CE4\u2CEC\u2CEE\u2CF3\u2D00-\u2D25\u2D27\u2D2D\uA641\uA643\uA645\uA647\uA649\uA64B\uA64D\uA64F\uA651\uA653\uA655\uA657\uA659\uA65B\uA65D\uA65F\uA661\uA663\uA665\uA667\uA669\uA66B\uA66D\uA681\uA683\uA685\uA687\uA689\uA68B\uA68D\uA68F\uA691\uA693\uA695\uA697\uA699\uA69B\uA723\uA725\uA727\uA729\uA72B\uA72D\uA72F-\uA731\uA733\uA735\uA737\uA739\uA73B\uA73D\uA73F\uA741\uA743\uA745\uA747\uA749\uA74B\uA74D\uA74F\uA751\uA753\uA755\uA757\uA759\uA75B\uA75D\uA75F\uA761\uA763\uA765\uA767\uA769\uA76B\uA76D\uA76F\uA771-\uA778\uA77A\uA77C\uA77F\uA781\uA783\uA785\uA787\uA78C\uA78E\uA791\uA793-\uA795\uA797\uA799\uA79B\uA79D\uA79F\uA7A1\uA7A3\uA7A5\uA7A7\uA7A9\uA7AF\uA7B5\uA7B7\uA7B9\uA7BB\uA7BD\uA7BF\uA7C3\uA7FA\uAB30-\uAB5A\uAB60-\uAB67\uAB70-\uABBF\uFB00-\uFB06\uFB13-\uFB17\uFF41-\uFF5A]|(?:\uD801[\uDC28-\uDC4F\uDCD8-\uDCFB]|\uD803[\uDCC0-\uDCF2]|\uD806[\uDCC0-\uDCDF]|\uD81B[\uDE60-\uDE7F]|\uD835[\uDC1A-\uDC33\uDC4E-\uDC54\uDC56-\uDC67\uDC82-\uDC9B\uDCB6-\uDCB9\uDCBB\uDCBD-\uDCC3\uDCC5-\uDCCF\uDCEA-\uDD03\uDD1E-\uDD37\uDD52-\uDD6B\uDD86-\uDD9F\uDDBA-\uDDD3\uDDEE-\uDE07\uDE22-\uDE3B\uDE56-\uDE6F\uDE8A-\uDEA5\uDEC2-\uDEDA\uDEDC-\uDEE1\uDEFC-\uDF14\uDF16-\uDF1B\uDF36-\uDF4E\uDF50-\uDF55\uDF70-\uDF88\uDF8A-\uDF8F\uDFAA-\uDFC2\uDFC4-\uDFC9\uDFCB]|\uD83A[\uDD22-\uDD43]))
Note that a regex implementation of \p{Lu} or \p{Pl} actually calls a
non standard function to test the value.
The character classes shown here are done differently and are linear, standard
and pretty slow, when jammed into mostly a single class.
Some insight on how a Regex engine (in general) implements Unicode Property Classes:
Examine these performance characteristics between the property
and the class block (like above)
Regex1: LONG CLASS
< none >
Completed iterations: 50 / 50 ( x 1 )
Matches found per iteration: 1788
Elapsed Time: 0.73 s, 727.58 ms, 727584 µs
Matches per sec: 122,872
Regex2: \p{Lu}
Options: < ICU - none >
Completed iterations: 50 / 50 ( x 1 )
Matches found per iteration: 1788
Elapsed Time: 0.07 s, 65.32 ms, 65323 µs
Matches per sec: 1,368,583
Wow what a difference !!
Lets see how Properties might be implemented
Array of Pointers [ 10FFFF ] where each index is is a Code Point
Each pointer in the Array is to a structure of classification.
A Classification structure contains fixed field elemets.
Some are NULL and do not pertain.
Some contain category classifications.
Example : General Category
This is a bitmapped element that uses 17 out of 64 bits.
Whatever this Code Point supports has bit(s) set as a mask.
-Close_Punctuation
-Connector_Punctuation
-Control
-Currency_Symbol
-Dash_Punctuation
-Decimal_Number
-Enclosing_Mark
-Final_Punctuation
-Format
-Initial_Punctuation
-Letter_Number
-Line_Separator
-Lowercase_Letter
-Math_Symbol
-Modifier_Letter
-Modifier_Symbol
-Nonspacing_Mark
-Open_Punctuation
-Other_Letter
-Other_Number
-Other_Punctuation
-Other_Symbol
-Paragraph_Separator
-Private_Use
-Space_Separator
-Spacing_Mark
-Surrogate
-Titlecase_Letter
-Unassigned
-Uppercase_Letter
When a regex is parsed with something like this \p{Lu} it
is translated directly into
Classification Structure element offset : General Category
A check of that element for bit item : Uppercase_Letter
Another example, when a regex is parsed with punctuation property \p{P} it
is translated into
Classification Structure element offset : General Category
A check of that element for any of these items bits, which are joined into a mask :
-Close_Punctuation
-Connector_Punctuation
-Dash_Punctuation
-Final_Punctuation
-Initial_Punctuation
-Open_Punctuation
-Other_Punctuation
The offset and bit or bit(mask) are stored as a regex step for that property.
The lookup table is created once for all Unicode Code Points using this array.
When a character is checked, it is as simple as using the CP as an index into
this array and checking the Classification Structure's specific element for that bit(mask).
This structure is expandable and indirect to provide much more complex look ups. This is just a simple example.
Compare that direct lookup with a character class search :
All classes are a linear list of items searched from left to right.
In this comparison, given our target string contains only the complete
Upper Case Unicode Letters only, the law of averages would predict that
half of the items in the class would have to be ranged checked
to find a match.
This is a huge disadvantage in performance.
However, if the lookup tables are not there or are not up to date
with the latest Unicode release (12 as of this date)
then this would be the only way.
In fact, it is mostly the only way to get the complete Emoji
characters as there is no specific property (or reasoning) to their assignment.
You can also use:
function myFunction() {
var str = "xq234";
var allowChars = "^[a-zA-ZÀ-ÿ]+$";
var res = str.match(allowChars);
if(!str.match(allowChars)){
res="true";
}
else {
res="false";
}
document.getElementById("demo").innerHTML = res;

Printing all ASCII characters in Javascript

I need to do something like this:
Have a variable of some type.
Run in a loop and assign all the possible ASCII characters to this variable and print them, one by one.
Is something similar possible for UNICODE also?
I'm not sure how exactly you want to print, but this will console.log printable ascii
for(var i=32;i<127;++i) console.log(String.fromCharCode(i));
You can document.write then if that's your intention. And if the environment is unicode, it should work for unicode as well, I believe.
Others have shown how to print the printable Ascii characters. It is possible to print all other Ascii characters, too, though they are control characters with system-dependent effect (often no effect). To create a string containing all Ascii characters into a string, you could do this:
var s = '';
for (var i = 0; i <= 127; i++) s += String.fromCharCode(i);
Unicode is much more tricky, because the Unicode coding space, from 0 to 0x10FFFF, contains a large number of unassigned code points as well as code points designated as noncharacters. There are also Private Use code points, which may be used to denote characters by “private agreement” but have no generally assigned meaning. Moreover, many Unicode characters are nonspacing, i.e. meant to combine with the preceding character (e.g., turning “a” to “â”), so you can’t visually print them in a row. There is no simple way in JavaScript to determine, from a integer, the class of the corresponding code point – you might need to read the UnicodeData.txt file, parse it, and use the information there to classify code points.
Finally, there is the programming issue that the JavaScript concept of character corresponds to a 16-bit code unit (not code point), and any Unicode code point larger than 0xFFFF needs to be represented using two code units (so-called surrogates). If you are using JavaScript in the context of an HTML document and you want to print characters in th HTML content, then the simplest way is to use character references like 𐐀 (which denotes the Unicode character at code point 10400 hexadecimal) and assign the string to the innerHTML property of an element.
If you need to write ranges of Unicode characters, you might take a look at the Full Unicode Input utility that I recently wrote. Its source code illustrates some ways of dealing with Unicode characters in JavaScript.
There are some of the ASCII characters that are non-printable, but for example getting the characters from 32 (space) to 126 (~), you would use:
var s = '';
for (var i = 32; i <= 127; i++) s += String.fromCharCode(i);
The unicode character set has more than 110,000 different characters (see Unicode), but a normal font doesn't contain all of them, so you can't display them anyway. You would have to specify what parts of the character space you are interested in.

Javascript string comparison fails when comparing unicode characters

I want to compare two strings in JavaScript that are the same, and yet the equality operator == returns false. One string contains a special character (eg. the danish å).
JavaScript code:
var filenameFromJS = "Designhåndbog.pdf";
var filenameFromServer = "Designhåndbog.pdf";
print(filenameFromJS == filenameFromServer); // This prints false why?
The solution
What worked for me is unicode normalization as slevithan pointed out.
I forked my original jsfiddle to make a version using the normalization lib suggested by slevithan. Link: http://jsfiddle.net/GWZ8j/1/.
Unlike what some other people here have said, this has nothing to do with encodings. Rather, your two strings use different code points to render the same visual characters.
To solve this correctly, you need to perform Unicode normalization on the two strings before comparing them. Unforunately, JavaScript doesn't have this functionality built in. Here is a JavaScript library that can perform the normalization for you: https://github.com/walling/unorm
The JavaScript equality operator == will appear to be failing under the following circumstances. In all cases it is programmer error. Not a bug in JavaScript.
The two strings do not contain the same number and sequence of characters.
There is whitespace or newlines before, within or after one string. Use a trim() operator on both and look closely at both strings.
Surprise typecasting. The programmer is comparing datatypes that are incompatible.
There are unicode characters which look identical to other unicode characters but in fact are different unicode characters.
UTF-8 is a complex thing. The charset has two different codes for characters such as á, é etc. As you already see in the URL encoded version, the HEX bytes of which the character is made differ for both versions.
See this answer for more information.
I had this same problem.
Adding
<meta charset="UTF-8">
to the HTML file fixed the issue.
In my case the templating engine was baking a json string into the HTML file. This string was in unicode.
While the template was also a unicode file, the JS engine was treating the string I wrote into the template as a latin-1 encoded string, until I added the meta tag.
I was comparing the typed in string to one of the JSON objects items (location.title == "Mühle")
Let the browser normalize unicode for you. This approach worked for me:
function normalizeUnicode(s) {
let div = $('<div style="display: none"></div>').html(s).appendTo('body');
let res = div.html();
div.remove();
return res;
}
normalizeUnicode(unicodeVal1) == normalizeUnicode(unicodeVal2)

Why is my RegExp construction not accepted by JavaScript?

I'm using a RegExp to validate some user input on an ASP.NET web page. It's meant to enforce the construction of a password (i.e. between 8 and 20 long, at least one upper case character, at least one lower case character, at least one number, at least one of the characters ##!$% and no use of letters L or O (upper or lower) or numbers 0 and 1. This RegExp works fine in my tester (Expresso) and in my C# code.
This is how it looks:
(?-i)^(?=.{8,20})(?=.*[2-9])(?=.*[a-hj-km-np-z])(?=.*[A-HJ-KM-NP-Z])
(?=.*[##!$%])[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]*$
(Line break added for formatting)
However, when I run the code it lives in in IE6 or IE7 (haven't tried other browsers as this is an internal app and we're a Microsoft shop), I get a runtime error saying 'Syntax error in regular expression'. That's it - no further information in the error message aside from the line number.
What is it about this that JavaScript doesn't like?
Well, there are two ways of defining a Regex in Javascript:
a. Through a Regexp object constructor:
var re = new RegExp("pattern","flags");
re.test(myTestString);
b. Using a string literal:
var re = /pattern/flags;
You should also note that JS does not support some of the tenets of Regular Expressions. For a non-comprehensive list of features unsupported in JS, check out the regular-expressions.info site.
Specifically speaking, you appear to be setting some flags on the expression (for example, the case insensitive flag). I would suggest that you use the /i flag (as indicated by the syntax above) instead of using (?-i)
That would make your Regex as follows (Positive Lookahead appears to be supported):
/^(?=.{8,20})(?=.*[2-9])(?=.*[a-hj-km-np-z])(?=.*[A-HJ-KM-NP-Z])(?=.*[##!$%])[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]*$/i;
For a very good article on the subject, check out Regular Expressions in JavaScript.
Edit (after Howard's comment)
If you are simply assigning this Regex pattern to a RegularExpressionValidator control, then you will not have the ability to set Regex options (such as ignore case). Also, you will not be able to use the Regex literal syntax supported by Javascript. Therefore, the only option that remains is to make your pattern intrinsically case insensitive. For example, [a-h] would have to be written as [A-Ha-h]. This would make your Regex quite long-winded, I'm sorry to say.
Here is a solution to this problem, though I cannot vouch for it's legitimacy. Some other options that come to mind may be to turn of Client side validation altogether and validate exclusively on the Server. This will give you access to the full Regex flavour implemented by the System.Text.RegularExpressions.Regex object. Alternatively, use a CustomValidator and create your own JS function which applies the Regex match using the patterns that I (and others) have suggested.
I'm not familiar with C#'s regular expression syntax, but is this (at the start)
(?-i)
meant to turn the case insensitivity pattern modifier on? If so, that's your problem. Javascript doesn't support specifying the pattern modifiers in the expression. There's two ways to do this in javascript
var re = /pattern/i
var re = new RegExp('pattern','i');
Give one of those a try, and your expression should be happy.
As Cerberus mentions, (?-i) is not supported in JavaScript regexps. So, you need to get rid of that and use /i. Something to keep in mind is that there is no standard for regular expression syntax; it is different in each language, so testing in something that uses the .NET regular expression engine is not a valid test of how it will work in JavaScript. Instead, try and look for a reference on JavaScript regular expressions, such as this one.
Your match that looks for 8-20 characters is also invalid. This will ensure that there are at least 8 characters, but it does not limit the string to 20, since the character class with the kleene-closure (* operator) at the end can match as many characters as provided. What you want instead is to replace the * at the end with the {8,20}, and eliminate it from the beginning.
var re = /^(?=.*[2-9])(?=.*[a-hj-km-np-z])(?=.*[A-HJ-KM-NP-Z])(?=.*[##!$%])[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]{8,20}$/i;
On the other hand, I'm not really sure why you would want to restrict the length of passwords, unless there's a hard database limit (which there shouldn't be, since you shouldn't be storing passwords in plain text in the database, but instead hashing them down to something fixed size using a secure hash algorithm with a salt). And as mentioned, I don't see a reason to be so restrictive on the set of characters you allow. I'd recommend something more like this:
var re = /^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##!$%])[a-zA-Z0-9##!$%]{8,}$/i;
Also, why would you forbid 1, 0, L and O from your passwords (and it looks like you're trying to forbid I as well, which you forgot to mention)? This will make it very hard for people to construct good passwords, and since you never see a password as you type it, there's no reason to worry about letters which look confusingly similar. If you want to have a more permissive regexp:
var re = /^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##!$%]).{8,}$/i;
Are you enclosing the regexp in / / characters?
var regexp = /[]/;
return regexp.test();
(?-i)
Doesn't exist in JS Regexp. Flags can be specified as “new RegExp('pattern', 'i')”, or literal syntax “/pattern/i”.
(?=
Exists in modern implementations of JS Regexp, but is dangerously buggy in IE. Lookahead assertions should be avoided in JS for this reason.
between 8 and 20 long, at least one upper case character, at least one lower case character, at least one number, at least one of the characters ##!$% and no use of letters L or O (upper or lower) or numbers 0 and 1.
Do you have to do this in RegExp, and do you have to put all the conditions in one RegExp? Because those are easy conditions to match using multiple RegExps, or even simple string matching:
if (
s.length<8 || s.length>20 ||
s==s.toLowerCase() || s==s.toUpperCase() ||
s.indexOf('0')!=-1 || s.indexOf('1')!=-1 ||
s.toLowerCase().indexOf('l')!=-1 || s.toLowerCase().indexOf('o')!=-1 ||
(s.indexOf('#')==-1 && s.indexOf('#')==-1 && s.indexOf('!')==-1 && s.indexOf('%')==-1 && s.indexOf('%')==-1)
)
alert('Bad password!');
(These are really cruel and unhelpful password rules if meant for end-users BTW!)
I would use this regular expression:
/(?=[^2-9]*[2-9])(?=[^a-hj-km-np-z]*[a-hj-km-np-z])(?=[^A-HJ-KM-NP-Z]*[A-HJ-KM-NP-Z])(?=[^##!$%]*[##!$%])^[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]{8,}$/
The [^a-z]*[a-z] will make sure that the match is made as early as possible instead of expanding the .* and doing backtracking.
(?-i) is supposed to turn case-insensitivity off. Everybody seems to be assuming you're trying to turn it on, but that would be (?i). Anyway, you don't want it to be case-insensitive, since you need to ensure that there are both uppercase and lowercase letters. Since case-sensitive matching is the default, prefacing a regex with (?-i) is pointless even in those flavors (like .NET) that support inline modifiers.

How can I use Unicode-aware regular expressions in JavaScript?

There should be something akin to \w that can match any code-point in Letters or Marks category (not just the ASCII ones), and hopefully have filters like [[P*]] for punctuation, etc.
Situation for ES 6
The ECMAScript language specification, edition 6 (also commonly known as ES2015), includes Unicode-aware regular expressions. Support must be enabled with the u modifier on the regex. See Unicode-aware regular expressions in ES6 for a break-down of the feature and some caveats.
ES6 is widely adopted in both browsers and stand-alone Javascript runtimes such as Node.js, so using this feature won't require extra effort in most cases. Full compatibility list: https://kangax.github.io/compat-table/es6/
Situation for ES 5 and below (legacy browsers)
There is a transpiler named regexpu that translates ES6 Unicode regular expressions into equivalent ES5. It can be used as part of your build process. Try it out online..
Even though JavaScript operates on Unicode strings, it does not implement Unicode-aware character classes and has no concept of POSIX character classes or Unicode blocks/sub-ranges.
Issues with Unicode in JavaScript regular expressions
Check your expectations here: Javascript RegExp Unicode Character Class tester (Edit: the original page is down, the Internet Archive still has a copy.)
Flagrant Badassery has an article on JavaScript, Regex, and Unicode that sheds some light on the matter.
Also read Regex and Unicode here on SO. Probably you have to build your own "punctuation character class".
Check out the Regular Expression: Match Unicode Block Range builder (archived copy), which lets you build a JavaScript regular expression that matches characters that fall in any number of specified Unicode blocks.
I just did it for the "General Punctuation" and "Supplemental Punctuation" sub-ranges, and the result is as simple and straight-forward as I would have expected it:
[\u2000-\u206F\u2E00-\u2E7F]
There also is XRegExp, a project that brings Unicode support to JavaScript by offering an alternative regex engine with extended capabilities.
And of course, required reading: mathiasbynens.be - JavaScript has a Unicode problem:
Personally, I would rather not install another library just to get this functionality. My answer does not require any external libraries, and it may also work with little modification for regex flavors besides JavaScript.
Unicode's website provides a way to translate Unicode categories into a set of code points. Since it's Unicode's website, the information from it should be accurate.
Note that you will need to exclude the high-end characters, as JavaScript can only handle characters less than FFFF (hex). I suggest checking the Abbreviate Collate, and Escape check boxes, which strike a balance between avoiding unprintable characters and minimizing the size of the regex.
Here are some common expansions of different Unicode properties:
\p{L} (Letters):
[A-Za-z\u00AA\u00B5\u00BA\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377\u037A-\u037D\u037F\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u052F\u0531-\u0556\u0559\u0561-\u0587\u05D0-\u05EA\u05F0-\u05F2\u0620-\u064A\u066E\u066F\u0671-\u06D3\u06D5\u06E5\u06E6\u06EE\u06EF\u06FA-\u06FC\u06FF\u0710\u0712-\u072F\u074D-\u07A5\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0815\u081A\u0824\u0828\u0840-\u0858\u08A0-\u08B4\u0904-\u0939\u093D\u0950\u0958-\u0961\u0971-\u0980\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD\u09CE\u09DC\u09DD\u09DF-\u09E1\u09F0\u09F1\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A59-\u0A5C\u0A5E\u0A72-\u0A74\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD\u0AD0\u0AE0\u0AE1\u0AF9\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D\u0B5C\u0B5D\u0B5F-\u0B61\u0B71\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BD0\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C39\u0C3D\u0C58-\u0C5A\u0C60\u0C61\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD\u0CDE\u0CE0\u0CE1\u0CF1\u0CF2\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D\u0D4E\u0D5F-\u0D61\u0D7A-\u0D7F\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0E01-\u0E30\u0E32\u0E33\u0E40-\u0E46\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB0\u0EB2\u0EB3\u0EBD\u0EC0-\u0EC4\u0EC6\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F88-\u0F8C\u1000-\u102A\u103F\u1050-\u1055\u105A-\u105D\u1061\u1065\u1066\u106E-\u1070\u1075-\u1081\u108E\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u1380-\u138F\u13A0-\u13F5\u13F8-\u13FD\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16F1-\u16F8\u1700-\u170C\u170E-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176C\u176E-\u1770\u1780-\u17B3\u17D7\u17DC\u1820-\u1877\u1880-\u18A8\u18AA\u18B0-\u18F5\u1900-\u191E\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A16\u1A20-\u1A54\u1AA7\u1B05-\u1B33\u1B45-\u1B4B\u1B83-\u1BA0\u1BAE\u1BAF\u1BBA-\u1BE5\u1C00-\u1C23\u1C4D-\u1C4F\u1C5A-\u1C7D\u1CE9-\u1CEC\u1CEE-\u1CF1\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2183\u2184\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2E2F\u3005\u3006\u3031-\u3035\u303B\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FD5\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA67F-\uA69D\uA6A0-\uA6E5\uA717-\uA71F\uA722-\uA788\uA78B-\uA7AD\uA7B0-\uA7B7\uA7F7-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA822\uA840-\uA873\uA882-\uA8B3\uA8F2-\uA8F7\uA8FB\uA8FD\uA90A-\uA925\uA930-\uA946\uA960-\uA97C\uA984-\uA9B2\uA9CF\uA9E0-\uA9E4\uA9E6-\uA9EF\uA9FA-\uA9FE\uAA00-\uAA28\uAA40-\uAA42\uAA44-\uAA4B\uAA60-\uAA76\uAA7A\uAA7E-\uAAAF\uAAB1\uAAB5\uAAB6\uAAB9-\uAABD\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEA\uAAF2-\uAAF4\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uAB30-\uAB5A\uAB5C-\uAB65\uAB70-\uABE2\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D\uFB1F-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC]
\p{Nd} (Number decimal digits):
[0-9\u0660-\u0669\u06F0-\u06F9\u07C0-\u07C9\u0966-\u096F\u09E6-\u09EF\u0A66-\u0A6F\u0AE6-\u0AEF\u0B66-\u0B6F\u0BE6-\u0BEF\u0C66-\u0C6F\u0CE6-\u0CEF\u0D66-\u0D6F\u0DE6-\u0DEF\u0E50-\u0E59\u0ED0-\u0ED9\u0F20-\u0F29\u1040-\u1049\u1090-\u1099\u17E0-\u17E9\u1810-\u1819\u1946-\u194F\u19D0-\u19D9\u1A80-\u1A89\u1A90-\u1A99\u1B50-\u1B59\u1BB0-\u1BB9\u1C40-\u1C49\u1C50-\u1C59\uA620-\uA629\uA8D0-\uA8D9\uA900-\uA909\uA9D0-\uA9D9\uA9F0-\uA9F9\uAA50-\uAA59\uABF0-\uABF9\uFF10-\uFF19]
\p{P} (Punctuation):
[!-#%-*,-/\:;?#\[-\]_\{\}\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u0AF0\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2010-\u2027\u2030-\u2043\u2045-\u2051\u2053-\u205E\u207D\u207E\u208D\u208E\u2308-\u230B\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E42\u3001-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA8FC\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65]
The page also recognizes a number of obscure character classes, such as \p{Hira}, which is just the (Japanese) Hiragana characters:
[\u3041-\u3096\u309D-\u309F]
Lastly, it's possible to plug a char class with more than one Unicode property to get a shorter regex than you would get by just combining them (as long as certain settings are checked).
Having also not found a good solution, I wrote a small script a long time ago, by downloading data from the unicode specification (v.5.0.0) and generating intervals for each unicode category and subcategory in the BMP (lately replaced by a small Java program that uses its own native Unicode support).
Basically it converts \p{...} to a range of values, much like the output of the tool mentioned by Tomalak, but the intervals can end up quite large (since it's not dealing with blocks, but with characters scattered through many different places).
For instance, a Regex written like this:
var regex = unicode_hack(/\p{L}(\p{L}|\p{Nd})*/g);
Will be converted to something like this:
/[\u0041-\u005a\u0061-\u007a...]([...]|[\u0030-\u0039\u0660-\u0669...])*/g
Haven't used it a lot in practice, but it seems to work fine from my tests, so I'm posting here in case someone find it useful. Despite the length of the resulting regexes (the example above has 3591 characters when expanded), the performance seems to be acceptable (see the tests at jsFiddle; thanks to #modiX and #Lwangaman for the improvements).
Here's the source (raw, 27.5KB; minified, 24.9KB, not much better...). It might be made smaller by unescaping the unicode characters, but OTOH will run the risk of encoding issues, so I'm leaving as it is. Hopefully with ES6 this kind of thing won't be necessary anymore.
Update: this looks like the same strategy adopted in the XRegExp Unicode plug-in mentioned by Tim Down, except that in this case regular JavaScript regexes are being used.
September 2018 (updated February 2019)
It seems that regexp /\p{L}/u for match letters (as unicode categories)
works on Chrome 68.0.3440.106 and Safari 11.1.2 (13605.3.8)
NOT working on Firefox 65.0 :(
Here is a working example
In below field you should be able to to type letters but not numbers<br>
<input type="text" name="field" onkeydown="return /\p{L}/u.test(event.key)" >
I report this bug here.
Update
After over 2 years according to: 1500035 > 1361876 > 1634135 finally this bug is fixed and will be available in Firefox v.78+
[^\u0000-\u007F]+ for any characters which is not included ASCII characters.
For example:
function isNonLatinCharacters(s) {
return /[^\u0000-\u007F]/.test(s);
}
console.log(isNonLatinCharacters("身分"));// Japanese
console.log(isNonLatinCharacters("测试"));// Chinese
console.log(isNonLatinCharacters("حمید"));// Persian
console.log(isNonLatinCharacters("테스트"));// Korean
console.log(isNonLatinCharacters("परीक्षण"));// Hindi
console.log(isNonLatinCharacters("מִבְחָן"));// Hebrew
Here are some perfect references:
Unicode range RegExp generator
Unicode Regular Expressions
Unicode 10.0 Character Code Charts
Match Unicode Block Range
As mentioned in other answers, JavaScript regexes have no support for Unicode character classes. However, there is a library that does provide this: Steven Levithan's excellent XRegExp and its Unicode plug-in.
In JavaScript, \w and \d are ASCII, while \s is Unicode. Don't ask me why. JavaScript does support \p with Unicode categories, which you can use to emulate a Unicode-aware \w and \d.
For \d use \p{N} (numbers)
For \w use [\p{L}\p{N}\p{Pc}\p{M}] (letters, numbers, underscores, marks)
Update: Unfortunately, I was wrong about this. JavaScript does does not officially support \p either, though some implementations may still support this. The only Unicode support in JavaScript regexes is matching specific code points with \uFFFF. You can use those in ranges in character classes.
This will do it:
/[A-Za-z\u00C0-\u00FF ]+/.exec('hipopótamo maçã pólen ñ poção água língüa')
It explicitly selects a range of unicode characters.
It will work for latin characters, but other strange characters may be out of this range.
If you are using Babel then Unicode support is already available.
I also released a plugin which transforms your source code such that you can write regular expressions like /^\p{L}+$/. These will then be transformed into something that browsers understand.
Here is the project page of the plugin:
babel-plugin-utf-8-regex
I'm answering this question
What would be the equivalent for \p{Lu} or \p{Ll} in regExp for js?
since it was marked as an exact duplicate of the current old question.
Querying the UCD Database of Unicode 12, \p{Lu} generates 1,788 code points.
Converting to UTF-16 yields the class construct equivalency.
It's only a 4k character string and is easily doable in any regex engines.
(?:[\u0041-\u005A\u00C0-\u00D6\u00D8-\u00DE\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C\u011E\u0120\u0122\u0124\u0126\u0128\u012A\u012C\u012E\u0130\u0132\u0134\u0136\u0139\u013B\u013D\u013F\u0141\u0143\u0145\u0147\u014A\u014C\u014E\u0150\u0152\u0154\u0156\u0158\u015A\u015C\u015E\u0160\u0162\u0164\u0166\u0168\u016A\u016C\u016E\u0170\u0172\u0174\u0176\u0178-\u0179\u017B\u017D\u0181-\u0182\u0184\u0186-\u0187\u0189-\u018B\u018E-\u0191\u0193-\u0194\u0196-\u0198\u019C-\u019D\u019F-\u01A0\u01A2\u01A4\u01A6-\u01A7\u01A9\u01AC\u01AE-\u01AF\u01B1-\u01B3\u01B5\u01B7-\u01B8\u01BC\u01C4\u01C7\u01CA\u01CD\u01CF\u01D1\u01D3\u01D5\u01D7\u01D9\u01DB\u01DE\u01E0\u01E2\u01E4\u01E6\u01E8\u01EA\u01EC\u01EE\u01F1\u01F4\u01F6-\u01F8\u01FA\u01FC\u01FE\u0200\u0202\u0204\u0206\u0208\u020A\u020C\u020E\u0210\u0212\u0214\u0216\u0218\u021A\u021C\u021E\u0220\u0222\u0224\u0226\u0228\u022A\u022C\u022E\u0230\u0232\u023A-\u023B\u023D-\u023E\u0241\u0243-\u0246\u0248\u024A\u024C\u024E\u0370\u0372\u0376\u037F\u0386\u0388-\u038A\u038C\u038E-\u038F\u0391-\u03A1\u03A3-\u03AB\u03CF\u03D2-\u03D4\u03D8\u03DA\u03DC\u03DE\u03E0\u03E2\u03E4\u03E6\u03E8\u03EA\u03EC\u03EE\u03F4\u03F7\u03F9-\u03FA\u03FD-\u042F\u0460\u0462\u0464\u0466\u0468\u046A\u046C\u046E\u0470\u0472\u0474\u0476\u0478\u047A\u047C\u047E\u0480\u048A\u048C\u048E\u0490\u0492\u0494\u0496\u0498\u049A\u049C\u049E\u04A0\u04A2\u04A4\u04A6\u04A8\u04AA\u04AC\u04AE\u04B0\u04B2\u04B4\u04B6\u04B8\u04BA\u04BC\u04BE\u04C0-\u04C1\u04C3\u04C5\u04C7\u04C9\u04CB\u04CD\u04D0\u04D2\u04D4\u04D6\u04D8\u04DA\u04DC\u04DE\u04E0\u04E2\u04E4\u04E6\u04E8\u04EA\u04EC\u04EE\u04F0\u04F2\u04F4\u04F6\u04F8\u04FA\u04FC\u04FE\u0500\u0502\u0504\u0506\u0508\u050A\u050C\u050E\u0510\u0512\u0514\u0516\u0518\u051A\u051C\u051E\u0520\u0522\u0524\u0526\u0528\u052A\u052C\u052E\u0531-\u0556\u10A0-\u10C5\u10C7\u10CD\u13A0-\u13F5\u1C90-\u1CBA\u1CBD-\u1CBF\u1E00\u1E02\u1E04\u1E06\u1E08\u1E0A\u1E0C\u1E0E\u1E10\u1E12\u1E14\u1E16\u1E18\u1E1A\u1E1C\u1E1E\u1E20\u1E22\u1E24\u1E26\u1E28\u1E2A\u1E2C\u1E2E\u1E30\u1E32\u1E34\u1E36\u1E38\u1E3A\u1E3C\u1E3E\u1E40\u1E42\u1E44\u1E46\u1E48\u1E4A\u1E4C\u1E4E\u1E50\u1E52\u1E54\u1E56\u1E58\u1E5A\u1E5C\u1E5E\u1E60\u1E62\u1E64\u1E66\u1E68\u1E6A\u1E6C\u1E6E\u1E70\u1E72\u1E74\u1E76\u1E78\u1E7A\u1E7C\u1E7E\u1E80\u1E82\u1E84\u1E86\u1E88\u1E8A\u1E8C\u1E8E\u1E90\u1E92\u1E94\u1E9E\u1EA0\u1EA2\u1EA4\u1EA6\u1EA8\u1EAA\u1EAC\u1EAE\u1EB0\u1EB2\u1EB4\u1EB6\u1EB8\u1EBA\u1EBC\u1EBE\u1EC0\u1EC2\u1EC4\u1EC6\u1EC8\u1ECA\u1ECC\u1ECE\u1ED0\u1ED2\u1ED4\u1ED6\u1ED8\u1EDA\u1EDC\u1EDE\u1EE0\u1EE2\u1EE4\u1EE6\u1EE8\u1EEA\u1EEC\u1EEE\u1EF0\u1EF2\u1EF4\u1EF6\u1EF8\u1EFA\u1EFC\u1EFE\u1F08-\u1F0F\u1F18-\u1F1D\u1F28-\u1F2F\u1F38-\u1F3F\u1F48-\u1F4D\u1F59\u1F5B\u1F5D\u1F5F\u1F68-\u1F6F\u1FB8-\u1FBB\u1FC8-\u1FCB\u1FD8-\u1FDB\u1FE8-\u1FEC\u1FF8-\u1FFB\u2102\u2107\u210B-\u210D\u2110-\u2112\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u2130-\u2133\u213E-\u213F\u2145\u2183\u2C00-\u2C2E\u2C60\u2C62-\u2C64\u2C67\u2C69\u2C6B\u2C6D-\u2C70\u2C72\u2C75\u2C7E-\u2C80\u2C82\u2C84\u2C86\u2C88\u2C8A\u2C8C\u2C8E\u2C90\u2C92\u2C94\u2C96\u2C98\u2C9A\u2C9C\u2C9E\u2CA0\u2CA2\u2CA4\u2CA6\u2CA8\u2CAA\u2CAC\u2CAE\u2CB0\u2CB2\u2CB4\u2CB6\u2CB8\u2CBA\u2CBC\u2CBE\u2CC0\u2CC2\u2CC4\u2CC6\u2CC8\u2CCA\u2CCC\u2CCE\u2CD0\u2CD2\u2CD4\u2CD6\u2CD8\u2CDA\u2CDC\u2CDE\u2CE0\u2CE2\u2CEB\u2CED\u2CF2\uA640\uA642\uA644\uA646\uA648\uA64A\uA64C\uA64E\uA650\uA652\uA654\uA656\uA658\uA65A\uA65C\uA65E\uA660\uA662\uA664\uA666\uA668\uA66A\uA66C\uA680\uA682\uA684\uA686\uA688\uA68A\uA68C\uA68E\uA690\uA692\uA694\uA696\uA698\uA69A\uA722\uA724\uA726\uA728\uA72A\uA72C\uA72E\uA732\uA734\uA736\uA738\uA73A\uA73C\uA73E\uA740\uA742\uA744\uA746\uA748\uA74A\uA74C\uA74E\uA750\uA752\uA754\uA756\uA758\uA75A\uA75C\uA75E\uA760\uA762\uA764\uA766\uA768\uA76A\uA76C\uA76E\uA779\uA77B\uA77D-\uA77E\uA780\uA782\uA784\uA786\uA78B\uA78D\uA790\uA792\uA796\uA798\uA79A\uA79C\uA79E\uA7A0\uA7A2\uA7A4\uA7A6\uA7A8\uA7AA-\uA7AE\uA7B0-\uA7B4\uA7B6\uA7B8\uA7BA\uA7BC\uA7BE\uA7C2\uA7C4-\uA7C6\uFF21-\uFF3A]|(?:\uD801[\uDC00-\uDC27\uDCB0-\uDCD3]|\uD803[\uDC80-\uDCB2]|\uD806[\uDCA0-\uDCBF]|\uD81B[\uDE40-\uDE5F]|\uD835[\uDC00-\uDC19\uDC34-\uDC4D\uDC68-\uDC81\uDC9C\uDC9E-\uDC9F\uDCA2\uDCA5-\uDCA6\uDCA9-\uDCAC\uDCAE-\uDCB5\uDCD0-\uDCE9\uDD04-\uDD05\uDD07-\uDD0A\uDD0D-\uDD14\uDD16-\uDD1C\uDD38-\uDD39\uDD3B-\uDD3E\uDD40-\uDD44\uDD46\uDD4A-\uDD50\uDD6C-\uDD85\uDDA0-\uDDB9\uDDD4-\uDDED\uDE08-\uDE21\uDE3C-\uDE55\uDE70-\uDE89\uDEA8-\uDEC0\uDEE2-\uDEFA\uDF1C-\uDF34\uDF56-\uDF6E\uDF90-\uDFA8\uDFCA]|\uD83A[\uDD00-\uDD21]))
Querying the UCD database of Unicode 12, \p{Ll} generates 2,151 code points.
Converting to UTF-16 yields the class construct equivalency.
(?:[\u0061-\u007A\u00B5\u00DF-\u00F6\u00F8-\u00FF\u0101\u0103\u0105\u0107\u0109\u010B\u010D\u010F\u0111\u0113\u0115\u0117\u0119\u011B\u011D\u011F\u0121\u0123\u0125\u0127\u0129\u012B\u012D\u012F\u0131\u0133\u0135\u0137-\u0138\u013A\u013C\u013E\u0140\u0142\u0144\u0146\u0148-\u0149\u014B\u014D\u014F\u0151\u0153\u0155\u0157\u0159\u015B\u015D\u015F\u0161\u0163\u0165\u0167\u0169\u016B\u016D\u016F\u0171\u0173\u0175\u0177\u017A\u017C\u017E-\u0180\u0183\u0185\u0188\u018C-\u018D\u0192\u0195\u0199-\u019B\u019E\u01A1\u01A3\u01A5\u01A8\u01AA-\u01AB\u01AD\u01B0\u01B4\u01B6\u01B9-\u01BA\u01BD-\u01BF\u01C6\u01C9\u01CC\u01CE\u01D0\u01D2\u01D4\u01D6\u01D8\u01DA\u01DC-\u01DD\u01DF\u01E1\u01E3\u01E5\u01E7\u01E9\u01EB\u01ED\u01EF-\u01F0\u01F3\u01F5\u01F9\u01FB\u01FD\u01FF\u0201\u0203\u0205\u0207\u0209\u020B\u020D\u020F\u0211\u0213\u0215\u0217\u0219\u021B\u021D\u021F\u0221\u0223\u0225\u0227\u0229\u022B\u022D\u022F\u0231\u0233-\u0239\u023C\u023F-\u0240\u0242\u0247\u0249\u024B\u024D\u024F-\u0293\u0295-\u02AF\u0371\u0373\u0377\u037B-\u037D\u0390\u03AC-\u03CE\u03D0-\u03D1\u03D5-\u03D7\u03D9\u03DB\u03DD\u03DF\u03E1\u03E3\u03E5\u03E7\u03E9\u03EB\u03ED\u03EF-\u03F3\u03F5\u03F8\u03FB-\u03FC\u0430-\u045F\u0461\u0463\u0465\u0467\u0469\u046B\u046D\u046F\u0471\u0473\u0475\u0477\u0479\u047B\u047D\u047F\u0481\u048B\u048D\u048F\u0491\u0493\u0495\u0497\u0499\u049B\u049D\u049F\u04A1\u04A3\u04A5\u04A7\u04A9\u04AB\u04AD\u04AF\u04B1\u04B3\u04B5\u04B7\u04B9\u04BB\u04BD\u04BF\u04C2\u04C4\u04C6\u04C8\u04CA\u04CC\u04CE-\u04CF\u04D1\u04D3\u04D5\u04D7\u04D9\u04DB\u04DD\u04DF\u04E1\u04E3\u04E5\u04E7\u04E9\u04EB\u04ED\u04EF\u04F1\u04F3\u04F5\u04F7\u04F9\u04FB\u04FD\u04FF\u0501\u0503\u0505\u0507\u0509\u050B\u050D\u050F\u0511\u0513\u0515\u0517\u0519\u051B\u051D\u051F\u0521\u0523\u0525\u0527\u0529\u052B\u052D\u052F\u0560-\u0588\u10D0-\u10FA\u10FD-\u10FF\u13F8-\u13FD\u1C80-\u1C88\u1D00-\u1D2B\u1D6B-\u1D77\u1D79-\u1D9A\u1E01\u1E03\u1E05\u1E07\u1E09\u1E0B\u1E0D\u1E0F\u1E11\u1E13\u1E15\u1E17\u1E19\u1E1B\u1E1D\u1E1F\u1E21\u1E23\u1E25\u1E27\u1E29\u1E2B\u1E2D\u1E2F\u1E31\u1E33\u1E35\u1E37\u1E39\u1E3B\u1E3D\u1E3F\u1E41\u1E43\u1E45\u1E47\u1E49\u1E4B\u1E4D\u1E4F\u1E51\u1E53\u1E55\u1E57\u1E59\u1E5B\u1E5D\u1E5F\u1E61\u1E63\u1E65\u1E67\u1E69\u1E6B\u1E6D\u1E6F\u1E71\u1E73\u1E75\u1E77\u1E79\u1E7B\u1E7D\u1E7F\u1E81\u1E83\u1E85\u1E87\u1E89\u1E8B\u1E8D\u1E8F\u1E91\u1E93\u1E95-\u1E9D\u1E9F\u1EA1\u1EA3\u1EA5\u1EA7\u1EA9\u1EAB\u1EAD\u1EAF\u1EB1\u1EB3\u1EB5\u1EB7\u1EB9\u1EBB\u1EBD\u1EBF\u1EC1\u1EC3\u1EC5\u1EC7\u1EC9\u1ECB\u1ECD\u1ECF\u1ED1\u1ED3\u1ED5\u1ED7\u1ED9\u1EDB\u1EDD\u1EDF\u1EE1\u1EE3\u1EE5\u1EE7\u1EE9\u1EEB\u1EED\u1EEF\u1EF1\u1EF3\u1EF5\u1EF7\u1EF9\u1EFB\u1EFD\u1EFF-\u1F07\u1F10-\u1F15\u1F20-\u1F27\u1F30-\u1F37\u1F40-\u1F45\u1F50-\u1F57\u1F60-\u1F67\u1F70-\u1F7D\u1F80-\u1F87\u1F90-\u1F97\u1FA0-\u1FA7\u1FB0-\u1FB4\u1FB6-\u1FB7\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FC7\u1FD0-\u1FD3\u1FD6-\u1FD7\u1FE0-\u1FE7\u1FF2-\u1FF4\u1FF6-\u1FF7\u210A\u210E-\u210F\u2113\u212F\u2134\u2139\u213C-\u213D\u2146-\u2149\u214E\u2184\u2C30-\u2C5E\u2C61\u2C65-\u2C66\u2C68\u2C6A\u2C6C\u2C71\u2C73-\u2C74\u2C76-\u2C7B\u2C81\u2C83\u2C85\u2C87\u2C89\u2C8B\u2C8D\u2C8F\u2C91\u2C93\u2C95\u2C97\u2C99\u2C9B\u2C9D\u2C9F\u2CA1\u2CA3\u2CA5\u2CA7\u2CA9\u2CAB\u2CAD\u2CAF\u2CB1\u2CB3\u2CB5\u2CB7\u2CB9\u2CBB\u2CBD\u2CBF\u2CC1\u2CC3\u2CC5\u2CC7\u2CC9\u2CCB\u2CCD\u2CCF\u2CD1\u2CD3\u2CD5\u2CD7\u2CD9\u2CDB\u2CDD\u2CDF\u2CE1\u2CE3-\u2CE4\u2CEC\u2CEE\u2CF3\u2D00-\u2D25\u2D27\u2D2D\uA641\uA643\uA645\uA647\uA649\uA64B\uA64D\uA64F\uA651\uA653\uA655\uA657\uA659\uA65B\uA65D\uA65F\uA661\uA663\uA665\uA667\uA669\uA66B\uA66D\uA681\uA683\uA685\uA687\uA689\uA68B\uA68D\uA68F\uA691\uA693\uA695\uA697\uA699\uA69B\uA723\uA725\uA727\uA729\uA72B\uA72D\uA72F-\uA731\uA733\uA735\uA737\uA739\uA73B\uA73D\uA73F\uA741\uA743\uA745\uA747\uA749\uA74B\uA74D\uA74F\uA751\uA753\uA755\uA757\uA759\uA75B\uA75D\uA75F\uA761\uA763\uA765\uA767\uA769\uA76B\uA76D\uA76F\uA771-\uA778\uA77A\uA77C\uA77F\uA781\uA783\uA785\uA787\uA78C\uA78E\uA791\uA793-\uA795\uA797\uA799\uA79B\uA79D\uA79F\uA7A1\uA7A3\uA7A5\uA7A7\uA7A9\uA7AF\uA7B5\uA7B7\uA7B9\uA7BB\uA7BD\uA7BF\uA7C3\uA7FA\uAB30-\uAB5A\uAB60-\uAB67\uAB70-\uABBF\uFB00-\uFB06\uFB13-\uFB17\uFF41-\uFF5A]|(?:\uD801[\uDC28-\uDC4F\uDCD8-\uDCFB]|\uD803[\uDCC0-\uDCF2]|\uD806[\uDCC0-\uDCDF]|\uD81B[\uDE60-\uDE7F]|\uD835[\uDC1A-\uDC33\uDC4E-\uDC54\uDC56-\uDC67\uDC82-\uDC9B\uDCB6-\uDCB9\uDCBB\uDCBD-\uDCC3\uDCC5-\uDCCF\uDCEA-\uDD03\uDD1E-\uDD37\uDD52-\uDD6B\uDD86-\uDD9F\uDDBA-\uDDD3\uDDEE-\uDE07\uDE22-\uDE3B\uDE56-\uDE6F\uDE8A-\uDEA5\uDEC2-\uDEDA\uDEDC-\uDEE1\uDEFC-\uDF14\uDF16-\uDF1B\uDF36-\uDF4E\uDF50-\uDF55\uDF70-\uDF88\uDF8A-\uDF8F\uDFAA-\uDFC2\uDFC4-\uDFC9\uDFCB]|\uD83A[\uDD22-\uDD43]))
Note that a regex implementation of \p{Lu} or \p{Pl} actually calls a
non standard function to test the value.
The character classes shown here are done differently and are linear, standard
and pretty slow, when jammed into mostly a single class.
Some insight on how a Regex engine (in general) implements Unicode Property Classes:
Examine these performance characteristics between the property
and the class block (like above)
Regex1: LONG CLASS
< none >
Completed iterations: 50 / 50 ( x 1 )
Matches found per iteration: 1788
Elapsed Time: 0.73 s, 727.58 ms, 727584 µs
Matches per sec: 122,872
Regex2: \p{Lu}
Options: < ICU - none >
Completed iterations: 50 / 50 ( x 1 )
Matches found per iteration: 1788
Elapsed Time: 0.07 s, 65.32 ms, 65323 µs
Matches per sec: 1,368,583
Wow what a difference !!
Lets see how Properties might be implemented
Array of Pointers [ 10FFFF ] where each index is is a Code Point
Each pointer in the Array is to a structure of classification.
A Classification structure contains fixed field elemets.
Some are NULL and do not pertain.
Some contain category classifications.
Example : General Category
This is a bitmapped element that uses 17 out of 64 bits.
Whatever this Code Point supports has bit(s) set as a mask.
-Close_Punctuation
-Connector_Punctuation
-Control
-Currency_Symbol
-Dash_Punctuation
-Decimal_Number
-Enclosing_Mark
-Final_Punctuation
-Format
-Initial_Punctuation
-Letter_Number
-Line_Separator
-Lowercase_Letter
-Math_Symbol
-Modifier_Letter
-Modifier_Symbol
-Nonspacing_Mark
-Open_Punctuation
-Other_Letter
-Other_Number
-Other_Punctuation
-Other_Symbol
-Paragraph_Separator
-Private_Use
-Space_Separator
-Spacing_Mark
-Surrogate
-Titlecase_Letter
-Unassigned
-Uppercase_Letter
When a regex is parsed with something like this \p{Lu} it
is translated directly into
Classification Structure element offset : General Category
A check of that element for bit item : Uppercase_Letter
Another example, when a regex is parsed with punctuation property \p{P} it
is translated into
Classification Structure element offset : General Category
A check of that element for any of these items bits, which are joined into a mask :
-Close_Punctuation
-Connector_Punctuation
-Dash_Punctuation
-Final_Punctuation
-Initial_Punctuation
-Open_Punctuation
-Other_Punctuation
The offset and bit or bit(mask) are stored as a regex step for that property.
The lookup table is created once for all Unicode Code Points using this array.
When a character is checked, it is as simple as using the CP as an index into
this array and checking the Classification Structure's specific element for that bit(mask).
This structure is expandable and indirect to provide much more complex look ups. This is just a simple example.
Compare that direct lookup with a character class search :
All classes are a linear list of items searched from left to right.
In this comparison, given our target string contains only the complete
Upper Case Unicode Letters only, the law of averages would predict that
half of the items in the class would have to be ranged checked
to find a match.
This is a huge disadvantage in performance.
However, if the lookup tables are not there or are not up to date
with the latest Unicode release (12 as of this date)
then this would be the only way.
In fact, it is mostly the only way to get the complete Emoji
characters as there is no specific property (or reasoning) to their assignment.
You can also use:
function myFunction() {
var str = "xq234";
var allowChars = "^[a-zA-ZÀ-ÿ]+$";
var res = str.match(allowChars);
if(!str.match(allowChars)){
res="true";
}
else {
res="false";
}
document.getElementById("demo").innerHTML = res;

Categories