how to use unicode character groups in javascript's regexs?

how to use unicode character groups in javascript's regexs? - javascript

there is a way to use patterns like "\p{L}" in javascript, natively?
(i suppose that is a perl-compatible syntax)
I'm interested firstly in firefox support, and webkit, possibly

No, \p{..} is not supported natively by any of the big browsers. However, it does work in JavaScript if you use the XRegExp library and it's Unicode plugins.

Unfortunately, no. You can only specify a set of characters in the usual syntax, writing characters and ranges in brackets, but this becomes awkward since e.g. letters are scattered all around the Unicode space, with other characters between them.
There’s an inefficient workaround: fetch the UnicodeData.txt file from the Unicode site, put its content inside your JavaScript code as data, and parse it. And then you could have the data e.g. in an array of objects containing the Unicode properties, such as gc (General Category), which tells you whether the character is a letter or not. But even then, you would just have the data handy for simple testing, not as something you can use as a constituent of a regexp.
In theory, you could use the data to construct a regexp... but it would be rather large.

No, Javascript has slightly different syntax. To catch unicode you have to use character selector like \uXXXX. However, on practice if your page and files in UTF-8, setting non-ASCII characters in range [абвг] does work too.
http://www.javascriptkit.com/jsref/regexp.shtml

The library found here:
http://inimino.org/~inimino/blog/javascript_cset
seems to work for me and is fairly small and independent of other libraries.

Related

JavaScript Unicode standard format

Is there any standard on how to write Unicode characters in JavaScript/JSON?
For instance, is there any difference between \u011b and \u011B? Most of the web examples use second format. Also, there is an option for ASCII characters to be written in a short format like \xe1. Which format is preferable (standard). Is it good practice to mix these formats together and what about performance?

For the first question: both version are valid. It is more a coding convention, you should prefer what convention is already used in your files/project. Then check on your community (convention used by other programs you heavily use, what they prefer, and as last option you can choose one way. But in any case, keep consistent.
Personally I prefer none of them for code: UTF-8 is so wide used and browsers should understand it, so I would put directly the right character (as character, not as escape sequence). If codepoint is important, I would add it into a comment. it is expected that all developers and tools will have UTF-8 editors.
Javascript uses UCS-2, so the precursor of UTF-16, but considering unicode code points to be just 16bit length (so some emoji would use two characters).
The byte format should not be used for text: it hides the meaning. There are exceptions: e.g. to check which encoding you get from user, or if you have BOM. [But so just for signatures]. For other binary cases, it is ok to use \x1e escapes, e.g. for key identification.
Note: you should really follow one coding guidelines. Google for it and you will find many, e.g. this from Google (which is maybe too much): https://google.github.io/styleguide/jsguide.html

Is it safe to use UTF-8 character literals in JavaScript source code?

Is it save to write JavaScript source code (to be executed in the browser) which includes UTF-8 character literals?
For example, I would like to use an ellipses literal in a string as such:
var foo = "Oops… Something went wrong";
Do "modern" browsers support this? Is there a published browser support matrix somewhere?

JavaScript is by specification a Unicode language, so Unicode characters in strings should be safe. You can use hex escapes (\u8E24) as an alternative. Make sure your script files are served with proper content type headers.
Note that characters beyond one- and two-byte sequences are problematic, and that JavaScript regular expressions are terrible with characters beyond the first codepage. (Well maybe not "terrible", but primitive at best.)
You can also use Unicode letters, Unicode combining marks, and Unicode connector punctuation characters in identifiers, in case you want to impress your friends. Thus
var wavy﹏line = "wow";
is perfectly good JavaScript (but good luck with your bug report if you find a browser where it doesn't work).
Read all about it in the spec, or use it to fall asleep at night :)

Determine all ISO 15924 script codes in JavaScript string

I'm looking for an efficient way to take a JavaScript string and return all of the scripts which occur in that string.
Full UTF-16 including the "astral" plane / non-BMP characters which require surrogate pairs must be correctly handled. This is possibly the main problem since JavaScript is not UTF-16 aware.
It only has to deal with codepoints so no fancy awareness of complex scripts or grapheme clusters is necessary. (This will be obvious to some of you anyway.)
Example:
stringToIso15924("παν語");
would return something like:
[ "Grek", "Hani" ]
I'm using node.js and some Unicode libraries such as XRegExp and unorm already so I don't mind adding other libraries that might already handle or ease such a feature.
I'm not aware of a JavaScript library that can look up character properties such as script codes, so this is probably the second part of the problem.
The third part of the problem is just to avoid inefficiencies.

I answered a similar question, well at least related. In this pastebin you will a (looooong) function that returns the script name for a character. It should be easy to modifiy it to accommodate a string.

need a JavaScript Regex that requires upper or lowercase letters

I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters:
/(?=.*[a-z])/

You Can’t Get There from Here
I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters: /(?=.*[a-z])/
Unfortunately, it is utterly impossible to do this correctly using Javascript! Read this flavor comparison’s ECMA column for all of what Javascript cannot do.
Theory vs Practice
The proper pattern for lowercase is the standard Unicode derived binary property \p{Lowercase}, and the proper pattern for uppercase is similarly \p{Uppercase}. These are normative properties that sometimes include non-letters in them under certain exotic circumstances.
Using just General Category properties, you can have \p{Ll} for Lowercase_Letter, \p{Lu} for Uppercase_Letter, and \p{Lt} for titlecase letter. Remember they are three cases in Unicode, not two). There is a standard alias \p{LC} which means [\p{Lu}\p{Lt}\p{Ll}].
If you want a letter than is not a lowercase letter, you could use (?=\P{Ll})\pL. Written in longhand that’s (?=\P{Lowercase_Letter})\p{Letter}. Again, these mix some of the Other_Lowercase code points that \p{Lowercase} recognizes. I must again stress that the Lowercase property is a superset of the Lowercase_Letter property.
Remember the previous paragraph, swapping in upper everywhere I have written lower, and you get the same thing for the capitals.
Possible Platforms
Because access to these essential properties is the minimal level of critical functionality necessary for Unicode regular expressions, some versions of Javascript implement them in just the way I have written them above. However, the standard for Javascript still does not require them, so you cannot in general count on them. This means that it is impossible to this correctly under all implementations of Javascript.
Languages in which it is possible to do what you want done minimally include:
C♯ and Java (both only General Categories)
Ruby if and only if v1.9 or better (only binary properties, including General Categories)
PHP and PCRE (only General Category and Script properties plus a couple extras)
ICU’s C++ library and Perl, which both support all Unicode properties
Of those listed bove, only the last line’s — ICU and Perl — strictly and completely meet all Level 1 compliance requirements (plus some Levels 2 and 3) for the proper handling of Unicode in regexes. However, all of those I’ve listed in the previous paragraph’s bullets can easily handle most, and quite probably all, of what you need.
Javascript is not amongst those, however. Your version might, though, if you are very lucky and never have to run on a standard-only Javascript platform.
Summary
So very sadly, you cannot really use Javascript regexes for Unicode work unless you have a non-standard extension. Some people do, but most do not. If you do not, you may have to use a different platform until the relevant ECMA standard catches up with the 21st century (Unicode 3.1 came out a decade ago!!).
If anyone knows of a Javascript library that implements the Level 1 requirements of UTS#18 on Unicode Regular Expressions including both RL1.2 “Properties” and RL1.2a “Annex C: Compatibility Properties”, please chime in.

Not sure if you mean mixed-case, or strictly lowercase plus strictly uppercase.
Here's the mixed-case version:
/^[a-zA-Z]+$/
And the strictly one-or-the-other version:
/^([a-z]+|[A-Z]+)$/

Try /(?=.*[a-z])/i
Note the i at the end, this makes the expression case insensitive.

Or add an uppercase range to your regex:
/(?=.*[a-zA-Z])/

Regex: Disable Symbols

Is there any way to disable all symbols, punctuations, block elements, geometric shapes and dingbats such like these:
✁ ✂ ✃ ✄ ✆ ✇ ✈ ✉ ✌ ✍ ✎ ✏ ✐ ✑ ✒ ✓ ✔ ✕ ⟻ ⟼ ⟽ ⟾ ⟿ ⟻ ⟼ ⟽ ⟾ ⟿ ▚ ▛ ▜ ▝ ▞ ▟
without writing down all of them in the Regular Expression Pattern, while enable all other normal language characters such like chinese, arabic etc.. such like these:
文化中国 الجزيرة نت
?
I'm building a javascript validation function and my real problem is that I can't use:
[a-zA-Z0-9]
Because this ignores a lots of languages too not just the symbols.

The Unicode standard divides up all the possible characters into code charts. Each code chart contains related characters. If you want to exclude (or include) only certain classes of characters, you will have to make a suitable list of exclusions (or inclusions). Unicode is big, so this might be a lot of work.

Not really.
JavaScript doesn't support Unicode Character Properties. The closest you'll get is excluding ranges by Unicode code point as Greg Hewgill suggested.
For example, to match all of the characters under Mathematical Symbols:
/[\u2190-\u259F]/

This depends on your regex dialect. Unfortunately, probably most existing JavaScript engines don't support Unicode character classes.
In regex engines such as the one in (recent) Perl or .Net, Unicode character classes can be referenced.
\p{L}: any kind of letter from any language.
\p{N}: any number symbol from any language (including, as I recall, the Indian and Arabic and CJK number glyphs).
Because Unicode supports composed and decomposed glyphs, you may run into certain complexities: namely, if only decomposed forms exist, it's possible that you might accidentally exclude some diacritic marks in your matching pattern, and you may need to explicitly allow glyphs of the type Mark. You can mitigate this somewhat by using, if I recall correctly, a string that has been normalized using kC normalization (only for characters that have a composed form). In environments that support Unicode well, there's usually a function that allows you to normalize Unicode strings fairly easily (true in Java and .Net, at least).
Edited to add: If you've started down this path, or have considered it, in order to regain some sanity, you may want to experiment with the Unicode Plugin for XRegExp (which will require you to take a dependency on XRegExp).

JavaScript regular expressions do not have native Unicode support. An alternative to to validate (or sanitize) the string at server site, or to use a non-native regex library. While I've never used it, XRegExp is such a library, and it has a Unicode Plugin.

Take a look at the Unicode Planes. You probably want to exclude everything but planes 0 and 2. After that, it gets ugly as you'll have to exclude a lot of plane 0 on a case-by-case basis.

We Keep Coding

JavaScript is the programming language of the Web.

how to use unicode character groups in javascript's regexs? - javascript

there is a way to use patterns like "\p{L}" in javascript, natively? (i suppose that is a perl-compatible syntax) I'm interested firstly in firefox support, and webkit, possibly

No, \p{..} is not supported natively by any of the big browsers. However, it does work in JavaScript if you use the XRegExp library and it's Unicode plugins.

No, Javascript has slightly different syntax. To catch unicode you have to use character selector like \uXXXX. However, on practice if your page and files in UTF-8, setting non-ASCII characters in range [абвг] does work too. http://www.javascriptkit.com/jsref/regexp.shtml

The library found here: http://inimino.org/~inimino/blog/javascript_cset seems to work for me and is fairly small and independent of other libraries.

Related

JavaScript Unicode standard format

Is it safe to use UTF-8 character literals in JavaScript source code?

Determine all ISO 15924 script codes in JavaScript string

need a JavaScript Regex that requires upper or lowercase letters

Regex: Disable Symbols

Categories

Resources