Regex: Disable Symbols - javascript

Is there any way to disable all symbols, punctuations, block elements, geometric shapes and dingbats such like these:
✁ ✂ ✃ ✄ ✆ ✇ ✈ ✉ ✌ ✍ ✎ ✏ ✐ ✑ ✒ ✓ ✔ ✕ ⟻ ⟼ ⟽ ⟾ ⟿ ⟻ ⟼ ⟽ ⟾ ⟿ ▚ ▛ ▜ ▝ ▞ ▟
without writing down all of them in the Regular Expression Pattern, while enable all other normal language characters such like chinese, arabic etc.. such like these:
文化中国 الجزيرة نت
?
I'm building a javascript validation function and my real problem is that I can't use:
[a-zA-Z0-9]
Because this ignores a lots of languages too not just the symbols.

The Unicode standard divides up all the possible characters into code charts. Each code chart contains related characters. If you want to exclude (or include) only certain classes of characters, you will have to make a suitable list of exclusions (or inclusions). Unicode is big, so this might be a lot of work.

Not really.
JavaScript doesn't support Unicode Character Properties. The closest you'll get is excluding ranges by Unicode code point as Greg Hewgill suggested.
For example, to match all of the characters under Mathematical Symbols:
/[\u2190-\u259F]/

This depends on your regex dialect. Unfortunately, probably most existing JavaScript engines don't support Unicode character classes.
In regex engines such as the one in (recent) Perl or .Net, Unicode character classes can be referenced.
\p{L}: any kind of letter from any language.
\p{N}: any number symbol from any language (including, as I recall, the Indian and Arabic and CJK number glyphs).
Because Unicode supports composed and decomposed glyphs, you may run into certain complexities: namely, if only decomposed forms exist, it's possible that you might accidentally exclude some diacritic marks in your matching pattern, and you may need to explicitly allow glyphs of the type Mark. You can mitigate this somewhat by using, if I recall correctly, a string that has been normalized using kC normalization (only for characters that have a composed form). In environments that support Unicode well, there's usually a function that allows you to normalize Unicode strings fairly easily (true in Java and .Net, at least).
Edited to add: If you've started down this path, or have considered it, in order to regain some sanity, you may want to experiment with the Unicode Plugin for XRegExp (which will require you to take a dependency on XRegExp).

JavaScript regular expressions do not have native Unicode support. An alternative to to validate (or sanitize) the string at server site, or to use a non-native regex library. While I've never used it, XRegExp is such a library, and it has a Unicode Plugin.

Take a look at the Unicode Planes. You probably want to exclude everything but planes 0 and 2. After that, it gets ugly as you'll have to exclude a lot of plane 0 on a case-by-case basis.

Related

Which encoding to use for many international languages

I am setting up a little website and would like to make it international. All the content will be stored in an external xml in different languages and parsed into the html via javascript.
Now the problem is, there are also german umlauts, russian, chinese and japanese symbols and also right-to-left languages like arabic and farsi.
What would be the best way/solution? Is there an "international encoding" which can display all languages properly? Or is there any other solution you would suggest?
Thanks in advance!
All of the Unicode transformations (UTF-8, UTF-16, UTF-32) can encode all Unicode characters. You pick which you want to use based on the size: If most of your text is in western scripts, probably UTF-8, as it will use only one byte for most of the characters, but 2, 3, or 4 if needed. If you're encoding far east scripts, you'll probably want one of the other transformations.
The fundamental thing here is that it's all Unicode; the transformations are just different ways of representing the same characters.
The co-founder of Stack Overflow had a good article on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Regardless of what encoding you use for your document, note that if you're doing processing of these strings in JavaScript, JavaScript strings are UTF-16 (except that invalid values are tolerated). (Even if the document is in UTF-8 or UTF-32.) This means that, for instance, each of those emojis people are so excited about these days look like two "characters" to JavaScript, because they take two words of UTF-16 to represent. Like 😎, for instance:
console.log("😎".length); // 2
So you'll need to be careful not to split up the two halves of characters that are encoded in two words of UTF-16.
The normal (and recommended) solution for multi-lingual sites is to use UTF-8. That can can deal with any characters that have been assigned Unicode codepoints with a couple of caveats:
Unicode is a versioned standard, and a different Javascript implementations may support different Unicode versions.
If your text includes characters outside of the Unicode Basic Multilingual Plane (BMP), then you need to do your text processing (in Javascript) in a way that is Unicode aware. For instance, if you use the Javascript String class you need to take proper account of surrogate pairs when doing text manipulation.
(A Javascript String is actually encoded as UTF-16. It has methods that allow you to manipulate it as Unicode codepoints, methods / attribute such as substring and length use codeunit rather than codepoint indexing. If you are not careful, you can end up splitting a string between the low and high parts of a surrogate pair. The result will be something that cannot be displayed properly. This only affects codepoints in higher planes ... but that includes the new emoji codepoints.)

Is it safe to use UTF-8 character literals in JavaScript source code?

Is it save to write JavaScript source code (to be executed in the browser) which includes UTF-8 character literals?
For example, I would like to use an ellipses literal in a string as such:
var foo = "Oops… Something went wrong";
Do "modern" browsers support this? Is there a published browser support matrix somewhere?
JavaScript is by specification a Unicode language, so Unicode characters in strings should be safe. You can use hex escapes (\u8E24) as an alternative. Make sure your script files are served with proper content type headers.
Note that characters beyond one- and two-byte sequences are problematic, and that JavaScript regular expressions are terrible with characters beyond the first codepage. (Well maybe not "terrible", but primitive at best.)
You can also use Unicode letters, Unicode combining marks, and Unicode connector punctuation characters in identifiers, in case you want to impress your friends. Thus
var wavy﹏line = "wow";
is perfectly good JavaScript (but good luck with your bug report if you find a browser where it doesn't work).
Read all about it in the spec, or use it to fall asleep at night :)

Validate field for all language characters through REGEX

i need to validate a field for empty. But it should allow English and the Foreign languages characters(UTF-8) but not the special characters. I'm not good at Regex. So any help on this would be great...
If you want to support a wide range of languages, you'll have to work by excluding only the characters you don't want, since specifying all of the ranges you do want will be difficult.
You'll need to look at the list of Unicode blocks and or the character database to identify the blocks you want to exclude (like, for instance, U+0000 through U+001F. This Wikipedia article may also help.
Then use a regular expression with character classes to look for what you want to exclude.
For example, this will check for the U+0000 through U+001F and the U+007F characters (obviously you'll be excluding more than just these):
if (/[\u0000-\u001F\u007F]/.exec(theString)) {
// Contains at least one invalid character
}
The [] identify a "character class" (list and/or range of characters to look for). That particular one says look for \u0000 through \u001F (inclusive) as well as \u007F.
It would have been nice if I could say "Just do /^\w+$/.test(word)", but...
See this answer for the current state of unicode support (or rather lack of) in JavaScript regular expressions.
You can either use the library he suggests, which might be slow or enlist the help of the server for this (which might be slower).
You can test for a unicode letter like this:
str.match(/\p{L}/u)
Or for the existence of a non-letter like this:
str.match(/[^\p{L}]/u)

how to use unicode character groups in javascript's regexs?

there is a way to use patterns like "\p{L}" in javascript, natively?
(i suppose that is a perl-compatible syntax)
I'm interested firstly in firefox support, and webkit, possibly
No, \p{..} is not supported natively by any of the big browsers. However, it does work in JavaScript if you use the XRegExp library and it's Unicode plugins.
Unfortunately, no. You can only specify a set of characters in the usual syntax, writing characters and ranges in brackets, but this becomes awkward since e.g. letters are scattered all around the Unicode space, with other characters between them.
There’s an inefficient workaround: fetch the UnicodeData.txt file from the Unicode site, put its content inside your JavaScript code as data, and parse it. And then you could have the data e.g. in an array of objects containing the Unicode properties, such as gc (General Category), which tells you whether the character is a letter or not. But even then, you would just have the data handy for simple testing, not as something you can use as a constituent of a regexp.
In theory, you could use the data to construct a regexp... but it would be rather large.
No, Javascript has slightly different syntax. To catch unicode you have to use character selector like \uXXXX. However, on practice if your page and files in UTF-8, setting non-ASCII characters in range [абвг] does work too.
http://www.javascriptkit.com/jsref/regexp.shtml
The library found here:
http://inimino.org/~inimino/blog/javascript_cset
seems to work for me and is fairly small and independent of other libraries.

need a JavaScript Regex that requires upper or lowercase letters

I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters:
/(?=.*[a-z])/
You Can’t Get There from Here
I have a regex that right now only allows lowercase letters, I need one that requires either lowercase or uppercase letters: /(?=.*[a-z])/
Unfortunately, it is utterly impossible to do this correctly using Javascript! Read this flavor comparison’s ECMA column for all of what Javascript cannot do.
Theory vs Practice
The proper pattern for lowercase is the standard Unicode derived binary property \p{Lowercase}, and the proper pattern for uppercase is similarly \p{Uppercase}. These are normative properties that sometimes include non-letters in them under certain exotic circumstances.
Using just General Category properties, you can have \p{Ll} for Lowercase_Letter, \p{Lu} for Uppercase_Letter, and \p{Lt} for titlecase letter. Remember they are three cases in Unicode, not two). There is a standard alias \p{LC} which means [\p{Lu}\p{Lt}\p{Ll}].
If you want a letter than is not a lowercase letter, you could use (?=\P{Ll})\pL. Written in longhand that’s (?=\P{Lowercase_Letter})\p{Letter}. Again, these mix some of the Other_Lowercase code points that \p{Lowercase} recognizes. I must again stress that the Lowercase property is a superset of the Lowercase_Letter property.
Remember the previous paragraph, swapping in upper everywhere I have written lower, and you get the same thing for the capitals.
Possible Platforms
Because access to these essential properties is the minimal level of critical functionality necessary for Unicode regular expressions, some versions of Javascript implement them in just the way I have written them above. However, the standard for Javascript still does not require them, so you cannot in general count on them. This means that it is impossible to this correctly under all implementations of Javascript.
Languages in which it is possible to do what you want done minimally include:
C♯ and Java (both only General Categories)
Ruby if and only if v1.9 or better (only binary properties, including General Categories)
PHP and PCRE (only General Category and Script properties plus a couple extras)
ICU’s C++ library and Perl, which both support all Unicode properties
Of those listed bove, only the last line’s — ICU and Perl — strictly and completely meet all Level 1 compliance requirements (plus some Levels 2 and 3) for the proper handling of Unicode in regexes. However, all of those I’ve listed in the previous paragraph’s bullets can easily handle most, and quite probably all, of what you need.
Javascript is not amongst those, however. Your version might, though, if you are very lucky and never have to run on a standard-only Javascript platform.
Summary
So very sadly, you cannot really use Javascript regexes for Unicode work unless you have a non-standard extension. Some people do, but most do not. If you do not, you may have to use a different platform until the relevant ECMA standard catches up with the 21st century (Unicode 3.1 came out a decade ago!!).
If anyone knows of a Javascript library that implements the Level 1 requirements of UTS#18 on Unicode Regular Expressions including both RL1.2 “Properties” and RL1.2a “Annex C: Compatibility Properties”, please chime in.
Not sure if you mean mixed-case, or strictly lowercase plus strictly uppercase.
Here's the mixed-case version:
/^[a-zA-Z]+$/
And the strictly one-or-the-other version:
/^([a-z]+|[A-Z]+)$/
Try /(?=.*[a-z])/i
Note the i at the end, this makes the expression case insensitive.
Or add an uppercase range to your regex:
/(?=.*[a-zA-Z])/

Categories