Need unicode support for a regular expressions - javascript

I have this regular expression /^[A-Z][A-Za-z.'\- ]+$/ for checking a name.
So when I type George or George Harris or George-Harris its OK. The problem is that it doesn't match names, words in my language (greek)
How can I add unicode support to this regular expression?

There is XRegExp library that adds support for character classes and other things missing from JS implementation of regular expressions. I think you'll find its Unicode addon particularly useful.

You can use unicode to match greek letters. Here's a map of the greek characters in unicode. /[A-Z]/i would translate /[\u03B1-\u03C9]/i if I understand right. So for greek characters (and, in my country, diacritics), you need to know their unicode equivalent and use them as \uxxxx in regular expressions.
In my experience, if the javascript file is saved as utf-8 and the webpage that uses it is utf-8, you can use such characters directly. So something like /α/.test('α200β') works in such a setup.

Related

Is it safe to use UTF-8 character literals in JavaScript source code?

Is it save to write JavaScript source code (to be executed in the browser) which includes UTF-8 character literals?
For example, I would like to use an ellipses literal in a string as such:
var foo = "Oops… Something went wrong";
Do "modern" browsers support this? Is there a published browser support matrix somewhere?
JavaScript is by specification a Unicode language, so Unicode characters in strings should be safe. You can use hex escapes (\u8E24) as an alternative. Make sure your script files are served with proper content type headers.
Note that characters beyond one- and two-byte sequences are problematic, and that JavaScript regular expressions are terrible with characters beyond the first codepage. (Well maybe not "terrible", but primitive at best.)
You can also use Unicode letters, Unicode combining marks, and Unicode connector punctuation characters in identifiers, in case you want to impress your friends. Thus
var wavy﹏line = "wow";
is perfectly good JavaScript (but good luck with your bug report if you find a browser where it doesn't work).
Read all about it in the spec, or use it to fall asleep at night :)

Regex for all alphabets

i need a regex for all alphabets. I have an input and target text. Both of them can be belong different alphabets. I mean they can be belong chinese, latin, cyrillic and any others alphabet.
I need a regex for multi language input and multi language target text.
Is there anybody has any idea about this? How can i write this regex ?
I will use this with javascript. But i think there should be common regex for java and javascript also for this problem.
If you are in Java (not in javascript!) you can use unicode properties, e.g.
\P{L} any kind of letter from any language.
See regular-expressions.info/unicode for more informations.
For Javascript:
There is a lib from XRegExp and some plugins XRegExp Unicode plugins that extends the javasript regex features. That adds support for Unicode categories, scripts, and blocks.
With those libs you would be able to use \p{L} with javascript.
See my answer to this question for a small example
Some regex engines support special character for all Unicode letters:
\p{L}
Or you can use \w - letter, digit, underscore
i use "|" this character as a separator, so it is speacial for me. Key can be any character except of "|". it solve my problems thanks for answers. And it can be used with javascript, java and groovy. I tested it, worked.
var keyPrefix ="\\|[\u0000-\u007B\u007D-\uFFEF]*";
var keySuffix = "[\u0000-\u007B\u007D-\uFFEF]*\\|";
var searchkey = keyPrefix + key.toLowerCase() + keySuffix;

Regular Expression for Japanese characters

I am doing internationalization in Struts. I want to write Javascript validation for Japanese and English users. I know regular expression for English but not for Japanese users. Is it possible to write one regular expression for both the users which validate on the basis of Unicode?
Please help me.
Here is a regular expression that can be used to match all English alphanumeric characters, Japanese katakana, hiragana, multibytes of alphanumerics (hankaku and zenkaku), and dashes:
/[一-龠]+|[ぁ-ゔ]+|[ァ-ヴー]+|[a-zA-Z0-9]+|[a-zA-Z0-9]+|[々〆〤ヶ]+/u
You can edit it to fit your needs, but notice the "u" flag at the end.
Provided your text editor and programming language support Unicode, you should be able to enter Japanese characters as literal strings. Things like [A-X] ranges will probably not translate very well in general.
What kind of text are you trying to validate?
What language are the regular experssions in? Perl-compatible, POSIX, or something else?

Regex: Disable Symbols

Is there any way to disable all symbols, punctuations, block elements, geometric shapes and dingbats such like these:
✁ ✂ ✃ ✄ ✆ ✇ ✈ ✉ ✌ ✍ ✎ ✏ ✐ ✑ ✒ ✓ ✔ ✕ ⟻ ⟼ ⟽ ⟾ ⟿ ⟻ ⟼ ⟽ ⟾ ⟿ ▚ ▛ ▜ ▝ ▞ ▟
without writing down all of them in the Regular Expression Pattern, while enable all other normal language characters such like chinese, arabic etc.. such like these:
文化中国 الجزيرة نت
?
I'm building a javascript validation function and my real problem is that I can't use:
[a-zA-Z0-9]
Because this ignores a lots of languages too not just the symbols.
The Unicode standard divides up all the possible characters into code charts. Each code chart contains related characters. If you want to exclude (or include) only certain classes of characters, you will have to make a suitable list of exclusions (or inclusions). Unicode is big, so this might be a lot of work.
Not really.
JavaScript doesn't support Unicode Character Properties. The closest you'll get is excluding ranges by Unicode code point as Greg Hewgill suggested.
For example, to match all of the characters under Mathematical Symbols:
/[\u2190-\u259F]/
This depends on your regex dialect. Unfortunately, probably most existing JavaScript engines don't support Unicode character classes.
In regex engines such as the one in (recent) Perl or .Net, Unicode character classes can be referenced.
\p{L}: any kind of letter from any language.
\p{N}: any number symbol from any language (including, as I recall, the Indian and Arabic and CJK number glyphs).
Because Unicode supports composed and decomposed glyphs, you may run into certain complexities: namely, if only decomposed forms exist, it's possible that you might accidentally exclude some diacritic marks in your matching pattern, and you may need to explicitly allow glyphs of the type Mark. You can mitigate this somewhat by using, if I recall correctly, a string that has been normalized using kC normalization (only for characters that have a composed form). In environments that support Unicode well, there's usually a function that allows you to normalize Unicode strings fairly easily (true in Java and .Net, at least).
Edited to add: If you've started down this path, or have considered it, in order to regain some sanity, you may want to experiment with the Unicode Plugin for XRegExp (which will require you to take a dependency on XRegExp).
JavaScript regular expressions do not have native Unicode support. An alternative to to validate (or sanitize) the string at server site, or to use a non-native regex library. While I've never used it, XRegExp is such a library, and it has a Unicode Plugin.
Take a look at the Unicode Planes. You probably want to exclude everything but planes 0 and 2. After that, it gets ugly as you'll have to exclude a lot of plane 0 on a case-by-case basis.

Why does \w match only English words in javascript regex?

I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).
So what can I use instead of \w to match all letters in all languages?
Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.
Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.
The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that \w should be equivalent to [a-zA-Z0-9_] and that \d should be equivalent to [0-9]. \s on the other hand matches both ASCII and Unicode whitespace, according to the standard.
JavaScript does not support the \p syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:
[\u0590-\u05FF]
This simply matches any code point in the Hebrew block.
You can match any ASCII word character or any Hebrew character with:
[\w\u0590-\u05FF]
I think you are looking for this regex:
^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$
I've just found XRegExp which has not been mentioned yet and I'm quite impressed with it. It is an alternative regular expression implementation, has a unicode plugin and is licensed under MIT license.
According to the website, to match unicode chars, you'd use such code:
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true
Try this \p{L}
the unicode regex to Letters
Have a look at http://www.regular-expressions.info/refunicode.html.
It looks like there is no \w equivalent for unicode, but you can match single unicode letters, so you can create it.
Check this SO Question about JavaScript and Unicode out. Looks like Jan Goyvaerts answer there provides some hope for you.
Edit: But then it seems all browsers don't support \p ... anyway. That question should contain useful info.
Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters.
Normally all other characters should be represented by percent-notation:
In local or regional contexts and with
improving technology, users might
benefit from being able to use a wider
range of characters; such use is not
defined by this specification.
Percent-encoded octets (Section 2.1)
may be used within a URI to represent
characters outside the range of the
US-ASCII coded character set if this
representation is allowed by the
scheme or by the protocol element in
which the URI is referenced. Such a
definition should specify the
character encoding used to map those
characters to octets prior to being
percent-encoded for the URI. // URI: Generic Syntax
Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.
If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.
Perhaps \S (non-whitespace).
If you're the one generating URLs with non-english letters in it, you may want to reconsider.
If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.

Categories