Why does \w match only English words in javascript regex? - javascript

I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).
So what can I use instead of \w to match all letters in all languages?

Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.
Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.

The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that \w should be equivalent to [a-zA-Z0-9_] and that \d should be equivalent to [0-9]. \s on the other hand matches both ASCII and Unicode whitespace, according to the standard.
JavaScript does not support the \p syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:
[\u0590-\u05FF]
This simply matches any code point in the Hebrew block.
You can match any ASCII word character or any Hebrew character with:
[\w\u0590-\u05FF]

I think you are looking for this regex:
^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$

I've just found XRegExp which has not been mentioned yet and I'm quite impressed with it. It is an alternative regular expression implementation, has a unicode plugin and is licensed under MIT license.
According to the website, to match unicode chars, you'd use such code:
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true

Try this \p{L}
the unicode regex to Letters

Have a look at http://www.regular-expressions.info/refunicode.html.
It looks like there is no \w equivalent for unicode, but you can match single unicode letters, so you can create it.

Check this SO Question about JavaScript and Unicode out. Looks like Jan Goyvaerts answer there provides some hope for you.
Edit: But then it seems all browsers don't support \p ... anyway. That question should contain useful info.

Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters.
Normally all other characters should be represented by percent-notation:
In local or regional contexts and with
improving technology, users might
benefit from being able to use a wider
range of characters; such use is not
defined by this specification.
Percent-encoded octets (Section 2.1)
may be used within a URI to represent
characters outside the range of the
US-ASCII coded character set if this
representation is allowed by the
scheme or by the protocol element in
which the URI is referenced. Such a
definition should specify the
character encoding used to map those
characters to octets prior to being
percent-encoded for the URI. // URI: Generic Syntax
Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.
If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.

Perhaps \S (non-whitespace).

If you're the one generating URLs with non-english letters in it, you may want to reconsider.
If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.

Related

Why is an underscore (_) not regarded as a non-word character?

Why is an underscore (_) not regarded as a non-word character? This regexp \W matches all non-word character but not the underscore.
Referring to Jeffrey Friedl's book about Regular Expressions, this was a change in Perl Regular Expressions, originally. Back to 1988 according to characters that were allowed to name a Perl variable [Page 89]:
Perl 2 was released in June 1988. Larry had replaced the regex code
entirely, this time using a greatly enhanced version of the Henry
Spencer package mentioned in the previous section. You could still
have at most nine sets of parentheses, but now you could use |
inside them. Support for \d and \s was added, and support for \w was
changed to include an underscore, since then it would match what
characters were allowed in a Perl variable name.
\W is defined as [^A-Za-z0-9_].
It is the opposite of \w which is [A-Za-z0-9_] and means "a word character".
It is not about words as you perceive them in a spoken language. The "word" here means an identifier, word that can be used to name a variable or a type in a programming language.
Many programming languages allow only uppercase and lowercase letters, digits and underscore (_) in identifiers. There are languages that allow other characters but back when the regular expressions were invented, there were less languages that permissive and most of them allowed only the characters that match \w in identifiers.
"Word character" definition is based on characters that can be used as a part of identifier in many programming languages, that is [A-Za-z0-9_].
According to regex101: \w matches any non-word character (equal to [^a-zA-Z0-9_]). This seems to be a designers' choice.

How do I match Unicode special alpha characters while NOT matching special characters

I have a dilemma here. I am trying to write a regex pattern that matches all alpha characters for eastern languages as well as western languages. One of the criteria is that no numbers can match (so José13) is not a match but (José) is, the other criteria is that special characters cannot match (ie: !##$% etc.)
I've played around with this in chrome's console, and I've gotten:
"a".match('[a-zA-z]');
to come back successfully, when I put in:
"a".match('[\p{L}]');
I get a null response, which I'm not quite understanding why. According to http://www.regular-expressions.info/unicode.html \p{L} is a match for any letter.
EDIT: the \p doesn't seem to work in my chrome console, so I'll try a different route. I have a chart of the unicode from Unifoundry. I'll match up the regex and attempt to make the range of characters invalid.
Any input would be greatly appreciated.
This works in the javascript console, but it seems like a hack:
.match('^[^\u0000-\u0040\u005B-\u0060\u007B-\u00BF\u00D7\u00F7]*');
However it does what I need it to do.
Referenced this post on SO: Javascript + Unicode regexes
Current Javascript implementations don't support such shortcuts, but you can specify a range, for example:
/[\u4E00-\u9FFF]+/g.test("漢字")

Need unicode support for a regular expressions

I have this regular expression /^[A-Z][A-Za-z.'\- ]+$/ for checking a name.
So when I type George or George Harris or George-Harris its OK. The problem is that it doesn't match names, words in my language (greek)
How can I add unicode support to this regular expression?
There is XRegExp library that adds support for character classes and other things missing from JS implementation of regular expressions. I think you'll find its Unicode addon particularly useful.
You can use unicode to match greek letters. Here's a map of the greek characters in unicode. /[A-Z]/i would translate /[\u03B1-\u03C9]/i if I understand right. So for greek characters (and, in my country, diacritics), you need to know their unicode equivalent and use them as \uxxxx in regular expressions.
In my experience, if the javascript file is saved as utf-8 and the webpage that uses it is utf-8, you can use such characters directly. So something like /α/.test('α200β') works in such a setup.

Regex for all alphabets

i need a regex for all alphabets. I have an input and target text. Both of them can be belong different alphabets. I mean they can be belong chinese, latin, cyrillic and any others alphabet.
I need a regex for multi language input and multi language target text.
Is there anybody has any idea about this? How can i write this regex ?
I will use this with javascript. But i think there should be common regex for java and javascript also for this problem.
If you are in Java (not in javascript!) you can use unicode properties, e.g.
\P{L} any kind of letter from any language.
See regular-expressions.info/unicode for more informations.
For Javascript:
There is a lib from XRegExp and some plugins XRegExp Unicode plugins that extends the javasript regex features. That adds support for Unicode categories, scripts, and blocks.
With those libs you would be able to use \p{L} with javascript.
See my answer to this question for a small example
Some regex engines support special character for all Unicode letters:
\p{L}
Or you can use \w - letter, digit, underscore
i use "|" this character as a separator, so it is speacial for me. Key can be any character except of "|". it solve my problems thanks for answers. And it can be used with javascript, java and groovy. I tested it, worked.
var keyPrefix ="\\|[\u0000-\u007B\u007D-\uFFEF]*";
var keySuffix = "[\u0000-\u007B\u007D-\uFFEF]*\\|";
var searchkey = keyPrefix + key.toLowerCase() + keySuffix;

Regex: Disable Symbols

Is there any way to disable all symbols, punctuations, block elements, geometric shapes and dingbats such like these:
✁ ✂ ✃ ✄ ✆ ✇ ✈ ✉ ✌ ✍ ✎ ✏ ✐ ✑ ✒ ✓ ✔ ✕ ⟻ ⟼ ⟽ ⟾ ⟿ ⟻ ⟼ ⟽ ⟾ ⟿ ▚ ▛ ▜ ▝ ▞ ▟
without writing down all of them in the Regular Expression Pattern, while enable all other normal language characters such like chinese, arabic etc.. such like these:
文化中国 الجزيرة نت
?
I'm building a javascript validation function and my real problem is that I can't use:
[a-zA-Z0-9]
Because this ignores a lots of languages too not just the symbols.
The Unicode standard divides up all the possible characters into code charts. Each code chart contains related characters. If you want to exclude (or include) only certain classes of characters, you will have to make a suitable list of exclusions (or inclusions). Unicode is big, so this might be a lot of work.
Not really.
JavaScript doesn't support Unicode Character Properties. The closest you'll get is excluding ranges by Unicode code point as Greg Hewgill suggested.
For example, to match all of the characters under Mathematical Symbols:
/[\u2190-\u259F]/
This depends on your regex dialect. Unfortunately, probably most existing JavaScript engines don't support Unicode character classes.
In regex engines such as the one in (recent) Perl or .Net, Unicode character classes can be referenced.
\p{L}: any kind of letter from any language.
\p{N}: any number symbol from any language (including, as I recall, the Indian and Arabic and CJK number glyphs).
Because Unicode supports composed and decomposed glyphs, you may run into certain complexities: namely, if only decomposed forms exist, it's possible that you might accidentally exclude some diacritic marks in your matching pattern, and you may need to explicitly allow glyphs of the type Mark. You can mitigate this somewhat by using, if I recall correctly, a string that has been normalized using kC normalization (only for characters that have a composed form). In environments that support Unicode well, there's usually a function that allows you to normalize Unicode strings fairly easily (true in Java and .Net, at least).
Edited to add: If you've started down this path, or have considered it, in order to regain some sanity, you may want to experiment with the Unicode Plugin for XRegExp (which will require you to take a dependency on XRegExp).
JavaScript regular expressions do not have native Unicode support. An alternative to to validate (or sanitize) the string at server site, or to use a non-native regex library. While I've never used it, XRegExp is such a library, and it has a Unicode Plugin.
Take a look at the Unicode Planes. You probably want to exclude everything but planes 0 and 2. After that, it gets ugly as you'll have to exclude a lot of plane 0 on a case-by-case basis.

Categories