Regex for all alphabets - javascript

i need a regex for all alphabets. I have an input and target text. Both of them can be belong different alphabets. I mean they can be belong chinese, latin, cyrillic and any others alphabet.
I need a regex for multi language input and multi language target text.
Is there anybody has any idea about this? How can i write this regex ?
I will use this with javascript. But i think there should be common regex for java and javascript also for this problem.

If you are in Java (not in javascript!) you can use unicode properties, e.g.
\P{L} any kind of letter from any language.
See regular-expressions.info/unicode for more informations.
For Javascript:
There is a lib from XRegExp and some plugins XRegExp Unicode plugins that extends the javasript regex features. That adds support for Unicode categories, scripts, and blocks.
With those libs you would be able to use \p{L} with javascript.
See my answer to this question for a small example

Some regex engines support special character for all Unicode letters:
\p{L}
Or you can use \w - letter, digit, underscore

i use "|" this character as a separator, so it is speacial for me. Key can be any character except of "|". it solve my problems thanks for answers. And it can be used with javascript, java and groovy. I tested it, worked.
var keyPrefix ="\\|[\u0000-\u007B\u007D-\uFFEF]*";
var keySuffix = "[\u0000-\u007B\u007D-\uFFEF]*\\|";
var searchkey = keyPrefix + key.toLowerCase() + keySuffix;

Related

Need unicode support for a regular expressions

I have this regular expression /^[A-Z][A-Za-z.'\- ]+$/ for checking a name.
So when I type George or George Harris or George-Harris its OK. The problem is that it doesn't match names, words in my language (greek)
How can I add unicode support to this regular expression?
There is XRegExp library that adds support for character classes and other things missing from JS implementation of regular expressions. I think you'll find its Unicode addon particularly useful.
You can use unicode to match greek letters. Here's a map of the greek characters in unicode. /[A-Z]/i would translate /[\u03B1-\u03C9]/i if I understand right. So for greek characters (and, in my country, diacritics), you need to know their unicode equivalent and use them as \uxxxx in regular expressions.
In my experience, if the javascript file is saved as utf-8 and the webpage that uses it is utf-8, you can use such characters directly. So something like /α/.test('α200β') works in such a setup.

How can I write a regex that will match any string?

Yes, I know this sounds counter-intuitive. But I need it for a JavaScript mashup written by someone else. There is a regex value used to select project names to which the mashup will apply. I want it to apply to all projects.
As others have mentioned, the . doesn't match line feeds.
If you must match absolutely every character including \n then you can use this instead...
[\s\S]*
That's whitespace characters, and non-whitespace characters. In other words, that'll match everything.
OK, folks. I'm not strong with regex, but I think I figured this out. I'm just using the following regex:
'.*'
This is working fine for me. The dot allows any character, and the asterisk allows it to be repeated any number of times.
If anybody knows how to limit this to a string that is a single line (ASCII Character set) I'm all ears.
Your answer (.*) works, and by default will match only a single line. If you wanted it to match multiple lines, you could enable multi-line mode in your particular regex implementation, but nobody enables it by default AFAIK.
You're correct that .* will find any character, however, the exception are newline characters ('\n', etc). What you could try is grouping. This: (.*) should work for you. If you need help with accessing matched group more info can be found here: How do you access the matched groups in a JavaScript regular expression?
EDIT: If you are using something other than JavaScript the implementation may differ slightly with multi vs single-line mode. JavaScript itself does not have single-ling mode.

How to parse for a word in text in JavaScript?

In the text page, I would like to examine each word. What is the best way to read each word at the time? It is easy to find words that are surrounded by space, but once you get into parsing out words in text it can get complicated.
Instead of defining my own way of parsing the words from text, is there something already built that parse out the words in regular expression or other methods?
Some example of words in text.
word word. word(word) word's word word' "word" .word. 'word' sub-word
You can use:
text = "word word. word(word) word's word word' \"word\" .word. 'word' sub-word";
words = text.match(/[-\w]+/g);
This will give you an array with all your words.
In regular expressions, \w means any character that is either a-z, A-Z, 0-9 or _. [-\w] means any character that is a \w or a -. [-\w]+ means any of these characters that appear 1 ore more times.
If you would like to define a word as being something more than the above expression, add the other characters that compose your words inside the [-\w] character class. For example, if you'd like words to also contain ( and ), make the character class be [-\w()].
For an introduction in regular expressions, check out the great tutorial at regular-expressions.info.
What you're talking about is Tokenisation. It's non-trivial to say the least, and a subject of intense reasearch at the major search engines. There are a number of open source tokenisation libraries in various server-side languages (e.g see the Stanford NLP and Lucene projects) but as far as I am aware there's nothing that would even come close to these in javascript. You may have to roll your own :) or perhaps do the processing server-side, and load the results via AJAX?
I support Richard's answer here - but to add to it - one of the easiest ways of building a tokeniser (imho) is Antlr; and some maniac has built a Javascript target for it; thus allowing you to run and execute a grammar in the web browser (look under 'runtime libraries' section here)
I won't pretend that there's not a learning curve there though.
Take a look at regular expressions - you can define almost any parsing algorithm you want.

Regex: Disable Symbols

Is there any way to disable all symbols, punctuations, block elements, geometric shapes and dingbats such like these:
✁ ✂ ✃ ✄ ✆ ✇ ✈ ✉ ✌ ✍ ✎ ✏ ✐ ✑ ✒ ✓ ✔ ✕ ⟻ ⟼ ⟽ ⟾ ⟿ ⟻ ⟼ ⟽ ⟾ ⟿ ▚ ▛ ▜ ▝ ▞ ▟
without writing down all of them in the Regular Expression Pattern, while enable all other normal language characters such like chinese, arabic etc.. such like these:
文化中国 الجزيرة نت
?
I'm building a javascript validation function and my real problem is that I can't use:
[a-zA-Z0-9]
Because this ignores a lots of languages too not just the symbols.
The Unicode standard divides up all the possible characters into code charts. Each code chart contains related characters. If you want to exclude (or include) only certain classes of characters, you will have to make a suitable list of exclusions (or inclusions). Unicode is big, so this might be a lot of work.
Not really.
JavaScript doesn't support Unicode Character Properties. The closest you'll get is excluding ranges by Unicode code point as Greg Hewgill suggested.
For example, to match all of the characters under Mathematical Symbols:
/[\u2190-\u259F]/
This depends on your regex dialect. Unfortunately, probably most existing JavaScript engines don't support Unicode character classes.
In regex engines such as the one in (recent) Perl or .Net, Unicode character classes can be referenced.
\p{L}: any kind of letter from any language.
\p{N}: any number symbol from any language (including, as I recall, the Indian and Arabic and CJK number glyphs).
Because Unicode supports composed and decomposed glyphs, you may run into certain complexities: namely, if only decomposed forms exist, it's possible that you might accidentally exclude some diacritic marks in your matching pattern, and you may need to explicitly allow glyphs of the type Mark. You can mitigate this somewhat by using, if I recall correctly, a string that has been normalized using kC normalization (only for characters that have a composed form). In environments that support Unicode well, there's usually a function that allows you to normalize Unicode strings fairly easily (true in Java and .Net, at least).
Edited to add: If you've started down this path, or have considered it, in order to regain some sanity, you may want to experiment with the Unicode Plugin for XRegExp (which will require you to take a dependency on XRegExp).
JavaScript regular expressions do not have native Unicode support. An alternative to to validate (or sanitize) the string at server site, or to use a non-native regex library. While I've never used it, XRegExp is such a library, and it has a Unicode Plugin.
Take a look at the Unicode Planes. You probably want to exclude everything but planes 0 and 2. After that, it gets ugly as you'll have to exclude a lot of plane 0 on a case-by-case basis.

Why does \w match only English words in javascript regex?

I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).
So what can I use instead of \w to match all letters in all languages?
Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.
Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.
The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that \w should be equivalent to [a-zA-Z0-9_] and that \d should be equivalent to [0-9]. \s on the other hand matches both ASCII and Unicode whitespace, according to the standard.
JavaScript does not support the \p syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:
[\u0590-\u05FF]
This simply matches any code point in the Hebrew block.
You can match any ASCII word character or any Hebrew character with:
[\w\u0590-\u05FF]
I think you are looking for this regex:
^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$
I've just found XRegExp which has not been mentioned yet and I'm quite impressed with it. It is an alternative regular expression implementation, has a unicode plugin and is licensed under MIT license.
According to the website, to match unicode chars, you'd use such code:
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true
Try this \p{L}
the unicode regex to Letters
Have a look at http://www.regular-expressions.info/refunicode.html.
It looks like there is no \w equivalent for unicode, but you can match single unicode letters, so you can create it.
Check this SO Question about JavaScript and Unicode out. Looks like Jan Goyvaerts answer there provides some hope for you.
Edit: But then it seems all browsers don't support \p ... anyway. That question should contain useful info.
Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters.
Normally all other characters should be represented by percent-notation:
In local or regional contexts and with
improving technology, users might
benefit from being able to use a wider
range of characters; such use is not
defined by this specification.
Percent-encoded octets (Section 2.1)
may be used within a URI to represent
characters outside the range of the
US-ASCII coded character set if this
representation is allowed by the
scheme or by the protocol element in
which the URI is referenced. Such a
definition should specify the
character encoding used to map those
characters to octets prior to being
percent-encoded for the URI. // URI: Generic Syntax
Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.
If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.
Perhaps \S (non-whitespace).
If you're the one generating URLs with non-english letters in it, you may want to reconsider.
If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.

Categories