Regular Expression for Japanese characters

Regular Expression for Japanese characters - javascript

I am doing internationalization in Struts. I want to write Javascript validation for Japanese and English users. I know regular expression for English but not for Japanese users. Is it possible to write one regular expression for both the users which validate on the basis of Unicode?
Please help me.

Here is a regular expression that can be used to match all English alphanumeric characters, Japanese katakana, hiragana, multibytes of alphanumerics (hankaku and zenkaku), and dashes:
/[一-龠]+|[ぁ-ゔ]+|[ァ-ヴー]+|[a-zA-Z0-9]+|[ａ-ｚＡ-Ｚ０-９]+|[々〆〤ヶ]+/u
You can edit it to fit your needs, but notice the "u" flag at the end.

Provided your text editor and programming language support Unicode, you should be able to enter Japanese characters as literal strings. Things like [A-X] ranges will probably not translate very well in general.
What kind of text are you trying to validate?
What language are the regular experssions in? Perl-compatible, POSIX, or something else?

Related

how to validate not to allow Chinese characters

One of my SQL like statement is breaking when user enters Chinese characters. I did some research and couldn't find any.
Is there any code available like a plugin or javascript function to validate and allow only english alphanumeric and allow special symbols.
If i validate Chinese characters there can be other languages right , so i thought to allow only english characters with all combination of numbers and symbols.
any direction or input will be appreciated thanks
I tried something like this but this allows only alphanumeric
/[^a-zA-Z 0-9]+/g

How about UTF8 encoding the text rather than ban languages?

Need unicode support for a regular expressions

I have this regular expression /^[A-Z][A-Za-z.'\- ]+$/ for checking a name.
So when I type George or George Harris or George-Harris its OK. The problem is that it doesn't match names, words in my language (greek)
How can I add unicode support to this regular expression?

There is XRegExp library that adds support for character classes and other things missing from JS implementation of regular expressions. I think you'll find its Unicode addon particularly useful.

You can use unicode to match greek letters. Here's a map of the greek characters in unicode. /[A-Z]/i would translate /[\u03B1-\u03C9]/i if I understand right. So for greek characters (and, in my country, diacritics), you need to know their unicode equivalent and use them as \uxxxx in regular expressions.
In my experience, if the javascript file is saved as utf-8 and the webpage that uses it is utf-8, you can use such characters directly. So something like /α/.test('α200β') works in such a setup.

Regex for all alphabets

i need a regex for all alphabets. I have an input and target text. Both of them can be belong different alphabets. I mean they can be belong chinese, latin, cyrillic and any others alphabet.
I need a regex for multi language input and multi language target text.
Is there anybody has any idea about this? How can i write this regex ?
I will use this with javascript. But i think there should be common regex for java and javascript also for this problem.

If you are in Java (not in javascript!) you can use unicode properties, e.g.
\P{L} any kind of letter from any language.
See regular-expressions.info/unicode for more informations.
For Javascript:
There is a lib from XRegExp and some plugins XRegExp Unicode plugins that extends the javasript regex features. That adds support for Unicode categories, scripts, and blocks.
With those libs you would be able to use \p{L} with javascript.
See my answer to this question for a small example

Some regex engines support special character for all Unicode letters:
\p{L}
Or you can use \w - letter, digit, underscore

i use "|" this character as a separator, so it is speacial for me. Key can be any character except of "|". it solve my problems thanks for answers. And it can be used with javascript, java and groovy. I tested it, worked.
var keyPrefix ="\\|[\u0000-\u007B\u007D-\uFFEF]*";
var keySuffix = "[\u0000-\u007B\u007D-\uFFEF]*\\|";
var searchkey = keyPrefix + key.toLowerCase() + keySuffix;

How to parse for a word in text in JavaScript?

In the text page, I would like to examine each word. What is the best way to read each word at the time? It is easy to find words that are surrounded by space, but once you get into parsing out words in text it can get complicated.
Instead of defining my own way of parsing the words from text, is there something already built that parse out the words in regular expression or other methods?
Some example of words in text.
word word. word(word) word's word word' "word" .word. 'word' sub-word

You can use:
text = "word word. word(word) word's word word' \"word\" .word. 'word' sub-word";
words = text.match(/[-\w]+/g);
This will give you an array with all your words.
In regular expressions, \w means any character that is either a-z, A-Z, 0-9 or _. [-\w] means any character that is a \w or a -. [-\w]+ means any of these characters that appear 1 ore more times.
If you would like to define a word as being something more than the above expression, add the other characters that compose your words inside the [-\w] character class. For example, if you'd like words to also contain ( and ), make the character class be [-\w()].
For an introduction in regular expressions, check out the great tutorial at regular-expressions.info.

What you're talking about is Tokenisation. It's non-trivial to say the least, and a subject of intense reasearch at the major search engines. There are a number of open source tokenisation libraries in various server-side languages (e.g see the Stanford NLP and Lucene projects) but as far as I am aware there's nothing that would even come close to these in javascript. You may have to roll your own :) or perhaps do the processing server-side, and load the results via AJAX?

I support Richard's answer here - but to add to it - one of the easiest ways of building a tokeniser (imho) is Antlr; and some maniac has built a Javascript target for it; thus allowing you to run and execute a grammar in the web browser (look under 'runtime libraries' section here)
I won't pretend that there's not a learning curve there though.

Take a look at regular expressions - you can define almost any parsing algorithm you want.

Why does \w match only English words in javascript regex?

I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).
So what can I use instead of \w to match all letters in all languages?

Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.
Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.

The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that \w should be equivalent to [a-zA-Z0-9_] and that \d should be equivalent to [0-9]. \s on the other hand matches both ASCII and Unicode whitespace, according to the standard.
JavaScript does not support the \p syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:
[\u0590-\u05FF]
This simply matches any code point in the Hebrew block.
You can match any ASCII word character or any Hebrew character with:
[\w\u0590-\u05FF]

I think you are looking for this regex:
^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$

I've just found XRegExp which has not been mentioned yet and I'm quite impressed with it. It is an alternative regular expression implementation, has a unicode plugin and is licensed under MIT license.
According to the website, to match unicode chars, you'd use such code:
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true

Try this \p{L}
the unicode regex to Letters

Have a look at http://www.regular-expressions.info/refunicode.html.
It looks like there is no \w equivalent for unicode, but you can match single unicode letters, so you can create it.

Check this SO Question about JavaScript and Unicode out. Looks like Jan Goyvaerts answer there provides some hope for you.
Edit: But then it seems all browsers don't support \p ... anyway. That question should contain useful info.

Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters.
Normally all other characters should be represented by percent-notation:
In local or regional contexts and with
improving technology, users might
benefit from being able to use a wider
range of characters; such use is not
defined by this specification.
Percent-encoded octets (Section 2.1)
may be used within a URI to represent
characters outside the range of the
US-ASCII coded character set if this
representation is allowed by the
scheme or by the protocol element in
which the URI is referenced. Such a
definition should specify the
character encoding used to map those
characters to octets prior to being
percent-encoded for the URI. // URI: Generic Syntax
Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.
If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.

Perhaps \S (non-whitespace).

If you're the one generating URLs with non-english letters in it, you may want to reconsider.
If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.

We Keep Coding

JavaScript is the programming language of the Web.

Regular Expression for Japanese characters - javascript

I am doing internationalization in Struts. I want to write Javascript validation for Japanese and English users. I know regular expression for English but not for Japanese users. Is it possible to write one regular expression for both the users which validate on the basis of Unicode? Please help me.

Related

how to validate not to allow Chinese characters

Need unicode support for a regular expressions

Regex for all alphabets

How to parse for a word in text in JavaScript?

Why does \w match only English words in javascript regex?

Categories

Resources