I am implementing a javascript code which makes hashtag linkable as follows -
str2 = str.replace(/(^|\s)#([A-Za-z0-9é_ü]+)/gi, '$1#$2');
if you see i included special hungarian characters like é , ü ... to be included in the hashtag linking but above code break at those special hungarian chars. But when i test that in w3schools.com example code editor things work there. So in my local script file those special chars are not being recognized as a character(é) but look like it's being treated as "e" character. Why this is happening ? how to overcome this problems, please suggest ideas.
Look here and here. Javascript has some problems with Unicode in regexp.
If you want to match every Unicode letter, you should use this regexp [\u00C0-\u1FFF\u2C00-\uD7FF\w].
So your code should look like this:
str2 = str.replace(/(^|\s)#([\u00C0-\u1FFF\u2C00-\uD7FF\w]+)/gi, '$1#$2');
var str2 = 'abc #łążaf3234 efg'.replace(/(^|\s)#([\u00C0-\u1FFF\u2C00-\uD7FF\w]+)/gi, '$1#$2');
alert(str2);
You have to list the special characters [A-Za-z0-9éüíóþæöÉÚÍÓÞÆÖ] (these are icelandic characters) or you could use \S to match any non-whitespace character
Your best bet is to use unicode escape sequences (like \u2665) rather than the binary character.
Related
I want a regex to match a simple hashtag like that in twitter (e.g. #someword). I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese).
This was my initial regex: (^|\s|\b)(#(\w+))\b
--> but it doesn't recognize non standard characters.
Then, I tried using XRegExp.js, which worked, but ran too slowly.
Any suggestions for how to do it?
Eventually I found this: twitter-text.js useful link, which is basically how twitter solve this problem.
With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example:
> s = "foo #הַתִּקְוָה. bar"
"foo #הַתִּקְוָה. bar"
> s.match(/#(.+?)(?=[\s.,:,]|$)/)
["#הַתִּקְוָה", "הַתִּקְוָה"]
The [\s.,:,] should include spaces, punctuation and whatever else can be considered a terminating symbol.
#([^#]+)[\s,;]*
Explanation: This regular expression will search for a # followed by one or more non-# characters, followed by 0 or more spaces, commas or semicolons.
var input = "#hasta #mañana #babהַ";
var matches = input.match(/#([^#]+)[\s,;]*/g);
Result:
["#hasta ", "#mañana ", "#babהַ"]
EDIT - Replaced \b for word boundary
I want a regex to match a simple hashtag like that in twitter (e.g. #someword). I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese).
This was my initial regex: (^|\s|\b)(#(\w+))\b
--> but it doesn't recognize non standard characters.
Then, I tried using XRegExp.js, which worked, but ran too slowly.
Any suggestions for how to do it?
Eventually I found this: twitter-text.js useful link, which is basically how twitter solve this problem.
With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example:
> s = "foo #הַתִּקְוָה. bar"
"foo #הַתִּקְוָה. bar"
> s.match(/#(.+?)(?=[\s.,:,]|$)/)
["#הַתִּקְוָה", "הַתִּקְוָה"]
The [\s.,:,] should include spaces, punctuation and whatever else can be considered a terminating symbol.
#([^#]+)[\s,;]*
Explanation: This regular expression will search for a # followed by one or more non-# characters, followed by 0 or more spaces, commas or semicolons.
var input = "#hasta #mañana #babהַ";
var matches = input.match(/#([^#]+)[\s,;]*/g);
Result:
["#hasta ", "#mañana ", "#babהַ"]
EDIT - Replaced \b for word boundary
I'm writing a function that takes a prospective filename and validates it in order to ensure that no system disallowed characters are in the filename. These are the disallowed characters: / \ | * ? " < >
I could obviously just use string.indexOf() to search for each special char one by one, but that's a lot longer than it would be to just use string.search() using a regular expression to find any of those characters in the filename.
The problem is that most of these characters are considered to be part of describing a regular expression, so I'm unsure how to include those characters as actually being part of the regex itself. For example, the / character in a Javascript regex tells Javascript that it is the beginning or end of the regex. How would one write a JS regex that functionally behaves like so: filename.search(\ OR / OR | OR * OR ? OR " OR < OR >)
Put your stuff in a character class like so:
[/\\|*?"<>]
You're gonna have to escape the backslash, but the other characters lose their special meaning. Also, RegExp's test() method is more appropriate than String.search in this case.
filenameIsInvalid = /[/\\|*?"<>]/.test(filename);
Include a backslash before the special characters [\^$.|?*+(){}, for instance, like \$
You can also search for a character by specified ASCII/ANSI value. Use \xFF where FF are 2 hexadecimal digits. Here is a hex table reference. http://www.asciitable.com/ Here is a regex reference http://www.regular-expressions.info/reference.html
The correct syntax of the regex is:
/^[^\/\\|\*\?"<>]+$/
The [^ will match anything, but anything that is matched in the [^] group will return the match as null. So to check for validation is to match against null.
Demo: jsFiddle.
Demo #2: Comparing against null.
The first string is valid; the second is invalid, hence null.
But obviously, you need to escape regex characters that are used in the matching. To escape a character that is used for regex needs to have a backslash before the character, e.g. \*, \/, \$, \?.
You'll need to escape the special characters. In javascript this is done by using the \ (backslash) character.
I'd recommend however using something like xregexp which will handle the escaping for you if you wish to match a string literal (something that is lacking in javascript's native regex support).
I have strings in Spanish and other languages that may contain generic special characters like (),*, etc. That I need to remove. But the problem is that it also may contain special language characters like ñ, á, ó, í etc and they need to remain. So I am trying to do it with regexp the following way:
var desired = stringToReplace.replace(/[^\w\s]/gi, '');
Unfortunately it is removing all special characters including the language related. Not sure how to avoid that. Maybe someone could suggest?
I would suggest using Steven Levithan's excellent XRegExp library and its Unicode plug-in.
Here's an example that strips non-Latin word characters from a string: http://jsfiddle.net/b3awZ/1/
var regex = XRegExp("[^\\s\\p{Latin}]+", "g");
var str = "¿Me puedes decir la contraseña de la Wi-Fi?"
var replaced = XRegExp.replace(str, regex, "");
See also this answer by Steven Levithan himself:
Regular expression Spanish and Arabic words
Instead of whitelisting characters you accept, you could try blacklisting illegal characters:
var desired = stringToReplace.replace(/[-'`~!##$%^&*()_|+=?;:'",.<>\{\}\[\]\\\/]/gi, '')
Note! Works only for 16bit code points. This answer is incomplete.
Short answer
The character class for all arabic digits and latin letters is: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].
To get a regex you can use, prepend /^ and append +$/. This will match strings consisting of only latin letters and digits like "mérito" or "Schönheit".
To match non-digits or non-letter characters to remove them, write a ^ as first character after the opening bracket [ and prepend / and append +/.
How did I find that out? Continue reading.
Long answer: use metaprogramming!
Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?
import unicodedata
import re
import sys
def unicodeNameMatch(pattern, codepoint):
try:
return re.match(pattern, unicodedata.name(unichr(codepoint)), re.I)
except ValueError:
return None
def regexChr(codepoint):
return chr(codepoint) if 32 <= codepoint < 127 else "\\u%04x" % codepoint
names = sys.argv
prev = None
js_regex = ""
for codepoint in range(pow(2, 16)):
if any([unicodeNameMatch(name, codepoint) for name in names]):
if prev is None: js_regex += regexChr(codepoint)
prev = codepoint
else:
if not prev is None: js_regex += "-" + regexChr(prev)
prev = None
print "[" + js_regex + "]"
Invoke it like this: python char_class.py latin digit and you get the character class mentioned above. It's an ugly char class but you know for sure that you catched all characters whose names contain latin or digit.
Browse the Unicode Character Database to view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for A its the line
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
Try python char_class.py "latin small" and you get a character class for all latin small letters.
Edit: There is a small misfeature (aka bug) in that \u271d-\u271d occurs in the regex. Perhaps this fix helps: Replace
if not prev is None: js_regex += "-" + regexChr(prev)
by
if not prev is None and prev != codepoint: js_regex += "-" + regexChr(prev)
var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');
might do the trick.
See also this Javascript + Unicode regexes question.
If you must insist on whitelisting here is the rawest way of doing it:
Test if string contains only letters (a-z + é ü ö ê å ø etc..)
It works by keeping track of 'all' unicode letter chars.
Unfortunately, Javascript does not support Unicode character properties (which would be just the right regex feature for you). If changing the language is an option for you, PHP (for example) can do this:
preg_replace("/[^\pL0-9_\s]/", "", $str);
Where \pL matches any Unicode character that represents a letter (lower case, upper case, modified or unmodified).
If you have to stick with JavaScript and cannot use the library suggested by Tim Down, the only options are probably either blacklisting or whitelisting. But your bounty mentions that blacklisting is not actually an option in your case. So you will probably simply have to include the special characters from your relevant language manually. So you could simply do this:
var desired = stringToReplace.replace(/[^\w\sñáóí]/gi, '');
Or use their corresponding Unicode sequences:
var desired = stringToReplace.replace(/[^\w\s\u00F1\u00C1\u00F3\u00ED]/gi, '');
Then simply add all the ones you want to take care of. Note that the case-insensitive modifier also works with Unicode sequences.
In JavaScript:
"ab abc cab ab ab".replace(/\bab\b/g, "AB");
correctly gives me:
"AB abc cab AB AB"
When I use utf-8 characters though:
"αβ αβγ γαβ αβ αβ".replace(/\bαβ\b/g, "AB");
the word boundary operator doesn't seem to work:
"αβ αβγ γαβ αβ αβ"
Is there a solution to this?
The word boundary assertion does only match if a word character is not preceded or followed by another word character (so .\b. is equal to \W\w and \w\W). And \w is defined as [A-Za-z0-9_]. So \w doesn’t match greek characters. And thus you cannot use \b for this case.
What you could do instead is to use this:
"αβ αβγ γαβ αβ αβ".replace(/(^|\s)αβ(?=\s|$)/g, "$1AB")
Not all Javascript regexp implementation has support for Unicode ad so you need to escape it
"αβ αβγ γαβ αβ αβ".replace(/\u03b1\u03b2/g, "AB"); // "AB ABγ γAB AB AB"
For mapping the characters you can take a look at http://htmlhelp.com/reference/html40/entities/symbols.html
Of course, this doesn't help with the word boundary issue (as explained in other answers) but should at least enable you to match the characters properly
I needed something to be programmable and handle punctuation, brackets, etc.
http://jsfiddle.net/AQvyd/
var wordToReplace = '買い手',
replacementWord = '[[BUYER]]',
text = 'Mange 買い手 information. The selected Store and Classification will be the default on the สั่งซื้อ.'
function replaceWord(text, wordToReplace, replacementWord) {
var re = new RegExp('(^|\\s|\\(|\'|"|,|;)' + wordToReplace + '($|\\s|\\)|\\.|\'|"|!|,|;|\\?)', 'gi');
return text.replace(re, replacementWord);
}
I've written a javascript resource editor so this is why I've found this page and also answered it out of necessity since I couldn't find a word boundary parametarized regexp that worked well for Unicode.
Not all the implementations of RegEx associated with Javascript engines a unicode aware.
For example Microsofts JScript using in IE is limited to ANSI.
When you’re dealing with Unicode and natural-language words, you probably want to be more careful with boundaries than just using \b. See this answer for details and directions.