Regex to remove non-letter characters but keep accented letters - javascript

I have strings in Spanish and other languages that may contain generic special characters like (),*, etc. That I need to remove. But the problem is that it also may contain special language characters like ñ, á, ó, í etc and they need to remain. So I am trying to do it with regexp the following way:
var desired = stringToReplace.replace(/[^\w\s]/gi, '');
Unfortunately it is removing all special characters including the language related. Not sure how to avoid that. Maybe someone could suggest?

I would suggest using Steven Levithan's excellent XRegExp library and its Unicode plug-in.
Here's an example that strips non-Latin word characters from a string: http://jsfiddle.net/b3awZ/1/
var regex = XRegExp("[^\\s\\p{Latin}]+", "g");
var str = "¿Me puedes decir la contraseña de la Wi-Fi?"
var replaced = XRegExp.replace(str, regex, "");
See also this answer by Steven Levithan himself:
Regular expression Spanish and Arabic words

Instead of whitelisting characters you accept, you could try blacklisting illegal characters:
var desired = stringToReplace.replace(/[-'`~!##$%^&*()_|+=?;:'",.<>\{\}\[\]\\\/]/gi, '')

Note! Works only for 16bit code points. This answer is incomplete.
Short answer
The character class for all arabic digits and latin letters is: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].
To get a regex you can use, prepend /^ and append +$/. This will match strings consisting of only latin letters and digits like "mérito" or "Schönheit".
To match non-digits or non-letter characters to remove them, write a ^ as first character after the opening bracket [ and prepend / and append +/.
How did I find that out? Continue reading.
Long answer: use metaprogramming!
Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?
import unicodedata
import re
import sys
def unicodeNameMatch(pattern, codepoint):
try:
return re.match(pattern, unicodedata.name(unichr(codepoint)), re.I)
except ValueError:
return None
def regexChr(codepoint):
return chr(codepoint) if 32 <= codepoint < 127 else "\\u%04x" % codepoint
names = sys.argv
prev = None
js_regex = ""
for codepoint in range(pow(2, 16)):
if any([unicodeNameMatch(name, codepoint) for name in names]):
if prev is None: js_regex += regexChr(codepoint)
prev = codepoint
else:
if not prev is None: js_regex += "-" + regexChr(prev)
prev = None
print "[" + js_regex + "]"
Invoke it like this: python char_class.py latin digit and you get the character class mentioned above. It's an ugly char class but you know for sure that you catched all characters whose names contain latin or digit.
Browse the Unicode Character Database to view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for A its the line
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
Try python char_class.py "latin small" and you get a character class for all latin small letters.
Edit: There is a small misfeature (aka bug) in that \u271d-\u271d occurs in the regex. Perhaps this fix helps: Replace
if not prev is None: js_regex += "-" + regexChr(prev)
by
if not prev is None and prev != codepoint: js_regex += "-" + regexChr(prev)

var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');
might do the trick.
See also this Javascript + Unicode regexes question.

If you must insist on whitelisting here is the rawest way of doing it:
Test if string contains only letters (a-z + é ü ö ê å ø etc..)
It works by keeping track of 'all' unicode letter chars.

Unfortunately, Javascript does not support Unicode character properties (which would be just the right regex feature for you). If changing the language is an option for you, PHP (for example) can do this:
preg_replace("/[^\pL0-9_\s]/", "", $str);
Where \pL matches any Unicode character that represents a letter (lower case, upper case, modified or unmodified).
If you have to stick with JavaScript and cannot use the library suggested by Tim Down, the only options are probably either blacklisting or whitelisting. But your bounty mentions that blacklisting is not actually an option in your case. So you will probably simply have to include the special characters from your relevant language manually. So you could simply do this:
var desired = stringToReplace.replace(/[^\w\sñáóí]/gi, '');
Or use their corresponding Unicode sequences:
var desired = stringToReplace.replace(/[^\w\s\u00F1\u00C1\u00F3\u00ED]/gi, '');
Then simply add all the ones you want to take care of. Note that the case-insensitive modifier also works with Unicode sequences.

Related

RegEx for exact word search starting with Umlaut [duplicate]

I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å
When user types text in to the search input field I try to match the text to data.
Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
http://jsfiddle.net/7TsxB/
So how can I get those ä,ö and å characters to work with javascript regex?
I think I should use unicode codes but how should I do that? Codes for those characters are:
[\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]
=> äÄåÅöÖ
There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.
Instead of using \b, try using (?:^|\\s)
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
Breakdown:
(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together
^ the caret symbol matches the beginning of a string
| the bar is the "or" operator.
\s matches whitespace (appears as \\s in the string because we have to escape the backslash)
) closes the group
So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.
The \b character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b is a shortcut code for the boundary between \w and \W sets or \w and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w is equal to [a-zA-Z0-9_] and \W is the negation of that class.
This makes the RegEx character classes largely useless for dealing with any real language.
\s should work for what you want to do, provided that search terms are only delimited by whitespace.
this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters.
Using XRegExp library you can implement a valid \b boundary expanding this
XRegExp('(?=^|$|[^\\p{L}])')
the result is a 4000+ char long, but it seems to work quite performing.
Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.
\b is a shortcut for the transition between a letter and a non-letter character, or vice-versa.
Updating and improving on max_masseti's answer:
With the introduction of the /u modifier for RegExs in ES2018, you can now use \p{L} to represent any unicode letter, and \P{L} (notice the uppercase P) to represent anything but.
EDIT: Previous version was incomplete.
As such:
const text = 'A Fé, o Império, e as terras viciosas';
text.split(/(?<=\p{L})(?=\P{L})|(?<=\P{L})(?=\p{L})/);
// ['A', ' Fé', ',', ' o', ' Império', ',', ' e', ' as', ' terras', ' viciosas']
We're using a lookbehind (?<=...) to find a letter and a lookahead (?=...) to find a non-letter, or vice versa.
I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.
Despite the fact the issue seems to be 8 years old, I run into a similar problem (I had to match Cyrillic letters) not so far ago. I spend a whole day on this and could not find any appropriate answer here on StackOverflow. So, to avoid others making lots of effort, I'd like to share my solution.
Yes, \b word boundary works only with Latin letters (Word boundary: \b):
Word boundary \b doesn’t work for non-Latin alphabets
The word boundary test \b checks that there should be \w on the one side from the position and "not \w" – on the other side.
But \w means a Latin letter a-z (or a digit or an underscore), so the test doesn’t work for other characters, e.g. Cyrillic letters or hieroglyphs.
Yes, JavaScript RegExp implementation hardly supports UTF-8 encoding.
So, I tried implementing own word boundary feature with the support of non-Latin characters. To make word boundary work just with Cyrillic characters I created such regular expression:
new RegExp(`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,'gi')
Where \u0400-\u04ff is a range of Cyrillic characters provided in the table of codes. It is not an ideal solution, however, it works properly in most cases.
To make it work in your case, you just have to pick up an appropriate range of codes from the list of Unicode characters.
To try out my example run the code snippet below.
function getMatchExpression(cyrillicSearchValue) {
return new RegExp(
`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,
'gi',
);
}
const sentence = 'Будь-який текст кирилицею, де необхідно знайти слово з контексту';
console.log(sentence.match(getMatchExpression('текст')));
// expected output: ["текст"]
console.log(sentence.match(getMatchExpression('но')));
// expected output: null
I noticed something really weird with \b when using Unicode:
/\bo/.test("pop"); // false (obviously)
/\bä/.test("päp"); // true (what..?)
/\Bo/.test("pop"); // true
/\Bä/.test("päp"); // false (what..?)
It appears that meaning of \b and \B are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.
In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace \b with (^|[\s\\/-_&]), as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)
My idea is to search with codes representing the Finnish letters
new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))
My original idea was to use plain encodeURI but the % sign seemed to interfere with the regexp.
http://jsfiddle.net/7TsxB/5/
I wrote a crude function using encodeURI to encode every character with code over 128 but removing its % and adding 'QQ' in the beginning. It is not the best marker but I couldn't get non alphanumeric to work.
What you are looking for is the Unicode word boundaries standard:
http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries
There is a JavaScript implementation here (unciodejs.wordbreak.js)
https://github.com/wikimedia/unicodejs
I had a similar problem, where I was trying to replace all of a particular unicode word with a different unicode word, and I cannot use lookbehind because it's not supported in the JS engine this code will be used in. I ultimately resolved it like this:
const needle = "КАРТОПЛЯ";
const replace = "БАРАБОЛЯ";
const regex = new RegExp(
String.raw`(^|[^\n\p{L}])`
+ needle
+ String.raw`(?=$|\P{L})`,
"gimu",
);
const result = (
'КАРТОПЛЯ сдффКАРТОПЛЯдадф КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ??? !!!КАРТОПЛЯ ;!;!КАРТОПЛЯ/#?#?'
+ '\n\nКАРТОПЛЯ КАРТОПЛЯ - - -КАРТОПЛЯ--'
)
.replace(regex, function (match, ...args) {
return args[0] + replace;
});
console.log(result)
output:
БАРАБОЛЯ сдффКАРТОПЛЯдадф БАРАБОЛЯ БАРАБОЛЯ БАРАБОЛЯ??? !!!БАРАБОЛЯ ;!;!БАРАБОЛЯ/#?#?
БАРАБОЛЯ БАРАБОЛЯ - - -БАРАБОЛЯ--
Breaking it apart
The first regex: (^|[^\n\p{L}])
^| = Start of the line or
[^\n\p{L}] = Any character which is not a letter or a newline
The second regex: (?=$|\P{L})
?= = Lookahead
$| = End of the line or
\P{L} = Any character which is not a letter
The first regex captures the group and is then used via args[0] to put it back into the string during replacement, thereby avoiding a lookbehind. The second regex utilized lookahead.
Note that the second one MUST be a lookahead because if we capture it then overlapping regex matches will not trigger (e.g. КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ would only match on the 1st and 3rd ones).
Trying to find text "myTest":
/(?<![\p{L}\p{N}_])myTest(?![\p{L}\p{N}_])/gu
Similar to NetBeans or Notepad++ form. Trying to find the expression without any letter or number or underscore (like \w characters of word boundary \b) in any unicode characters of letter and number before or after the expression.
I have had a similar problem, but I had to replace an array of terms. All solutions, which I have found did not worked, if two terms were in the text next to each other (because their boundaries overlaped). So I had to use a little modified approach:
var text = "Ještě. že; \"už\" à. Fürs, 'anlässlich' že že že.";
var terms = ["à","anlässlich","Fürs","už","Ještě", "že"];
var replaced = [];
var order = 0;
for (i = 0; i < terms.length; i++) {
terms[i] = "(^\|[ \n\r\t.,;'\"\+!?-])(" + terms[i] + ")([ \n\r\t.,;'\"\+!?-]+\|$)";
}
var re = new RegExp(terms.join("|"), "");
while (true) {
var replacedString = "";
text = text.replace(re, function replacer(match){
var beginning = match.match("^[ \n\r\t.,;'\"\+!?-]+");
if (beginning == null) beginning = "";
var ending = match.match("[ \n\r\t.,;'\"\+!?-]+$");
if (ending == null) ending = "";
replacedString = match.replace(beginning,"");
replacedString = replacedString.replace(ending,"");
replaced.push(replacedString);
return beginning+"{{"+order+"}}"+ending;
});
if (replacedString == "") break;
order += 1;
}
See the code in a fiddle: http://jsfiddle.net/antoninslejska/bvbLpdos/1/
The regular expression is inspired by: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular
I can't say, that I find the solution elegant...
The correct answer to the question is given by andrefs.
I will only rewrite it more clearly, after putting all required things together.
For ASCII text, you can use \b for matching a word boundary both at the start and the end of a pattern. When using Unicode text, you need to use 2 different patterns for doing the same:
Use (?<=^|\P{L}) for matching the start or a word boundary before the main pattern.
Use (?=\P{L}|$) for matching the end or a word boundary after the main pattern.
Additionally, use (?i) in the beginning of everything, to make all those matchings case-insensitive.
So the resulting answer is: (?i)(?<=^|\P{L})xxx(?=\P{L}|$), where xxx is your main pattern. This would be the equivalent of (?i)\bxxx\b for ASCII text.
For your code to work, you now need to do the following:
Assign to your variable "searchterm", the pattern or words you want to find.
Escape the variable's contents. For example, replace '\' with '\\' and also do the same for any reserved special character of regex, like '\^', '\$', '\/', etc. Check here for a question on how to do this.
Insert the variable's contents to the pattern above, in the place of "xxx", by simply using the string.replace() method.
bad but working:
var text = " аб аб АБ абвг ";
var ttt = "(аб)"
var p = "(^|$|[^A-Za-zА-Я-а-я0-9()])"; // add other word boundary symbols here
var exp = new RegExp(p+ttt+p,"gi");
text = text.replace(exp, "$1($2)$3").replace(exp, "$1($2)$3");
const t1 = performance.now();
console.log(text);
result (without qutes):
" (аб) (аб) (АБ) абвг "
I struggled hard on this. Working with French accented characters, and I managed to find this solution :
const myString = "MyString";
const regex = new RegExp(
"(?:[^À-ú]|^)\\b(" + myString + ")\\b(?:[^À-ú]|$)",
"ig"
);
What id does :
It keeps checking word-boundaries with \b before and after "MyString".
In addition to that, (?:[^À-ú]|^) and (?:[^À-ú]|$) will check if MyString is not surrounded by any accented characters
It will not work with cyrillic but it may be possible to find the range of cirillic charactes and edit [^À-ú] in consequence.
Warning, it captures only the group (MyString) but the total match contains previous and next characters
See example : https://regex101.com/r/5P0ZIe/1
Match examples :
MyString
match : "MyString"
group 1 : "MyString"
Lorem ipsum. MyString dolor sit amet
match : " MyString "
group 1 : "MyString"
(MyString)
match : "(MyString)"
group 1 : "MyString"
BetweenCharactersMyStringIsNotFound
match : Nothing
group 1 : Nothing
éMyStringé
match : Nothing
group 1 : Nothing
ùMyString
match : Nothing
group 1 : Nothing
MyStringÖ
match : Nothing
group 1 : Nothing

Removing punctuation from strings?

I'm working on a palindrome function and have come across a formula which removes punctuation from strings.
var punctuation = /[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,\-.\/:;<=>?#\[\]^_`{|}~]/g;
var spaceRE = /\s+/g;
var str = "randomstringwith*&^%"
var testStr = str.replace(punctuation, '').replace(spaceRE, '')
document.write(testStr);
My question is that If I remove the .replace(spaceRE, '') nothing seems to change in the result. Is there something I'm missing or does this formula have excess code on it? also I'm slightly confused about the use of str.replace(punctuation,'');
punctuation represents any non letter/number characters and the '' replaces them with an empty string, correct? Thanks!
In situations like yours you have to ask yourself which is easier:
Create a REGEXP that blocks certain characters
Create a REGEXP that allows certain characters
The choice you opt for should depend on which is less work and be more reliable.
Writing a pattern that blocks all symbols depends on you remembering every possible symbol - not just punctuation, but emoji patterns, mathematical symbols and so on.
If all you want is to allow numbers and letters only, you can do:
str.replace(/\W/g, '');
\W/ is an alias for "non-alphanumeric" characters. The only caveat here is alphanumeric includes underscores, so if you want to block those too:
str.replace(/\W|_/g, '');
Turns out var spaceRE = /\s+/g; removes all whitespaces from strings, while punctuation removes punctuation. Replacing both simultaneously with empty strings delivers a string with no punctuation or whitespaces and saves it to testStr

Regex treating accentuated letters as word boundary [duplicate]

I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å
When user types text in to the search input field I try to match the text to data.
Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
http://jsfiddle.net/7TsxB/
So how can I get those ä,ö and å characters to work with javascript regex?
I think I should use unicode codes but how should I do that? Codes for those characters are:
[\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]
=> äÄåÅöÖ
There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.
Instead of using \b, try using (?:^|\\s)
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
Breakdown:
(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together
^ the caret symbol matches the beginning of a string
| the bar is the "or" operator.
\s matches whitespace (appears as \\s in the string because we have to escape the backslash)
) closes the group
So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.
The \b character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b is a shortcut code for the boundary between \w and \W sets or \w and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w is equal to [a-zA-Z0-9_] and \W is the negation of that class.
This makes the RegEx character classes largely useless for dealing with any real language.
\s should work for what you want to do, provided that search terms are only delimited by whitespace.
this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters.
Using XRegExp library you can implement a valid \b boundary expanding this
XRegExp('(?=^|$|[^\\p{L}])')
the result is a 4000+ char long, but it seems to work quite performing.
Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.
\b is a shortcut for the transition between a letter and a non-letter character, or vice-versa.
Updating and improving on max_masseti's answer:
With the introduction of the /u modifier for RegExs in ES2018, you can now use \p{L} to represent any unicode letter, and \P{L} (notice the uppercase P) to represent anything but.
EDIT: Previous version was incomplete.
As such:
const text = 'A Fé, o Império, e as terras viciosas';
text.split(/(?<=\p{L})(?=\P{L})|(?<=\P{L})(?=\p{L})/);
// ['A', ' Fé', ',', ' o', ' Império', ',', ' e', ' as', ' terras', ' viciosas']
We're using a lookbehind (?<=...) to find a letter and a lookahead (?=...) to find a non-letter, or vice versa.
I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.
Despite the fact the issue seems to be 8 years old, I run into a similar problem (I had to match Cyrillic letters) not so far ago. I spend a whole day on this and could not find any appropriate answer here on StackOverflow. So, to avoid others making lots of effort, I'd like to share my solution.
Yes, \b word boundary works only with Latin letters (Word boundary: \b):
Word boundary \b doesn’t work for non-Latin alphabets
The word boundary test \b checks that there should be \w on the one side from the position and "not \w" – on the other side.
But \w means a Latin letter a-z (or a digit or an underscore), so the test doesn’t work for other characters, e.g. Cyrillic letters or hieroglyphs.
Yes, JavaScript RegExp implementation hardly supports UTF-8 encoding.
So, I tried implementing own word boundary feature with the support of non-Latin characters. To make word boundary work just with Cyrillic characters I created such regular expression:
new RegExp(`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,'gi')
Where \u0400-\u04ff is a range of Cyrillic characters provided in the table of codes. It is not an ideal solution, however, it works properly in most cases.
To make it work in your case, you just have to pick up an appropriate range of codes from the list of Unicode characters.
To try out my example run the code snippet below.
function getMatchExpression(cyrillicSearchValue) {
return new RegExp(
`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,
'gi',
);
}
const sentence = 'Будь-який текст кирилицею, де необхідно знайти слово з контексту';
console.log(sentence.match(getMatchExpression('текст')));
// expected output: ["текст"]
console.log(sentence.match(getMatchExpression('но')));
// expected output: null
I noticed something really weird with \b when using Unicode:
/\bo/.test("pop"); // false (obviously)
/\bä/.test("päp"); // true (what..?)
/\Bo/.test("pop"); // true
/\Bä/.test("päp"); // false (what..?)
It appears that meaning of \b and \B are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.
In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace \b with (^|[\s\\/-_&]), as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)
My idea is to search with codes representing the Finnish letters
new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))
My original idea was to use plain encodeURI but the % sign seemed to interfere with the regexp.
http://jsfiddle.net/7TsxB/5/
I wrote a crude function using encodeURI to encode every character with code over 128 but removing its % and adding 'QQ' in the beginning. It is not the best marker but I couldn't get non alphanumeric to work.
What you are looking for is the Unicode word boundaries standard:
http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries
There is a JavaScript implementation here (unciodejs.wordbreak.js)
https://github.com/wikimedia/unicodejs
I had a similar problem, where I was trying to replace all of a particular unicode word with a different unicode word, and I cannot use lookbehind because it's not supported in the JS engine this code will be used in. I ultimately resolved it like this:
const needle = "КАРТОПЛЯ";
const replace = "БАРАБОЛЯ";
const regex = new RegExp(
String.raw`(^|[^\n\p{L}])`
+ needle
+ String.raw`(?=$|\P{L})`,
"gimu",
);
const result = (
'КАРТОПЛЯ сдффКАРТОПЛЯдадф КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ??? !!!КАРТОПЛЯ ;!;!КАРТОПЛЯ/#?#?'
+ '\n\nКАРТОПЛЯ КАРТОПЛЯ - - -КАРТОПЛЯ--'
)
.replace(regex, function (match, ...args) {
return args[0] + replace;
});
console.log(result)
output:
БАРАБОЛЯ сдффКАРТОПЛЯдадф БАРАБОЛЯ БАРАБОЛЯ БАРАБОЛЯ??? !!!БАРАБОЛЯ ;!;!БАРАБОЛЯ/#?#?
БАРАБОЛЯ БАРАБОЛЯ - - -БАРАБОЛЯ--
Breaking it apart
The first regex: (^|[^\n\p{L}])
^| = Start of the line or
[^\n\p{L}] = Any character which is not a letter or a newline
The second regex: (?=$|\P{L})
?= = Lookahead
$| = End of the line or
\P{L} = Any character which is not a letter
The first regex captures the group and is then used via args[0] to put it back into the string during replacement, thereby avoiding a lookbehind. The second regex utilized lookahead.
Note that the second one MUST be a lookahead because if we capture it then overlapping regex matches will not trigger (e.g. КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ would only match on the 1st and 3rd ones).
Trying to find text "myTest":
/(?<![\p{L}\p{N}_])myTest(?![\p{L}\p{N}_])/gu
Similar to NetBeans or Notepad++ form. Trying to find the expression without any letter or number or underscore (like \w characters of word boundary \b) in any unicode characters of letter and number before or after the expression.
I have had a similar problem, but I had to replace an array of terms. All solutions, which I have found did not worked, if two terms were in the text next to each other (because their boundaries overlaped). So I had to use a little modified approach:
var text = "Ještě. že; \"už\" à. Fürs, 'anlässlich' že že že.";
var terms = ["à","anlässlich","Fürs","už","Ještě", "že"];
var replaced = [];
var order = 0;
for (i = 0; i < terms.length; i++) {
terms[i] = "(^\|[ \n\r\t.,;'\"\+!?-])(" + terms[i] + ")([ \n\r\t.,;'\"\+!?-]+\|$)";
}
var re = new RegExp(terms.join("|"), "");
while (true) {
var replacedString = "";
text = text.replace(re, function replacer(match){
var beginning = match.match("^[ \n\r\t.,;'\"\+!?-]+");
if (beginning == null) beginning = "";
var ending = match.match("[ \n\r\t.,;'\"\+!?-]+$");
if (ending == null) ending = "";
replacedString = match.replace(beginning,"");
replacedString = replacedString.replace(ending,"");
replaced.push(replacedString);
return beginning+"{{"+order+"}}"+ending;
});
if (replacedString == "") break;
order += 1;
}
See the code in a fiddle: http://jsfiddle.net/antoninslejska/bvbLpdos/1/
The regular expression is inspired by: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular
I can't say, that I find the solution elegant...
The correct answer to the question is given by andrefs.
I will only rewrite it more clearly, after putting all required things together.
For ASCII text, you can use \b for matching a word boundary both at the start and the end of a pattern. When using Unicode text, you need to use 2 different patterns for doing the same:
Use (?<=^|\P{L}) for matching the start or a word boundary before the main pattern.
Use (?=\P{L}|$) for matching the end or a word boundary after the main pattern.
Additionally, use (?i) in the beginning of everything, to make all those matchings case-insensitive.
So the resulting answer is: (?i)(?<=^|\P{L})xxx(?=\P{L}|$), where xxx is your main pattern. This would be the equivalent of (?i)\bxxx\b for ASCII text.
For your code to work, you now need to do the following:
Assign to your variable "searchterm", the pattern or words you want to find.
Escape the variable's contents. For example, replace '\' with '\\' and also do the same for any reserved special character of regex, like '\^', '\$', '\/', etc. Check here for a question on how to do this.
Insert the variable's contents to the pattern above, in the place of "xxx", by simply using the string.replace() method.
bad but working:
var text = " аб аб АБ абвг ";
var ttt = "(аб)"
var p = "(^|$|[^A-Za-zА-Я-а-я0-9()])"; // add other word boundary symbols here
var exp = new RegExp(p+ttt+p,"gi");
text = text.replace(exp, "$1($2)$3").replace(exp, "$1($2)$3");
const t1 = performance.now();
console.log(text);
result (without qutes):
" (аб) (аб) (АБ) абвг "
I struggled hard on this. Working with French accented characters, and I managed to find this solution :
const myString = "MyString";
const regex = new RegExp(
"(?:[^À-ú]|^)\\b(" + myString + ")\\b(?:[^À-ú]|$)",
"ig"
);
What id does :
It keeps checking word-boundaries with \b before and after "MyString".
In addition to that, (?:[^À-ú]|^) and (?:[^À-ú]|$) will check if MyString is not surrounded by any accented characters
It will not work with cyrillic but it may be possible to find the range of cirillic charactes and edit [^À-ú] in consequence.
Warning, it captures only the group (MyString) but the total match contains previous and next characters
See example : https://regex101.com/r/5P0ZIe/1
Match examples :
MyString
match : "MyString"
group 1 : "MyString"
Lorem ipsum. MyString dolor sit amet
match : " MyString "
group 1 : "MyString"
(MyString)
match : "(MyString)"
group 1 : "MyString"
BetweenCharactersMyStringIsNotFound
match : Nothing
group 1 : Nothing
éMyStringé
match : Nothing
group 1 : Nothing
ùMyString
match : Nothing
group 1 : Nothing
MyStringÖ
match : Nothing
group 1 : Nothing

regex ALLOW foreign characters while filtering out special characters

I saw this post: Regular expression to match non-English characters? which allows you to filter out foreign characters like so
str = str.replace(/[^\x00-\x7F]+/g, "");
I am trying to allow these characters while filtering out special charaters but allowing '- _ << single quote, hyphen, underscore and empty space
question: how can i combine the 2 to allow foreign characters in this javascript regex?
str = str.replace(/[^a-zA-Z0-9'-_ ]/g, "");
lets say i want the ü, this does not work str = str.replace(/[^a-zA-Z0-9'-_ ü]/g, "");
So this is pretty complex because no matter how you slice it you either have a ton of Unicode letters to include or a ton of Unicode special characters to exclude. What you essentially need here is a regex that will only allow characters from the Unicode general categories for letters (Lu, Ll, Lt, Lm, Lo).
In some regex flavors support for Unicode general categories is built in, and your regex would just be something like the following:
[\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}'\- _]
Unfortunately JavaScript does not support this, but you could do this with the Unicode addon to the XRegExp library, the usage would look something like this (for filtering out all of the characters you do not want):
XRegExp.replace(text, "[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}'\\- _]", '', 'all');
Or alternatively if you want to construct a crazy long JavaScript regex that does the job, the CSET JavaScript library can be used, here is the regex I came up with:
var regex = /[\u0000-\u001f!-&(-,.-#[-^`{-©«-´¶-¹»-¿×÷˂-˅˒-˟˥-˫˭˯-\u036f͵\u0378-\u0379;-΅·\u038b\u038d\u03a2϶҂-\u0489\u0524-\u0530\u0557-\u0558՚-\u0560\u0588-\u05cf\u05eb-\u05ef׳-\u0620\u064b-٭\u0670۔\u06d6-\u06e4\u06e7-\u06ed۰-۹۽-۾܀-\u070f\u0711\u0730-\u074c\u07a6-\u07b0\u07b2-߉\u07eb-\u07f3߶-߹\u07fb-\u0903\u093a-\u093c\u093e-\u094f\u0951-\u0957\u0962-॰\u0973-\u097a\u0980-\u0984\u098d-\u098e\u0991-\u0992\u09a9\u09b1\u09b3-\u09b5\u09ba-\u09bc\u09be-\u09cd\u09cf-\u09db\u09de\u09e2-৯৲-\u0a04\u0a0b-\u0a0e\u0a11-\u0a12\u0a29\u0a31\u0a34\u0a37\u0a3a-\u0a58\u0a5d\u0a5f-\u0a71\u0a75-\u0a84\u0a8e\u0a92\u0aa9\u0ab1\u0ab4\u0aba-\u0abc\u0abe-\u0acf\u0ad1-\u0adf\u0ae2-\u0b04\u0b0d-\u0b0e\u0b11-\u0b12\u0b29\u0b31\u0b34\u0b3a-\u0b3c\u0b3e-\u0b5b\u0b5e\u0b62-୰\u0b72-\u0b82\u0b84\u0b8b-\u0b8d\u0b91\u0b96-\u0b98\u0b9b\u0b9d\u0ba0-\u0ba2\u0ba5-\u0ba7\u0bab-\u0bad\u0bba-\u0bcf\u0bd1-\u0c04\u0c0d\u0c11\u0c29\u0c34\u0c3a-\u0c3c\u0c3e-\u0c57\u0c5a-\u0c5f\u0c62-\u0c84\u0c8d\u0c91\u0ca9\u0cb4\u0cba-\u0cbc\u0cbe-\u0cdd\u0cdf\u0ce2-\u0d04\u0d0d\u0d11\u0d29\u0d3a-\u0d3c\u0d3e-\u0d5f\u0d62-൹\u0d80-\u0d84\u0d97-\u0d99\u0db2\u0dbc\u0dbe-\u0dbf\u0dc7-\u0e00\u0e31\u0e34-฿\u0e47-\u0e80\u0e83\u0e85-\u0e86\u0e89\u0e8b-\u0e8c\u0e8e-\u0e93\u0e98\u0ea0\u0ea4\u0ea6\u0ea8-\u0ea9\u0eac\u0eb1\u0eb4-\u0ebc\u0ebe-\u0ebf\u0ec5\u0ec7-\u0edb\u0ede-\u0eff༁-\u0f3f\u0f48\u0f6d-\u0f87\u0f8c-\u0fff\u102b-\u103e၀-၏\u1056-\u1059\u105e-\u1060\u1062-\u1064\u1067-\u106d\u1071-\u1074\u1082-\u108d\u108f-႟\u10c6-\u10cf჻\u10fd-\u10ff\u115a-\u115e\u11a3-\u11a7\u11fa-\u11ff\u1249\u124e-\u124f\u1257\u1259\u125e-\u125f\u1289\u128e-\u128f\u12b1\u12b6-\u12b7\u12bf\u12c1\u12c6-\u12c7\u12d7\u1311\u1316-\u1317\u135b-\u137f᎐-\u139f\u13f5-\u1400᙭-᙮\u1677-\u1680᚛-\u169f᛫-\u16ff\u170d\u1712-\u171f\u1732-\u173f\u1752-\u175f\u176d\u1771-\u177f\u17b4-៖៘-៛\u17dd-\u181f\u1878-\u187f\u18a9\u18ab-\u18ff\u191d-᥏\u196e-\u196f\u1975-\u197f\u19aa-\u19c0\u19c8-᧿\u1a17-\u1b04\u1b34-\u1b44\u1b4c-\u1b82\u1ba1-\u1bad᮰-\u1bff\u1c24-\u1c4c᱐-᱙᱾-\u1cff\u1dc0-\u1dff\u1f16-\u1f17\u1f1e-\u1f1f\u1f46-\u1f47\u1f4e-\u1f4f\u1f58\u1f5a\u1f5c\u1f5e\u1f7e-\u1f7f\u1fb5᾽᾿-῁\u1fc5῍-῏\u1fd4-\u1fd5\u1fdc-῟῭-\u1ff1\u1ff5´-⁰\u2072-⁾₀-\u208f\u2095-℁℃-℆℈-℉℔№-℘℞-℣℥℧℩℮℺-℻⅀-⅄⅊-⅍⅏-\u2182\u2185-\u2bff\u2c2f\u2c5f\u2c70\u2c7e-\u2c7f⳥-⳿\u2d26-\u2d2f\u2d66-\u2d6e\u2d70-\u2d7f\u2d97-\u2d9f\u2da7\u2daf\u2db7\u2dbf\u2dc7\u2dcf\u2dd7\u2ddf-⸮⸰-〄\u3007-〰〶-\u303a〽-\u3040\u3097-゜゠・\u3100-\u3104\u312e-\u3130\u318f-㆟\u31b8-\u31ef㈀-㏿\u4db6-䷿\u9fc4-\u9fff\ua48d-\ua4ff꘍-꘏꘠-꘩\ua62c-\ua63f\ua660-\ua661\ua66f-꙾\ua698-꜖꜠-꜡꞉-꞊\ua78d-\ua7fa\ua802\ua806\ua80b\ua823-\ua83f꡴-\ua881\ua8b4-꤉\ua926-꤯\ua947-\ua9ff\uaa29-\uaa3f\uaa43\uaa4c-\uabff\ud7a4-\ud7ff\ud840-\ud868\udc00-\uf8ff\ufa2e-\ufa2f\ufa6b-\ufa6f\ufada-\ufaff\ufb07-\ufb12\ufb18-\ufb1c\ufb1e﬩\ufb37\ufb3d\ufb3f\ufb42\ufb45\ufbb2-\ufbd2﴾-\ufd4f\ufd90-\ufd91\ufdc8-\ufdef﷼-\ufe6f\ufe75\ufefd-@[-`{-・\uffbf-\uffc1\uffc8-\uffc9\uffd0-\uffd1\uffd8-\uffd9\uffdd-\uffff]|[\ud803-\ud807\ud809-\ud834\ud836-\ud83f\ud86a-\ud87d\ud87f-\udbff][\udc00-\udfff]|\ud800[\udc0c\udc27\udc3b\udc3e\udc4e-\udc4f\udc5e-\udc7f\udcfb-\ude7f\ude9d-\ude9f\uded1-\udeff\udf1f-\udf2f\udf41\udf4a-\udf7f\udf9e-\udf9f\udfc4-\udfc7\udfd0-\udfff]|\ud801[\udc9e-\udfff]|\ud802[\udc06-\udc07\udc09\udc36\udc39-\udc3b\udc3d-\udc3e\udc40-\udcff\udd16-\udd1f\udd3a-\uddff\ude01-\ude0f\ude14\ude18\ude34-\udfff]|\ud808[\udf6f-\udfff]|\ud835[\udc55\udc9d\udca0-\udca1\udca3-\udca4\udca7-\udca8\udcad\udcba\udcbc\udcc4\udd06\udd0b-\udd0c\udd15\udd1d\udd3a\udd3f\udd45\udd47-\udd49\udd51\udea6-\udea7\udec1\udedb\udefb\udf15\udf35\udf4f\udf6f\udf89\udfa9\udfc3\udfcc-\udfff]|\ud869[\uded7-\udfff]|\ud87e[\ude1e-\udfff]|[\ud800-\ud83f\ud869-\udbff]/g;
And the steps to get there (after including the CSET source):
CSET.import();
var allUnicodeLetters = ['Lu', 'Ll', 'Lt', 'Lm', 'Lo'].map(fromUnicodeGeneralCategory).reduce(union);
var allAllowedCharacters = union(allUnicodeLetters, fromString("'- _"));
var regex = new RegExp(toRegex(complement(allAllowedCharacters)), 'g');
Then you could use str = str.replace(regex, '') and it would remove all special characters except for the ones you want to allow including symbols like dingbats.
Edit: Just realized you may also want to allow numbers, if so you could use the following, which was obtained by adding 'Nd' and 'Nl' in the method above:
var regex = /[\u0000-\u001f!-&(-,.-/:-#[-^`{-©«-´¶-¹»-¿×÷˂-˅˒-˟˥-˫˭˯-\u036f͵\u0378-\u0379;-΅·\u038b\u038d\u03a2϶҂-\u0489\u0524-\u0530\u0557-\u0558՚-\u0560\u0588-\u05cf\u05eb-\u05ef׳-\u0620\u064b-\u065f٪-٭\u0670۔\u06d6-\u06e4\u06e7-\u06ed۽-۾܀-\u070f\u0711\u0730-\u074c\u07a6-\u07b0\u07b2-\u07bf\u07eb-\u07f3߶-߹\u07fb-\u0903\u093a-\u093c\u093e-\u094f\u0951-\u0957\u0962-॥॰\u0973-\u097a\u0980-\u0984\u098d-\u098e\u0991-\u0992\u09a9\u09b1\u09b3-\u09b5\u09ba-\u09bc\u09be-\u09cd\u09cf-\u09db\u09de\u09e2-\u09e5৲-\u0a04\u0a0b-\u0a0e\u0a11-\u0a12\u0a29\u0a31\u0a34\u0a37\u0a3a-\u0a58\u0a5d\u0a5f-\u0a65\u0a70-\u0a71\u0a75-\u0a84\u0a8e\u0a92\u0aa9\u0ab1\u0ab4\u0aba-\u0abc\u0abe-\u0acf\u0ad1-\u0adf\u0ae2-\u0ae5\u0af0-\u0b04\u0b0d-\u0b0e\u0b11-\u0b12\u0b29\u0b31\u0b34\u0b3a-\u0b3c\u0b3e-\u0b5b\u0b5e\u0b62-\u0b65୰\u0b72-\u0b82\u0b84\u0b8b-\u0b8d\u0b91\u0b96-\u0b98\u0b9b\u0b9d\u0ba0-\u0ba2\u0ba5-\u0ba7\u0bab-\u0bad\u0bba-\u0bcf\u0bd1-\u0be5௰-\u0c04\u0c0d\u0c11\u0c29\u0c34\u0c3a-\u0c3c\u0c3e-\u0c57\u0c5a-\u0c5f\u0c62-\u0c65\u0c70-\u0c84\u0c8d\u0c91\u0ca9\u0cb4\u0cba-\u0cbc\u0cbe-\u0cdd\u0cdf\u0ce2-\u0ce5\u0cf0-\u0d04\u0d0d\u0d11\u0d29\u0d3a-\u0d3c\u0d3e-\u0d5f\u0d62-\u0d65൰-൹\u0d80-\u0d84\u0d97-\u0d99\u0db2\u0dbc\u0dbe-\u0dbf\u0dc7-\u0e00\u0e31\u0e34-฿\u0e47-๏๚-\u0e80\u0e83\u0e85-\u0e86\u0e89\u0e8b-\u0e8c\u0e8e-\u0e93\u0e98\u0ea0\u0ea4\u0ea6\u0ea8-\u0ea9\u0eac\u0eb1\u0eb4-\u0ebc\u0ebe-\u0ebf\u0ec5\u0ec7-\u0ecf\u0eda-\u0edb\u0ede-\u0eff༁-༟༪-\u0f3f\u0f48\u0f6d-\u0f87\u0f8c-\u0fff\u102b-\u103e၊-၏\u1056-\u1059\u105e-\u1060\u1062-\u1064\u1067-\u106d\u1071-\u1074\u1082-\u108d\u108f\u109a-႟\u10c6-\u10cf჻\u10fd-\u10ff\u115a-\u115e\u11a3-\u11a7\u11fa-\u11ff\u1249\u124e-\u124f\u1257\u1259\u125e-\u125f\u1289\u128e-\u128f\u12b1\u12b6-\u12b7\u12bf\u12c1\u12c6-\u12c7\u12d7\u1311\u1316-\u1317\u135b-\u137f᎐-\u139f\u13f5-\u1400᙭-᙮\u1677-\u1680᚛-\u169f᛫-᛭\u16f1-\u16ff\u170d\u1712-\u171f\u1732-\u173f\u1752-\u175f\u176d\u1771-\u177f\u17b4-៖៘-៛\u17dd-\u17df\u17ea-\u180f\u181a-\u181f\u1878-\u187f\u18a9\u18ab-\u18ff\u191d-᥅\u196e-\u196f\u1975-\u197f\u19aa-\u19c0\u19c8-\u19cf\u19da-᧿\u1a17-\u1b04\u1b34-\u1b44\u1b4c-\u1b4f᭚-\u1b82\u1ba1-\u1bad\u1bba-\u1bff\u1c24-᰿\u1c4a-\u1c4c᱾-\u1cff\u1dc0-\u1dff\u1f16-\u1f17\u1f1e-\u1f1f\u1f46-\u1f47\u1f4e-\u1f4f\u1f58\u1f5a\u1f5c\u1f5e\u1f7e-\u1f7f\u1fb5᾽᾿-῁\u1fc5῍-῏\u1fd4-\u1fd5\u1fdc-῟῭-\u1ff1\u1ff5´-⁰\u2072-⁾₀-\u208f\u2095-℁℃-℆℈-℉℔№-℘℞-℣℥℧℩℮℺-℻⅀-⅄⅊-⅍⅏-⅟\u2189-\u2bff\u2c2f\u2c5f\u2c70\u2c7e-\u2c7f⳥-⳿\u2d26-\u2d2f\u2d66-\u2d6e\u2d70-\u2d7f\u2d97-\u2d9f\u2da7\u2daf\u2db7\u2dbf\u2dc7\u2dcf\u2dd7\u2ddf-⸮⸰-〄〈-〠\u302a-〰〶-〷〽-\u3040\u3097-゜゠・\u3100-\u3104\u312e-\u3130\u318f-㆟\u31b8-\u31ef㈀-㏿\u4db6-䷿\u9fc4-\u9fff\ua48d-\ua4ff꘍-꘏\ua62c-\ua63f\ua660-\ua661\ua66f-꙾\ua698-꜖꜠-꜡꞉-꞊\ua78d-\ua7fa\ua802\ua806\ua80b\ua823-\ua83f꡴-\ua881\ua8b4-꣏\ua8da-\ua8ff\ua926-꤯\ua947-\ua9ff\uaa29-\uaa3f\uaa43\uaa4c-\uaa4f\uaa5a-\uabff\ud7a4-\ud7ff\ud840-\ud868\udc00-\uf8ff\ufa2e-\ufa2f\ufa6b-\ufa6f\ufada-\ufaff\ufb07-\ufb12\ufb18-\ufb1c\ufb1e﬩\ufb37\ufb3d\ufb3f\ufb42\ufb45\ufbb2-\ufbd2﴾-\ufd4f\ufd90-\ufd91\ufdc8-\ufdef﷼-\ufe6f\ufe75\ufefd-/:-@[-`{-・\uffbf-\uffc1\uffc8-\uffc9\uffd0-\uffd1\uffd8-\uffd9\uffdd-\uffff]|[\ud803-\ud807\ud80a-\ud834\ud836-\ud83f\ud86a-\ud87d\ud87f-\udbff][\udc00-\udfff]|\ud800[\udc0c\udc27\udc3b\udc3e\udc4e-\udc4f\udc5e-\udc7f\udcfb-\udd3f\udd75-\ude7f\ude9d-\ude9f\uded1-\udeff\udf1f-\udf2f\udf4b-\udf7f\udf9e-\udf9f\udfc4-\udfc7\udfd0\udfd6-\udfff]|\ud801[\udc9e-\udc9f\udcaa-\udfff]|\ud802[\udc06-\udc07\udc09\udc36\udc39-\udc3b\udc3d-\udc3e\udc40-\udcff\udd16-\udd1f\udd3a-\uddff\ude01-\ude0f\ude14\ude18\ude34-\udfff]|\ud808[\udf6f-\udfff]|\ud809[\udc63-\udfff]|\ud835[\udc55\udc9d\udca0-\udca1\udca3-\udca4\udca7-\udca8\udcad\udcba\udcbc\udcc4\udd06\udd0b-\udd0c\udd15\udd1d\udd3a\udd3f\udd45\udd47-\udd49\udd51\udea6-\udea7\udec1\udedb\udefb\udf15\udf35\udf4f\udf6f\udf89\udfa9\udfc3\udfcc-\udfcd]|\ud869[\uded7-\udfff]|\ud87e[\ude1e-\udfff]|[\ud800-\ud83f\ud869-\udbff]/g;
There's no unicode character class for Regular expressions in JavaScript, but you can either include/exclude all the characters by yourself by doing something like this:
str = str.replace(/[!##\$%\^&\*\(\)\{\}\?<>\+:;",\.\\]/g, "");
Or use a library like XRegExp

utf-8 word boundary regex in javascript

In JavaScript:
"ab abc cab ab ab".replace(/\bab\b/g, "AB");
correctly gives me:
"AB abc cab AB AB"
When I use utf-8 characters though:
"αβ αβγ γαβ αβ αβ".replace(/\bαβ\b/g, "AB");
the word boundary operator doesn't seem to work:
"αβ αβγ γαβ αβ αβ"
Is there a solution to this?
The word boundary assertion does only match if a word character is not preceded or followed by another word character (so .\b. is equal to \W\w and \w\W). And \w is defined as [A-Za-z0-9_]. So \w doesn’t match greek characters. And thus you cannot use \b for this case.
What you could do instead is to use this:
"αβ αβγ γαβ αβ αβ".replace(/(^|\s)αβ(?=\s|$)/g, "$1AB")
Not all Javascript regexp implementation has support for Unicode ad so you need to escape it
"αβ αβγ γαβ αβ αβ".replace(/\u03b1\u03b2/g, "AB"); // "AB ABγ γAB AB AB"
For mapping the characters you can take a look at http://htmlhelp.com/reference/html40/entities/symbols.html
Of course, this doesn't help with the word boundary issue (as explained in other answers) but should at least enable you to match the characters properly
I needed something to be programmable and handle punctuation, brackets, etc.
http://jsfiddle.net/AQvyd/
var wordToReplace = '買い手',
replacementWord = '[[BUYER]]',
text = 'Mange 買い手 information. The selected Store and Classification will be the default on the สั่งซื้อ.'
function replaceWord(text, wordToReplace, replacementWord) {
var re = new RegExp('(^|\\s|\\(|\'|"|,|;)' + wordToReplace + '($|\\s|\\)|\\.|\'|"|!|,|;|\\?)', 'gi');
return text.replace(re, replacementWord);
}
I've written a javascript resource editor so this is why I've found this page and also answered it out of necessity since I couldn't find a word boundary parametarized regexp that worked well for Unicode.
Not all the implementations of RegEx associated with Javascript engines a unicode aware.
For example Microsofts JScript using in IE is limited to ANSI.
When you’re dealing with Unicode and natural-language words, you probably want to be more careful with boundaries than just using \b. See this answer for details and directions.

Categories