utf-8 word boundary regex in javascript

utf-8 word boundary regex in javascript - javascript

In JavaScript:
"ab abc cab ab ab".replace(/\bab\b/g, "AB");
correctly gives me:
"AB abc cab AB AB"
When I use utf-8 characters though:
"αβ αβγ γαβ αβ αβ".replace(/\bαβ\b/g, "AB");
the word boundary operator doesn't seem to work:
"αβ αβγ γαβ αβ αβ"
Is there a solution to this?

The word boundary assertion does only match if a word character is not preceded or followed by another word character (so .\b. is equal to \W\w and \w\W). And \w is defined as [A-Za-z0-9_]. So \w doesn’t match greek characters. And thus you cannot use \b for this case.
What you could do instead is to use this:
"αβ αβγ γαβ αβ αβ".replace(/(^|\s)αβ(?=\s|$)/g, "$1AB")

Not all Javascript regexp implementation has support for Unicode ad so you need to escape it
"αβ αβγ γαβ αβ αβ".replace(/\u03b1\u03b2/g, "AB"); // "AB ABγ γAB AB AB"
For mapping the characters you can take a look at http://htmlhelp.com/reference/html40/entities/symbols.html
Of course, this doesn't help with the word boundary issue (as explained in other answers) but should at least enable you to match the characters properly

I needed something to be programmable and handle punctuation, brackets, etc.
http://jsfiddle.net/AQvyd/
var wordToReplace = '買い手',
replacementWord = '[[BUYER]]',
text = 'Mange 買い手 information. The selected Store and Classification will be the default on the สั่งซื้อ.'
function replaceWord(text, wordToReplace, replacementWord) {
var re = new RegExp('(^|\\s|\\(|\'|"|,|;)' + wordToReplace + '($|\\s|\\)|\\.|\'|"|!|,|;|\\?)', 'gi');
return text.replace(re, replacementWord);
}
I've written a javascript resource editor so this is why I've found this page and also answered it out of necessity since I couldn't find a word boundary parametarized regexp that worked well for Unicode.

Not all the implementations of RegEx associated with Javascript engines a unicode aware.
For example Microsofts JScript using in IE is limited to ANSI.

When you’re dealing with Unicode and natural-language words, you probably want to be more careful with boundaries than just using \b. See this answer for details and directions.

Related

Highlight specific word which is not a part of another word using regex in javascript [duplicate]

I'm trying to use regexes to match space-separated numbers.
I can't find a precise definition of \b ("word boundary").
I had assumed that -12 would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of .
[I am using Java regexes in Java 1.6]
Example:
Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());
This returns:
true
false
true

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.

In the course of learning regular expression, I was really stuck in the metacharacter which is \b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(\w)-boundary.
My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.

A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are alpha-numeric; a minus sign is not.
Taken from Regex Tutorial.

I would like to explain Alan Moore's answer
A word boundary is a position that is either preceded by a word character and not followed by one or followed by a word character and not preceded by one.
Suppose I have a string "This is a cat, and she's awesome", and I want to replace all occurrences of the letter 'a' only if this letter ('a') exists at the "Boundary of a word",
In other words: the letter a inside 'cat' should not be replaced.
So I'll perform regex (in Python) as
re.sub(r"\ba","e", myString.strip()) //replace a with e
Therefore,
Input; Output
This is a cat and she's awesome
This is e cat end she's ewesome

A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.

I talk about what \b-style regex boundaries actually are here.
The short story is that they’re conditional. Their behavior depends on what they’re next to.
# same as using a \b before:
(?(?=\w) (?<!\w) | (?<!\W) )
# same as using a \b after:
(?(?<=\w) (?!\w) | (?!\W) )
Sometimes that isn’t what you want. See my other answer for elaboration.

I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.
Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w. (I'm sure there was a good reason for it at the time).
The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.
Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the \b.
Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.
Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the \b with before and after whitespace and punctuation designators. For example:
public static String grep(String regexp, String multiLineStringToSearch) {
String result = "";
String[] lines = multiLineStringToSearch.split("\\n");
Pattern pattern = Pattern.compile(regexp);
for (String line : lines) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
result = result + "\n" + line;
}
}
return result.trim();
}
Then in your test or main function:
String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";
String afterWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
text = "Programming in C, (C++) C#, Java, and .NET.";
System.out.println("text="+text);
// Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));
System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text)); // Works Ok for this example, but see below
// Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
System.out.println("text="+text);
System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
// Make sure the first and last cases work OK.
text = "C is a language that should have been named differently.";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
text = "One language that should have been named differently is C";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
//Make sure we don't get false positives
text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
System.out.println("text="+text);
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!

Check out the documentation on boundary conditions:
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
Check out this sample:
public static void main(final String[] args)
{
String x = "I found the value -12 in my string.";
System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
}
When you print it out, notice that the output is this:
[I found the value -, in my string.]
This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like #brianary kinda beat me to the punch, so he gets an up-vote.

Reference: Mastering Regular Expressions (Jeffrey E.F. Friedl) - O'Reilly
\b is equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w)

Word boundary \b is used where one word should be a word character and another one a non-word character.
Regular Expression for negative number should be
--?\b\d+\b
check working DEMO

I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.
One possible alternative is
(?:(?:^|\s)-?)\d+\b
This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.

when you use \\b(\\w+)+\\b that means exact match with a word containing only word characters ([a-zA-Z0-9])
in your case for example setting \\b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)
for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html

I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.

How to match all words starting with dollar sign but not slash dollar

I want to match all words which are starting with dollar sign but not slash and dollar sign.
I already try few regex.
(?:(?!\\)\$\w+)
\\(\\?\$\w+)\b
String
$10<i class="">$i01d</i>\$id
Expected result
*$10*
*$i01d*
but not this
*$id*
After find all expected matching word i want to replace this my object.

One option is to eliminate escape sequences first, and then match the cleaned-up string:
s = String.raw`$10<i class="">$i01d</i>\$id`
found = s.replace(/\\./g, '').match(/\$\w+/g)
console.log(found)

The big problem here is that you need a negative lookbehind, however, JavaScript does not support it. It's possible to emulate it crudely, but I will offer an alternative which, while not great, will work:
var input = '$10<i class="">$i01d</i>\\$id';
var regex = /\b\w+\b\$(?!\\)/g;
//sample implementation of a string reversal function. There are better implementations out there
function reverseString(string) {
return string.split("").reverse().join("");
}
var reverseInput = reverseString(input);
var matches = reverseInput
.match(regex)
.map(reverseString);
console.log(matches);
It is not elegant but it will do the job. Here is how it works:
JavaScript does support a lookahead expression ((?>)) and a negative lookahead ((?!)). Since this is the reverse of of a negative lookbehind, you can reverse the string and reverse the regex, which will match exactly what you want. Since all the matches are going to be in reverse, you need to also reverse them back to the original.
It is not elegant, as I said, since it does a lot of string manipulations but it does produce exactly what you want.
See this in action on Regex101
Regex explanation Normally, the "match x long as it's not preceded by y" will be expressed as (?<!y)x, so in your case, the regex will be
/(?<!\\)\$\b\w+\b/g
demonstration (not JavaScript)
where
(?<!\\) //do not match a preceding "\"
\$ //match literal "$"
\b //word boundary
\w+ //one or more word characters
\b //second word boundary, hence making the match a word
When the input is reversed, so do all the tokens in order to match. Furthermore, the negative lookbehind gets inverted into a negative lookahead of the form x(?!y) so the new regular expression is
/\b\w+\b\$(?!\\)/g;

This is more difficult than it appears at first blush. How like Regular Expressions!
If you have look-behind available, you can try:
/(?<!\\)\$\w+/g
This is NOT available in JS. Alternatively, you could specify a boundary that you know exists and use a capture group like:
/\s(\$\w+)/g
Unfortunately, you cannot rely on word boundaries via /b because there's no such boundary before '\'.
Also, this is a cool site for testing your regex expressions. And this explains the word boundary anchor.

If you're using a language that supports negative lookback assertions you can use something like this.
(?<!\\)\$\w+
I think this is the cleanest approach, but unfortunately it's not supported by all languages.
This is a hackier implementation that may work as well.
(?:(^\$\w+)|[^\\](\$\w+))
This matches either
A literal $ at the beginning of a line followed by multiple word characters. Or...
A literal $ this is preceded by any character except a backslash.
Here is a working example.

javascript not recognizing special characters

I am implementing a javascript code which makes hashtag linkable as follows -
str2 = str.replace(/(^|\s)#([A-Za-z0-9é_ü]+)/gi, '$1#$2');
if you see i included special hungarian characters like é , ü ... to be included in the hashtag linking but above code break at those special hungarian chars. But when i test that in w3schools.com example code editor things work there. So in my local script file those special chars are not being recognized as a character(é) but look like it's being treated as "e" character. Why this is happening ? how to overcome this problems, please suggest ideas.

Look here and here. Javascript has some problems with Unicode in regexp.
If you want to match every Unicode letter, you should use this regexp [\u00C0-\u1FFF\u2C00-\uD7FF\w].
So your code should look like this:
str2 = str.replace(/(^|\s)#([\u00C0-\u1FFF\u2C00-\uD7FF\w]+)/gi, '$1#$2');
var str2 = 'abc #łążaf3234 efg'.replace(/(^|\s)#([\u00C0-\u1FFF\u2C00-\uD7FF\w]+)/gi, '$1#$2');
alert(str2);

You have to list the special characters [A-Za-z0-9éüíóþæöÉÚÍÓÞÆÖ] (these are icelandic characters) or you could use \S to match any non-whitespace character

Your best bet is to use unicode escape sequences (like \u2665) rather than the binary character.

Splitting a string into words and keeping delimiter

I want to split up a string (sentence) in an array of words and keep the delimiters.
I have found and I am currently using this regex for this:
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
An explanation can be found here: http://regex101.com/
This works exactly as I want it to and effectively makes a string like
This is a sentence.
To an array of
["This", "is", "a", "sentence."]
The problem here is that it does not include spaces nor newlines. I want the string to be parsed as words as it already does but I also want the corresponding space and or newline character to belong to the previous word.
I have read about positive lookahead that should look for future characters (space and or newline) but still take them into account when extracting the word. Although this might be the solution I have failed to implement it.
If it makes any difference I am using JavaScript and the following code:
//save the regex -- g modifier to get all matches
var reg = /[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)/g;
//define variable for holding matches
var matches;
//loop through each match
while(matches = reg.exec(STRING_HERE)){
//the word without spaces or newlines
console.log(matches[0]);
}
The code works but as I said, it does not include spaces and newline characters.

Yo can try something simpler:
str.split(/\b(?!\s)/);
However, note non word characters (e.g. full stop) will be considered another word:
"This is a sentence.".split(/\b(?!\s)/);
// [ "This ", "is ", "a ", "sentence", "." ]
To fix that, you can use a character class with the characters that shouldn't begin another word:
str.split(/\b(?![\s.])/);

function split_string(str){
var arr = str.split(" ");
var last_i = arr.length - 1;
for(var i=0; i<last_i; i++){
arr[i]+=" ";
}
return arr;
}

It may be as simple as this:
var sentence = 'This is a sentence.';
sentence = sentence.split(' ').join(' ||');
sentence = sentence.split('\n').join('\n||');
var matches = sentence.split('||');
Note that I use 2 pipes as a delimiter, but ofcourse you can use anything as long as it's unique.
Also note that I only split \n as a newline, but you may add \r\n or whatever you want to split as well.

General Solution
To keep the delimiters conjoined in the results, the regex needs to be a zero-width match. In other words, the regex can be thought of as matching the point between a delimiter and non-delimiter, rather than matching the delimiters themselves. This can be achieved with zero-width matching expressions, matching before, at, or after the split point (at most one each); let's call these A, B, and C. Sometimes a single sub-expression will do it, others you'll need two; offhand, I can't think of a case where you'd need three.
Not only look-aheads but lookarounds in general are the perfect candidates for this purpose: lookbehinds ((?<=...)) to match before the split point, and lookaheads ((?=...)) after. That's the essence of this approach. Positive or negative lookarounds can be used. The one pitfall is that lookbehinds are relatively new to JS regexes, so not all browsers or other JS engines will support them (current versions of Firefox, Chrome, Opera, Edge, and node.js do; Safari does not). If you need to support a JS engine that doesn't support lookbehinds, you might still be able to write & use a regex that matches at-and-before (BC).
To have the delimiters appear at the end of each match, put them in A. To have them at the start, in C. Fortunately, JS regexes do not place restrictions on lookbehinds, so simply wrapping the delimiter regex in the positive lookaround markers should be all that's required for delimiters. If the delimiters aren't so simple (i.e. context-sensitive), it might take a little more work to write the regex, which doesn't need to match the entire delimiter.
Paired with the delimiter pattern, you'll need to write a pattern that matches the start (for C) or end (for A) of the non-delimiter. This step is likely the one that will require the most additional work.
The at-split-point match, B
will often (always?) be a simple boundary, such as \b.
Specific Solution
If spaces are the only delimiters, and they're to appear at the end of each match, the delimiter pattern would be (?<=\s), in A. However, there are some cases not covered in the problem description. For example, should words separated by only punctuation (e.g. "x.y") be split? Which side of a split point should quotation marks and hyphens appear, if any? Should they count as punctuation? Another option for the delimiter is to match (after) all non-word characters, in which case A would be (<?=\W).
Since the split-point is at a word boundary, B could be \b.
Since the start of a match is a word character, (?=\w) will suffice for C.
Any two of those three should suffice. One that is perhaps clearest in meaning (and splits at the most points) is /(<?=\W)(?=\w)/, which can be translated as "split at the start of each word". \b could be added, if you find it more understandable, though it has no functional affect: /(<?=\W)\b(?=\w)/.
Note Oriol's excellent solutions are given by B=\b and (C=(?!\s) or C=(?![\s.])).
Additional
As a point of interest, there would be a simpler solution for this particular case if JS regexes supported TCL word boundaries: \m matches only at the start of a word, so str.split(/\m/) would split exactly at the start of each word. (\m is equivalent to (<?=\W)(?=\w).)

If you want to include the whitespace after the word, the regex \S+\s* should work.
const s = `This is a sentence.
This is another sentence.`;
console.log(s.match(/\S+\s*/g))

Regex for a (twitter-like) hashtag that allows non-ASCII characters

I want a regex to match a simple hashtag like that in twitter (e.g. #someword). I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese).
This was my initial regex: (^|\s|\b)(#(\w+))\b
--> but it doesn't recognize non standard characters.
Then, I tried using XRegExp.js, which worked, but ran too slowly.
Any suggestions for how to do it?

Eventually I found this: twitter-text.js useful link, which is basically how twitter solve this problem.

With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example:
> s = "foo #הַתִּקְוָה. bar"
"foo #הַתִּקְוָה. bar"
> s.match(/#(.+?)(?=[\s.,:,]|$)/)
["#הַתִּקְוָה", "הַתִּקְוָה"]
The [\s.,:,] should include spaces, punctuation and whatever else can be considered a terminating symbol.

#([^#]+)[\s,;]*
Explanation: This regular expression will search for a # followed by one or more non-# characters, followed by 0 or more spaces, commas or semicolons.
var input = "#hasta #mañana #babהַ";
var matches = input.match(/#([^#]+)[\s,;]*/g);
Result:
["#hasta ", "#mañana ", "#babהַ"]
EDIT - Replaced \b for word boundary

We Keep Coding

JavaScript is the programming language of the Web.

utf-8 word boundary regex in javascript - javascript

Not all the implementations of RegEx associated with Javascript engines a unicode aware. For example Microsofts JScript using in IE is limited to ANSI.

When you’re dealing with Unicode and natural-language words, you probably want to be more careful with boundaries than just using \b. See this answer for details and directions.

Related

Highlight specific word which is not a part of another word using regex in javascript [duplicate]

How to match all words starting with dollar sign but not slash dollar

javascript not recognizing special characters

Splitting a string into words and keeping delimiter

Regex for a (twitter-like) hashtag that allows non-ASCII characters

Categories

Resources