replace/replaceAll with regex on unicode issues - javascript

Is there a way to apply the replace method on Unicode text in general (Arabic is of concern here)? In the example below, whereas replacing the entire word works nicely on the English text, it fails to detect and as a result, replace the Arabic word. I added the u as a flag to enable unicode parsing but that didn't help. In the Arabic example below, the word النجوم should be replaced, but not والنجوم, but this doesn't happen.
<!DOCTYPE html>
<html>
<body>
<p>Click to replace...</p>
<button onclick="myFunction()">replace</button>
<p id="demo"></p>
<script>
function myFunction() {
var str = "الشمس والقمر والنجوم، ثم النجوم والنهار";
var rep = 'النجوم';
var repWith = 'الليل';
//var str = "the sun and the stars, then the starsz and the day";
//var rep = 'stars';
//var repWith = 'night';
var result = str.replace(new RegExp("\\b"+rep+"\\b", "ug"), repWith);
document.getElementById("demo").innerHTML = result;
}
</script>
</body>
</html>
And, whatever solution you could offer, please keep it with the use of variables as you see in the code above (the variable rep above), as these replace words being sought are passed in through function calls.
UPDATE: To try the above code, replace code in here with the code above.

A \bword\b pattern can be represented as (^|[A-Za-z0-9_])word(?![A-Za-z0-9_]) pattern and when you need to replace the match, you need to add $1 before the replacement pattern.
Since you need to work with Unicode, it makes sense to utilize XRegExp library that supports a "shorthand" \pL notation for any base Unicode letter. You may replace A-Za-z in the above pattern with this \pL:
var str = "الشمس والقمر والنجوم، ثم النجوم والنهار";
var rep = 'النجوم';
var repWith = 'الليل';
var regex = new XRegExp('(^|[^\\pL0-9_])' + rep + '(?![\\pL0-9_])');
var result = XRegExp.replace(str, regex, '$1' + repWith, 'all');
console.log(result);
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
UPDATE by #mohsenmadi:
To integrate in an Angular app, follow these steps:
Issue an npm install xregexp to add the library to package.json
Inside a component, add an import { replace, build } from 'xregexp/xregexp-all.js';
Build the regex with: let regex = build('(^|[^\\pL0-9_])' + rep + '(?![\\pL0-9_])');
Replace with: let result = replace(str, regex, '$1' + repWith, 'all');

Incase you change your mind about whitespace boundary's, here is the regex.
var Rx = new RegExp(
"(^|[\\u0009-\\u000D\\u0020\\u0085\\u00A0\\u1680\\u2000-\\u200A\\u2028-\\u2029\\u202F\\u205F\\u3000])"
+ text +
"(?![^\\u0009-\\u000D\\u0020\\u0085\\u00A0\\u1680\\u2000-\\u200A\\u2028-\\u2029\\u202F\\u205F\\u3000])"
,"ug");
var result = str.replace( Rx, '$1' + repWith );
Regex explanation
( # (1 start), simulated whitespace boundary
^ # BOL
| # or whitespace
[\u0009-\u000D\u0020\u0085\u00A0\u1680\u2000-\u200A\u2028-\u2029\u202F\u205F\u3000]
) # (1 end)
text # To find
(?! # Whitespace boundary
[^\u0009-\u000D\u0020\u0085\u00A0\u1680\u2000-\u200A\u2028-\u2029\u202F\u205F\u3000]
)
In an engine that can use lookbehind assertions, a whitespace boundary
is typically done like this (?<!\S)text(?!\S).

Related

JavaScript Regular Expression with special characters [duplicate]

I have this code to highlight words that exist in an array everything works fine except it didn't highlight the words that contain '.'
spansR[i].innerHTML = t[i].replace(new RegExp(wordsArray.join("|"),'gi'), function(c) {
return '<span style="color:red">'+c+'</span>';
});
I also tried to escape dot in each word
for(var r=0;r<wordsArray.length;r++){
if(wordsArray[r].includes('.')){
wordsArray[r] = wordsArray[r].replace(".", "\\.");
wordsArray[r] = '\\b'+wordsArray[r]+'\\b';
}
}
I also tried to change replace by those and non of them worked "replace(".", "\.")" , "replace(".", "\.")" , "replace(".", "/.")" , "replace('.','/.')" , "replace('.','/.')" .
This is a simplified test case (I want to match 'free.' )
<!DOCTYPE html>
<html>
<body>
<button onclick="myFunction()">Try it</button>
<p id="demo"></p>
<script>
function myFunction() {
var re = "\\bfree\\.\\b";
var str = "The best things in life are free.";
var patt = new RegExp(re);
var res = patt.test(str);
document.getElementById("demo").innerHTML = res;
}
</script>
</body>
</html>
Implement an unambiguous word boundary in JavaScript.
Here is a version for JS that does not support ECMAScript 2018 and newer:
var t = "Some text... firas and firas. but not firass ... Also, some shop and not shopping";
var wordsArray = ['firas', 'firas.', 'shop'];
wordsArray.sort(function(a, b){
return b.length - a.length;
});
var regex = new RegExp("(^|\\W)(" + wordsArray.map(function(x) {
return x.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&')
}).join("|") + ")(?!\\w)",'gi');
console.log( t.replace(regex, '$1<span style="color:red">$2</span>') );
Here, the regex will look like /(^|\W)(firas\.|firas|shop)(?!\w)/gi, see demo. The (^|\W) captures into Group 1 ($1) start of string or a non-word char, then there is a second capturing group that catures the term in question and (?!\w) negative lookahead matches a position that is not immediately followed with a word char.
The wordsArray.sort is important, as without it, the shorter words with the same beginning might "win" before the longer ones appear.
The .replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&') is a must to escape special chars in the search terms.
A variation for JS environments that support lookbehinds:
let t = "Some text... firas and firas. but not firass ... Also, some shop and not shopping";
let wordsArray = ['firas', 'firas.', 'shop'];
wordsArray.sort((a, b) => b.length - a.length );
let regex = new RegExp(String.raw`(?<!\w)(?:${wordsArray.map(x => x.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&')).join("|")})(?!\w)`,'gi');
console.log( t.replace(regex, '<span style="color:red">$&</span>') );
The regex will look like /(?<!\w)(?:firas\.|firas|shop)(?!\w)/gi, see demo. Here, (?<!\w) negative lookbehind matches a location that is not immediately preceded with a word char. This also makes capturing group redundant and I replaced it with a non-capturing one, (?:...), and the replacement pattern now contains just one placeholder, $&, that inserts the whole match.
Here is your solution:
Replace this:
new RegExp(wordsArray.join("|"),'gi')
With this:
new RegExp(wordsArray.join("|"),'gi').replace(/\./g,'\\.')
Example :
['javascript', 'firas.', 'regexp'].join("|").replace(/\./g,'\\.')
Will print
javascript|firas\.|regexp
Which is the regular expression you are looking for, with the escaped dot. It will match firas. but it will not match firas, as you specifically asked in your last comment

How to ban words with diacritics using a blacklist array and regex?

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don't know how to check against words with diacritics from the array:
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new RegExp('\\b' + bannedWords.join("\\b|\\b") + '\\b', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
alert(valid);
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check'>
Now on the word băţ it returns true instead of false for example.
Chiu's comment is right: 'aaáaa'.match(/\b.+?\b/g) yelds quite counter-intuitive [ "aa", "á", "aa" ], because "word character" (\w) in JavaScript regular expressions is just a shorthand for [A-Za-z0-9_] ('case-insensitive-alpha-numeric-and-underscore'), so word boundary (\b) matches any place between chunk of alpha-numerics and any other character. This makes extracting "Unicode words" quite hard.
For non-unicase writing systems it is possible to identify "word character" by its dual nature: ch.toUpperCase() != ch.toLowerCase(), so your altered snippet could look like this:
var bannedWords = ["bad", "mad", "testing", "băţ", "bať"];
var bannedWordsRegex = new RegExp('-' + bannedWords.join("-|-") + '-', 'i');
$(function() {
$("input").on("input", function() {
var invalid = bannedWordsRegex.test(dashPaddedWords(this.value));
$('#log').html(invalid ? 'bad' : 'good');
});
$("input").trigger("input").focus();
function dashPaddedWords(str) {
return '-' + str.replace(/./g, wordCharOrDash) + '-';
};
function wordCharOrDash(ch) {
return isWordChar(ch) ? ch : '-'
};
function isWordChar(ch) {
return ch.toUpperCase() != ch.toLowerCase();
};
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check' value="ba">
<p id="log"></p>
Let's see what's going on:
alert("băţ".match(/\w\b/));
This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so aä, pπ, and zƶ match \w\b\W since they contain a word character, a word boundary, and a non-word character.
I think the best you can do is something like this:
var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
+ bannedWords.join('|')
+ ')(?=' + bound + '|$)', 'i');
where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)
Given ["bad", "mad", "testing", "băţ"], this becomes:
/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i
This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).
You can also use var bound = '[\\s!-/:-#[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.
You need a Unicode aware word boundary. The easiest way is to use XRegExp package.
Although its \b is still ASCII based, there is a \p{L} (or a shorter pL version) construct that matches any Unicode letter from the BMP plane. To build a custom word boundary using this contruct is easy:
\b word \b
---------------------------------------
| | |
([^\pL0-9_]|^) word (?=[^\pL0-9_]|$)
The leading word boundary can be represented with a (non)capturing group ([^\pL0-9_]|^) that matches (and consumes) either a character other than a Unicode letter from the BMP plane, a digit and _ or a start of the string before the word.
The trailing word boundary can be represented with a positive lookahead (?=[^\pL0-9_]|$) that requires a character other than a Unicode letter from the BMP plane, a digit and _ or the end of string after the word.
See the snippet below that will detect băţ as a banned word, and băţy as an allowed word.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new XRegExp('(?:^|[^\\pL0-9_])(?:' + bannedWords.join("|") + ')(?=$|[^\\pL0-9_])', 'i');
$(function () {
$("input").on("change", function () {
var valid = !regex.test(this.value);
//alert(valid);
console.log("The word is", valid ? "allowed" : "banned");
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
<input type='text' name='word_to_check'>
In stead of using word boundary, you could do it with
(?:[^\w\u0080-\u02af]+|^)
to check for start of word, and
(?=[^\w\u0080-\u02af]|$)
to check for the end of it.
The [^\w\u0080-\u02af] matches any characters not (^) being basic Latin word characters - \w - or the Unicode 1_Supplement, Extended-A, Extended-B and Extensions. This include some punctuation, but would get very long to match just letters. It may also have to be extended if other character sets have to be included. See for example Wikipedia.
Since javascript doesn't support look-behinds, the start-of-word test consumes any before mentioned non-word characters, but I don't think that should be a problem. The important thing is that the end-of-word test doesn't.
Also, putting these test outside a non capturing group that alternates the words, makes it significantly more effective.
var bannedWords = ["bad", "mad", "testing", "băţ", "båt", "süß"],
regex = new RegExp('(?:[^\\w\\u00c0-\\u02af]+|^)(?:' + bannedWords.join("|") + ')(?=[^\\w\\u00c0-\\u02af]|$)', 'i');
function myFunction() {
document.getElementById('result').innerHTML = 'Banned = ' + regex.test(document.getElementById('word_to_check').value);
}
<!DOCTYPE html>
<html>
<body>
Enter word: <input type='text' id='word_to_check'>
<button onclick='myFunction()'>Test</button>
<p id='result'></p>
</body>
</html>
When dealing with characters outside my base set (which can show up at any time), I convert them to an appropriate base equivalent (8bit, 16bit, 32bit). before running any character matching over them.
var bannedWords = ["bad", "mad", "testing", "băţ"];
var bannedWordsBits = {};
bannedWords.forEach(function(word){
bannedWordsBits[word] = "";
for (var i = 0; i < word.length; i++){
bannedWordsBits[word] += word.charCodeAt(i).toString(16) + "-";
}
});
var bannedWordsJoin = []
var keys = Object.keys(bannedWordsBits);
keys.forEach(function(key){
bannedWordsJoin.push(bannedWordsBits[key]);
});
var regex = new RegExp(bannedWordsJoin.join("|"), 'i');
function checkword(word) {
var wordBits = "";
for (var i = 0; i < word.length; i++){
wordBits += word.charCodeAt(i).toString(16) + "-";
}
return !regex.test(wordBits);
};
The separator "-" is there to make sure that unique characters don't bleed together creating undesired matches.
Very useful as it brings all the characters down to a common base that everything can interact with. And this can be re-encoded back to it's original without having to ship it in key/value pair.
For me the best thing about it is that I don't have to know all of the rules for all of the character sets that I might intersect with, because I can pull them all into a common playing field.
As a side note:
To speed things up, rather than passing the large regex statement that you probably have, which takes exponentially longer to pass with the length of the words that you're banning, I would pass each separate word in the sentence through the filter. And break the filter up into length based segments. like;
checkword3Chars();
checkword4Chars();
checkword5chars();
who's functions you can generate systematically and even create on the fly as and when they become required.

Remove punctuation, retain spaces, toLowerCase, add dashes succinctly

I need to do the following to a string:
Remove any punctuation (but retain spaces) (can include removal of foreign chars)
Add dashes instead of spaces
toLowercase
I'd like to be able to do this as succinctly as possible, so on one line for example.
At the moment I have:
const ele = str.replace(/[^\w\s]/, '').replace(/\s+/g, '-').toLowerCase();
Few problems I'm having. Firstly the line above is syntactically incorrect. I think it's a problem with /[^\w\s] but I am not sure what I've done wrong.
Secondly I wonder if it is possible to write a regex statement that removes the punctuation AND converts spaces to dashes?
And example of what I want to change:
Where to? = where-to
Destination(s) = destinations
Travel dates?: = travel-dates
EDIT: I have updated the missing / from the first regex replace. I am finding that Destination(s) is becoming destinations) which is peculiar.
Codepen: http://codepen.io/anon/pen/mAdXJm?editors=0011
You may use the following regex to only match ASCII punctuation and some symbols (source) - maybe we should remove _ from it:
var punct = /[!"#$%&'()*+,.\/:;<=>?#\[\\\]^`{|}~-]+/g;
or a more contracted one since some of these symbols appear in the ASCII table as consecutive chars:
var punct = /[!-\/:-#\[-^`{-~]+/g;
You may chain 2 regex replacements.
var punct = /[!"#$%&'()*+,.\/:;<=>?#\[\\\]^`{|}~-]+/g;
var s = "Where to?"; // = where-to
console.log(s.replace(punct, '').replace(/\s+/, '-').toLowerCase());
s = "Destination(s)"; // = destinations
console.log(s.replace(punct, '').replace(/\s+/, '-').toLowerCase());
console.log(s.replace(punct, '').replace(/\s+/, '-').toLowerCase());
Or use an anonymous method inside the replace with arrow functions (less compatibility, but succint):
var s="Travel dates?:"; // = travel-dates
var o=/([!-\/:-#\[-^`{-~]+)|\s+/g;
console.log(s.replace(o,(m,g)=>g?'':'-').toLowerCase());
Note you may also use XRegExp to match any Unicode punctuation with \pP construct.
Wiktor touched on the subject, but my first thought was an anonymous function using the regex /(\s+)|([\W])/g like this:
var inputs = ['Where to?', 'Destination(s)', 'Travel dates?:'],
res,
idx;
for( idx=0; idx<inputs.length; idx++ ) {
res = inputs[idx].replace(/(\s+)|([\W])/g, function(a, b) {return b ? '-' : '';}).toLowerCase();
document.getElementById('output').innerHTML += '"' + inputs[idx] + '" -> "'
+ res + '"<br/>';
}
<!DOCTYPE html>
<html>
<body>
<p id='output'></p>
</body>
</html>
The regex captures either white space (1+) or a non-word characters. If the first is true the anonymous function returns -, otherwise an empty string.

RegEx to match words in comma delimited list

How can you match text that appears between delimiters, but not match the delimiters themselves?
Text
DoNotFindMe('DoNotFindMe')
DoNotFindMe(FindMe)
DoNotFindMe(FindMe,FindMe)
DoNotFindMe(FindMe,FindMe,FindMe)
Script
text = text.replace(/[\(,]([a-zA-Z]*)[,\)]/g, function(item) {
return "'" + item + "'";
});
Expected Result
DoNotFindMe('DoNotFindMe')
DoNotFindMe('FindMe')
DoNotFindMe('FindMe','FindMe')
DoNotFindMe('FindMe','FindMe','FindMe')
https://regex101.com/r/tB1nE2/1
Here's a pretty simple way to do it:
([a-zA-Z]+)(?=,|\))
This looks for any word that is succeeded by either a comma or a close-parenthesis.
var s = "DoNotFindMe('DoNotFindMe')\nDoNotFindMe(FindMe)\nDoNotFindMe(FindMe,FindMe)\nDoNotFindMe(FindMe,FindMe,FindMe)";
var r = s.replace(/([a-zA-Z]+)(?=,|\))/g, "'$1'" );
alert(r);
Used the same test code as the other two answers; thanks!
You can use:
var s = "DoNotFindMe('DoNotFindMe')\nDoNotFindMe(FindMe)\nDoNotFindMe(FindMe,FindMe)\nDoNotFindMe(FindMe,FindMe,FindMe)";
var r = s.replace(/(\([^)]+\))/g, function($0, $1) {
return $1.replace(/(\b[a-z]+(?=[,)]))/gi, "'$1'"); }, s);
DoNotFindMe('DoNotFindMe')
DoNotFindMe('FindMe')
DoNotFindMe('FindMe','FindMe')
DoNotFindMe('FindMe','FindMe','FindMe')
Here's a solution that avoids the function argument. It's a bit wonky, but works. Basically, you explicitly match the left delimiter and include it in the replacement string via backreference so it won't get dropped, but then you have to use a positive look-ahead assertion for the right delimiter, because otherwise the match pointer would be moved ahead of the right delimiter for the next match, and so it then wouldn't be able to match that delimiter as the left delimiter of the following delimited word:
var s = "DoNotFindMe('DoNotFindMe')\nDoNotFindMe(FindMe)\nDoNotFindMe(FindMe,FindMe)\nDoNotFindMe(FindMe,FindMe,FindMe)";
var r = s.replace(/([,(])([a-zA-Z]*)(?=[,)])/g, "$1'$2'" );
alert(r);
results in
DoNotFindMe('DoNotFindMe')
DoNotFindMe('FindMe')
DoNotFindMe('FindMe','FindMe')
DoNotFindMe('FindMe','FindMe','FindMe')
(Thanks anubhava, I stole your code template, cause it was perfect for my testing! I gave you an upvote for it.)

Remove ALL white spaces from text

$("#topNav" + $("#breadCrumb2nd").text().replace(" ", "")).addClass("current");
This is a snippet from my code. I want to add a class to an ID after getting another ID's text property. The problem with this, is the ID holding the text I need, contains gaps between the letters.
I would like the white spaces removed. I have tried TRIM()and REPLACE() but this only partially works. The REPLACE() only removes the 1st space.
You have to tell replace() to repeat the regex:
.replace(/ /g,'')
The g character makes it a "global" match, meaning it repeats the search through the entire string. Read about this, and other RegEx modifiers available in JavaScript here.
If you want to match all whitespace, and not just the literal space character, use \s instead:
.replace(/\s/g,'')
You can also use .replaceAll if you're using a sufficiently recent version of JavaScript, but there's not really any reason to for your specific use case, since catching all whitespace requires a regex, and when using a regex with .replaceAll, it must be global, so you just end up with extra typing:
.replaceAll(/\s/g,'')
.replace(/\s+/, "")
Will replace the first whitespace only, this includes spaces, tabs and new lines.
To replace all whitespace in the string you need to use global mode
.replace(/\s/g, "")
Now you can use "replaceAll":
console.log(' a b c d e f g '.replaceAll(' ',''));
will print:
abcdefg
But not working in every possible browser:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replaceAll
Regex for remove white space
\s+
var str = "Visit Microsoft!";
var res = str.replace(/\s+/g, "");
console.log(res);
or
[ ]+
var str = "Visit Microsoft!";
var res = str.replace(/[ ]+/g, "");
console.log(res);
Remove all white space at begin of string
^[ ]+
var str = " Visit Microsoft!";
var res = str.replace(/^[ ]+/g, "");
console.log(res);
remove all white space at end of string
[ ]+$
var str = "Visit Microsoft! ";
var res = str.replace(/[ ]+$/g, "");
console.log(res);
var mystring="fg gg";
console.log(mystring.replaceAll(' ',''))
** 100% working
use replace(/ +/g,'_'):
let text = "I love you"
text = text.replace( / +/g, '_') // replace with underscore ('_')
console.log(text) // I_love_you
Using String.prototype.replace with regex, as mentioned in the other answers, is certainly the best solution.
But, just for fun, you can also remove all whitespaces from a text by using String.prototype.split and String.prototype.join:
const text = ' a b c d e f g ';
const newText = text.split(/\s/).join('');
console.log(newText); // prints abcdefg
I don't understand why we need to use regex here when we can simply use replaceAll
let result = string.replaceAll(' ', '')
result will store string without spaces
let str = 'a big fat hen clock mouse '
console.log(str.split(' ').join(''))
// abigfathenclockmouse
Use string.replace(/\s/g,'')
This will solve the problem.
Happy Coding !!!
simple solution could be : just replace white space ask key value
val = val.replace(' ', '')
Use replace(/\s+/g,''),
for example:
const stripped = ' My String With A Lot Whitespace '.replace(/\s+/g, '')// 'MyStringWithALotWhitespace'
Well, we can also use that [^A-Za-z] with g flag for removing all the spaces in text. Where negated or complemente or ^. Show to the every character or range of character which is inside the brackets. And the about g is indicating that we search globally.
let str = "D S# D2m4a r k 23";
// We are only allowed the character in that range A-Za-z
str = str.replace(/[^A-Za-z]/g,""); // output:- DSDmark
console.log(str)
javascript - Remove ALL white spaces from text - Stack Overflow
Using .replace(/\s+/g,'') works fine;
Example:
this.slug = removeAccent(this.slug).replace(/\s+/g,'');
function RemoveAllSpaces(ToRemove)
{
let str = new String(ToRemove);
while(str.includes(" "))
{
str = str.replace(" ", "");
}
return str;
}

Categories