Using JavaScript to perform text matches with/without accented characters

Using JavaScript to perform text matches with/without accented characters - javascript

I am using an AJAX-based lookup for names that a user searches in a text box.
I am making the assumption that all names in the database will be transliterated to European alphabets (i.e. no Cyrillic, Japanese, Chinese). However, the names will still contain accented characters, such as ç, ê and even č and ć.
A simple search like "Micic" will not match "Mičić" though - and the user expectation is that it will.
The AJAX lookup uses regular expressions to determine a match. I have modified the regular expression comparison using this function in an attempt to match more accented characters. However, it's a little clumsy since it doesn't take into account all characters.
function makeComp (input)
{
input = input.toLowerCase ();
var output = '';
for (var i = 0; i < input.length; i ++)
{
if (input.charAt (i) == 'a')
output = output + '[aàáâãäåæ]'
else if (input.charAt (i) == 'c')
output = output + '[cç]';
else if (input.charAt (i) == 'e')
output = output + '[eèéêëæ]';
else if (input.charAt (i) == 'i')
output = output + '[iìíîï]';
else if (input.charAt (i) == 'n')
output = output + '[nñ]';
else if (input.charAt (i) == 'o')
output = output + '[oòóôõöø]';
else if (input.charAt (i) == 's')
output = output + '[sß]';
else if (input.charAt (i) == 'u')
output = output + '[uùúûü]';
else if (input.charAt (i) == 'y')
output = output + '[yÿ]'
else
output = output + input.charAt (i);
}
return output;
}
Apart from a substitution function like this, is there a better way? Perhaps to "deaccent" the string being compared?

There is a way to “"deaccent" the string being compared” without the use of a substitution function that lists all the accents you want to remove…
Here is the easiest solution I can think about to remove accents (and other diacritics) from a string.
See it in action:
var string = "Ça été Mičić. ÀÉÏÓÛ";
console.log(string);
var string_norm = string.normalize('NFD').replace(/\p{Diacritic}/gu, ""); // Old method: .replace(/[\u0300-\u036f]/g, "");
console.log(string_norm);
.normalize(…) decomposes the letters and diacritics.
.replace(…) removes all the diacritics.

Came upon this old thread and thought I'd try my hand at doing a fast function. I'm relying on the ordering of pipe-separated ORs setting variables when they match in the function replace() is calling. My goal was to use the standard regex-implementation javascript's replace() function uses as much as possible, so that the heavy-processing can take place in low-level browser-optimized space, instead of in expensive javascript char-by-char comparisons.
It's not scientific at all, but my old Huawei IDEOS android phone is sluggish when I plug the other functions in this thread in to my autocomplete, while this function zips along:
function accentFold(inStr) {
return inStr.replace(
/([àáâãäå])|([çčć])|([èéêë])|([ìíîï])|([ñ])|([òóôõöø])|([ß])|([ùúûü])|([ÿ])|([æ])/g,
function (str, a, c, e, i, n, o, s, u, y, ae) {
if (a) return 'a';
if (c) return 'c';
if (e) return 'e';
if (i) return 'i';
if (n) return 'n';
if (o) return 'o';
if (s) return 's';
if (u) return 'u';
if (y) return 'y';
if (ae) return 'ae';
}
);
}
If you're a jQuery dev, here's a handy example of using this function; you could use :icontains the same way you'd use :contains in a selector:
jQuery.expr[':'].icontains = function (obj, index, meta, stack) {
return accentFold(
(obj.textContent || obj.innerText || jQuery(obj).text() || '').toLowerCase()
)
.indexOf(accentFold(meta[3].toLowerCase())
) >= 0;
};

I searched and upvoted herostwist answer but kept searching and truly, here is a modern solution, core to JavaScript (string.localeCompare function)
var a = 'réservé'; // with accents, lowercase
var b = 'RESERVE'; // no accents, uppercase
console.log(a.localeCompare(b));
// expected output: 1
console.log(a.localeCompare(b, 'en', {sensitivity: 'base'}));
// expected output: 0
NOTE, however, that full support is still missing for some mobile browser !!!
Until then, keep watching out for full support across ALL platforms and env.
Is that all ?
No, we can go further right now and use string.toLocaleLowerCase function.
var dotted = 'İstanbul';
console.log('EN-US: ' + dotted.toLocaleLowerCase('en-US'));
// expected output: "istanbul"
console.log('TR: ' + dotted.toLocaleLowerCase('tr'));
// expected output: "istanbul"
Thank You !

There is no easier way to "deaccent" that I can think of, but your substitution could be streamlined a little more:
var makeComp = (function(){
var accents = {
a: 'àáâãäåæ',
c: 'ç',
e: 'èéêëæ',
i: 'ìíîï',
n: 'ñ',
o: 'òóôõöø',
s: 'ß',
u: 'ùúûü',
y: 'ÿ'
},
chars = /[aceinosuy]/g;
return function makeComp(input) {
return input.replace(chars, function(c){
return '[' + c + accents[c] + ']';
});
};
}());

I think this is the neatest solution
var nIC = new Intl.Collator(undefined , {sensitivity: 'base'})
var cmp = nIC.compare.bind(nIC)
It will return 0 if the two strings are the same, ignoring accents.
Alternatively you try localecompare
'être'.localeCompare('etre',undefined,{sensitivity: 'base'})

I made a Prototype Version of this:
String.prototype.strip = function() {
var translate_re = /[öäüÖÄÜß ]/g;
var translate = {
"ä":"a", "ö":"o", "ü":"u",
"Ä":"A", "Ö":"O", "Ü":"U",
" ":"_", "ß":"ss" // probably more to come
};
return (this.replace(translate_re, function(match){
return translate[match];})
);
};
Use like:
var teststring = 'ä ö ü Ä Ö Ü ß';
teststring.strip();
This will will change the String to a_o_u_A_O_U_ss

You can also use http://fusejs.io, which describes itself as "Lightweight fuzzy-search library.
Zero dependencies", for fuzzy searching.

First, I'd recommend a switch statement instead of a long string of if-else if ...
Then, I am not sure why you don't like your current solution. It certainly is the cleanest one. What do you mean by not taking into account "all characters"?
There is no standard method in JavaScript to map accented letters to ASCII letters outside of using a third-party library, so the one you wrote is as good as any.
Also, "ß" I believe maps to "ss", not a single "s". And beware of "i" with and without dot in Turkish -- I believe they refer to different letters.

Related

Check numeric expression (linear) is valid or not?

There are many ways to check string valid or not using regex expression and using match/test match accordingly.
I am looking to check whether the expression containing alphabets(a-b), operators (+,-,/,*), only special characters like (')','(') and numerics (0-9) is valid or not
I have already tried for traditional methods pushing when the character is '(' and popping when ')' and checking the balanced parenthesis or not.
Code almost works for even operators but there are some cases where I am lacking behind.
The code provided might be right to some extent.
checkBalancedString(text){
let format = /[A-Za-z0-9]/;
let expression = /[+-\/*]/
if(text.length <=2){
if(format.test(text[0])){
return true;
}
return false;
}
for(let i=0;i<text.length;i++){
let stringcheck=[]
if(text[i]== '('){
stringcheck.push(text[i])
}
switch(text[i]){
case ')':
if(!stringcheck.length){
return false;
}
stringcheck.pop();
break;
}
let checkalphaformat = format.test(text[i]);
if(checkalphaformat){
let nextChar = format.test(text[i+1]);
let nexttonextChar = expression.test(text[i+2])
if(nextChar || nexttonextChar){
return false
}
}else{
let nextChar = format.test(text[i+1]);
if(!nextChar){
return false;
}
if(text[i+2]){
let nextChar = format.test(text[i+2])
if(!nextChar){
return false;
}
}
}
if(!stringcheck.length){
return true;
}
}
}
In short string should return valid where expression like:
(a+b), a+b, a/9, b*5 , (e-6*(d+e)), (a+b)/(c-d)
and expression like:
+, - ,-a,+a-, (a+, (a+v, e*)
The expression should be complete when every character is followed by the operator or parenthesis
either operator is followed by only characters
and parenthesis is followed by the only character.
There should be no operator together from the present index (front and behind)and no two characters together

It looks like what you really want is to check the validity of the formula more than to check a specific kind of formula.
Here's what I use in such a case:
// Return true if the passed string is a valid mathematical expression
// taking as parameter a, b, c, etc.
// Examples:
// 2*a+b*b
// (e-6*(d+e))
// sin(a*PI + b.length) / ( round(d) - log(c) + +("Basse Qualité"===e) )
// for (var i=0, total=0; i<10; i++) total += pow(a,i); return total
check = function(str){
try {
str = str.replace(/(^|[^\."'a-zA-Z])([a-zA-Z]\w+)\b/g, function(s, p, t){
return t in Math ? p + "Math." + t : s;
});
if (!/\breturn\b/.test(str)) str = "return ("+str+")";
var args = "abcdefghijklmnopqrstuvwxyz".split("");
var f = Function.apply(Function, args.concat(str));
f.apply(null, args); // if it works for args it should be ok for numbers...
return true;
} catch (e) {
console.log("error while checking formula", e, str);
return false;
}
}
The basic ideas which should apply to your cases:
try to instantiate a Function with your formula as body and ["a", "b", ...] as argument names
execute this function with a sample input (in my case ["a", "b", ...])
If you don't want such liberty as in my example, you may also test the character ranges (you don't have to allow ; or ̀, if you don't want inline javascript)

How to remove specific character surrounding a string?

I have this string:
var str = "? this is a ? test ?";
Now I want to get this:
var newstr = "this is a ? test";
As you see I want to remove just those ? surrounding (in the beginning and end) that string (not in the middle of string). How can do that using JavaScript?
Here is what I have tried:
var str = "? this is a ? test ?";
var result = str.trim("?");
document.write(result);
So, as you see it doesn't work. Actually I'm a PHP developer and trim() works well in PHP. Now I want to know if I can use trim() to do that in JS.
It should be noted I can do that using regex, but to be honest I hate regex for this kind of jobs. Anyway is there any better solution?
Edit: As this mentioned in the comment, I need to remove both ? and whitespaces which are around the string.

Search for character mask and return the rest without.
This proposal the use of the bitwise not ~ operator for checking.
~ is a bitwise not operator. It is perfect for use with indexOf(), because indexOf returns if found the index 0 ... n and if not -1:
value ~value boolean
-1 => 0 => false
0 => -1 => true
1 => -2 => true
2 => -3 => true
and so on
function trim(s, mask) {
while (~mask.indexOf(s[0])) {
s = s.slice(1);
}
while (~mask.indexOf(s[s.length - 1])) {
s = s.slice(0, -1);
}
return s;
}
console.log(trim('??? this is a ? test ?', '? '));
console.log(trim('abc this is a ? test abc', 'cba '));

Simply use:
let text = '?? something ? really ??'
text = text.replace(/^([?]*)/g, '')
text = text.replace(/([?]*)$/g, '')
console.log(text)

A possible solution would be to use recursive functions to remove the unwanted leading and trailing characters. This doesn't use regular expressions.
function ltrim(char, str) {
if (str.slice(0, char.length) === char) {
return ltrim(char, str.slice(char.length));
} else {
return str;
}
}
function rtrim(char, str) {
if (str.slice(str.length - char.length) === char) {
return rtrim(char, str.slice(0, 0 - char.length));
} else {
return str;
}
}
Of course this is only one of many possible solutions. The function trim would use both ltrim and rtrim.
The reason that char is the first argument and the string that needs to be cleaned the second, is to make it easier to change this into a functional programming style function, like so (ES 2015):
function ltrim(char) {
(str) => {
<body of function>
}
}
// No need to specify str here
function ltrimSpaces = ltrim(' ');

Here is one way to do it which checks for index-out-of-bounds and makes only a single call to substring:
String.prototype.trimChars = function(chars) {
var l = 0;
var r = this.length-1;
while(chars.indexOf(this[l]) >= 0 && l < r) l++;
while(chars.indexOf(this[r]) >= 0 && r >= l) r--;
return this.substring(l, r+1);
};
Example:
var str = "? this is a ? test ?";
str.trimChars(" ?"); // "this is a ? test"

No regex:
uberTrim = s => s.length >= 2 && (s[0] === s[s.length - 1])?
s.slice(1, -1).trim()
: s;
Step-by-step explanation:
Check if the string is at least 2 characters long and if it is surrounded by a specific character;
If it is, then first slice it to remove the surrounding characters then trim it to remove whitespaces;
If not just return it.
In case you're weirded out by that syntax, it's an Arrow Function and a ternary operator.
The parenthesis are superfluous in the ternary by the way.
Example use:
uberTrim(''); // ''
uberTrim(' Plop! '); //'Plop!'
uberTrim('! ...What is Plop?!'); //'...What is Plop?'

Simple approach using Array.indexOf, Array.lastIndexOf and Array.slice functions:
Update: (note: the author has requested to trim the surrounding chars)
function trimChars(str, char){
var str = str.trim();
var checkCharCount = function(side) {
var inner_str = (side == "left")? str : str.split("").reverse().join(""),
count = 0;
for (var i = 0, len = inner_str.length; i < len; i++) {
if (inner_str[i] !== char) {
break;
}
count++;
}
return (side == "left")? count : (-count - 1);
};
if (typeof char === "string"
&& str.indexOf(char) === 0
&& str.lastIndexOf(char, -1) === 0) {
str = str.slice(checkCharCount("left"), checkCharCount("right")).trim();
}
return str;
}
var str = "???? this is a ? test ??????";
console.log(trimChars(str, "?")); // "this is a ? test"

to keep this question up to date using an ES6 approach:
I liked the bitwise method but when readability is a concern too then here's another approach.
function trimByChar(string, character) {
const first = [...string].findIndex(char => char !== character);
const last = [...string].reverse().findIndex(char => char !== character);
return string.substring(first, string.length - last);
}

Using regex
'? this is a ? test ?'.replace(/^[? ]*(.*?)[? ]*$/g, '$1')
You may hate regex but after finding a solution you will feel cool :)

Javascript's trim method only remove whitespaces, and takes no parameters. For a custom trim, you will have to make your own function. Regex would make a quick solution for it, and you can find an implementation of a custom trim on w3schools in case you don't want the trouble of going through the regex creation process. (you'd just have to adjust it to filter ? instead of whitespace

This in one line of code which returns your desire output:
"? this is a ? test ?".slice(1).slice(0,-1).trim();

Javascript search string for numbers regex maybe?

I have a string like
:21::22::24::99:
And I want to find say if :22: is in said string. But is there a means of searching a string like above for one like I want to match it to with javascript, and if there is, does it involve regex magic or is there something else? Either way not sure how to do it, more so if regex is involved.

You can build the regular expression you need:
function findNumberInString(num, s) {
var re = new RegExp(':' + num + ':');
return re.test(s);
}
var s = ':21::22::24::99';
var n = '22';
findNumberInString(n, s); // true
or just use match (though test is cleaner to me)
!!s.match(':' + n + ':'); // true
Edit
Both the above use regular expressions, so a decimal ponit (.) will come to represent any character, so "4.1" will match "461" or even "4z1", so better to use a method based on String.prototype.indexOf just in case (unless you want "." to represent any character), so per Blender's comment:
function findNumberInString(num, s) {
return s.indexOf(':' + num + ':') != -1;
}

like this:
aStr = ':21::22::24::99:';
if(aStr.indexOf(':22:') != -1){
//':22:' exists in aStr
}
else{
//it doesn't
}

Regex, grab only one instance of each letter

I have a paragraph that's broken up into an array, split at the periods. I'd like to perform a regex on index[i], replacing it's contents with one instance of each letter that index[i]'s string value has.
So; index[i]:"This is a sentence" would return --> index[i]:"thisaenc"
I read this thread. But i'm not sure if that's what i'm looking for.

Not sure how to do this in regex, but here's a very simple function to do it without using regex:
function charsInString(input) {
var output='';
for(var pos=0; pos<input.length; pos++) {
char=input.charAt(pos).toLowerCase();
if(output.indexOf(char) == -1 && char != ' ') {output+=char;}
}
return output;
}
alert(charsInString('This is a sentence'));

As I'm pretty sure what you need cannot be achieved using a single regular expression, I offer a more general solution:
// collapseSentences(ary) will collapse each sentence in ary
// into a string containing its constituent chars
// #param {Array} the array of strings to collapse
// #return {Array} the collapsed sentences
function collapseSentences(ary){
var result=[];
ary.forEach(function(line){
var tmp={};
line.toLowerCase().split('').forEach(function(c){
if(c >= 'a' && c <= 'z') {
tmp[c]++;
}
});
result.push(Object.keys(tmp).join(''));
});
return result;
}
which should do what you want except that the order of characters in each sentence cannot be guaranteed to be preserved, though in most cases it is.
Given:
var index=['This is a sentence','This is a test','this is another test'],
result=collapseSentences(index);
result contains:
["thisaenc","thisae", "thisanoer"]

(\w)(?<!.*?\1)
This yields a match for each of the right characters, but as if you were reading right-to-left instead.
This finds a word character, then looks ahead for the character just matched.

Nevermind, i managed:
justC = "";
if (color[i+1].match(/A/g)) {justC += " L_A";}
if (color[i+1].match(/B/g)) {justC += " L_B";}
if (color[i+1].match(/C/g)) {justC += " L_C";}
if (color[i+1].match(/D/g)) {justC += " L_D";}
if (color[i+1].match(/E/g)) {justC += " L_E";}
else {color[i+1] = "L_F";}
It's not exactly what my question may have lead to belive is what i wanted, but the printout for this is what i was after, for use in a class: <span class="L_A L_C L_E"></span>

How about:
var re = /(.)((.*?)\1)/g;
var str = 'This is a sentence';
x = str.toLowerCase();
x = x.replace(/ /g, '');
while(x.match(re)) {
x=x.replace(re, '$1$3');
}

I don't think this can be done in one fell regex swoop. You are going to need to use a loop.
While my example was not written in your language of choice, it doesn't seem to use any regex features not present in javascript.
perl -e '$foo="This is a sentence"; while ($foo =~ s/((.).*?)\2/$1/ig) { print "<$1><$2><$foo>\n"; } print "$foo\n";'
Producing:
This aenc

Case insensitive string replacement in JavaScript?

I need to highlight, case insensitively, given keywords in a JavaScript string.
For example:
highlight("foobar Foo bar FOO", "foo") should return "<b>foo</b>bar <b>Foo</b> bar <b>FOO</b>"
I need the code to work for any keyword, and therefore using a hardcoded regular expression like /foo/i is not a sufficient solution.
What is the easiest way to do this?
(This an instance of a more general problem detailed in the title, but I feel that it's best to tackle with a concrete, useful example.)

You can use regular expressions if you prepare the search string. In PHP e.g. there is a function preg_quote, which replaces all regex-chars in a string with their escaped versions.
Here is such a function for javascript (source):
function preg_quote (str, delimiter) {
// discuss at: https://locutus.io/php/preg_quote/
// original by: booeyOH
// improved by: Ates Goral (https://magnetiq.com)
// improved by: Kevin van Zonneveld (https://kvz.io)
// improved by: Brett Zamir (https://brett-zamir.me)
// bugfixed by: Onno Marsman (https://twitter.com/onnomarsman)
// example 1: preg_quote("$40")
// returns 1: '\\$40'
// example 2: preg_quote("*RRRING* Hello?")
// returns 2: '\\*RRRING\\* Hello\\?'
// example 3: preg_quote("\\.+*?[^]$(){}=!<>|:")
// returns 3: '\\\\\\.\\+\\*\\?\\[\\^\\]\\$\\(\\)\\{\\}\\=\\!\\<\\>\\|\\:'
return (str + '')
.replace(new RegExp('[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\' + (delimiter || '') + '-]', 'g'), '\\$&')
}
So you could do the following:
function highlight(str, search) {
return str.replace(new RegExp("(" + preg_quote(search) + ")", 'gi'), "<b>$1</b>");
}

function highlightWords( line, word )
{
var regex = new RegExp( '(' + word + ')', 'gi' );
return line.replace( regex, "<b>$1</b>" );
}

You can enhance the RegExp object with a function that does special character escaping for you:
RegExp.escape = function(str)
{
var specials = /[.*+?|()\[\]{}\\$^]/g; // .*+?|()[]{}\$^
return str.replace(specials, "\\$&");
}
Then you would be able to use what the others suggested without any worries:
function highlightWordsNoCase(line, word)
{
var regex = new RegExp("(" + RegExp.escape(word) + ")", "gi");
return line.replace(regex, "<b>$1</b>");
}

Regular expressions are fine as long as keywords are really words, you can just use a RegExp constructor instead of a literal to create one from a variable:
var re= new RegExp('('+word+')', 'gi');
return s.replace(re, '<b>$1</b>');
The difficulty arises if ‘keywords’ can have punctuation in, as punctuation tends to have special meaning in regexps. Unfortunately unlike most other languages/libraries with regexp support, there is no standard function to escape punctation for regexps in JavaScript.
And you can't be totally sure exactly what characters need escaping because not every browser's implementation of regexp is guaranteed to be exactly the same. (In particular, newer browsers may add new functionality.) And backslash-escaping characters that are not special is not guaranteed to still work, although in practice it does.
So about the best you can do is one of:
attempting to catch each special character in common browser use today [add: see Sebastian's recipe]
backslash-escape all non-alphanumerics. care: \W will also match non-ASCII Unicode characters, which you don't really want.
just ensure that there are no non-alphanumerics in the keyword before searching
If you are using this to highlight words in HTML which already has markup in, though, you've got trouble. Your ‘word’ might appear in an element name or attribute value, in which case attempting to wrap a < b> around it will cause brokenness. In more complicated scenarios possibly even an HTML-injection to XSS security hole. If you have to cope with markup you will need a more complicated approach, splitting out ‘< ... >’ markup before attempting to process each stretch of text on its own.

What about something like this:
if(typeof String.prototype.highlight !== 'function') {
String.prototype.highlight = function(match, spanClass) {
var pattern = new RegExp( match, "gi" );
replacement = "<span class='" + spanClass + "'>$&</span>";
return this.replace(pattern, replacement);
}
}
This could then be called like so:
var result = "The Quick Brown Fox Jumped Over The Lazy Brown Dog".highlight("brown","text-highlight");

For those poor with disregexia or regexophobia:
function replacei(str, sub, f){
let A = str.toLowerCase().split(sub.toLowerCase());
let B = [];
let x = 0;
for (let i = 0; i < A.length; i++) {
let n = A[i].length;
B.push(str.substr(x, n));
if (i < A.length-1)
B.push(f(str.substr(x + n, sub.length)));
x += n + sub.length;
}
return B.join('');
}
s = 'Foo and FOO (and foo) are all -- Foo.'
t = replacei(s, 'Foo', sub=>'<'+sub+'>')
console.log(t)
Output:
<Foo> and <FOO> (and <foo>) are all -- <Foo>.

Why not just create a new regex on each call to your function? You can use:
new Regex([pat], [flags])
where [pat] is a string for the pattern, and [flags] are the flags.

We Keep Coding

JavaScript is the programming language of the Web.

Using JavaScript to perform text matches with/without accented characters - javascript

I think this is the neatest solution var nIC = new Intl.Collator(undefined , {sensitivity: 'base'}) var cmp = nIC.compare.bind(nIC) It will return 0 if the two strings are the same, ignoring accents. Alternatively you try localecompare 'être'.localeCompare('etre',undefined,{sensitivity: 'base'})

You can also use http://fusejs.io, which describes itself as "Lightweight fuzzy-search library. Zero dependencies", for fuzzy searching.

Related

Check numeric expression (linear) is valid or not?

How to remove specific character surrounding a string?

Javascript search string for numbers regex maybe?

Regex, grab only one instance of each letter

Case insensitive string replacement in JavaScript?

Categories

Resources