Javascript string comparison fails when comparing unicode characters

Javascript string comparison fails when comparing unicode characters - javascript

I want to compare two strings in JavaScript that are the same, and yet the equality operator == returns false. One string contains a special character (eg. the danish å).
JavaScript code:
var filenameFromJS = "Designhåndbog.pdf";
var filenameFromServer = "Designhåndbog.pdf";
print(filenameFromJS == filenameFromServer); // This prints false why?
The solution
What worked for me is unicode normalization as slevithan pointed out.
I forked my original jsfiddle to make a version using the normalization lib suggested by slevithan. Link: http://jsfiddle.net/GWZ8j/1/.

Unlike what some other people here have said, this has nothing to do with encodings. Rather, your two strings use different code points to render the same visual characters.
To solve this correctly, you need to perform Unicode normalization on the two strings before comparing them. Unforunately, JavaScript doesn't have this functionality built in. Here is a JavaScript library that can perform the normalization for you: https://github.com/walling/unorm

The JavaScript equality operator == will appear to be failing under the following circumstances. In all cases it is programmer error. Not a bug in JavaScript.
The two strings do not contain the same number and sequence of characters.
There is whitespace or newlines before, within or after one string. Use a trim() operator on both and look closely at both strings.
Surprise typecasting. The programmer is comparing datatypes that are incompatible.
There are unicode characters which look identical to other unicode characters but in fact are different unicode characters.

UTF-8 is a complex thing. The charset has two different codes for characters such as á, é etc. As you already see in the URL encoded version, the HEX bytes of which the character is made differ for both versions.
See this answer for more information.

I had this same problem.
Adding
<meta charset="UTF-8">
to the HTML file fixed the issue.
In my case the templating engine was baking a json string into the HTML file. This string was in unicode.
While the template was also a unicode file, the JS engine was treating the string I wrote into the template as a latin-1 encoded string, until I added the meta tag.
I was comparing the typed in string to one of the JSON objects items (location.title == "Mühle")

Let the browser normalize unicode for you. This approach worked for me:
function normalizeUnicode(s) {
let div = $('<div style="display: none"></div>').html(s).appendTo('body');
let res = div.html();
div.remove();
return res;
}
normalizeUnicode(unicodeVal1) == normalizeUnicode(unicodeVal2)

Related

Same Symbol but two UTF-8 in Javascript (.normalize() can not resolve)

I have issue with Unicode symbol. Like the title, previous, i use .normalize() function to convert two symbol to standard utf-8 code. I think it will cover all my casse but it does not .
The two "Đ" symbol in my case have 2 utf-8 code: \xC3\x90 and \xC4\x90. You can check it out https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&names=-&utf8=string-literal
And javascript normalize function can not convert them to only one utf-8 code.
I need your suggest. A block of code or library, anything. Thank you!

Those are simply two different letters, an uppercase Eth (U+00D0) and an uppercase D with a stroke (U+0110). They're not interchangeable, so they don't normalise to the same characters. Even if they happen to look the same.
The same is true for many other characters. For example, the Russian letter С (U+0421) looks just like a C (U+0043), but it's not the same letter; when transliterating Russian to ASCII you'd get an S.
So you can't convert all lookalikes to each other; not without loss of information.
If you explain what your use case is exactly, maybe someone can come up with a solution for that. But there is no general library that can solve the problem of some characters looking just like others.

Javascript encodeURI() vs. PHP rawurldecode() and special characters

Encoding a string with German umlauts like ä,ü,ö,ß with Javascript encodeURI() causes a weird bug after decoding it in PHP with rawurldecode(). Although the string seems to be correctly decoded it isn't. See below example screenshots from my IDE
Also the strlen() of the - with rawurldecode() - decoded string gives more characters than it really has!
Problems occur when I need to process the decoded string, for example if I want to replace the German characters ä,ü,ö with ae, ue and oe. This can be seen in the example provided here.
I have also made an PHP fiddle where this whole weirdness can be seen.
What I've tried so far:
- utf8_decode
- iconv
- and also first two suggestions from here

This is a Unicode equivalence issue and it looks like your IDE doesnt handle multibyte strings very well.
In unicode you can represent Ü with either:
the single unicode codepoint (U+00DC) or %C3%9C in utf8
or use a capital U (U+0055) with a modifier (U+0308) or %55%CC%88 in utf8
Your GWT string uses the latter method called NFD while your one from PHP uses the first method called NFC. That's why your GWT string is 3 characters longer even though they are both valid encodings of logically identical unicode strings. Your problem is that they are not identical byte for byte in PHP.
More details about utf-8 normalisation.
If you want to do preg replacements on the strings you need to normalise them to the same form first. From your example I can see your IDE is using NFC since it's the PHP string that works. So I suggest normalising to NFC form in PHP (the default), then doing the preg_replace.
http://php.net/manual/en/normalizer.normalize.php
function cleanImageName($name)
{
$name = Normalizer::normalize( $name, Normalizer::FORM_C );
$clean = preg_replace(
Otherwise you have to do something like this which is based on this article.

Escape with Unicode Code Points similar to PHP in Javascript

When using PHP to generate JSON, it encodes higher characters using the \u0123 code-point notation.
(I know this not necessary, but for unnamed reasons I want that.)
I am trying to achieve the same in JavaScript. I searched high and low and found nothing. The encodeUri function does nothing for me (even though many suggested that it would).
Any helpful hints? I hope that I do not have to use some big external library but rather something build-in or a few lines of nice code - this can not be this hard, can it...?!
I have an input string in the form of:
var stringVar = 'Hällö Würld.';
My desired conversion would give me something like:
var resultStringVar = 'H\u00e4ll\u00f6 W\u00fcrld.';

I’ve made a library for just that called jsesc. From its README:
This is a JavaScript library for escaping JavaScript strings while generating the shortest possible valid ASCII-only output. Here’s an online demo.
This can be used to avoid mojibake and other encoding issues, or even to avoid errors when passing JSON-formatted data (which may contain U+2028 LINE SEPARATOR, U+2029 PARAGRAPH SEPARATOR, or lone surrogates) to a JavaScript parser or an UTF-8 encoder, respectively.
For your specific example, use it as follows:
var stringVar = 'Hällö Würld.';
var resultStringVar = jsesc(stringVar, { 'json': true, 'wrap': false });

charCodeAt is not behaving as expected

How can this be possible:
var string1 = "🌀", string2 = "🌀🌂";
//comparing the charCode
console.log(string1.charCodeAt(0) === string2.charCodeAt(0)); //true
//comparing the character
console.log(string1 === string2.substring(0,1)); //false
//This is giving me a headache.
http://jsfiddle.net/DerekL/B9Xdk/
If their char codes are the same in both strings, by comparing the character itself should return true. It is true when I put in a and ab. But when I put in these strings, it simply breaks.
Some said that it might be the encoding that is causing the problem. But since it works perfectly fine when there's only one character in the string literal, I assume encoding has nothing to do with it.
(This question addresses the core problem in my previous questions. Don't worry I deleted them already.)

In JavaScript, strings are treated by characters instead of bytes, but only if they can be expressed in 16-bit code points.
A majority of the characters will cause no issues, but in this case they don't "fit" and so they occupy 2 characters as far as JavaScript is concerned.
In this case you need to do:
string2.substring(0, 2) // "🌀"
For more information on Unicode quirkiness, see UTF-8 Everywhere.

Substring parameters are the index where he starts, and the end, where as if you change it to substr, the parameters are index where to start and how many characters.

You can use the method to compare 2 strings:
string1.localeCompare(string2);

How to escape a character out of Basic Multilingual Plane?

For characters in Basic Multilingual Plane, we can use '\uxxxx' escape it. For example, you can use /[\u4e00-\u9fff]/ to match a common chinese character(0x4e00-0x9fff is the range of CJK Unified Ideographs).
But for characters out of Basic Multilingual Plane, their codes are bigger than 0xffff. So you can't use format '\uxxxx' to escape it, because '\u20000' means character '\u2000' and character '0', not a character which code is 0x20000.
How can I escape characters out of Basic Multilingual Plane? Use those characters directly is not a good idea, because they can't show in most fonts.

Characters outside the BMP are not recognized directly by Javascript -- they're represented internally as UTF-16 surrogate pairs. For instance, the character you mentioned, U+20000 (currently allocated to "CJK Unified Ideographs Ext. B") is represented as the surrogate pair U+D840 U+DC00. As a Javascript string, this would simply be "\u2840\uDC00". (Note that s.length is 2 for this string, even though it displays as a single character.)
Wikipedia has details on the encoding scheme used.

You can use a pair of escaped surrogate code points, as described in #duskwuff’s answer. You can use my Full Unicode input utility to get the notations (button “Show \u”), or use the Fileformat.info character search to find them out (item “C/C++/Java source code”, because JavaScript uses the same notation here).
Alternatively, you can enter the characters directly: “You can enter non-BMP characters as such into string literals in your JavaScript code,whether in a separate file or as embedded in HTML. Naturally, you need suitable Unicode support in the editor you use. But JavaScript implementations need not support non-BMP characters in program source. They may, and modern browser implementations generally do.” (Going Global with JavaScript and Globalize.js, p. 177) There are some caveats like properly declaring the character encoding.
Font support is a different issue, but when working with characters, you generally want to see them at some point anyway, at least in testing. So you more or less need some font(s) that cover the characters. The Fileformat.info pages also contain links to browser support info, such as (U+20000) Font Support – a good starting point, though not quite complete. For example, U+20000 '𠀀' is also supported in SimSun-ExtB

Interesting problem.
Now that we have ES6, we can do this:
let newSpeak = '\u{1F4A9}'
Note that internally it's still UTF-16 with surrogate pairs:
newSpeak.length === 2 // "wrong"
[...newSpeak].length === 1
newSpeak === '\uD83D\uDCA9'
Unicode is huge.
Also, it's not just the literals:
newSpeak.charCodeAt(0) === 0xD83D // "wrong"
newSpeak.codePointAt(0) === 0x1F4A9
String.fromCharCode(0x1F4A9) !== newSpeak
String.fromCodePoint(0x1F4A9) === newSpeak
for (let i = 0; i < newSpeak.length; i++) console.log(newSpeak[i]) // "wrong"
for (let c of newSpeak) console.log(c)
[...'🏃🚚'].map(c => `__${c}`).join('') === "__🏃__🚚"
I � handling Unicode.

We Keep Coding

JavaScript is the programming language of the Web.

Javascript string comparison fails when comparing unicode characters - javascript

UTF-8 is a complex thing. The charset has two different codes for characters such as á, é etc. As you already see in the URL encoded version, the HEX bytes of which the character is made differ for both versions. See this answer for more information.

Let the browser normalize unicode for you. This approach worked for me: function normalizeUnicode(s) { let div = $('<div style="display: none"></div>').html(s).appendTo('body'); let res = div.html(); div.remove(); return res; } normalizeUnicode(unicodeVal1) == normalizeUnicode(unicodeVal2)

Related

Same Symbol but two UTF-8 in Javascript (.normalize() can not resolve)

Javascript encodeURI() vs. PHP rawurldecode() and special characters

Escape with Unicode Code Points similar to PHP in Javascript

charCodeAt is not behaving as expected

How to escape a character out of Basic Multilingual Plane?

Categories

Resources