Character encoding from UTF8 JSON to ISO-8859-1

Character encoding from UTF8 JSON to ISO-8859-1 - javascript

Using getJSON to retrieve some data which I am utf8 encoding on the server-side end...
"title":"new movie \u0091The Tree of Life\u0092 on day 6"
The page that is is displayed on is charset ISO-8859-1 and I am doing this...
$.getJSON('index.php', { q: q }, function(data){
for (var i = 0; i < data.length; i++) {
alert(data[i].title + "\n" + utf8_decode(data[i].title));
}
});
The utf8_decode function comes from here.
The problem is that I am still seeing the magic squares for both versions...
new movie The Tree of Life on day 6
new movie ᔨe Tree of Life⠯n day 6
This leads me to believe that perhaps the character is of neither encoding. However it works if I paste the string onto a page and set the charset to either UTF8 or ISO-8859-1 :-/
Any help would be great!

There is no need to escape or decode any characters in data transmitted in JSON. It's done automatically. It is also independent of the page's encoding. You can easily transmit and display the euro sign (\u20ac) with your code even though ISO-8859-1 does not contain the euro sign.
You problem are the characters \u0091 and \u0092. They aren't valid Unicode characters. They are for private use only.
It rather looks as if you in fact have data that originally used the Windows-1250 character set but was not properly translated to Unicode/JSON. In Windows-1250, these two characters are typographic single quotes.

Did you tried without utf8_decode ?
If the characters in your string exist in ISO-8859-1, this will just work, as Javascript decodes the \u0091 in the encoding of the page.

Related

How to remove illegal characters from nodejs buffer?

I got a base64 encoded string of a csv file from frontend. In backend i am converting base64 string to binary and then trying to convert it to json object.
var csvDcaData = new Buffer(source, 'base64').toString('binary')//convert base64 to binary
Problem is, Ui is sending some illegal characters with on of the field which are not visible to user in plain csv. "ï»¿" these are characters appended in one of csv field.
I want to remove these kind of characters from data from base64 but i am not able to recognize them in buffer, after conversion these characters appear.
It is possible in any way to detect such kind of characters from buffer.

The source is sending you a message. The message consists of metadata and text. The first few bytes of the message are identifiable as metadata because they are the Byte-Order Mark (BOM) encoded in UTF-8. That strongly suggests that the text is encoded in UTF-8. Nonetheless, to read the text you should find out from the sender which encoding is used.
Yes, the BOM "characters" should be stripped off when wanting to deal only in the text. They are not characters in the sense that they are not part of the text. (Though, if you decode the bytes as UTF-8, it matches the codepoint U+FEFF.)
So, though perhaps esoteric, the message does not contain illegal characters but actually has useful metadata.
Also, given that you are not stripping off the BOM, the fact that you are seeing "ï»¿" instead of "" (U+FEFF ZERO WIDTH NO-BREAK SPACE) means that you are not using UTF-8 to decode the text. That could result in data loss. There is no text but encoded text. You always have to know and use the correct encoding.
Now, source is a JavaScript string (which, by-the-way, uses the UTF-16 encoding of Unicode). The content of the string is a message encoded in Base64. The message is a sequence of bytes which are the UTF-8 encoding of a BOM and text. You want the text in a JavaScript string. (And the text happens to be some form of CSV. For that, you'll need to know the line ending, delimiter, and text-qualifier.) There is a lot for you and the sender to talk about. Perhaps the sender has documented all this.
const stripBom = require('strip-bom');
const original = "¡You win one million ₹! Now you can get a real 🚲";
const base64String = Buffer.from("\u{FEFF}" + original, "utf-8").toString("base64");
console.log(base64String);
const decodedString =
stripBom(Buffer.from(base64String, "base64").toString("utf-8"));
console.log(decodedString);
console.log(original === decodedString);

Adding UTF-8 BOM to string/Blob

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using new Blob(['\xEF\xBB\xBF' + content]) yields 'ï»¿"my data"', of course.
Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).
Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between #jeff-fischer and #casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because
UTF-8 code units are 8 bits in size, and the usual machine issues of
endian order for larger code units do not apply. The serialized order
of the bytes must not depart from the order defined by the UTF- 8
encoding form. Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is
converted from other encoding forms that use a BOM or where the BOM is
used as a UTF-8 signature.

I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain with your desired MIME type.

I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.

This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});

This works for me:
let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });
BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.
Example:
When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.

This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:
const data = JSON.parse((await res.text()).trim());

How to get the length of Japanese characters in Javascript?

I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.
if(document.frmPage.txtName.value.length > 200) {
alert("You have exceeded the maximum length of 200.");
return false;
}
The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.
If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.
The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?
Thanks for the help!

For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding
Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.
In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:
function getShiftJISByteLength(s) {
return s.replace(/[^\x00-\x80｡｢｣､･ｦｧｨｩｪｫｬｭｮｯｰｱｲｳｴｵｶｷｸｹｺｻｼｽｾｿﾀﾁﾂﾃﾄﾅﾆﾇﾈﾉﾊﾋﾌﾍﾎﾏﾐﾑﾒﾓﾔﾕﾖﾗﾘﾙﾚﾛﾜﾝ ﾞ ﾟ]/g, 'xx').length;
}
However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)
Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测. This is a lossy mangling: you can't tell whether the user typed literally 测 or 测. And if you are displaying the submitted content 测 as 测 then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.
The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:
function getUTF8ByteLength(s) {
return unescape(encodeURIComponent(s)).length;
}
although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.

You are getting confused between characters and bytes. 测 is ONE character, however you look at it. In UTF-16 (which is what Javascript uses), it's two BYTES. In Shift_JIS, 8 bytes, apparently. But in both cases, it's ONE character. So what you are trying to do is limit the text length to 200 BYTES. Since Javascript is using UTF-16 (UCS-2, really) you can get it's byte length by multiplying the string length by 2, but that doesn't help you with Shift_JIS. Then again, you should probably consider switching to Unicode anyway, if you're working with Javascript...

Handling unicode in the http response xml

I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.
For example:
<title>Onegai My Melody Sukkiriâ�ª</title>
If I create a HTML node from the text it looks like this:
Onegai My Melody Sukkiriâ�ª
The actual title, however, is this:
Onegai My Melody Sukkiri♪
Why is my text not correctly rendered and how can I fix it?
Update
Code: background.html
I think these are the crucial parts:
function htmlDecode(input){
var e = document.createElement('div');
e.innerHTML = input;
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
function xmlDecode(input){
var result = input;
result = result.replace(/</g, "<");
result = result.replace(/>/g, ">");
result = result.replace(/\n/g, "
");
return htmlDecode(result);
}
Further:
var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");

<title>Onegai My Melody Sukkiriâ�ª</title>
Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.
The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.
This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.
htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.
Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.
Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.

So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.
If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).
The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)
Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).
What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.
Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.

You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.
Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).
If you can't fix the data, you're kinda screwed.
For more details, see: UTF-8 vs. Unicode

javascript json - problem decoding ajax json array from php

I'm using php's json_encode() to convert an array to json which then echo's it and is read from a javascript ajax request.
The problem is the echo'd text has unicode characters which the javascript json parse() function doesn't convert to.
Example array value is "2\u00000\u00001\u00000\u0000-\u00001\u00000\u0000-\u00000\u00001" which is "2010-10-01".
Json.parse() only gives me "2".
Anyone help me with this issue?
Example:
var resArray = JSON.parse(this.responseText);
for(var x=0; x < resArray.length; x++) {
var twt = resArray[x];
alert(twt.date);
break;
}

You have NUL characters (character code zero) in the string. It's actually "2_0_1_0_-_1_0_-_0_1", where _ represents the NUL characters.
The unicode character escape is actually part of the JSON standard, so the parser should handle that correctly. However, the result is still a string will NUL characters in it, so when you try to use the string in Javascript the behaviour will depend on what the browser does with the NUL characters.
You can try this in some different browsers:
alert('as\u0000df');
Internet Explorer will display only as
Firefox will display asdf but the NUL character doesn't display.
The best solution would be to remove the NUL characters before you convert the data to JSON.

To add to what Guffa said:
When you have alternating zero bytes, what has almost certainly happened is that you've read a UTF-16 data source without converting it to an ASCII-compatible encoding such as UTF-8. Whilst you can throw away the nulls, this will mangle the string if it contains any characters outside of ASCII range. (Not an issue for date strings of course, but it may affect any other strings you're reading from the same source.)
Check where your PHP code is reading the 2010-10-01 string from, and either convert it on the fly using eg iconv('utf-16le', 'utf-8', $string), or change the source to use a more reasonable encoding. If it's a text file, for example, save it in a text editor using ‘UTF-8 without BOM’, and not ‘Unicode’, which is a highly misleading name Windows text editors use to mean UTF-16LE.

We Keep Coding

JavaScript is the programming language of the Web.