How to remove illegal characters from nodejs buffer? - javascript

I got a base64 encoded string of a csv file from frontend. In backend i am converting base64 string to binary and then trying to convert it to json object.
var csvDcaData = new Buffer(source, 'base64').toString('binary')//convert base64 to binary
Problem is, Ui is sending some illegal characters with on of the field which are not visible to user in plain csv. "" these are characters appended in one of csv field.
I want to remove these kind of characters from data from base64 but i am not able to recognize them in buffer, after conversion these characters appear.
It is possible in any way to detect such kind of characters from buffer.

The source is sending you a message. The message consists of metadata and text. The first few bytes of the message are identifiable as metadata because they are the Byte-Order Mark (BOM) encoded in UTF-8. That strongly suggests that the text is encoded in UTF-8. Nonetheless, to read the text you should find out from the sender which encoding is used.
Yes, the BOM "characters" should be stripped off when wanting to deal only in the text. They are not characters in the sense that they are not part of the text. (Though, if you decode the bytes as UTF-8, it matches the codepoint U+FEFF.)
So, though perhaps esoteric, the message does not contain illegal characters but actually has useful metadata.
Also, given that you are not stripping off the BOM, the fact that you are seeing "" instead of "" (U+FEFF ZERO WIDTH NO-BREAK SPACE) means that you are not using UTF-8 to decode the text. That could result in data loss. There is no text but encoded text. You always have to know and use the correct encoding.
Now, source is a JavaScript string (which, by-the-way, uses the UTF-16 encoding of Unicode). The content of the string is a message encoded in Base64. The message is a sequence of bytes which are the UTF-8 encoding of a BOM and text. You want the text in a JavaScript string. (And the text happens to be some form of CSV. For that, you'll need to know the line ending, delimiter, and text-qualifier.) There is a lot for you and the sender to talk about. Perhaps the sender has documented all this.
const stripBom = require('strip-bom');
const original = "¡You win one million ₹! Now you can get a real 🚲";
const base64String = Buffer.from("\u{FEFF}" + original, "utf-8").toString("base64");
console.log(base64String);
const decodedString =
stripBom(Buffer.from(base64String, "base64").toString("utf-8"));
console.log(decodedString);
console.log(original === decodedString);

Related

base64 string gets written as an invalid image using Node.JS

I have the following code:
var __dirname = '/home/ubuntu/Site/public/uploads/';
var base64Data = '/9j/2wBDABAICAgICBAICAgQEBAQECAYEBAQECggIBggMCgwMDAoMDAwOEhAMDhIODAwQFhASFBQUFBQMEBYYFhQYEhQUFD/2wBDARAQEBAQECgYGChQODA4UFBQUFBQUFBQUA==UFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUP/AABEIADwAUAMBIgACEQEDEQH/xAAfAAABBQEBAQEBAQAAAAAAAAAAAQIDBAUGBwgJCgv/xAC1EAACAQ==AwMCBAMFBQQEAAABfQECAwAEEQUSITFBBhNRYQcicRQygZGhCCNCscEVUtHwJDNicoIJChYXGBkaJSYnKCkqNDU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eA==eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAw==BAUGBwgJCgv/xAC1EQACAQIEBAMEBwUEBAABAncAAQIDEQQFITEGEkFRB2FxEyIygQgUQpGhscEJIzNS8BVictEKFiQ04SXxFxgZGiYnKCkqNTY3ODk6Q0RFRkdISUpTVFVWVw==WFlaY2RlZmdoaWpzdHV2d3h5eoKDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uLj5OXm5+jp6vLz9PX29/j5+v/aAAwDAQACEQ==AxEAPwDmraLzIyTzkYINUby+njmEKtxH1HqeamkvXs7cxoPnPHPas4lyxZsknrUpGjZNJGsn7+2/4EvcUscjKMMKhViDkZqZJJnXaQSPcUwTAznOAKka2cJukxuPY06C3ELedA==oII+6PepA0074EW0E5x/n8qENu5Hb2zn5RkcGmeQ7zbCCMHmtKKMxuFHPBAPv6/zpJIgX4x+P+f85HpTTFYksFEafLxgdv8AP+fwp5k/doxYfLxz7f5J/P1pbYbUbn8cf59/1g==oYpMxMgP3T1/l/LNMogmt4rsHHUn+p/+tVY6W45DZHfj/PvU43xNkHIB/wA/y/WrEFwrDa/p/TH9Ki5NrlEWnlnLLx/F7f55/Kp7dEbnAwf8/wBTWi1vFOPl55/r/wDX/Wqktg==zQsWA6/eGev+fmoHYfbLHLHiQAsDwD74/wDZiPwBpXRcAx5x/D/T8hg/U1HauXmZF6nt/tf/AK2/SppkJTbGPvfd+nX+QX86ABVV2Mijj7q+4/8Arnim4+bcT+J/n/n1PpU6wA==dm0L7Y/p+v8A49TTCVO4n8f1z/X8TQMVflU9vr/n2x+HvWdJI8E5ZD9Rj/Pt+dX3YKMAdPX/AD+FULv7xYVQE5iOOR/niozCycjt/n+lS22oW8y4Zh0/oP8ACrIiikPB7/1NQA==FWG4kh4z/n/IFXkmiuVKsoOT3Hvj+oqBrX5dw9M/oDUW2WE/Ken/ANf/AAoAnayAkE9vJg+/Uf49f19qvaZZG/Mrx4jaJARHI3Xnt+QH4VRS5BHJ9v5/4ile7bAuIcEo3KH+JQ==PUfy/M0ATyh4hg8jGAPUf/q4+v0qB5SwLMR/j/n/ANmqeS8jZN5YYI6/1/kfxNZ9zdITiPgUDCabPGaqyupB5FJLN71VmmJ71SAhjWVT8oNXrO8uoiMgkA1TYlD8pNPjkc8ZpA==QjTXU224x2x+hFObUVY/Mnf+v/16zS7DvTSzY60Fl57xRjDAdP6f4U2K8BDKH49/pWc8jZojkbJ5osFy1Nc+SxMTkgnlaYb3PIFMVRK3zGnyWsUZUjPLYOTTsK5G1wzUxydhNQ==NNDHGDtFRFR5LH0xTE2f/9k=';
var buffer = new Buffer(base64Data, 'base64');
fs.writeFileSync(__dirname + 'zorro.jpg', buffer, 0, buffer.length);
However the saved image is corrupt and won't open in Finder. What am I doing wrong? Am I missing some header? The base64 string opens perfectly fine as inline data with an img tag.
EDIT: this works in HTML for me:
<img src="data:image/jpg;base64,/9j/2wBDABAICAgICBAICAgQEBAQECAYEBAQECggIBggMCgwMDAoMDAwOEhAMDhIODAwQFhASFBQUFBQMEBYYFhQYEhQUFD/2wBDARAQEBAQECgYGChQODA4UFBQUFBQUFBQUA==UFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUP/AABEIADwAUAMBIgACEQEDEQH/xAAfAAABBQEBAQEBAQAAAAAAAAAAAQIDBAUGBwgJCgv/xAC1EAACAQ==AwMCBAMFBQQEAAABfQECAwAEEQUSITFBBhNRYQcicRQygZGhCCNCscEVUtHwJDNicoIJChYXGBkaJSYnKCkqNDU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eA==eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAw==BAUGBwgJCgv/xAC1EQACAQIEBAMEBwUEBAABAncAAQIDEQQFITEGEkFRB2FxEyIygQgUQpGhscEJIzNS8BVictEKFiQ04SXxFxgZGiYnKCkqNTY3ODk6Q0RFRkdISUpTVFVWVw==WFlaY2RlZmdoaWpzdHV2d3h5eoKDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uLj5OXm5+jp6vLz9PX29/j5+v/aAAwDAQACEQ==AxEAPwDmraLzIyTzkYINUby+njmEKtxH1HqeamkvXs7cxoPnPHPas4lyxZsknrUpGjZNJGsn7+2/4EvcUscjKMMKhViDkZqZJJnXaQSPcUwTAznOAKka2cJukxuPY06C3ELedA==oII+6PepA0074EW0E5x/n8qENu5Hb2zn5RkcGmeQ7zbCCMHmtKKMxuFHPBAPv6/zpJIgX4x+P+f85HpTTFYksFEafLxgdv8AP+fwp5k/doxYfLxz7f5J/P1pbYbUbn8cf59/1g==oYpMxMgP3T1/l/LNMogmt4rsHHUn+p/+tVY6W45DZHfj/PvU43xNkHIB/wA/y/WrEFwrDa/p/TH9Ki5NrlEWnlnLLx/F7f55/Kp7dEbnAwf8/wBTWi1vFOPl55/r/wDX/Wqktg==zQsWA6/eGev+fmoHYfbLHLHiQAsDwD74/wDZiPwBpXRcAx5x/D/T8hg/U1HauXmZF6nt/tf/AK2/SppkJTbGPvfd+nX+QX86ABVV2Mijj7q+4/8Arnim4+bcT+J/n/n1PpU6wA==dm0L7Y/p+v8A49TTCVO4n8f1z/X8TQMVflU9vr/n2x+HvWdJI8E5ZD9Rj/Pt+dX3YKMAdPX/AD+FULv7xYVQE5iOOR/niozCycjt/n+lS22oW8y4Zh0/oP8ACrIiikPB7/1NQA==FWG4kh4z/n/IFXkmiuVKsoOT3Hvj+oqBrX5dw9M/oDUW2WE/Ken/ANf/AAoAnayAkE9vJg+/Uf49f19qvaZZG/Mrx4jaJARHI3Xnt+QH4VRS5BHJ9v5/4ile7bAuIcEo3KH+JQ==PUfy/M0ATyh4hg8jGAPUf/q4+v0qB5SwLMR/j/n/ANmqeS8jZN5YYI6/1/kfxNZ9zdITiPgUDCabPGaqyupB5FJLN71VmmJ71SAhjWVT8oNXrO8uoiMgkA1TYlD8pNPjkc8ZpA==QjTXU224x2x+hFObUVY/Mnf+v/16zS7DvTSzY60Fl57xRjDAdP6f4U2K8BDKH49/pWc8jZojkbJ5osFy1Nc+SxMTkgnlaYb3PIFMVRK3zGnyWsUZUjPLYOTTsK5G1wzUxydhNQ==NNDHGDtFRFR5LH0xTE2f/9k="/>
I re-encoded your string into proper base64
9j/2wBDABAICAgICBAICAgQEBAQECAYEBAQECggIBggMCgwMDAoMDAwOEhAMDhIODAwQFhASFBQ
UFBQMEBYYFhQYEhQUFD/2wBDARAQEBAQECgYGChQODA4UFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQ
UFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFD/wAARCAA8AFADASIAAhEBAxEB/8QAHwAAAQUBAQEB
AQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1Fh
ByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZ
WmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXG
x8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAEC
AwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHB
CSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0
dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX
2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwDmraLzIyTzkYINUby+njmEKtxH1Hqe
amkvXs7cxoPnPHPas4lyxZsknrUpGjZNJGsn7+2/4EvcUscjKMMKhViDkZqZJJnXaQSPcUwTAznO
AKka2cJukxuPY06C3ELedKCCPuj3qQNNO+BFtBOcf5/KhDbuR29s5+UZHBpnkO82wgjB5rSijMbh
RzwQD7+v86SSIF+Mfj/n/OR6U0xWJLBRGny8YHb/AD/n8KeZP3aMWHy8c+3+Sfz9aW2G1G5/HH+f
f9ahikzEyA/dPX+X8s0yiCa3iuwcdSf6n/61VjpbjkNkd+P8+9TjfE2QcgH/AD/L9asQXCsNr+n9
Mf0qLk2uURaeWcsvH8Xt/nn8qnt0RucDB/z/AFNaLW8U4+Xnn+v/ANf9aqS2zQsWA6/eGev+fmoH
YfbLHLHiQAsDwD74/wDZiPwBpXRcAx5x/D/T8hg/U1HauXmZF6nt/tf/AK2/SppkJTbGPvfd+nX+
QX86ABVV2Mijj7q+4/8Arnim4+bcT+J/n/n1PpU6wHZtC+2P6fr/AOPU0wlTuJ/H9c/1/E0DFX5V
Pb6/59sfh71nSSPBOWQ/UY/z7fnV92CjAHT1/wA/hVC7+8WFUBOYjjkf54qMwsnI7f5/pUttqFvM
uGYdP6D/AAqyIopDwe/9TUAVYbiSHjP+f8gVeSaK5Uqyg5Pce+P6ioGtfl3D0z+gNRbZYT8p6f8A
1/8ACgCdrICQT28mD79R/j1/X2q9plkb8yvHiNokBEcjdee35AfhVFLkEcn2/n/iKV7tsC4hwSjc
of4lPUfy/M0ATyh4hg8jGAPUf/q4+v0qB5SwLMR/j/n/ANmqeS8jZN5YYI6/1/kfxNZ9zdITiPgU
DCabPGaqyupB5FJLN71VmmJ71SAhjWVT8oNXrO8uoiMgkA1TYlD8pNPjkc8ZpEI011NtuMdsfoRT
m1FWPzJ3/r/9es0uw700s2OtBZee8UYwwHT+n+FNivAQyh+Pf6VnPI2aI5GyeaLBctTXPksTE5IJ
5WmG9zyBTFUSt8xp8lrFGVIzy2Dk07CuRtcM1McnYTU00McYO0VEVHksfTFMTZ//2Q==
As to why the base64 encoded data is working on my end, the RFC4648 for base64 states this:
Furthermore, such specifications MAY ignore the pad character, "=",
treating it as non-alphabet data, if it is present before the end
of the encoded data. If more than the allowed number of pad
characters is found at the end of the string (e.g., a base 64
string terminated with "==="), the excess pad characters MAY also be
ignored.
Some implementation will ignore the added "=" and some will not.
EDIT:
As other has pointed out, your base64 string seems to be many base64 string concatenated together. Here is your string:
/9j/2wBDABAICAgICBAICAgQEBAQECAYEBAQECggIBggMCgwMDAoMDAwOEhAMDhIODAwQFhASFBQUFBQMEBYYFhQYEhQUFD/2wBDARAQEBAQECgYGChQODA4UFBQUFBQUFBQUA==
UFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUP/AABEIADwAUAMBIgACEQEDEQH/xAAfAAABBQEBAQEBAQAAAAAAAAAAAQIDBAUGBwgJCgv/xAC1EAACAQ==
AwMCBAMFBQQEAAABfQECAwAEEQUSITFBBhNRYQcicRQygZGhCCNCscEVUtHwJDNicoIJChYXGBkaJSYnKCkqNDU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eA==
eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAw==
BAUGBwgJCgv/xAC1EQACAQIEBAMEBwUEBAABAncAAQIDEQQFITEGEkFRB2FxEyIygQgUQpGhscEJIzNS8BVictEKFiQ04SXxFxgZGiYnKCkqNTY3ODk6Q0RFRkdISUpTVFVWVw==
WFlaY2RlZmdoaWpzdHV2d3h5eoKDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uLj5OXm5+jp6vLz9PX29/j5+v/aAAwDAQACEQ==
AxEAPwDmraLzIyTzkYINUby+njmEKtxH1HqeamkvXs7cxoPnPHPas4lyxZsknrUpGjZNJGsn7+2/4EvcUscjKMMKhViDkZqZJJnXaQSPcUwTAznOAKka2cJukxuPY06C3ELedA==
oII+6PepA0074EW0E5x/n8qENu5Hb2zn5RkcGmeQ7zbCCMHmtKKMxuFHPBAPv6/zpJIgX4x+P+f85HpTTFYksFEafLxgdv8AP+fwp5k/doxYfLxz7f5J/P1pbYbUbn8cf59/1g==
oYpMxMgP3T1/l/LNMogmt4rsHHUn+p/+tVY6W45DZHfj/PvU43xNkHIB/wA/y/WrEFwrDa/p/TH9Ki5NrlEWnlnLLx/F7f55/Kp7dEbnAwf8/wBTWi1vFOPl55/r/wDX/Wqktg==
zQsWA6/eGev+fmoHYfbLHLHiQAsDwD74/wDZiPwBpXRcAx5x/D/T8hg/U1HauXmZF6nt/tf/AK2/SppkJTbGPvfd+nX+QX86ABVV2Mijj7q+4/8Arnim4+bcT+J/n/n1PpU6wA==
dm0L7Y/p+v8A49TTCVO4n8f1z/X8TQMVflU9vr/n2x+HvWdJI8E5ZD9Rj/Pt+dX3YKMAdPX/AD+FULv7xYVQE5iOOR/niozCycjt/n+lS22oW8y4Zh0/oP8ACrIiikPB7/1NQA==
FWG4kh4z/n/IFXkmiuVKsoOT3Hvj+oqBrX5dw9M/oDUW2WE/Ken/ANf/AAoAnayAkE9vJg+/Uf49f19qvaZZG/Mrx4jaJARHI3Xnt+QH4VRS5BHJ9v5/4ile7bAuIcEo3KH+JQ==
PUfy/M0ATyh4hg8jGAPUf/q4+v0qB5SwLMR/j/n/ANmqeS8jZN5YYI6/1/kfxNZ9zdITiPgUDCabPGaqyupB5FJLN71VmmJ71SAhjWVT8oNXrO8uoiMgkA1TYlD8pNPjkc8ZpA==
QjTXU224x2x+hFObUVY/Mnf+v/16zS7DvTSzY60Fl57xRjDAdP6f4U2K8BDKH49/pWc8jZojkbJ5osFy1Nc+SxMTkgnlaYb3PIFMVRK3zGnyWsUZUjPLYOTTsK5G1wzUxydhNQ==
NNDHGDtFRFR5LH0xTE2f/9k=
Notice how each of base64 string are 136 chracter long. If you decoded each of these base64 and append the result of each decoded base64 into a file, you will get your image.

Adding UTF-8 BOM to string/Blob

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using new Blob(['\xEF\xBB\xBF' + content]) yields '"my data"', of course.
Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).
Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.
Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between #jeff-fischer and #casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because
UTF-8 code units are 8 bits in size, and the usual machine issues of
endian order for larger code units do not apply. The serialized order
of the bytes must not depart from the order defined by the UTF- 8
encoding form. Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is
converted from other encoding forms that use a BOM or where the BOM is
used as a UTF-8 signature.
I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain with your desired MIME type.
I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.
This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});
This works for me:
let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });
BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.
Example:
When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.
This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:
const data = JSON.parse((await res.text()).trim());

Character encoding from UTF8 JSON to ISO-8859-1

Using getJSON to retrieve some data which I am utf8 encoding on the server-side end...
"title":"new movie \u0091The Tree of Life\u0092 on day 6"
The page that is is displayed on is charset ISO-8859-1 and I am doing this...
$.getJSON('index.php', { q: q }, function(data){
for (var i = 0; i < data.length; i++) {
alert(data[i].title + "\n" + utf8_decode(data[i].title));
}
});
The utf8_decode function comes from here.
The problem is that I am still seeing the magic squares for both versions...
new movie The Tree of Life on day 6
new movie ᔨe Tree of Life⠯n day 6
This leads me to believe that perhaps the character is of neither encoding. However it works if I paste the string onto a page and set the charset to either UTF8 or ISO-8859-1 :-/
Any help would be great!
There is no need to escape or decode any characters in data transmitted in JSON. It's done automatically. It is also independent of the page's encoding. You can easily transmit and display the euro sign (\u20ac) with your code even though ISO-8859-1 does not contain the euro sign.
You problem are the characters \u0091 and \u0092. They aren't valid Unicode characters. They are for private use only.
It rather looks as if you in fact have data that originally used the Windows-1250 character set but was not properly translated to Unicode/JSON. In Windows-1250, these two characters are typographic single quotes.
Did you tried without utf8_decode ?
If the characters in your string exist in ISO-8859-1, this will just work, as Javascript decodes the \u0091 in the encoding of the page.

Handling unicode in the http response xml

I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.
For example:
<title>Onegai My Melody Sukkiri�</title>
If I create a HTML node from the text it looks like this:
Onegai My Melody Sukkiri�
The actual title, however, is this:
Onegai My Melody Sukkiri♪
Why is my text not correctly rendered and how can I fix it?
Update
Code: background.html
I think these are the crucial parts:
function htmlDecode(input){
var e = document.createElement('div');
e.innerHTML = input;
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
function xmlDecode(input){
var result = input;
result = result.replace(/</g, "<");
result = result.replace(/>/g, ">");
result = result.replace(/\n/g, "
");
return htmlDecode(result);
}
Further:
var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");
<title>Onegai My Melody Sukkiri�</title>
Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.
The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.
This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.
htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.
Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.
Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.
So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.
If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).
The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)
Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).
What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.
Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.
You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.
Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).
If you can't fix the data, you're kinda screwed.
For more details, see: UTF-8 vs. Unicode

javascript json - problem decoding ajax json array from php

I'm using php's json_encode() to convert an array to json which then echo's it and is read from a javascript ajax request.
The problem is the echo'd text has unicode characters which the javascript json parse() function doesn't convert to.
Example array value is "2\u00000\u00001\u00000\u0000-\u00001\u00000\u0000-\u00000\u00001" which is "2010-10-01".
Json.parse() only gives me "2".
Anyone help me with this issue?
Example:
var resArray = JSON.parse(this.responseText);
for(var x=0; x < resArray.length; x++) {
var twt = resArray[x];
alert(twt.date);
break;
}
You have NUL characters (character code zero) in the string. It's actually "2_0_1_0_-_1_0_-_0_1", where _ represents the NUL characters.
The unicode character escape is actually part of the JSON standard, so the parser should handle that correctly. However, the result is still a string will NUL characters in it, so when you try to use the string in Javascript the behaviour will depend on what the browser does with the NUL characters.
You can try this in some different browsers:
alert('as\u0000df');
Internet Explorer will display only as
Firefox will display asdf but the NUL character doesn't display.
The best solution would be to remove the NUL characters before you convert the data to JSON.
To add to what Guffa said:
When you have alternating zero bytes, what has almost certainly happened is that you've read a UTF-16 data source without converting it to an ASCII-compatible encoding such as UTF-8. Whilst you can throw away the nulls, this will mangle the string if it contains any characters outside of ASCII range. (Not an issue for date strings of course, but it may affect any other strings you're reading from the same source.)
Check where your PHP code is reading the 2010-10-01 string from, and either convert it on the fly using eg iconv('utf-16le', 'utf-8', $string), or change the source to use a more reasonable encoding. If it's a text file, for example, save it in a text editor using ‘UTF-8 without BOM’, and not ‘Unicode’, which is a highly misleading name Windows text editors use to mean UTF-16LE.

Categories