Parse UTF-8 XML in javascript

Parse UTF-8 XML in javascript - javascript

I'm trying to load and parse a simple utf-8-encoded XML file in javascript using node and the xpath and xmldom packages. There are no XML namespaces used and the same XML parsed when converted to ASCII. I can see in the debugger in VS Code that the string has embedded spaces in between each character (surely due to loading the utf-8 file incorrectly) but I can't find a way to properly load and parse the utf-8 file.
Code:
var xpath = require('xpath')
, dom = require('xmldom').DOMParser;
const fs = require('fs');
var myXml = "path_to_my_file.xml";
var xmlContents = fs.readFileSync(myXml, 'utf8').toString();
// this line causes errors parsing every single tag as the tag names have spaces in them from improper utf-8 decoding
var doc = new dom().parseFromString(xmlContents, 'application/xml');
var cvNode = xpath.select1("//MyTag", doc);
console.log(cvNode.textContent);
The code works fine if the file is ASCII (textContent has the proper data), but if it is UTF-8 then there are a number of parsing errors and cvNode is undefined.
Is there a proper way to parse UTF-8 XML in node/javascript? I can't for the life of me find a decent example.

When you see additional white spaces between each letter, this suggests that the file isn't actually encoded using utf-8 but uses a 16 bit unicode encoding.
Try 'utf16le'.
For a list of supported encodings see Buffers and Character Encodings.

Related

Reading a Windows-1252 file in Node JS

I'm using node to read a text document using readFile and within that document is a character
�
This is a windows-1252 character but it is being converted in javascript to utf-8 automatically. The correct character should actually display as Å.
Is there a way I can convert this character from utf-8 to windows-1252 to render the correct character?
The file is being read using nodes readFile method and is being read as utf-8, due to the lack of support for the necessary encoding.
fs.readFile(`${logDirectory}myText.txt`,"utf-8", (err, text) => { ... }
I've tried a few options such as iconv-lite and legacy-decode but neither seem to return the correct result.
Any guidance appreciated.

You can try reading the file with the latin1-encoding as Windows-1252 is based on that:
fs.readFile(`${logDirectory}myText.txt`,'latin1', (err, text) => { ... }
Also note that in NodeJS the utf-8 encoding is called utf8 instead of utf-8 as described here.

JavaScript atob is different than Notepad++ Base64 Decode

I am receiving the content of a zip file (from an API) as a Base64-encoded string.
If I paste that string into Notepad++ and go
Plugins > MIME Tools > Base64 Decode
and save it as test.zip, it becomes a valid zip file, I can open it.
Now, I am trying to achieve the same thing in JavaScript.
I have tried atob(), and probably everything mentioned in the answers here and the code from Mozilla doc.
atob produces a similar content, but some characters are decoded differently (hence becomes an invalid zip file). The other methods throw an invalid URI error.
How can I reproduce Notepad++ behaviour in JavaScript?

The window.atob is only good for decoding data which fits in a UTF-8 string. Anything which cannot be represented in UTF-8 string will not be equal to its binary form when decoded. Javascript, at most, will try encoding the resultant bytes in to UTF-8 character sequence. This is the reason why your zip archive is rendered invalid eventually.
The moment you do the following:
var data = window.atob(encoded_data)
... you are having a different representation of your data in a UTF-8 string which is referenced by the variable data.
You should decode your binary data directly to an ArrayBuffer. And window.atob is not a good fit for this.
Here is a function which can convert base64 encoded data directly in to an ArrayBuffer.

As mentioned, do not use atob directly for decoding Base64 encoded zip files. You can use this function mentioned in https://stackoverflow.com/a/21797381/3508516 instead.
function _base64ToArrayBuffer(base64) {
var binary_string = window.atob(base64);
var len = binary_string.length;
var bytes = new Uint8Array(len);
for (var i = 0; i < len; i++) {
bytes[i] = binary_string.charCodeAt(i);
}
return bytes.buffer;
}

Javascript export CSV encoding utf-8 issue

I need to export javascript array to CSV file and download it. I did it but 'ı,ü,ö,ğ,ş' this characters looks like 'Ä± Ã¼ Ã¶ ÄŸ ÅŸ' in the CSV file. I have tried many solutions recommended on this site but didn't work for me.
I added my code snippet, Can anyone solve this problem?
var csvString = 'ı,ü,ö,ğ,ş';
var a = window.document.createElement('a');
a.setAttribute('href', 'data:text/csv; charset=utf-8,' + encodeURIComponent(csvString));
a.setAttribute('download', 'example.csv');
a.click();

This depends on what program is opening the example.csv file. Using a text editor, the encoding will be UTF-8 and the characters will not be malformed. But using Excel the default encoding for CSV is ANSI and not UTF-8. So without forcing Excel using not ANSI but UTF-8 as the encoding, the characters will be malformed.
Excel can be forced using UTF-8 for CSV with putting a BOM (Byte Order Mark) as first characters in the file. The default BOM for UTF-8 is the byte sequence 0xEF,0xBB,0xBF. So one could think simply putting "\xEF\xBB\xBF" as first bytes to the string will be the solution. But surely that would be too simple, wouldn't it? ;-) The problem with this is how to force JavaScript to not taking those bytes as characters. The "solution" is using a "universal BOM" "\uFEFF" as mentioned in Special Characters (JavaScript).
Example:
var csvString = 'ı,ü,ü,ğ,ş,#Hashtag,ä,ö';
var universalBOM = "\uFEFF";
var a = window.document.createElement('a');
a.setAttribute('href', 'data:text/csv; charset=utf-8,' + encodeURIComponent(universalBOM+csvString));
a.setAttribute('download', 'example.csv');
window.document.body.appendChild(a);
a.click();
See also Adding UTF-8 BOM to string/Blob.
Using this, the encoding will be correct. But nevertheless, this only works properly if comma is the default list separator in your Windows locale settings. If not, if for example semicolon is the default list separator in your Windows locale settings, then all content will be in first column without splitting it by comma. Then you have to use semicolon as delimiter in the CSV also. But this is another problem and leads to the conclusion not using CSV at all but using libraries which can directly creating Excel files (*.xls or *.xlsx).

Javascript FileReader readAsText function not understaning utf-8 encoding characters like ä and ö

I have tried searching this a lot and nothing helped me. I have an import from csv feature and javascript code reads the csv content line by line. The characters ä,ö etc are just not recognized. FileReader readAsText has default encoding utf-8 but in this case it is not for some reason working. Here is my code.
reader = new FileReader()
reader.onload = (e) =>
result = e.target.result
console.log result
# file content
fileContent = result.split("\r")
reader.readAsText(e.target.files.item(0))
I have tried defining encoding like below and whatever I put there couldn't help me.
encoding = "UTF-8"
reader.readAsText(e.target.files.item(0), encoding)

I got this to work by using ISO Latin 4 encoding.
reader.readAsText(e.target.files.item(0), 'ISO-8859-4');
That should work for you but remember to use this particular encoding just for some scandinavian characters.

Adding UTF-8 BOM to string/Blob

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using new Blob(['\xEF\xBB\xBF' + content]) yields 'ï»¿"my data"', of course.
Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).
Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between #jeff-fischer and #casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because
UTF-8 code units are 8 bits in size, and the usual machine issues of
endian order for larger code units do not apply. The serialized order
of the bytes must not depart from the order defined by the UTF- 8
encoding form. Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is
converted from other encoding forms that use a BOM or where the BOM is
used as a UTF-8 signature.

I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain with your desired MIME type.

I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.

This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});

This works for me:
let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });
BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.
Example:
When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.

This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:
const data = JSON.parse((await res.text()).trim());

We Keep Coding

JavaScript is the programming language of the Web.

Parse UTF-8 XML in javascript - javascript

When you see additional white spaces between each letter, this suggests that the file isn't actually encoded using utf-8 but uses a 16 bit unicode encoding. Try 'utf16le'. For a list of supported encodings see Buffers and Character Encodings.

Related

Reading a Windows-1252 file in Node JS

JavaScript atob is different than Notepad++ Base64 Decode

Javascript export CSV encoding utf-8 issue

Javascript FileReader readAsText function not understaning utf-8 encoding characters like ä and ö

Adding UTF-8 BOM to string/Blob

Categories

Resources