Reading a Windows-1252 file in Node JS - javascript

I'm using node to read a text document using readFile and within that document is a character
�
This is a windows-1252 character but it is being converted in javascript to utf-8 automatically. The correct character should actually display as Å.
Is there a way I can convert this character from utf-8 to windows-1252 to render the correct character?
The file is being read using nodes readFile method and is being read as utf-8, due to the lack of support for the necessary encoding.
fs.readFile(`${logDirectory}myText.txt`,"utf-8", (err, text) => { ... }
I've tried a few options such as iconv-lite and legacy-decode but neither seem to return the correct result.
Any guidance appreciated.

You can try reading the file with the latin1-encoding as Windows-1252 is based on that:
fs.readFile(`${logDirectory}myText.txt`,'latin1', (err, text) => { ... }
Also note that in NodeJS the utf-8 encoding is called utf8 instead of utf-8 as described here.

Related

Parse UTF-8 XML in javascript

I'm trying to load and parse a simple utf-8-encoded XML file in javascript using node and the xpath and xmldom packages. There are no XML namespaces used and the same XML parsed when converted to ASCII. I can see in the debugger in VS Code that the string has embedded spaces in between each character (surely due to loading the utf-8 file incorrectly) but I can't find a way to properly load and parse the utf-8 file.
Code:
var xpath = require('xpath')
, dom = require('xmldom').DOMParser;
const fs = require('fs');
var myXml = "path_to_my_file.xml";
var xmlContents = fs.readFileSync(myXml, 'utf8').toString();
// this line causes errors parsing every single tag as the tag names have spaces in them from improper utf-8 decoding
var doc = new dom().parseFromString(xmlContents, 'application/xml');
var cvNode = xpath.select1("//MyTag", doc);
console.log(cvNode.textContent);
The code works fine if the file is ASCII (textContent has the proper data), but if it is UTF-8 then there are a number of parsing errors and cvNode is undefined.
Is there a proper way to parse UTF-8 XML in node/javascript? I can't for the life of me find a decent example.
When you see additional white spaces between each letter, this suggests that the file isn't actually encoded using utf-8 but uses a 16 bit unicode encoding.
Try 'utf16le'.
For a list of supported encodings see Buffers and Character Encodings.

Convert ANSI file text into UTF8 in node.js usinf Fyle System

I´m trying to convert the text from an ANSI encoded file to an UTF8 encoded text in node.js.
I´m reading the info from the file using node´s core Fyle System. Is there any way to 'tell' to readFile that the encoding is ANSI?
fs = require('fs');
fs.readFile('..\\\\LogSSH\\' + fileName + '.log', 'utf8', function (err, data) {
if (err) {
console.log(err);
}
If not, how can I convert that text?
Of course, ANSI is not actually an encoding. But no matter what exact encoding we're talking about I can't see any Microsoft code pages included in the relatively short list documented at Buffers and Character Encodings:
ascii - for 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.
utf8 - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.
utf16le - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.
ucs2 - Alias of 'utf16le'.
base64 - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC4648, Section 5.
latin1 - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).
binary - Alias for 'latin1'.
hex - Encode each byte as two hexadecimal characters.
If you work in Western Europe you may be tempted to use latin1 as synonym for Windows-1252 but it'll render incorrect results as soon as you print a € symbol.
So the answer is no, you need to install a third-party package like iconv-lite.
In my case the convertion between types was due to the need to use special latin characters as 'í' or 'ó'. I solve it changing the encoding from 'utf8' to binary in the fs.readFile() function:
fs.readFile('..\\LogSSH\\' + fileName + '.log', {encoding: "binary"}, function (err, data) {
if (err) {
console.log(err);
}

base64 encoding in javascript decoding in php

I am trying to encode a string in javascript and decode it in php.
I use this code to put the string in a inputbox and then send it via form PUT.
document.getElementById('signature').value= b64EncodeUnicode(ab2str(signature));
And this code to decode
$signature=base64_decode($signature);
Here there is a jsfiddle for the encoding page:
https://jsfiddle.net/okaea662/
The problem is that I always get a string 98% correct but with some different characters.
For example: (the first string is the string printed in the inputbox)
¦S÷ä½m0×C|u>£áWÅàUù»¥ïs7Dþ1Ji%ýÊ{\ö°(úýýÁñxçO9Ù¡ö}XÇIWçβÆü8ú²ðÑOA¤nì6S+̽ i¼?¼ºNËÒo·a©8»eO|PPþBE=HèÑqaX©$Ì磰©b2(Ðç.$nÈR,ä_OX¾xè¥3éÂòkå¾ N,sáW§ÝáV:ö~Å×à<4)íÇKo¡L¤<Í»äA(!xón#WÙÕGù¾g!)ùC)]Q(*}?­Ìp
¦S÷ ä½m0×C|u>£áWÅàUù»¥ïs7Dþ1Ji%ýÊ{\ö°(úýýÁñxçO9Ù¡ö}XÇIWçβÆü8ú²ðÑOA¤nì6S+̽ i¼?¼ºNËÒo·a©8»eO|PPþBE=HèÑ qaX©$Ì磰©b2(Ðç.$nÈR,ä_OX¾xè¥3éÂòkå¾ N ,sá W§ÝáV:ö~Å×à<4)íÇKo¡L¤<Í»äA(!xón#WÙÕGù¾g!)ùC)]Q(*}?­Ìp
Note that the 4th character is distinct and then there is one or two more somewhere.
The string corresponds to a digital signature so these characters make the signature to be invalid.
I have no idea what is happening here. Any idea? I use Chrome browser and utf-8 encoding in header and metas (Firefox seems to use a different encoding in the inputbox but I will look that problem later)
EDIT:
The encoding to base64 apparently is not the problem. The base64 encoded string is the same in the browser than in the server. If I base64-decode it in javascript I get the original string but if I decode it in PHP I get a slightly different string.
EDIT2:
I still don't know what the problem is but I have avoided it sending the data in a blob with ajax.
Try using this command to encode your string with js:
var signature = document.getElementById('signature');
var base64 = window.btoa(signature);
Now with php, you simply use: base64_decode($signature)
If that doesn't work (I haven't tested it) there may be something wrong with the btoa func. So checkout this link here:
https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding
There is a function in there that should work (if the above does not)
function b64EncodeUnicode(str) {
return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
return String.fromCharCode('0x' + p1);
}));
}
b64EncodeUnicode(signature); // "4pyTIMOgIGxhIG1vZGU="

Convert binary data to string in node

The code below is returning data in binary format, how can I convert this to string?
fs.readFile('C:/test.prn', function (err, data) {
bufferString = data.toString();
bufferStringSplit = bufferString.split('\n');
console.log(bufferStringSplit)
});
console.log(bufferStringSplit)
output
&b27WPML ? ?????§201501081339&b16WPML ? ?????? *o5W? ?&l6H&l0S*r4800S&l-1M*o5W
? ??&l0E&l0L&u600D&l74A*o5W?? :*o5W?? :*o-3M&l-2H&l0O*o5W?? *o7 ?*g20W?? ??X?X ???X?X
?,??????????%]?? ?M???/????r????WWW???Y???~???$???///?9???DDD?N??Y???0v0w0v0w0v0w0v145w??T????!??###??????????'''?d??????????EEE?hhh??????????????
?'''?d??????EEE?hhh???=??5???-}???#????%???s?????? ?+???¦??
That is most likely happening because your .prn file is binary, that is, it does not contain plain text such as ASCII, UTF8 or ISO-8859-1. You need to convert it either within you JS code or with an external tool. Alternatively, you can read and handle it as a binary but you won't be operating on "normal" strings then.
A *.prn is most likely a printer file (http://filext.com/file-extension/PRN), thus it is binary and cannot be shown as a string.
You either need to process the file as binary or convert it to a string of an encoding of you choice.

Adding UTF-8 BOM to string/Blob

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using new Blob(['\xEF\xBB\xBF' + content]) yields '"my data"', of course.
Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).
Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.
Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between #jeff-fischer and #casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because
UTF-8 code units are 8 bits in size, and the usual machine issues of
endian order for larger code units do not apply. The serialized order
of the bytes must not depart from the order defined by the UTF- 8
encoding form. Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is
converted from other encoding forms that use a BOM or where the BOM is
used as a UTF-8 signature.
I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain with your desired MIME type.
I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.
This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});
This works for me:
let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });
BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.
Example:
When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.
This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:
const data = JSON.parse((await res.text()).trim());

Categories