I´m trying to convert the text from an ANSI encoded file to an UTF8 encoded text in node.js.
I´m reading the info from the file using node´s core Fyle System. Is there any way to 'tell' to readFile that the encoding is ANSI?
fs = require('fs');
fs.readFile('..\\\\LogSSH\\' + fileName + '.log', 'utf8', function (err, data) {
if (err) {
console.log(err);
}
If not, how can I convert that text?
Of course, ANSI is not actually an encoding. But no matter what exact encoding we're talking about I can't see any Microsoft code pages included in the relatively short list documented at Buffers and Character Encodings:
ascii - for 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.
utf8 - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.
utf16le - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.
ucs2 - Alias of 'utf16le'.
base64 - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC4648, Section 5.
latin1 - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).
binary - Alias for 'latin1'.
hex - Encode each byte as two hexadecimal characters.
If you work in Western Europe you may be tempted to use latin1 as synonym for Windows-1252 but it'll render incorrect results as soon as you print a € symbol.
So the answer is no, you need to install a third-party package like iconv-lite.
In my case the convertion between types was due to the need to use special latin characters as 'í' or 'ó'. I solve it changing the encoding from 'utf8' to binary in the fs.readFile() function:
fs.readFile('..\\LogSSH\\' + fileName + '.log', {encoding: "binary"}, function (err, data) {
if (err) {
console.log(err);
}
Related
I'm using node to read a text document using readFile and within that document is a character
�
This is a windows-1252 character but it is being converted in javascript to utf-8 automatically. The correct character should actually display as Å.
Is there a way I can convert this character from utf-8 to windows-1252 to render the correct character?
The file is being read using nodes readFile method and is being read as utf-8, due to the lack of support for the necessary encoding.
fs.readFile(`${logDirectory}myText.txt`,"utf-8", (err, text) => { ... }
I've tried a few options such as iconv-lite and legacy-decode but neither seem to return the correct result.
Any guidance appreciated.
You can try reading the file with the latin1-encoding as Windows-1252 is based on that:
fs.readFile(`${logDirectory}myText.txt`,'latin1', (err, text) => { ... }
Also note that in NodeJS the utf-8 encoding is called utf8 instead of utf-8 as described here.
I got a base64 encoded string of a csv file from frontend. In backend i am converting base64 string to binary and then trying to convert it to json object.
var csvDcaData = new Buffer(source, 'base64').toString('binary')//convert base64 to binary
Problem is, Ui is sending some illegal characters with on of the field which are not visible to user in plain csv. "" these are characters appended in one of csv field.
I want to remove these kind of characters from data from base64 but i am not able to recognize them in buffer, after conversion these characters appear.
It is possible in any way to detect such kind of characters from buffer.
The source is sending you a message. The message consists of metadata and text. The first few bytes of the message are identifiable as metadata because they are the Byte-Order Mark (BOM) encoded in UTF-8. That strongly suggests that the text is encoded in UTF-8. Nonetheless, to read the text you should find out from the sender which encoding is used.
Yes, the BOM "characters" should be stripped off when wanting to deal only in the text. They are not characters in the sense that they are not part of the text. (Though, if you decode the bytes as UTF-8, it matches the codepoint U+FEFF.)
So, though perhaps esoteric, the message does not contain illegal characters but actually has useful metadata.
Also, given that you are not stripping off the BOM, the fact that you are seeing "" instead of "" (U+FEFF ZERO WIDTH NO-BREAK SPACE) means that you are not using UTF-8 to decode the text. That could result in data loss. There is no text but encoded text. You always have to know and use the correct encoding.
Now, source is a JavaScript string (which, by-the-way, uses the UTF-16 encoding of Unicode). The content of the string is a message encoded in Base64. The message is a sequence of bytes which are the UTF-8 encoding of a BOM and text. You want the text in a JavaScript string. (And the text happens to be some form of CSV. For that, you'll need to know the line ending, delimiter, and text-qualifier.) There is a lot for you and the sender to talk about. Perhaps the sender has documented all this.
const stripBom = require('strip-bom');
const original = "¡You win one million ₹! Now you can get a real 🚲";
const base64String = Buffer.from("\u{FEFF}" + original, "utf-8").toString("base64");
console.log(base64String);
const decodedString =
stripBom(Buffer.from(base64String, "base64").toString("utf-8"));
console.log(decodedString);
console.log(original === decodedString);
New to node-red and javascript. I need to use the TCP input to connect to a relay controller for status. I'm using a function node to generate a two-byte request that will flow to the TCP input node and on to the controller but don't know how to format this in java. I can set
msg.payload = "hello";
to send a string, but I need to send 2 bytes: 0xEF 0xAA. In C# I would just create the string
msg.payload = "\xEF\xAA";
or something. How to do this in java/node-red?
Binary payloads are NodeJS buffer objects so can be created like this:
msg.payload = new Buffer([0xEF,0xAA]);
As of today (nodered 0.17.5), this can be achieved doing the following, see the documentation:
msg.payload = Buffer.from("\xEF\xAA")
or
msg.payload = Buffer.from('hello world', 'ascii');
As you can see, you can also specify an encoding parameter:
The character encodings currently supported by Node.js include:
'ascii' - for 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.
'utf8' - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.
'utf16le' - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.
'ucs2' - Alias of 'utf16le'.
'base64' - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC4648, Section 5.
'latin1' - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).
'binary' - Alias for 'latin1'.
'hex' - Encode each byte as two hexadecimal characters.
The code below is returning data in binary format, how can I convert this to string?
fs.readFile('C:/test.prn', function (err, data) {
bufferString = data.toString();
bufferStringSplit = bufferString.split('\n');
console.log(bufferStringSplit)
});
console.log(bufferStringSplit)
output
&b27WPML ? ?????§201501081339&b16WPML ? ?????? *o5W? ?&l6H&l0S*r4800S&l-1M*o5W
? ??&l0E&l0L&u600D&l74A*o5W?? :*o5W?? :*o-3M&l-2H&l0O*o5W?? *o7 ?*g20W?? ??X?X ???X?X
?,??????????%]?? ?M???/????r????WWW???Y???~???$???///?9???DDD?N??Y???0v0w0v0w0v0w0v145w??T????!??###??????????'''?d??????????EEE?hhh??????????????
?'''?d??????EEE?hhh???=??5???-}???#????%???s?????? ?+???¦??
That is most likely happening because your .prn file is binary, that is, it does not contain plain text such as ASCII, UTF8 or ISO-8859-1. You need to convert it either within you JS code or with an external tool. Alternatively, you can read and handle it as a binary but you won't be operating on "normal" strings then.
A *.prn is most likely a printer file (http://filext.com/file-extension/PRN), thus it is binary and cannot be shown as a string.
You either need to process the file as binary or convert it to a string of an encoding of you choice.
I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using new Blob(['\xEF\xBB\xBF' + content]) yields '"my data"', of course.
Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).
Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.
Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between #jeff-fischer and #casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because
UTF-8 code units are 8 bits in size, and the usual machine issues of
endian order for larger code units do not apply. The serialized order
of the bytes must not depart from the order defined by the UTF- 8
encoding form. Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is
converted from other encoding forms that use a BOM or where the BOM is
used as a UTF-8 signature.
I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain with your desired MIME type.
I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.
This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});
This works for me:
let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });
BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.
Example:
When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.
This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:
const data = JSON.parse((await res.text()).trim());