Adding UTF-8 BOM to string/Blob

Adding UTF-8 BOM to string/Blob - javascript

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?
Using new Blob(['\xEF\xBB\xBF' + content]) yields 'ï»¿"my data"', of course.
Neither did '\uBBEF\x22BF' work (with '\x22' == '"' being the next character in content).
Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?
Yes, I really do need the UTF-8 BOM in this case.

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx
See discussion between #jeff-fischer and #casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.
See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page
The endian order entry for UTF-8 in Table 2-4 is marked N/A because
UTF-8 code units are 8 bits in size, and the usual machine issues of
endian order for larger code units do not apply. The serialized order
of the bytes must not depart from the order defined by the UTF- 8
encoding form. Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is
converted from other encoding forms that use a BOM or where the BOM is
used as a UTF-8 signature.

I had the same issue and this is the solution I came up with:
var blob = new Blob([
new Uint8Array([0xEF, 0xBB, 0xBF]), // UTF-8 BOM
"Text",
... // Remaining data
],
{ type: "text/plain;charset=utf-8" });
Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).
You should replace text/plain with your desired MIME type.

I'm editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.
The short answer is, yes, this code works.
The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I've verified this is the case.
http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding's byte order marks.
UTF-8 Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf8' }, function(err) {
/* The actual byte order mark written to the file is EF BB BF */
}
UTF-16 Little Endian Example:
fs.writeFile(someFilename, '\ufeff' + html, { encoding: 'utf16le' }, function(err) {
/* The actual byte order mark written to the file is FF FE */
}
So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.
I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn't easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.

This is my solution:
var blob = new Blob(["\uFEFF"+csv], {
type: 'text/csv; charset=utf-18'
});

This works for me:
let blob = new Blob(["\ufeff", csv], { type: 'text/csv;charset=utf-8' });
BOM (Byte Order Marker) might be necessary to use because some programs need it to use the correct character encoding.
Example:
When opening a csv file without a BOM in a system with a default character encoding of Shift_JIS instead of UTF-8 in MS Excel, it will open it in default encoding. This will result in garbage characters. If you specify the BOM for UTF-8, it will fix it.

This fixes it for me. was getting a BOM with authorize.net api and cloudflare workers:
const data = JSON.parse((await res.text()).trim());

Related

Reconstructing a XORed UTF-8 encoded text in javascript, client side, no Node

I am driving nuts trying to achieve this in JavaScript.
First I will describe the scenario, and then I'll put my code, Python version, which I can't seem to translate into JavaScript.
I have a web page running on a server. I have no access to it whatsoever, so the only way I have to achieve basic functionality is using JavaScript.
The web page is used to compare information. The information is stored in CSV format, which I use to create HTML tables on the fly by using AJAX calls. For the sake of not having that information quickly available to users, enabling them to print the source code and 'stealing it', I came across a range of solutions, like encoding in Base64 (I know this is considered 'security by obscurity' and it's a bad practice, but I have no other choice here).
Base64 it's very easy to use in this case, but I lose all the special characters from UTF-8 (like á é í ó ú ñ etc), which are part of my language (Spanish).
So here comes the preferred solution, which works like a charm in Python: using bitwise XOR. What could I achieve using this method:
If someone figures out the url of the CSV file, it wouldn't be so easy to read the text without basic programming knowledge to de-encode it.
I can easily program the source database to export the data and then run the XORing fuction, upload those files to the server and then having them de-encoded on the fly too.
Is in that last step where I can not achieve what I want.
Here is my Python script:
To encode:
b = bytearray(open('file.csv', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
open('file.out', 'wb').write(b)
To decode:
b = bytearray(open('file.out', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
I need to achieve that small decoding function in JS.
Thank you all in advance for your time.

Base64
It isn't true that base64 makes you lose non-ASCII characters like ñ or á. Why should it? Base64 can encode any binary data, and encoded text is nothing else than binary data.
So encoding involves two steps:
A text encoding (such as UTF-8) converts your text to bytes, and the base64 encoding turns that into an ASCII string.
Decoding works the same, but backwards (reverse order of the two corresponding decoding functions).
This is how text encoding for UTF-8 works in JavaScript:
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
I got this from here. Please note that I'm not a JS crack at all, and there might be more convenient methods now that I couldn't find.
Let's try this:
s = 'Se bañó todo el día.';
b = encode_utf8(s); # text encoding
a = btoa(b); # base64 encoding
console.log(a); # prints U2UgYmHDscOzIHRvZG8gZWwgZMOtYS4=
d = decode_utf8(atob(a)); # decode base64, then UTF-8
console.log(d); # prints Se bañó todo el día.
No character lost here.
XOR method
If you still want to do the XOR thing, you can decode as follows:
convert the UTF8-encoded string to an array of code points with Array.from()
XOR-decode with the ^ operator (or ^= assignment)
convert the result to a string with String.fromCodePoint()
decode the string with decode_utf8()
I'm not providing code for this, though.
Especially the third step might be a bit cumbersome, and I'm not sure if it's worth the pain.
After all, your users can just inspect the JS code to find out how the data are "encrypted", be it base64 or the XOR method.
Note
If you come from a Python background, be aware that there is no distinction like Python's str and bytes type. Both input and output of the {en,de}code_utf8() functions are always strings, same type. When you encode a string, you just get back another string where every codepoint is below 256, and it might be longer than the input string.

How to remove illegal characters from nodejs buffer?

I got a base64 encoded string of a csv file from frontend. In backend i am converting base64 string to binary and then trying to convert it to json object.
var csvDcaData = new Buffer(source, 'base64').toString('binary')//convert base64 to binary
Problem is, Ui is sending some illegal characters with on of the field which are not visible to user in plain csv. "ï»¿" these are characters appended in one of csv field.
I want to remove these kind of characters from data from base64 but i am not able to recognize them in buffer, after conversion these characters appear.
It is possible in any way to detect such kind of characters from buffer.

The source is sending you a message. The message consists of metadata and text. The first few bytes of the message are identifiable as metadata because they are the Byte-Order Mark (BOM) encoded in UTF-8. That strongly suggests that the text is encoded in UTF-8. Nonetheless, to read the text you should find out from the sender which encoding is used.
Yes, the BOM "characters" should be stripped off when wanting to deal only in the text. They are not characters in the sense that they are not part of the text. (Though, if you decode the bytes as UTF-8, it matches the codepoint U+FEFF.)
So, though perhaps esoteric, the message does not contain illegal characters but actually has useful metadata.
Also, given that you are not stripping off the BOM, the fact that you are seeing "ï»¿" instead of "" (U+FEFF ZERO WIDTH NO-BREAK SPACE) means that you are not using UTF-8 to decode the text. That could result in data loss. There is no text but encoded text. You always have to know and use the correct encoding.
Now, source is a JavaScript string (which, by-the-way, uses the UTF-16 encoding of Unicode). The content of the string is a message encoded in Base64. The message is a sequence of bytes which are the UTF-8 encoding of a BOM and text. You want the text in a JavaScript string. (And the text happens to be some form of CSV. For that, you'll need to know the line ending, delimiter, and text-qualifier.) There is a lot for you and the sender to talk about. Perhaps the sender has documented all this.
const stripBom = require('strip-bom');
const original = "¡You win one million ₹! Now you can get a real 🚲";
const base64String = Buffer.from("\u{FEFF}" + original, "utf-8").toString("base64");
console.log(base64String);
const decodedString =
stripBom(Buffer.from(base64String, "base64").toString("utf-8"));
console.log(decodedString);
console.log(original === decodedString);

Convert ANSI file text into UTF8 in node.js usinf Fyle System

I´m trying to convert the text from an ANSI encoded file to an UTF8 encoded text in node.js.
I´m reading the info from the file using node´s core Fyle System. Is there any way to 'tell' to readFile that the encoding is ANSI?
fs = require('fs');
fs.readFile('..\\\\LogSSH\\' + fileName + '.log', 'utf8', function (err, data) {
if (err) {
console.log(err);
}
If not, how can I convert that text?

Of course, ANSI is not actually an encoding. But no matter what exact encoding we're talking about I can't see any Microsoft code pages included in the relatively short list documented at Buffers and Character Encodings:
ascii - for 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.
utf8 - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.
utf16le - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.
ucs2 - Alias of 'utf16le'.
base64 - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC4648, Section 5.
latin1 - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).
binary - Alias for 'latin1'.
hex - Encode each byte as two hexadecimal characters.
If you work in Western Europe you may be tempted to use latin1 as synonym for Windows-1252 but it'll render incorrect results as soon as you print a € symbol.
So the answer is no, you need to install a third-party package like iconv-lite.

In my case the convertion between types was due to the need to use special latin characters as 'í' or 'ó'. I solve it changing the encoding from 'utf8' to binary in the fs.readFile() function:
fs.readFile('..\\LogSSH\\' + fileName + '.log', {encoding: "binary"}, function (err, data) {
if (err) {
console.log(err);
}

Using Crockford's base 32 for IDs in URLs?

I'd like to write some IDs for use in URLs in Crockford's base32. I'm using the base32 npm module.
So, for example, if the user types in http://domain/page/4A2A I'd like it to map to the same underlying ID as http://domain/page/4a2a
This is because I want human-friendly URLs, where the user doesn't have to worry about the difference between upper- and lower-case letters, or between "l" and "1" - they just get the page they expect.
But I'm struggling to implement this, basically because I'm too dim to understand how encoding works. First I tried:
var encoded1 = base32.encode('4a2a');
var encoded2 = base32.encode('4A2A');
console.log(encoded1, encoded2);
But they map to different underlying IDs:
6hgk4r8 6h0k4g8
OK, so maybe I need to use decode?
var encoded1 = base32.decode('4a2a');
var encoded2 = base32.decode('4A2A');
console.log(encoded1, encoded2);
No, that just gives me empty strings:
" "
What am I doing wrong, and how can I get 4A2A and 4A2A to map to the same thing?

For an incoming request, you'll want to decode the URL fragment. When you create URLs, you will take your identifier and encode it. So, given a URL http://domain/page/dnwnyub46m50, you will take that fragment and decode it. Example:
#> echo 'dnwnyub46m50'| base32 -d
my_id5
The library you linked to is case-insensitive, so you get the same result this way:
echo 'DNWNYUB46M50'| base32 -d
my_id5
When dealing with any encoding scheme (Base-16/32/64), you have two basic operations: encode, which works on a raw stream of bits/bytes, and decode which takes an encoded set of bytes and returns the original bit/byte stream. The Wikipedia page on Base32 encoding is a great resource.
When you decode a string, you get raw bytes: it may be that those bytes are not compatible with ASCII, UTF-8, or some other encoding which you are trying to work with. This is why your decoded examples look like spaces: the tools you are using do not recognize the resulting bytes as valid characters.
How you go about encoding identifiers depends on how your identifiers are generated. You didn't say how you were generating the underlying identifiers, so I can't make any assumptions about how you should handle the raw bytes that come out of the decoder, nor about the content of the raw bytes being passed into the encoder.
It's also important to mention that the library you linked to is not compatible with Crockford's Base32 encoding. The library excludes I, L, O, S, while Crockford's encoding excludes I, L, O, U. This would be a problem if you were trying to interoperate with another system that used a different library. If no one besides you will ever need to decode your URL fragments, then interoperability doesn't matter.

The source of your confusion is that a base64 or base32 are methods of representing numbers- whereas you are attempting in your examples to encode or decode text strings.
Encoding and decoding text strings as base32 is done by first converting the string into a large number. In your first examples, where you are encoding "4a2a" and "4A2A", those are strings with two different numeric values, that consequently translate to encoded base32 numbers with two different values, 6hgk4r8 6h0k4g8
when you "decode" 4a2a and 4A2A you say you get empty strings. However this is not true, the strings are not empty, they contain what the decoded number looks like, when interpreted as a string. Which is to say, it looks like nothing because 4a2a produces an unprintable character. It's invisible. What you want is to feed the encoder numbers, not strings.

JavaScript has
parseInt(num, 32)
and
num.toString(32)
built in in a way that's compatible with Java and across JavaScript versions.

Character encoding from UTF8 JSON to ISO-8859-1

Using getJSON to retrieve some data which I am utf8 encoding on the server-side end...
"title":"new movie \u0091The Tree of Life\u0092 on day 6"
The page that is is displayed on is charset ISO-8859-1 and I am doing this...
$.getJSON('index.php', { q: q }, function(data){
for (var i = 0; i < data.length; i++) {
alert(data[i].title + "\n" + utf8_decode(data[i].title));
}
});
The utf8_decode function comes from here.
The problem is that I am still seeing the magic squares for both versions...
new movie The Tree of Life on day 6
new movie ᔨe Tree of Life⠯n day 6
This leads me to believe that perhaps the character is of neither encoding. However it works if I paste the string onto a page and set the charset to either UTF8 or ISO-8859-1 :-/
Any help would be great!

There is no need to escape or decode any characters in data transmitted in JSON. It's done automatically. It is also independent of the page's encoding. You can easily transmit and display the euro sign (\u20ac) with your code even though ISO-8859-1 does not contain the euro sign.
You problem are the characters \u0091 and \u0092. They aren't valid Unicode characters. They are for private use only.
It rather looks as if you in fact have data that originally used the Windows-1250 character set but was not properly translated to Unicode/JSON. In Windows-1250, these two characters are typographic single quotes.

Did you tried without utf8_decode ?
If the characters in your string exist in ISO-8859-1, this will just work, as Javascript decodes the \u0091 in the encoding of the page.

We Keep Coding

JavaScript is the programming language of the Web.