So as an example, when I read the π character (\u03C0) from a File using the FileReader API, I get the pi character back to me when I read it using FileReader.readAsText(blob) which is expected. But when I use FileReader.readAsBinaryString(blob), I get the result \xcf\x80 instead, which doesn't seem to have any visible correlation with the pi character. What's going on? (This probably has something to do with the way UTF-8/16 is encoded...)
FileReader.readAsText takes the encoding of the file into account. In particular, since you have the file encoded in UTF-8, there may be multiple bytes per character. Reading it as text, the UTF-8 is read as it is, and you get your string.
FileReader.readAsBinaryString, on the other hand, does exactly what it says. It reads the file byte by byte. It doesn't recognise multi-byte characters, which in particular is good news for binary files (basically anything except a text file). Since π is a two-byte character, you get the two individual bytes that make it up in your string.
This difference can be seen in many places. In particular when encoding is lost and you see characters like é displayed as é.
Oh well, if that's all you needed... :)
CF80 is the UTF-8 encoding for π.
Related
I am trying to use data from an API. I am using request for the API access, but have also tried axios.
const request = require('request')
request('https://remoteok.io/api', function (error, response, body) {
const data = JSON.parse(body)
console.log(data)
})
When accessing the website remoteok.io/api in a browser, I can see sequences like \u00e2\u0080\u0099. This sequence should be a backtick apostrophe, but when I log to the console in JavaScript or use express to render res.json(body), I get the characters †instead.
How can I fix this encoding issue? Shouldn't JSON always just be plain UTF-8?
UPDATE:
Here is a simple glitch project that shows the behavior.
The problem is in the source data: the JSON sequence "\u00e2\u0080\u0099"does not represent a right closing quotation mark. There are three Unicode code points here, and the first represent "â", while the other two are control characters.
You can verify this in a dev console, or by running the snippet below:
console.log(JSON.parse('"\u00e2\u0080\u0099"'));
Apparently the author of that JSON mixed up two things:
JSON is encoded in UTF
A \u notation represents a Unicode Code Point
The first means that the file or stream, encoding the JSON text into bytes, should be UTF encoded (preference for UTF8). The second has nothing to do with that. JSON syntax allows to specify 16-bit Unicode Code Points using the \u syntax. It is not intended to produce a UTF8 byte sequence with a sequence1 of \u encodings. One should not be concerned about the lower-level UTF8 byte stream encoding when defining JSON text.
1 I may need to at least mention the surrogate pairs, but they are really unrelated to UTF8, but more with how Unicode Code Points beyond the 16-bit range can be encoded in JSON.
So although the right closing quotation mark has an UTF8 sequence of E2 80 99, this is not to be encoded with a \u notation for each of those three bytes.
The right closing quotation mark has Unicode Code Point \u2019. So either the source JSON should have that, or it should just have the character ’ literally (which will indeed be a UTF8 sequence in the byte stream, but that is a level below JSON)
See those two possibilities:
console.log(JSON.parse('"’"'));
console.log(JSON.parse('"\u2019"'));
And now?
I would advise you to contact the service provider of this particular API. They have a bug in their JSON producing service.
Whatever you do, do not try to fix this in your client that is using this service, trying to recognise such malformed sequences, and replacing them as if those characters represented UTF8 bytes. Such a fix will be hard to maintain, and may even hit false positives.
I think this is not an error, you can use this extension to see JSON on browser
JSON Viewer
I have found this sentence while reading one of the JavaScript books:
JavaScript programs are written using the Unicode character set
What I don't understand is, how does JavaScript files makes sure, that whatever I write in .js file, would be a Unicode Character Set?
Does that mean whenever I type using keyboard on my computer, it'd always use Unicode? How does it work?
This means that the language definition employs Unicode charset. In particular, this usually means that string literals can include Unicode chars, and also may mean that identifiers can include some Unicode chars too (I don't know JavaScript, but in particular it's allowed in the Haskell language).
Now, the JavaScript implementation can choose any way to map bytes in .js file into internal Unicode representation. It may pretend that all .js files are written in UTF-8, or in 7-bit ASCII encoding, or anything else. You need to consult the implementation manual to reveal that.
And yeah, you need to know that any file consists of bytes, not characters. How characters, that you are typed in editor, converted to bytes stored in the file, is up to your editor (usually it provides a choice between use of local 8-bit encodings, UTF-8 and sometimes UTF-16). How the bytes stored in the file are converted to characters is up to your language implementation (in this case, JavaScript one).
In Javascript, window.atob() method decodes a base64 string and window.btoa() method encodes a string into base64.
Then why weren't they named like base64Decode() and base64Encode()?
atob() and btoa() don't make sense because they're not semantic at all.
I want to know the reason.
The atob() and btoa() methods allow authors to transform content to and from the base64 encoding.
In these APIs, for mnemonic purposes, the "b" can be considered to
stand for "binary", and the "a" for "ASCII". In practice, though, for
primarily historical reasons, both the input and output of these
functions are Unicode strings.
From : http://www.w3.org/TR/html/webappapis.html#atob
I know this is old, but it recently came up on Twitter, and I thought I'd share it as it is authoritative.
Me:
#BrendanEich did you pick those names?
Him:
Old Unix names, hard to find man pages rn but see
https://www.unix.com/man-page/minix/1/btoa/ …. The names carried over
from Unix into the Netscape codebase. I reflected them into JS in a
big hurry in 1995 (after the ten days in May but soon).
In case the Minix link breaks, here's the man page content:
BTOA(1) BTOA(1)
NAME
btoa - binary to ascii conversion
SYNOPSIS
btoa [-adhor] [infile] [outfile]
OPTIONS
-a Decode, rather than encode, the file
-d Extracts repair file from diagnosis file
-h Help menu is displayed giving the options
-o The obsolete algorithm is used for backward compatibility
-r Repair a damaged file
EXAMPLES
btoa <a.out >a.btoa # Convert a.out to ASCII
btoa -a <a.btoa >a.out
# Reverse the above
DESCRIPTION
Btoa is a filter that converts a binary file to ascii for transmission over a telephone
line. If two file names are provided, the first in used for input and the second for out-
put. If only one is provided, it is used as the input file. The program is a function-
ally similar alternative to uue/uud, but the encoding is completely different. Since both
of these are widely used, both have been provided with MINIX. The file is expanded about
25 percent in the process.
SEE ALSO
uue(1), uud(1).
Source: Brendan Eich, the creator of JavaScript. https://twitter.com/BrendanEich/status/998618208725684224
To sum up the already given answers:
atob stands for ASCII to binary
e.g.: atob("ZXhhbXBsZSELCg==") == "example!^K"
btoa stands for binary to ASCII
e.g.: btoa("\x01\x02\xfe\xff") == "AQL+/w=="
Why ASCII and binary:
ASCII (the a) is the result of base64 encoding. A safe text composed only of a subset of ascii characters(*) that can be correctly represented and transported (e.g. email's body),
binary (the b) is any stream of 0s and 1s (in javascript it must be represented with a string type).
(*) in base64 these are limited to: A-Z, a-z, 0-9, +, / and = (padding, only at the end) https://en.wikipedia.org/wiki/Base64
P.S. I must admit I myself was initially confused by the naming and thought the names were swapped. I thought that b stand for "base64 encoded string" and a for "any string" :D.
The names come from a unix function with similar functionality, but you can already read that in other answers here.
Here is my mnemonic to remember which one to use. This doesn't really answer the question itself, but might help people figure which one of the functions to use without keeping a tab open on this question on stack overflow all day long.
Beautiful to Awful btoa
Take something Beautiful (aka, beautiful content that would make sense to your application: json, xml, text, binary data) and transform it to something Awful, that cannot be understood as is (aka: encoded).
Awful to Beautiful atob
The exact opposite of btoa
Note
Some may say that binary is not beautiful, but hey, this is only a trick to help you.
I can't locate a source at the moment, but it is common knowledge that in this case, the b stands for 'binary', and the a for 'ASCII'.
Therefore, the functions are actually named:
ASCII to Binary for atob(), and
Binary to ASCII for btoa().
Note this is browser implementation, and was left for legacy / backwards-compatibility purposes. In Node.js for example, these don't exist.
I'm struggling to find any resources on this online, which is concerning.
I've been reading about UCS-2 and UTF-16 woes, but I can't find a solution.
I need to get a value from an input:
var val = $('input').val()
and encode it to base64, treating the text as utf-16, so:
this is a test
becomes:
dABoAGkAcwAgAGkAcwAgAGEAIAB0AGUAcwB0AA==
and not the below, which you get treating it as UTF-8:
dGhpcyBpcyBhIHRlc3Q=
Your data, once read into JavaScript, will be in an encodingless numerical format (strictly speaking, it has to be in Unicode Normalised Form C, but Unicode is just a series of identifying numbers for each glyph in the Unicode lexicon. It's encoding-less). So: if you specifically need the data encoded as a UTF-16 byte sequence, do so, then base64 encode that.
But here's the fun part: which UTF-16 do you need? Little or Big Endian? With or without BOM? UTF-16 is a really inconvenient encoding format (we're not even going to touch UCS-2. It's obsolete. Has been for a long time).
What you really should need is to get a text value from your HTML element, Base64 encode its value, and then have whatever receives that data unpack it as UTF8; don't try to make JavaScript do more work than it has to. I presume you're sending this data to a server or something, in which case: your server language is way more elaborate than JavaScript, and can unpack text in about a million different encodings thanks to built-in functions. So just use that. Don't solve Y for X.
Admit me describe my questions in situation-oriented way:
Assume Internet Explorer is still the dominating web browser (Firefox has document for binary processing):
The XMLHttpRequest.responseText or XMLHttpRequest.responseXML in Internet Explorer desire txt or xml/xhtml/html, but what about the server response the xmlHttprequest with MIME TYPE application/octet? Would the characters in the response string all be less than 256? (every character of that string < 256)? Thanks very much for a straight answer, I have no webserver environment, so I don't know how to test it out.
Because use of txt or xml have an issue of character set encoding, and I don't know how to process #[[[CDDATA node of one encoded xml (for example: UTF-8, ASCII, GB18030) with JavaScript, when I getNodeText, does the docObj return me byte or decoded char? If it was decoded char which according to the header indicated charSet in the httpresponse, it would be all wrong.
To avoid mess up with charSet, I would like the server to response octet data and force strings data to be encoded as UTF-8 but another charSet in the binary format.
If the response is octal, so I guess the browser would not try to decode the response "txt".
Does this weird? Or miss understanding the fundamental things?
EDIT: I believe the question is asking this: Can JavaScript safely process strings that aren't encoded in Unicode? What are the problems with trying to do so?
EDIT: no no no , I mean if http-header: content-type is "application/octet", would Internet Explorer try to decode it as (16 bits Unicode or Internet Explorer local setting charset) when I get XMLHttpRequestobj.responseText use JavaScript? Or it (Internet Explorer) just wrap every single byte of the response body as a JavaScript string, then every character in that string less than or equal 256 (character <= 256).
Am I talking Mars language? Sadly, if I were Marsian, I would come as tourist without fuzzy questions. However I am in a country which share at least one property with Mars: RED.
If I understand your question correctly, the short answer is: yes, every single byte will contain a value between 0 and 255 (unsigned, that is). That's just the nature of bytes, consisting of 8 bits.
But why do you want this? What binary data do you want to process using JavaScript?
Just FYI, read Mastering Ajax, Part 3: Advanced requests and responses in Ajax:
This allows you to determine […] if
the server will try to return binary
data instead of HTML, text, or XML
(which are all three much easier to
process in JavaScript than binary
data).
(under Useful HEAD requests).
In case you wondered, I found this article with a simple Google search.