Why were Javascript `atob()` and `btoa()` named like that? - javascript

In Javascript, window.atob() method decodes a base64 string and window.btoa() method encodes a string into base64.
Then why weren't they named like base64Decode() and base64Encode()?
atob() and btoa() don't make sense because they're not semantic at all.
I want to know the reason.

The atob() and btoa() methods allow authors to transform content to and from the base64 encoding.
In these APIs, for mnemonic purposes, the "b" can be considered to
stand for "binary", and the "a" for "ASCII". In practice, though, for
primarily historical reasons, both the input and output of these
functions are Unicode strings.
From : http://www.w3.org/TR/html/webappapis.html#atob

I know this is old, but it recently came up on Twitter, and I thought I'd share it as it is authoritative.
Me:
#BrendanEich did you pick those names?
Him:
Old Unix names, hard to find man pages rn but see
https://www.unix.com/man-page/minix/1/btoa/ …. The names carried over
from Unix into the Netscape codebase. I reflected them into JS in a
big hurry in 1995 (after the ten days in May but soon).
In case the Minix link breaks, here's the man page content:
BTOA(1) BTOA(1)
NAME
btoa - binary to ascii conversion
SYNOPSIS
btoa [-adhor] [infile] [outfile]
OPTIONS
-a Decode, rather than encode, the file
-d Extracts repair file from diagnosis file
-h Help menu is displayed giving the options
-o The obsolete algorithm is used for backward compatibility
-r Repair a damaged file
EXAMPLES
btoa <a.out >a.btoa # Convert a.out to ASCII
btoa -a <a.btoa >a.out
# Reverse the above
DESCRIPTION
Btoa is a filter that converts a binary file to ascii for transmission over a telephone
line. If two file names are provided, the first in used for input and the second for out-
put. If only one is provided, it is used as the input file. The program is a function-
ally similar alternative to uue/uud, but the encoding is completely different. Since both
of these are widely used, both have been provided with MINIX. The file is expanded about
25 percent in the process.
SEE ALSO
uue(1), uud(1).
Source: Brendan Eich, the creator of JavaScript. https://twitter.com/BrendanEich/status/998618208725684224

To sum up the already given answers:
atob stands for ASCII to binary
e.g.: atob("ZXhhbXBsZSELCg==") == "example!^K"
btoa stands for binary to ASCII
e.g.: btoa("\x01\x02\xfe\xff") == "AQL+/w=="
Why ASCII and binary:
ASCII (the a) is the result of base64 encoding. A safe text composed only of a subset of ascii characters(*) that can be correctly represented and transported (e.g. email's body),
binary (the b) is any stream of 0s and 1s (in javascript it must be represented with a string type).
(*) in base64 these are limited to: A-Z, a-z, 0-9, +, / and = (padding, only at the end) https://en.wikipedia.org/wiki/Base64
P.S. I must admit I myself was initially confused by the naming and thought the names were swapped. I thought that b stand for "base64 encoded string" and a for "any string" :D.

The names come from a unix function with similar functionality, but you can already read that in other answers here.
Here is my mnemonic to remember which one to use. This doesn't really answer the question itself, but might help people figure which one of the functions to use without keeping a tab open on this question on stack overflow all day long.
Beautiful to Awful btoa
Take something Beautiful (aka, beautiful content that would make sense to your application: json, xml, text, binary data) and transform it to something Awful, that cannot be understood as is (aka: encoded).
Awful to Beautiful atob
The exact opposite of btoa
Note
Some may say that binary is not beautiful, but hey, this is only a trick to help you.

I can't locate a source at the moment, but it is common knowledge that in this case, the b stands for 'binary', and the a for 'ASCII'.
Therefore, the functions are actually named:
ASCII to Binary for atob(), and
Binary to ASCII for btoa().
Note this is browser implementation, and was left for legacy / backwards-compatibility purposes. In Node.js for example, these don't exist.

Related

How does JavaScript ensure that programs are written using the Unicode character set?

I have found this sentence while reading one of the JavaScript books:
JavaScript programs are written using the Unicode character set
What I don't understand is, how does JavaScript files makes sure, that whatever I write in .js file, would be a Unicode Character Set?
Does that mean whenever I type using keyboard on my computer, it'd always use Unicode? How does it work?
This means that the language definition employs Unicode charset. In particular, this usually means that string literals can include Unicode chars, and also may mean that identifiers can include some Unicode chars too (I don't know JavaScript, but in particular it's allowed in the Haskell language).
Now, the JavaScript implementation can choose any way to map bytes in .js file into internal Unicode representation. It may pretend that all .js files are written in UTF-8, or in 7-bit ASCII encoding, or anything else. You need to consult the implementation manual to reveal that.
And yeah, you need to know that any file consists of bytes, not characters. How characters, that you are typed in editor, converted to bytes stored in the file, is up to your editor (usually it provides a choice between use of local 8-bit encodings, UTF-8 and sometimes UTF-16). How the bytes stored in the file are converted to characters is up to your language implementation (in this case, JavaScript one).

Need to normalize UTF-8 string encoding to composed characters

So I have some characters like í, ñ, etc. that are percent-encoded in a URL string in an XML document. I need to convert them programmatically from the combining form (e.g. i%CC%81) to their composed UTF-8 character equivalent (%C3%AD in that case).
SO was kind enough to point me to the same question about how to do this in iOS (you can't, you have to create your own lookup table) and C# (apparently you can do this in the general case with built-in functionality in C#).
I need to be able to do it in python 3.x and preferably, JavaScript as well. So far I have tried to unquote/decodeURI the string and then re-encode it back, but apparently the characters are not exactly equivalent because the transforms are lossless (I get back the original starting with either form).
Is there anyway to do this in the general case or do I need to build my own lookup table and replacement functions? Also, here's an example URL:
file:///some/file/path/3-05%20Melodi%CC%81a%20de%20la%20montan%CC%83a%20.m4a
(Obviously I'm unescaping the XML part).
UPDATE
Using Christoph's answer below got me the python solution and enabled me to find this for JavaScript (note that it is an ES 2015 function, has mediocre browser support with no IE and Safari 10 only).
In python3 urllib.quote moved to urllib.parse, but you're actual looking for unicodedata.normalize()
Coming from a default python3 string
import urllib.parse
import unicodedata
s = "î"
print (urllib.parse.quote(s))
> %C3%AE
s = unicodedata.normalize("NFC",s)
print (urllib.parse.quote(s))
> %C3%AD
which looks to me pretty much like the result you're looking for.

Storing more info in QR Code

I am trying to develop a hybrid mobile app with QR code functionality. QR Code contains a limited number of character can be stored with it. So, I am thinking is it possible to compress the string to make it shorter so that I can store more info into the QR code?
At lengths that short, most compression algorithms will actually make data longer, not shorter. There are some algorithms which may work well, though… smaz comes to mind. However, it is going to depend heavily on what you are trying to compress, and you haven't really provided any information about that.
Instead of thinking about compression, your best bet may be to find an encoding scheme which makes more sense for your data. For example, if you're encoding a date and time, store it as a single number instead of text. Think about whether you really need seconds. If you are storing numbers, consider using variable-length quantities. If your data is JSON, consider using protobuf instead.
If what you have really is text, it may be worth considering coming up with your own character set. Instead of ASCII where each character 8 bits, can you limit yourself to 64 characters? a-z, A-Z, 0-9, and two punctuation characters is only 64 possible symbols… if that is all you need, you could use a 6-bit encoding. If the strings aren't case-sensitive you have tons of room for punctuation.

Difference between readAsBinaryString and readAsText using FileReader

So as an example, when I read the π character (\u03C0) from a File using the FileReader API, I get the pi character back to me when I read it using FileReader.readAsText(blob) which is expected. But when I use FileReader.readAsBinaryString(blob), I get the result \xcf\x80 instead, which doesn't seem to have any visible correlation with the pi character. What's going on? (This probably has something to do with the way UTF-8/16 is encoded...)
FileReader.readAsText takes the encoding of the file into account. In particular, since you have the file encoded in UTF-8, there may be multiple bytes per character. Reading it as text, the UTF-8 is read as it is, and you get your string.
FileReader.readAsBinaryString, on the other hand, does exactly what it says. It reads the file byte by byte. It doesn't recognise multi-byte characters, which in particular is good news for binary files (basically anything except a text file). Since π is a two-byte character, you get the two individual bytes that make it up in your string.
This difference can be seen in many places. In particular when encoding is lost and you see characters like é displayed as é.
Oh well, if that's all you needed... :)
CF80 is the UTF-8 encoding for π.

Loading EUC-JP and other Japanese text encodings in Node.JS

I'm trying to scrape some Japanese websites for a personal project. Sites with text in UTF-8 work perfectly fine, as you'd expect, but I can't get any text out of sites specifying other international encodings, specifically EUC-JP. Node also seems to be interpreting the text and performing modifications rather than passing it on raw - I've tried setting the response to be interpreted as both ascii and binary, and then set my terminal application to EUC-JP, but after doing a console.log(), neither result in the actual text.
I've had a scan through the Node documentation, and it seems to only support two main text encodings (apart from binary and base64.)
I'm using the inbuilt http client, and specifying the encoding through the response.setEncoding method, e.g. response.setEncoding('utf8');
How are other people working with international text in Node (especially with regard to situations where the original data is not in UTF-8?) Are binary buffers the only way?
While I've done a bit of research, I'm not hugely knowledgeable when it comes to character encoding, so simple answers would be appreciated. Thanks!
There is a module that adds iconv bindings to node.js. If you grab the response as a binary Buffer, you can use Iconv.convert to convert it from EUC-JP to UTF-8 (take a look at the README for an example).

Categories