The question is pretty simple: how much RAM in bytes does each character in an ECMAScript/JavaScript string consume?
I am going to guess two bytes, since the standard says they are stored as 16-bit unsigned integers?
Does this mean each character is always two bytes?
Yes, I believe that is the case. The characters are probably stored as widestrings or UCS2-strings.
They may be UTF-16, in which case they take up two Words (16 bit integers) per character for characters outside the BMP (Basic Multilingual Plane), but I believe these characters are not fully supported. Read This blog post about problems in the UTF16 implementation of ECMA.
Most modern languages store their strings with two byte characters. This way you have full support for all spoken languages. It costs a little extra memory, but that's peanuts for any modern computer with multiGig RAM. Storing the string in more compact UTF8 will cause processing to be more complex and slower. UTF8 is therefore mostly used for transportation only. ASCII supports only Latin alphabet without diacritics. ANSI is still limited and needs a specified code page to make sense.
Section 4.13.16 of ECMA-262 explicitly defines "String value" as a "primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integers". It suggests that programs use these 16-bit values as UTF-16 text, but it is legal simply to use a string to store any immutable array of unsigned shorts.
Note that character size isn't the only thing that makes up the string size. I don't know about the exact implementation (and it might differ), but strings tend to have a 0x00 terminator to make them compatible with PChars. And they probably have some header that contains the string size and maybe some refcounting and even encoding information. A string with one character can easily consume 10 bytes or more (yes, that's 80 bits).
Related
I have a string that I k'now for sure has only ASCII lettes.
JS treats strings as UTF-8 by default,
so it means that every character takes up to 4 bytes,
which is 4 times ASCII.
I'm trying to compress / save spaces / get the shortest string as possible,
by having an encode and decode functions.
I thought about representing 4 characters of ASCII on a UTF-8 string and by that achieve my goals, is there anything like that?
If not, what is the best way to compress ASCII strings, so that by encoding and decoding I'll reach the same string?
Actually JavaScript encodes program strings in UTF-16, which uses 2 octets (16 bits) for Unicode characters in the BMP (Basic Multilingual Plane) and 4 octets (32 bits) for characters outside it. So internally at least, ASCII characters use 2 bytes.
There is room to pack two ASCII characters into 16 bits since they only use 7 bits each. Furthermore, since the difference between 2**16 and 2**14 is 49152, and the number of encodings used by surrogate pairs in UTF-16 is (allegedly) 2048, you should be able to devise an encoding scheme that avoids the range of code points used by surrogates.
You could also use 8 bit typed arrays to hold ASCII characters while avoiding the complexity of a custom compression algorithm.
The purpose of compressing 7 bit ASCII for use within JavaScript is largely (entirely?) academic these days and not something there is a demand for. Note that encoding 7 bit ASCII content into UTF-8 (for transmission or file encoding) only uses one byte for ASCII characters due to the design of UTF-8.
If you want to use 1 byte per character you can simply use a byte. There is already a function to change to a string from bytes.
I am setting up a little website and would like to make it international. All the content will be stored in an external xml in different languages and parsed into the html via javascript.
Now the problem is, there are also german umlauts, russian, chinese and japanese symbols and also right-to-left languages like arabic and farsi.
What would be the best way/solution? Is there an "international encoding" which can display all languages properly? Or is there any other solution you would suggest?
Thanks in advance!
All of the Unicode transformations (UTF-8, UTF-16, UTF-32) can encode all Unicode characters. You pick which you want to use based on the size: If most of your text is in western scripts, probably UTF-8, as it will use only one byte for most of the characters, but 2, 3, or 4 if needed. If you're encoding far east scripts, you'll probably want one of the other transformations.
The fundamental thing here is that it's all Unicode; the transformations are just different ways of representing the same characters.
The co-founder of Stack Overflow had a good article on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Regardless of what encoding you use for your document, note that if you're doing processing of these strings in JavaScript, JavaScript strings are UTF-16 (except that invalid values are tolerated). (Even if the document is in UTF-8 or UTF-32.) This means that, for instance, each of those emojis people are so excited about these days look like two "characters" to JavaScript, because they take two words of UTF-16 to represent. Like π, for instance:
console.log("π".length); // 2
So you'll need to be careful not to split up the two halves of characters that are encoded in two words of UTF-16.
The normal (and recommended) solution for multi-lingual sites is to use UTF-8. That can can deal with any characters that have been assigned Unicode codepoints with a couple of caveats:
Unicode is a versioned standard, and a different Javascript implementations may support different Unicode versions.
If your text includes characters outside of the Unicode Basic Multilingual Plane (BMP), then you need to do your text processing (in Javascript) in a way that is Unicode aware. For instance, if you use the Javascript String class you need to take proper account of surrogate pairs when doing text manipulation.
(A Javascript String is actually encoded as UTF-16. It has methods that allow you to manipulate it as Unicode codepoints, methods / attribute such as substring and length use codeunit rather than codepoint indexing. If you are not careful, you can end up splitting a string between the low and high parts of a surrogate pair. The result will be something that cannot be displayed properly. This only affects codepoints in higher planes ... but that includes the new emoji codepoints.)
In javascript world,
I learnt that Javascript source code charset is usually UTF-8(but not always).
I learnt that Javascript (execution) charset is UTF-16.
How do I interpret these two terms?
Note: Answer can be given language agnostic-ally, by taking another language like java
Pretty well most source code is written in utf-8, or should be. As source code is mostly English, using ASCII compatible characters, and utf-8 is most efficient in this character range, there is a great advantage. In any case, it has become the de facto standard.
JavaScript was developed before the rest of the world settled on utf-8, so it follows the Java practice of using utf-16 for all strings, which was pretty forward-thinking at the time. This means that all strings, whether coded in the source, or obtained some other way, will be (re-)encoded in in utf-16.
For the most part itβs unimportant. Source code is for humans and the execution character set is for machines. However, the fact does have two minor issues:
JavaScript strings may waste a lot of space if your strings are largely ASCII range (which they would be in English, or even in other languages which use spaces).
like utf-8, utf-16 is also variable width, though most characters in most languages fit within the normal 2 bytes; however JavaScript may mis-calculate the length of a string if some of the characters extend to 4 bytes.
Apart from questions of which encoding better suits a particular human language, there is no other advantage of one over the other. If JavaScript were developed more recently, it would probably have used utf-8 encoding for strings.
Base64 ( 2^6 ) uses a subset of characters, usually
a-z, A-Z, 0-9, / , +
It does not use all 128 defined in ASCII because non-printable characters can not be used.
However, each character takes up 2^8 space.
This results in 33% ( 4/3 ) wasted space.
why can't a subset of UTF-8 be used which has 256 printable characters. Hence instead of the limited subset listed above, the richness of UTF could be used to fill all 8 bits.
This way there would be no loss.
Base64 is used to encode arbitrary 8bit data in systems that do not support 8bit data, like email and XML. Its use of 7bit ASCII characters is deliberate, so it can pass through 7bit systems, like email. But it is not the only data encoding format in the world, though. yEnc, for example, tends to have slightly better compression than base64. And if your data is mostly ASCII-compatible, Quoted-Printable is almost 1-to-1.
UTFs are meant for encoding Unicode text, not arbitrary binary data. Period.
Pick an encoding that is appropriate for the data and usage. Don't just try coherse an encoding into doing something it is not meant to do.
why can't a subset of UTF-8 be used which has 256 printable
characters. Hence instead of the limited subset listed above, the
richness of UTF could be used to fill all 8 bits.
Suppose you used a subset that contained the 94 non-space printable characters from the ASCII range (encoded in UTF-8 as 1 byte each) and 162 characters from somewhere in the U+0080 to U+07FF range (encoded in UTF-8 as 2 bytes each). Assuming a uniform distribution of values, you'd need an average of 1.6328125 bytes of text per byte of data, which is less efficient than Base64's 1.3333333.
UTF-8 uses 2 bytes for characters 128-255, so would use 16 bits to store 8 bits (50% efficiency) instead of using 8 bits to store 6 bits (75% efficiency)
Javascript represents all numbers as double-precision floating-point. This means it loses precision when dealing with numbers at the very highest end of the 64 bit Java Long datatype -- anything after 17 digits. For example, the number:
714341252076979033
... becomes:
714341252076979100
My database uses long IDs and some happen to be in the danger zone. I could change the offending values in the database, but that'd be difficult in my application. Instead, right now I rather laboriously ensure the server encodes Long IDs as Strings in all ajax responses.
However, I'd prefer to deal with this in the Javascript. My question: is there a best practice for coercing JSON parsing to treat a number as a string?
You do have to send your values as strings (i.e. enclosed in quotes) to ensure that Javascript will treat them as strings instead of numbers.
There's no way I know of to get around that.
JavaScript represents all numbers as 64b IEEE 754 floats.
If your integer can't fit in 52 bits then it will be truncated which is what happened here.
If you do need to change your database, change it to send 52 bit integers. Or 53 bit signed integers.
Otherwise, send them as strings.