I have a string that I k'now for sure has only ASCII lettes.
JS treats strings as UTF-8 by default,
so it means that every character takes up to 4 bytes,
which is 4 times ASCII.
I'm trying to compress / save spaces / get the shortest string as possible,
by having an encode and decode functions.
I thought about representing 4 characters of ASCII on a UTF-8 string and by that achieve my goals, is there anything like that?
If not, what is the best way to compress ASCII strings, so that by encoding and decoding I'll reach the same string?
Actually JavaScript encodes program strings in UTF-16, which uses 2 octets (16 bits) for Unicode characters in the BMP (Basic Multilingual Plane) and 4 octets (32 bits) for characters outside it. So internally at least, ASCII characters use 2 bytes.
There is room to pack two ASCII characters into 16 bits since they only use 7 bits each. Furthermore, since the difference between 2**16 and 2**14 is 49152, and the number of encodings used by surrogate pairs in UTF-16 is (allegedly) 2048, you should be able to devise an encoding scheme that avoids the range of code points used by surrogates.
You could also use 8 bit typed arrays to hold ASCII characters while avoiding the complexity of a custom compression algorithm.
The purpose of compressing 7 bit ASCII for use within JavaScript is largely (entirely?) academic these days and not something there is a demand for. Note that encoding 7 bit ASCII content into UTF-8 (for transmission or file encoding) only uses one byte for ASCII characters due to the design of UTF-8.
If you want to use 1 byte per character you can simply use a byte. There is already a function to change to a string from bytes.
Related
In javascript, especially in JSON, we can represent unicode characters with escape sequences or direct unicode string.
How does the two differ?
Is there any practical implications or pitfalls in using one over another?
Escape sequences use ASCII characters so can be represented without having to worry about the encoding that the JS or JSON is transmitted / saved using. They require more bytes per encoded character. They aren't easy for a human to read by looking at source code.
None of the above are true when using a Unicode encoding (such as UTF-8).
Base64 ( 2^6 ) uses a subset of characters, usually
a-z, A-Z, 0-9, / , +
It does not use all 128 defined in ASCII because non-printable characters can not be used.
However, each character takes up 2^8 space.
This results in 33% ( 4/3 ) wasted space.
why can't a subset of UTF-8 be used which has 256 printable characters. Hence instead of the limited subset listed above, the richness of UTF could be used to fill all 8 bits.
This way there would be no loss.
Base64 is used to encode arbitrary 8bit data in systems that do not support 8bit data, like email and XML. Its use of 7bit ASCII characters is deliberate, so it can pass through 7bit systems, like email. But it is not the only data encoding format in the world, though. yEnc, for example, tends to have slightly better compression than base64. And if your data is mostly ASCII-compatible, Quoted-Printable is almost 1-to-1.
UTFs are meant for encoding Unicode text, not arbitrary binary data. Period.
Pick an encoding that is appropriate for the data and usage. Don't just try coherse an encoding into doing something it is not meant to do.
why can't a subset of UTF-8 be used which has 256 printable
characters. Hence instead of the limited subset listed above, the
richness of UTF could be used to fill all 8 bits.
Suppose you used a subset that contained the 94 non-space printable characters from the ASCII range (encoded in UTF-8 as 1 byte each) and 162 characters from somewhere in the U+0080 to U+07FF range (encoded in UTF-8 as 2 bytes each). Assuming a uniform distribution of values, you'd need an average of 1.6328125 bytes of text per byte of data, which is less efficient than Base64's 1.3333333.
UTF-8 uses 2 bytes for characters 128-255, so would use 16 bits to store 8 bits (50% efficiency) instead of using 8 bits to store 6 bits (75% efficiency)
In Javascript, what would be fastest method to encode unicode characters outside ASCII range to their respective %uxxxx. I need to use this method to encode hundreds of KBs of data (number of unicode characters outside ASCII range within this data is fairly low). I have been using 'escape' currently, but that's very slow given that it also encodes many other characters than just non-ASCII.
escape is native code. Nothing you could code in JS could beat that...
I really need a way to convert all characters from CP1251 table to ASCII codes from 0 to 255.
The only way that I found till now is the charCodeAt() function which only works for codes up to 128. For the upper codes it issues an Unicode number which is not good for me.
The first 128 characters in CP1251 are the same as the characters in ASCII. After that, they represent non-ASCII, cyrillic characters, which can't be converted to ASCII.
Consider using Unicode, which was invented to solve this kind of problem.
The question is pretty simple: how much RAM in bytes does each character in an ECMAScript/JavaScript string consume?
I am going to guess two bytes, since the standard says they are stored as 16-bit unsigned integers?
Does this mean each character is always two bytes?
Yes, I believe that is the case. The characters are probably stored as widestrings or UCS2-strings.
They may be UTF-16, in which case they take up two Words (16 bit integers) per character for characters outside the BMP (Basic Multilingual Plane), but I believe these characters are not fully supported. Read This blog post about problems in the UTF16 implementation of ECMA.
Most modern languages store their strings with two byte characters. This way you have full support for all spoken languages. It costs a little extra memory, but that's peanuts for any modern computer with multiGig RAM. Storing the string in more compact UTF8 will cause processing to be more complex and slower. UTF8 is therefore mostly used for transportation only. ASCII supports only Latin alphabet without diacritics. ANSI is still limited and needs a specified code page to make sense.
Section 4.13.16 of ECMA-262 explicitly defines "String value" as a "primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integers". It suggests that programs use these 16-bit values as UTF-16 text, but it is legal simply to use a string to store any immutable array of unsigned shorts.
Note that character size isn't the only thing that makes up the string size. I don't know about the exact implementation (and it might differ), but strings tend to have a 0x00 terminator to make them compatible with PChars. And they probably have some header that contains the string size and maybe some refcounting and even encoding information. A string with one character can easily consume 10 bytes or more (yes, that's 80 bits).