Base64 ( 2^6 ) uses a subset of characters, usually
a-z, A-Z, 0-9, / , +
It does not use all 128 defined in ASCII because non-printable characters can not be used.
However, each character takes up 2^8 space.
This results in 33% ( 4/3 ) wasted space.
why can't a subset of UTF-8 be used which has 256 printable characters. Hence instead of the limited subset listed above, the richness of UTF could be used to fill all 8 bits.
This way there would be no loss.
Base64 is used to encode arbitrary 8bit data in systems that do not support 8bit data, like email and XML. Its use of 7bit ASCII characters is deliberate, so it can pass through 7bit systems, like email. But it is not the only data encoding format in the world, though. yEnc, for example, tends to have slightly better compression than base64. And if your data is mostly ASCII-compatible, Quoted-Printable is almost 1-to-1.
UTFs are meant for encoding Unicode text, not arbitrary binary data. Period.
Pick an encoding that is appropriate for the data and usage. Don't just try coherse an encoding into doing something it is not meant to do.
why can't a subset of UTF-8 be used which has 256 printable
characters. Hence instead of the limited subset listed above, the
richness of UTF could be used to fill all 8 bits.
Suppose you used a subset that contained the 94 non-space printable characters from the ASCII range (encoded in UTF-8 as 1 byte each) and 162 characters from somewhere in the U+0080 to U+07FF range (encoded in UTF-8 as 2 bytes each). Assuming a uniform distribution of values, you'd need an average of 1.6328125 bytes of text per byte of data, which is less efficient than Base64's 1.3333333.
UTF-8 uses 2 bytes for characters 128-255, so would use 16 bits to store 8 bits (50% efficiency) instead of using 8 bits to store 6 bits (75% efficiency)
Related
I have a string that I k'now for sure has only ASCII lettes.
JS treats strings as UTF-8 by default,
so it means that every character takes up to 4 bytes,
which is 4 times ASCII.
I'm trying to compress / save spaces / get the shortest string as possible,
by having an encode and decode functions.
I thought about representing 4 characters of ASCII on a UTF-8 string and by that achieve my goals, is there anything like that?
If not, what is the best way to compress ASCII strings, so that by encoding and decoding I'll reach the same string?
Actually JavaScript encodes program strings in UTF-16, which uses 2 octets (16 bits) for Unicode characters in the BMP (Basic Multilingual Plane) and 4 octets (32 bits) for characters outside it. So internally at least, ASCII characters use 2 bytes.
There is room to pack two ASCII characters into 16 bits since they only use 7 bits each. Furthermore, since the difference between 2**16 and 2**14 is 49152, and the number of encodings used by surrogate pairs in UTF-16 is (allegedly) 2048, you should be able to devise an encoding scheme that avoids the range of code points used by surrogates.
You could also use 8 bit typed arrays to hold ASCII characters while avoiding the complexity of a custom compression algorithm.
The purpose of compressing 7 bit ASCII for use within JavaScript is largely (entirely?) academic these days and not something there is a demand for. Note that encoding 7 bit ASCII content into UTF-8 (for transmission or file encoding) only uses one byte for ASCII characters due to the design of UTF-8.
If you want to use 1 byte per character you can simply use a byte. There is already a function to change to a string from bytes.
I have a system in a different language implemented in Unicode. The condition is that the system must also accept Unicode characters (for digits) and process them accordingly. Is it possible to convert any Unicode characters(that represents numbers) to a sensible English numbers equivalent?
How can I implement that in Javascript?
EDIT: I searched the web and found a chart in unicode.org . There are codes corresponding to the literals i want there. Now, how do i read the code from the input unicode string ?
The Unicode Database contains in column 6-8 information about digit values, decimal digit values and number values (like U+216E: ROMAN NUMERAL FIVE HUNDRED has a number value of 500).
To use this in JavaScript, you might parse that file with some other language and dump the information you need as JSON or similar, and then just look up the value in the JSON from JavaScript.
Documentation of the Unicode Database file format
Either you dump the unicode codepoints into your JSON like this "\u20ac" for U+20AC, then you can just compare the characters, or you can use someString.charCodeAt(somePosition).toString(16) to convert that character to a hex string (like 20ac) to compare from there.
What is the best way to store ca. 100 sequences of doubles directly in the js file? Each sequence will have length of ca. 10 000 doubles or more.
Requirements
the javascript file must be executed as fast as possible
it is enough for me to iterate through the sequence on demand (I do not need to decode all the numbers at js execution. They will be decoded on event.)
it shouldn't take to much space
The simplest option is probably to use string of CSV format but then the doubles are not stored in the most efficient manner, right?
Another option might be to store the numbers in Base64 byte array, but then I have no idea how to read the base64 string into double.
EDIT:
I would like to use the doubles to transform Matrix4x4 of 3d nodes in Adobe 3D annotations. Adobe allows to import external files but it is so complicated that it might be simpler to include all the data in the js file directly.
As I mentioned in my comment, it is probably not worth it to try and encode the values. Here are some values from my head on the required amount of data to store doubles (updated from comment).
Assuming 1,000,000 values:
Using direct encoding (won't work well in a JS file): 8 B = 8 MB
Using base64: 10.7 B = 10.7 MB
Literals (best case): 1 B + delimiter = 2 MB
Literals (worst case): 21 B + delimiter = 22 MB
Literals (average case assuming evenly distributed values): 19 B + delimiter = 20MB
Note: A double can take 21 bytes (assuming 15 digits of precision) in the worst case like this: 1.23456789101112e-131
As you can see, you still won't be able to cut it below 1/2 of using plain literal values with encoding, and if you plan on doing random-access decoding it will get complicated fast. It may be best to stick to literals. You might get some benefit from using the external file that you mentioned, but that depends on how much effort is needed to do so.
Some ideas on how to optimize using literals:
Depending on the precision required, you could approximate the values and limit a value to, say, 5 digits of precision. This would incredibly shorten the file.
You could compress the file. I think you can specify any number of doubles using 14 characters, (0123456789.e-,) so theoretically, you could compress such a string to half its size. I don't know how good practical modern compression routines are though.
I really need a way to convert all characters from CP1251 table to ASCII codes from 0 to 255.
The only way that I found till now is the charCodeAt() function which only works for codes up to 128. For the upper codes it issues an Unicode number which is not good for me.
The first 128 characters in CP1251 are the same as the characters in ASCII. After that, they represent non-ASCII, cyrillic characters, which can't be converted to ASCII.
Consider using Unicode, which was invented to solve this kind of problem.
The question is pretty simple: how much RAM in bytes does each character in an ECMAScript/JavaScript string consume?
I am going to guess two bytes, since the standard says they are stored as 16-bit unsigned integers?
Does this mean each character is always two bytes?
Yes, I believe that is the case. The characters are probably stored as widestrings or UCS2-strings.
They may be UTF-16, in which case they take up two Words (16 bit integers) per character for characters outside the BMP (Basic Multilingual Plane), but I believe these characters are not fully supported. Read This blog post about problems in the UTF16 implementation of ECMA.
Most modern languages store their strings with two byte characters. This way you have full support for all spoken languages. It costs a little extra memory, but that's peanuts for any modern computer with multiGig RAM. Storing the string in more compact UTF8 will cause processing to be more complex and slower. UTF8 is therefore mostly used for transportation only. ASCII supports only Latin alphabet without diacritics. ANSI is still limited and needs a specified code page to make sense.
Section 4.13.16 of ECMA-262 explicitly defines "String value" as a "primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integers". It suggests that programs use these 16-bit values as UTF-16 text, but it is legal simply to use a string to store any immutable array of unsigned shorts.
Note that character size isn't the only thing that makes up the string size. I don't know about the exact implementation (and it might differ), but strings tend to have a 0x00 terminator to make them compatible with PChars. And they probably have some header that contains the string size and maybe some refcounting and even encoding information. A string with one character can easily consume 10 bytes or more (yes, that's 80 bits).