javascript: compact serialization (JSON) of integers

javascript: compact serialization (JSON) of integers - javascript

I would like to transfer a large array of 2 short integers (0-65535) via JSON, in the most compact way possible. Theoretically, the most compact would be 2 bytes for each integer in binary format, but is that possible with JSON? Encoding them in hexadecimal, for example ff4a will require 4 bytes, which is double the size and can be important for large datasets. Is there a more compact/efficient way?
The said dataset will be received and parsed by javascript.
example in hexadecimal:
"array": "f1f0a1a2e3e4" this string will be split into
"f1f0", "a1a2", "e3e4" and each parsed into a number.
but this is 12 bytes, for 3 short integers (6 bytes)

Related

Why does a string of length 3 have 3 as its byte length?

When javascript encodes one character as 2 bytes or more, why does the following:
Buffer.from("abc").byteLength
output 3? Should not it be 6?

Javascript class Buffer's default encoding is 'utf-8'. ASCII characters take 1 bytes in utf-8 encoding as you can see here. So the result should be 3. Note: Utf-8 encoding can take 1~3 bytes for one character.

UTF-8 vs UTF-16 and UTF-32 conversion confusion

I'm kinda confused about conversion of unicode characters into hexadecimal values.
I'm using this website to get hexadecimal value for characters. (https://www.branah.com/unicode-converter)
If I put "A" and convert then I get something like:
0041 --> UTF-16
00000041 --> UTF-32
41 --> UTF-8
00065 --> Decimal Value
This above output makes sense because we can convert all these hexadecimal values into 65.
Now, If i put "Я" (without quotes) and convert it then I get values like.
042f --> UTF-16
0000042f --> UTF-32
d0af --> UTF-8
01071 --> Decimal Value
This output doesn't make sense to me because not all these hexadecimal values convert back to 1071.
If you you take d0af and try to convert it back to decimal value then you will get 53423.
This is something that is really confusing for me and I've searching online to find answers about this conversion but so far I've not been able to find any good answer.
So, I'm wondering if anyone here can help. (that would mean alot) // Thanks in advance.
you can also see below link for example of this conversion in binary. (and can you explain why utf-8 binary value is different in last example??)
http://kunststube.net/encoding/

UTF-8 uses variable length encoding (can use 1, 2, 3 or 4 bytes to store a single character).
In this case:
d0af = 11010000 10101111
110 at the start tells us to expect 2 bytes when decoding it (looking at the byte 1 column of the schematic). When decoding we use the binary digits that follow the first 0 in the byte. So,
110x xxxx the x's are our first lot of values for our actual unicode value. Every additional byte follows the pattern of 10xx xxxx. So taking the values from byte 1 & 2 we get:
110[10000] 10[101111] =
V V
10000 101111 = 42f = 1071
The reason this is done is that for common characters less bytes are needed for transmission and storage. But on the odd occasion that a uncommon character is needed it can still be used at part of UTF-8.
If you have any questions, please comment.

Parse big binary strings to base 10

I'm having some trouble converting big binary strings to base 10. parseInt(string, 2) should return the int, but when using big strings (1800 characters) it maxes out the variable and just returns Infinity. How can get around that?

A 1800 bit binary number would be well over the maximum possible number value in JavaScript. A regular number datatype will not be able to hold that value, and so JavaScript just calls it Infinity. If you need arbitrarily large numbers, you will have to use some bignum library and probably write a custom string to number function.

Understanding String heap size in Javascript / V8

Does anyone have a good understanding/explanation of how the heap size of strings are determined in Javascript with Chrome(V8)?
Some examples of what I see in a heap dump:
1) Multiple copies of an identical 2 character strings (ie. "dt") with different # object Ids all designated as OneByteStrings. The heapdump says each copy has a shallow & retained size of 32 bytes. It isn't clear how a two byte string has a retained size of 32 and why the strings don't appear to be interned.
2) Long object path string which is 78 characters long. All characters would be a single byte in utf8. It is classified as a InternalizedString. It has a 184 byte retained size. Even with a 2 byte character encoding that would still not account for the remaining 28 bytes. Why are these path strings taking up so much space? I could imagine another 4 bytes (maybe 8) being used for address and another 4 for storing the string length, but that still leaves 16 bytes even with a 2 byte character encoding.

Internally, V8 has a number of different representations for strings:
SeqOneByteString: The simplest, contains a few header fields and then the string's bytes (not UTF-8 encoded, can only contain characters in the first 256 unicode code points)
SeqTwoByteString: Same, but uses two bytes for each character (using surrogate pairs to represent unicode characters that can't be represented in two bytes).
SlicedString: A substring of some other string. Contains a pointer to the "parent" string and an offset and length.
ConsString: The result of adding two strings (if over a certain size). Contains pointers to both strings (which may themselves be any of these types of strings).
ExternalString: Used for strings that have been passed in from outside of V8.
"Internalized" is just a flag, the actual string representation could be any of the above.
All of these have a common parent class String, whose parent is Name, whose parent is HeapObject (which is the root of the V8 class hierarchy for objects allocated on the V8 heap).
HeapObject has one field: the pointer to its Map (there's a good explanation of these here).
Name adds one additional field: a hash value.
String adds another field: the length.
On a 32-bit system, each of these is 4 bytes. On a 64-bit system, each one is 8 bytes.
If you're on a 64-bit system then the minimum size of a SeqOneByteString will be 32 bytes: 24 bytes for the header fields described above plus at least one byte for the string data, rounded up to a multiple of 8.
Regarding your second question, it's difficult to say exactly what's going on. It could be that the string is using a 2-byte representation and its header fields are pushing up the size above what you are expecting, or it could be that it's a ConsString or a SlicedString (whose retained sizes would include the strings that it points to).
V8 doesn't internalize strings most of the time - it internalizes string constants and identifier names that it finds during parsing, and strings that are used as object property keys, and probably a few other cases.

Shorten Integer-Array JSON Representation

In my JS-application I'm uploading documents to a server.
The documents are stored in Uint8Array's. That means documents are represented as Arrays consisting of Integers.
Before uploading the documents I JSON.stringify the documents.
Here you have an example:
var document = [0,5,5,7,2,234,1,4,2,4]
JSON.stringify(document)
=> "[0,5,5,7,2,234,1,4,2,4]"
Then I send the JSON-representation to the Server.
My problem is that the JSON-representation has a much bigger file-size than the original Integer-Array. I guess that's because the Array is transformed to a JSON-String. I send much more data to the server then needed. How can I store the data more compressed in JSON?
I thought, the JSON-representation is maybe smaller, if I convert the Array first to Base64 and then to JSON.
What are your ideas? Thanks

The integer array JSON representation results in an average 3.57 ratio between input and output:
10 values are represented by 2 bytes (one digit, one comma)
90 values by 3 bytes (2 digits, one comma)
156 values by 4 bytes
On the other hand, base64 will result in an average 1.333... ratio (3 bytes are encoded as 4).
If you have mostly ASCII-like characters in your array (i.e. in the range 32-126), you would probably be better off just sending them as strings (with a few characters escaped), but not if you have random 8-bit data.
You could use some kind of base94 representation to get a better ratio over base64, but is it really worth the cost?
Also note that if you may also consider compression of the data.

We Keep Coding

JavaScript is the programming language of the Web.