In my JS-application I'm uploading documents to a server.
The documents are stored in Uint8Array's. That means documents are represented as Arrays consisting of Integers.
Before uploading the documents I JSON.stringify the documents.
Here you have an example:
var document = [0,5,5,7,2,234,1,4,2,4]
JSON.stringify(document)
=> "[0,5,5,7,2,234,1,4,2,4]"
Then I send the JSON-representation to the Server.
My problem is that the JSON-representation has a much bigger file-size than the original Integer-Array. I guess that's because the Array is transformed to a JSON-String. I send much more data to the server then needed. How can I store the data more compressed in JSON?
I thought, the JSON-representation is maybe smaller, if I convert the Array first to Base64 and then to JSON.
What are your ideas? Thanks
The integer array JSON representation results in an average 3.57 ratio between input and output:
10 values are represented by 2 bytes (one digit, one comma)
90 values by 3 bytes (2 digits, one comma)
156 values by 4 bytes
On the other hand, base64 will result in an average 1.333... ratio (3 bytes are encoded as 4).
If you have mostly ASCII-like characters in your array (i.e. in the range 32-126), you would probably be better off just sending them as strings (with a few characters escaped), but not if you have random 8-bit data.
You could use some kind of base94 representation to get a better ratio over base64, but is it really worth the cost?
Also note that if you may also consider compression of the data.
Related
I'm kinda confused about conversion of unicode characters into hexadecimal values.
I'm using this website to get hexadecimal value for characters. (https://www.branah.com/unicode-converter)
If I put "A" and convert then I get something like:
0041 --> UTF-16
00000041 --> UTF-32
41 --> UTF-8
00065 --> Decimal Value
This above output makes sense because we can convert all these hexadecimal values into 65.
Now, If i put "Я" (without quotes) and convert it then I get values like.
042f --> UTF-16
0000042f --> UTF-32
d0af --> UTF-8
01071 --> Decimal Value
This output doesn't make sense to me because not all these hexadecimal values convert back to 1071.
If you you take d0af and try to convert it back to decimal value then you will get 53423.
This is something that is really confusing for me and I've searching online to find answers about this conversion but so far I've not been able to find any good answer.
So, I'm wondering if anyone here can help. (that would mean alot) // Thanks in advance.
you can also see below link for example of this conversion in binary. (and can you explain why utf-8 binary value is different in last example??)
http://kunststube.net/encoding/
UTF-8 uses variable length encoding (can use 1, 2, 3 or 4 bytes to store a single character).
In this case:
d0af = 11010000 10101111
110 at the start tells us to expect 2 bytes when decoding it (looking at the byte 1 column of the schematic). When decoding we use the binary digits that follow the first 0 in the byte. So,
110x xxxx the x's are our first lot of values for our actual unicode value. Every additional byte follows the pattern of 10xx xxxx. So taking the values from byte 1 & 2 we get:
110[10000] 10[101111] =
V V
10000 101111 = 42f = 1071
The reason this is done is that for common characters less bytes are needed for transmission and storage. But on the odd occasion that a uncommon character is needed it can still be used at part of UTF-8.
If you have any questions, please comment.
I would like to transfer a large array of 2 short integers (0-65535) via JSON, in the most compact way possible. Theoretically, the most compact would be 2 bytes for each integer in binary format, but is that possible with JSON? Encoding them in hexadecimal, for example ff4a will require 4 bytes, which is double the size and can be important for large datasets. Is there a more compact/efficient way?
The said dataset will be received and parsed by javascript.
example in hexadecimal:
"array": "f1f0a1a2e3e4" this string will be split into
"f1f0", "a1a2", "e3e4" and each parsed into a number.
but this is 12 bytes, for 3 short integers (6 bytes)
Overview:
I'm building a Javascript tool inside a web page. Except for loading that page, the tool will run without server communication. A user will select a local file containing multiple binary records, each with a x'F0 start byte and x'F0 end byte. The data in between is constrained to x'00 - x'7F and consists of:
bit maps
1-byte numbers
2-byte numbers, low order byte first
a smattering of ASCII characters
The records vary in lengths and use different formats.
[It's a set of MIDI Sysex messages, probably not relevant].
The local file is read via reader.readAsArrayBuffer and then processed thus:
var contents = event.target.result;
var bytes = new Uint8Array(contents);
var rawAccum = '';
for (x = 0; x < bytes.length; x++) {
rawAccum += bytes[x];
}
var records = rawAccum.split(/\xF0/g);
I expect this to split the string into an array of its constituent records, deleting the x'F0 start byte in the process.
It actually does very little. records.length is 1 and records[0] contains the entire input stream.
[The actual split code is: var records = rawAccum.split(/\xF0\x00\x00\x26\x02/g); which should remove several identical bytes from the start of each record. When this failed I tried the abbreviated version above, with identical (non)results.]
I've looked at the doc on split( and at several explanations of \xXX among regex references. Clearly something does not work as I have deduced. My experience with JavaScript is minimal and sporadic.
How can I split a string of binary data at the occurrence of a specific binary byte?
The splitting appears to work correctly:
var rawAccum = "\xf0a\xf0b\xf0c\xf0"
console.log( rawAccum.length); // 7
var records = rawAccum.split(/\xF0/g);
console.log(records); // "", "a", "b", "c", ""
but the conversion of the array buffer to a string looks suspicious. Try converting the unsigned byte value to a string before appending it to rawAccum:
for (x = 0; x < bytes.length; x++) {
rawAccum += String.fromCharCode( bytes[x]);
}
Data conversions (update after comment)
The filereader reads the file into an array buffer in memory, but JavaScript does not provide access to array buffers directly. You can either create and initialize a typed array from the buffer (e.g. using the Uint8Array constructor as in the post), or access bytes in the buffer using a DataView object. Methods of DataView objects can convert sequences of bytes at specified positions to integers of varying types, such as the 16 bit integers in the Midi sysex records.
JavaScript strings use sequences of 16 bit values to hold characters, where each character uses one or two 16 bit values encoded using UTF-16 character encoding. 8 bit characters use only the lower 8 bits of a single 16 bit value to store their Unicode code point.
It is possible to convert an array buffer of octet values into a "binary string", by storing each byte value from the buffer in the low order bits of a 16 bit character and appending it to an existing string. This is what the post attempts to do. But in JavaScript strings (and individual characters which have a string length of 1) are not a subset of integer numbers and have their own data type, "string".
So to convert an unsigned 8 bit number to a JavaScript 16 bit character of type "string", use the fromCharCode static method of the global String object, as in
rawAccum += String.fromCharCode( bytes[x]);
Calling String.fromCharCode is also how to convert an ASCII character code located within MIDI data to a character in JavaScript.
To convert a binary string character derived from an 8 bit value back into a number, use the String instance method charCodeAt on a string value and provide the character position:
var byteValue = "\xf0".charCodeAt(0);
returns the number 0xf0 or 250 decimal.
If you append a number to a string, as in the question, the number is implicitly converted to a decimal string representation of its value first:
"" + 0xf0 + 66 // becomes the string "24066"
Note that an array buffer can be inspected using a Uint8Array created from it, sliced into pieces using the buffer's slice method and have integers of various types extracted from the buffer using data views. Please review if creating a binary string remains the best way to extract and interpret Midi record contents.
Context
I know that in python you can create an a string representing the packed bytes of an array doing something like such:
import numpy as np
np.array([1, 2], dtype=np.int32).tobytes()
# returns '\x01\x00\x00\x00\x02\x00\x00\x00'
np.array([1, 2], dtype=np.float32).tobytes()
# returns '\x00\x00\x80?\x00\x00\x00#'
And they can be decoded using np.fromstring
Question
Currently, my Javascript is receiving a string of packed bytes that encodes an array of floats (i.e. '\x00\x00\x80?\x00\x00\x00#') and I need to decode the array -- what is the best way to do this?
(if it were an array of ints I imagine I could use text-encoding to pull the bytes and then just multiply and add appropriately...)
Thanks,
First, you have to convert a string into a buffer, and then create a Float32Array on that buffer. Optionally, spread it to create a normal Array:
str = '\x00\x00\x80?\x00\x00\x00#'
bytes = Uint8Array.from(str, c => c.charCodeAt(0))
floats = new Float32Array(bytes.buffer)
console.log(floats)
console.log([...floats]);
Floating-point representation can vary from platform to platform. It isn't a good idea to use a binary serialized form of a floating-point number to convey a number from one machine to another.
That said, the answers to this question might help you, provided the encoding used by the JavaScript matches that of the source of the numbers:
Read/Write bytes of float in JS
Does anyone have a good understanding/explanation of how the heap size of strings are determined in Javascript with Chrome(V8)?
Some examples of what I see in a heap dump:
1) Multiple copies of an identical 2 character strings (ie. "dt") with different # object Ids all designated as OneByteStrings. The heapdump says each copy has a shallow & retained size of 32 bytes. It isn't clear how a two byte string has a retained size of 32 and why the strings don't appear to be interned.
2) Long object path string which is 78 characters long. All characters would be a single byte in utf8. It is classified as a InternalizedString. It has a 184 byte retained size. Even with a 2 byte character encoding that would still not account for the remaining 28 bytes. Why are these path strings taking up so much space? I could imagine another 4 bytes (maybe 8) being used for address and another 4 for storing the string length, but that still leaves 16 bytes even with a 2 byte character encoding.
Internally, V8 has a number of different representations for strings:
SeqOneByteString: The simplest, contains a few header fields and then the string's bytes (not UTF-8 encoded, can only contain characters in the first 256 unicode code points)
SeqTwoByteString: Same, but uses two bytes for each character (using surrogate pairs to represent unicode characters that can't be represented in two bytes).
SlicedString: A substring of some other string. Contains a pointer to the "parent" string and an offset and length.
ConsString: The result of adding two strings (if over a certain size). Contains pointers to both strings (which may themselves be any of these types of strings).
ExternalString: Used for strings that have been passed in from outside of V8.
"Internalized" is just a flag, the actual string representation could be any of the above.
All of these have a common parent class String, whose parent is Name, whose parent is HeapObject (which is the root of the V8 class hierarchy for objects allocated on the V8 heap).
HeapObject has one field: the pointer to its Map (there's a good explanation of these here).
Name adds one additional field: a hash value.
String adds another field: the length.
On a 32-bit system, each of these is 4 bytes. On a 64-bit system, each one is 8 bytes.
If you're on a 64-bit system then the minimum size of a SeqOneByteString will be 32 bytes: 24 bytes for the header fields described above plus at least one byte for the string data, rounded up to a multiple of 8.
Regarding your second question, it's difficult to say exactly what's going on. It could be that the string is using a 2-byte representation and its header fields are pushing up the size above what you are expecting, or it could be that it's a ConsString or a SlicedString (whose retained sizes would include the strings that it points to).
V8 doesn't internalize strings most of the time - it internalizes string constants and identifier names that it finds during parsing, and strings that are used as object property keys, and probably a few other cases.