Convert array of bytes to base128 valid JSON string - javascript

I want to send big array of bytes using JSON (I was inspired by this question), to have small overhead I wana to use base128 encoding (which can in fact produce valid json string). But unfortunately I was unable to find some procedures which do that conversions in JS. I will publish my procedures as answer to this question, however may be someone has shorter procedures or may be better idea to effective sending binary data inside JSON.

ES6:
Encode
let bytesToBase128 = (bytesArr) => {
// 128 characters to encode as json-string
let c= "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz¼½ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
let fbits=[];
let bits = (n,b=8) => [...Array(b)].map((x,i)=>n>>i&1);
bytesArr.map(x=> fbits.push(...bits(x)));
let fout=[];
for(let i =0; i<fbits.length/7; i++) {
fout.push(parseInt(fbits.slice(i*7, i*7+7).reverse().join(''),2))
};
return (fout.map(x => c[x])).join('');
}
// Example
// bytesToBase128([23, 45, 65, 129, 254, 42, 1, 255]) => "NÚ4AèßÊ0ÿ1"
Decode
let base128ToBytes = (base128str) => {
// 128 characters to encode as json-string
let c= "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz¼½ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
dfout = base128str.split('').map(x=>c.indexOf(x));
let dfbits = [];
let bits = (n,b=8) => [...Array(b)].map((x,i)=>n>>i&1);
dfout.map(x=> dfbits.push(...bits(x,7) ));
let dfbytes=[];
let m1 = dfbits.length%8 ? 1 : 0;
for(let i =0; i<dfbits.length/8-m1; i++) {
dfbytes.push(parseInt(dfbits.slice(i*8, i*8+8).reverse().join(''),2))
};
return dfbytes;
}
// Example
// base128ToBytes("NÚ4AèßÊ0ÿ1") => [23, 45, 65, 129, 254, 42, 1, 255]
I embeded here bits function - here. The coversion idea here is to convert bytes array to bit array and then take each 7 bits (the value is from 0 to 127) as character number i charakter list c. In decoding we change each character number 7-bit number and create array, and then take each 8-bit packages of this array and interpret them as byte.
To view characters from ASCI and choose 128 from them (which is arbitrary) I type in console
[...Array(256)].map((x,i) => String.fromCharCode(i)).join('');
I try to avoid characters that has "special meaning" in different contexts like ! # # $ % ' & ...
And here is working example (which convert Float32Array to json).
Tested on Chrome, Firefox and Safari
Conclusion
After conversion bytes array to base128 string (which is valid json) the output string is less than 15% bigger than input array.
Update
A dig a little bit more, and expose that when we send characters which codes are bigger than 128 (¼½ÀÁÂÃÄ...) then chrome in fact send TWO characters (bytes) instead one :( - I made test in this way: type in url bar chrome://net-internals/#events (and send POST request) and in URL_REQUEST> HTTP_STREAM_REQUEST > UPLOAD_DATA_STREAM_INIT > total_size we see that request are two times bigger when body contains charaters witch codes bigger than 128. So in fact we don't have profit with sending this characters :( . For base64 strings we not observe such negative behaviour - However I left this procedures because they may bye used in other purposes than sending (like better alternative to storage binary data in localstorage than base64 - however probably there exists even better ways...?). UPDATE 2019 here.

Related

Javascript Equvivalant to BinaryReader.ReadString() from C#

I am converting some C# code into JavaScript code and while this file has multiple datatypes and I found a matching functionality in Javascrip from across the libraries, I am not able to find one particular function in JS.
That function is https://learn.microsoft.com/en-us/dotnet/api/system.io.binaryreader.readstring?view=net-7.0
There are couple of questions that I have:
First of all what confuses me is that isn't a string inherently a variable length variable? If so, how can this function not take a length argument?
Let's assume that there is some cap on the length of the string. If so, does JS/TS have any similar functionality? Or any package that I can download to mimic the C# functionality?
Thank you in advance.
BinaryReader expects strings to be encoded in specific format - the format BinaryWriter writes them. As stated in documentation:
Reads a string from the current stream. The string is prefixed with
the length, encoded as an integer seven bits at a time
So length of the string is stored right before the string itself, encoded "as integer seven bits at a time". We can get more info about that from BinaryWriter.Write7BitEncodedInt:
The integer of the value parameter is written out seven bits at a
time, starting with the seven least-significant bits. The high bit of
a byte indicates whether there are more bytes to be written after this
one.
If value will fit in seven bits, it takes only one byte of space. If
value will not fit in seven bits, the high bit is set on the first
byte and written out. value is then shifted by seven bits and the next
byte is written. This process is repeated until the entire integer has
been written.
So it's variable-length encoding: unlike the usual approach to always use 4 bytes for Int32 value, this approach uses variable number of bytes. That way the length of short string can take less than 4 bytes (strings with length less than 128 bytes will take just 1 byte for example).
You can reproduce this logic in javascript - just read one byte at a time. Lowest 7-bits represent (part of) the length information, and highest bit indicates whether next byte also represents length information (otherwise it's the start of actual string).
Then when you got the length - use TextDecoder to decode byte array into string of given encoding. Here is the same function in typescript. It accepts buffer (Uint8Array), offset into that buffer and encoding (by default UTF-8, check docs of TextDecoder for other available encodings):
class BinaryReader {
getString(buffer: Uint8Array, offset: number, encoding: string = "utf-8") {
let length = 0; // length of following string
let cursor = 0;
let nextByte: number;
do {
// just grab next byte
nextByte = buffer[offset + cursor];
// grab 7 bits of current byte, then shift them according to this byte position
// that is if that's first byte - do not shift, second byte - shift by 7, etc
// then merge into length with or.
length = length | ((nextByte & 0x7F) << (cursor * 7));
cursor++;
}
while (nextByte >= 0x80); // do this while most significant bit is 1
// get a slice of the length we got
let sliceWithString = buffer.slice(offset + cursor, offset + cursor + length);
let decoder = new TextDecoder(encoding);
return decoder.decode(sliceWithString);
}
}
Worth adding various sanity checks into the above code if will be used in production (that we do not read too much bytes reading length, that calculated length is actually in bounds of buffer etc).
Small test, using binary representation of string "TEST STRING", written by BinaryWriter.Write(string) in C#:
let buffer = new Uint8Array([12, 84, 69, 83, 84, 32, 83, 84, 82, 73, 78, 71, 33]);
let reader = new BinaryReader();
console.log(reader.getString(buffer, 0, "utf-8"));
// outputs TEST STRING
Update. You mention in comments that in your data the length of the string is represented by 4 bytes, so for example length 29 is represented by [0, 0, 0, 29]. That means your data was not written using BinaryWriter, and so cannot be read using BinaryReader, so you don't actually need analog of BinaryReader.GetString, contrary to what your question asks.
Anyway if you need to handle such case - you can do it:
class BinaryReader {
getString(buffer: Uint8Array, offset: number, encoding: string = "utf-8") {
// create a view over first 4 bytes starting at offset
let view = new DataView(buffer.buffer, offset, 4);
// read those 4 bytes as int 32 (big endian, since your example is like that)
let length = view.getInt32(0);
// get a slice of the length we got
let sliceWithString = buffer.slice(offset + 4, offset + 4 + length);
let decoder = new TextDecoder(encoding);
return decoder.decode(sliceWithString);
}
}

How to find the memory of the javascript hasmap? [duplicate]

I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?
I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?
You can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['😂']).size, // 4
new Blob(['👍']).size, // 4
new Blob(['😂👍']).size, // 8
new Blob(['👍😂']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source
If you're using node.js, there is a simpler solution using buffers :
function getBinarySize(string) {
return Buffer.byteLength(string, 'utf8');
}
There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)
String values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:
4.3.16 String Value
A string value is a member of the type String and is a
finite ordered sequence of zero or
more 16-bit unsigned integer values.
NOTE Although each value usually
represents a single 16-bit unit of
UTF-16 text, the language does not
place any restrictions or requirements
on the values except that they be
16-bit unsigned integers.
These are 3 ways I use:
TextEncoder
new TextEncoder().encode("myString").length
Blob
new Blob(["myString"]).size
Buffer
Buffer.byteLength("myString", 'utf8')
Try this combination with using unescape js function:
const byteAmount = unescape(encodeURIComponent(yourString)).length
Full encode proccess example:
const s = "1 a ф № # ®"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ф-2,№-3,#-1,®-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11
Note that if you're targeting node.js you can use Buffer.from(string).length:
var str = "\u2620"; // => "☠"
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)
The size of a JavaScript string is
Pre-ES6: 2 bytes per character
ES6 and later: 2 bytes per character,
or 5 or more bytes per character
Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.
ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}
Practical notes
This doesn't relate to the internal implemention of a particular engine. For
example, some engines use data structures and libraries with full
UTF-16 support, but what they provide externally doesn't have to be
full UTF-16 support. Also an engine may provide external UTF-16
support as well but is not mandated to do so.
For ES6, practically speaking characters will never be more than 5
bytes long (2 bytes for the escape point + 3 bytes for the Unicode
code point) because the latest version of Unicode only has 136,755
possible characters, which fits easily into 3 bytes. However this is
technically not limited by the standard so in principal a single
character could use say, 4 bytes for the code point and 6 bytes
total.
Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.
UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).
If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:
getStringMemorySize = function( _string ) {
"use strict";
var codePoint
, accum = 0
;
for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
codePoint = _string.charCodeAt( stringIndex );
if( codePoint < 0x100 ) {
accum += 1;
continue;
}
if( codePoint < 0x10000 ) {
accum += 2;
continue;
}
if( codePoint < 0x1000000 ) {
accum += 3;
} else {
accum += 4;
}
}
return accum * 2;
}
Examples:
getStringMemorySize( 'I' ); // 2
getStringMemorySize( '❤' ); // 4
getStringMemorySize( '𠀰' ); // 8
getStringMemorySize( 'I❤𠀰' ); // 14
The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.
byteCount(String.fromCharCode(55555))
// URIError: URI malformed
This longer function should handle all strings:
function bytes (str) {
var bytes=0, len=str.length, codePoint, next, i;
for (i=0; i < len; i++) {
codePoint = str.charCodeAt(i);
// Lone surrogates cannot be passed to encodeURI
if (codePoint >= 0xD800 && codePoint < 0xE000) {
if (codePoint < 0xDC00 && i + 1 < len) {
next = str.charCodeAt(i + 1);
if (next >= 0xDC00 && next < 0xE000) {
bytes += 4;
i++;
continue;
}
}
}
bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
}
return bytes;
}
E.g.
bytes(String.fromCharCode(55555))
// 3
It will correctly calculate the size for strings containing surrogate pairs:
bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)
The results can be compared with Node's built-in function Buffer.byteLength:
Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3
Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)
A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).
The charCodeAt() method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The codePointAt() can be used to return the entire code point value for Unicode characters, e.g. UTF-32.
When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)
See Unicode encodings for different encodings and their code ranges.
The Blob interface's size property returns the size of the Blob or File in bytes.
const getStringSize = (s) => new Blob([s]).size;
I'm working with an embedded version of the V8 Engine.
I've tested a single string. Pushing each step 1000 characters. UTF-8.
First test with single byte (8bit, ANSI) Character "A" (hex: 41).
Second test with two byte character (16bit) "Ω" (hex: CE A9) and the
third test with three byte character (24bit) "☺" (hex: E2 98 BA).
In all three cases the device prints out of memory at
888 000 characters and using ca. 26 348 kb in RAM.
Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.
Embedded or not, fact is that the characters are not only stored in 16bit.
Unfortunally I've no 100% answer, what Javascript do at low level area.
Btw. I've tested the same (first test above) with an array of character "A".
Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000.
So, the javascript engine is not simple restricted. It's a kind more complex.
You can try this:
var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));
It worked for me.

How can I split( a binary-containing string at a specific binary value?

Overview:
I'm building a Javascript tool inside a web page. Except for loading that page, the tool will run without server communication. A user will select a local file containing multiple binary records, each with a x'F0 start byte and x'F0 end byte. The data in between is constrained to x'00 - x'7F and consists of:
bit maps
1-byte numbers
2-byte numbers, low order byte first
a smattering of ASCII characters
The records vary in lengths and use different formats.
[It's a set of MIDI Sysex messages, probably not relevant].
The local file is read via reader.readAsArrayBuffer and then processed thus:
var contents = event.target.result;
var bytes = new Uint8Array(contents);
var rawAccum = '';
for (x = 0; x < bytes.length; x++) {
rawAccum += bytes[x];
}
var records = rawAccum.split(/\xF0/g);
I expect this to split the string into an array of its constituent records, deleting the x'F0 start byte in the process.
It actually does very little. records.length is 1 and records[0] contains the entire input stream.
[The actual split code is: var records = rawAccum.split(/\xF0\x00\x00\x26\x02/g); which should remove several identical bytes from the start of each record. When this failed I tried the abbreviated version above, with identical (non)results.]
I've looked at the doc on split( and at several explanations of \xXX among regex references. Clearly something does not work as I have deduced. My experience with JavaScript is minimal and sporadic.
How can I split a string of binary data at the occurrence of a specific binary byte?
The splitting appears to work correctly:
var rawAccum = "\xf0a\xf0b\xf0c\xf0"
console.log( rawAccum.length); // 7
var records = rawAccum.split(/\xF0/g);
console.log(records); // "", "a", "b", "c", ""
but the conversion of the array buffer to a string looks suspicious. Try converting the unsigned byte value to a string before appending it to rawAccum:
for (x = 0; x < bytes.length; x++) {
rawAccum += String.fromCharCode( bytes[x]);
}
Data conversions (update after comment)
The filereader reads the file into an array buffer in memory, but JavaScript does not provide access to array buffers directly. You can either create and initialize a typed array from the buffer (e.g. using the Uint8Array constructor as in the post), or access bytes in the buffer using a DataView object. Methods of DataView objects can convert sequences of bytes at specified positions to integers of varying types, such as the 16 bit integers in the Midi sysex records.
JavaScript strings use sequences of 16 bit values to hold characters, where each character uses one or two 16 bit values encoded using UTF-16 character encoding. 8 bit characters use only the lower 8 bits of a single 16 bit value to store their Unicode code point.
It is possible to convert an array buffer of octet values into a "binary string", by storing each byte value from the buffer in the low order bits of a 16 bit character and appending it to an existing string. This is what the post attempts to do. But in JavaScript strings (and individual characters which have a string length of 1) are not a subset of integer numbers and have their own data type, "string".
So to convert an unsigned 8 bit number to a JavaScript 16 bit character of type "string", use the fromCharCode static method of the global String object, as in
rawAccum += String.fromCharCode( bytes[x]);
Calling String.fromCharCode is also how to convert an ASCII character code located within MIDI data to a character in JavaScript.
To convert a binary string character derived from an 8 bit value back into a number, use the String instance method charCodeAt on a string value and provide the character position:
var byteValue = "\xf0".charCodeAt(0);
returns the number 0xf0 or 250 decimal.
If you append a number to a string, as in the question, the number is implicitly converted to a decimal string representation of its value first:
"" + 0xf0 + 66 // becomes the string "24066"
Note that an array buffer can be inspected using a Uint8Array created from it, sliced into pieces using the buffer's slice method and have integers of various types extracted from the buffer using data views. Please review if creating a binary string remains the best way to extract and interpret Midi record contents.

JSON increases Float32Array buffer size many folds when sending through websocket

I got a strange experience. When I send data of this arraybuffer setting:
var f32s = Float32Array(2048);
for (var i = 0; i < f32s.length; i++) {
f32s[i] = buffer[i]; // fill the array
ws.send(f32s[i]);
}
The buffer size I got at the other end is 8192 bytes.
But when I send chunck of buffer in JSON format like bellow:
var obj = {
buffer_id: 4,
data: f32s[i]
};
var json = JSON.stringify({ type:'buffer', data: obj });
ws.send(json);
The buffer size I got at the other end bloat to 55,xxx bytes with data filled in and 17,xxx bytes with no data filled.
Why this happen and how do I keep the buffer size low?
I want to do this because the stream is choppy when I render it at the other end.
Thank you.
I would expect this is happening because a float 32 array requires exactly 32 bits per number within the data structure, however json being an ascii format represents each number with a serious of 8bit characters, and then another 8bits for the comma and maybe again for the decimal and again for the delimiting whitespace.
Therefore the data [0.1234545, 111.3242, 523.12341] for instance requires 3 * 32 => 96 bits to represent within a float32array but as a json string requires 8bits for each of the 32 characters in this example which comes to 256 bits.

Using JavaScript to truncate text to a certain size (8 KB)

I'm using the Zemanta API, which accepts up to 8 KB of text per call. I'm extracting the text to send to Zemanta from Web pages using JavaScript, so I'm looking for a function that will truncate my text at exactly 8 KB.
Zemanta should do this truncation on its own (i.e., if you send it a larger string), but I need to shuttle this text around a bit before making the API call, so I want to keep the payload as small as possible.
Is it safe to assume that 8 KB of text is 8,192 characters, and to truncate accordingly? (1 byte per character; 1,024 characters per KB; 8 KB = 8,192 bytes/characters) Or, is that inaccurate or only true given certain circumstances?
Is there a more elegant way to truncate a string based on its actual file size?
If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.
(Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)
If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:
function toBytesUTF8(chars) {
return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
return decodeURIComponent(escape(bytes));
}
Now you can truncate with:
function truncateByBytesUTF8(chars, n) {
var bytes= toBytesUTF8(chars).substring(0, n);
while (true) {
try {
return fromBytesUTF8(bytes);
} catch(e) {};
bytes= bytes.substring(0, bytes.length-1);
}
}
(The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)
If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.
No it's not safe to assume that 8KB of text is 8192 characters, since in some character encodings, each character takes up multiple bytes.
If you're reading the data from files, can't you just grab the filesize? Or read it in in chunks of 8KB?
You can do something like this since unescape is partially deprecated
function byteCount( string ) {
// UTF8
return encodeURI(string).split(/%..|./).length - 1;
}
function truncateByBytes(string, byteSize) {
// UTF8
if (byteCount(string) > byteSize) {
const charsArray = string.split('');
let truncatedStringArray = [];
let bytesCounter = 0;
for (let i = 0; i < charsArray.length; i++) {
bytesCounter += byteCount(charsArray[i]);
if (bytesCounter <= byteSize) {
truncatedStringArray.push(charsArray[i]);
} else {
break;
}
}
return truncatedStringArray.join('');
}
return string;
}
As Dominic says, character encoding is the problem - however if you can either really ensure that you'll only deal with 8-bit chars (unlikely but possible) or assume 16-bit chars and limit yourself to half the available space, i.e. 4096 chars then you could attempt this.
It's a bad idea to rely on JS for this though because it can be trivially modified or ignored and you have complications of escape chars and encoding to deal with for example. Better to use JS as a first-chance filter and use whatever server-side language you have available (which will also open up compression).

Categories