I need to serialize moderately complex objects with 1-100's of mixed type properties.
JSON was used originally, then I switched to BSON which is marginally faster.
Encoding 10000 sample objects
JSON: 1807mS
BSON: 1687mS
MessagePack: 2644mS (JS, modified for BinaryF)
I want an order of magnitude increase; it is having a ridiculously bad impact on the rest of the system.
Part of the motivation to move to BSON is the requirement to encode binary data, so JSON is (now) unsuitable. And because it simply skips the binary data present in the objects it is "cheating" in those benchmarks.
Profiled BSON performance hot-spots
(unavoidable?) conversion of UTF16 V8 JS strings to UTF8.
malloc and string ops inside the BSON library
The BSON encoder is based on the Mongo BSON library.
A native V8 binary serializer might be wonderful, yet as JSON is native and quick to serialize I fear even that might not provide the answer. Perhaps my best bet is to optimize the heck out of the BSON library or write my own plus figure out far more efficient way to pull strings out of V8. One tactic might be to add UTF16 support to BSON.
So I'm here for ideas, and perhaps a sanity check.
Edit
Added MessagePack benchmark. This was modified from the original JS to use BinaryF.
The C++ MessagePack library may offer further improvements, I may benchmark it in isolation to compare directly with the BSON library.
I made a recent (2020) article and benchmark comparing binary serialization libraries in JavaScript.
The following formats and libraries are compared:
Protocol Buffer: protobuf-js, pbf, protons, google-protobuf
Avro: avsc
BSON: bson
BSER: bser
JSBinary: js-binary
Based on the current benchmark results I would rank the top libraries in the following order (higher values are better, measurements are given as x times faster than JSON):
avsc: 10x encoding, 3-10x decoding
js-binary: 2x encoding, 2-8x decoding
protobuf-js: 0.5-1x encoding, 2-6x decoding,
pbf: 1.2x encoding, 1.0x decoding
bser: 0.5x encoding, 0.5x decoding
bson: 0.5x encoding, 0.7x decoding
I did not include msgpack in the benchmark as it is currently slower than the build-in JSON library according to its NPM description.
For details, see the full article.
For serialization / deserialization protobuf is pretty tough to beat. I don't know if you can switch out the transport protocol. But if you can protobuf should definitely be considered.
Take a look at all the answers to Protocol Buffers versus JSON or BSON.
The accepted answer chooses thrift. It is however slower than protobuf. I suspect it was chosen for ease of use (with Java) not speed. These Java benchmarks are very telling.
Of note
MongoDB-BSON 45042
protobuf 6539
protostuff/protobuf 3318
The benchmarks are Java, I'd imagine that you can achieve speeds near the protostuff implementation of protobuf, ie 13.5 times faster. Worst case (if for some reason Java is just better for serialization) you can do no worse the the plain unoptimized protobuf implementation which runs 6.8 times faster.
Take a look at MessagePack. It's compatible with JSON. From the docs:
Fast and Compact Serialization
MessagePack is a binary-based
efficient object serialization
library. It enables to exchange
structured objects between many
languages like JSON. But unlike JSON,
it is very fast and small.
Typical small integer (like flags or
error code) is saved only in 1 byte,
and typical short string only needs 1
byte except the length of the string
itself. [1,2,3] (3 elements array) is
serialized in 4 bytes using
MessagePack as follows:
If you are more interested on the de-serialisation speed, take a look at JBB (Javascript Binary Bundles) library. It is faster than BSON or MsgPack.
From the Wiki, page JBB vs BSON vs MsgPack:
...
JBB is about 70% faster than Binary-JSON (BSON) and about 30% faster than MsgPack on decoding speed, even with one negative test-case (#3).
JBB creates files that (even their compressed versions) are about 61% smaller than Binary-JSON (BSON) and about 55% smaller than MsgPack.
...
Unfortunately, it's not a streaming format, meaning that you must pre-process your data offline. However there is a plan for converting it into a streaming format (check the milestones).
Related
What is the fastest way to serialize and compress a javascript hashmap (object) in NodeJS 12+? I'm trying to find the fastest combination of serialization and compression methods to transform a javascript object into binary data.
The number of possible combinations is 100+, my goal is a pre-study to choose several best options for final benchmark.
Input: an object having arbitrary keys, so some really fast serialization methods like AVSC cannot be used. Assume that the object has 30 key-value pairs, example:
{
"key-one": 2,
"key:two-random-text": "English short text, not binary data.",
... 28 more keys
}
No need to support serialization of Date, Regex etc.
Serialization only schemaless serialization formats can be taken into account, like JSON or BSON. V8.serialize is an interesting option, probably it's fast because it's native. Compress-brotli package added it's support for some reason, but did not provide a benchmark or highlighted it as a recommended option.
Compression Probably only fastest methods should be considered. Not sure if brotli is a perfect choice because according to wiki it's strong in compressing JS, CSS and HTML because expects "keywords" in input. Native nodejs support is preferred.
I've found a helpful research for a similar use case (I'm planning to compress via lambda and store in S3), but their data originates in JSON, opposite to my case.
I'd recommend lz4 for fast compression.
I am passing the data from my client to server and vice versa . I want to know is their is any size limit of the protocol buffer .
Citing the official source:
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are really just a collection of small pieces, where each small piece may be a structured piece of data. Even though Protocol Buffers cannot handle the entire set at once, using Protocol Buffers to encode each piece greatly simplifies your problem: now all you need is to handle a set of byte strings rather than a set of structures.
Protocol Buffers do not include any built-in support for large data sets because different situations call for different solutions. Sometimes a simple list of records will do while other times you may want something more like a database. Each solution should be developed as a separate library, so that only those who need it need to pay the costs.
As far as I understand the protobuf encoding the following applies:
varints above 64-bit are not specified, but given how their encoding works varint bit-length is not limited by the wire-format (varint consisting of several 1xxxxxxx groups and terminated by a single 0xxxxxxx is perfectly valid -- I suppose there is no actual implementation supporting varints larger than 64-bit thought)
given the above varint encoding property, it should be possible to encode any message length (as varints are used internally to encode length of length-delimited fields and other field types are varints or have a fixed length)
you can construct arbitrarily long valid protobuf messages just by repeating a single repeated field ad-absurdum -- parser should be perfectly happy as long as it has enough memory to store the values (there are even parsers which provide callbacks for field values thus relaxing memory consumption, e.g. nanopb)
(Please do validate my thoughts)
I'm looking for a compression algorithm that:
must be loseless
must have very high compression ratio
must be supported in browser via JavaScript libs or natively
shouldn't be fast.
Goals:
to compress dense array of 8 million double-precision floats. There only 256 unique values. Values are normally distributed. (primary use-case)
the same as before but for sparse arrays (contains a lot of 0 values)
It' OK for me to use 2 different algorithms for these use-cases.
I've found Google's Brotli algorithm. But I'm not sure if it is the best.
Coding is pretty much a problem solved: your main task will be modelling (starting with float number and lossless).
[primarily dense arrays] of 256 unique float numbers doesn't sound promising: depending on range, exponent representation may be the only source of exploitable redundancy.
sparse array does sound promising, 16×16 sparse matrix even more so. The more you know about your data, the more you can help compressors - "mainly diagonal matrix", anyone?
"General purpose data compressors" exploit self-similarity:
To get an idea where your data has such, use "the usual suspects" on whatever "machine representation" you chose and on a generic unicode representation.
The latter allows you to use no more resolution than required.
I've a lot of float numbers. But because there are only 256 unique values I can encode each number as 1 byte. It gives a huge compression ratio.
After that I can run some general purpose algorithm for further data compression.
I've checked several popular algorithms: gzip, Brotli, bzip2, lzma, Zstandard.
I've found that 2 options suit my needs:
bzip2
Brotli
bzip2:
compresses well even if I don't convert the floats to unsigned bytes.
but requires JS library in browser
Brotli:
compresses well only if I manually map all floats to unsigned bytes before
supported natively by nearly all modern browsers
I need to retrieve a large amount of data (coordinates plus an extra value) via AJAX. The format of the data is:
-72.781;;6,-68.811;;8
Note two different delimiters are being used: ;; and ,.
Shall I just return a delimited string and use String.split() (twice) or is it better to return a JSON string and use JSON.parse() to unpack my data? What is the worst and the best from each method?
Even if the data is really quite large, the odds of their being a performance difference noticeable in the real world are quite low (data transfer time will trump the decoding time). So barring a real-world performance problem, it's best to focus on what's best from a code clarity viewpoint.
If the data is homogenous (you deal with each coordinate largely the same way), then there's nothing wrong with the String#split approach.
If you need to refer to the coordinates individually in your code, there would be an argument for assigning them proper names, which would suggest using JSON. I tend to lean toward clarity, so I would probably lean toward JSON.
Another thing to consider is size on the wire. If you only need to support nice fat network connections, it probably doesn't matter, but because JSON keys are reiterated for each object, the size could be markedly larger. That might argue for compressed JSON.
I've created a performance test that describes your issue.
Although it depends on the browser implementation, in many cases -as the results show- split would be much faster, because JSON.parse does a lot of other things in the background, but you would need the data served for easy parsing: in the test I've added a case where you use split (among replace) in order to parse an already formatted json array and, the result speaks for itself.
All in all, I wouldn't go with a script that's a few miliseconds faster but n seconds harder to read and maintain.
I have raw data in text file format with lot of repetitive tokens (~25%). I would like to know if there's any algorithm which will help:
(A) store data in compact form
(B) yet, allow at run time to re-constitute the original file.
Any ideas?
More details:
the raw data is consumed in a pure html+javascript app, for instant search using regex.
data is made of tokens containing (case sensitive) alpha characters, plus few punctuation symbols.
tokens are separated by spaces, new lines.
Most promising Algorithm so far: Succinct data structures discussed below, but reconstituting looks difficult.
http://stevehanov.ca/blog/index.php?id=120
http://ejohn.org/blog/dictionary-lookups-in-javascript/
http://ejohn.org/blog/revised-javascript-dictionary-search/
PS: server side gzip is being employed right now, but its only a transport layer optimization, and doesn't help maximize use of offline storage for example. Given the massive 25% repetitiveness, it should be possible to store in a more compact way, isn't it?
Given that the actual use is pretty unclear I have no idea whether this is helpful or not, but for smallest total size (html+javascript+data) some people came up with the idea of storing text data in a greyscale .png file, one byte to each pixel. A small loader script can then draw the .png to a canvas, read it pixel for pixel and reassemble the original data this way. This gives you deflate compression without having to implement it in Javascript. See e.g. here for more detailled information.
Please, do not use a technique like that unless you have pretty esotheric requirements, e.g. for a size-constrained programming competition. Your coworkers will thank you :-)
Generally speaking, it's a bad idea to try to implement compression in JavaScript. Compression is the exact type of work that JS is the worst at: CPU-intensive calculations.
Remember that JS is single-threaded1, so for the entire time spent decompressing data, you block the browser UI. In contrast, HTTP gzipped content is decompressed by the browser asynchronously.
Given that you have to reconstruct the entire dataset (so as to test every record against a regex), I doubt the Succinct Trie will work for you. To be honest, I doubt you'll get much better compression than the native gzipping.
1 - Web Workers notwithstanding.