What is the fastest way to serialize and compress a javascript hashmap (object) in NodeJS 12+? I'm trying to find the fastest combination of serialization and compression methods to transform a javascript object into binary data.
The number of possible combinations is 100+, my goal is a pre-study to choose several best options for final benchmark.
Input: an object having arbitrary keys, so some really fast serialization methods like AVSC cannot be used. Assume that the object has 30 key-value pairs, example:
{
"key-one": 2,
"key:two-random-text": "English short text, not binary data.",
... 28 more keys
}
No need to support serialization of Date, Regex etc.
Serialization only schemaless serialization formats can be taken into account, like JSON or BSON. V8.serialize is an interesting option, probably it's fast because it's native. Compress-brotli package added it's support for some reason, but did not provide a benchmark or highlighted it as a recommended option.
Compression Probably only fastest methods should be considered. Not sure if brotli is a perfect choice because according to wiki it's strong in compressing JS, CSS and HTML because expects "keywords" in input. Native nodejs support is preferred.
I've found a helpful research for a similar use case (I'm planning to compress via lambda and store in S3), but their data originates in JSON, opposite to my case.
I'd recommend lz4 for fast compression.
Related
Let's say you use several different programming languages and frameworks in your infrastructure to handle large amounts of traffic etc.
Example Stack:
event driven API servers (using Scala, node.js, Ruby EM)
a standard full stack webapp (e.g. Rails)
(maybe more technologies)
When using different languages and frameworks I usually end up duplicating most of the model validations because every "customer entry point" needs to validate its input. This is of course a pain to keep in sync.
How would you handle this without something like CORBA?
Your best bet would be a framework that allows you to specify model validation in a language agnostic format, like JSON. You might end up with a validation schema of sorts, say:
{
"name": [
{
"validate": "length",
"minLength": 6,
"maxLength": 10
},
...
],
...
}
You would then have language-specific validators that could parse this format. The validators only need to be written once, and then you maintain a single schema for each model.
However, this is probably sounding a lot like CORBA/SOAP/Thrift/ProtocolBuffers/etc. at this point. That's because they were written to solve these types of problems, and you'll end up reinventing a few wheels if you write it yourself.
I would go on a "dictionary" of Regular Expressions.
Regular Expressions are supported by all the languages you counted - and - translating their string representation from one language to another can be done by a passing the expressions themselves through regular expressions...
To my observation - it's a lot less work then composing a parse and execute mechanism for each language...
Like advised before - you can save this "dictionary" of Reg-Exps in an agnostic format, such as JSON.
This narrows down the duplicated work to -
one source file of validation expressions that you maintain
per programming language:
converter from the main file to the format of the target language
thin mechanism of
reading the JSON,
selecting from it the configured checks
executing them
edge cases (if any)
Have fun :)
To add to #Nathan Ostgard's post, XML, XSD and, when necessary, XSLT might work as well. The advantage to this would be a) XSD has the simple validation built into it b) most languages have good support for this c) you wouldn't have to write validation in each language; stuff that isn't handled in the schema could be written once in XSLT (with the caveat that XSLT implementations tend to vary :) )
If you want to go the way of full validation, I'd use SOAP. At least, as long as you have SOAP libraries, all you need to feed them is the WSDL for your interfaces.
I need to retrieve a large amount of data (coordinates plus an extra value) via AJAX. The format of the data is:
-72.781;;6,-68.811;;8
Note two different delimiters are being used: ;; and ,.
Shall I just return a delimited string and use String.split() (twice) or is it better to return a JSON string and use JSON.parse() to unpack my data? What is the worst and the best from each method?
Even if the data is really quite large, the odds of their being a performance difference noticeable in the real world are quite low (data transfer time will trump the decoding time). So barring a real-world performance problem, it's best to focus on what's best from a code clarity viewpoint.
If the data is homogenous (you deal with each coordinate largely the same way), then there's nothing wrong with the String#split approach.
If you need to refer to the coordinates individually in your code, there would be an argument for assigning them proper names, which would suggest using JSON. I tend to lean toward clarity, so I would probably lean toward JSON.
Another thing to consider is size on the wire. If you only need to support nice fat network connections, it probably doesn't matter, but because JSON keys are reiterated for each object, the size could be markedly larger. That might argue for compressed JSON.
I've created a performance test that describes your issue.
Although it depends on the browser implementation, in many cases -as the results show- split would be much faster, because JSON.parse does a lot of other things in the background, but you would need the data served for easy parsing: in the test I've added a case where you use split (among replace) in order to parse an already formatted json array and, the result speaks for itself.
All in all, I wouldn't go with a script that's a few miliseconds faster but n seconds harder to read and maintain.
I have a problem I'd like to solve to not have to spend a lot of manual work to analyze as an alternative.
I have 2 JSON objects (returned from different web service API or HTTP responses). There is intersecting data between the 2 JSON objects, and they share similar JSON structure, but not identical. One JSON (the smaller one) is like a subset of the bigger JSON object.
I want to find all the interesecting data between the two objects. Actually, I'm more interested in the shared parameters/properties within the object, not really the actual values of the parameters/properties of each object. Because I want to eventually use data from one JSON output to construct the other JSON as input to an API call. Unfortunately, I don't have the documentation that defines the JSON for each API. :(
What makes this tougher is the JSON objects are huge. One spans a page if you print it out via Windows Notepad. The other spans 37 pages. The APIs return the JSON output compressed as a single line. Normal text compare doesn't do much, I'd have to reformat manually or w/ script to break up object w/ newlines, etc. for a text compare to work well. Tried with Beyond Compare tool.
I could do manual search/grep but that's a pain to cycle through all the parameters inside the smaller JSON. Could write code to do it but I'd also have to spend time to do that, and test if the code works also. Or maybe there's some ready made code already for that...
Or can look for JSON diff type tools. Searched for some. Came across these:
https://github.com/samsonjs/json-diff or https://tlrobinson.net/projects/javascript-fun/jsondiff
https://github.com/andreyvit/json-diff
both failed to do what I wanted. Presumably the JSON is either too complex or too large to process.
Any thoughts on best solution? Or might the best solution for now be manual analysis w/ grep for each parameter/property?
In terms of a code solution, any language will do. I just need a parser or diff tool that will do what I want.
Sorry, can't share the JSON data structure with you either, it may be considered confidential.
Beyond Compare works well, if you set up a JSON file format in it to use Python to pretty-print the JSON. Sample setup for Windows:
Install Python 2.7.
In Beyond Compare, go under Tools, under File Formats.
Click New. Choose Text Format. Enter "JSON" as a name.
Under the General tab:
Mask: *.json
Under the Conversion tab:
Conversion: External program (Unicode filenames)
Loading: c:\Python27\python.exe -m json.tool %s %t
Note, that second parameter in the command line must be %t, if you enter two %ss you will suffer data loss.
Click Save.
Jeremy Simmons has created a better File Format package Posted on forum: "JsonFileFormat.bcpkg" for BEYOND COMPARE that does not require python or so to be installed.
Just download the file and open it with BC and you are good to go. So, its much more simpler.
JSON File Format
I needed a file format for JSON files.
I wanted to pretty-print & sort my JSON to make comparison easy.
I have attached my bcpackage with my completed JSON File Format.
The formatting is done via jq - http://stedolan.github.io/jq/
Props to
Stephen Dolan for the utility https://github.com/stedolan.
I have sent a message to the folks at Scooter Software asking them to
include it in the page with additional formats.
If you're interested in seeing it on there, I'm sure a quick reply to
the thread with an up-vote would help them see the value posting it.
Attached Files Attached Files File Type: bcpkg JsonFileFormat.bcpkg
(449.8 KB, 58 views)
I have a small GPL project that would do the trick for simple JSON. I have not added support for nested entities as it is more of a simple ObjectDB solution and not actually JSON (Despite the fact it was clearly inspired by it.
Long and short the API is pretty simple. Make a new group, populate it, and then pull a subset via whatever logical parameters you need.
https://github.com/danielbchapman/groups
The API is used basically like ->
SubGroup items = group
.notEqual("field", "value")
.lessThan("field2", 50); //...etc...
There's actually support for basic unions and joins which would do pretty much what you want.
Long and short you probably want a Set as your data-type. Considering your comparisons are probably complex you need a more complex set of methods.
My only caution is that it is GPL. If your data is confidential, odds are you may not be interested in that license.
I need to serialize moderately complex objects with 1-100's of mixed type properties.
JSON was used originally, then I switched to BSON which is marginally faster.
Encoding 10000 sample objects
JSON: 1807mS
BSON: 1687mS
MessagePack: 2644mS (JS, modified for BinaryF)
I want an order of magnitude increase; it is having a ridiculously bad impact on the rest of the system.
Part of the motivation to move to BSON is the requirement to encode binary data, so JSON is (now) unsuitable. And because it simply skips the binary data present in the objects it is "cheating" in those benchmarks.
Profiled BSON performance hot-spots
(unavoidable?) conversion of UTF16 V8 JS strings to UTF8.
malloc and string ops inside the BSON library
The BSON encoder is based on the Mongo BSON library.
A native V8 binary serializer might be wonderful, yet as JSON is native and quick to serialize I fear even that might not provide the answer. Perhaps my best bet is to optimize the heck out of the BSON library or write my own plus figure out far more efficient way to pull strings out of V8. One tactic might be to add UTF16 support to BSON.
So I'm here for ideas, and perhaps a sanity check.
Edit
Added MessagePack benchmark. This was modified from the original JS to use BinaryF.
The C++ MessagePack library may offer further improvements, I may benchmark it in isolation to compare directly with the BSON library.
I made a recent (2020) article and benchmark comparing binary serialization libraries in JavaScript.
The following formats and libraries are compared:
Protocol Buffer: protobuf-js, pbf, protons, google-protobuf
Avro: avsc
BSON: bson
BSER: bser
JSBinary: js-binary
Based on the current benchmark results I would rank the top libraries in the following order (higher values are better, measurements are given as x times faster than JSON):
avsc: 10x encoding, 3-10x decoding
js-binary: 2x encoding, 2-8x decoding
protobuf-js: 0.5-1x encoding, 2-6x decoding,
pbf: 1.2x encoding, 1.0x decoding
bser: 0.5x encoding, 0.5x decoding
bson: 0.5x encoding, 0.7x decoding
I did not include msgpack in the benchmark as it is currently slower than the build-in JSON library according to its NPM description.
For details, see the full article.
For serialization / deserialization protobuf is pretty tough to beat. I don't know if you can switch out the transport protocol. But if you can protobuf should definitely be considered.
Take a look at all the answers to Protocol Buffers versus JSON or BSON.
The accepted answer chooses thrift. It is however slower than protobuf. I suspect it was chosen for ease of use (with Java) not speed. These Java benchmarks are very telling.
Of note
MongoDB-BSON 45042
protobuf 6539
protostuff/protobuf 3318
The benchmarks are Java, I'd imagine that you can achieve speeds near the protostuff implementation of protobuf, ie 13.5 times faster. Worst case (if for some reason Java is just better for serialization) you can do no worse the the plain unoptimized protobuf implementation which runs 6.8 times faster.
Take a look at MessagePack. It's compatible with JSON. From the docs:
Fast and Compact Serialization
MessagePack is a binary-based
efficient object serialization
library. It enables to exchange
structured objects between many
languages like JSON. But unlike JSON,
it is very fast and small.
Typical small integer (like flags or
error code) is saved only in 1 byte,
and typical short string only needs 1
byte except the length of the string
itself. [1,2,3] (3 elements array) is
serialized in 4 bytes using
MessagePack as follows:
If you are more interested on the de-serialisation speed, take a look at JBB (Javascript Binary Bundles) library. It is faster than BSON or MsgPack.
From the Wiki, page JBB vs BSON vs MsgPack:
...
JBB is about 70% faster than Binary-JSON (BSON) and about 30% faster than MsgPack on decoding speed, even with one negative test-case (#3).
JBB creates files that (even their compressed versions) are about 61% smaller than Binary-JSON (BSON) and about 55% smaller than MsgPack.
...
Unfortunately, it's not a streaming format, meaning that you must pre-process your data offline. However there is a plan for converting it into a streaming format (check the milestones).
I am storing a JSON string in the database that represents a set of properties. In the code behind, I export it and use it for some custom logic. Essentially, I am using it only as a storage mechanism. I understand XML is better suited for this, but I read that JSON is faster and preferred.
Is it a good practice to use JSON if the intention is not to use the string on the client side?
JSON is a perfectly valid way of storing structured data and simpler and more concise than XML. I don't think it is a "bad practice" to use it for the same reason someone would use XML as long as you understand and are OK with its limitations.
Whether it's good practice I can't say, but it strikes me as odd. Having XML fields in your SQL database are at least queryable (SQL Server 2000 or later, MySQL and others) but more often than not is a last resort for metadata.
JSON is usually the carrier between JavaScript and your back end, not the storage itself unless you have a JSON back end document orientated database such as CouchDB or SOLR as JSON lends itself perfectly for storing documents.
Not to say I don't agree with using JSON as a simple (that is, not serializing references) data serializer over XML, but I won't go into a JSON vs XML rant just for the sake of it :).
If you're not using JSON for its portability between 2 languages, and you're positive you will never query the data from SQL you will be better off with the default serialization from .NET.
You're not the only one using JSON for data storage. CouchDB is a document-oriented database which uses JSON for storing information. They initially stored that info as XML, then switched to JSON. Query-ing is done via good old HTTP.
By the way, CouchDB received some attention lately and is backed by IBM and the Apache Software Foundation. It is written in Erlang and promises a lot (IMHO).
Yeah I think JSON is great. It's an open standard and can be used by any programming language. Most programming languages have libraries to parse and encode JSON as well.
JSON is just a markup language like XML, and due to its more restrictive specification is is smaller in size and faster to generate and parse. So from that aspect it's perfectly acceptable to use in the database if you need ad-hoc structured markup (and you're absolutely sure that using tables isn't possible and/or the best solution).
However, you don't say what database you're using. If it is SQL Server 2005 or later then the database itself has extensive capabilities when dealing with XML, including an XML data type, XQuery support, and the ability to optimise XQuery expressions as part of an execution plan. So if this is the case I'd be much more inclined to use XML and take advantage of the database's facilities.
Just remember that JSON has all of the same problems that XML has as a back-end data store; namely, it in no way replaces a relational database or even a fixed-size binary format. If you have millions of rows and need random access, you will run into all the same fundamental performance issues with JSON that you would with XML. I doubt you're intending to use it for this kind of data store scenario, but you never know... stranger things have been tried :)
JSON is shorter, so it will use less space in your database. I'd probably use it instead of XML or writing my own format.
On the other hand, searching for matches will suck - you're going to have to use "where json like '% somevalue %'" which will be painfully slow. This is the same limitation with any other text storage strategy (xml included).
You shouldn't use either JSON or XML for storing data in a relational database. JSON and XML are serialization formats which are useful for storing data in files or sending over the wire.
If you want to store a set of properties, just create a table for it, with each row beeing a property. That way you can query the data using ordinary SQL.