I'm looking for a compression algorithm that:
must be loseless
must have very high compression ratio
must be supported in browser via JavaScript libs or natively
shouldn't be fast.
Goals:
to compress dense array of 8 million double-precision floats. There only 256 unique values. Values are normally distributed. (primary use-case)
the same as before but for sparse arrays (contains a lot of 0 values)
It' OK for me to use 2 different algorithms for these use-cases.
I've found Google's Brotli algorithm. But I'm not sure if it is the best.
Coding is pretty much a problem solved: your main task will be modelling (starting with float number and lossless).
[primarily dense arrays] of 256 unique float numbers doesn't sound promising: depending on range, exponent representation may be the only source of exploitable redundancy.
sparse array does sound promising, 16×16 sparse matrix even more so. The more you know about your data, the more you can help compressors - "mainly diagonal matrix", anyone?
"General purpose data compressors" exploit self-similarity:
To get an idea where your data has such, use "the usual suspects" on whatever "machine representation" you chose and on a generic unicode representation.
The latter allows you to use no more resolution than required.
I've a lot of float numbers. But because there are only 256 unique values I can encode each number as 1 byte. It gives a huge compression ratio.
After that I can run some general purpose algorithm for further data compression.
I've checked several popular algorithms: gzip, Brotli, bzip2, lzma, Zstandard.
I've found that 2 options suit my needs:
bzip2
Brotli
bzip2:
compresses well even if I don't convert the floats to unsigned bytes.
but requires JS library in browser
Brotli:
compresses well only if I manually map all floats to unsigned bytes before
supported natively by nearly all modern browsers
Related
I'm wondering if someone might be able to explain a specific aspect of the JavaScript BigInt implementation to me.
The overview implementation I understand - rather than operating in base 10, build an array representing digits effectively operating in base 2^32/2^64 depending on build architecture.
What I'm curious about is the display/console.log implementation for this type - it's incredibly fast for most common cases, to the point where if you didn't know anything about the implementation you'd probably assume it was native. But, knowing what I do about the implementation, it's incredible to me that it's able to do the decimal cast/string concatenation math as quickly as it can, and I'm deeply curious how it works.
A moderate look into bigint.cc and bigint.h in the Chromium source has only confused me further, as there are a number of methods whose signatures are defined, but whose implementations I can't seem to find.
I'd appreciate even being pointed to another spot in the Chromium source which contains the decimal cast implementation.
(V8 developer here.)
#Bergi basically provided the relevant links already, so just to sum it up:
Formatting a binary number as a decimal string is a "base conversion", and its basic building block is:
while (number > 0) {
next_char = "0123456789"[number % 10];
number = number / 10; // Truncating integer division.
}
(Assuming that next_char is also written into some string backing store; this string is being built up from the right.)
Special-cased for the common situation that the BigInt only had one 64-bit "digit" to begin with, you can find this algorithm in code here.
The generalization for more digits and non-decimal radixes is here; it's the same algorithm.
This algorithm runs sufficiently fast for sufficiently small BigInts; its problem is that it scales quadratically with the length of the BigInt. So for large BigInts (where some initial overhead easily pays for itself due to enabling better scaling), we have a divide-and-conquer implementation that's built on better-scaling division and multiplication algorithms.
When the requested radix is a power of two, then no such heavy machinery is necessary, because a linear-time implementation is easy. That's why some_bigint.toString(16) is and always will be much faster than some_bigint.toString() (at least for large BigInts), so when you need de/serialization rather than human readability, hex strings are preferable for performance.
if you didn't know anything about the implementation you'd probably assume it was native
What does that even mean?
I am making a game, it will likely be built in JavaScript - but this question is rather platform agnostic...
The game involves generation of a random campaign, however to dissuade hacking and reduce the amount of storage space needed to save game (which may potentially be cloud-based) I wanted the campaign generation to be seed based.
Trying to think of ways to accomplish this, I considered an MD5 based approach. For example, lets say at the start of the game the user is given the random seed "ABC123". When selecting which level template to use for each game level, I could generate MD5 hashes...
MD5("ABC123" + "level1"); // = 3f19bf4df62494495a3f23bedeb82cce
MD5("ABC123" + "level2"); // = b499e3184b3c23d3478da9783089cc5b
MD5("ABC123" + "level3"); // = cf240d23885e6bd0228677f1f3e1e857
Ideally, there are only 16 templates. There will be more, but for the sake of demonstration if I were to take the first letter from each hash I have a random number out of 16 which I could re-produce with the same seed, forever.
Level 1 for this seed is always "3" (#3), Level 2 is always "b" (#11), Level 3 is always "c" (#12)
This approach has a few drawbacks I'm sure many will be quick to point out...
MD5 generation is CPU intensive, particularly if used in loops etc...
JavaScript doesn't come with an MD5 encryptor - you'll need to DIY...
That only gives you 16 numbers - or 128 if you use another number. How do you 'round' the number to your required range?
I considered this actually. Divide the number by the potential (16, or 128...), then multiply it by the random range needed. As long as the range remains the same, so too will the result... but that too is a constraint...
Given those drawbacks, is there a better approach which is simple, doesn't require an entire framework? In my case all I really need is an MD5 encryptor function and my implementation is basically complete.
Any advice is appreciated. I guess the "chosen answer" will be the suggestions or approach which is the most useful or practical given everything I've mentioned.
I think you overcomplicate the solution.
1) You don't need the MD5 hash. Actually since in your case there is no interest in the statistical quality of the hash, almost any hash function would be satisfactory. You can use any string hash algorithm which is cheaper to evaluate. If you only accept ASCII characters, then the Pearson hash is also an option - it is fast, simple and easy to port to any language.
2) Do you really need string seeds from the user, or a single integer seed is also acceptable? If acceptable, then you can use an integer hash function, which is significantly faster than a string hash algorithm, also very simple and easy to port.
3) Any decent pseudo-random number generator (PRNG) will give you radically different sequence with each different seed value. It means that with the increasing levels you can simply increase the seed by 1 as ++seed and generate random numbers by that. I recommend to use a custom simple and fast random number generator other than JavaScript's Math.random(). You can use some variant of xorshift.
With these 3 points all your listed drawbacks are addressed and no framework needed.
I wouldn't worry about hacking. As #apokryfos pointed out in the comments even your original solution with MD5 is not secure, and I think that level generation in games is not the best example where you need cryptography. Think about, even big title commercial games are hackable.
In a short amount of time, I ran twice into the same problem:
I have a list of coordinates ( latitude, longitude in the case of geo-coordinates — or x,y,z in the case of a 3D OBJ file)
the coordinates are stored as numbers written out in ASCI decimals,... e.g. 3.14159265
the coordinates have decimals
the coordinates are stored as text in a text file or database
the whole bunch gets too large
Now, we could simply ignore the problem and accept a slow response or a more jagged shape - but it nags. A decimal in ASCII uses 8 bits (where we only need 4 to represent the numbers 0…10) and many coordinates share the same first couple of digits... It feels like these files could be compressed easily. Zipping obviously reduces the files a bit, although it varies. Base-encoding also seems to help, but it turns out not to be as efficient as I hoped (about 30%)
Using PHP, What would be a pragmatic approach to compress coordinates stored in text files?
( Pragmatic meaning: reasonably fast, preferably using vanilla PHP )
You can use a quadkey to presort the geo co-coordinates and other presort algorithm, for example, move-to-front and burrow-wheeler. A quadkey is often used in mapping application especially for map tiles but it has interesting features. Just convert the geo coordinate into a binary and concatenate it. Then treat it as base-4 number. Here is a free source code here:http://msdn.microsoft.com/en-us/library/bb259689.aspx. Then use a statistical compression like huffman. The same algorithm is used in delaunay triangulation.
What is the best way to store ca. 100 sequences of doubles directly in the js file? Each sequence will have length of ca. 10 000 doubles or more.
Requirements
the javascript file must be executed as fast as possible
it is enough for me to iterate through the sequence on demand (I do not need to decode all the numbers at js execution. They will be decoded on event.)
it shouldn't take to much space
The simplest option is probably to use string of CSV format but then the doubles are not stored in the most efficient manner, right?
Another option might be to store the numbers in Base64 byte array, but then I have no idea how to read the base64 string into double.
EDIT:
I would like to use the doubles to transform Matrix4x4 of 3d nodes in Adobe 3D annotations. Adobe allows to import external files but it is so complicated that it might be simpler to include all the data in the js file directly.
As I mentioned in my comment, it is probably not worth it to try and encode the values. Here are some values from my head on the required amount of data to store doubles (updated from comment).
Assuming 1,000,000 values:
Using direct encoding (won't work well in a JS file): 8 B = 8 MB
Using base64: 10.7 B = 10.7 MB
Literals (best case): 1 B + delimiter = 2 MB
Literals (worst case): 21 B + delimiter = 22 MB
Literals (average case assuming evenly distributed values): 19 B + delimiter = 20MB
Note: A double can take 21 bytes (assuming 15 digits of precision) in the worst case like this: 1.23456789101112e-131
As you can see, you still won't be able to cut it below 1/2 of using plain literal values with encoding, and if you plan on doing random-access decoding it will get complicated fast. It may be best to stick to literals. You might get some benefit from using the external file that you mentioned, but that depends on how much effort is needed to do so.
Some ideas on how to optimize using literals:
Depending on the precision required, you could approximate the values and limit a value to, say, 5 digits of precision. This would incredibly shorten the file.
You could compress the file. I think you can specify any number of doubles using 14 characters, (0123456789.e-,) so theoretically, you could compress such a string to half its size. I don't know how good practical modern compression routines are though.
I need to serialize moderately complex objects with 1-100's of mixed type properties.
JSON was used originally, then I switched to BSON which is marginally faster.
Encoding 10000 sample objects
JSON: 1807mS
BSON: 1687mS
MessagePack: 2644mS (JS, modified for BinaryF)
I want an order of magnitude increase; it is having a ridiculously bad impact on the rest of the system.
Part of the motivation to move to BSON is the requirement to encode binary data, so JSON is (now) unsuitable. And because it simply skips the binary data present in the objects it is "cheating" in those benchmarks.
Profiled BSON performance hot-spots
(unavoidable?) conversion of UTF16 V8 JS strings to UTF8.
malloc and string ops inside the BSON library
The BSON encoder is based on the Mongo BSON library.
A native V8 binary serializer might be wonderful, yet as JSON is native and quick to serialize I fear even that might not provide the answer. Perhaps my best bet is to optimize the heck out of the BSON library or write my own plus figure out far more efficient way to pull strings out of V8. One tactic might be to add UTF16 support to BSON.
So I'm here for ideas, and perhaps a sanity check.
Edit
Added MessagePack benchmark. This was modified from the original JS to use BinaryF.
The C++ MessagePack library may offer further improvements, I may benchmark it in isolation to compare directly with the BSON library.
I made a recent (2020) article and benchmark comparing binary serialization libraries in JavaScript.
The following formats and libraries are compared:
Protocol Buffer: protobuf-js, pbf, protons, google-protobuf
Avro: avsc
BSON: bson
BSER: bser
JSBinary: js-binary
Based on the current benchmark results I would rank the top libraries in the following order (higher values are better, measurements are given as x times faster than JSON):
avsc: 10x encoding, 3-10x decoding
js-binary: 2x encoding, 2-8x decoding
protobuf-js: 0.5-1x encoding, 2-6x decoding,
pbf: 1.2x encoding, 1.0x decoding
bser: 0.5x encoding, 0.5x decoding
bson: 0.5x encoding, 0.7x decoding
I did not include msgpack in the benchmark as it is currently slower than the build-in JSON library according to its NPM description.
For details, see the full article.
For serialization / deserialization protobuf is pretty tough to beat. I don't know if you can switch out the transport protocol. But if you can protobuf should definitely be considered.
Take a look at all the answers to Protocol Buffers versus JSON or BSON.
The accepted answer chooses thrift. It is however slower than protobuf. I suspect it was chosen for ease of use (with Java) not speed. These Java benchmarks are very telling.
Of note
MongoDB-BSON 45042
protobuf 6539
protostuff/protobuf 3318
The benchmarks are Java, I'd imagine that you can achieve speeds near the protostuff implementation of protobuf, ie 13.5 times faster. Worst case (if for some reason Java is just better for serialization) you can do no worse the the plain unoptimized protobuf implementation which runs 6.8 times faster.
Take a look at MessagePack. It's compatible with JSON. From the docs:
Fast and Compact Serialization
MessagePack is a binary-based
efficient object serialization
library. It enables to exchange
structured objects between many
languages like JSON. But unlike JSON,
it is very fast and small.
Typical small integer (like flags or
error code) is saved only in 1 byte,
and typical short string only needs 1
byte except the length of the string
itself. [1,2,3] (3 elements array) is
serialized in 4 bytes using
MessagePack as follows:
If you are more interested on the de-serialisation speed, take a look at JBB (Javascript Binary Bundles) library. It is faster than BSON or MsgPack.
From the Wiki, page JBB vs BSON vs MsgPack:
...
JBB is about 70% faster than Binary-JSON (BSON) and about 30% faster than MsgPack on decoding speed, even with one negative test-case (#3).
JBB creates files that (even their compressed versions) are about 61% smaller than Binary-JSON (BSON) and about 55% smaller than MsgPack.
...
Unfortunately, it's not a streaming format, meaning that you must pre-process your data offline. However there is a plan for converting it into a streaming format (check the milestones).