Manually "compressing" a very large number of boolean values in JSON

Manually "compressing" a very large number of boolean values in JSON - javascript

We have a data model where each entity has 600 boolean values. All of this data needs to travel over the wire from a node.js backend to an Angular frontend, via JSON.
I was thinking about various ways to optimize it (this is an internal API and is not public, so adherence to best practices is less important than performance and saving bandwidth).
I am not a native Javascript speaker, so was hoping to get some feedback on some of the options I was considering, which are:
Turning it into a bitfield and using a huge (600-bit) BigInt.
Is this a feasible approach? I can imagine it would probably be pretty horrific in terms of performance
Splitting the 600 bits into 10 integers (since JS integers are 64 bit), and putting those into an array in the JSON
Base64 encoding a binary blob (will be decoded to a UInt8Array I'm assuming?)
Using something like Protobuf? It might be overkill because I don't want more than 1-2 hours spent on this optimization; definitely don't want to make major changes to the architecture either
Side note: We don't have compression on the server end due to infrastructure reasons, which makes this more complicated and is the reason for us implementing this on the data level.
Thanks!

as Evan points out, transforming your boolean for example into a single character for true="t" and false="f", the 600 boolean will become a joined string of 600 chars which can very well be split into 3 strings of 200 of the sizes, then once received on the front just concatenate the transit and if you want to recover your Bollean values from the string, with a simple reg it becomes possible.
I don't know how the data is set and then obtained, just changing this parameter to which I think needs to be automated.
Once the final string is obtained on the front here is an example of reg ex which can convert your string to an array with your 600 boolean. It is also possible to define indexes by defining an object instead of the array.
function convert_myBool(str)
{
/*var reg = new RegExp('.{1}', 'g');
var tmpTab = str.replace(reg, function(matched){
return matched == "t"?true:false;
});*/
//map is best
tmpTab = str.split('').map((value) =>{
return value == "t"?true:false;
});
return tmpTab;
};
I wrote this dynamically so of course it can be pondered, improved replaced etc. Hoping to have helped :)

Can it be sorted in any way? If there are boolean values that always occur in conjunction with a related value you may be able to group them and simplify.
Depending on what your use for that data is, you may be able to cache some of the it or memoize based on usage frequency. There would be a space tradeoff with caching, however.

Related

Converting JSON objects to unique UTF-8 strings, limited by 750 characters, other ideas are welcomed

I have this issue with Google's Firestore and Google's Realtime DB ids/duplicates but I think it is more general problem and it may have multiple solutions without even considering Firebase.
Right now, I create IDs from my JSON object like this:
// Preparing data
let data = {
workouts: [{...}],
movements: [{...}]
}
// Creating ID from data
const id = btoa(JSON.stringify(data))
// Deleting workouts and movements
delete data.workouts
delete data.movements
// Adding multiple other properties to data objects for example:
data.date = new Date()
... and much more
// Creating new document in DB here or
alerting user it already exists if data.id already exists
When I load the data object from Firestore, I decode it like this:
const data = JSON.parse(atob(this.workout.id))
My goal is to have only unique workouts + movements combinations in my database and generating id based on data from workouts + movements solves it.
The issue is that Realtime DB has limit of 750 Bytes (750 UTF-8 chars per id) and Firestore has limit of 1500 Bytes per id. I have just discovered that by having id that has ~1000 chars. And I believe I would be able to hit even the 1500 chars limit with data from users.
My ideas:
1) Use some different encoding (supporting UTF-8) that will make even long (1000 chars) string to something like 100 chars max. It will still need to be decodable. Is it even possible or Base64 is the shortest it could be?
2) Use autogenerated IDs + save encoded string as data.id parameter to db and when creating new workout always compare this data.id to table of already created workout data.id(s).
Is it possible to solve without looping through all existing workouts?
3) Any other idea? I am still in the realm of decoding/encoding but I believe it must has a different more simple solution.

Do not btoa
First off, Base64 string is probably gonna be longer than the stringified JSON, so if you're struggling with character limit and you can use entire UTF-8, do not btoa anything.
IDs
You're looking for a hash. You could (not recommended) try to roll your own by writing hashing functions for JSON primitives, each must return number:
{ ... } object shall have it's properties sorted by name then hashed
string string shall construct the it's hash from individual characters (.charCodeAt())
number probably can just be kept as-is
[ ... ] Not really sure what would I do with arrays, probably assume different order is different hash and hash them as is
Then you'd deal with the json recursively, constructing the value as:
let hash = 0;
hash += hashStringValue("ddd");
hash *= 31;
hash += hashObjectValue({one:1, test:"text"});
return hash
The multiplication by a prime before addition is a cheap trick, but this only works fir limited depth of the object.
Use library for hash
I googled javascript hash json and found this: https://www.npmjs.com/package/json-hash which looks like what you want:
// If you're not on babel use:
// require('babel/polyfill')
npm install json-hash
var assert = require('assert')
var hash = require('json-hash')
// hash.digest(any, options?)
assert.equal(hash.digest({foo:1,bar:1}), hash.digest({bar:1,foo:1}))
Storage
For the storage of JSON data, if you really need it, use compression algorithm such as LZString. You could also filter the JSON and only keep the values you really need.

Dealing with arbitrarily large inbound data in JavaScript

Here's a function to remove specified characters from a string:
function remove(str, chars) {
var set = new Set(chars);
return [...str].filter(i => !set.has(i)).join('');
}
console.log(remove('hello world', 'el') === 'ho word'); // true
But... what if the inbound string is arbitrarily large and possibly continually extended?
Presumably we need a completely different strategy to deal with it in a piecemeal fashion?
Would such an implementation look like constructing a buffer object that is periodically updated as the data is inbound, and then having sampling logic to deal with the "delta", process it and pass it on?
And that this would have to be done asynchronously to avoid blocking everything else on the event loop?
Is this essentially what Node.js streams are?

[...str] will convert string into array of 1-character strings, which will occupy additional memory. Then .filter() will produces another array of strings, which could be as big as previous one, depending on input data. End then, the resulting string.
If you concerned about possible memory and/or performance, you can implement this with regular cycle "for" and "charAt" function.

MD5 hash is Different

On sql server : Out put : 0x5C8C8AAFE7AE37EA4EBDF8BFA01F82B8
SELECT HASHBYTES('MD5', convert(varchar,getdate(),112)+'mytest#+')
On JavaScript : Out put : 5c8c8aafe7ae37ea4ebdf8bfa01f82b8
//to get Md5 Hash bytes
vm.getMd5Hashbytes = function () {
var currentDate = moment().format('YYYYMMDD');
var md5Hash = md5.createHash(currentDate + 'mytest#+');
return md5Hash;
}
angular-md5 module
Q : Can you tell me why this difference ? SQL server shows 0x as prefix.Why ?

This is purely a formatting issue. Both versions are producing an identical sequence of bytes. SQL Server and node just have different conventions when it comes to presenting these bytes in a human readable format.
You can get similar formatting by specifically telling SQL Server how to format your binary data
declare #hashAsBinary varbinary(max)
declare #hashAsText char(32)
set #hashAsBinary = HASHBYTES('MD5', '20160818mytest#+')
set #hashAsText = LOWER(CONVERT(varchar(max), #hashAsBinary, 2))
select #hashAsText
Which outputs:
5c8c8aafe7ae37ea4ebdf8bfa01f82b8
See SQL Server converting varbinary to string

I am not sure how else to explain it but it will take more space than a comment allows for so I will post it as an answer.
Look at the source code that you are referencing. At the end (lines 210 and 212) you will see it converts the binary value to a hex string (and then to lower case which does not matter unless you opt for a string comparison at the end). End result = your JavaScript library returns a representation using the type string formatted as hex.
Your Sql function HASHBYTES on the other hand produces a varbinary typed result (which is a different type than string (varchar)).
So you have 2 different data types (each living on their own space as you have not pulled one to the other). You never mention where you are doing the comparison, ie: on the database or are you pulling from the database to script. Either way to do a comparison you need to convert one type so you are either comparing 2 strings types OR comparing two binary types. If you do not compare similar types you will get unexpected results or run time exceptions.
If you are comparing using strings AND in JavaScript then look at your library that you are referencing, it already has a call named wordToHex, copy and paste it and reuse it to convert your Sql result to a string and then do a string comparison (do not forget to compare case insensitive or also make it lower case).
Edit
WebApi is black box for me.It is a 3rd party service.I just need to send the security token as mentioned above.
Assuming that the type accepted by that web api is byt[] appending 0x to your string in javascript and then sending it to the web api should work as in the web api will then translate the incoming parameter as a byte array and execute the comparison using the correct types. As this is a black box there is no way to know for certain unless you either ask them if the accepted type is indeed a byte array or to test it.

MongoDB shell: printing to console without a trailing newline?

Is there a way to write to STDOUT without a trailing newline from the Mongo shell? I can't seem to find anything other than print() available.

This is related to my SO question on reading a line from the console. Per #Stennie's comment, it is not possible in the current (2.0.6) version of the Mongo shell.

There might be ways to work around it. You can accumulate the results in an intermediate variable (could be an array, string or any other data structure), then print the entire thing in a single line. Below example illustrates use of an array to capture values from query results, then array is converted to string with comma as a separator. In my case I'm interested in just the _id field:
var cursor = db.getCollection('<collection name>').find(<your query goes here>)
let values = []
cursor.forEach((doc) => values.push(doc._id))
print(values.join(','))
Depending on how many results you're expecting, not sure if space consumed by the intermediate data structure might overwhelm memory. If that's the case can craft the query to return smaller, subsets of data that when added together comprise the full result set you're going for.

This is quite old question, however still relevant, so answering.
One can use printjsononeline().

Extract values from a 'binary format' file and create text file with fixed field sizes

I've used JavaScript to get the data through the internet (I'm interfacing with a brokerage firm's API functions), but unlike most of the rest of their API's, this one returns the data in a 'binary' like format. Here is the layout of the file I get back:
Field -------- Type ------------ Length(8 bit bytes) --------------- Description
Symbol Count----Integer------------- 4--------------------------- Number of symbols for which data is being returned. The subsequent sections are repeated this many times
REPEATING SYMBOL DATA
Symbol Length Short 2 Length of the Symbol field
Symbol String Variable The symbol for which the historical data is returned
Error Code Byte 1 0=OK, 1=ERROR
Error Length Short 2 Only returned if Error Code=1. Length of the Error string
Error Text String Variable Only returned if Error Code=1. The string describing the error
Bar Count Integer 4 # of chart bars; only if error code=0
REPEATING PRICE DATA
close Float 4
high Float 4
Low Float 4
open Float 4
volume Float 4 in 100's
timestamp Long 8 time in milliseconds from 00:00:00 UTC on January 1, 1970
END OF REPEATING PRICE DATA
Terminator Bytes 2 0xFF, 0XFF
END OF REPEATING SYMBOL DATA
As you can see, this file is a mixture of different types of fields. My requirement is to convert this file from the way it is into a fixed field text file (or CSV file). I'm not very good at JavaScript, but I know enough to get by. My main language is MAPPER from Unisys (it is actually called "Business Information Server"). Currently I get all HTTP responses as text files, but this one is a 'binary' file, and MAPPER can not process it because it is a text-based language (a 4GL). I've spent days trying to find a snippet of JavaScript code that I could use, but to no avail. I think this is really simple stuff for a guy that knows JavaScript.

I'm a fellow UNISYS programmer. 25 years of FORTRAN 77 on a 2200 mainframe. Happily, I rarely had anything to do with MAPPER.
I'd like to help, but you're not providing enough information.
Where is this JavaScript code running? In a browser, or is it an extension to whatever you're using to access MAPPER?
Are you using some kind of terminal emulator? AttachMate?
Is your data really arriving in a file, or is it in memory? How are you receiving it, how are you passing on the contents?
Is it vital that your processing happen in JavaScript? There are dozens of languages that would make very short work of the task if the data were lying around as a file and the output should be a file too.
One problem I see is that, AFAIK, JavaScript doesn't know about file IO. That's why I'm asking where it's running.
EDIT:
OK, somehow you have a browser-like environment and JavaScript running in it.
First, the problem of getting binary data out of your response. Here's a bit of help:
https://developer.mozilla.org/en/using_xmlhttprequest
This is Mozilla documentation, under "Receiving binary data," but I'm hoping there will be enough overlap for it to be useful:
function load_binary_resource(url) {
var req = new XMLHttpRequest();
req.open('GET', url, false);
//XHR binary charset opt by Marcus Granado 2006 [http://mgran.blogspot.com]
req.overrideMimeType('text/plain; charset=x-user-defined');
req.send(null);
if (req.status != 200) return '';
return req.responseText;
}
The above lets you fiddle with the connection a bit to hopefully obtain binary data.
That function is called like so:
var filestream = load_binary_resource(url);
var abyte = filestream.charCodeAt(x) & 0xff;
...and if I understand this correctly, your responseText is a JavaScript string (as usual) but thanks to the fiddling and the binary content, it's not containing printable text but binary data. Heh, as long as you don't try to interpret it, it's just a series of bytes just like any old text.
The second line lets you extract a single byte from any position in the string. That byte will be a value between 0 and 255; or if you're unlucky, between -128 and 127. Not sure how JavaScript deals with signed bytes.
This may look like it's doing you a lot of no good. Let's see how you could get to your data:
Your data starts off with a short called symbolLength. I'm guessing a short is 2 bytes, and I'm guessing that offsets for charCodeAt() begin at 0. So you'll be wanting the first two bytes, or bytes 0 and 1. I'm not sure if your data will be coming in high-endian or low-endian, but you should be able to reconstruct that short from either
var symbolLength = fileStream.charCodeAt(0) + 256 * fileStream.charCodeAt(1);
or
var symbolLength = 256 * fileStream.charCodeAt(0) + fileStream.charCodeAt(1);
In other words, using multiplication to re-assemble bytes into integers.
Integers are presumably 4 bytes, so you'll be multiplying by 4 powers of 256: 16777216, 65536, 256 and 1 - again, either in that order or reversed.
The String data is, of course, just that, and once you've taken into account the number of bytes taken up by the preceding fields, you should be able to dig it out of your response string simply using substring operators.
Now for the yucky part - conversion of floats. The internal structure of those numbers is defined by IEEE 754. float probably corresponds to binary32 and double (if you have any) to binary64. The links from the Wikipedia article I linked explain these formats well enough that you could write your own conversion routine if you were desparate, but in your shoes I'd go looking for ready-built coding for this. Surely you're not the first person faced with converting a handful of bytes into a floating point number. Maybe you can find some C or Java code you could hand-convert, or you could even find a routine already written in JavaScript.
Finally, once you have in hand methods to convert all the data types you mentioned, all you need to do is to format that data in whatever format you want to see downstream in MAPPER. Loop through the structures, incrementing the pointers for the offsets... probably nothing new for you.
Admittedly, I've done a lot of guessing and handwaving here. This could be the beginning of a solution but you'll probably want to do a bit of experimenting and hit SO with some detail questions. Don't mention UNISYS, phrase your question as if you wanted to do this in IE :)
As a first step, I'd try dumping out your incoming binary string, byte-wise, preferrably in hex, to some medium where you can read it and compare the bytes you see with the bytes you're expecting from the input data.

We Keep Coding

JavaScript is the programming language of the Web.