I'm passing a table of up to 1000 rows, consisting of name, ID, latitude and longitude values, to the client.
The list will then be processed by Javascript and converted to markers on a Google map.
I initially planned to do this with JSON, as I want the code to be readable and easy to deal with, and because we may be adding more structure to it over time.
However, my colleague suggested passing it down as a Javascript array, as it would reduce the size greatly.
This made me think, maybe JSON is a bit redundant. After all, for each row defined, the name of each field is also being outputted repetitively. Whereas, for an array, the position of the cells is used to indicate the field.
However, would there really be a performance improvement by using an array?
The site uses GZIP compression. Is this compression effective enough to take care of any redundancy found in a JSON string?
[edit]
I realize JSON is just a notation.
But my real question is - what notation is best, performance-wise?
If I use fully named attributes, then I can have code like this:
var x = resultset.rows[0].name;
Whereas if I don't, it will look less readable, like so:
var x = resultset.rows[0][2];
My question is - would the sacrifice in code readability be worth it for the performance gains? Or not?
Further notes:
According to Wikipedia, the Deflate compression algorithm (used by gzip) performs 'Duplicate string elimination'. http://en.wikipedia.org/wiki/DEFLATE#Duplicate_string_elimination
If this is correct, I have no reason to be concerned about any redundancy in JSON, as it's already been taken care of.
JSON is just a notation (Javascript Object Notation), and includes JS arrays -- even if there is the word "object" in its name.
See its grammar on http://json.org/ which defines an array like this (quoting) :
An array is an ordered collection of
values. An array begins with [ (left
bracket) and ends with ] (right
bracket). Values are separated by ,
(comma).
This means this (taken from JSON Data Set Sample) would be valid JSON :
[ 100, 500, 300, 200, 400 ]
Even if it doesn't include nor declare nor whatever any object at all.
In your case, I suppose you could use some array, storing data by position, and not by name.
If you are worried about size you could want to "compress" that data on the server side by yourself, and de-compress it on the client side -- but I wouldn't do that : it would mean you'd need more processing time/power on the client side...
I'd rather go with gzipping of the page that contains the data : you'll have nothing to do, it's fully automatic, and it works just fine -- and the difference in size will probably not be noticeable.
I suggest to use a simple CSV format. There is a nice article on the Flickr Development Blog where they talked about their experience with such a problem. But the best would be to try it on your own.
Related
I am thinking of a implementing a new project that has import/export feature. First, I will have an array of around 45 objects. The object structure is simple like this.
{"id": "someId", "quantity": 3}
So, in order to make it exportable, I will have to change the whole array of these objects into one single string first. For this part, I think I will use JSON.stringify(). After that, I want to make the string as short as possible for the users to use it (copy the string and paste it to share to other users to import it back to get the original array). I know this part is not necessary but I really want to make it as short as possible. Hence, the question. How to convert array of objects to a shortest possible string?
Any techniques such as Encoding, Encryption, or Hashing are acceptable as long as it is reversible to get the original data.
By "shortest possible", I mean you can answer any solution that is shorter than just pure stringification. I will just accept the one that gives shortest string to import.
I tried text minification but it gives almost the same result as the original text. I also tried encryption but it still gives a relatively long result.
Note: The string for import (that comes from export) can be human-readable or unreadable. It does not matter.
Deleting json optional SPACE after : colon and , comma
is a no-brainer. Let's assume you have already minified
in that way.
xz compression is generally helpful.
Perhaps you know some strings that are very likely
to repeatedly appear in the input doc. That might include:
"id":
"quantity":
Construct a prefix document which mentions such terms.
Sender will compress prefix + doc,
strip the initial unchanging bytes,
and send the rest.
Receiver will accept those bytes via TCP,
prepend the unchanging bytes,
and decompress.
Why does this improve compression ratios?
Lempel-Ziv and related schemes maintain a dictionary,
and transmit integer indexes into that dictionary
in order to indicate common words.
A word can be fairly long, even longer than "quantity".
The longer it is, the greater the savings.
If sender and receiver both know a set of words
that belong in the dictionary, beforehand,
we can avoid sending the raw text of those words.
Your chrome browser
compresses web headers
in this way already, each time you do a google search.
Finally, you might want to base64 encode the compressed output.
Ignore compression, and use a database instead,
in the way that tinyurl.com has been doing for quite a long time.
Set serial to 1.
Accept a new object, or set of objects.
Ask the DB if you've seen this input before.
If not, store it under a brand new serial ID.
Now send the matching ID to the receiving end.
It can query the central database when it gets a new ID,
and it can cache such results to use in future.
You might opt for a simple CSV export. The export string becomes, if you use the pipe separator, something like:
id|quantity\nsomeId|3\notherId|8
which is the equivalent of
[{"id":"someId","quantity":3},{"id":"otherId","quantity":8}]
This approach will remove the redundant id and quantity tags for each record and remove the unnecessary double quotes.
The downside is that your records all should have the same data structure but that is generally the case.
In writing a database to disk as a text file of JSON strings, I've been experimenting with how to most efficiently build the string of text that is ultimately converted to a blob for download to disk.
There a number of questions that state to not concatenate a string with the + operator in a loop, but instead write the component strings to an array and then use the join method to build one large string.
The best explanation I came across explaining why can be found here, by Jeol Mueller:
In JavaScript (and C# for that matter) strings are immutable. They can never be changed, only replaced with other strings. You're probably aware that combined + "hello " doesn't directly modify the combined variable - the operation creates a new string that is the result of concatenating the two strings together, but you must then assign that new string to the combined variable if you want it to be changed.
So what this loop is doing is creating a million different string objects, and throwing away 999,999 of them. Creating that many strings that are continually growing in size is not fast, and now the garbage collector has a lot of work to do to clean up after this."
The thread here, was also helpful.
However, using the join method didn't allow me to build the string I was aiming for without getting the error:
allocation size overflow
I was trying to write 50,000 JSON strings from a database into one text file, which simply may have been too large no matter what. I think it was reaching over 350MB. I was just testing the limit of my application and picked something far larger than a user of the application will likely ever create. So, this test case was likely unreasonable.
Nonetheless, this leaves me with three questions about working with large strings.
For the same amount of data overall, does altering the number of array elements joined in a single join operation affect the efficiency in terms of not hitting an allocation size overflow?
For example, I tried writing the JSON strings to a pseudo 3-D array of 100 (and then 50) elements per dimension; and then looped through the outer two dimensions joining them together. 100^3 = 1,000,000 or 50^3 = 125,000 both provide more than enough entries to hold the 50,000 JSON strings. I know I'm not including the 0 index, here.
So, the 50,000 strings were held in an array from a[1][1][1] to a5[100][100] in the first attempt and of a[1][1][1] to a[20][50][50] in the second attempt. If the dimensions are i, j, k from outer to inner, I joined all the k elements in each a[i][j]; and then joined all of those i x j joins, and lastly all of these i joins into the final text string.
All attemtps still hit the allocation size overflow before completing.
So, is there any difference between joining 50,000 smaller strings in one join versus 50 larger strings, if the total data is the same?
Is there a better, more efficient way to build large strings than the join method?
Does the same principle described by Joel Mueller regarding string concatenation apply to reducing a string through substring, such as string = string.substring(position)?
The context of this third question is that when I read a text file in as a string and break it down into its component JSON strings before writing to the database, I use an array that is map of the file layout; so, I know the length of each JSON string in advance and repeat three statements inside a loop:
l = map[i].l;
str = text.substring(0,l);
text = text.substring(l).
It would appear that since strings are immutable, this sort of reverse of concatenation step is as inefficient as using the + operator to concatenate.
Would it be more efficient to not delete the str from text each iteration, and just keep track of the increasing start and end positions for the substrings as step through the loop reading the entire text string?
Response to message about duplicate question
I got a message, I guess from the stackoverflow system itself, asking me to edit my question explaining why it is different from the proposed duplicate.
Reasons are:
The proposed duplicate asks specifically and exclusively about the maximum size of a single string. None of the three bolded questions, here, asks about the maximum size of a single string, although that is useful to know.
This question asks about the most efficient way of building large strings and that isn't addressed in the answers found in the proposed duplicate, apart from an efficent way of building a large test string. They don't address how to build a realistic string, comprised of actual application data.
This question provides a couple links to some information concerning the efficiency of building large strings that may be helpful to those interested in more than the maximum size alone.
This question also has a specific context of why the large string was being built, which led to some suggestions about how to handle that situation in a more efficient manner. Although, in the strictest sense, they don't specifically address the question by title, they do address the broader context of the question as presented, which is how to deal with the large strings, even if that means ways to work around them. Someone searching on this same topic might find real help in these suggestions that is not provided in the proposed duplicate.
So, although the proposed duplicate is somewhat helpful, it doesn't appear to be anywhere near a genuine duplicate of this question in its full context.
Additional Information
This doesn't answer the question concerning the most efficient way to build a large string, but it refers to the comments about how to get around the string size limit.
Converting each component string to a blob and holding them in an array, and then converting the array of blobs into a single blob, accomplished this. I don't know what the size limit of a single blob is, but did see 800MB in another question.
A process (or starting point) for creating the blob to write the database to disk and then to read it back in again can be found here.
Regarding the idea of writing the blobs or strings to disk as they are generated on the client as opposed to generating one giant string or blob for download, although the most logical and efficient method, may not be possible in the scenario presented here of an offline application.
According to this question, web extensions no longer have access to the privileged javascript code necessary to accomplish this through the File API.
I asked this question related to the Streams API write stream method and something called StreamSaver.
In writing a database to disk as a text file of JSON strings.
I see no reason to store the data in a string or array of strings in this case. Instead you can write the data directly to the file.
In the simplest case you can write each string to the file separately.
To get better performance, you could first write some data to a smaller buffer, and then write that buffer to disk when it's full.
For best performance you could create a file of a certain size and create a memory mapping over that file. Then write/copy the data directly to the mapped memory (which is your file). The trick would be to know or guess the size up front, or you could resize the file when needed and then remap the file.
Joining or growing strings will trigger a lot of memory (re)allocations, which is unnecessary overhead in this case.
I don't want the user to have to download more than one file
If the goal is to let a user download that generated file, you could even do better by streaming those strings directly to the user without even creating a file. This also has the advantage that the user starts receiving data immediately instead of first having to wait till the whole file is generated.
Because the file size is not known up front, you could use chunked transfer encoding.
Consider this JSON response:
[{
Name: 'Saeed',
Age: 31
}, {
Name: 'Maysam',
Age: 32
}, {
Name: 'Mehdi',
Age: 27
}]
This works fine for small amount of data, but when you want to serve larger amounts of data (say many thousand records for example), it seems logical to prevent those repetitions of property names in the response JSON somehow.
I Googled the concept (DRYing JSON) and to my surprise, I didn't find any relevant result. One way of course is to compress JSON using a simple home-made algorithm and decompress it on the client-side before consuming it:
[['Name', 'Age'],
['Saeed', 31],
['Maysam', 32],
['Mehdi', 27]]
However, a best practice would be better than each developer trying to reinvent the wheel. Have you guys seen a well-known widely-accepted solution for this?
First off, JSON is not meant to be the most compact way of representing data. It's meant to be parseable directly into a javascript data structure designed for immediate consumption without further parsing. If you want to optimize for size, then you probably don't want self describing JSON and you need to allow your code to make a bunch of assumptions about how to handle the data and put it to use and do some manual parsing on the receiving end. It's those assumptions and extra coding work that can save you space.
If the property names and format of the server response are already known to the code, you could just return the data as an array of alternating values:
['Saeed', 31, 'Maysam', 32, 'Mehdi', 27]
or if it's safe to assume that names don't include commas, you could even just return a comma delimited string that you could split into it's pieces and stick into your own data structures:
"Saeed, 31, Maysam, 32, Mehdi, 27"
or if you still want it to be valid JSON, you can put that string in an array like this which is only slightly better than my first version where the items themselves are array elements:
["Saeed, 31, Maysam, 32, Mehdi, 27"]
These assumptions and compactness put more of the responsibility for parsing the data on your own javascript, but it is that removal of the self describing nature of the full JSON you started with that leads to its more compact nature.
One solution is known as hpack algorithm
https://github.com/WebReflection/json.hpack/wiki
You might be able to use a CSV format instead of JSON, as you would only specify the property names once. However, this would require a rigid structure like in your example.
JSON isn't really the kind of thing that lends itself to DRY, since it's already quite well-packaged considering what you can do with it. Personally, I've used bare arrays for JSON data that gets stored in a file for later use, but for simple AJAX requests I just leave it as it is.
DRY usually refers to what you write yourself, so if your object is being generated dynamically you shouldn't worry about it anyway.
Use gzip-compression which is usually readily built into most web servers & clients?
It will still take some (extra) time & memory to generate & parse the JSON at each end, but it will not take that much time to send over the network, and will take minimal implementation effort on your behalf.
Might be worth a shot even if you pre-compress your source-data somehow.
It's actually not a problem for JSON that you've often got massive string or "property" duplication (nor is it for XML).
This is exactly what the duplicate string elimination component of the DEFLATE-algorithm addresses (used by GZip).
While most browser clients can accept GZip-compressed responses, traffic back to the server won't be.
Does that warrant using "JSON compression" (i.e. hpack or some other scheme)?
It's unlikely to be much faster than implementing GZip-compression in Javascript (which is not impossible; on a reasonably fast machine you can compress 100 KB in 250 ms).
It's pretty difficult to safely process untrusted JSON input. You need to use stream-based parsing and decide on a maximum complexity threshold, or else your server might be in for a surprise. See for instance Armin Ronacher's Start Writing More Classes:
If your neat little web server is getting 10000 requests a second through gevent but is using json.loads then I can probably make it crawl to a halt by sending it 16MB of well crafted and nested JSON that hog away all your CPU.
I have a problem I'd like to solve to not have to spend a lot of manual work to analyze as an alternative.
I have 2 JSON objects (returned from different web service API or HTTP responses). There is intersecting data between the 2 JSON objects, and they share similar JSON structure, but not identical. One JSON (the smaller one) is like a subset of the bigger JSON object.
I want to find all the interesecting data between the two objects. Actually, I'm more interested in the shared parameters/properties within the object, not really the actual values of the parameters/properties of each object. Because I want to eventually use data from one JSON output to construct the other JSON as input to an API call. Unfortunately, I don't have the documentation that defines the JSON for each API. :(
What makes this tougher is the JSON objects are huge. One spans a page if you print it out via Windows Notepad. The other spans 37 pages. The APIs return the JSON output compressed as a single line. Normal text compare doesn't do much, I'd have to reformat manually or w/ script to break up object w/ newlines, etc. for a text compare to work well. Tried with Beyond Compare tool.
I could do manual search/grep but that's a pain to cycle through all the parameters inside the smaller JSON. Could write code to do it but I'd also have to spend time to do that, and test if the code works also. Or maybe there's some ready made code already for that...
Or can look for JSON diff type tools. Searched for some. Came across these:
https://github.com/samsonjs/json-diff or https://tlrobinson.net/projects/javascript-fun/jsondiff
https://github.com/andreyvit/json-diff
both failed to do what I wanted. Presumably the JSON is either too complex or too large to process.
Any thoughts on best solution? Or might the best solution for now be manual analysis w/ grep for each parameter/property?
In terms of a code solution, any language will do. I just need a parser or diff tool that will do what I want.
Sorry, can't share the JSON data structure with you either, it may be considered confidential.
Beyond Compare works well, if you set up a JSON file format in it to use Python to pretty-print the JSON. Sample setup for Windows:
Install Python 2.7.
In Beyond Compare, go under Tools, under File Formats.
Click New. Choose Text Format. Enter "JSON" as a name.
Under the General tab:
Mask: *.json
Under the Conversion tab:
Conversion: External program (Unicode filenames)
Loading: c:\Python27\python.exe -m json.tool %s %t
Note, that second parameter in the command line must be %t, if you enter two %ss you will suffer data loss.
Click Save.
Jeremy Simmons has created a better File Format package Posted on forum: "JsonFileFormat.bcpkg" for BEYOND COMPARE that does not require python or so to be installed.
Just download the file and open it with BC and you are good to go. So, its much more simpler.
JSON File Format
I needed a file format for JSON files.
I wanted to pretty-print & sort my JSON to make comparison easy.
I have attached my bcpackage with my completed JSON File Format.
The formatting is done via jq - http://stedolan.github.io/jq/
Props to
Stephen Dolan for the utility https://github.com/stedolan.
I have sent a message to the folks at Scooter Software asking them to
include it in the page with additional formats.
If you're interested in seeing it on there, I'm sure a quick reply to
the thread with an up-vote would help them see the value posting it.
Attached Files Attached Files File Type: bcpkg JsonFileFormat.bcpkg
(449.8 KB, 58 views)
I have a small GPL project that would do the trick for simple JSON. I have not added support for nested entities as it is more of a simple ObjectDB solution and not actually JSON (Despite the fact it was clearly inspired by it.
Long and short the API is pretty simple. Make a new group, populate it, and then pull a subset via whatever logical parameters you need.
https://github.com/danielbchapman/groups
The API is used basically like ->
SubGroup items = group
.notEqual("field", "value")
.lessThan("field2", 50); //...etc...
There's actually support for basic unions and joins which would do pretty much what you want.
Long and short you probably want a Set as your data-type. Considering your comparisons are probably complex you need a more complex set of methods.
My only caution is that it is GPL. If your data is confidential, odds are you may not be interested in that license.
I would like to write a JavaScript function that validates a zip code, by checking if the zip code actually exists. Here is a list of all zip codes:
http://www.census.gov/tiger/tms/gazetteer/zips.txt (I only care about the 2nd column)
This is really a compression problem. I would like to do this for fun. OK, now that's out of the way, here is a list of optimizations over a straight hashtable that I can think of, feel free to add anything I have not thought of:
Break zipcode into 2 parts, first 2 digits and last 3 digits.
Make a giant if-else statement first checking the first 2 digits, then checking ranges within the last 3 digits.
Or, covert the zips into hex, and see if I can do the same thing using smaller groups.
Find out if within the range of all valid zip codes there are more valid zip codes vs invalid zip codes. Write the above code targeting the smaller group.
Break up the hash into separate files, and load them via Ajax as user types in the zipcode. So perhaps break into 2 parts, first for first 2 digits, second for last 3.
Lastly, I plan to generate the JavaScript files using another program, not by hand.
Edit: performance matters here. I do want to use this, if it doesn't suck. Performance of the JavaScript code execution + download time.
Edit 2: JavaScript only solutions please. I don't have access to the application server, plus, that would make this into a whole other problem =)
You could do the unthinkable and treat the code as a number (remember that it's not actually a number). Convert your list into a series of ranges, for example:
zips = [10000, 10001, 10002, 10003, 23001, 23002, 23003, 36001]
// becomes
zips = [[10000,10003], [23001,23003], [36001,36001]]
// make sure to keep this sorted
then to test:
myzip = 23002;
for (i = 0, l = zips.length; i < l; ++i) {
if (myzip >= zips[i][0] && myzip <= zips[i][1]) {
return true;
}
}
return false;
this is just using a very naive linear search (O(n)). If you kept the list sorted and used binary searching, you could achieve O(log n).
I would like to write a JavaScript function that validates a zip code
Might be more effort than it's worth, keeping it updated so that at no point someone's real valid ZIP code is rejected. You could also try an external service, or do what everyone else does and just accept any 5-digit number!
here is a list of optimizations over a straight hashtable that I can think of
Sorry to spoil the potential Fun, but you're probably not going to manage much better actual performance than JavaScript's Object gives you when used as a hashtable. Object member access is one of the most common operations in JS and will be super-optimised; building your own data structures is unlikely to beat it even if they are potentially better structures from a computer science point of view. In particular, anything using ‘Array’ is not going to perform as well as you think because Array is actually implemented as an Object (hashtable) itself.
Having said that, a possible space compression tool if you only need to know 'valid or not' would be to use a 100000-bit bitfield, packed into a string. For example for a space of only 100 ZIP codes, where codes 032-043 are ‘valid’:
var zipfield= '\x00\x00\x00\x00\xFF\x0F\x00\x00\x00\x00\x00\x00\x00';
function isvalid(zip) {
if (!zip.match('[0-9]{3}'))
return false;
var z= parseInt(zip, 10);
return !!( zipfield.charCodeAt(Math.floor(z/8)) & (1<<(z%8)) );
}
Now we just have to work out the most efficient way to get the bitfield to the script. The naive '\x00'-filled version above is pretty inefficient. Conventional approaches to reducing that would be eg. to base64-encode it:
var zipfield= atob('AAAAAP8PAAAAAAAAAA==');
That would get the 100000 flags down to 16.6kB. Unfortunately atob is Mozilla-only, so an additional base64 decoder would be needed for other browsers. (It's not too hard, but it's a bit more startup time to decode.) It might also be possible to use an AJAX request to transfer a direct binary string (encoded in ISO-8859-1 text to responseText). That would get it down to 12.5kB.
But in reality probably anything, even the naive version, would do as long as you served the script using mod_deflate, which would compress away a lot of that redundancy, and also the repetition of '\x00' for all the long ranges of ‘invalid’ codes.
I use Google Maps API to check whether a zipcode exists.
It's more accurate.
Assuming you've got the zips in a sorted array (seems fair if you're controlling the generation of the datastructure), see if a simple binary search is fast enough.
So... You're doing client side validation and want to optimize for file size? you probably cannot beat general compression. Fortunately, most browsers support gzip for you, so you can use that much for free.
How about a simple json coded dict or list with the zip codes in sorted order and do a look up on the dict. it'll compress well, since its a predictable sequence, import easily since it's json, using the browsers in-built parser, and lookup will probably be very fast also, since that's a javascript primitive.
This might be useful:
PHP Zip Code Range and Distance Calculation
As well as List of postal codes.