I am thinking of a implementing a new project that has import/export feature. First, I will have an array of around 45 objects. The object structure is simple like this.
{"id": "someId", "quantity": 3}
So, in order to make it exportable, I will have to change the whole array of these objects into one single string first. For this part, I think I will use JSON.stringify(). After that, I want to make the string as short as possible for the users to use it (copy the string and paste it to share to other users to import it back to get the original array). I know this part is not necessary but I really want to make it as short as possible. Hence, the question. How to convert array of objects to a shortest possible string?
Any techniques such as Encoding, Encryption, or Hashing are acceptable as long as it is reversible to get the original data.
By "shortest possible", I mean you can answer any solution that is shorter than just pure stringification. I will just accept the one that gives shortest string to import.
I tried text minification but it gives almost the same result as the original text. I also tried encryption but it still gives a relatively long result.
Note: The string for import (that comes from export) can be human-readable or unreadable. It does not matter.
Deleting json optional SPACE after : colon and , comma
is a no-brainer. Let's assume you have already minified
in that way.
xz compression is generally helpful.
Perhaps you know some strings that are very likely
to repeatedly appear in the input doc. That might include:
"id":
"quantity":
Construct a prefix document which mentions such terms.
Sender will compress prefix + doc,
strip the initial unchanging bytes,
and send the rest.
Receiver will accept those bytes via TCP,
prepend the unchanging bytes,
and decompress.
Why does this improve compression ratios?
Lempel-Ziv and related schemes maintain a dictionary,
and transmit integer indexes into that dictionary
in order to indicate common words.
A word can be fairly long, even longer than "quantity".
The longer it is, the greater the savings.
If sender and receiver both know a set of words
that belong in the dictionary, beforehand,
we can avoid sending the raw text of those words.
Your chrome browser
compresses web headers
in this way already, each time you do a google search.
Finally, you might want to base64 encode the compressed output.
Ignore compression, and use a database instead,
in the way that tinyurl.com has been doing for quite a long time.
Set serial to 1.
Accept a new object, or set of objects.
Ask the DB if you've seen this input before.
If not, store it under a brand new serial ID.
Now send the matching ID to the receiving end.
It can query the central database when it gets a new ID,
and it can cache such results to use in future.
You might opt for a simple CSV export. The export string becomes, if you use the pipe separator, something like:
id|quantity\nsomeId|3\notherId|8
which is the equivalent of
[{"id":"someId","quantity":3},{"id":"otherId","quantity":8}]
This approach will remove the redundant id and quantity tags for each record and remove the unnecessary double quotes.
The downside is that your records all should have the same data structure but that is generally the case.
Related
In writing a database to disk as a text file of JSON strings, I've been experimenting with how to most efficiently build the string of text that is ultimately converted to a blob for download to disk.
There a number of questions that state to not concatenate a string with the + operator in a loop, but instead write the component strings to an array and then use the join method to build one large string.
The best explanation I came across explaining why can be found here, by Jeol Mueller:
In JavaScript (and C# for that matter) strings are immutable. They can never be changed, only replaced with other strings. You're probably aware that combined + "hello " doesn't directly modify the combined variable - the operation creates a new string that is the result of concatenating the two strings together, but you must then assign that new string to the combined variable if you want it to be changed.
So what this loop is doing is creating a million different string objects, and throwing away 999,999 of them. Creating that many strings that are continually growing in size is not fast, and now the garbage collector has a lot of work to do to clean up after this."
The thread here, was also helpful.
However, using the join method didn't allow me to build the string I was aiming for without getting the error:
allocation size overflow
I was trying to write 50,000 JSON strings from a database into one text file, which simply may have been too large no matter what. I think it was reaching over 350MB. I was just testing the limit of my application and picked something far larger than a user of the application will likely ever create. So, this test case was likely unreasonable.
Nonetheless, this leaves me with three questions about working with large strings.
For the same amount of data overall, does altering the number of array elements joined in a single join operation affect the efficiency in terms of not hitting an allocation size overflow?
For example, I tried writing the JSON strings to a pseudo 3-D array of 100 (and then 50) elements per dimension; and then looped through the outer two dimensions joining them together. 100^3 = 1,000,000 or 50^3 = 125,000 both provide more than enough entries to hold the 50,000 JSON strings. I know I'm not including the 0 index, here.
So, the 50,000 strings were held in an array from a[1][1][1] to a5[100][100] in the first attempt and of a[1][1][1] to a[20][50][50] in the second attempt. If the dimensions are i, j, k from outer to inner, I joined all the k elements in each a[i][j]; and then joined all of those i x j joins, and lastly all of these i joins into the final text string.
All attemtps still hit the allocation size overflow before completing.
So, is there any difference between joining 50,000 smaller strings in one join versus 50 larger strings, if the total data is the same?
Is there a better, more efficient way to build large strings than the join method?
Does the same principle described by Joel Mueller regarding string concatenation apply to reducing a string through substring, such as string = string.substring(position)?
The context of this third question is that when I read a text file in as a string and break it down into its component JSON strings before writing to the database, I use an array that is map of the file layout; so, I know the length of each JSON string in advance and repeat three statements inside a loop:
l = map[i].l;
str = text.substring(0,l);
text = text.substring(l).
It would appear that since strings are immutable, this sort of reverse of concatenation step is as inefficient as using the + operator to concatenate.
Would it be more efficient to not delete the str from text each iteration, and just keep track of the increasing start and end positions for the substrings as step through the loop reading the entire text string?
Response to message about duplicate question
I got a message, I guess from the stackoverflow system itself, asking me to edit my question explaining why it is different from the proposed duplicate.
Reasons are:
The proposed duplicate asks specifically and exclusively about the maximum size of a single string. None of the three bolded questions, here, asks about the maximum size of a single string, although that is useful to know.
This question asks about the most efficient way of building large strings and that isn't addressed in the answers found in the proposed duplicate, apart from an efficent way of building a large test string. They don't address how to build a realistic string, comprised of actual application data.
This question provides a couple links to some information concerning the efficiency of building large strings that may be helpful to those interested in more than the maximum size alone.
This question also has a specific context of why the large string was being built, which led to some suggestions about how to handle that situation in a more efficient manner. Although, in the strictest sense, they don't specifically address the question by title, they do address the broader context of the question as presented, which is how to deal with the large strings, even if that means ways to work around them. Someone searching on this same topic might find real help in these suggestions that is not provided in the proposed duplicate.
So, although the proposed duplicate is somewhat helpful, it doesn't appear to be anywhere near a genuine duplicate of this question in its full context.
Additional Information
This doesn't answer the question concerning the most efficient way to build a large string, but it refers to the comments about how to get around the string size limit.
Converting each component string to a blob and holding them in an array, and then converting the array of blobs into a single blob, accomplished this. I don't know what the size limit of a single blob is, but did see 800MB in another question.
A process (or starting point) for creating the blob to write the database to disk and then to read it back in again can be found here.
Regarding the idea of writing the blobs or strings to disk as they are generated on the client as opposed to generating one giant string or blob for download, although the most logical and efficient method, may not be possible in the scenario presented here of an offline application.
According to this question, web extensions no longer have access to the privileged javascript code necessary to accomplish this through the File API.
I asked this question related to the Streams API write stream method and something called StreamSaver.
In writing a database to disk as a text file of JSON strings.
I see no reason to store the data in a string or array of strings in this case. Instead you can write the data directly to the file.
In the simplest case you can write each string to the file separately.
To get better performance, you could first write some data to a smaller buffer, and then write that buffer to disk when it's full.
For best performance you could create a file of a certain size and create a memory mapping over that file. Then write/copy the data directly to the mapped memory (which is your file). The trick would be to know or guess the size up front, or you could resize the file when needed and then remap the file.
Joining or growing strings will trigger a lot of memory (re)allocations, which is unnecessary overhead in this case.
I don't want the user to have to download more than one file
If the goal is to let a user download that generated file, you could even do better by streaming those strings directly to the user without even creating a file. This also has the advantage that the user starts receiving data immediately instead of first having to wait till the whole file is generated.
Because the file size is not known up front, you could use chunked transfer encoding.
I frequently see code where people define a populated array using the split method, like this:
var colors = "red,green,blue".split(',');
How does this differ from:
var colors = ["red","green","blue"];
Is it simply to avoid having to quote each value?
Splitting a string is a bad way of creating an array. There several issues with the approach that include performance, stability and memory consumption. It requires CPU time to parse the string, it is prone to errors (double commas, spaces in the string, etc.) and means your script essentially has to store twice as much data in memory.
It's not a good idea and is most likely just a bad habit someone picked up when they first learned about strings and arrays. That or they're trying to be clever for some kind of coding exercise.
As a rule of thumb, the only time you should be parsing strings into arrays is if you're reading that string data from an external source and need to convert it to native types. If you already know the values ahead of time, you should create the array yourself.
The one possible reason someone might do this is to reduce the number of characters in their source code, trading performance for bandwidth. 'a,b,c,d,e,f,g'.split(',') is fewer characters than ['a','b','c','d','e','f','g'].
There is no difference, it's just bad practice and laziness if anything. The only reason I could think of using the first approach is if the data naturally came in string form and using an array literal made it completely unreadable.
Here is something I would like to do, I have a file, which something like this:
"key"-"content", each "key" is not unique for the content, one key can have zero or more content that much....
The file is about 200Kb, I convert it in array, and put it all in the javascript. When user type, I loop the array once to find out the result, but it is slow...
Any suggestions on how to doing this? Thank you.
(Only client side javascript implementation is allowed, not allow to use server to analysis the result and send back.)
If I understood you right, these links to articles of John Resig may help you. His problem was the poor performance, when looking for valid words in a big text file while typing them.
Part 1: Dictionary Lookups in JavaScript
Part 2: JavaScript Trie Performance Analysis
I assume that the user is typing something that is supposed to match the "key"? Or the "content"?
Assuming it's the key, then sort the keys and use a binary search. Once you get a hit (assuming a partial match, like, say, the first letter), just keep scanning until your matches fail. That's your result set.
If you're querying the content, then it's the same premise, but you need to invert the index and make the content pieces your keys, and sort those.
Can you use an associative array, with unique keys that point to arrays of possible values?
{ 'key1' => ['value1','value2','value3'],
'key2' => ['value1','value2'],
'key3' => ['value1'],
}
It means more overhead to parse the list, but I bet searching the list will be much faster. It should also use less memory, since you aren't duplicating all of the duplicate keys in memory.
I'm passing a table of up to 1000 rows, consisting of name, ID, latitude and longitude values, to the client.
The list will then be processed by Javascript and converted to markers on a Google map.
I initially planned to do this with JSON, as I want the code to be readable and easy to deal with, and because we may be adding more structure to it over time.
However, my colleague suggested passing it down as a Javascript array, as it would reduce the size greatly.
This made me think, maybe JSON is a bit redundant. After all, for each row defined, the name of each field is also being outputted repetitively. Whereas, for an array, the position of the cells is used to indicate the field.
However, would there really be a performance improvement by using an array?
The site uses GZIP compression. Is this compression effective enough to take care of any redundancy found in a JSON string?
[edit]
I realize JSON is just a notation.
But my real question is - what notation is best, performance-wise?
If I use fully named attributes, then I can have code like this:
var x = resultset.rows[0].name;
Whereas if I don't, it will look less readable, like so:
var x = resultset.rows[0][2];
My question is - would the sacrifice in code readability be worth it for the performance gains? Or not?
Further notes:
According to Wikipedia, the Deflate compression algorithm (used by gzip) performs 'Duplicate string elimination'. http://en.wikipedia.org/wiki/DEFLATE#Duplicate_string_elimination
If this is correct, I have no reason to be concerned about any redundancy in JSON, as it's already been taken care of.
JSON is just a notation (Javascript Object Notation), and includes JS arrays -- even if there is the word "object" in its name.
See its grammar on http://json.org/ which defines an array like this (quoting) :
An array is an ordered collection of
values. An array begins with [ (left
bracket) and ends with ] (right
bracket). Values are separated by ,
(comma).
This means this (taken from JSON Data Set Sample) would be valid JSON :
[ 100, 500, 300, 200, 400 ]
Even if it doesn't include nor declare nor whatever any object at all.
In your case, I suppose you could use some array, storing data by position, and not by name.
If you are worried about size you could want to "compress" that data on the server side by yourself, and de-compress it on the client side -- but I wouldn't do that : it would mean you'd need more processing time/power on the client side...
I'd rather go with gzipping of the page that contains the data : you'll have nothing to do, it's fully automatic, and it works just fine -- and the difference in size will probably not be noticeable.
I suggest to use a simple CSV format. There is a nice article on the Flickr Development Blog where they talked about their experience with such a problem. But the best would be to try it on your own.
I would like to write a JavaScript function that validates a zip code, by checking if the zip code actually exists. Here is a list of all zip codes:
http://www.census.gov/tiger/tms/gazetteer/zips.txt (I only care about the 2nd column)
This is really a compression problem. I would like to do this for fun. OK, now that's out of the way, here is a list of optimizations over a straight hashtable that I can think of, feel free to add anything I have not thought of:
Break zipcode into 2 parts, first 2 digits and last 3 digits.
Make a giant if-else statement first checking the first 2 digits, then checking ranges within the last 3 digits.
Or, covert the zips into hex, and see if I can do the same thing using smaller groups.
Find out if within the range of all valid zip codes there are more valid zip codes vs invalid zip codes. Write the above code targeting the smaller group.
Break up the hash into separate files, and load them via Ajax as user types in the zipcode. So perhaps break into 2 parts, first for first 2 digits, second for last 3.
Lastly, I plan to generate the JavaScript files using another program, not by hand.
Edit: performance matters here. I do want to use this, if it doesn't suck. Performance of the JavaScript code execution + download time.
Edit 2: JavaScript only solutions please. I don't have access to the application server, plus, that would make this into a whole other problem =)
You could do the unthinkable and treat the code as a number (remember that it's not actually a number). Convert your list into a series of ranges, for example:
zips = [10000, 10001, 10002, 10003, 23001, 23002, 23003, 36001]
// becomes
zips = [[10000,10003], [23001,23003], [36001,36001]]
// make sure to keep this sorted
then to test:
myzip = 23002;
for (i = 0, l = zips.length; i < l; ++i) {
if (myzip >= zips[i][0] && myzip <= zips[i][1]) {
return true;
}
}
return false;
this is just using a very naive linear search (O(n)). If you kept the list sorted and used binary searching, you could achieve O(log n).
I would like to write a JavaScript function that validates a zip code
Might be more effort than it's worth, keeping it updated so that at no point someone's real valid ZIP code is rejected. You could also try an external service, or do what everyone else does and just accept any 5-digit number!
here is a list of optimizations over a straight hashtable that I can think of
Sorry to spoil the potential Fun, but you're probably not going to manage much better actual performance than JavaScript's Object gives you when used as a hashtable. Object member access is one of the most common operations in JS and will be super-optimised; building your own data structures is unlikely to beat it even if they are potentially better structures from a computer science point of view. In particular, anything using ‘Array’ is not going to perform as well as you think because Array is actually implemented as an Object (hashtable) itself.
Having said that, a possible space compression tool if you only need to know 'valid or not' would be to use a 100000-bit bitfield, packed into a string. For example for a space of only 100 ZIP codes, where codes 032-043 are ‘valid’:
var zipfield= '\x00\x00\x00\x00\xFF\x0F\x00\x00\x00\x00\x00\x00\x00';
function isvalid(zip) {
if (!zip.match('[0-9]{3}'))
return false;
var z= parseInt(zip, 10);
return !!( zipfield.charCodeAt(Math.floor(z/8)) & (1<<(z%8)) );
}
Now we just have to work out the most efficient way to get the bitfield to the script. The naive '\x00'-filled version above is pretty inefficient. Conventional approaches to reducing that would be eg. to base64-encode it:
var zipfield= atob('AAAAAP8PAAAAAAAAAA==');
That would get the 100000 flags down to 16.6kB. Unfortunately atob is Mozilla-only, so an additional base64 decoder would be needed for other browsers. (It's not too hard, but it's a bit more startup time to decode.) It might also be possible to use an AJAX request to transfer a direct binary string (encoded in ISO-8859-1 text to responseText). That would get it down to 12.5kB.
But in reality probably anything, even the naive version, would do as long as you served the script using mod_deflate, which would compress away a lot of that redundancy, and also the repetition of '\x00' for all the long ranges of ‘invalid’ codes.
I use Google Maps API to check whether a zipcode exists.
It's more accurate.
Assuming you've got the zips in a sorted array (seems fair if you're controlling the generation of the datastructure), see if a simple binary search is fast enough.
So... You're doing client side validation and want to optimize for file size? you probably cannot beat general compression. Fortunately, most browsers support gzip for you, so you can use that much for free.
How about a simple json coded dict or list with the zip codes in sorted order and do a look up on the dict. it'll compress well, since its a predictable sequence, import easily since it's json, using the browsers in-built parser, and lookup will probably be very fast also, since that's a javascript primitive.
This might be useful:
PHP Zip Code Range and Distance Calculation
As well as List of postal codes.