Building large strings in JavaScript ; Is join method most efficient?

Building large strings in JavaScript ; Is join method most efficient? - javascript

In writing a database to disk as a text file of JSON strings, I've been experimenting with how to most efficiently build the string of text that is ultimately converted to a blob for download to disk.
There a number of questions that state to not concatenate a string with the + operator in a loop, but instead write the component strings to an array and then use the join method to build one large string.
The best explanation I came across explaining why can be found here, by Jeol Mueller:
In JavaScript (and C# for that matter) strings are immutable. They can never be changed, only replaced with other strings. You're probably aware that combined + "hello " doesn't directly modify the combined variable - the operation creates a new string that is the result of concatenating the two strings together, but you must then assign that new string to the combined variable if you want it to be changed.
So what this loop is doing is creating a million different string objects, and throwing away 999,999 of them. Creating that many strings that are continually growing in size is not fast, and now the garbage collector has a lot of work to do to clean up after this."
The thread here, was also helpful.
However, using the join method didn't allow me to build the string I was aiming for without getting the error:
allocation size overflow
I was trying to write 50,000 JSON strings from a database into one text file, which simply may have been too large no matter what. I think it was reaching over 350MB. I was just testing the limit of my application and picked something far larger than a user of the application will likely ever create. So, this test case was likely unreasonable.
Nonetheless, this leaves me with three questions about working with large strings.
For the same amount of data overall, does altering the number of array elements joined in a single join operation affect the efficiency in terms of not hitting an allocation size overflow?
For example, I tried writing the JSON strings to a pseudo 3-D array of 100 (and then 50) elements per dimension; and then looped through the outer two dimensions joining them together. 100^3 = 1,000,000 or 50^3 = 125,000 both provide more than enough entries to hold the 50,000 JSON strings. I know I'm not including the 0 index, here.
So, the 50,000 strings were held in an array from a[1][1][1] to a5[100][100] in the first attempt and of a[1][1][1] to a[20][50][50] in the second attempt. If the dimensions are i, j, k from outer to inner, I joined all the k elements in each a[i][j]; and then joined all of those i x j joins, and lastly all of these i joins into the final text string.
All attemtps still hit the allocation size overflow before completing.
So, is there any difference between joining 50,000 smaller strings in one join versus 50 larger strings, if the total data is the same?
Is there a better, more efficient way to build large strings than the join method?
Does the same principle described by Joel Mueller regarding string concatenation apply to reducing a string through substring, such as string = string.substring(position)?
The context of this third question is that when I read a text file in as a string and break it down into its component JSON strings before writing to the database, I use an array that is map of the file layout; so, I know the length of each JSON string in advance and repeat three statements inside a loop:
l = map[i].l;
str = text.substring(0,l);
text = text.substring(l).
It would appear that since strings are immutable, this sort of reverse of concatenation step is as inefficient as using the + operator to concatenate.
Would it be more efficient to not delete the str from text each iteration, and just keep track of the increasing start and end positions for the substrings as step through the loop reading the entire text string?
Response to message about duplicate question
I got a message, I guess from the stackoverflow system itself, asking me to edit my question explaining why it is different from the proposed duplicate.
Reasons are:
The proposed duplicate asks specifically and exclusively about the maximum size of a single string. None of the three bolded questions, here, asks about the maximum size of a single string, although that is useful to know.
This question asks about the most efficient way of building large strings and that isn't addressed in the answers found in the proposed duplicate, apart from an efficent way of building a large test string. They don't address how to build a realistic string, comprised of actual application data.
This question provides a couple links to some information concerning the efficiency of building large strings that may be helpful to those interested in more than the maximum size alone.
This question also has a specific context of why the large string was being built, which led to some suggestions about how to handle that situation in a more efficient manner. Although, in the strictest sense, they don't specifically address the question by title, they do address the broader context of the question as presented, which is how to deal with the large strings, even if that means ways to work around them. Someone searching on this same topic might find real help in these suggestions that is not provided in the proposed duplicate.
So, although the proposed duplicate is somewhat helpful, it doesn't appear to be anywhere near a genuine duplicate of this question in its full context.
Additional Information
This doesn't answer the question concerning the most efficient way to build a large string, but it refers to the comments about how to get around the string size limit.
Converting each component string to a blob and holding them in an array, and then converting the array of blobs into a single blob, accomplished this. I don't know what the size limit of a single blob is, but did see 800MB in another question.
A process (or starting point) for creating the blob to write the database to disk and then to read it back in again can be found here.
Regarding the idea of writing the blobs or strings to disk as they are generated on the client as opposed to generating one giant string or blob for download, although the most logical and efficient method, may not be possible in the scenario presented here of an offline application.
According to this question, web extensions no longer have access to the privileged javascript code necessary to accomplish this through the File API.
I asked this question related to the Streams API write stream method and something called StreamSaver.

In writing a database to disk as a text file of JSON strings.
I see no reason to store the data in a string or array of strings in this case. Instead you can write the data directly to the file.
In the simplest case you can write each string to the file separately.
To get better performance, you could first write some data to a smaller buffer, and then write that buffer to disk when it's full.
For best performance you could create a file of a certain size and create a memory mapping over that file. Then write/copy the data directly to the mapped memory (which is your file). The trick would be to know or guess the size up front, or you could resize the file when needed and then remap the file.
Joining or growing strings will trigger a lot of memory (re)allocations, which is unnecessary overhead in this case.
I don't want the user to have to download more than one file
If the goal is to let a user download that generated file, you could even do better by streaming those strings directly to the user without even creating a file. This also has the advantage that the user starts receiving data immediately instead of first having to wait till the whole file is generated.
Because the file size is not known up front, you could use chunked transfer encoding.

Related

How to convert array of objects to a shortest possible string?

I am thinking of a implementing a new project that has import/export feature. First, I will have an array of around 45 objects. The object structure is simple like this.
{"id": "someId", "quantity": 3}
So, in order to make it exportable, I will have to change the whole array of these objects into one single string first. For this part, I think I will use JSON.stringify(). After that, I want to make the string as short as possible for the users to use it (copy the string and paste it to share to other users to import it back to get the original array). I know this part is not necessary but I really want to make it as short as possible. Hence, the question. How to convert array of objects to a shortest possible string?
Any techniques such as Encoding, Encryption, or Hashing are acceptable as long as it is reversible to get the original data.
By "shortest possible", I mean you can answer any solution that is shorter than just pure stringification. I will just accept the one that gives shortest string to import.
I tried text minification but it gives almost the same result as the original text. I also tried encryption but it still gives a relatively long result.
Note: The string for import (that comes from export) can be human-readable or unreadable. It does not matter.

Deleting json optional SPACE after : colon and , comma
is a no-brainer. Let's assume you have already minified
in that way.
xz compression is generally helpful.
Perhaps you know some strings that are very likely
to repeatedly appear in the input doc. That might include:
"id":
"quantity":
Construct a prefix document which mentions such terms.
Sender will compress prefix + doc,
strip the initial unchanging bytes,
and send the rest.
Receiver will accept those bytes via TCP,
prepend the unchanging bytes,
and decompress.
Why does this improve compression ratios?
Lempel-Ziv and related schemes maintain a dictionary,
and transmit integer indexes into that dictionary
in order to indicate common words.
A word can be fairly long, even longer than "quantity".
The longer it is, the greater the savings.
If sender and receiver both know a set of words
that belong in the dictionary, beforehand,
we can avoid sending the raw text of those words.
Your chrome browser
compresses web headers
in this way already, each time you do a google search.
Finally, you might want to base64 encode the compressed output.
Ignore compression, and use a database instead,
in the way that tinyurl.com has been doing for quite a long time.
Set serial to 1.
Accept a new object, or set of objects.
Ask the DB if you've seen this input before.
If not, store it under a brand new serial ID.
Now send the matching ID to the receiving end.
It can query the central database when it gets a new ID,
and it can cache such results to use in future.

You might opt for a simple CSV export. The export string becomes, if you use the pipe separator, something like:
id|quantity\nsomeId|3\notherId|8
which is the equivalent of
[{"id":"someId","quantity":3},{"id":"otherId","quantity":8}]
This approach will remove the redundant id and quantity tags for each record and remove the unnecessary double quotes.
The downside is that your records all should have the same data structure but that is generally the case.

In Javascript, why define an array with split?

I frequently see code where people define a populated array using the split method, like this:
var colors = "red,green,blue".split(',');
How does this differ from:
var colors = ["red","green","blue"];
Is it simply to avoid having to quote each value?

Splitting a string is a bad way of creating an array. There several issues with the approach that include performance, stability and memory consumption. It requires CPU time to parse the string, it is prone to errors (double commas, spaces in the string, etc.) and means your script essentially has to store twice as much data in memory.
It's not a good idea and is most likely just a bad habit someone picked up when they first learned about strings and arrays. That or they're trying to be clever for some kind of coding exercise.
As a rule of thumb, the only time you should be parsing strings into arrays is if you're reading that string data from an external source and need to convert it to native types. If you already know the values ahead of time, you should create the array yourself.
The one possible reason someone might do this is to reduce the number of characters in their source code, trading performance for bandwidth. 'a,b,c,d,e,f,g'.split(',') is fewer characters than ['a','b','c','d','e','f','g'].

There is no difference, it's just bad practice and laziness if anything. The only reason I could think of using the first approach is if the data naturally came in string form and using an array literal made it completely unreadable.

JavaScript performance when handling large arrays

I'm currently writing an image editing program in JavaScript. I've chosen JS because I wanted to learn more about it. The average image I'm handling is about 3000 x 4000 pixels big. When converted into imageData (for editing the pixels), that adds up to 48000000 values I have to deal with. That's why I decided to introduce webworkers and let them edit only the n-th-part of the array. Pretending that I have ten webworkers, each worker will have to deal with 4800000 values.
To be able to use webworkers I'm dividing the big array through the amount of threads I've chosen. The piece of code I use looks like this:
while(pixelArray.length > 0){
cD.pixelsSliced.push(pixelArray.splice(0, chunks)); //Chop off a chunk from the picture array
}
Later after the workers have done something with the array, they save it into another array. Each worker has an ID and saves his part in the mentioned array at the place of his id (to make sure the arrays stay in the correct order). I use $.map to concat that array (looking like [[1231][123213123][213123123]] into one big array [231231231413431] from which I will later create the imageData I need. It looks like that:
cD.newPixels = jQuery.map(pixelsnew, function(n){
return n;
});
After this array (cD.pixelsSliced) is created I create imageData and copy this image into the imageData-Object like so:
cD.imageData = cD.context.createImageData(cD.width, cD.height);
for(var i = 0; i < cD.imageData.data.length; i +=4){ //Build imageData
cD.imageData.data[i + eD.offset["r"]] = cD.newPixels[i + eD.offset["r"]];
cD.imageData.data[i + eD.offset["g"]] = cD.newPixels[i + eD.offset["g"]];
cD.imageData.data[i + eD.offset["b"]] = cD.newPixels[i + eD.offset["b"]];
cD.imageData.data[i + eD.offset["a"]] = cD.newPixels[i + eD.offset["a"]];
}
Now I do realize that I'm dealing with a huge amount of data here and that I probably shouldn't use the browser for image editing, but a different language (I'm using Java in uni). However I was wondering if you have any tips regarding the performance, because frankly I was pretty surprised when I tried a big image for the first time. I didn't figure, that it would take "that" long to load the image (First peace of code). Firefox actually thinks that my script is broken. The other two parts of codes are those ones which I found to slow down the script (which is normal). So yeah I would be thankful for any tips.
Thank you

I would recommend looking into Transferable Objects instead of Structured Cloning when using Web Workers. Web Workers normally use structured cloning to pass objects, in other words a copy is made. This can take loads of time for large objects such as large images.
When using Transferable Objects data is transferred from one context to another. In other words, zero-copy, which should improve the performance of sending data to a Worker.
For more info check:
http://www.w3.org/html/wg/drafts/html/master/infrastructure.html#transferable-objects
Also, another idea perhaps would be to move the task of splitting and butting back the large array to a web worker.
Just brainstorming here, but, perhaps you could first spaw a Web Worker, let's call it Mother Worker. This worker could split the array and then spawn 10 other child workers that performs the heavy duty task and sends back to their mother.
The mother finally puts it all back together and send back to main application.

StarDict support for JavaScript and a Firefox OS App

I wrote a dictionary app in the spirit of GoldenDict (www.goldendict.org, also see Google Play Store for more information) for Firefox OS: http://tuxor1337.github.io/firedict and https://marketplace.firefox.com/app/firedict
Since apps for ffos are based on HTML, CSS and JavaScript (WebAPI etc.), I had to write everything from scratch. At first, I wrote a basic library for synchronous and asynchronous access to StarDict dictionaries in JavaScript: https://github.com/tuxor1337/stardict.js
Although the app can be called stable by now, overall performance is still a bit sluggish. For some dictionaries, I have a list of words of almost 1,000,000 entries! That's huge. Indexing takes a really long time (up to several minutes per dictionary) and lookup as well. At the moment, the words are stored in an IndexedDB object store. Is there another alternative? With the current solution (words accessed and inserted using binary search) the overall experience is pretty slow. Maybe it would become faster, if there was some locale sort support by IndexedDB... Actually, I'm not even storing the terms themselves in the DB but only their offsets in the *.syn/*.idx file. I hope to save some memory doing that. But of course I'm not able to use any IDB sorting functionality with this configuration...
Maybe it's not the best idea to do the sorting in memory, because now the app is killed by the kernel due to an OOM on some devices (e.g. ZTE Open). A dictionary with more than 500,000 entries will definitely exceed 100 MB in memory. (That's only 200 Byte per entry and if you suppose the keyword strings are UTF-8, you'll exceed 100 MB immediately...)
Feel free to contribute directly to the project on GitHub. Otherwise, I would be glad to hear your advice concerning the above issues.

I am working on a pure Javascript implementation of MDict parser (https://github.com/fengdh/mdict-js) simliliar to your stardict project. MDict is another popular dictionary format with rich format (embeded image/audio/css etc.), which is widely support on window/linux/ios/android/windows phone. I have some ideas to share, and wish you can apply it to improve stardict.js in future.
MDict dictionary file (mdx/mdd) divides keyword and record into (optionaly compressed) block each contains around 2000 entries, and also provides a keyword block index table and record block index table to help quick look-up. Because of its compact data structure, I can implement my MDict parser scanning directly on dictionary file with small pre-load index table but no need of IndexDB.
Each keyword block index looks like:
{num_entries: ..,
first_word: ..,
last_word: ..,
comp_size: .., // size in compression
decomp_size: .., // size after decompression
offset: .., // offset in mdx file
index: ..
}
In keyblock, each entries is a pair of [keyword, offset]
Each record block index looks like:
{comp_size: .., // size in compression
decomp_size: .., // size after decompression
}
Given a word, use binary search to locate the keyword block maybe containing it.
Slice the keyword block and Load all keys in it, filter out matched one and get its record offfset.
Use binary search to locate the record block containing the word's record.
Slice the record block and retrieve its record (a definition in text or resource in ArrayBuffer) directly.
Since each block contains only around 2000 entries, it is fast enough to lookup word among 100K~1M dictionary entries within 100ms, quite decent value for human interaction. mdict-js parses file head only, it is super fast and of low memory usage.
In the same way, it is possible to retrieve a list of neighboring words for given phrase, even with wild card.
Please take a look on my online demo here: http://fengdh.github.io/mdict-js/
(You have to choose a local MDict dictionary: a mdx + optional mdd file)

Writing a JavaScript zip code validation function

I would like to write a JavaScript function that validates a zip code, by checking if the zip code actually exists. Here is a list of all zip codes:
http://www.census.gov/tiger/tms/gazetteer/zips.txt (I only care about the 2nd column)
This is really a compression problem. I would like to do this for fun. OK, now that's out of the way, here is a list of optimizations over a straight hashtable that I can think of, feel free to add anything I have not thought of:
Break zipcode into 2 parts, first 2 digits and last 3 digits.
Make a giant if-else statement first checking the first 2 digits, then checking ranges within the last 3 digits.
Or, covert the zips into hex, and see if I can do the same thing using smaller groups.
Find out if within the range of all valid zip codes there are more valid zip codes vs invalid zip codes. Write the above code targeting the smaller group.
Break up the hash into separate files, and load them via Ajax as user types in the zipcode. So perhaps break into 2 parts, first for first 2 digits, second for last 3.
Lastly, I plan to generate the JavaScript files using another program, not by hand.
Edit: performance matters here. I do want to use this, if it doesn't suck. Performance of the JavaScript code execution + download time.
Edit 2: JavaScript only solutions please. I don't have access to the application server, plus, that would make this into a whole other problem =)

You could do the unthinkable and treat the code as a number (remember that it's not actually a number). Convert your list into a series of ranges, for example:
zips = [10000, 10001, 10002, 10003, 23001, 23002, 23003, 36001]
// becomes
zips = [[10000,10003], [23001,23003], [36001,36001]]
// make sure to keep this sorted
then to test:
myzip = 23002;
for (i = 0, l = zips.length; i < l; ++i) {
if (myzip >= zips[i][0] && myzip <= zips[i][1]) {
return true;
}
}
return false;
this is just using a very naive linear search (O(n)). If you kept the list sorted and used binary searching, you could achieve O(log n).

I would like to write a JavaScript function that validates a zip code
Might be more effort than it's worth, keeping it updated so that at no point someone's real valid ZIP code is rejected. You could also try an external service, or do what everyone else does and just accept any 5-digit number!
here is a list of optimizations over a straight hashtable that I can think of
Sorry to spoil the potential Fun, but you're probably not going to manage much better actual performance than JavaScript's Object gives you when used as a hashtable. Object member access is one of the most common operations in JS and will be super-optimised; building your own data structures is unlikely to beat it even if they are potentially better structures from a computer science point of view. In particular, anything using ‘Array’ is not going to perform as well as you think because Array is actually implemented as an Object (hashtable) itself.
Having said that, a possible space compression tool if you only need to know 'valid or not' would be to use a 100000-bit bitfield, packed into a string. For example for a space of only 100 ZIP codes, where codes 032-043 are ‘valid’:
var zipfield= '\x00\x00\x00\x00\xFF\x0F\x00\x00\x00\x00\x00\x00\x00';
function isvalid(zip) {
if (!zip.match('[0-9]{3}'))
return false;
var z= parseInt(zip, 10);
return !!( zipfield.charCodeAt(Math.floor(z/8)) & (1<<(z%8)) );
}
Now we just have to work out the most efficient way to get the bitfield to the script. The naive '\x00'-filled version above is pretty inefficient. Conventional approaches to reducing that would be eg. to base64-encode it:
var zipfield= atob('AAAAAP8PAAAAAAAAAA==');
That would get the 100000 flags down to 16.6kB. Unfortunately atob is Mozilla-only, so an additional base64 decoder would be needed for other browsers. (It's not too hard, but it's a bit more startup time to decode.) It might also be possible to use an AJAX request to transfer a direct binary string (encoded in ISO-8859-1 text to responseText). That would get it down to 12.5kB.
But in reality probably anything, even the naive version, would do as long as you served the script using mod_deflate, which would compress away a lot of that redundancy, and also the repetition of '\x00' for all the long ranges of ‘invalid’ codes.

I use Google Maps API to check whether a zipcode exists.
It's more accurate.

Assuming you've got the zips in a sorted array (seems fair if you're controlling the generation of the datastructure), see if a simple binary search is fast enough.

So... You're doing client side validation and want to optimize for file size? you probably cannot beat general compression. Fortunately, most browsers support gzip for you, so you can use that much for free.
How about a simple json coded dict or list with the zip codes in sorted order and do a look up on the dict. it'll compress well, since its a predictable sequence, import easily since it's json, using the browsers in-built parser, and lookup will probably be very fast also, since that's a javascript primitive.

This might be useful:
PHP Zip Code Range and Distance Calculation
As well as List of postal codes.

We Keep Coding

JavaScript is the programming language of the Web.