Are promises and closures consuming all of my memory? - javascript

I am currently working on a project that is using node.js as a controlling system to do some relatively large scale machine learning with images. I am running out of memory pretty quickly while trying to do this even though I am trying to optimize the usage as much as possible and my data should not take up an excessive amount of space. My code relies heavily on promises and anonymous functions to manage the pipeline and I am wondering if that's why I'm seeing crazy high usage on my test case.
Just for context I am using the mnist dataset for my testing which can be found here. The training set consists of 60,000 20x20 images. From these I am extracting overfeat features, a description of this can be found here. What this boils down to is a 4,096 element array for each image, so 60,000 of them. I am caching all the image and feature data in redis.
A quick computation tells me that the full feature set here should be 4096 * 8 * 60000 = 1966080000 bytes or appx 1.82GB of memory assuming that each element of the array is a 64bit javascript number. The images themselves should only take up a very small amount of space and I am not storing them in memory. However when I run my code I am seeing more like 8-10GB of memory used after extraction/loading. When trying to do more operations on this data (like print it all out to a JSON file so I can make sure the extraction worked right) I quickly consume the 16GB of available memory on my computer, crashing the node script.
So my general question is: why am I seeing such high memory usage? Is it because of my heavy use of promises/closures? Can I refactor my code to use less memory and allow more variables to be garbage collected?
The code is available here for review. Be aware that it is a little rough as far as organization goes.

Your code uses the "promise" library which to be fair is very memory hoggy and was not really built for raw performance. If you switch to Bluebird promises you can get considerably more items in RAM as it will drastically reduce your memory usage.
Here are benchmark results for doxbee-sequential:
results for 10000 parallel executions, 1 ms per I/O op
file time(ms) memory(MB)
promises-bluebird.js 280 26.64
promises-then-promise.js 1775 134.73
And under bench parallel (--p 25):
file time(ms) memory(MB)
promises-bluebird.js 483 63.32
promises-then-promise.js 2553 338.36
You can see the full benchmark here.

Related

MongoDB - slow concurrent responses and huge difference in term of speed for different response sizes

I was trying to mock the real use case by creating hundreds of promises to query MongoDB with the same queries (not real though since the queries in real would be different at the same time) and then run a Promise.all. I have noticed 2 interesting behaviors:
The first return might be much faster than the last return (usually double the time for the last one). For this one, does MongoDB lock the collection so the queries had to wait in a queue?
If I reduce the return size from about 40 fields to 1 field, the query run 10 times faster.
Can anyone explain those 2 behaviors? Thanks.
Mongo does not queue and block queries.
try looking at your machine's performances, Promise.all/map are supposed to be used to make a process more efficient but you are obviously capped by your CPU/RAM availability, i suspect the answer to both your questions lies here.
you should try and find the sweet spot with Promise.map and {concurrency:x} if the queries are too heavy, i found my machine not to be able to deal with hundreds of complex queries queue'd together at the same time, however when i piplined them with a smaller concurrency i could efficiently use my RAM/CPU availability without overloading it.
1) I guess the most time consuming task is the transfer of data between MongoDB and NodeJS, and as that goes through a limited number of sockets, queries will queue up in the network.
2) Because then less data has to be transfered.

how to generate 15 million+ data points and insert from node server to mysql with no error

I wrote a script that generates 5,443,200 data records in batches of 544,320, which are inserted into a mysql db using the mysqljs/mysql module, and when I try to bump up the volume of data records, I get a node heap out of memory error. I tried to refactor this using a generator and iterator, but doing so seemed to introduce a lot of buggy behavior and ultimately crashed my mysql server. I had also thrown in the cluster node module to see if that would help, but that created issues of my mysql server refusing connections and sometimes crashing my computer altogether.
My question is, how would I be able to scale the script such that I can generate 30 times the 5 million I've generated at ideally the same generation and insertion rate? I reckon I can still work with generators and iterators as it's most likely my particular implementation that's buggy.
https://gist.github.com/anonymous/bed4a311fb746ba04c65d331d23bd0a8
Batch-inserting more than 1000 rows at a time is very much into "diminishing returns". 1000 gives you about 99% of the theoretical maximum.
When you go much beyond 1000, you first get into inefficiencies, then you get (as you found out) into limitations. Your attempt to get to 100% has backfired.

Node.js JSON.parse on object creation vs. with getter property

This is largely an 'am I doing it right / how can I do this better' kind of topic, with some concrete questions at the end. If you have other advice / remarks on the text below, even if I didn't specifically ask those questions, feel free to comment.
I have a MySQL table for users of my app that, along with a set of fixed columns, also has a text column containing a JSON config object. This is to store variable configuration data that cannot be stored in separate columns because it has different properties per user. There doesn't need to be any lookup / ordering / anything on the configuration data, so we decided this would be the best way to go.
When querying the database from my Node.JS app (running on Node 0.12.4), I assign the JSON text to an object and then use Object.defineProperty to create a getter property that parses the JSON string data when it is needed and adds it to the object.
The code looks like this:
user =
uid: results[0].uid
_c: results[0].user_config # JSON config data as string
Object.defineProperty user, 'config',
get: ->
#c = JSON.parse #_c if not #c?
return #c
Edit: above code is Coffeescript, here's the (approximate) Javascript equivalent for those of you who don't use Coffeescript:
var user = {
uid: results[0].uid,
_c: results[0].user_config // JSON config data as string
};
Object.defineProperty(user, 'config', {
get: function() {
if(this.c === undefined){
this.c = JSON.parse(this._c);
}
return this.c;
}
});
I implemented it this way because parsing JSON blocks the Node event loop, and the config property is only needed about half the time (this is in a middleware function for an express server) so this way the JSON would only be parsed when it is actually needed. The config data itself can range from 5 to around 50 different properties organised in a couple of nested objects, not a huge amount of data but still more than just a few lines of JSON.
Additionally, there are three of these JSON objects (I only showed one since they're all basically the same, just with different data in them). Each one is needed in different scenarios but all of the scenarios depend on variables (some of which come from external sources) so at the point of this function it's impossible to know which ones will be necessary.
So I had a couple of questions about this approach that I hope you guys can answer.
Is there a negative performance impact when using Object.defineProperty, and if yes, is it possible that it could negate the benefit from not parsing the JSON data right away?
Am I correct in assuming that not parsing the JSON right away will actually improve performance? We're looking at a continuously high number of requests and we need to process these quickly and efficiently.
Right now the three JSON data sets come from two different tables JOINed in an SQL query. This is to only have to do one query per request instead of up to four. Keeping in mind that there are scenarios where none of the JSON data is needed, but also scenarios where all three data sets are needed (and of course scenarios inbetween), could it be an improvement to only get the required JSON data from its table, at the point when one of the data sets is actually needed? I avoided this because I feel like waiting for four separate SELECT queries to be executed would take longer than waiting for one query with two JOINed tables.
Are there other ways to approach this that would improve the general performance even more? (I know, this one's a bit of a subjective question, but ideas / suggestions of things I should check out are welcome). I'm not looking to spin off parsing the JSON data into a separate thread though, because as our service runs on a cluster of virtualised single-core servers, creating a child process would only increase overall CPU usage, which at high loads would have even more negative impact on performance.
Note: when I say performance it mainly means fast and efficient throughput rates. We prefer a somewhat larger memory footprint over heavier CPU usage.
We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil
- Donald Knuth
What do I get from that article? Too much time is spent in optimizing with dubious results instead of focusing on design and clarity.
It's true that JSON.parse blocks the event loop, but every synchronous call does - this is just code execution and is not a bad thing.
The root concern is not that it is blocking, but how long it is blocking. I remember a Strongloop instructor saying 10ms was a good rule of thumb for max execution time for a call in an app at cloud scale. >10ms is time to start optimizing - for apps at huge scale. Each app has to define that threshold.
So, how much execution time will your lazy init save? This article says it takes 1.5s to parse a 15MB json string - about 10,000 B/ms. 3 configs, 50 properties each, 30 bytes/k-v pair = 4500 bytes - about half a millisecond.
When the time came to optimize, I would look at having your lazy init do the MySQL call. A config is needed only 50% of the time, it won't block the event loop, and an external call to a db absolutely dwarfs a JSON.parse().
All of this to say: What you are doing is not necessarily bad or wrong, but if the whole app is littered with these types of dubious optimizations, how does that impact feature addition and maintenance? The biggest problems I see revolve around time to market, not speed. Code complexity increases time to market.
Q1: Is there a negative performance impact when using Object.defineProperty...
Check out this site for a hint.
Q2: *...not parsing the JSON right away will actually improve performance...
IMHO: inconsequentially
Q3: Right now the three JSON data sets come from two different tables...
The majority db query cost is usually the out of process call and the network data transport (unless you have a really bad schema or config). All data in one call is the right move.
Q4: Are there other ways to approach this that would improve the general performance
Impossible to tell. The place to start is with an observed behavior, then profiler tools to identify the culprit, then code optimization.

Recommended Riak mapreduce Javascript VM pool size for map and reduce phases? (mapred timeout error)

I was wondering if anyone can recommend app.config settings for map and reduce Javascript VM pools?
My current setup consists of two (2) Amazon EC2 m1.medium instanes in the cluster. Each server has a single CPU with ~4GB of RAM. My ring size is set to 64 partitions, with 8 JS VMs for map phases, 16 JS VMs for reduce, and 2 for hooks. I am planning on adding another instance on the cluster, to make it 3, but I'm trying to stretch as much as possible until then.
I recently encountered high wait times for queries on a set of a few thousand records (the query was to fetch the most recent 25 news feeds from a bucket of articles), resulting in timeouts. As a workaround, I passed "reduce_phase_only_1" as an argument. My query was structured as follows:
1) 2i index search
2) map phase to filter out deleted articles
3) reduce phase to sort on creation time (this is where i added reduce_phase_only_1 arg)
4) reduce phase to slice the top of results
Anyone know how to alleviate the bottleneck?
Cheers,
-Victor
Your Map phase functions are going to execute in parallel close to the data while the reduce phase generally runs iteratively on a single node using a single VM. You should therefore increase the number of VMs in the pool for map phases and reduce the pool size for Reduce phases. This has been described in greater detail here.
I would also recommend not using the reduce_phase_only_1 flag as it will allow you to pre-reduce if volumes grow, although this will result in a number of reduce phase functions running in parallel, which will require a larger pool size. You could also merge your two reduce phase functions into one and at each stage sort before cutting excessive results.
MapReduce is a flexible way to query your data, but also quite expensive, especially compared to direct key access. It is therefore best suited for batch type jobs where you can control the level of concurrency and the amount of load you put on the system through MapReduce. It is generally not recommended to use it to serve user driven queries as it can overload the cluster if there is a spike in traffic.
Instead of generating the appropriate data for every request, it is very common to de-normalise and pre-compute data when using Riak. In your case you might be able to keep lists of news in separate summary objects and update these as news are inserted, deleted or updated. This adds a bit more work when inserting, but will make reads much more efficient and scalable as it can be served through a single GET request rather than a MapReduce job. If you have a read heavy application this is often a very good design.
If inserts and updates are too frequent, thereby making it difficult to update these summary objects efficiently, it may be possible to have a batch job do this at specific time intervals instead if it is acceptable that the view may not be 100% up to date.

Chrome thinks 99,999 is drastically different than 100,000

I just ran into a very interesting issue when someone posted a jsperf benchmark that conflicted with a previous, nearly identical, benchmark I ran.
Chrome does something drastically different between these two lines:
new Array(99999); // jsperf ~50,000 ops/sec
new Array(100000); // jsperf ~1,700,000 ops/sec
benchmarks: http://jsperf.com/newarrayassign/2
I was wondering if anyone has any clue as to what's going on here!
(To clarify, I'm looking for some low-level details on the V8 internals, such as it's using a different data structure with one vs the other and what those structures are)
Just because this sounded pretty interesting, I searched through the V8 codebase for a static defined as 100000, and I found this kInitialMaxFastElementArray var, which is the subsequently used in the builtin ArrayConstructInitializeElements function function. While I'm not a c programmer and don't know the nitty-gritty here, you can see that it's using an if loop to determine if it's smaller than 100,000, and returning at different points based on that.
Well, there always is some threshold number when you design algorithms that adapt to the size of data (for example SharePoint changes the way it works when you add 1000 items to a list). So, the guess would be that you have found the actual number and the performance differs, as different data structures or algorithms are used.
I don't know what operating system you're using, but if this is Linux, I'd suspect that Chrome (i.e. malloc) is allocating memory from a program-managed heap (size determined using the sbrk system call, and the free lists are managed by the C standard library), but when you reach a certain size threshold, it switches to using mmap to ask the kernel to allocate large chunks of memory that don't interfere with the sbrk-managed heap.
Doug Lea describes how malloc works in the GNU C Library, better than I could. He wrote it.
Or maybe 100000 hits some kind of magic threshold for the amount of space needed that it triggers the garbage collector more frequently when trying to allocate memory.

Categories