In a recent SO question, I outlined an OOM condition that I'm running into while processing a large number of csv files with millions of records in each.
The more I am looking into the problem and the more I'm reading up on Node.js the more convinced I become that the OOM isn't happening because of a memory leak but because I'm not throttling the data input into the system.
The code just blindly sucks in all data, creating a single callback event for each line. The events keep getting added to the main event loop, which eventually becomes so large that it exhausts all available memory.
What are Node's idiomatic patterns for dealing with this scenario? Should I be tying reading of csv files to a blocking queue of some sort that, once full, will block the file reader from parsing more of the data? Are there any good examples dealing with processing of large data sets?
Update: To put this differently and simpler, Node can process input faster than it can process output and the slack is being stored in memory (queued as events for the event queue). Because there is a lot of slack, the memory eventually gets exhausted. So the question is: what's the idiomatic way of throttling down input to the output's rate?
Your best bet is to set things up as streams, and rely on the built-in backpressure semantics to do so. The Streams Handbook as a really good overview on it.
Similar to unix, the node stream module's primary composition operator is called .pipe() and you get a backpressure mechanism for free to throttle writes for slow consumers.
Update
I've not used the readline module for anything other than a terminal input before, but reading the docs it looks like it accepts an input stream and an output stream. If you frame your DB writer as a writeable stream, you should be able to let readline pipe it for you internally.
Related
It helps me understand things by using real world comparison, in this case fastfood.
In java, for synchronous blocking I understand that each request processed by a thread can only be completed one at a time. Like ordering through a drive through, so if im tenth in line I have to wait for the 9 cars ahead of me. But, I can open up more threads such that multiple orders are completed simultaneously.
In javascript you can have asynchronous non-blocking but single threaded. As I understand it, multiple requests are made, and those request are immediately accepted, but the request is processed by some background process at some later time before returning. I don't understand how this would be faster. If you order 10 burgers at the same time the 10 requests would be put in immediately but since there is only one cook (single thread) it still takes the same time to create the 10 burgers.
I mean I understand the reasoning, of why non blocking async single thread "should" be faster for somethings, but the more I ask myself questions the less I understand it which makes me not understand it.
I really dont understand how non blocking async single threaded can be faster than sync blocking multithreaded for any type of application including IO.
Non-blocking async single threaded is sometimes faster
That's unlikely. Where are you getting this from?
In multi-threaded synchronous I/O, this is roughly how it works:
The OS and appserver platform (e.g. a JVM) work together to create 10 threads. These are data structures represented in memory, and a scheduler running at the kernel/OS level will use these data structures to tell one of your CPU cores to 'jump to' some point in the code to run the commands it finds there.
The datastructure that represents a thread contains more or less the following items:
What is the location in memory of the instruction we were running
The entire 'stack'. If some function invokes a second function, then we need to remember all local variables and the point we were at in that original method, so that when the second method 'returns', it knows how to do that. e.g. your average java program is probably ~20 methods deep, so that's 20x the local vars, 20 places in code to track. This is all done on stacks. Each thread has one. They tend to be fixed size for the entire app.
What cache page(s) were spun up in the local cache of the core running this code?
The code in the thread is written as follows: All commands to interact with 'resources' (which are orders of magnitude slower than your CPU; think network packets, disk access, etc) are specified to either return the data requested immediately (only possible if everything you asked for is already available and in memory). If that is impossible, because the data you wanted just isn't there yet (let's say the packet carrying the data you want is still on the wire, heading to your network card), there's only one thing to do for the code that powers the 'get me network data' function: Wait until that packet arrives and makes its way into memory.
To not just do nothing at all, the OS/CPU will work together to take that datastructure that represents the thread, freeze it, find another such frozen datastructure, unfreeze it, and jump to the 'where did we leave things' point in the code.
That's a 'thread switch': Core A was running thread 1. Now core A is running thread 2.
The thread switch involves moving a bunch of memory around: All those 'live' cached pages, and that stack, need to be near that core for the CPU to do the job, so that's a CPU loading in a bunch of pages from main memory, which does take some time. Not a lot (nanoseconds), but not zero either. Modern CPUs can only operate on the data loaded in a nearby cachepage (which are ~64k to 1MB in size, no more than that, a thousand+ times less than what your RAM sticks can store).
In single-threaded asynchronous I/O, this is roughly how it works:
There's still a thread of course (all things run in one), but this time the app in question doesn't multithread at all. Instead, it, itself, creates the data structures required to track multiple incoming connections, and, crucially, the primitives used to ask for data work differently. Remember that in the synchronous case, if the code asks for the next bunch of bytes from the network connection then the thread will end up 'freezing' (telling the kernel to find some other work to do) until the data is there. In asynchronous modes, instead the data is returned if available, but if not available, the function 'give me some data!' still returns, but it just says: Sorry bud. I have 0 new bytes for you.
The app itself will then decide to go work on some other connection, and in that way, a single thread can manage a bunch of connections: Is there data for connection #1? Yes, great, I shall process this. No? Oh, okay. Is there data for connection #2? and so on and so forth.
Note that, if data arrives on, say, connection #5, then this one thread, to do the job of handling this incoming data, will presumably need to load, from memory, a bunch of state info, and may need to write it.
For example, let's say you are processing an image, and half of the PNG data arrives on the wire. There's not a lot you can do with it, so this one thread will create a buffer and store half of the PNG inside it. As it then hops to another connection, it needs to load the ~15% of the image it alrady got, and add onto that buffer the 10% of the image that just arrived in a network packet.
This app is also causing a bunch of memory to be moved around into and out of cache pages just the same, so in that sense it's not all that different, and if you want to handle 100k things at once, you're inevitably going to end up having to move stuff into and out of cache pages.
So what is the difference? Can you put it in fry cook terms?
Not really, no. It's all just data structures.
The key difference is in what gets moved into and out of those cache pages.
In the case of async it is exactly what the code you wrote wants to buffer. No more, no less.
In the case of synchronous, it's that 'datastructure representing a thread'.
Take java, for example: That means at the very least the entire stack for that thread. That's, depending on the -Xss parameter, about 128k worth of data. So, if you have 100k connections to be handled simultaneously, that's 12.8GB of RAM just for those stacks!
If those incoming images really are all only about 4k in size, you could have done it with 4k buffers, for only 0.4GB of memory needed at most, if you handrolled that by going async.
That is where the gain lies for async: By handrolling your buffers, you can't avoid moving memory into and out of cache pages, but you can ensure it's smaller chunks. and that will be faster.
Of course, to really make it faster, the buffer for storing state in the async model needs to be small (not much point to this if you need to save 128k into memory before you can operate on it, that's how large those stacks were already), and you need to handle so many things at once (10k+ simultaneous).
There's a reason we don't write all code in assembler or why memory managed languages are popular: Handrolling such concerns is tedious and error-prone. You shouldn't do it unless the benefits are clear.
That's why synchronous is usually the better option, and in practice, often actually faster (those OS thread schedulers are written by expert coders and tweaked extremely well. You don't stand a chance to replicate their work) - that whole 'by handrolling my buffers I can reduce the # of bytes that need to be moved around a ton!' thing needs to outweigh the losses.
In addition, async is complicated as a programming model.
In async mode, you can never block. Wanna do a quick DB query? That could block, so you can't do that, you have to write your code as: Okay, fire off this job, and here's some code to run when it gets back. You can't 'wait for an answer', because in async land, waiting is not allowed.
In async mode, anytime you ask for data, you need to be capable of dealing with getting half of what you wanted. In synchronized mode, if you ask for 4k, you get 4k. The fact that your thread may freeze during this task until the 4k is available is not something you need to worry about, you write your code as if it just arrives as you ask for it, complete.
Bbbuutt... fry cooks!
Look, CPU design just isn't simple enough to put in terms of a restaurant like this.
You are mentally moving the bottleneck from your process (the burger orderer) to the other process (the burger maker).
This will not make your application faster.
When considering the single-threaded async model, the real benefit is that your process is not blocked while waiting for the other process.
In other words, do not associate async with the word fast but with the word free. Free to do other work.
I need help understanding how Bull Queue (bull.js) processes concurrent jobs.
Suppose I have 10 Node.js instances that each instantiate a Bull Queue connected to the same Redis instance:
const bullQueue = require('bull');
const queue = new bullQueue('taskqueue', {...})
const concurrency = 5;
queue.process('jobTypeA', concurrency, job => {...do something...});
Does this mean that globally across all 10 node instances there will be a maximum of 5 (concurrency) concurrently running jobs of type jobTypeA? Or am I misunderstanding and the concurrency setting is per-Node instance?
What happens if one Node instance specifies a different concurrency value?
Can I be certain that jobs will not be processed by more than one Node instance?
The TL;DR is: under normal conditions, jobs are being processed only once. If things go wrong (say Node.js process crashes), jobs may be double processed.
Quoting from Bull's official README.md:
Important Notes
The queue aims for an "at least once" working strategy. This means that in some situations, a job could be processed more than once. This mostly happens when a worker fails to keep a lock for a given job during the total duration of the processing.
When a worker is processing a job it will keep the job "locked" so other workers can't process it.
It's important to understand how locking works to prevent your jobs from losing their lock - becoming stalled - and being restarted as a result. Locking is implemented internally by creating a lock for lockDuration on interval lockRenewTime (which is usually half lockDuration). If lockDuration elapses before the lock can be renewed, the job will be considered stalled and is automatically restarted; it will be double processed. This can happen when:
The Node process running your job processor unexpectedly terminates.
Your job processor was too CPU-intensive and stalled the Node event loop, and as a result, Bull couldn't renew the job lock (see #488 for how we might better detect this). You can fix this by breaking your job processor into smaller parts so that no single part can block the Node event loop. Alternatively, you can pass a larger value for the lockDuration setting (with the tradeoff being that it will take longer to recognize a real stalled job).
As such, you should always listen for the stalled event and log this to your error monitoring system, as this means your jobs are likely getting double-processed.
As a safeguard so problematic jobs won't get restarted indefinitely (e.g. if the job processor aways crashes its Node process), jobs will be recovered from a stalled state a maximum of maxStalledCount times (default: 1).
Bull is designed for processing jobs concurrently with "at least once" semantics, although if the processors are working correctly, i.e. not stalling or crashing, it is in fact delivering "exactly once". However you can set the maximum stalled retries to 0 (maxStalledCount https://github.com/OptimalBits/bull/blob/develop/REFERENCE.md#queue) and then the semantics will be "at most once".
Having said that I will try to answer to the 2 questions asked by the poster:
What happens if one Node instance specifies a different concurrency value?
I will assume you mean "queue instance". If so, the concurrency is specified in the processor. If the concurrency is X, what happens is that at most X jobs will be processed concurrently by that given processor.
Can I be certain that jobs will not be processed by more than one Node instance?
Yes, as long as your job does not crash or your max stalled jobs setting is 0.
I spent a bunch of time digging into it as a result of facing a problem with too many processor threads.
The short story is that bull's concurrency is at a queue object level, not a queue level.
If you dig into the code the concurrency setting is invoked at the point in which you call .process on your queue object. This means that even within the same Node application if you create multiple queues and call .process multiple times they will add to the number of concurrent jobs that can be processed.
One contributor posted the following:
Yes, It was a little surprising for me too when I used Bull first
time. Queue options are never persisted in Redis. You can have as many
Queue instances per application as you want, each can have different
settings. The concurrency setting is set when you're registering a
processor, it is in fact specific to each process() function call, not
Queue. If you'd use named processors, you can call process() multiple
times. Each call will register N event loop handlers (with Node's
process.nextTick()), by the amount of concurrency (default is 1).
So the answer to your question is: yes, your processes WILL be processed by multiple node instances if you register process handlers in multiple node instances.
Ah Welcome! This is a meta answer and probably not what you were hoping for but a general process for solving this:
Read the documentation ultra carefully to identify which guarantees your solution aims to provide:
You can specify a concurrency argument. Bull will then call your
handler in parallel respecting this maximum value.
I personally don't really understand this or the guarantees that bull provides. Since it's not super clear:
Dive into source to better understand what is actually happening. I usually just trace the path to understand:
https://github.com/OptimalBits/bull/blob/f05e67724cc2e3845ed929e72fcf7fb6a0f92626/lib/queue.js#L629
https://github.com/OptimalBits/bull/blob/f05e67724cc2e3845ed929e72fcf7fb6a0f92626/lib/queue.js#L651
https://github.com/OptimalBits/bull/blob/f05e67724cc2e3845ed929e72fcf7fb6a0f92626/lib/queue.js#L658
... more this is pretty big :p
If the implementation and guarantees offered are still not clear than create test cases to try and invalidate assumptions it sounds like:
Initialize process for the same queue with 2 different concurrency values
Create a queue and two workers, set a concurrent level of 1, and a callback that logs message process then times out on each worker, enqueue 2 events and observe if both are processed concurrently or if it is limited to 1
IMO the biggest thing is:
Can I be certain that jobs will not be processed by more than one Node
instance?
If exclusive message processing is an invariant and would result in incorrectness for your application, even with great documentation, I would highly recommend to perform due diligence on the library :p
Looking into it more, I think Bull doesn't handle being distributed across multiple Node instances at all, so the behavior is at best undefined.
I am writing a multi-threaded program. The main thread is constantly receiving network data, and the amount of data is relatively large, so sub-threads are used to process the data.
The received data is a 100-byte packet. Each time I receive a packet, I create a 100-byte SharedArrayBuffer and send it to the child thread via postMessage(). But the main thread receives the data very fast, so it is necessary to frequently call postMessage to notify the sub-thread, which leads to high CPU usage...affecting the response speed of the main thread
So I was thinking, if SharedArraybuffer can grow dynamically, the received data is constantly appended at the end of the SharedArrayBuffer, I only notify the child thread once, so that the child thread can also access the data.
I would like to ask how to dynamically increase the length of SharedArrayBuffer. I have tried to implement it in a chained way, storing a SharedArrayBuffer object in another SharedArrayBuffer object, but the browser does not allow this.
I would like to ask how to dynamically increase the length of SharedArrayBuffer.
From MDN web docs (emphasis mine).
"The SharedArrayBuffer object is used to represent a generic, fixed-length raw binary data buffer, similar to the ArrayBuffer object, but in a way that they can be used to create views on shared memory."
Fixed-length means you can't resize it...so it's not surprising it doesn't have a resize() method.
(Note: One thing that does cross my mind though is I believe there is a very new ability for SharedArrayBuffer to be used in WebAssembly as "linear memory" which has a grow_memory operator. I would imagine taking advantage of this would be very difficult, if it is possible at all, and likely would not be supported in many browsers if it was.)
I have tried to implement it in a chained way, storing a SharedArrayBuffer object in another SharedArrayBuffer object, but the browser does not allow this.
Nope. You can only write numbers.
It might seem that you could use a number to index into a table of SharedArrayBuffers, and link them that way. But then you have to worry about how to share that table between threads--same problem.
So no matter what you do, whichever thread makes the decision to update the shared buffering structure will need to notify the others of the update somehow. For that notification to be able to transfer SharedArrayBuffers, it will have to use postMessage to do it.
Have you considered experimenting with allocating a larger SharedArrayBuffer to start with, and treat it like a circular buffer so that the main thread reads out of the writes the sub threads are doing, in a "producer/consumer" pattern?
If you insist on implementing resizes, you might consider having some portion of the buffer hold an indicator that it is "stale" and a new one must be requested from the thread that resized it. You'll have to control that with synchronization. If you make a small sample that does this, it would probably make a good technical article...and if you have trouble with the small sample, it would be a good basis for further questions here.
There is no way to resize, only copy via a typed array.
But no RAM is actually allocated, until the ram is actually used. Under Node.js (v14.14.0) you can see how the ram usage gradually increases as the buffer is filled or how it is basically instantly used if array.fill is used.
const sharedBuffer = new SharedArrayBuffer(512 * 1024 * 1024)
const array = new Uint8Array(sharedBuffer)
// array.fill(1) // Causes ram to be allocated right away
We are trying to pre-cache a large sum of data on load of our web application into indexed db. From my performance testing the speed is decent on a desktop browser (e.g. Internet Explorer) where I can insert 10,000 records in around 2 seconds. But comparing the exact same functionality on the iPad it drops to 30 seconds. That comparison just blew my mind.
Does anyone know of any hints or tricks to inserting large data sets into indexedDB. I dont know if it is possible at all but if we could build up a copy of an indexedDB server side with all the data prepopulated and then just shoot it over to the client and it just stores it down to the browser. Is anything along these lines doable?
Thanks
I had problems with massive bulk insert (100.000 - 200.000 records). I've solved all my IndexedDB performance problems using Dexie library. It has this important feature:
Dexie has a kick-ass performance. It's bulk methods take advantage of
a not well known feature in indexedDB that makes it possible to store
stuff without listening to every onsuccess event. This speeds up the
performance to a maximum.
Dexie: https://github.com/dfahlander/Dexie.js
Some pretty bad IndexedDB performance problems can be caused by a prolonged period of the browser just calling onsuccess callbacks and running into event loop overhead after the work is actually done. The performance pattern observed by my app which was doing this was that it did a bunch of work, then it just went answering thousands of callbacks very inefficiently:
The right hand part of this image is the callbacks on every request. The solution to doing that is, of course, to not put a callback on every request, but it was previously unclear to me how to do this.
The way that Dexie.js accomplishes this (for details, see src/dbcore/dbcore-indexeddb.ts) is that it saves the last request (e.g. IDBObjectStore.put, etc) sent and sets an onsuccess callback on that one, which then collects the results from the rest of the requests. Thus, it avoids the callback hell.
Another approach from this is to use the IDBTransaction.oncomplete event, and not worry about the callbacks on the individual requests at all.
(note: yes, I know how old this question is, I had this problem today and wanted to put something more useful for this question which is high in Google results)
How is your data stored in the indexeddb? Is everything in a single object store of do you use multiple objectstores. Do you need all the cached data immediatly?
If you only have a single object store you can start with storing all the data you initialy need, commit that transaction and start a new for all the rest. This way you can start retrieving the initial data while inserting the rest. IndexedDB is async so it should block you.
If you have multiple object stores you can use the same stratigy. First fill up the objectstore you need immediatly and delay the others.
Or maybe consider using the AppCache API instead of the indexeddb api. Using this you can just cache a javascriptfile containing all the json objects you want to cache. This is more the case when you don't need a lot of querying on the data.
I have some tasks I want to do in JS that are resource intensive. For this question, lets assume they are some heavy calculations, rather then system access. Now I want to run tasks A, B and C at the same time, and executing some function D when this is done.
The async library provides a nice scaffolding for this:
async.parallel([A, B, C], D);
If what I am doing is just calculations, then this will still run synchronously (unless the library is putting the tasks on different threads itself, which I expect is not the case). How do I make this be actually parallel? What is the thing done typically by async code to not block the caller (when working with NodeJS)? Is it starting a child process?
2022 notice: this answer predates the introduction of worker threads in Node.js
How do I make this be actually parallel?
First, you won't really be running in parallel while in a single node application. A node application runs on a single thread and only one event at a time is processed by node's event loop. Even when running on a multi-core box you won't get parallelism of processing within a node application.
That said, you can get processing parallelism on multicore machine via forking the code into separate node processes or by spawning child process. This, in effect, allows you to create multiple instances of node itself and to communicate with those processes in different ways (e.g. stdout, process fork IPC mechanism). Additionally, you could choose to separate the functions (by responsibility) into their own node app/server and call it via RPC.
What is the thing done typically by async code to not block the caller (when working with NodeJS)? Is it starting a child process?
It is not starting a new process. Underneath, when async.parallel is used in node.js, it is using process.nextTick(). And nextTick() allows you to avoid blocking the caller by deferring work onto a new stack so you can interleave cpu intensive tasks, etc.
Long story short
Node doesn't make it easy "out of the box" to achieve multiprocessor concurrency. Node instead gives you a non-blocking design and an event loop that leverages a thread without sharing memory. Multiple threads cannot share data/memory, therefore locks aren't needed. Node is lock free. One node process leverages one thread, and this makes node both safe and powerful.
When you need to split work up among multiple processes then use some sort of message passing to communicate with the other processes / servers. e.g. IPC/RPC.
For more see:
Awesome answer from SO on What is Node.js... with tons of goodness.
Understanding process.nextTick()
Asynchronous and parallel are not the same thing. Asynchronous means that you don't have to wait for synchronization. Parallel means that you can be doing multiple things at the same time. Node.js is only asynchronous, but its only ever 1 thread. It can only work on 1 thing at once. If you have a long running computation, you should start another process and then just have your node.js process asynchronously wait for results.
To do this you could use child_process.spawn and then read data from stdin.
http://nodejs.org/api/child_process.html#child_process_child_process_spawn_command_args_options
var spawn = require('child_process').spawn;
var process2 = spawn('sh', ['./computationProgram', 'parameter'] );
process2.stderr.on('data', function (data) {
//handle error input
});
process2.stdout.on('data', function (data) {
//handle data results
});
Keep in mind I/O is parallelized by Node.js; only your JavaScript callbacks are single threaded.
Assuming you are writing a server, an alternative to adding the complexity of spawning processes or forking is to simply build stateless node servers and run an instance per core, or better yet run many instances each in their own virtualized micro server. Coordinate incoming requests using a reverse proxy or load balancer.
You could also offload computation to another server, maybe MongoDB (using MapReduce) or Hadoop.
To be truly hardcore, you could write a Node plugin in C++ and have fine-grained control of parallelizing the computation code. The speed up from C++ might negate the need of parallelization anyway.
You can always write code to perform computationally intensive tasks in another language best suited for numeric computation, and e.g. expose them through a REST API.
Finally, you could perhaps run the code on the GPU using node-cuda or something similar depending on the type of computation (not all can be optimized for GPU).
Yes, you can fork and spawn other processes, but it seems to me one of the major advantages of node is to not much have to worry about parallelization and threading, and therefor bypass a great amount of complexity altogether.
Depending on your use case you can use something like
task.js Simplified interface for getting CPU intensive code to run on all cores (node.js, and web)
A example would be
function blocking (exampleArgument) {
// block thread
}
// turn blocking pure function into a worker task
const blockingAsync = task.wrap(blocking);
// run task on a autoscaling worker pool
blockingAsync('exampleArgumentValue').then(result => {
// do something with result
});
Just recently came across parallel.js but it seems to be actually using multi-core and also has map reduce type features.
http://adambom.github.io/parallel.js/