Parallelizing tasks in Node.js

Parallelizing tasks in Node.js - javascript

I have some tasks I want to do in JS that are resource intensive. For this question, lets assume they are some heavy calculations, rather then system access. Now I want to run tasks A, B and C at the same time, and executing some function D when this is done.
The async library provides a nice scaffolding for this:
async.parallel([A, B, C], D);
If what I am doing is just calculations, then this will still run synchronously (unless the library is putting the tasks on different threads itself, which I expect is not the case). How do I make this be actually parallel? What is the thing done typically by async code to not block the caller (when working with NodeJS)? Is it starting a child process?

2022 notice: this answer predates the introduction of worker threads in Node.js
How do I make this be actually parallel?
First, you won't really be running in parallel while in a single node application. A node application runs on a single thread and only one event at a time is processed by node's event loop. Even when running on a multi-core box you won't get parallelism of processing within a node application.
That said, you can get processing parallelism on multicore machine via forking the code into separate node processes or by spawning child process. This, in effect, allows you to create multiple instances of node itself and to communicate with those processes in different ways (e.g. stdout, process fork IPC mechanism). Additionally, you could choose to separate the functions (by responsibility) into their own node app/server and call it via RPC.
What is the thing done typically by async code to not block the caller (when working with NodeJS)? Is it starting a child process?
It is not starting a new process. Underneath, when async.parallel is used in node.js, it is using process.nextTick(). And nextTick() allows you to avoid blocking the caller by deferring work onto a new stack so you can interleave cpu intensive tasks, etc.
Long story short
Node doesn't make it easy "out of the box" to achieve multiprocessor concurrency. Node instead gives you a non-blocking design and an event loop that leverages a thread without sharing memory. Multiple threads cannot share data/memory, therefore locks aren't needed. Node is lock free. One node process leverages one thread, and this makes node both safe and powerful.
When you need to split work up among multiple processes then use some sort of message passing to communicate with the other processes / servers. e.g. IPC/RPC.
For more see:
Awesome answer from SO on What is Node.js... with tons of goodness.
Understanding process.nextTick()

Asynchronous and parallel are not the same thing. Asynchronous means that you don't have to wait for synchronization. Parallel means that you can be doing multiple things at the same time. Node.js is only asynchronous, but its only ever 1 thread. It can only work on 1 thing at once. If you have a long running computation, you should start another process and then just have your node.js process asynchronously wait for results.
To do this you could use child_process.spawn and then read data from stdin.
http://nodejs.org/api/child_process.html#child_process_child_process_spawn_command_args_options
var spawn = require('child_process').spawn;
var process2 = spawn('sh', ['./computationProgram', 'parameter'] );
process2.stderr.on('data', function (data) {
//handle error input
});
process2.stdout.on('data', function (data) {
//handle data results
});

Keep in mind I/O is parallelized by Node.js; only your JavaScript callbacks are single threaded.
Assuming you are writing a server, an alternative to adding the complexity of spawning processes or forking is to simply build stateless node servers and run an instance per core, or better yet run many instances each in their own virtualized micro server. Coordinate incoming requests using a reverse proxy or load balancer.
You could also offload computation to another server, maybe MongoDB (using MapReduce) or Hadoop.
To be truly hardcore, you could write a Node plugin in C++ and have fine-grained control of parallelizing the computation code. The speed up from C++ might negate the need of parallelization anyway.
You can always write code to perform computationally intensive tasks in another language best suited for numeric computation, and e.g. expose them through a REST API.
Finally, you could perhaps run the code on the GPU using node-cuda or something similar depending on the type of computation (not all can be optimized for GPU).
Yes, you can fork and spawn other processes, but it seems to me one of the major advantages of node is to not much have to worry about parallelization and threading, and therefor bypass a great amount of complexity altogether.

Depending on your use case you can use something like
task.js Simplified interface for getting CPU intensive code to run on all cores (node.js, and web)
A example would be
function blocking (exampleArgument) {
// block thread
}
// turn blocking pure function into a worker task
const blockingAsync = task.wrap(blocking);
// run task on a autoscaling worker pool
blockingAsync('exampleArgumentValue').then(result => {
// do something with result
});

Just recently came across parallel.js but it seems to be actually using multi-core and also has map reduce type features.
http://adambom.github.io/parallel.js/

Related

node.js handling blocking IO operation

I want to understand internal working of node.js, I am intentionally including computation task ( for loop). But I see it is still blocking main thread.
Here is my script
console.log("start");
for (let i = 0; i < 10; i++) {
console.log(i)
}
console.log("end")
And the o/p is :
start
1
2
3
....
10
end
But according to node.js architecture shouldn't high computation tasks be executed by different thread picked from thread pool and event loop continue executing non-blocking task?
I am referencing node.js internal architecture using this link enter link description here
Can someone please explain the architecture and behavior of the script?

By default, nodejs uses only ONE thread to run your Javascript with. That means that (unless you engage WorkerThreads which are essentially an entirely separate VM), only one piece of Javascript is ever running at once. Nodejs does not "detect" some long running piece of Javascript and move it to another thread. It has no features like that at all. If you have some long running piece of synchronous Javascript, it will block the event loop and block all other Javascript and all other event processing.
Internal to its implementation, nodejs has a thread pool that it uses for certain types of native code (internal implementations of file I/O and crypto operations). That only supports the implementation of asynchronous implementations for file I/O and crypto operations - it does not parallelize the running of Javascript.
So, your script you show:
console.log("start");
for (let i = 0; i < 10; i++) {
console.log(i)
}
console.log("end")
Is entirely synchronous and runs sequentially and blocks all other Javascript from running while it is running because it is using the one thread for running Javascript while it is running.
Nodejs gets its excellent scalability from its asynchronous I/O model that does not have to use a separate thread in order to have lots of asynchronous operations in flight at the same time. But, keep in mind that these asynchronous I/O operations all have native code behind them (some of which may use threads in their native code implementations).
But, if you have long running synchronous Javascript operations (like say something like image analysis written in Javascript), then those typically need to be moved out of the main event loop thread either by shunting them off to WorkerThreads or to other processes or to a native code implementation that may use OS threads.
But according to node.js architecture shouldn't high computation tasks be executed by different thread picked from thread pool and event loop continue executing non-blocking task?
No, that is not how nodejs works and is not a correct interpretation of the diagram you show. The thread pool is NOT used for running your Javascript. It is used for internal implementation of some APIs such as file I/O and some crypto operations. It is not used for running your Javascript. There is just one main thread for running your Javascript (unless you specifically run your code in a WorkerThread).
I want to understand internal working of node.js, I am intentionally including computation task ( for loop). But I see it is still blocking main thread.
Yes, a for loop (that does not contain an await statement that is awaiting a promise) will completely occupy the single Javascript thread and will block the event loop from processing other events while the for loop is running.

JS executes its code Synchronouse. there are few things that gets "Asynchronouse" like setInterval or setTimout for exmple. But thats actually not fully true. Asynchronouse means things get done in parallel witch is not true. Take a look at setTimeout. By executing it you add the function into the task que, later the event loop grabs it from the que and put it onto the stack and executes it, syncrhonouse. If you want to execute something really parallel then you should consider using an worker thread

There are absolutely no threads in JS (unless you explicitly use worker threads). Javascript uses cooperative multi-tasking which means that a function will always complete before the next one will start. The only other way to yield control back to the scheduler is to separate a task out into another function that is called asynchronously. So in your example, e.g., you could do:
console.log("start");
setTimeout(() => {
for (let i = 0; i < 10; i++) {
console.log(i)
}}, 0);
console.log("end")
and you would get:
start
end
1
2
..
9
This also answers your question about heavy computations: unless you use the relatively new worker threads, you cannot run heavy computations in node.js "in the background" without the use of native code.
So if you really have heavy loads you have three options:
worker threads,
native code that is multi-threaded, e.g., written in C/C++, or
breaking your computation down into small pieces, each one yielding control back to the scheduler when done (e.g., using map/reduce).

Why NodeJs cannot process intensive CPU tasks even if it is running async?

What I don't understand is, nodejs is asynchronous so I run multiple tasks doesn't matter how long they take, because the code parser continue to the other tasks and when the long task is completed it will let me know with a callback function, so where is the blocking for the CPU intensive tasks? Even if a task it takes 10 seconds the other lines of code in Js will continue to execute and to start other tasks. As NodeJs use only 4 threads for heavy tasks I understand that all this threads will be busy so here is come in place the scenario of why to don't use heavy cpu tasks with nodejs, am I right?
var listener = readAsync("I/O heavy calculation", function(){
console.log("I run after the I/O is done.");
})
//The parse will send the request and pass to the next line of code
console.log("I run before I/O request is done")
I expect the global declared console.log to run before the callback function.

Nodejs programs are single threaded by design. It prioritizes the application of whats most important inside of the event loop. There are ways to optimize your performance at scale. For example, there is a cluster module built in node js to assign the same node clients to different workers.
https://nodejs.org/en/blog/release/v10.5.0/ from node 10.5 there is multi threading support but it is experimental.

Is there a limit to how many promises can or should run concurrently?

Surprisingly google had trouble returning the result for this question.
I'm wondering how many promises can or should be ran in parallel before queuing them and waiting for the next one to finish. I guess it might depend on the user's internet, but I figured it was worth asking.
If it's based on the user's ISP/connection type is there a way to test for the ideal amount of promises to send before starting a queue?
Also, I'm talking strictly from the client side. So, single thread js.
Example code:
function uploadToServer(requestData){
return Promise((...));
}
function sendRequests(requestArray){
var count = 0;
for(var requestData in requestArray){
if(count<idealAmount){
uploadToServer(idealAmount).then(count--);
count++;
}else{
// Logic to wait before attempting to fire event
}
}
}

Promises themselves have no particular coded limits. They are just a notification system and you could have millions of them just fine (as long as you had enough memory to hold those Javascript objects).
Now, if a promise represents an underlying asynchronous operation (which they usually do), there could very well be some limits to how many of that specific type of asynchronous operation can be in flight at the same time. For example, at some point you might run into limits of how many requests a single host would accept from you at the same time. Or, you might run into local resources issues with zillions of connections somewhere.
For things like node.js disk I/O operations, the underlying disk I/O sub-system already has a queuing system so that only a small number of operations are actually running at once and the rest are queued.
So, to answer a question about how many concurrent operations you can have, it can only be analyzed and answered in the context of a specific type of asynchronous request and sometimes even a specific type of receiving host.
If you know you're processing a large or potentially large array of requests and you'll be sending a network request for every item in the array, then it is common to code a limit yourself to avoid overwhelming either local resources or the target host resources. This is usually not done with a queue, but rather code that just launches N requests and then as one finishes, it launches the next one and so on. Both the Bluebird and Async libraries have methods for managing this for you. In Bluebird, it's the concurrency option for Promise.map(). I've also hand-coded loops that manage the number of concurrent connections several times myself and here are links to some of that code:
Promise.all consumes all my RAM
Javascript - how to control how many promises access network in parallel
Make several requests to an API that can only handle 20 request a minute
Loop through an api get request with variable URL
Choose proper async method for batch processing for max requests/sec
Nodejs: Async request with a list of URL

As #jfried00 mentioned there can't be any limits on a number of promises running, as there's no such thing as running a Promise. Once you run an async function or run a code like new Promise(res => something(res)), the method is run.
What you can do is limit the number of promise chains being resolved:
// ten promises ago:
let oldPromise = doSomethingAsync();
// and now:
oldPromise.then(doSomethingNewAsync());
But actually coding this on your own is gonna dye your hair grey rather quickly as my example has shown - error handling, finding the empty slots and keeping the flow in the right order will be hard.
That said it is possible and my framework, Scramjet, which I'll shamelessly plug here does what you need:
DataStream.from(requestArray)
.setOptions({maxParallel: 4})
.unorder(requestData => uploadToServer(requestData))
.run()
Scramjet will keep 4 promises resolving but won't try to keep order (there are other methods for that) and you can use any function - if it doesn't return a promise, it will work the same as if it did. Here's some more text on unordered transforms in scramjet. You can also peek at the source code if you'd rather do that yourself...

Is moving database works to child processes a good idea in node.js?

I just started getting into child_process and all I know is that it's good for delegating blocking functions (e.g. looping a huge array) to child processes.
I use typeorm to communicate with the mysql database. I was wondering if there's a benefit to move some of the asynchronous database works to child processes. I read it in another thread (Unfortunately I couldn't find it in the browser history) that there's no good reason to delegate async functions to child processes. Is it true?
example code:
child.js
import {createConnection} "./dbConnection";
import {SomeTable} from "./entity/SomeTable";
process.on('message', (m)=> {
createConnection().then(async connection=>{
let repository = connection.getRepository(SomeTable);
let results = await repository
.createQueryBuilder("t")
.orderBy("t.postId", "DESC")
.getMany();
process.send(results);
})
});
main.js
const cp = require('child_process');
const child = cp.fork('./child.js');
child.send('Please fetch some data');
child.on('message', (m)=>{
console.log(m);
});

The big gain about Javascript is its asynchronous nature...
What happens when you call an asynchronous function is that the code continues to execute, not waiting for the answer. And just when the function is done, and an answer is given does it then continue on with that part.
Your database call is already asynchronous. So you would spawn another node process for completely nothing. Since your database takes all the heat, having more nodeJS processes wouldn't help on that part.
Take the same example but with a file write. What could make the write to the disk faster? Nothing much really... But do we care? Nope because our NodeJS is not blocked and keeps answering requests and handling tasks. The only thing that you might want to check is to not send a thousand file writes at the same time, if they are big there would be a negative impact on the file system, but since a write is not CPU intensive, node will run just fine.
child processes really are a great tool, but it is rare to need it. I too wanted to use some when I heard about them, but the thing is that you will certainly not need them at all... The only time I decided to use it was to create a CPU intensive worker. It would make sure it spawns one child process per Core (since node is single threaded) and respawn any faulty ones.

Are nodejs data-structures thread-safe by design?

Read in a node.js related web document that it is a single threaded server. So it confuses me whether all data structures by default be thread-safe in a node server!
I have multiple call-backs accessing a global object like this :
callback1{
global_var['key'] = val;
}
callback2{
globalv_var['key'] = val;
}
'key' may be same at times and may be different as well. Will the global_var be thread-safe ?
callbacks, as intended gets called back as and when something is done, in no particular order.

Node.JS contains a "dispatcher." It accepts web requests and hands them off for asynchronous processing. That dispatcher is single threaded. But the dispatcher spins up a new thread for each task, and quickly hands off the task to the new thread, freeing the dispatcher's thread for servicing a new request.
To the extent that those task threads are kept separate (i.e. they don't modify each other's state), yes, they are threadsafe.

All of the javascript you write for your node.js applocation executes as if it were running in a single thread.
Any multithreading occurs behind the scenes, in the I/O code and in other native modules. So there's no need to worry about the thread safety of any application code, regardless.

We Keep Coding

JavaScript is the programming language of the Web.