What is the optimal way of sharing linear tasks between worker threads to improve performance?
Take the following example of a basic Deno web-server:
Main Thread
// Create an array of four worker threads
const workers = new Array<Worker>(4).fill(
new Worker(new URL("./worker.ts", import.meta.url).href, {
type: "module",
})
);
for await (const req of server) {
// Pass this request to worker a worker thread
}
worker.ts
self.onmessage = async (req) => {
//Peform some linear task on the request and make a response
};
Would the optimal way of distributing tasks be something along the lines of this?
function* generator(): Generator<number> {
let i = 0;
while (true) {
i == 3 ? (i = 0) : i++;
yield i;
}
}
const gen = generator();
const workers = new Array<Worker>(4).fill(
new Worker(new URL("./worker.ts", import.meta.url).href, {
type: "module",
})
);
for await (const req of server) {
// Pass this request to a worker thread
workers[gen.next().value].postMessage(req);
}
Or is there a better way of doing this? Say, for example, using Attomics to determine which threads are free to accept another task.
When working with WorkerThread code like this, I found that the best way to distribute jobs was to have the WorkerThread ask the main thread for a job when the WorkerThread knew that it was done with the prior job. The main thread could then send it a new job in response to that message.
In the main thread, I maintained a queue of jobs and a queue of WorkerThreads waiting for a job. If the job queue was empty, then the WorkerThread queue would likely have some workerThreads in it waiting for a job. Then, any time a job is added to the job queue, the code checks to see if there's a workerThread waiting and, if so, removes it from the queue and sends it the next job.
Anytime a workerThread sends a message indicating it is ready for the next job, then we check the job queue. If there's a job there, it is removed and sent to that worker. If not, the worker is added to the WorkerThread queue.
This whole bit of logic was very clean, did not need atomics or shared memory (because everything was gated through the event loop of the main process) and wasn't very much code.
I arrived at this mechanism after trying several other ways that each had their own problems. In one case, I had concurrency issues, in another I was starving the event loop, in another, I didn't have proper flow control to the WorkerThreads and was overwhelming them and not distributing load equally.
There are some abstractions in Deno to handle these kind of needs very easily. Especially considering the pooledMap functionality.
So you have a server which is an async iterator and you would like to leverage threading to generate responses since each response depends on a time taking heavy computation right..?
Simple.
import { serve } from "https://deno.land/std/http/server.ts";
import { pooledMap } from "https://deno.land/std#0.173.0/async/pool.ts";
const server = serve({ port: 8000 }),
ress = pooledMap( window.navigator.hardwareConcurrency - 1
, server
, req => new Promise(v => v(respondWith(req))
);
for await (const res of ress) {
// respond with res
}
That's it. In this particular case the repondWith function bears the heavy calculation to prepare your response object. In case it's already an async function then you don't even need to wrap it into a promise. Obviously here I have just used available many less one threads but it's up to you to decide howmany threads to spawn.
Related
Express.js serving a Remix app. The server-side code sets several timers at startup that do various background jobs every so often, one of which checks if a remote Jenkins build is finished. If so, it copies several large PDFs from one network path to another network path (both on GSA).
One function creates an array of chained glob+copyFile promises:
import { copyFile } from 'node:fs/promises';
import { promisify } from "util";
import glob from "glob";
...
async function getFiles() {
let result: Promise<void>[] = [];
let globPromise = promisify(glob);
for (let wildcard of wildcards) { // lots of file wildcards here
result.push(globPromise(wildcard).then(
(files: string[]) => {
if (files.length < 1) {
// do error stuff
} else {
for (let srcFile of files) {
let tgtFile = tgtDir + basename(srcFile);
return copyFile(srcFile, tgtFile);
}
}
},
(reason: any) => {
// do error stuff
}));
}
return result;
}
Another async function gets that array and does Promise.allSettled on it:
copyPromises = await getFiles();
console.log("CALLING ALLSETTLED.THEN()...");
return Promise.allSettled(copyPromises).then(
(results) => {
console.log("ALLSETTLED COMPLETE...");
Between the "CALLING" and "COMPLETE" messages, which can take on the order of several minutes, the server no longer responds to browser requests, which timeout.
However, during this time my other active backend timers can still be seen running and completing just fine in the server console log (I made one run every 5 seconds for test purposes, and it runs quite smoothly over and over while those file copies are crawling along).
So it's not blocking the server as a whole, it's seemingly just preventing browser requests from being handled. And once the "COMPLETE" message pops up in the log, browser requests are served up normally again.
The Express startup script basically just does this for Remix:
const { createRequestHandler } = require("#remix-run/express");
...
app.all(
"*",
createRequestHandler({
build: require(BUILD_DIR),
mode: process.env.NODE_ENV,
})
);
What's going on here, and how do I solve this?
It's apparent no further discussion is forthcoming, and I've not determined why the async I/O functions are preventing server responses, so I'll go ahead and post an answer that was basically Konrad Linkowski's workaround solution from the comments: to use the OS to do the copies instead of using copyFile(). It boils down to this in place of the glob+copyFile calls inside getFiles:
const exec = util.promisify(require('node:child_process').exec);
...
async function getFiles() {
...
result.push( exec("copy /y " + wildcard + " " + tgtDir) );
...
}
This does not exhibit any of the request-crippling behavior; for the entire time the copies are chugging away (many minutes), browser requests are handled instantly.
It's an OS-specific solution and thus non-portable as-is, but that's fine in our case since we will likely be using a Windows server for this app for many years to come. And certainly if needed, runtime OS-detection could be used to make the commands run on other OSes.
I guess that this is due to node's libuv using a threadpool with synchronous access for file system operations, and the pool size is only 4. See https://kariera.future-processing.pl/blog/on-problems-with-threads-in-node-js/ for a demonstration of the problem, or Nodejs - What is maximum thread that can run same time as thread pool size is four? for an explanation of how this is normally not a problem in network-heavy applications.
So if you have a filesystem-access-heavy application, try increasing the thread pool by setting the UV_THREADPOOL_SIZE environment variable.
I have a feature in which I have to receive items from different users (identified with some ID) and process them in relative order (so user A's item 1 is processed before user A's item 2, but no restriction on the order compared to user B's item N). In order to throughput I'd like to process them asynchronously.
I've done this with multithread languages and concurrent queues before, leaving the thread running in a busy wait if there wasn't anything in the queue anymore. But I don't know how I would solve this with JavaScript's single-thread. I was thinking about using Promises and an infinite loop inside, but I don't know if that's an anti-pattern.
As a restriction, my client is reticent to use external libraries that haven't been cleared by their audit team, so I'm trying to do this artisanally.
Thank you
You generally don't busy wait in JavaScript ever. But, many things that you can do (send a request to a web server, load an image (same thing as send a request to a web server, wait for user input) take a callback or return a promise (another form of a callback)
To process lists of asynchronous things is pretty easy with async/await
function doThing(thingMsg) {
// this just represents some process that runs async
return new Promise((resolve) => {
const duration = Math.random() * 500 + 1500; // 0.5 to 2 seconds
setTimeout(() => {
resolve(thingMsg);
}, duration);
});
}
const userAToDoList = [
'mix flour',
'knead',
'bake',
'slice',
];
const userBToDoList = [
'chop',
'fry',
'braise',
'combine',
];
async function doListInOrder(list, userName) {
for (const item of list) {
const result = await doThing(item);
console.log(`${userName} did ${result}`);
}
console.log(`${userName} finished`);
}
doListInOrder(userAToDoList, 'userA');
doListInOrder(userBToDoList, 'userB');
Replace doThing with the things you want to do like "wait for a file to download", "wait for an image to download" or "wait for something from a worker" etc.
Threading-wise, what's the difference between web workers and functions declared as
async function xxx()
{
}
?
I am aware web workers are executed on separate threads, but what about async functions? Are such functions threaded in the same way as a function executed through setInterval is, or are they subject to yet another different kind of threading?
async functions are just syntactic sugar around
Promises and they are wrappers for callbacks.
// v await is just syntactic sugar
// v Promises are just wrappers
// v functions taking callbacks are actually the source for the asynchronous behavior
await new Promise(resolve => setTimeout(resolve));
Now a callback could be called back immediately by the code, e.g. if you .filter an array, or the engine could store the callback internally somewhere. Then, when a specific event occurs, it executes the callback. One could say that these are asynchronous callbacks, and those are usually the ones we wrap into Promises and await them.
To make sure that two callbacks do not run at the same time (which would make concurrent modifications possible, which causes a lot of trouble) whenever an event occurs the event does not get processed immediately, instead a Job (callback with arguments) gets placed into a Job Queue. Whenever the JavaScript Agent (= thread²) finishes execution of the current job, it looks into that queue for the next job to process¹.
Therefore one could say that an async function is just a way to express a continuous series of jobs.
async function getPage() {
// the first job starts fetching the webpage
const response = await fetch("https://stackoverflow.com"); // callback gets registered under the hood somewhere, somewhen an event gets triggered
// the second job starts parsing the content
const result = await response.json(); // again, callback and event under the hood
// the third job logs the result
console.log(result);
}
// the same series of jobs can also be found here:
fetch("https://stackoverflow.com") // first job
.then(response => response.json()) // second job / callback
.then(result => console.log(result)); // third job / callback
Although two jobs cannot run in parallel on one agent (= thread), the job of one async function might run between the jobs of another. Therefore, two async functions can run concurrently.
Now who does produce these asynchronous events? That depends on what you are awaiting in the async function (or rather: what callback you registered). If it is a timer (setTimeout), an internal timer is set and the JS-thread continues with other jobs until the timer is done and then it executes the callback passed. Some of them, especially in the Node.js environment (fetch, fs.readFile) will start another thread internally. You only hand over some arguments and receive the results when the thread is done (through an event).
To get real parallelism, that is running two jobs at the same time, multiple agents are needed. WebWorkers are exactly that - agents. The code in the WebWorker therefore runs independently (has it's own job queues and executor).
Agents can communicate with each other via events, and you can react to those events with callbacks. For sure you can await actions from another agent too, if you wrap the callbacks into Promises:
const workerDone = new Promise(res => window.onmessage = res);
(async function(){
const result = await workerDone;
//...
})();
TL;DR:
JS <---> callbacks / promises <--> internal Thread / Webworker
¹ There are other terms coined for this behavior, such as event loop / queue and others. The term Job is specified by ECMA262.
² How the engine implements agents is up to the engine, though as one agent may only execute one Job at a time, it very much makes sense to have one thread per agent.
In contrast to WebWorkers, async functions are never guaranteed to be executed on a separate thread.
They just don't block the whole thread until their response arrives. You can think of them as being registered as waiting for a result, let other code execute and when their response comes through they get executed; hence the name asynchronous programming.
This is achieved through a message queue, which is a list of messages to be processed. Each message has an associated function which gets called in order to handle the message.
Doing this:
setTimeout(() => {
console.log('foo')
}, 1000)
will simply add the callback function (that logs to the console) to the message queue. When it's 1000ms timer elapses, the message is popped from the message queue and executed.
While the timer is ticking, other code is free to execute. This is what gives the illusion of multithreading.
The setTimeout example above uses callbacks. Promises and async work the same way at a lower level — they piggyback on that message-queue concept, but are just syntactically different.
Workers are also accessed by asynchronous code (i.e. Promises) however Workers are a solution to the CPU intensive tasks which would block the thread that the JS code is being run on; even if this CPU intensive function is invoked asynchronously.
So if you have a CPU intensive function like renderThread(duration) and if you do like
new Promise((v,x) => setTimeout(_ => (renderThread(500), v(1)),0)
.then(v => console.log(v);
new Promise((v,x) => setTimeout(_ => (renderThread(100), v(2)),0)
.then(v => console.log(v);
Even if second one takes less time to complete it will only be invoked after the first one releases the CPU thread. So we will get first 1 and then 2 on console.
However had these two function been run on separate Workers, then the outcome we expect would be 2 and 1 as then they could run concurrently and the second one finishes and returns a message earlier.
So for basic IO operations standard single threaded asynchronous code is very efficient and the need for Workers arises from need of using tasks which are CPU intensive and can be segmented (assigned to multiple Workers at once) such as FFT and whatnot.
Async functions have nothing to do with web workers or node child processes - unlike those, they are not a solution for parallel processing on multiple threads.
An async function is just1 syntactic sugar for a function returning a promise then() chain.
async function example() {
await delay(1000);
console.log("waited.");
}
is just the same as
function example() {
return Promise.resolve(delay(1000)).then(() => {
console.log("waited.");
});
}
These two are virtually indistinguishable in their behaviour. The semantics of await or a specified in terms of promises, and every async function does return a promise for its result.
1: The syntactic sugar gets a bit more elaborate in the presence of control structures such as if/else or loops which are much harder to express as a linear promise chain, but it's still conceptually the same.
Are such functions threaded in the same way as a function executed through setInterval is?
Yes, the asynchronous parts of async functions run as (promise) callbacks on the standard event loop. The delay in the example above would implemented with the normal setTimeout - wrapped in a promise for easy consumption:
function delay(t) {
return new Promise(resolve => {
setTimeout(resolve, t);
});
}
I want to add my own answer to my question, with the understanding I gathered through all the other people's answers:
Ultimately, all but web workers, are glorified callbacks. Code in async functions, functions called through promises, functions called through setInterval and such - all get executed in the main thread with a mechanism akin to context switching. No parallelism exists at all.
True parallel execution with all its advantages and pitfalls, pertains to webworkers and webworkers alone.
(pity - I thought with "async functions" we finally got streamlined and "inline" threading)
Here is a way to call standard functions as workers, enabling true parallelism. It's an unholy hack written in blood with help from satan, and probably there are a ton of browser quirks that can break it, but as far as I can tell it works.
[constraints: the function header has to be as simple as function f(a,b,c) and if there's any result, it has to go through a return statement]
function Async(func, params, callback)
{
// ACQUIRE ORIGINAL FUNCTION'S CODE
var text = func.toString();
// EXTRACT ARGUMENTS
var args = text.slice(text.indexOf("(") + 1, text.indexOf(")"));
args = args.split(",");
for(arg of args) arg = arg.trim();
// ALTER FUNCTION'S CODE:
// 1) DECLARE ARGUMENTS AS VARIABLES
// 2) REPLACE RETURN STATEMENTS WITH THREAD POSTMESSAGE AND TERMINATION
var body = text.slice(text.indexOf("{") + 1, text.lastIndexOf("}"));
for(var i = 0, c = params.length; i<c; i++) body = "var " + args[i] + " = " + JSON.stringify(params[i]) + ";" + body;
body = body + " self.close();";
body = body.replace(/return\s+([^;]*);/g, 'self.postMessage($1); self.close();');
// CREATE THE WORKER FROM FUNCTION'S ALTERED CODE
var code = URL.createObjectURL(new Blob([body], {type:"text/javascript"}));
var thread = new Worker(code);
// WHEN THE WORKER SENDS BACK A RESULT, CALLBACK AND TERMINATE THE THREAD
thread.onmessage =
function(result)
{
if(callback) callback(result.data);
thread.terminate();
}
}
So, assuming you have this potentially cpu intensive function...
function HeavyWorkload(nx, ny)
{
var data = [];
for(var x = 0; x < nx; x++)
{
data[x] = [];
for(var y = 0; y < ny; y++)
{
data[x][y] = Math.random();
}
}
return data;
}
...you can now call it like this:
Async(HeavyWorkload, [1000, 1000],
function(result)
{
console.log(result);
}
);
I am trying to find a way to "promisify" event callbacks on workers, so that the master can:
wait for all workers to finish CPU intensive tasks,
then do some calculation based on the results returned.
I came up with the following code and this works fine, however I am not confident with the approach I am taking.
Function to creaet a promise that wraps the message event on a worker:
_waitAsync(worker) {
// using bluebird for Promise
return Promise.promisify((callback) => {
worker.on('message', callback.bind(undefined, undefined));
})();
}
Master calls as below:
doCPUIntensiveTaskAsync(){
const promises = [];
let k = 0;
for (var wid in cluster.workers) {
promises.push(this._waitAsync(cluster.workers[wid]).bind(this));
cluster.workers[wid].send(messages[k++]);
}
return Promise.all(promises).then(calulcate);
}
What are the recommended / better ways to take advantage of multi-core environment (ie. offload CPU intensive tasks to different thread/process) in Node.js?
start worker per each 'CPU intensive task', then let them exit. Don not reuse workers, restart them for next task.
Easy way to promisify worker exit:
import * as Q from 'q';
let worker = cluster.fork();
let workerExit = Q.denodeify(w.on.bind(w, 'exit'));
await workerExit();// here can be workerExit().then ... instead of await
console.log('child exited');
I am quite confused about why is my promise blocking the node app requests.
Here is my simplified code:
var express = require('express');
var someModule = require('somemodule');
app = express();
app.get('/', function (req, res) {
res.status(200).send('Main');
});
app.get('/status', function (req, res) {
res.status(200).send('Status');
});
// Init Promise
someModule.doSomething({}).then(function(){},function(){}, function(progress){
console.log(progress);
});
var server = app.listen(3000, function () {
var host = server.address().address;
var port = server.address().port;
console.log('Example app listening at http://%s:%s in %s environment',host, port, app.get('env'));
});
And the module:
var q = require('q');
function SomeModule(){
this.doSomething = function(){
return q.Promise(function(resolve, reject, notify){
for (var i=0;i<10000;i++){
notify('Progress '+i);
}
resolve();
});
}
}
module.exports = SomeModule;
Obviously this is very simplified. The promise function does some work that takes anywhere from 5 to 30 minutes and has to run only when server starts up.
There is NO async operation in that promise function. Its just a lot of data processing, loops etc.
I wont to be able to do requests right away though. So what I expect is when I run the server, I can go right away to 127.0.0.1:3000 and see Main and same for any other requests.
Eventually I want to see the progress of that task by accessing /status but Im sure I can make that work once the server works as expected.
At the moment, when I open / it just hangs until the promise job finishes..
Obviously im doing something wrong...
If your task is IO-bound go with process.nextTick. If your task is CPU-bound asynchronous calls won't offer much performance-wise. In that case you need to delegate the task to another process. An example solution would be to spawn a child process, do the work and pipe the results back to the parent process when done.
See nodejs.org/api/child_process.html for more.
If your application needs to do this often then forking lots of child processes quickly becomes a resource hog - each time you fork, a new V8 process will be loaded into memory. In this case it is probably better to use one of the multiprocessing modules like Node's own Cluster. This module offers easy creation and communication between master-worker processes and can remove a lot of complexity from your code.
See also a related question: Node.js - Sending a big object to child_process is slow
The main thread of Javascript in node.js is single threaded. So, if you do some giant loop that is processor bound, then that will hog the one thread and no other JS will run in node.js until that one operation is done.
So, when you call:
someModule.doSomething()
and that is all synchronous, then it does not return until it is done executing and thus the lines of code following that don't execute until the doSomething() method returns. And, just so you understand, the use of promises with synchronous CPU-hogging code does not help your cause at all. If it's synchronous and CPU bound, it's just going to take a long time to run before anything else can run.
If there is I/O involves in the loop (like disk I/O or network I/O), then there are opportunities to use async I/O operations and make the code non-blocking. But, if not and it's just a lot of CPU stuff, then it will block until done and no other code will run.
Your opportunities for changing this are:
Run the CPU consuming code in another process. Either create a separate program that you run as a child process that you can pass input to and get output from or create a separate server that you can then make async requests to.
Break the non-blocking work into chunks where you execute 100ms chunks of work at a time, then yield the processor back to the event loop (using something like setTimeout() to allow other things in the event queue to be serviced and run before you pick up and run the next chunk of work. You can see Best way to iterate over an array without blocking the UI for ideas on how to chunk synchronous work.
As an example, you could chunk your current loop. This runs up to 100ms of cycles and then breaks execution to give other things a chance to run. You can set the cycle time to whatever you want.
function SomeModule(){
this.doSomething = function(){
return q.Promise(function(resolve, reject, notify){
var cntr = 0, numIterations = 10000, timePerSlice = 100;
function run() {
if (cntr < numIterations) {
var start = Date.now();
while (Date.now() - start < timePerSlice && cntr < numIterations) {
notify('Progress '+cntr);
++cntr;
}
// give some other things a chance to run and then call us again
// setImmediate() is also an option here, but setTimeout() gives all
// other operations a chance to run alongside this operation
setTimeout(run, 10);
} else {
resolve();
}
}
run();
});
}
}