I am trying to understand the working of workers in NodeJS.
My understanding is everytime we spawn a worker , it will create a new thread with it own Node/V8 instance.
So will the below code spawn 50 threads?
How is it distributed over the cpu cores?
This is the index.js
const { Worker } = require("worker_threads");
var count = 0;
console.log("Start Program");
const runService = () => {
return new Promise((resolve, reject) => {
const worker = new Worker("./service.js", {});
worker.on("message", resolve);
worker.on("error", reject);
worker.on("exit", code => {
if (code != 0) {
reject(new Error("Worker has stopped"));
}
});
});
};
const run = async () => {
const result = await runService();
console.log(count++);
console.log(result);
};
for (let i = 0; i < 50; i++) {
run().catch(error => console.log(error));
}
setTimeout(() => console.log("End Program"), 2000);
and this is the service.js file
const { workerData, parentPort } = require("worker_threads");
// You can do any heavy stuff here, in a synchronous way
// without blocking the "main thread"
const sleep = () => {
return new Promise(resolve => setTimeout(() => resolve, 500));
};
let cnt = 0;
for (let i = 0; i < 10e8; i += 1) {
cnt += 1;
}
parentPort.postMessage({ data: cnt });
So will the below code spawn 50 threads?
....
for (let i = 0; i < 50; i++) {
run().catch(error => console.log(error));
}
Yes.
How is it distributed over the cpu cores?
The OS will handle this.
Depending on the OS, there is a feature called processor affinity that allow you to manually set the "affinity" or preference a task has for a CPU core. On many OSes this is just a hint and the OS will override your preference if it needs to. Some real-time OSes treat this as mandatory allowing you more control over the hardware (when writing algorithms for self-driving cars or factory robots you sometimes don't want the OS to take control of your carefully crafted software at random times).
Some OSes like Linux allow you to set processor affinity with command line commands so you can easily write a shell script or use child_process to fine-tune your threads. At the moment there is no built-in way to manage processor affinity for worker threads. There is a third party module that I'm aware of that does this on Windows and Linux: nodeaffinity but it doesn't work on Max OSX (and other OSes like BSD, Solaris/Illumos etc.).
See try to understand this Nodejs is single threaded and when started it uses the thread so the number of workers depend on how much thread your system creates and trying forking the child process more the number of your threads wont help it will just slow down the whole process and result in lest productivity.
Related
I'm building a browser tool that samples a big file and shows some stats about it.
The program picks k random parts of a file, and processes each part of the file separately. Once each part is processed, an object is modified that keeps track of the rolling "stats" of the file (in the example below, I've simplified to incrementing a rolling counter).
The issue is that now every part is read in parallel, but I'd like it to be in series - so that the updates to the rolling counter are thread safe.
I think the next processFileChunk the for-loop is executing before the other finishes. How do I get this to be done serially?
I'm fairly new to Vue, and frontend in general. Is this a simple asynchronicity problem? How do I tell if something is asynchronous?
Edit: the parsing step uses the papaparse library (which I bet is the asynchronous part)
import {parse} from 'papaparse'
export default {
data() {
counter: 0
},
methods() {
streamAndSample(file) {
var vm = this;
const k = 10 // number of samples
var pointers = PickRandomPointers(file) // this is an array of integers, representing a random byte location of a file
for (const k_th_random_pointer in pointers) {
processFileChunk(file, k_th_random_pointer)
}
}
processFileChunk(file, k_th_random_pointer){
var vm = this;
var reader = new FileReader();
reader.readAsText(file.slice(k_th_random_pointer, k_th_random_pointer + 100000)) // read 100 KB
reader.onload = function (oEvent) {
var text = oEvent.target.result
parse(text,{complete: function (res) {
for (var i = 0; i < res.data.length; i++) {
vm.counter = vm.counter + 1
}}})
}
}
}
}
"thread safe" JavaScript
JavaScript is single-threaded, so only one thread of execution is run at a time. Async operations are put into a master event queue, and each is run until completion one after another.
Perhaps you meant "race condition", where the file size determines when it affects the counter rather than the read order. That is, a smaller file might be parsed earlier (and thus bump the counter) than a larger one that the parser initially saw first.
Awaiting each result
To await the parser completion of each file before moving onto the next, return a Promise from processFileChunk() that resolves the parsed data length:
export default {
methods: {
processFileChunk(file, k_th_random_pointer) {
return new Promise((resolve, reject) => {
const reader = new FileReader()
reader.onload = oEvent => {
const text = text = oEvent.target.result
const result = parse(text)
resolve(result.data.length)
}
reader.onerror = err => reject(err)
reader.onabort = () => reject()
reader.readAsText(file.slice(k_th_random_pointer, k_th_random_pointer + 100000)) // read 100 KB
})
}
}
}
Then make streamAndSample() an async function in order to await the result of each processFileChunk() call (the result is the data length resolved in the Promise):
export default {
methods: {
👇
async streamAndSample(file) {
const k = 10
const pointers = PickRandomPointers(file)
for (const k_th_random_pointer in pointers) {
👇
const length = await processFileChunk(file, k_th_random_pointer)
this.counter += length
}
}
}
}
Aside: Instead of passing a cached this into a callback, use an arrow function, which automatically preserves the context. I've done that in the code blocks above.
It's worth noting the papaparse.parse() also supports streaming for large files (although the starting read index cannot be specified), so processFileChunk() might be rewritten as this:
export default {
methods: {
processFileChunk(file, k_th_random_pointer) {
return new Promise((resolve, reject) => {
parse(file, {
chunk(res, parser) {
console.log('chunk', res.data.length)
},
chunkSize: 100000,
complete: res => resolve(res.data.length),
error: err => reject(err)
})
})
}
}
}
For each line in a file, I want to execute a computationally intensive task, such as image compression. The problem I have is that the data comes in too fast and overwhelms the memory. Ideally I'd like to be able to pause and continue the stream as the data gets processed.
I initially tried using the readline module with a file stream like this:
const fileStream = fs.createReadStream('long-list.txt')
const rl = readline.createInterface({ input: fileStream })
rl.on('line', (line, lineCount) => {
doTheHeavyTask(line)
})
However, this quickly overwhelms the memory with thousands of calls to doTheHeavyTask().
I settled on pushing each line into a queue and creating an event that dequeues the next line when the previous line is done being processed:
const lineQ = new Queue() // From the 'queue-fifo' module
rl.on('line', (line, lineCount) => {
lineQ.enqueue(line)
})
const lineEmitter = new EventEmitter() // From the 'events' module
lineEmitter.on('processNextLine', async () => {
await doTheHeavyTask( lineQ.dequeue() )
if (!lineQ.isEmpty()) lineEmitter.emit('processNextLine')
})
setTimeout( () => lineEmitter.emit('processNextLine'), 20) // Give rl a moment to enqueue some lines
This works, but it seems kind of hacky and not much better than just reading in the file all at once.
I'm vaguely aware of concepts like "backpressure" and "generators" in Javascript, but I'm not sure how to apply them.
The problem here is not the stream itself, but the async task that you trigger. Every task (wether it is a callback with a closure or an async function) will consume memory. If you start multiple (thousand) tasks at the same time, that'll use resources.
You can use an async iterator to go over the lines, and do one task for each (and wait for it):
(async function () {
for await (const el of rl) {
await doHeavyTask(el);
}
})();
That'll correctly backpressure.
However, it only does one task at a time, which might be quite slow. To buffer a few elements and process them concurrently, you could do:
const SIZE = 10; // to be tested with different values
(async function () {
let chunk = [];
for await(const el of rl) {
chunk.push(el);
if(chunk.length >= SIZE) {
await Promise.all(chunk.map(doHeavyTask));
chunk.length = 0;
}
}
await Promise.all(chunk.map(doHeavyTask));
})();
You need at least Node 11.14.0 for this to work.
I have a system where I send a large array into the worker, which then uses a library to compress it. Compression takes a long time a makes the worker busy. I'm wondering if further messages sent to the worker will be received while the worker is still busy. Would the messages be lost? Will they queued and eventually processed by the worker as well?
Yes, once the worker has finished what it's working on currently, it'll be able to process the next message in the queue. Here's an example:
// https://stackoverflow.com/questions/10343913/how-to-create-a-web-worker-from-a-string
// I put in a Stack Snippet for live demonstration
// in a real project, put this in a separate file
const workerFn = () => {
self.onmessage = () => {
self.postMessage('Worker message callback just fired');
// something expensive
for (let i = 0; i < 2e9; i++) {
}
self.postMessage('Expensive operation ending');
};
};
const workerFnStr = `(${workerFn})();`;
const blob = new Blob([workerFnStr], { type: 'text/javascript' });
const worker = new Worker(window.URL.createObjectURL(blob));
worker.onmessage = ({ data }) => console.log(data);
worker.postMessage(null);
worker.postMessage(null);
This is similar to the idea that when an element in the DOM is clicked on, that click will be processed once the current event loop task finishes, even if said task is expensive.
setTimeout(() => {
window.onclick = () => console.log('click callback running');
// something expensive
for (let i = 0; i < 3e9; i++) {
}
console.log('Main synchronous task finishing');
}, 500);
Click here *quickly* after running the snippet
I have a repo with ~70 test of executables. When running under mocha or jest, it typically gets either errors on the first couple promises, either because of a timeout or because stdout never made it back to the parent process.
My minimal replication of this problem involves 100 threads, each calling a command line which sleeps for 10s:
let child_process = require('child_process');
let AllTests = [];
/* start processes */
for (let i = 0; i < 100; ++i) {
AllTests.push({
i: i,
start: new Date(),
exec: new Promise((resolve, reject) => {
let program = child_process.spawn(
'node', ['-e', 'setTimeout(() => { process.exit(0); }, 9999)'])
// 'node', ['-e', 'for (let i = 0; i < 2**28; ++i) ;'])
program.on('exit', exitCode => { resolve({exitCode:exitCode}) })
program.on('error', err => { reject(err) })
})
})
}
/* test results */
describe('churn', () => {
AllTests.forEach(test => {
it('should execute test ' + test.i + '.',
done => {
test.exec.then(exec => {
test.end = new Date()
done()
})
})
})
})
On my under-powered laptop, I typically get:
93 passing (19s)
7 failing
1) churn
should execute test 0.:
Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/eric/errz/flood.js)
...
Adding some accounting afterwards:
after(() => {
console.log()
AllTests.forEach(test => {
console.log(test.i, (test.end - test.start)/1000.0)
})
})
shows that each process takes ~19s.
Given that this occurs in Mocha and Jest, I guess that the issue is related to the 100 simultaneous processes. Suggestions?
I could almost address the timeouts and the stdio stream issues separately.
The stream issues mostly cleared up when I pushed the exit-handler for process termination into the next event cycle:
program.on("exit", function(exitCode) {
setTimeout(
() => resolve({stdout:stdout, stderr:stderr, exitCode:exitCode}), 0
)
});
program.on("error", function(err) { reject(err); });
The timeouts were because I was flooding the process table.
Rather than getting overly intimate with the kernel's scheduler, I used timeout-promise-queue, which throttled the total concurrent processes and provided timeouts based on the start time of each queued process.
Using timeout-promise-queue also cleared up the malingering stream issues, which only showed up when the process table got too large.
After thousands of tests, I settled on a process queue of 25 and a 0-length timeout on the exit-handler.
The resulting diffs are pretty minimal and self-explanatory and I no longer have to hit [↻Restart job] on Travis tests.
In the code attached i am looking to run the function returnFile after all database querys have run, but the problem is that i am unable to tell which response will be the last from inside of the query response, so what I was thinking was to separate the loops and just have the last callback run the returnFile function but that would dramatically slow things down.
for (var i = 0, len = articleRevisionData.length; i < len; i++) {
tagNames=[]
console.log("step 1, "+articleRevisionData.length+" i:"+i);
if(articleRevisionData[i]["tags"]){
for (var x = 0, len2 = articleRevisionData[i]["tags"].length; x < len2; x++) {
console.log("step 2, I: "+i+" x: "+x+articleRevisionData[i]["articleID"])
tagData.find({"tagID":articleRevisionData[i]["tags"][x]}).toArray( function(iteration,len3,iterationA,error, resultC){
console.log("step 3, I: "+i+" x: "+x+" iteration: "+iteration+" len3: "+len3)
if(resultC.length>0){
tagNames.push(resultC[0]["tagName"]);
}
//console.log("iteration: "+iteration+" len: "+len3)
if(iteration+1==len3){
console.log("step 4, iterationA: "+iterationA+" I: "+iteration)
articleRevisionData[iterationA]["tags"]=tagNames.join(",");
}
}.bind(tagData,x,len2,i));
}
}
if(i==len-1){
templateData={
name:userData["firstName"]+" "+userData["lastName"],
articleData:articleData,
articleRevisionData:articleRevisionData
}
returnFile(res,"/usr/share/node/Admin/anonymousAttempt2/Admin/Articles/home.html",templateData);
}
}
It is rarely a good idea to call an asynchronous function from within a loop since, as you've discovered, you cannot know when all the calls complete (which is the nature of asynchrony.)
In your example, it's important to note that all of your async calls run concurrently, which can consume more system resources than you might wish.
I've found that the best solution to these kinds of problems is to use events to manage execution flow, as in:
const EventEmitter = require('events');
const emitter = new EventEmitter();
let iterations = articleRevisionData.length;
// start up state
emitter.on('start', () => {
// do setup here
emitter.emit('next_iteration');
});
// loop state
emitter.on('next_iteration', () => {
if(iterations--) {
asyncFunc(args, (err,result) => {
if(err) {
emitter.emit('error', err);
return;
}
// do something with result
emitter.emit('next_iteration');
});
return;
}
// no more iterations
emitter.emit('complete');
});
// error state
emitter.on('error', (e) => {
console.error(`processing failed on iteration ${iterations+1}: ${e.toString()}`);
});
// processing complete state
emitter.on('complete', () => {
// do something with all results
console.log('all iterations complete');
});
// start processing
emitter.emit('start');
Note how simple and clean this code is, lacking any "callback hell", and how easy it is to visualize program flow.
It is also worth noting that you can express every kind of execution control (doWhile, doUntil, map/reduce, queue workers, etc.) using events and since event handling is at the very core of Node, you'll find using them in this manner will outperform most, if not all, other solutions.
See Node Events for more information on event handling in Node.