I have a wasm process (compiled from c++) that processes data inside a web application. Let's say the necessary code looks like this:
std::vector<JSONObject> data
for (size_t i = 0; i < data.size(); i++)
{
process_data(data[i]);
if (i % 1000 == 0) {
bool is_cancelled = check_if_cancelled();
if (is_cancelled) {
break;
}
}
}
This code basically "runs/processes a query" similar to a SQL query interface:
However, queries may take several minutes to run/process and at any given time the user may cancel their query. The cancellation process would occur in the normal javascript/web application, outside of the service Worker running the wasm.
My question then is what would be an example of how we could know that the user has clicked the 'cancel' button and communicate it to the wasm process so that knows the process has been cancelled so it can exit? Using the worker.terminate() is not an option, as we need to keep all the loaded data for that worker and cannot just kill that worker (it needs to stay alive with its stored data, so another query can be run...).
What would be an example way to communicate here between the javascript and worker/wasm/c++ application so that we can know when to exit, and how to do it properly?
Additionally, let us suppose a typical query takes 60s to run and processes 500MB of data in-browser using cpp/wasm.
Update: I think there are the following possible solutions here based on some research (and the initial answers/comments below) with some feedback on them:
Use two workers, with one worker storing the data and another worker processing the data. In this way the processing-worker can be terminated, and the data will always remain. Feasible? Not really, as it would take way too much time to copy over ~ 500MB of data to the webworker whenever it starts. This could have been done (previously) using SharedArrayBuffer, but its support is now quite limited/nonexistent due to some security concerns. Too bad, as this seems like by far the best solution if it were supported...
Use a single worker using Emterpreter and using emscripten_sleep_with_yield. Feasible? No, destroys performance when using Emterpreter (mentioned in the docs above), and slows down all queries by about 4-6x.
Always run a second worker and in the UI just display the most recent. Feasible? No, would probably run into quite a few OOM errors if it's not a shared data structure and the data size is 500MB x 2 = 1GB (500MB seems to be a large though acceptable size when running in a modern desktop browser/computer).
Use an API call to a server to store the status and check whether the query is cancelled or not. Feasible? Yes, though it seems quite heavy-handed to long-poll with network requests every second from every running query.
Use an incremental-parsing approach where only a row at a time is parsed. Feasible? Yes, but also would require a tremendous amount of re-writing the parsing functions so that every function supports this (the actual data parsing is handled in several functions -- filter, search, calculate, group by, sort, etc. etc.
Use IndexedDB and store the state in javascript. Allocate a chunk of memory in WASM, then return its pointer to JavaScript. Then read database there and fill the pointer. Then process your data in C++. Feasible? Not sure, though this seems like the best solution if it can be implemented.
[Anything else?]
In the bounty then I was wondering three things:
If the above six analyses seem generally valid?
Are there other (perhaps better) approaches I'm missing?
Would anyone be able to show a very basic example of doing #6 -- seems like that would be the best solution if it's possible and works cross-browser.
For Chrome (only) you may use shared memory (shared buffer as memory). And raise a flag in memory when you want to halt. Not a big fan of this solution (is complex and is supported only in chrome). It also depends on how your query works, and if there are places where the lengthy query can check the flag.
Instead you should probably call the c++ function multiple times (e.g. for each query) and check if you should halt after each call (just send a message to the worker to halt).
What I mean by multiple time is make the query in stages (multiple function cals for a single query). It may not be applicable in your case.
Regardless, AFAIK there is no way to send a signal to a Webassembly execution (e.g. Linux kill). Therefore, you'll have to wait for the operation to finish in order to complete the cancellation.
I'm attaching a code snippet that may explain this idea.
worker.js:
... init webassembly
onmessage = function(q) {
// query received from main thread.
const result = ... call webassembly(q);
postMessage(result);
}
main.js:
const worker = new Worker("worker.js");
const cancel = false;
const processing = false;
worker.onmessage(function(r) {
// when worker has finished processing the query.
// r is the results of the processing.
processing = false;
if (cancel === true) {
// processing is done, but result is not required.
// instead of showing the results, update that the query was canceled.
cancel = false;
... update UI "cancled".
return;
}
... update UI "results r".
}
function onCancel() {
// Occurs when user clicks on the cancel button.
if (cancel) {
// sanity test - prevent this in UI.
throw "already cancelling";
}
cancel = true;
... update UI "canceling".
}
function onQuery(q) {
if (processing === true) {
// sanity test - prevent this in UI.
throw "already processing";
}
processing = true;
// Send the query to the worker.
// When the worker receives the message it will process the query via webassembly.
worker.postMessage(q);
}
An idea from user experience perspective:
You may create ~two workers. This will take twice the memory, but will allow you to "cancel" "immediately" once. (it will just mean that in the backend the 2nd worker will run the next query, and when the 1st finishes the cancellation, cancellation will again become immediate).
Shared Thread
Since the worker and the C++ function that it called share the same thread, the worker will also be blocked until the C++ loop is finished, and won't be able to handle any incoming messages. I think the a solid option would minimize the amount of time that the thread is blocked by instead initializing one iteration at a time from the main application.
It would look something like this.
main.js -> worker.js -> C++ function -> worker.js -> main.js
Breaking up the Loop
Below, C++ has a variable initialized at 0, which will be incremented at each loop iteration and stored in memory.
C++ function then performs one iteration of the loop, increments the variable to keep track of loop position, and immediately breaks.
int x;
x = 0; // initialized counter at 0
std::vector<JSONObject> data
for (size_t i = x; i < data.size(); i++)
{
process_data(data[i]);
x++ // increment counter
break; // stop function until told to iterate again starting at x
}
Then you should be able to post a message to the web worker, which then sends a message to main.js that the thread is no longer blocked.
Canceling the Operation
From this point, main.js knows that the web worker thread is no longer blocked, and can decide whether or not to tell the web worker to execute the C++ function again (with the C++ variable keeping track of the loop increment in memory.)
let continueOperation = true
// here you can set to false at any time since the thread is not blocked here
worker.expensiveThreadBlockingFunction()
// results in one iteration of the loop being iterated until message is received below
worker.onmessage = function(e) {
if (continueOperation) {
worker.expensiveThreadBlockingFunction()
// execute worker function again, ultimately continuing the increment in C++
} {
return false
// or send message to worker to reset C++ counter to prepare for next execution
}
}
Continuing the Operation
Assuming all is well, and the user has not cancelled the operation, the loop should continue until finished. Keep in mind you should also send a distinct message for whether the loop has completed, or needs to continue, so you don't keep blocking the worker thread.
I have been reading up on nodejs lately, trying to understand how it handles multiple concurrent requests. I know NodeJs is a single threaded event loop based architecture, and at a given point in time only one statement is going to be executing, i.e. on the main thread and that blocking code/IO calls are handled by the worker threads (default is 4).
Now my question is, what happens when a web server built using NodeJs receives multiple requests? I know that there are lots of similar questions here, but haven't found a concrete answer to my question.
So as an example, let's say we have following code inside a route like /index:
app.use('/index', function(req, res, next) {
console.log("hello index routes was invoked");
readImage("path", function(err, content) {
status = "Success";
if(err) {
console.log("err :", err);
status = "Error"
}
else {
console.log("Image read");
}
return res.send({ status: status });
});
var a = 4, b = 5;
console.log("sum =", a + b);
});
Let's assume that the readImage() function takes around 1 min to read that Image.
If two requests, T1, and T2 come in concurrently, how is NodeJs going to process these request ?
Does it going to take first request T1, process it while queueing the request T2? I assume that if any async/blocking stuff is encountered like readImage, it then sends that to a worker thread (then some point later when async stuff is done that thread notifies the main thread and main thread starts executing the callback?), and continues by executing the next line of code?
When it is done with T1, it then processes the T2 request? Is that correct? Or it can process T2 in between (meaning whilethe code for readImage is running, it can start processing T2)?
Is that right?
Your confusion might be coming from not focusing on the event loop enough. Clearly you have an idea of how this works, but maybe you do not have the full picture yet.
Part 1, Event Loop Basics
When you call the use method, what happens behind the scenes is another thread is created to listen for connections.
However, when a request comes in, because we're in a different thread than the V8 engine (and cannot directly invoke the route function), a serialized call to the function is appended onto the shared event loop, for it to be called later. ('event loop' is a poor name in this context, as it operates more like a queue or stack)
At the end of the JavaScript file, the V8 engine will check if there are any running theads or messages in the event loop. If there are none, it will exit with a code of 0 (this is why server code keeps the process running). So the first Timing nuance to understand is that no request will be processed until the synchronous end of the JavaScript file is reached.
If the event loop was appended to while the process was starting up, each function call on the event loop will be handled one by one, in its entirety, synchronously.
For simplicity, let me break down your example into something more expressive.
function callback() {
setTimeout(function inner() {
console.log('hello inner!');
}, 0); // †
console.log('hello callback!');
}
setTimeout(callback, 0);
setTimeout(callback, 0);
† setTimeout with a time of 0, is a quick and easy way to put something on the event loop without any timer complications, since no matter what, it has always been at least 0ms.
In this example, the output will always be:
hello callback!
hello callback!
hello inner!
hello inner!
Both serialized calls to callback are appended to the event loop before either of them is called. This is guaranteed. That happens because nothing can be invoked from the event loop until after the full synchronous execution of the file.
It can be helpful to think of the execution of your file, as the first thing on the event loop. Because each invocation from the event loop can only happen in series, it becomes a logical consequence, that no other event loop invocation can occur during its execution; Only when the previous invocation is finished, can the next event loop function be invoked.
Part 2, The inner Callback
The same logic applies to the inner callback as well, and can be used to explain why the program will never output:
hello callback!
hello inner!
hello callback!
hello inner!
Like you might expect.
By the end of the execution of the file, two serialized function calls will be on the event loop, both for callback. As the Event loop is a FIFO (first in, first out), the setTimeout that came first, will be be invoked first.
The first thing callback does is perform another setTimeout. As before, this will append a serialized call, this time to the inner function, to the event loop. setTimeout immediately returns, and execution will move on to the first console.log.
At this time, the event loop looks like this:
1 [callback] (executing)
2 [callback] (next in line)
3 [inner] (just added by callback)
The return of callback is the signal for the event loop to remove that invocation from itself. This leaves 2 things in the event loop now: 1 more call to callback, and 1 call to inner.
Now callback is the next function in line, so it will be invoked next. The process repeats itself. A call to inner is appended to the event loop. A console.log prints Hello Callback! and we finish by removing this invocation of callback from the event loop.
This leaves the event loop with 2 more functions:
1 [inner] (next in line)
2 [inner] (added by most recent callback)
Neither of these functions mess with the event loop any further. They execute one after the other, the second one waiting for the first one's return. Then when the second one returns, the event loop is left empty. This fact, combined with the fact that there are no other threads currently running, triggers the end of the process, which exits with a return code of 0.
Part 3, Relating to the Original Example
The first thing that happens in your example, is that a thread is created within the process which will create a server bound to a particular port. Note, this is happening in precompiled C++ code, not JavaScript, and is not a separate process, it's a thread within the same process. see: C++ Thread Tutorial.
So now, whenever a request comes in, the execution of your original code won't be disturbed. Instead, incoming connection requests will be opened, held onto, and appended to the event loop.
The use function, is the gateway into catching the events for incoming requests. Its an abstraction layer, but for the sake of simplicity, it's helpful to think of the use function like you would a setTimeout. Except, instead of waiting a set amount of time, it appends the callback to the event loop upon incoming http requests.
So, let's assume that there are two requests coming in to the server: T1 and T2. In your question you say they come in concurrently, since this is technically impossible, I'm going to assume they are one after the other, with a negligible time in between them.
Whichever request comes in first, will be handled first by the secondary thread from earlier. Once that connection has been opened, it's appended to the event loop, and we move on to the next request, and repeat.
At any point after the first request is added to the event loop, V8 can begin execution of the use callback.
A quick aside about readImage
Since its unclear whether readImage is from a particular library, something you wrote or otherwise, it's impossible to tell exactly what it will do in this case. There are only 2 possibilities though, so here they are:
It's entirely synchronous, never using an alternate thread or the event loop
function readImage (path, callback) {
let image = fs.readFileSync(path);
callback(null, image);
// a definition like this will force the callback to
// fully return before readImage returns. This means
// means readImage will block any subsequent calls.
}
It's entirely asynchronous, and takes advantage of fs' async callback.
function readImage (path, callback) {
fs.readFile(path, (err, data) => {
callback(err, data);
});
// a definition like this will force the readImage
// to immediately return, and allow exectution
// to continue.
}
For the purposes of explanation, I'll be operating under the assumption that readImage will immediately return, as proper asynchronous functions should.
Once the use callback execution is started, the following will happen:
The first console log will print.
readImage will kick off a worker thread and immediately return.
The second console log will print.
During all of this, its important to note, these operations are happening synchronously; No other event loop invocation can start until these are finished. readImage may be asynchronous, but calling it is not, the callback and usage of a worker thread is what makes it asynchronous.
After this use callback returns, the next request has probably already finished parsing and was added to the event loop, while V8 was busy doing our console logs and readImage call.
So the next use callback is invoked, and repeats the same process: log, kick off a readImage thread, log again, return.
After this point, the readImage functions (depending on how long they take) have probably already retrieved what they needed and appended their callback to the event loop. So they will get executed next, in order of whichever one retrieved its data first. Remember, these operations were happening in separate threads, so they happened not only in parallel to the main javascript thread, but also parallel to each other, so here, it doesn't matter which one got called first, it matters which one finished first, and got 'dibs' on the event loop.
Whichever readImage completed first will be the first one to execute. So, assuming no errors occured, we'll print out to the console, then write to the response for the corresponding request, held in lexical scope.
When that send returns, the next readImage callback will begin execution: console log, and writing to the response.
At this point, both readImage threads have died, and the event loop is empty, but the thread that holds the server port binding is keeping the process alive, waiting for something else to add to the event loop, and the cycle to continue.
I hope this helps you understand the mechanics behind the asynchronous nature of the example you provided.
For each incoming request, node will handle it one by one. That means there must be order, just like the queue, first in first serve. When node starts processing request, all synchronous code will execute, and asynchronous will pass to work thread, so node can start to process the next request. When the asynchrous part is done, it will go back to main thread and keep going.
So when your synchronous code takes too long, you block the main thread, node won't be able to handle other request, it's easy to test.
app.use('/index', function(req, res, next) {
// synchronous part
console.log("hello index routes was invoked");
var sum = 0;
// useless heavy task to keep running and block the main thread
for (var i = 0; i < 100000000000000000; i++) {
sum += i;
}
// asynchronous part, pass to work thread
readImage("path", function(err, content) {
// when work thread finishes, add this to the end of the event loop and wait to be processed by main thread
status = "Success";
if(err) {
console.log("err :", err);
status = "Error"
}
else {
console.log("Image read");
}
return res.send({ status: status });
});
// continue synchronous part at the same time.
var a = 4, b = 5;
console.log("sum =", a + b);
});
Node won't start processing the next request until finish all synchronous part. So people said don't block the main thread.
There are a number of articles that explain this such as this one
The long and the short of it is that nodejs is not really a single threaded application, its an illusion. The diagram at the top of the above link explains it reasonably well, however as a summary
NodeJS event-loop runs in a single thread
When it gets a request, it hands that request off to a new thread
So, in your code, your running application will have a PID of 1 for example. When you get request T1 it creates PID 2 that processes that request (taking 1 minute). While thats running you get request T2 which spawns PID 3 also taking 1 minute. Both PID 2 and 3 will end after their task is completed, however PID 1 will continue listening and handing off events as and when they come in.
In summary, NodeJS being 'single threaded' is true, however its just an event-loop listener. When events are heard (requests), it passes them off to a pool of threads that execute asynchronously, meaning its not blocking other requests.
You can simply create child process by shifting readImage() function in a different file using fork().
The parent file, parent.js:
const { fork } = require('child_process');
const forked = fork('child.js');
forked.on('message', (msg) => {
console.log('Message from child', msg);
});
forked.send({ hello: 'world' });
The child file, child.js:
process.on('message', (msg) => {
console.log('Message from parent:', msg);
});
let counter = 0;
setInterval(() => {
process.send({ counter: counter++ });
}, 1000);
Above article might be useful to you .
In the parent file above, we fork child.js (which will execute the file with the node command) and then we listen for the message event. The message event will be emitted whenever the child uses process.send, which we’re doing every second.
To pass down messages from the parent to the child, we can execute the send function on the forked object itself, and then, in the child script, we can listen to the message event on the global process object.
When executing the parent.js file above, it’ll first send down the { hello: 'world' } object to be printed by the forked child process and then the forked child process will send an incremented counter value every second to be printed by the parent process.
The V8 JS interpeter (ie: Node) is basically single threaded. But, the processes it kicks off can be async, example: 'fs.readFile'.
As the express server runs, it will open new processes as it needs to complete the requests. So the 'readImage' function will be kicked off (usually asynchronously) meaning that they will return in any order. However the server will manage which response goes to which request automatically.
So you will NOT have to manage which readImage response goes to which request.
So basically, T1 and T2, will not return concurrently, this is virtually impossible. They are both heavily reliant on the Filesystem to complete the 'read' and they may finish in ANY ORDER (this cannot be predicted). Note that processes are handled by the OS layer and are by nature multithreaded (in a modern computer).
If you are looking for a queue system, it should not be too hard to implement/ensure that images are read/returned in the exact order that they are requested.
Since there's not really more to add to the previous answer from Marcus - here's a graphic that explains the single threaded event-loop mechanism:
I'm using the npm library jsdiff, which has a function that determines the difference between two strings. This is a synchronous function, but given two large, very different strings, it will take extremely long periods of time to compute.
diff = jsdiff.diffWords(article[revision_comparison.field], content[revision_comparison.comparison]);
This function is called in a stack that handles an request through Express. How can I, for the sake of the user, make the experience more bearable? I think my two options are:
Cancelling the synchronous function somehow.
Cancelling the user request somehow. (But would this keep the function still running?)
Edit: I should note that given two very large and different strings, I want a different logic to take place in the code. Therefore, simply waiting for the process to finish is unnecessary and cumbersome on the load - I definitely don't want it to run for any long period of time.
fork a child process for that specific task, you can even create a queu to limit the number of child process that can be running in a given moment.
Here you have a basic example of a worker that sends the original express req and res to a child that performs heavy sync. operations without blocking the main (master) thread, and once it has finished returns back to the master the outcome.
Worker (Fork Example) :
process.on('message', function(req,res) {
/* > Your jsdiff logic goes here */
//change this for your heavy synchronous :
var input = req.params.input;
var outcome = false;
if(input=='testlongerstring'){outcome = true;}
// Pass results back to parent process :
process.send(req,res,outcome);
});
And from your Master :
var cp = require('child_process');
var child = cp.fork(__dirname+'/worker.js');
child.on('message', function(req,res,outcome) {
// Receive results from child process
console.log('received: ' + outcome);
res.send(outcome); // end response with data
});
You can perfectly send some work to the child along with the req and res like this (from the Master): (imagine app = express)
app.get('/stringCheck/:input',function(req,res){
child.send(req,res);
});
I found this on jsdiff's repository:
All methods above which accept the optional callback method will run in sync mode when that parameter is omitted and in async mode when supplied. This allows for larger diffs without blocking the event loop. This may be passed either directly as the final parameter or as the callback field in the options object.
This means that you should be able to add a callback as the last parameter, making the function asynchronous. It will look something like this:
jsdiff.diffWords(article[x], content[y], function(err, diff) {
//add whatever you need
});
Now, you have several choices:
Return directly to the user and keep the function running in the background.
Set a 2 second timeout (or whatever limit fits your application) using setTimeout as outlined in this
answer.
If you go with option 2, your code should look something like this
jsdiff.diffWords(article[x], content[y], function(err, diff) {
//add whatever you need
return callback(err, diff);
});
//if this was called, it means that the above operation took more than 2000ms (2 seconds)
setTimeout(function() { return callback(); }, 2000);
In the next code, I want to process several files at the same time without wait to the end of each other. For this reason, I first read the files (array) and then the callback is called to process an element of this array instance.
I have found a problem into this javascript code, exactly in a async for-loop, where this process is executed as a sync code instead of async.
var array = ['string1','string2','string3','string4'];
function processArray (arrayString,callback){
//Read file Example.csv thought sync way
try{
var ifs = new InputFileStream('Example.csv','utf8');
table = ifs.read(0);
ifs.close();
}catch(err){
console.log(err.stack);
}
callback(arrayString, table);
}
//Async for
for (var i=0; i<array.length; i++) {
processArray(array[i], function(arrayString, table){
//Here process the file values thought async way
console.log('processed_'+i);
});
}
You could put the call back in a setTimeout with a delay of 1ms. That will run it in the next block of execution and your loop will continue on.
e.g. use this:
setTimeout(function() { callback(arrayString, table); }, 1);
instead of this:
callback(arrayString, table);
An alternative to this is to run the callback on a separate thread using Web Workers. I don't think it would appropiate to provide a long answer describing how to do multi threaded JavaScript here so I'll just leave the link to the docs. https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers
where this process is executed as a sync code instead of async
I've seen that you just have find out the answers of your question, remember that JavaScript is single thread.
So, for that when you execute operations that require full use of CPU like for..loops, while, etc; you just will get your code running synchronous and not only that,
You will get your web page freeze if they are huge loops
Let me give you an example, this is a while loop that will run for 6 seconds, look how you cannot do anything in stackoverflow.
function blocker (ms) {
console.log('You cannot do anything')
var now = new Date().getTime();
while(true) {
if (new Date().getTime() > now +ms)
return;
}
}
blocker(6000) //This stop your entire web page for 6 seconds
If you really want to achieve running blocking code in the background read about Web Workers or you just can use a small library I wrote, that allow you to execute a blocking CPU function in the background, I called it GenericWebWorker
I need to perform several functions in my JavaScript/jQuery, but I want to avoid blocking the UI.
AJAX is not a viable solution, because of the nature of the application, those functions will easily reach the thousands. Doing this asynchroniously will kill the browser.
So, I need some way of chaining the functions the browser needs to process, and only send the next function after the first has finished.
The algorithm is something like this
For steps from 2 to 15
HTTP:GET amount of items for current step (ranges somewhere from a couple of hundred to multiple thousands)
For every item, HTTP:GET the results
As you see, I have two GET-request-"chains" I somehow need to manage... Especially the innermost loop crashes the browser near to instantly, if it's done asynchroniously - but I'd still like the user to be able to operate the page, so a pure (blocking) synchronous way will not work.
You can easily do this asynchronously without firing all requests at once. All you need to do is manage a queue. The following is pseudo-code for clarity. It's easily translatable to real AJAX requests:
// Basic structure of the request queue. It's a list of objects
// that defines ajax requests:
var request_queue = [{
url : "some/path",
callback : function_to_process_the_data
}];
// This function implements the main loop.
// It looks recursive but is not because each function
// call happens in an event handler:
function process_request_queue () {
// If we have anything in the queue, do an ajax call.
// Otherwise do nothing and let the loop end.
if (request_queue.length) {
// Get one request from the queue. We can either
// shift or pop depending on weather you prefer
// depth first or breadth first processing:
var req = request_queue.pop();
ajax(req.url,function(result){
req.callback(result);
// At the end of the ajax request process
// the queue again:
process_request_queue();
}
}
}
// Now get the ball rolling:
process_request_queue();
So basically we turn the ajax call itself into a pseudo loop. It's basically the classic continuation passing style of programming done recursively.
In your case, an example of a request would be:
request_queue.push({
url : "path/to/OUTER/request",
callback : function (result) {
// You mentioned that the result of the OUTER request
// should trigger another round of INNER requests.
// To do this simply add the INNER requests to the queue:
request_queue.push({
url : result.inner_url,
callback : function_to_handle_inner_request
});
}
});
This is quite flexible because you not only have the option of processing requests either breadth first or depth first (shift vs pop). But you can also use splice to add stuff to the middle of the queue or use unshift vs push to put requests at the head of the queue for high priority requests.
You can also increase the number of simultaneous requests by popping more than one request per loop. Just be sure to only call process_request_queue only once per loop to avoid exponential growth of simultaneous requests:
// Handling two simultaneous request channels:
function process_request_queue () {
if (request_queue.length) {
var req = request_queue.pop();
ajax(req.url,function(result){
req.callback(result);
// Best to loop on the first request.
// The second request below may never fire
// if the queue runs out of requests.
process_request_queue();
}
}
if (request_queue.length) {
var req = request_queue.pop();
ajax(req.url,function(result){
req.callback(result);
// DON'T CALL process_request_queue here
// We only want to call it once per "loop"
// otherwise the "loop" would grow into a "tree"
}
}
}
You could make that ASYNC and use a small library I wrote some time ago that will let you queue function calls.