Long-running asynchronous file copies block browser requests - javascript

Express.js serving a Remix app. The server-side code sets several timers at startup that do various background jobs every so often, one of which checks if a remote Jenkins build is finished. If so, it copies several large PDFs from one network path to another network path (both on GSA).
One function creates an array of chained glob+copyFile promises:
import { copyFile } from 'node:fs/promises';
import { promisify } from "util";
import glob from "glob";
...
async function getFiles() {
let result: Promise<void>[] = [];
let globPromise = promisify(glob);
for (let wildcard of wildcards) { // lots of file wildcards here
result.push(globPromise(wildcard).then(
(files: string[]) => {
if (files.length < 1) {
// do error stuff
} else {
for (let srcFile of files) {
let tgtFile = tgtDir + basename(srcFile);
return copyFile(srcFile, tgtFile);
}
}
},
(reason: any) => {
// do error stuff
}));
}
return result;
}
Another async function gets that array and does Promise.allSettled on it:
copyPromises = await getFiles();
console.log("CALLING ALLSETTLED.THEN()...");
return Promise.allSettled(copyPromises).then(
(results) => {
console.log("ALLSETTLED COMPLETE...");
Between the "CALLING" and "COMPLETE" messages, which can take on the order of several minutes, the server no longer responds to browser requests, which timeout.
However, during this time my other active backend timers can still be seen running and completing just fine in the server console log (I made one run every 5 seconds for test purposes, and it runs quite smoothly over and over while those file copies are crawling along).
So it's not blocking the server as a whole, it's seemingly just preventing browser requests from being handled. And once the "COMPLETE" message pops up in the log, browser requests are served up normally again.
The Express startup script basically just does this for Remix:
const { createRequestHandler } = require("#remix-run/express");
...
app.all(
"*",
createRequestHandler({
build: require(BUILD_DIR),
mode: process.env.NODE_ENV,
})
);
What's going on here, and how do I solve this?

It's apparent no further discussion is forthcoming, and I've not determined why the async I/O functions are preventing server responses, so I'll go ahead and post an answer that was basically Konrad Linkowski's workaround solution from the comments: to use the OS to do the copies instead of using copyFile(). It boils down to this in place of the glob+copyFile calls inside getFiles:
const exec = util.promisify(require('node:child_process').exec);
...
async function getFiles() {
...
result.push( exec("copy /y " + wildcard + " " + tgtDir) );
...
}
This does not exhibit any of the request-crippling behavior; for the entire time the copies are chugging away (many minutes), browser requests are handled instantly.
It's an OS-specific solution and thus non-portable as-is, but that's fine in our case since we will likely be using a Windows server for this app for many years to come. And certainly if needed, runtime OS-detection could be used to make the commands run on other OSes.

I guess that this is due to node's libuv using a threadpool with synchronous access for file system operations, and the pool size is only 4. See https://kariera.future-processing.pl/blog/on-problems-with-threads-in-node-js/ for a demonstration of the problem, or Nodejs - What is maximum thread that can run same time as thread pool size is four? for an explanation of how this is normally not a problem in network-heavy applications.
So if you have a filesystem-access-heavy application, try increasing the thread pool by setting the UV_THREADPOOL_SIZE environment variable.

Related

Optimal way of sharing load between worker threads

What is the optimal way of sharing linear tasks between worker threads to improve performance?
Take the following example of a basic Deno web-server:
Main Thread
// Create an array of four worker threads
const workers = new Array<Worker>(4).fill(
new Worker(new URL("./worker.ts", import.meta.url).href, {
type: "module",
})
);
for await (const req of server) {
// Pass this request to worker a worker thread
}
worker.ts
self.onmessage = async (req) => {
//Peform some linear task on the request and make a response
};
Would the optimal way of distributing tasks be something along the lines of this?
function* generator(): Generator<number> {
let i = 0;
while (true) {
i == 3 ? (i = 0) : i++;
yield i;
}
}
const gen = generator();
const workers = new Array<Worker>(4).fill(
new Worker(new URL("./worker.ts", import.meta.url).href, {
type: "module",
})
);
for await (const req of server) {
// Pass this request to a worker thread
workers[gen.next().value].postMessage(req);
}
Or is there a better way of doing this? Say, for example, using Attomics to determine which threads are free to accept another task.
When working with WorkerThread code like this, I found that the best way to distribute jobs was to have the WorkerThread ask the main thread for a job when the WorkerThread knew that it was done with the prior job. The main thread could then send it a new job in response to that message.
In the main thread, I maintained a queue of jobs and a queue of WorkerThreads waiting for a job. If the job queue was empty, then the WorkerThread queue would likely have some workerThreads in it waiting for a job. Then, any time a job is added to the job queue, the code checks to see if there's a workerThread waiting and, if so, removes it from the queue and sends it the next job.
Anytime a workerThread sends a message indicating it is ready for the next job, then we check the job queue. If there's a job there, it is removed and sent to that worker. If not, the worker is added to the WorkerThread queue.
This whole bit of logic was very clean, did not need atomics or shared memory (because everything was gated through the event loop of the main process) and wasn't very much code.
I arrived at this mechanism after trying several other ways that each had their own problems. In one case, I had concurrency issues, in another I was starving the event loop, in another, I didn't have proper flow control to the WorkerThreads and was overwhelming them and not distributing load equally.
There are some abstractions in Deno to handle these kind of needs very easily. Especially considering the pooledMap functionality.
So you have a server which is an async iterator and you would like to leverage threading to generate responses since each response depends on a time taking heavy computation right..?
Simple.
import { serve } from "https://deno.land/std/http/server.ts";
import { pooledMap } from "https://deno.land/std#0.173.0/async/pool.ts";
const server = serve({ port: 8000 }),
ress = pooledMap( window.navigator.hardwareConcurrency - 1
, server
, req => new Promise(v => v(respondWith(req))
);
for await (const res of ress) {
// respond with res
}
That's it. In this particular case the repondWith function bears the heavy calculation to prepare your response object. In case it's already an async function then you don't even need to wrap it into a promise. Obviously here I have just used available many less one threads but it's up to you to decide howmany threads to spawn.

Is it safe to rely on Node.js require behavior to implement Singletons?

Suppose I have a module implemented like this:
const dbLib = require('db-lib');
const dbConfig = require('db-config');
const dbInstance = dbLib.createInstance({
host: dbConfig.host,
database: dbConfig.database,
user: dbConfig.user,
password: dbConfig.password,
});
module.exports = dbInstance;
Here an instance of database connection pool is created and exported. Then suppose db-instance.js is required several times throughout the app. Node.js should execute its code only once and always pass the same one instance of database pool. Is it safe to rely on this behavior of Node.js require command? I want to use it so that I don't need to implement dependency injection.
Every single file that you require in Node.js is Singleton.
Also - require is synchronous operation and it also works deterministically. This means it is predictable and will always work the same.
This behaviour is not for require only - Node.js has only one thread in Event-Loop and it works like this (little simplified):
Look if there is any task that it can do
Take the task
Run the task synchronously from the beginning to the end
If there are any asynchronous calls, it just push them to "do later", but it never starts them before synchronous part of the task is done (unless there is worker spawned, but you dont need to know details about this)
Repeat the whole process
For example imagine this code is file infinite.js:
setTimeout(() => {
while(true){
console.log('only this');
}
}, 1000)
setTimeout(() => {
while(true){
console.log('never this');
}
}, 2000)
while(someConditionThatTakes5000Miliseconds){
console.log('requiring');
}
When you require this file, it first register to "doLater" the first setTimeout to "after 1000ms be resolved", the second for "after 2000ms be resolved" (note that it is not "run after 1000ms").
Then it run the while cycle for 5000ms (if there is condition like that) and nothing else happens in your code.
After 5000ms the require is completed, the synchronous part is finished and Event Loop looks for new task to do. And the first one to see is the setTimeout with 1000ms delay (once again - it took 1000ms to just mark as "can be taken by Event-Loop", but you dont know when it will be run).
There is neverending while cycle, so you will see in console "only this". The second setTimeout will never be taken from Event-Loop as it is marked after 2000ms to "can be taken", but Event Loop is stuck in never-ending while loop already.
With this knowledge, you can use require (and other Node.js aspects) very confidently.
Conclusion - the require is synchronous, deterministic. Once it finishes with requiring file (the output of it is a object with methods and properties you export, or empty object if you dont export anything) the reference to this object is saved to Node.js core memory. When you require file from somewhere else, it firsts look into the core memory and if it finds the require there, it just use the reference to the object and therefore never execute it twice.
POC:
Create file infinite.js
const time = Date.now();
setTimeout(() => {
let i=0;
console.log('Now I get stuck');
while(true){
i++;
if (i % 100000000 === 0) {
console.log(i);
}
}
console.log('Never reach this');
}, 1000)
setTimeout(() => {
while(true){
console.log('never this');
}
}, 2000)
console.log('Prepare for 5sec wait')
while(new Date() < new Date(time + 5*1000)){
// blocked
}
console.log('Done, lets allow other')
Then create server.js in same folder with
console.log('start program');
require('./infinite');
console.log('done with requiring');
Run it with node server
This will be the output(with numbers neverending):
start program
Prepare for 5sec wait
Done, lets allow other
done with requiring
Now I get stuck
100000000
200000000
300000000
400000000
500000000
600000000
700000000
800000000
900000000
The documentation of Node.js about modules explains:
Modules are cached after the first time they are loaded. This means (among other things) that every call to require('foo') will get exactly the same object returned, if it would resolve to the same file.
Multiple calls to require('foo') may not cause the module code to be executed multiple times.
It is also worth mentioning the situations when require produce Singletons and when this goal could not reached (and why):
Modules are cached based on their resolved filename. Since modules may resolve to a different filename based on the location of the calling module (loading from node_modules folders), it is not a guarantee that require('foo') will always return the exact same object, if it would resolve to different files.
Additionally, on case-insensitive file systems or operating systems, different resolved filenames can point to the same file, but the cache will still treat them as different modules and will reload the file multiple times. For example, require('./foo') and require('./FOO') return two different objects, irrespective of whether or not ./foo and ./FOO are the same file.
To summarize, if your module name is unique inside the project then you'll always get Singletons. Otherwise, when there are two modules having the same name, require-ing that name in different places may produce different objects. To ensure they produce the same object (the desired Singleton) you have to refer the module in a manner that is resolved to the same file in both places.
You can use require.resolve() to find out the exact file that is resolved by a require statement.

Cloud Functions for Firebase: completing long processes without touching maximum timeout

I have to transcode videos from webm to mp4 when they're uploaded to firebase storage. I have a code demo here that works, but if the uploaded video is too large, firebase functions will time out on me before the conversion is finished. I know it's possible to increase the timeout limit for the function, but that seems messy, since I can't ever confirm the process will take less time than the timeout limit.
Is there some way to stop firebase from timing out without just increasing the maximum timeout limit?
If not, is there a way to complete time consuming processes (like video conversion) while still having each process start using firebase function triggers?
If even completing time consuming processes using firebase functions isn't something that really exists, is there some way to speed up the conversion of fluent-ffmpeg without touching the quality that much? (I realize this part is a lot to ask. I plan on lowering the quality if I absolutely have to, as the reason webms are being converted to mp4 is for IOS devices)
For reference, here's the main portion of the demo I mentioned. As I said before, the full code can be seen here, but this section of the code copied over is the part that creates the Promise that makes sure the transcoding finishes. The full code is only 70 something lines, so it should be relatively easy to go through if needed.
const functions = require('firebase-functions');
const mkdirp = require('mkdirp-promise');
const gcs = require('#google-cloud/storage')();
const Promise = require('bluebird');
const ffmpeg = require('fluent-ffmpeg');
const ffmpeg_static = require('ffmpeg-static');
(There's a bunch of text parsing code here, followed by this next chunk of code inside an onChange event)
function promisifyCommand (command) {
return new Promise( (cb) => {
command
.on( 'end', () => { cb(null) } )
.on( 'error', (error) => { cb(error) } )
.run();
})
}
return mkdirp(tempLocalDir).then(() => {
console.log('Directory Created')
//Download item from bucket
const bucket = gcs.bucket(object.bucket);
return bucket.file(filePath).download({destination: tempLocalFile}).then(() => {
console.log('file downloaded to convert. Location:', tempLocalFile)
cmd = ffmpeg({source:tempLocalFile})
.setFfmpegPath(ffmpeg_static.path)
.inputFormat(fileExtension)
.output(tempLocalMP4File)
cmd = promisifyCommand(cmd)
return cmd.then(() => {
//Getting here takes forever, because video transcoding takes forever!
console.log('mp4 created at ', tempLocalMP4File)
return bucket.upload(tempLocalMP4File, {
destination: MP4FilePath
}).then(() => {
console.log('mp4 uploaded at', filePath);
});
})
});
});
Cloud Functions for Firebase is not well suited (and not supported) for long-running tasks that can go beyond the maximum timeout. Your only real chance at using only Cloud Functions to perform very heavy compute operations is to find a way to split up the work into multiple function invocations, then join the results of all that work into a final product. For something like video transcoding, that sounds like a very difficult task.
Instead, consider using a function to trigger a long-running task in App Engine or Compute Engine.
As follow up for the random anonymous person who tries to figure out how to get past transcoding videos or some other long processes, here's a version of the same code example I gave that instead sends a http request to a google app engine process which transcodes the file. No documentation for it as of right now, but looking at Firebase/functions/index.js code and the app.js code may help you with your issue.
https://github.com/Scew5145/GCSConvertDemo
Good luck.

How to be sure two clients are not requesting at the same time corrupting state

I am new to Node JS and I am building a small app that relies on filesystem.
Let's say that the goal of my app is to fill a file like this:
1
2
3
4
..
And I want that at each request, a new line is written to the file, and in the right order.
Can I achieve that?
I know I can't let my question here without making any code so here it is. I am using an Express JS server:
(We imagine that the file contains only 1 at the first code launch)
import express from 'express'
import fs from 'fs'
let app = express();
app.all('*', function (req, res, next) {
// At every request, I want to write my file
writeFile()
next()
})
app.get('/', function(req,res) {
res.send('Hello World')
})
app.listen(3000, function (req,res) {
console.log('listening on port 3000')
})
function writeFile() {
// I get the file
let content = fs.readFileSync('myfile.txt','utf-8')
// I get an array of the numbers
let numbers = content.split('\n').map(item => {
return parseInt(item)
})
// I compute the new number and push it to the list
let new_number = numbers[numbers.length - 1] + 1
numbers.push(new_number)
// I write back the file
fs.writeFileSync('myfile.txt',numbers.join('\n'))
}
I tried to make a guess on the synchronous process that made me thinking that I was sure that nothing else was made at the same moment but I was really note sure...
If I am unclear, please tell me in the comments
If I understood you correctly, what you are scared of is a race condition happening in this case, where if two clients reach the HTTP server at the same time, the file is saved with the same contents where the number is only incremented once instead of twice.
The simple fix for it is to make sure the shared resource is only access or modified once at a time. In this case, using synchronous methods fix your problem. As when they are executing the whole node process is blocked and will not do anything.
If you change the synchronous methods with their asynchronous counter-parts without any other concurrency control measures then your code is definitely vulnerable to race conditions or corrupted state.
Now if this is only the thing your application is doing, it's probably best to keep this way as it's very simple, but let's say you want to add other functionality to it, in that case you probably want to avoid any synchronous methods as it blocks the process and won't let you have any concurrency.
A simple way to add a concurrency control, is to have a counter which keeps track how many requests are queued. If there's nothing queued up(counter === 0), then we just do read and write the file, else we add to the counter. Once writing to the file is finished we decrease from the counter and repeat:
app.all('*', function (req, res, next) {
// At every request, I want to write my file
writeFile();
next();
});
let counter = 0;
function writeFile() {
if (counter === 0) {
work(function onWriteFileDone() {
counter--;
if (counter > 0) {
work();
}
});
} else {
counter++;
}
function work(callback) {
// I get the file
let content = fs.readFile('myfile.txt','utf-8', function (err, content) {
// ignore the error because life is too short on stackoverflow questions...
// I get an array of the numbers
let numbers = content.split('\n').map(parseInt);
// I compute the new number and push it to the list
let new_number = numbers[numbers.length - 1] + 1;
numbers.push(new_number);
// I write back the file
fs.writeFile('myfile.txt',numbers.join('\n'), callback);
});
}
}
Of course this function doesn't have any arguments, but if you want to add to it, you have to use a queue instead of the counter where you store the arguments in the queue.
Now don't write your own concurrency mechanisms. There's a lot of in the node ecosystem. For example you can use the async module, which provide a queue.
Note that if you only have one process at a time, then you don't have to worry about multiple threads since In node.js, in one process there's only one thread of execution at a time, but let's say there's multiple processes writing to the file then that might make things more complicated, but let's keep that for another question if not already covered. Operating systems provides a few different ways to handle this, also you could use your own lock files or a dedicated process to write to the file or a message queue process.

NodeJS promise blocking requests

I am quite confused about why is my promise blocking the node app requests.
Here is my simplified code:
var express = require('express');
var someModule = require('somemodule');
app = express();
app.get('/', function (req, res) {
res.status(200).send('Main');
});
app.get('/status', function (req, res) {
res.status(200).send('Status');
});
// Init Promise
someModule.doSomething({}).then(function(){},function(){}, function(progress){
console.log(progress);
});
var server = app.listen(3000, function () {
var host = server.address().address;
var port = server.address().port;
console.log('Example app listening at http://%s:%s in %s environment',host, port, app.get('env'));
});
And the module:
var q = require('q');
function SomeModule(){
this.doSomething = function(){
return q.Promise(function(resolve, reject, notify){
for (var i=0;i<10000;i++){
notify('Progress '+i);
}
resolve();
});
}
}
module.exports = SomeModule;
Obviously this is very simplified. The promise function does some work that takes anywhere from 5 to 30 minutes and has to run only when server starts up.
There is NO async operation in that promise function. Its just a lot of data processing, loops etc.
I wont to be able to do requests right away though. So what I expect is when I run the server, I can go right away to 127.0.0.1:3000 and see Main and same for any other requests.
Eventually I want to see the progress of that task by accessing /status but Im sure I can make that work once the server works as expected.
At the moment, when I open / it just hangs until the promise job finishes..
Obviously im doing something wrong...
If your task is IO-bound go with process.nextTick. If your task is CPU-bound asynchronous calls won't offer much performance-wise. In that case you need to delegate the task to another process. An example solution would be to spawn a child process, do the work and pipe the results back to the parent process when done.
See nodejs.org/api/child_process.html for more.
If your application needs to do this often then forking lots of child processes quickly becomes a resource hog - each time you fork, a new V8 process will be loaded into memory. In this case it is probably better to use one of the multiprocessing modules like Node's own Cluster. This module offers easy creation and communication between master-worker processes and can remove a lot of complexity from your code.
See also a related question: Node.js - Sending a big object to child_process is slow
The main thread of Javascript in node.js is single threaded. So, if you do some giant loop that is processor bound, then that will hog the one thread and no other JS will run in node.js until that one operation is done.
So, when you call:
someModule.doSomething()
and that is all synchronous, then it does not return until it is done executing and thus the lines of code following that don't execute until the doSomething() method returns. And, just so you understand, the use of promises with synchronous CPU-hogging code does not help your cause at all. If it's synchronous and CPU bound, it's just going to take a long time to run before anything else can run.
If there is I/O involves in the loop (like disk I/O or network I/O), then there are opportunities to use async I/O operations and make the code non-blocking. But, if not and it's just a lot of CPU stuff, then it will block until done and no other code will run.
Your opportunities for changing this are:
Run the CPU consuming code in another process. Either create a separate program that you run as a child process that you can pass input to and get output from or create a separate server that you can then make async requests to.
Break the non-blocking work into chunks where you execute 100ms chunks of work at a time, then yield the processor back to the event loop (using something like setTimeout() to allow other things in the event queue to be serviced and run before you pick up and run the next chunk of work. You can see Best way to iterate over an array without blocking the UI for ideas on how to chunk synchronous work.
As an example, you could chunk your current loop. This runs up to 100ms of cycles and then breaks execution to give other things a chance to run. You can set the cycle time to whatever you want.
function SomeModule(){
this.doSomething = function(){
return q.Promise(function(resolve, reject, notify){
var cntr = 0, numIterations = 10000, timePerSlice = 100;
function run() {
if (cntr < numIterations) {
var start = Date.now();
while (Date.now() - start < timePerSlice && cntr < numIterations) {
notify('Progress '+cntr);
++cntr;
}
// give some other things a chance to run and then call us again
// setImmediate() is also an option here, but setTimeout() gives all
// other operations a chance to run alongside this operation
setTimeout(run, 10);
} else {
resolve();
}
}
run();
});
}
}

Categories