Queueing file system access in Node - javascript

I have a wrapper class managing file system access on the server in my web app.
async saveArchiveData(id, data) { /* saving data to the disk using fs */ }
async getArchiveData(id) { /* read data from disk */ }
This is all written using typescript but broken down to the relevant parts for readability.
These functions may be called in such a way that getArchiveData will try to access data that is currently being saved by saveArchiveData. In that case I don't want getArchiveData to fail, but to wait for the data to be available and only return then (so kind of like queuing those functions). What is the best practice for this?
Thanks!

Use a promise queue:
constructor() {
this.queue = Promise.resolve();
}
_enqueue(fn) {
const promise = this.queue.then(fn);
this.queue = promise.then(x => void x, _err => { /* ignore */ });
return promise;
}
async _writeData(id, data) { /* saving data to the disk using fs */ }
async _readData(id) { /* read data from disk */ }
saveArchiveData(id, data) {
return this._enqueue(() => this._writeData(id, data));
}
getArchiveData(id) {
return this._enqueue(() => this._readData(id));
}
This will guarantee that _writeData and _readData will never run concurrently (per instance of your class).
You may further want to have one queue per id if that fits your application.

So this issue is called "consistency". Typically this is dealt with via a pattern called "eventual consistency". In that, reading typically lags behind writes by a small period, but is accurate enough. What you've detailed is "strong consistency" which means reads will always have the most up-to-date information possible. But is much more difficult to build and is only typically done if it's absolutely necessary.

Related

Long-running asynchronous file copies block browser requests

Express.js serving a Remix app. The server-side code sets several timers at startup that do various background jobs every so often, one of which checks if a remote Jenkins build is finished. If so, it copies several large PDFs from one network path to another network path (both on GSA).
One function creates an array of chained glob+copyFile promises:
import { copyFile } from 'node:fs/promises';
import { promisify } from "util";
import glob from "glob";
...
async function getFiles() {
let result: Promise<void>[] = [];
let globPromise = promisify(glob);
for (let wildcard of wildcards) { // lots of file wildcards here
result.push(globPromise(wildcard).then(
(files: string[]) => {
if (files.length < 1) {
// do error stuff
} else {
for (let srcFile of files) {
let tgtFile = tgtDir + basename(srcFile);
return copyFile(srcFile, tgtFile);
}
}
},
(reason: any) => {
// do error stuff
}));
}
return result;
}
Another async function gets that array and does Promise.allSettled on it:
copyPromises = await getFiles();
console.log("CALLING ALLSETTLED.THEN()...");
return Promise.allSettled(copyPromises).then(
(results) => {
console.log("ALLSETTLED COMPLETE...");
Between the "CALLING" and "COMPLETE" messages, which can take on the order of several minutes, the server no longer responds to browser requests, which timeout.
However, during this time my other active backend timers can still be seen running and completing just fine in the server console log (I made one run every 5 seconds for test purposes, and it runs quite smoothly over and over while those file copies are crawling along).
So it's not blocking the server as a whole, it's seemingly just preventing browser requests from being handled. And once the "COMPLETE" message pops up in the log, browser requests are served up normally again.
The Express startup script basically just does this for Remix:
const { createRequestHandler } = require("#remix-run/express");
...
app.all(
"*",
createRequestHandler({
build: require(BUILD_DIR),
mode: process.env.NODE_ENV,
})
);
What's going on here, and how do I solve this?
It's apparent no further discussion is forthcoming, and I've not determined why the async I/O functions are preventing server responses, so I'll go ahead and post an answer that was basically Konrad Linkowski's workaround solution from the comments: to use the OS to do the copies instead of using copyFile(). It boils down to this in place of the glob+copyFile calls inside getFiles:
const exec = util.promisify(require('node:child_process').exec);
...
async function getFiles() {
...
result.push( exec("copy /y " + wildcard + " " + tgtDir) );
...
}
This does not exhibit any of the request-crippling behavior; for the entire time the copies are chugging away (many minutes), browser requests are handled instantly.
It's an OS-specific solution and thus non-portable as-is, but that's fine in our case since we will likely be using a Windows server for this app for many years to come. And certainly if needed, runtime OS-detection could be used to make the commands run on other OSes.
I guess that this is due to node's libuv using a threadpool with synchronous access for file system operations, and the pool size is only 4. See https://kariera.future-processing.pl/blog/on-problems-with-threads-in-node-js/ for a demonstration of the problem, or Nodejs - What is maximum thread that can run same time as thread pool size is four? for an explanation of how this is normally not a problem in network-heavy applications.
So if you have a filesystem-access-heavy application, try increasing the thread pool by setting the UV_THREADPOOL_SIZE environment variable.

Optimizing a file content parser class written in TypeScript

I got a typescript module (used by a VSCode extension) which accepts a directory and parses the content contained within the files. For directories containing large number of files this parsing takes a bit of time therefore would like some advice on how to optimize it.
I don't want to copy/paste the entire class files therefore will be using a mock pseudocode containing the parts that I think are relevant.
class Parser {
constructor(_dir: string) {
this.dir = _dir;
}
parse() {
let tree: any = getFileTree(this.dir);
try {
let parsedObjects: MyDTO[] = await this.iterate(tree.children);
} catch (err) {
console.error(err);
}
}
async iterate(children: any[]): Promise<MyDTO[]> {
let objs: MyDTO[] = [];
for (let i = 0; i < children.length; i++) {
let child: any = children[i];
if (child.type === Constants.FILE) {
let dto: FileDTO = await this.heavyFileProcessingMethod(file); // this takes time
objs.push(dto);
} else {
// child is a folder
let dtos: MyDTO[] = await this.iterateChildItems(child.children);
let dto: FolderDTO = new FolderDTO();
dto.files = dtos.filter(item => item instanceof FileDTO);
dto.folders = dtos.filter(item => item instanceof FolderDTO);
objs.push(FolderDTO);
}
}
return objs;
}
async heavyFileProcessingMethod(file: string): Promise<FileDTO> {
let content: string = readFile(file); // util method to synchronously read file content using fs
return new FileDTO(await this.parseFileContent(content));
}
async parseFileContent(content): Promise<any[]> {
// parsing happens here and the file content is parsed into separate blocks
let ast: any = await convertToAST(content); // uses an asynchronous method of an external dependency to convert content to AST
let blocks = parseToBlocks(ast); // synchronous method called to convert AST to blocks
return await this.processBlocks(blocks);
}
async processBlocks(blocks: any[]): Promise<any[]> {
for (let i = 0; i < blocks.length; i++) {
let block: Block = blocks[i];
if (block.condition === true) {
// this can take some time because if this condition is true, some external assets will be downloaded (via internet)
// on to the caller's machine + some additional processing takes place
await processBlock(block);
}
}
return blocks;
}
}
Still sort of a beginner to TypeScript/NodeJS. I am looking for a multithreading/Java-esque solution here if possible. In the context of Java, this.heavyFileProcessingMethod would be a instance of Callable object and this object would be pushed into a List<Callable> which would then be executed parallelly by an ExecutorService returning List<Future<Object>>.
Basically I want all files to be processed parallelly but the function must wait for all the files to be processed before returning from the method (so the entire iterate method will only take as long as the time taken to parse the largest file).
Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well? If so, can it be used in this situation? If my Parser class needs to be refactored to accommodate this change (or any other suggested change) it's no issue.
EDIT: Using Promise.all
async iterate(children: any[]): Promise<MyDTO>[] {
let promises: Promies<MyDTO>[] = [];
for(let i = 0; i <children.length; i++) {
let child: any = children[i];
if (child.type === Constants.FILE) {
let promise: Promise<FileDTO> = this.heavyFileProcessingMethod(file); // this takes time
promises.push(promise);
} else {
// child is a folder
let dtos: Promise<MyDTO>[] = this.iterateChildItems(child.children);
let promise: Promise<FolderDTO> = this.getFolderPromise(dtos);
promises.push(promise);
}
}
return promises;
}
async getFolderPromise(promises: Promise<MyDTO>[]): Promise<FolderDTO> {
return Promise.all(promises).then(dtos => {
let dto: FolderDTO = new FolderDTO();
dto.files = dtos.filter(item => item instanceof FileDTO);
dto.folders = dtos.filter(item => item instanceof FolderDTO);
return dto;
})
}
first: Typescript is really Javascript
Typescript is just Javascript with static type checking, and those static types are erased when the TS is transpiled to JS. Since your question is about algorithms and runtime language features, Typescript has no bearing; your question is a Javascript one. So right off the bat that tells us the answer to
Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well?
is YES.
As to the second part of your question,
can it be used in this situation?
the answer is YES, but...
second: Use Worker Threads only if the task is CPU bound.
Can does not necessarily mean you should. It depends on whether your processes are IO bound or CPU bound. If they are IO bound, you're most likely far better off relying on Javascript's longstanding asynchronous programming model (callbacks, Promises). But if they are CPU bound, then using Node's relatively new support for Thread-based parallelism is more likely to result in throughput gains. See Node.js Multithreading!, though I think this one is better: Understanding Worker Threads in Node.js.
While worker threads are lighter weight than previous Node options for parallelism (spawning child processes), it is still relatively heavy weight compared to threads in Java. Each worker runs in its own Node VM, regular variables are not shared (you have to use special data types and/or message passing to share data). It had to be done this way because Javascript is designed around a single-threaded programming model. It's extremely efficient within that model, but that design makes support for multithreading harder. Here's a good SO answer with useful info for you: https://stackoverflow.com/a/63225073/8910547
My guess is your parsing is more IO bound, and the overhead of spawning worker threads will outweigh any gains. But give it a go and it will be a learning experience. :)
It looks like your biggest problem is navigating the nested directory structure and keeping individual per-file and per-dir promises organized. My suggestion would be to do that in a simpler way.
Have a function that takes a directory path and returns a flat list of all files it can find, no matter how deep, in a manner similar to the find program. This function can be like this:
import * as fs from 'fs/promises'
import * as path from 'path'
async function fileList(dir: string): Promise<string[]> {
let entries = await fs.readdir(dir, {withFileTypes: true})
let files = entries
.filter(e => e.isFile())
.map(e => path.join(dir, e.name))
let dirs = entries
.filter(e => e.isDirectory())
.map(e => path.join(dir, e.name))
let subLists = await Promise.all(dirs.map(d => fileList(d)))
return files.concat(subLists.flat())
}
Basically, obtain directory entries, find (sub)directories and iterate them recursively in parallel. Once the iteration is complete, flatten, merge and return the list.
Using this function, you can apply your heavy task to all files at once by simply using map + Promise.all:
let allFiles = await fileList(dir)
let results = await Promise.all(allFiles.map(f => heavyTask(f)))

JS Program using Async & Recursion gets stuck in Main process

I am trying to retrieve the Firebase users in a project. I can retrieve them, but the main program never ends. I debug using VS Code or run the script with npm run and it gets stuck both ways. In VS Code there is nothing left on the stack...it just never stops
**Users function (It returns the users without a problem)
admin.auth().listUsers returns a listUsersResult object with properties nextPageToken and a users array
const BATCH_SIZE = 2;
const listAllUsers = async (nextPageToken = undefined) => {
let listUsersResult = await admin.auth().listUsers(BATCH_SIZE, nextPageToken);
if (listUsersResult.pageToken) {
return listUsersResult.users.concat(await listAllUsers(listUsersResult.pageToken));
} else {
return listUsersResult.users;
}
};
Main Routine (this is the one that gets stuck)
const uploadUsersMain = async () => {
try {
// if I comment out this call, then there is no problem
let firestoreUsers = await listAllUsers();
} catch(error) {
log.error(`Unable to retrieve users ${error}`)
}
finally {
// do stuff
}
}
uploadUsersMain();
What could be the issue that stops the main program from ending? What should I look for? Thanks
To shut down your script you should use
await admin.app().delete();
Node.js will quietly stay running as long as there are handles, which can really be anything — a network socket certainly.
You could also use process.exit(), but it's good practice to properly shut down the SDK, instead of exiting abruptly when it might not be finished.

WebGL: Callback hell in the resource loader

Please help me deal with this garbage I produced:
Program.prototype.init = function()
{
loadText('../res/shaders/blinnPhong-shader.vsh', function (vshErr, vshText) {
if (vshErr) {
alert('Fatal error loading vertex shader.');
console.error(vshErr);
} else {
loadText('../res/shaders/blinnPhong-shader.fsh', function (fshErr, fshText) {
if (fshErr) {
alert('Fatal error loading fragment shader.');
console.error(fshErr);
} else {
loadJSON('../res/models/dragon.json', function (modelErr, modelObj) {
if (modelErr) {
alert('Fatal error loading model.');
console.error(modelErr);
} else {
loadImage('../res/textures/susanTexture.png', function (imgErr, img) {
if (imgErr) {
alert('Fatal error loading texture.');
console(imgErr);
} else {
this.run = true;
RunProgram(vshText, fshText, img, modelObj);
}
});
}
});
}
});
}
});
};
My actual goal is to abstract the resource loading process for a WebGL program.
That means in the future there will be arrays of meshes, textures, shaders and I want to be able to connect certain dependencies between resources. For example: I want to create two GameObjects One and Two. One uses shaders and is loaded from a mesh but has no texture, whereas Two uses the same shaders as One but uses its own mesh and also needs a texture. What principles could I use to achieve building these dependencies in JavaScript (with asynchronous loading and so on)?
Edit:
So the following is happening with this code: I kept callbacks for now. However this method is part of a Singleton object. I edited the code because in the last else case I am setting a flag of program to true. I keep a global reference of the program object in my main. However due to the callbacks the reference is somehow lost, the global reference keeps its flag to false so the main loop is never reached. It is clearly a problem of the callbacks, since the flag is set when I call "this.run = true" outside the nested callbacks. Any advice on that?
Using modern APIs like Promises, Fetch and sugar like arrow functions, your code can become:
Program.prototype.init = function () {
return Promise.all(
fetch('../res/shaders/blinnPhong-shader.vsh').then(r=>r.text()),
fetch('../res/shaders/blinnPhong-shader.fsh').then(r=>r.text()),
fetch('../res/models/dragon.json').then(r=>r.json()),
new Promise(function (resolve,reject) {
var i = new Image();
i.onload = () => resolve(i);
i.onerror = () => reject('Error loading image '+i.src);
i.src = '../res/textures/susanTexture.png';
})
)
.then(RunProgram);
}
You could spice things up even further by using related ES2017 features like async functions/await or go all in on compatibility by forgoing arrow functions and using seamless polyfills for promises and fetch. For some simple request caching, wrap fetch:
const fetchCache = Object.create(null);
function fetchCached (url) {
if (fetchCache[url])
return Promise.resolve(fetchCache[url]);
return fetch.apply(null,arguments).then(r=>fetchCache[url]=r);
}
Note that you want your resources to be unique so the above mentioned caching still needs another layer of actual GPU resource caching on top of it, you don't want to create multiple shader programs with the same shader code or array buffers with the same vertex data in them.
Your actual core question as to how you could manage dependencies is a bit too broad / application specific to be answered here on SO. In regards to managing the async nature in such an environment I see two options:
Use placeholder resources and seamlessly replace them once the actual resources are loaded
Wait until everything is loaded before you insert the GameObject into the rendering pipeline
Both approaches have their pros and cons, but usually I'd recommend the first option.
You can use promises for this. With the bluebird module, you can convert loadText to a promise function with promise.promiseifyAll(the module loadText is from), or if that is your module, you can make it return a new Promise(function(resolve, reject){})
Using promises, you can make an array of all the promises you want to run and Promise.all([loadText('shader'), loadText("other shader"), ...])
More information on promises

Node.js WriteStream synchronous

I'm writing a purely synchronous, single threaded command line program in node.js, which needs to write a single binary file, for which I'm using WriteStream. My usage pattern is along the lines of:
var stream = fs.createWriteStream(file)
stream.write(buf1)
stream.write(buf2)
This seems to work, but the documentation says it's asynchronous and I want to make sure I'm not writing code that works 99% of the time. I don't care exactly when the data gets written as long as it's written in the specified order and no later than when the program exits, and the quantity of data is small so speed and memory consumption are not issues.
I've seen mention of stream.end() but it seems to work without it and I've also seen suggestions that calling it may actually be a bad idea if you're not using callbacks because it might end up getting called before all the data is written.
Is my approach correct (given that I want purely synchronous) or is there anything I need to watch out for?
You can do this, the only problem can be if you create two or more concurrent streams for the same path: the order of writes from different streams will be undefined. By the way, there is a synchronous fs write stream implementation in node: fs.SyncWriteStream. It's kind of private and requires fd as an argument, but if you really want it...
I'm working on a timing-critical API, where a new file has to have been written and its stream completely handled before the next action can be performed. The solution, in my case (and, quite possibly, that of the OP's question) was to use:
writer.on('finish', () => {
console.error('All writes are now complete.');
});
as per the fs Event: 'finish' documentation
const writeToLocalDisk = (stream, path) => {
return new Promise((resolve, reject) => {
const istream = stream;
const ostream = fs.createWriteStream(path);
istream.pipe(ostream);
istream.on("end", () => {
console.log(`Fetched ${path} from elsewhere`);
resolve();
});
istream.on("error", (err) => {
console.log(JSON.stringify(err, null, 2));
resolve();
});
});
};
// Then use an async function to perform sequential-like operation
async function sequential (stream) {
const path = "";
await writeToLocalDisk(stream, path);
console.log('other operation here');
}

Categories