how to stream read directory in node.js?

how to stream read directory in node.js? - javascript

Suppose I have a directory that contains 100K+ or even 500k+ files. I want to read the directory with fs.readdir, but it's async not stream. Someone tell me that async use memory before done read the entire file list.
So what is the solution? I want to readdir with stream approach. Can I?

In modern computers traversing a directory with 500K files is nothing. When you fs.readdir asynchronously in Node.js, what it does is just read a list of file names in the specified directory. It doesn't read the files' contents. I've just tested with 700K files in the dir. It takes only 21MB of memory to load this list of file names.
Once you've loaded this list of file names, you just traverse them one by one or in parallel by setting some limit for concurrency and you can easily consume them all. Example:
var async = require('async'),
fs = require('fs'),
path = require('path'),
parentDir = '/home/user';
async.waterfall([
function (cb) {
fs.readdir(parentDir, cb);
},
function (files, cb) {
// `files` is just an array of file names, not full path.
// Consume 10 files in parallel.
async.eachLimit(files, 10, function (filename, done) {
var filePath = path.join(parentDir, filename);
// Do with this files whatever you want.
// Then don't forget to call `done()`.
done();
}, cb);
}
], function (err) {
err && console.trace(err);
console.log('Done');
});

Now there is a way to do it with async iteration! You can do:
const dir = fs.opendirSync('/tmp')
for await (let file of dir) {
console.log(file.name)
}
To turn it into a stream:
const _pipeline = util.promisify(pipeline)
await _pipeline([
Readable.from(dir),
... // consume!
])

The more modern answer for this is to use opendir (added v12.12.0) to iterate over each found file, as it is found:
import { opendirSync } from "fs";
const dir = opendirSync("./files");
for await (const entry of dir) {
console.log("Found file:", entry.name);
}
fsPromises.opendir/openddirSync return an instance of Dir which is an iterable which returns a Dirent (directory entry) for every file in the directory.
This is more efficient because it returns each file as it is found, rather than having to wait till all files are collected.

Here are two viable solutions:
Async generators. You can use the fs.opendir function to create a Dir object, which has a Symbol.asyncIterator property.
import { opendir } from 'fs/promises';
// An async generator that accepts a directory name
const openDirGen = async function* (directory: string) {
// Create a Dir object for that directory
const dir = await opendir(directory);
// Iterate through the items in the directory asynchronously
for await (const file of dir) {
// (yield whatever you want here)
yield file.name;
}
};
The usage of this is as follows:
for await (const name of openDirGen('./src')) {
console.log(name);
}
A Readable stream can be created using the async generator we created above.
// ...
import { Readable } from 'stream';
// ...
// A function accepting the directory name
const openDirStream = (directory: string) => {
return new Readable({
// Set encoding to utf-8 to get the names of the items in
// the directory as utf-8 strings.
encoding: 'utf-8',
// Create a custom read method which is async, but works
// because it doesn't need to be awaited, as Readable is
// event-based anyways.
async read() {
// Asynchronously iterate through the items names in
// the directory using the openDirGen generator.
for await (const name of openDirGen(directory)) {
// Push each name into the stream, emitting the
// 'data' event each time.
this.push(name);
}
// Once iteration is complete, manually destroy the stream.
this.destroy();
},
});
};
You can use this the same way you'd use any other Readable stream:
const myDir = openDirStream('./src');
myDir.on('data', (name) => {
// Logs the file name of each file in my './src' directory
console.log(name);
// You can do anything you want here, including actually reading
// the file.
});
Both of these solutions will allow you to asynchronously iterate through the item names within a directory rather than pull them all into memory at once like fs.readdir does.

The answer by #mstephen19 gave the right direction, but it uses an async generator where Readable.read() does not support it. If you try to turn opendirGen() into a recursive function, to recurse into directories, it does not work anymore.
Using Readable.from() is the solution here. The following is his solution adapted as such (with opendirGen() still not recursive):
import { opendir } from 'node:fs/promises';
import { Readable } from 'node:stream';
async function* opendirGen(dir) {
for await ( const file of await opendir('/tmp') ) {
yield file.name;
}
};
Readable
.from(opendirGen('/tmp'), {encoding: 'utf8'})
.on('data', name => console.log(name));

As of version 10, there is still no good solution for this. Node is just not that mature yet.
modern filesystems can easily handle millions of files in a directory. And of cause you can make a god cases for it, in a large scale operations, as you suggests.
The underlying C library iterates over the directory list, one at a time, as it should. But all node implementations I have seen, that claims to iterate, uses fs.readdir, that reads all into memory, as fast as it can.
As I understand it, you have to wait for a new version of libuv to be adopted into node. And then for the maintainers to address this old issue. See discussion at https://github.com/nodejs/node/issues/583
Some improvements will happen in with version 12.

Related

Optimizing a file content parser class written in TypeScript

I got a typescript module (used by a VSCode extension) which accepts a directory and parses the content contained within the files. For directories containing large number of files this parsing takes a bit of time therefore would like some advice on how to optimize it.
I don't want to copy/paste the entire class files therefore will be using a mock pseudocode containing the parts that I think are relevant.
class Parser {
constructor(_dir: string) {
this.dir = _dir;
}
parse() {
let tree: any = getFileTree(this.dir);
try {
let parsedObjects: MyDTO[] = await this.iterate(tree.children);
} catch (err) {
console.error(err);
}
}
async iterate(children: any[]): Promise<MyDTO[]> {
let objs: MyDTO[] = [];
for (let i = 0; i < children.length; i++) {
let child: any = children[i];
if (child.type === Constants.FILE) {
let dto: FileDTO = await this.heavyFileProcessingMethod(file); // this takes time
objs.push(dto);
} else {
// child is a folder
let dtos: MyDTO[] = await this.iterateChildItems(child.children);
let dto: FolderDTO = new FolderDTO();
dto.files = dtos.filter(item => item instanceof FileDTO);
dto.folders = dtos.filter(item => item instanceof FolderDTO);
objs.push(FolderDTO);
}
}
return objs;
}
async heavyFileProcessingMethod(file: string): Promise<FileDTO> {
let content: string = readFile(file); // util method to synchronously read file content using fs
return new FileDTO(await this.parseFileContent(content));
}
async parseFileContent(content): Promise<any[]> {
// parsing happens here and the file content is parsed into separate blocks
let ast: any = await convertToAST(content); // uses an asynchronous method of an external dependency to convert content to AST
let blocks = parseToBlocks(ast); // synchronous method called to convert AST to blocks
return await this.processBlocks(blocks);
}
async processBlocks(blocks: any[]): Promise<any[]> {
for (let i = 0; i < blocks.length; i++) {
let block: Block = blocks[i];
if (block.condition === true) {
// this can take some time because if this condition is true, some external assets will be downloaded (via internet)
// on to the caller's machine + some additional processing takes place
await processBlock(block);
}
}
return blocks;
}
}
Still sort of a beginner to TypeScript/NodeJS. I am looking for a multithreading/Java-esque solution here if possible. In the context of Java, this.heavyFileProcessingMethod would be a instance of Callable object and this object would be pushed into a List<Callable> which would then be executed parallelly by an ExecutorService returning List<Future<Object>>.
Basically I want all files to be processed parallelly but the function must wait for all the files to be processed before returning from the method (so the entire iterate method will only take as long as the time taken to parse the largest file).
Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well? If so, can it be used in this situation? If my Parser class needs to be refactored to accommodate this change (or any other suggested change) it's no issue.
EDIT: Using Promise.all
async iterate(children: any[]): Promise<MyDTO>[] {
let promises: Promies<MyDTO>[] = [];
for(let i = 0; i <children.length; i++) {
let child: any = children[i];
if (child.type === Constants.FILE) {
let promise: Promise<FileDTO> = this.heavyFileProcessingMethod(file); // this takes time
promises.push(promise);
} else {
// child is a folder
let dtos: Promise<MyDTO>[] = this.iterateChildItems(child.children);
let promise: Promise<FolderDTO> = this.getFolderPromise(dtos);
promises.push(promise);
}
}
return promises;
}
async getFolderPromise(promises: Promise<MyDTO>[]): Promise<FolderDTO> {
return Promise.all(promises).then(dtos => {
let dto: FolderDTO = new FolderDTO();
dto.files = dtos.filter(item => item instanceof FileDTO);
dto.folders = dtos.filter(item => item instanceof FolderDTO);
return dto;
})
}

first: Typescript is really Javascript
Typescript is just Javascript with static type checking, and those static types are erased when the TS is transpiled to JS. Since your question is about algorithms and runtime language features, Typescript has no bearing; your question is a Javascript one. So right off the bat that tells us the answer to
Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well?
is YES.
As to the second part of your question,
can it be used in this situation?
the answer is YES, but...
second: Use Worker Threads only if the task is CPU bound.
Can does not necessarily mean you should. It depends on whether your processes are IO bound or CPU bound. If they are IO bound, you're most likely far better off relying on Javascript's longstanding asynchronous programming model (callbacks, Promises). But if they are CPU bound, then using Node's relatively new support for Thread-based parallelism is more likely to result in throughput gains. See Node.js Multithreading!, though I think this one is better: Understanding Worker Threads in Node.js.
While worker threads are lighter weight than previous Node options for parallelism (spawning child processes), it is still relatively heavy weight compared to threads in Java. Each worker runs in its own Node VM, regular variables are not shared (you have to use special data types and/or message passing to share data). It had to be done this way because Javascript is designed around a single-threaded programming model. It's extremely efficient within that model, but that design makes support for multithreading harder. Here's a good SO answer with useful info for you: https://stackoverflow.com/a/63225073/8910547
My guess is your parsing is more IO bound, and the overhead of spawning worker threads will outweigh any gains. But give it a go and it will be a learning experience. :)

It looks like your biggest problem is navigating the nested directory structure and keeping individual per-file and per-dir promises organized. My suggestion would be to do that in a simpler way.
Have a function that takes a directory path and returns a flat list of all files it can find, no matter how deep, in a manner similar to the find program. This function can be like this:
import * as fs from 'fs/promises'
import * as path from 'path'
async function fileList(dir: string): Promise<string[]> {
let entries = await fs.readdir(dir, {withFileTypes: true})
let files = entries
.filter(e => e.isFile())
.map(e => path.join(dir, e.name))
let dirs = entries
.filter(e => e.isDirectory())
.map(e => path.join(dir, e.name))
let subLists = await Promise.all(dirs.map(d => fileList(d)))
return files.concat(subLists.flat())
}
Basically, obtain directory entries, find (sub)directories and iterate them recursively in parallel. Once the iteration is complete, flatten, merge and return the list.
Using this function, you can apply your heavy task to all files at once by simply using map + Promise.all:
let allFiles = await fileList(dir)
let results = await Promise.all(allFiles.map(f => heavyTask(f)))

Returning ffprobe metadata to another function using fluent-ffmpeg

I'm trying to use fluent-ffmpeg's ffprobe to get the metadata of a file and add it to a list, but I want to separate the process of getting the metadata from the method related to checking the file, mostly because the addFileToList() function is quite long as is and the ffprobe routine is quite long as well.
I've tried the following code, but it doesn't give the results I'm expecting:
export default {
// ...
methods: {
getVideoMetadata (file) {
const ffmpeg = require('fluent-ffmpeg')
ffmpeg.ffprobe(file.name, (err, metadata) => {
if (!err) {
console.log(metadata) // this shows the metadata just fine
return metadata
}
})
},
addFileToList (file) {
// file checking routines
console.log(this.getVideoMetadata(file)) // this returns null
item.metadata = this.getVideoMetadata(file)
// item saving routines
}
}
}
I've already tried to nest the getVideoMetadata() routines inside addFileToList(), and it works, but not as intended, because the actions are carried, but not the first time, only the second time. It seems to be an async issue, but I don't know how can I tackle this.
What can I do? Should I stick to my idea of decoupling getVideoMetadata() or should I nest it inside addFileToList() and wrestle with async/await?

It turns out that probing for metadata introduces a race condition, so we should structure the code so that the flow continues after the callback:
const ffmpeg = require('fluent-ffmpeg')
var data
ffmpeg.ffprobe(file.name, (err, metadata) => {
if (!err) {
data = metadata
continueDoingStuff()
}
})

How can I download/upload multiple files at once using SFTP through the ssh2 Node.js module (within an Electron app)?

I am building a simple SFTP client with Electron and I am attempting to download or upload multiple files at once using the ssh2 module and the SFTPStream within that module. I have tried many different method structures, some including use of es6-promise-pool. Every attempt I make results in one file from the array of files to transfer being transferred properly and then a subsequent
MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 sftp_message listeners added to [EventEmitter]. Use emitter.setMaxListeners() to increase limit
message is displayed in the console and the rest of the files are not transferred. I am unsure how to change my method structure to prevent this from occurring. I am using ipcRenderer to tell ipcMain to execute the methods I will display here (here is my structure for uploading files for example).
let counter = 0;
// Upload local file to server
ipcMain.on('upload_local_files', (event, args) => { // args[0] is connection settings, args[1] is array of file paths
let conn = new Client();
conn.on('ready', () => {
let pool = new PromisePool(uploadFileProducer(conn, args[1]), 10);
pool.start().then(() => {
conn.end();
counter = 0;
let tempArgs = [];
tempArgs.push(curLocalDir);
tempArgs.push(curRemoteDir);
event.sender.send('local_upload_complete', tempArgs);
});
}).connect(args[0]);
});
// Producer used in promise pool to upload files
function uploadFileProducer(conn, files){
if(counter < 100 && counter < files.length){
counter++;
return(uploadFile(conn, files[counter - 1]));
}else{
return null;
}
}
// Function used to upload files in promise pool
function uploadFile(conn, file){
return new Promise((resolve, reject) => {
conn.sftp((error, sftp) => {
return sftp.fastPut(file, curRemoteDir + file.substring(file.lastIndexOf('/') + 1), {}, (error) => {
resolve(file);
});
});
});
}
Admittedly, the use of promise pools is new to me and I am unsure if I am going about using them properly. Another post about this topic used promise pools to prevent the problem I am having from occurring, but that example did not involve an Electron app (I don't know if that's relevant). I appreciate any help I can get!

The problem is not the Warning, which is just that, a warning, and normal in your current use case. The issue with the uploads is the incorrect usage of PromisePool.
I'm assuming you're using es6-promise-pool
You should pass a promise producer function to the constructor, but instead you're calling the function and passing a promise, that's why only a single files gets uploaded.
You should pass the producer without calling it, or make a producer that returns a function, or use a generator.
The PromisePool constructor takes a Promise-producing function as its
first argument.
function *uploadFileProducer(conn, files) {
for(const file of files)
yield uploadFile(conn, file);
}
Now you can call:
let pool = new PromisePool(uploadFileProducer(conn, args[1]), 10)
And the PromisePool will iterate correctly the iterator returned by the generator function, and handle the concurrency accordingly.
You can also create a function that returns a Promise each call.
function uploadFileProducer(conn, files) {
files = files.slice(); // don't want to mutate the original
return () => uploadFile(conn, files.shift())
}
Regarding the warning, it's normal if you're uploading multiple things concurrently, if that's the case you can increase the limit using:
emitter.setMaxListeners(n)

Return SVG from a function

I'm trying to return an SVG file located in the same folder as the Javascript file when a function is executed.
The function is called returnSVG, the SVG is called image.
import image from './image.svg'
returnSVG= () => {
return image;
}
But when I call it I get this: /static/media/image.f85cba53.svg

An SVG file contains a textual representation of a graphic, encoded in an XML-like syntax.
So, you cannot simply import thing from 'filepath' because the file contents are not a JavaScript module, a loadable binary extension, or a JSON-encoded string.
Rather than importing what cannot be imported, the correct approach would be to explicitly read the contents from the file:
const { readFileSync } = require('fs')
const returnSvg = (path = './image.svg') => return readFileSync(path)
Note that this example uses the synchronous version of the fs function to read a file. This will pause (block) all processing in NodeJS until the contents have been read.
The better solution would be to use Promises via async/await as in:
const { readFile } = require('fs')
const { promisify } = require('util')
const asyncReadFile = promisify(readFile)
const returnSvg = async (path = './image.svg') => {
const data = await asyncReadFile(path)
// since fs.readFile returns a buffer, we should probably convert it to a string.
return data.toString()
}
Using Promises and async/await requires that all of your processing occur within the chain of promises, so you'd need to restructure your code to:
returnSvg()
.then(data => { /* do something with the file contents here. */ })
.catch(err => console.error(`failed to read svg file: ${err}`))
Remember, in NodeJS, operations can be either asynchronous (via callbacks or Promises which async/await uses "under the covers"), or synchronous (via special "blocking" calls).
One option, if you choose to use the synchronous version would be to load all your files into your server before you call app.listen(8080, () => console.log('listing on 8080') but this implies you know what files will be required before you run your server and would not work for dynamically loaded files.
You might imagine doing something like:
const data = readSvg()
console.log(data)
or perhaps:
let data
readSvg().then(res => data = res)
console.log(data)
but neither will work because an asynchronous function (one defined with the async keyword) can only return a Promise.
Both attempts will not print any usable value to the console because at the time console.log() is called, NodeJS has no idea what the value of data will be.
The rule of thumb here is:
You cannot return a value or interact with any higher context from within a Promise.
Any manipulation of the data generated within a Promise chain, can ONLY be accessed from within its own chain.

Node read line by line, process and store

I have the following code in Node.js which reads from a file, line by line. I want to do stuff to each line and store it in an array. The array would then be used in other functions in the same file. The problem I'm running into is the async nature of reading the stream which results in an empty array. The solutions I've come across all seem to rely on modules.
function processLine(file) {
const fs = require('fs');
const readline = require('readline');
const input = fs.createReadStream(file);
const rl = readline.createInterface(input);
const arr = []
rl.on('line', (line) => {
// do stuff to data and store in array
})
// return array;
}
I am aware of being able to store the chunks and operate on the whole file with input.on('end', cb)... However, I feel like this would put too much functionality within the cb. Plus I still can't use its return value since its async. I guess my question is, is there a way to store data being read and use it within the file?

If you would like to process elements like chunks - take a look on
highWaterMark
https://nodejs.org/api/stream.html#stream_types_of_streams
Proably you will be instered in:
objectMode
as well.
Also there are interfaces which you could use while use streams:
Readable
Writable
Duplex
Transform
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback
Where you could use any Promise based function and simply use callback to finish processing element at right point of time:
_transform = function(data, encoding, callback) {
this.push(data);
callback();
};
or
https://nodejs.org/api/stream.html#stream_class_stream_transform
_write(chunk, encoding, callback) {
// ...
}
However there is another solution - rxjs binding for node stream - which you could use while process elements.

We Keep Coding

JavaScript is the programming language of the Web.

how to stream read directory in node.js? - javascript

Suppose I have a directory that contains 100K+ or even 500k+ files. I want to read the directory with fs.readdir, but it's async not stream. Someone tell me that async use memory before done read the entire file list. So what is the solution? I want to readdir with stream approach. Can I?

Now there is a way to do it with async iteration! You can do: const dir = fs.opendirSync('/tmp') for await (let file of dir) { console.log(file.name) } To turn it into a stream: const _pipeline = util.promisify(pipeline) await _pipeline([ Readable.from(dir), ... // consume! ])

Related

Optimizing a file content parser class written in TypeScript

Returning ffprobe metadata to another function using fluent-ffmpeg

How can I download/upload multiple files at once using SFTP through the ssh2 Node.js module (within an Electron app)?

Return SVG from a function

Node read line by line, process and store

Categories

Resources