Node read line by line, process and store - javascript

I have the following code in Node.js which reads from a file, line by line. I want to do stuff to each line and store it in an array. The array would then be used in other functions in the same file. The problem I'm running into is the async nature of reading the stream which results in an empty array. The solutions I've come across all seem to rely on modules.
function processLine(file) {
const fs = require('fs');
const readline = require('readline');
const input = fs.createReadStream(file);
const rl = readline.createInterface(input);
const arr = []
rl.on('line', (line) => {
// do stuff to data and store in array
})
// return array;
}
I am aware of being able to store the chunks and operate on the whole file with input.on('end', cb)... However, I feel like this would put too much functionality within the cb. Plus I still can't use its return value since its async. I guess my question is, is there a way to store data being read and use it within the file?

If you would like to process elements like chunks - take a look on
highWaterMark
https://nodejs.org/api/stream.html#stream_types_of_streams
Proably you will be instered in:
objectMode
as well.
Also there are interfaces which you could use while use streams:
Readable
Writable
Duplex
Transform
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback
Where you could use any Promise based function and simply use callback to finish processing element at right point of time:
_transform = function(data, encoding, callback) {
this.push(data);
callback();
};
or
https://nodejs.org/api/stream.html#stream_class_stream_transform
_write(chunk, encoding, callback) {
// ...
}
However there is another solution - rxjs binding for node stream - which you could use while process elements.

Related

Parallel stream huge line delimited json file in nodejs

I am reading a file with 350M lines using createReadstream and transforming each line and writing it back as line delimited file. Below is the code which I am using to do it.
var fs = require("fs");
var args = process.argv.slice(2);
var split = require("split")
fs.createReadStream(args[0])
.pipe(split(JSON.parse))
.on('data', function(obj) {
<data trasformation operation>
})
.on('error', function(err) {
})
To red 350M lines it takes 40 minutes and it only uses one CPU core while doing it. I have 16 CPU cores. How can I make this line reading process to run parallel so that alteast 10 cores are utilized and the entire operation finishes in less time.
I tried using this module - https://www.npmjs.com/package/parallel-transform. But when I checked in htop, it was still single CPU which was doing the operation.
var stream = transform(10, {
objectMode: true
}, function(data, callback) {
<data trasformation operation>
callback(null, data);
});
fs.createReadStream(args[0])
.pipe(stream)
.pipe(process.stdout);
What is the better way to parallel read file while streaming?
You can try scramjet - I would be happy to find someone with a strong multi-threaded use case for setting up proper tests around this.
Your code would look something like this:
var fs = require("fs");
var {StringStream} = require("scramjet");
var args = process.argv.slice(2);
let i = 0;
let threads = os.cpus().length; // you may want to check this out
StringStream.from(fs.createReadStream(args[0]))
.lines() // it's better to deserialize this in the threads
.separate(() => i = ++i % threads)
.cluster(stream => stream // these will happen in the thread
.JSONParse()
.map(yourProcessingFunc) // this can be async as well
)
.mux() // if the function above returns something you'll get
// a stream of results
.run() // this executes the whole workflow.
.catch(errorHandler)
You can use a better affinity function for separate, see the docs here where based on the data you can direct the data to specific workers. If you encounter any issues, please create a repo and let's see how to fix those.

Save csv-parse output to a variable

I'm new to using csv-parse and this example from the project's github does what I need with one exception. Instead of outputting via console.log I want to store data in a variable. I've tried assigning the fs line to a variable and then returning data rather than logging it but that just returned a whole bunch of stuff I didn't understand. The end goal is to import a CSV file into SQLite.
var fs = require('fs');
var parse = require('..');
var parser = parse({delimiter: ';'}, function(err, data){
console.log(data);
});
fs.createReadStream(__dirname+'/fs_read.csv').pipe(parser);
Here is what I have tried:
const fs = require("fs");
const parse = require("./node_modules/csv-parse");
const sqlite3 = require("sqlite3");
// const db = new sqlite3.Database("testing.sqlite");
let parser = parse({delimiter: ","}, (err, data) => {
// console.log(data);
return data;
});
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);
I was also struggling to figure out how to get the data from csv-parse back to the top-level that invokes parsing. Specifically I was trying to get parser.info data at the end of processing to see if it was successful, but the solution for that can work to get the row data as well, if you need.
The key was to wrap all the stream event listeners into a Promise, and within the parser's callback resolve the Promise.
function startFileImport(myFile) {
// THIS IS THE WRAPPER YOU NEED
return new Promise((resolve, reject) => {
let readStream = fs.createReadStream(myFile);
let fileRows = [];
const parser = parse({
delimiter: ','
});
// Use the readable stream api
parser.on('readable', function () {
let record
while (record = parser.read()) {
if (record) { fileRows.push(record); }
}
});
// Catch any error
parser.on('error', function (err) {
console.error(err.message)
});
parser.on('end', function () {
const { lines } = parser.info;
// RESOLVE OUTPUT THAT YOU WANT AT PARENT-LEVEL
resolve({ status: 'Successfully processed lines: ', lines });
});
// This will wait until we know the readable stream is actually valid before piping
readStream.on('open', function () {
// This just pipes the read stream to the response object (which goes to the client)
readStream.pipe(parser);
});
// This catches any errors that happen while creating the readable stream (usually invalid names)
readStream.on('error', function (err) {
resolve({ status: null, error: 'readStream error' + err });
});
});
}
This is a question that suggests confusion about an asynchronous streaming API and seems to ask at least three things.
How do I get output to contain an array-of-arrays representing the parsed CSV data?
That output will never exist at the top-level, like you (and many other programmers) hope it would, because of how asynchronous APIs operate. All the data assembled neatly in one place can only exist in a callback function. The next best thing syntactically is const output = await somePromiseOfOutput() but that can only occur in an async function and only if we switch from streams to promises. That's all possible, and I mention it so you can check it out later on your own. I'll assume you want to stick with streams.
An array consisting of all the rows can only exist after reading the entire stream. That's why all the rows are only available in the author's "Stream API" example only in the .on('end', ...) callback. If you want to do anything with all the rows present at the same time, you'll need to do it in the end callback.
From https://csv.js.org/parse/api/ note that the author:
uses the on readable callback to push single records into a previously empty array defined externally named output.
uses the on error callback to report errors
uses the on end callback to compare all the accumulated records in output to the expected result
...
const output = []
...
parser.on('readable', function(){
let record
while (record = parser.read()) {
output.push(record)
}
})
// Catch any error
parser.on('error', function(err){
console.error(err.message)
})
// When we are done, test that the parsed output matched what expected
parser.on('end', function(){
assert.deepEqual(
output,
[
[ 'root','x','0','0','root','/root','/bin/bash' ],
[ 'someone','x','1022','1022','','/home/someone','/bin/bash' ]
]
)
})
As to the goal on interfacing with sqlite, this is essentially building a customized streaming endpoint.
In this use case, implement a customized writable stream that accepts the output of parser and sends rows to the database.
Then you simply chain pipe calls as
fs.createReadStream(__dirname+'/fs_read.csv')
.pipe(parser)
.pipe(your_writable_stream)
Beware: This code returns immediately. It does not wait for the operations to finish. It interacts with a hidden event loop internal to node.js. The event loop often confuses new developers who are arriving from another language, used to a more imperative style, and skipped this part of their node.js training.
Implementing such a customized writable stream can get complicated and is left as an exercise for the reader. It will be easiest if the parser emits a row, and then the writer can be written to handle single rows. Make sure you are able to notice errors somehow and throw appropriate exceptions, or you'll be cursed with incomplete results and no warning or reason why.
A hackish way to do it would have been to replace console.log(data) in let parser = ... with a customized function writeRowToSqlite(data) that you'll have to write anyway to implement a custom stream. Because of asynchronous API issues, using return data there does not do anything useful. It certainly, as you saw, fails to put the data into the output variable.
As to why output in your modified posting does not contain the data...
Unfortunately, as you discovered, this is usually wrong-headed:
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);
Here, the variable output will be a ReadableStream, which is not the same as the data contained in the readable stream. Put simply, it's like when you have a file in your filesystem, and you can obtain all kinds of system information about the file, but the content contained in the file is accessed through a different call.

node.js data consistency when iterating asynchronously

I have a tool who's basic idea is as follows:
//get a bunch of couchdb databases. this is an array
const jsonFile = require('jsonfile');
let dbList = getDbList();
const filePath = 'some/path/to/file';
const changesObject = {};
//iterate the db list. do asynchronous stuff on each iteration
dbList.forEach(function(db){
let merchantDb = nano.use(db);
//get some changes from the database. validate inside callback
merchantDb.get("_changes", function(err,changes){
validateChanges(changes);
changesObject['db'] = changes.someAttribute;
//write changes to file
jsonFile.writeFile(filePath, changesObject, function (err) {
if (err) {
logger.error("Unable to write to file: ");
}
});
})
const validateChanges = function(changes) {
if (!validateLogic(changes) sendAlertMail();
}
For performance improvements the iteration is not done synchronously. Therefore there can be multiple iterations running in 'parallel'. My question is can this cause any data inconsistencies and/or any issues with the file writing process?
Edit:
The same file gets written to on each iteration.
Edit:2
The changes are stored as a JSON object with key value pairs. The key being the db name.
If you're really writing to a single file, which you appear to be (though it's hard to be sure), then no; you have a race condition in which multiple callbacks will try to write to the same file, possibly at the same time (remember, I/O isn't done on the JavaScript thread in Node unless you use the *Sync functions), which will at best mean the last one wins and will at worst mean I/O errors because of overlap.
If you're writing to separate files for each db, then provided there's no cross-talk (shared state) amongst validateChanges, validateLogic, sendAlertMail, etc., that should be fine.
Just for detail: It will start tasks (jobs) getting the changes and then writing them out; the callbacks of the calls to get won't be run until later, when all of those jobs are queued.
You are creating closures in loops, but the way you're doing it is okay, both because you're doing it within the forEach callback and because you're not using db in the get callback (which would be fine with the forEach callback but not with some other ways you might loop arrays). Details on that aspect in this question's answers if you're interested.
This line is suspect, though:
let merchantDb = nano.use('db');
I suspect you meant (no quotes):
let merchantDb = nano.use(db);
For what it's worth, it sounds from the updates to the question and your various comments like the better solution would be not to write out the file separately each time. Instead, you want to gather up the changes and then write them out.
You can do that with the classic Node-callback APIs you're using like this:
let completed = 0;
//iterate the db list. do asynchronous stuff on each iteration
dbList.forEach(function(db) {
let merchantDb = nano.use(db);
//get some changes from the database. validate inside callback
merchantDb.get("_changes", function(err, changes) {
if (err) {
// Deal with the fact there was an error (don't return)
} else {
validateChanges(changes);
changesObject[db] = changes.someAttribute; // <=== NOTE: This line had 'db' rather than db, I assume that was meant to be just db
}
if (++completed === dbList.length) {
// All done, write changes to file
jsonFile.writeFile(filePath, changesObject, function(err) {
if (err) {
logger.error("Unable to write to file: ");
}
});
}
})
});

how to stream read directory in node.js?

Suppose I have a directory that contains 100K+ or even 500k+ files. I want to read the directory with fs.readdir, but it's async not stream. Someone tell me that async use memory before done read the entire file list.
So what is the solution? I want to readdir with stream approach. Can I?
In modern computers traversing a directory with 500K files is nothing. When you fs.readdir asynchronously in Node.js, what it does is just read a list of file names in the specified directory. It doesn't read the files' contents. I've just tested with 700K files in the dir. It takes only 21MB of memory to load this list of file names.
Once you've loaded this list of file names, you just traverse them one by one or in parallel by setting some limit for concurrency and you can easily consume them all. Example:
var async = require('async'),
fs = require('fs'),
path = require('path'),
parentDir = '/home/user';
async.waterfall([
function (cb) {
fs.readdir(parentDir, cb);
},
function (files, cb) {
// `files` is just an array of file names, not full path.
// Consume 10 files in parallel.
async.eachLimit(files, 10, function (filename, done) {
var filePath = path.join(parentDir, filename);
// Do with this files whatever you want.
// Then don't forget to call `done()`.
done();
}, cb);
}
], function (err) {
err && console.trace(err);
console.log('Done');
});
Now there is a way to do it with async iteration! You can do:
const dir = fs.opendirSync('/tmp')
for await (let file of dir) {
console.log(file.name)
}
To turn it into a stream:
const _pipeline = util.promisify(pipeline)
await _pipeline([
Readable.from(dir),
... // consume!
])
The more modern answer for this is to use opendir (added v12.12.0) to iterate over each found file, as it is found:
import { opendirSync } from "fs";
const dir = opendirSync("./files");
for await (const entry of dir) {
console.log("Found file:", entry.name);
}
fsPromises.opendir/openddirSync return an instance of Dir which is an iterable which returns a Dirent (directory entry) for every file in the directory.
This is more efficient because it returns each file as it is found, rather than having to wait till all files are collected.
Here are two viable solutions:
Async generators. You can use the fs.opendir function to create a Dir object, which has a Symbol.asyncIterator property.
import { opendir } from 'fs/promises';
// An async generator that accepts a directory name
const openDirGen = async function* (directory: string) {
// Create a Dir object for that directory
const dir = await opendir(directory);
// Iterate through the items in the directory asynchronously
for await (const file of dir) {
// (yield whatever you want here)
yield file.name;
}
};
The usage of this is as follows:
for await (const name of openDirGen('./src')) {
console.log(name);
}
A Readable stream can be created using the async generator we created above.
// ...
import { Readable } from 'stream';
// ...
// A function accepting the directory name
const openDirStream = (directory: string) => {
return new Readable({
// Set encoding to utf-8 to get the names of the items in
// the directory as utf-8 strings.
encoding: 'utf-8',
// Create a custom read method which is async, but works
// because it doesn't need to be awaited, as Readable is
// event-based anyways.
async read() {
// Asynchronously iterate through the items names in
// the directory using the openDirGen generator.
for await (const name of openDirGen(directory)) {
// Push each name into the stream, emitting the
// 'data' event each time.
this.push(name);
}
// Once iteration is complete, manually destroy the stream.
this.destroy();
},
});
};
You can use this the same way you'd use any other Readable stream:
const myDir = openDirStream('./src');
myDir.on('data', (name) => {
// Logs the file name of each file in my './src' directory
console.log(name);
// You can do anything you want here, including actually reading
// the file.
});
Both of these solutions will allow you to asynchronously iterate through the item names within a directory rather than pull them all into memory at once like fs.readdir does.
The answer by #mstephen19 gave the right direction, but it uses an async generator where Readable.read() does not support it. If you try to turn opendirGen() into a recursive function, to recurse into directories, it does not work anymore.
Using Readable.from() is the solution here. The following is his solution adapted as such (with opendirGen() still not recursive):
import { opendir } from 'node:fs/promises';
import { Readable } from 'node:stream';
async function* opendirGen(dir) {
for await ( const file of await opendir('/tmp') ) {
yield file.name;
}
};
Readable
.from(opendirGen('/tmp'), {encoding: 'utf8'})
.on('data', name => console.log(name));
As of version 10, there is still no good solution for this. Node is just not that mature yet.
modern filesystems can easily handle millions of files in a directory. And of cause you can make a god cases for it, in a large scale operations, as you suggests.
The underlying C library iterates over the directory list, one at a time, as it should. But all node implementations I have seen, that claims to iterate, uses fs.readdir, that reads all into memory, as fast as it can.
As I understand it, you have to wait for a new version of libuv to be adopted into node. And then for the maintainers to address this old issue. See discussion at https://github.com/nodejs/node/issues/583
Some improvements will happen in with version 12.

How to read an entire text stream in node.js?

In RingoJS there's a function called read which allows you to read an entire stream until the end is reached. This is useful when you're making a command line application. For example you may write a tac program as follows:
#!/usr/bin/env ringo
var string = system.stdin.read(); // read the entire input stream
var lines = string.split("\n"); // split the lines
lines.reverse(); // reverse the lines
var reversed = lines.join("\n"); // join the reversed lines
system.stdout.write(reversed); // write the reversed lines
This allows you to fire up a shell and run the tac command. Then you type in as many lines as you wish to and after you're done you can press Ctrl+D (or Ctrl+Z on Windows) to signal the end of transmission.
I want to do the same thing in node.js but I can't find any function which would do so. I thought of using the readSync function from the fs library to simulate as follows, but to no avail:
fs.readSync(0, buffer, 0, buffer.length, null);
The file descriptor for stdin (the first argument) is 0. So it should read the data from the keyboard. Instead it gives me the following error:
Error: ESPIPE, invalid seek
at Object.fs.readSync (fs.js:381:19)
at repl:1:4
at REPLServer.self.eval (repl.js:109:21)
at rli.on.self.bufferedCmd (repl.js:258:20)
at REPLServer.self.eval (repl.js:116:5)
at Interface.<anonymous> (repl.js:248:12)
at Interface.EventEmitter.emit (events.js:96:17)
at Interface._onLine (readline.js:200:10)
at Interface._line (readline.js:518:8)
at Interface._ttyWrite (readline.js:736:14)
How would you synchronously collect all the data in an input text stream and return it as a string in node.js? A code example would be very helpful.
As node.js is event and stream oriented there is no API to wait until end of stdin and buffer result, but it's easy to do manually
var content = '';
process.stdin.resume();
process.stdin.on('data', function(buf) { content += buf.toString(); });
process.stdin.on('end', function() {
// your code here
console.log(content.split('').reverse().join(''));
});
In most cases it's better not to buffer data and process incoming chunks as they arrive (using chain of already available stream parsers like xml or zlib or your own FSM parser)
The key is to use these two Stream events:
Event: 'data'
Event: 'end'
For stream.on('data', ...) you should collect your data data into either a Buffer (if it is binary) or into a string.
For on('end', ...) you should call a callback with you completed buffer, or if you can inline it and use return using a Promises library.
Let me illustrate StreetStrider's answer.
Here is how to do it with concat-stream
var concat = require('concat-stream');
yourStream.pipe(concat(function(buf){
// buf is a Node Buffer instance which contains the entire data in stream
// if your stream sends textual data, use buf.toString() to get entire stream as string
var streamContent = buf.toString();
doSomething(streamContent);
}));
// error handling is still on stream
yourStream.on('error',function(err){
console.error(err);
});
Please note that process.stdin is a stream.
There is a module for that particular task, called concat-stream.
If you are in async context and have a recent version of Node.js, here is a quick suggestion:
const chunks = []
for await (let chunk of readable) {
chunks.push(chunk)
}
console.log(Buffer.concat(chunks))
On Windows, I had some problems with the other solutions posted here - the program would run indefinitely when there's no input.
Here is a TypeScript implementation for modern NodeJS, using async generators and for await - quite a bit simpler and more robust than using the old callback based APIs, and this worked on Windows:
import process from "process";
/**
* Read everything from standard input and return a string.
*
* (If there is no data available, the Promise is rejected.)
*/
export async function readInput(): Promise<string> {
const { stdin } = process;
const chunks: Uint8Array[] = [];
if (stdin.isTTY) {
throw new Error("No input available");
}
for await (const chunk of stdin) {
chunks.push(chunk);
}
return Buffer.concat(chunks).toString('utf8');
}
Example:
(async () => {
const input = await readInput();
console.log(input);
})();
(consider adding a try/catch, if you want to handle the Promise rejection and display a more user-friendly error-message when there's no input.)

Categories