How to read an entire text stream in node.js? - javascript

In RingoJS there's a function called read which allows you to read an entire stream until the end is reached. This is useful when you're making a command line application. For example you may write a tac program as follows:
#!/usr/bin/env ringo
var string = system.stdin.read(); // read the entire input stream
var lines = string.split("\n"); // split the lines
lines.reverse(); // reverse the lines
var reversed = lines.join("\n"); // join the reversed lines
system.stdout.write(reversed); // write the reversed lines
This allows you to fire up a shell and run the tac command. Then you type in as many lines as you wish to and after you're done you can press Ctrl+D (or Ctrl+Z on Windows) to signal the end of transmission.
I want to do the same thing in node.js but I can't find any function which would do so. I thought of using the readSync function from the fs library to simulate as follows, but to no avail:
fs.readSync(0, buffer, 0, buffer.length, null);
The file descriptor for stdin (the first argument) is 0. So it should read the data from the keyboard. Instead it gives me the following error:
Error: ESPIPE, invalid seek
at Object.fs.readSync (fs.js:381:19)
at repl:1:4
at REPLServer.self.eval (repl.js:109:21)
at rli.on.self.bufferedCmd (repl.js:258:20)
at REPLServer.self.eval (repl.js:116:5)
at Interface.<anonymous> (repl.js:248:12)
at Interface.EventEmitter.emit (events.js:96:17)
at Interface._onLine (readline.js:200:10)
at Interface._line (readline.js:518:8)
at Interface._ttyWrite (readline.js:736:14)
How would you synchronously collect all the data in an input text stream and return it as a string in node.js? A code example would be very helpful.

As node.js is event and stream oriented there is no API to wait until end of stdin and buffer result, but it's easy to do manually
var content = '';
process.stdin.resume();
process.stdin.on('data', function(buf) { content += buf.toString(); });
process.stdin.on('end', function() {
// your code here
console.log(content.split('').reverse().join(''));
});
In most cases it's better not to buffer data and process incoming chunks as they arrive (using chain of already available stream parsers like xml or zlib or your own FSM parser)

The key is to use these two Stream events:
Event: 'data'
Event: 'end'
For stream.on('data', ...) you should collect your data data into either a Buffer (if it is binary) or into a string.
For on('end', ...) you should call a callback with you completed buffer, or if you can inline it and use return using a Promises library.

Let me illustrate StreetStrider's answer.
Here is how to do it with concat-stream
var concat = require('concat-stream');
yourStream.pipe(concat(function(buf){
// buf is a Node Buffer instance which contains the entire data in stream
// if your stream sends textual data, use buf.toString() to get entire stream as string
var streamContent = buf.toString();
doSomething(streamContent);
}));
// error handling is still on stream
yourStream.on('error',function(err){
console.error(err);
});
Please note that process.stdin is a stream.

There is a module for that particular task, called concat-stream.

If you are in async context and have a recent version of Node.js, here is a quick suggestion:
const chunks = []
for await (let chunk of readable) {
chunks.push(chunk)
}
console.log(Buffer.concat(chunks))

On Windows, I had some problems with the other solutions posted here - the program would run indefinitely when there's no input.
Here is a TypeScript implementation for modern NodeJS, using async generators and for await - quite a bit simpler and more robust than using the old callback based APIs, and this worked on Windows:
import process from "process";
/**
* Read everything from standard input and return a string.
*
* (If there is no data available, the Promise is rejected.)
*/
export async function readInput(): Promise<string> {
const { stdin } = process;
const chunks: Uint8Array[] = [];
if (stdin.isTTY) {
throw new Error("No input available");
}
for await (const chunk of stdin) {
chunks.push(chunk);
}
return Buffer.concat(chunks).toString('utf8');
}
Example:
(async () => {
const input = await readInput();
console.log(input);
})();
(consider adding a try/catch, if you want to handle the Promise rejection and display a more user-friendly error-message when there's no input.)

Related

Asynchronous readline loop without async / await

I'd like to be using this function (which runs fine on my laptop), or something like it, on my embedded device. But two problems:
Node.JS on our device is so old that the JavaScript doesn't support async / await. Given our hundreds of units in the field, updating is probably impractical.
Even if that weren't a problem, we have another problem: the file may be tens of megabytes in size, not something that could fit in memory all at once. And this for-await loop will set thousands of asynchronous tasks into motion, more or less all at once.
async function sendOldData(pathname) {
const reader = ReadLine.createInterface({
input: fs.createReadStream(pathname),
crlfDelay: Infinity
})
for await (const line of reader) {
record = JSON.parse(line);
sendOldRecord(record);
}
}
function sendOldRecord(record) {...}
Promises work in this old version. I'm sure there is a graceful syntax for doing this sequentially with Promises:
read one line
massage its data
send that data to our server
sequentially, but asynchronously, so that the JavaScript event loop is not blocked while the data is sent to the server.
Please, could someone suggest the right syntax for doing this in my outdated JavaScript?
Make a queue so it takes the next one off the array
function foo() {
const reader = [
'{"foo": 1}',
'{"foo": 2}',
'{"foo": 3}',
'{"foo": 4}',
'{"foo": 5}',
];
function doNext() {
if (!reader.length) {
console.log('done');
return;
}
const line = reader.shift();
const record = JSON.parse(line);
sendOldRecord(record, doNext);
}
doNext();
}
function sendOldRecord(record, done) {
console.log(record);
// what ever your async task is
window.setTimeout(function () {
done();
}, Math.floor(2000 * Math.random()));
}
foo();
The Problem
Streams and asynchronous processing are somewhat of a pain to get them to work well together and handle all possible error conditions and things are even worse for the readline module. Since you seem to be saying that you can't use the for await () construct for the readable (which even when it is supported, has various issues), things are even a bit more complicated.
The main problem with readline.createInterface() on a stream is that it reads a chunk of the file, parses that chunk for full lines and then synchronously sends all the lines in a tight for loop.
You can literally see the code here:
for (let n = 0; n < lines.length; n++) this[kOnLine](lines[n]);
The implementation of kOnLine does this:
this.emit('line', line);
So, this is a tight for loop that emits all the lines it read out. So ... if you try to do something asynchronous in your responding to the line event, the moment you hit an await or an asynchronous callback, this readline code will send the next line event before you're done processing the previous one. This makes it a pain to do asynchronous processing of the line events in sequential order where you finishing asynchronous processing one line before starting on the next one. IMO, this is a very busted design as it only really works with synchronous processing. You will notice that this for loop also doesn't care if the readline object was paused either. It just pumps out all the lines it has without regard for anything.
Discussion of Possible Solutions
So, what to do about that. Some part of a fix for this is in the asynchronous iterator interface to readline (but it has other problems which I've filed bugs on). But, the supposition of your question seems to be that you can't use that asynchronous iterator interface because your device may have an older version of nodejs. If that's the case, then I only know of two options:
Ditch the readline.createInterface() functionality entirely and either use a 3rd party module or do your own line boundary processing.
Cover the line event with your own code that supports asynchronous processing of lines without getting the next line in the middle of still processing the previous one.
A Solution
I've written an implementation for option #2, covering the line event with your own code. In my implementation, we just acknowledge that line events will arrive during our asynchronous processing of previous lines, but instead of notifying you about then, the input stream gets paused and these "early" lines get queued. With this solution the readline code will read a chunk of data from the input stream, parse it into its full lines, synchronously send all the line events for those full lines. But, upon receipt of the first line event, we will pause the input stream and initiate queueing of subsequent line events. So, you can asynchronously process a line and you won't get another one until you ask for the next line.
This code has a different way of communicating incoming lines to your code. Since we're in the age of promises for asynchronous code, I've added a promise-based reader.getNextLine() function to the reader object.
This lets you write code like this:
import fs from 'fs';
async function run(filename) {
let reader = createLineReader({
input: fs.createReadStream(filename),
crlfDelay: Infinity
});
let line;
let cntr = 0;
while ((line = await reader.getNextLine()) !== null) {
// simulate some asynchronous operation in the processing of the line
console.log(`${++cntr}: ${line}`);
await processLine(line);
}
}
run("temp.txt").then(result => {
console.log("done");
}).catch(err => {
console.log(err);
});
And, here's the implementation of createLineReader():
import * as ReadLine from 'readline';
function createLineReader(options) {
const stream = options.input;
const reader = ReadLine.createInterface(options);
// state machine variables
let latchedErr = null;
let isPaused = false;
let readerClosed = false;
const queuedLines = [];
// resolves with line
// resolves with null if no more lines
// rejects with error
reader.getNextLine = async function() {
if (latchedErr) {
// once we get an error, we're done
throw latchedErr;
} else if (queuedLines.length) {
// if something in the queue, return the oldest from the queue
const line = queuedLines.shift();
if (queuedLines.length === 0 && isPaused) {
reader.resume();
}
return line;
} else if (readerClosed) {
// if nothing in the queue and the reader is closed, then signify end of data
return null;
} else {
// waiting for more line data to arrive
return new Promise((resolve, reject) => {
function clear() {
reader.off('error', errorListener);
reader.off('queued', queuedListener);
reader.off('done', doneListener);
}
function queuedListener() {
clear();
resolve(queuedLines.shift());
}
function errorListener(e) {
clear();
reject(e);
}
function doneListener() {
clear();
resolve(null);
}
reader.once('queued', queuedListener);
reader.once('error', errorListener);
reader.once('done', doneListener);
});
}
}
reader.on('pause', () => {
isPaused = true;
}).on('resume', () => {
isPaused = false;
}).on('line', line => {
queuedLines.push(line);
if (!isPaused) {
reader.pause();
}
// tell any queue listener that something was just added to the queue
reader.emit('queued');
}).on('close', () => {
readerClosed = true;
if (queuedLines.length === 0) {
reader.emit('done');
}
});
return reader;
}
Explanation
Internally, the implementation takes each new line event and puts it into a queue. Then, reader.getNextLine() just pulls items from the queue or waits (with a promise) for the queue to get something put in it.
During operation, the readline object will get a chunk of data from your readstream, it will parse that into whole lines. The whole lines will all get added to the queue (via line events). The readstream will be paused so it won't generate any more lines until the queue has been drained.
When the queue becomes empty, the readstream will be resumed so it can send more data to the reader object.
This is scalable to very large files because it will only queue the whole lines found in one chunk of the file being read. Once those lines are queued, the input stream is paused so it won't put more into the queue. After the queue is drained, the inputs stream is resumed so it can send more data and repeat...
Any errors in the readstream will trigger an error event on the readline object which will either reject a reader.getNextLine() that is already waiting for the next line or will reject the next time reader.getNextLine() is called.
Disclaimers
This has only been tested with file-based readstreams.
I would not recommend having more than one reader.getNextLine() in process at once as this code does not anticipate that and it's not even clear what that should do.
Basically you can achieve this using functional approach:
const arrayOfValues = [1,2,3,4,5];
const chainOfPromises = arrayOfValues.reduce((acc, item) => {
return acc.then((result) => {
// Here you can add your logic for parsing/sending request
// And here you are chaining next promise request
return yourAsyncFunction(item);
})
}, Promise.resolve());
// Basically this will do
// Promise.resolve().then(_ => yourAsyncFunction(1)).then(_ => yourAsyncFunction(2)) and so on...
// Start
chainOfPromises.then();

Parallel stream huge line delimited json file in nodejs

I am reading a file with 350M lines using createReadstream and transforming each line and writing it back as line delimited file. Below is the code which I am using to do it.
var fs = require("fs");
var args = process.argv.slice(2);
var split = require("split")
fs.createReadStream(args[0])
.pipe(split(JSON.parse))
.on('data', function(obj) {
<data trasformation operation>
})
.on('error', function(err) {
})
To red 350M lines it takes 40 minutes and it only uses one CPU core while doing it. I have 16 CPU cores. How can I make this line reading process to run parallel so that alteast 10 cores are utilized and the entire operation finishes in less time.
I tried using this module - https://www.npmjs.com/package/parallel-transform. But when I checked in htop, it was still single CPU which was doing the operation.
var stream = transform(10, {
objectMode: true
}, function(data, callback) {
<data trasformation operation>
callback(null, data);
});
fs.createReadStream(args[0])
.pipe(stream)
.pipe(process.stdout);
What is the better way to parallel read file while streaming?
You can try scramjet - I would be happy to find someone with a strong multi-threaded use case for setting up proper tests around this.
Your code would look something like this:
var fs = require("fs");
var {StringStream} = require("scramjet");
var args = process.argv.slice(2);
let i = 0;
let threads = os.cpus().length; // you may want to check this out
StringStream.from(fs.createReadStream(args[0]))
.lines() // it's better to deserialize this in the threads
.separate(() => i = ++i % threads)
.cluster(stream => stream // these will happen in the thread
.JSONParse()
.map(yourProcessingFunc) // this can be async as well
)
.mux() // if the function above returns something you'll get
// a stream of results
.run() // this executes the whole workflow.
.catch(errorHandler)
You can use a better affinity function for separate, see the docs here where based on the data you can direct the data to specific workers. If you encounter any issues, please create a repo and let's see how to fix those.

Save csv-parse output to a variable

I'm new to using csv-parse and this example from the project's github does what I need with one exception. Instead of outputting via console.log I want to store data in a variable. I've tried assigning the fs line to a variable and then returning data rather than logging it but that just returned a whole bunch of stuff I didn't understand. The end goal is to import a CSV file into SQLite.
var fs = require('fs');
var parse = require('..');
var parser = parse({delimiter: ';'}, function(err, data){
console.log(data);
});
fs.createReadStream(__dirname+'/fs_read.csv').pipe(parser);
Here is what I have tried:
const fs = require("fs");
const parse = require("./node_modules/csv-parse");
const sqlite3 = require("sqlite3");
// const db = new sqlite3.Database("testing.sqlite");
let parser = parse({delimiter: ","}, (err, data) => {
// console.log(data);
return data;
});
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);
I was also struggling to figure out how to get the data from csv-parse back to the top-level that invokes parsing. Specifically I was trying to get parser.info data at the end of processing to see if it was successful, but the solution for that can work to get the row data as well, if you need.
The key was to wrap all the stream event listeners into a Promise, and within the parser's callback resolve the Promise.
function startFileImport(myFile) {
// THIS IS THE WRAPPER YOU NEED
return new Promise((resolve, reject) => {
let readStream = fs.createReadStream(myFile);
let fileRows = [];
const parser = parse({
delimiter: ','
});
// Use the readable stream api
parser.on('readable', function () {
let record
while (record = parser.read()) {
if (record) { fileRows.push(record); }
}
});
// Catch any error
parser.on('error', function (err) {
console.error(err.message)
});
parser.on('end', function () {
const { lines } = parser.info;
// RESOLVE OUTPUT THAT YOU WANT AT PARENT-LEVEL
resolve({ status: 'Successfully processed lines: ', lines });
});
// This will wait until we know the readable stream is actually valid before piping
readStream.on('open', function () {
// This just pipes the read stream to the response object (which goes to the client)
readStream.pipe(parser);
});
// This catches any errors that happen while creating the readable stream (usually invalid names)
readStream.on('error', function (err) {
resolve({ status: null, error: 'readStream error' + err });
});
});
}
This is a question that suggests confusion about an asynchronous streaming API and seems to ask at least three things.
How do I get output to contain an array-of-arrays representing the parsed CSV data?
That output will never exist at the top-level, like you (and many other programmers) hope it would, because of how asynchronous APIs operate. All the data assembled neatly in one place can only exist in a callback function. The next best thing syntactically is const output = await somePromiseOfOutput() but that can only occur in an async function and only if we switch from streams to promises. That's all possible, and I mention it so you can check it out later on your own. I'll assume you want to stick with streams.
An array consisting of all the rows can only exist after reading the entire stream. That's why all the rows are only available in the author's "Stream API" example only in the .on('end', ...) callback. If you want to do anything with all the rows present at the same time, you'll need to do it in the end callback.
From https://csv.js.org/parse/api/ note that the author:
uses the on readable callback to push single records into a previously empty array defined externally named output.
uses the on error callback to report errors
uses the on end callback to compare all the accumulated records in output to the expected result
...
const output = []
...
parser.on('readable', function(){
let record
while (record = parser.read()) {
output.push(record)
}
})
// Catch any error
parser.on('error', function(err){
console.error(err.message)
})
// When we are done, test that the parsed output matched what expected
parser.on('end', function(){
assert.deepEqual(
output,
[
[ 'root','x','0','0','root','/root','/bin/bash' ],
[ 'someone','x','1022','1022','','/home/someone','/bin/bash' ]
]
)
})
As to the goal on interfacing with sqlite, this is essentially building a customized streaming endpoint.
In this use case, implement a customized writable stream that accepts the output of parser and sends rows to the database.
Then you simply chain pipe calls as
fs.createReadStream(__dirname+'/fs_read.csv')
.pipe(parser)
.pipe(your_writable_stream)
Beware: This code returns immediately. It does not wait for the operations to finish. It interacts with a hidden event loop internal to node.js. The event loop often confuses new developers who are arriving from another language, used to a more imperative style, and skipped this part of their node.js training.
Implementing such a customized writable stream can get complicated and is left as an exercise for the reader. It will be easiest if the parser emits a row, and then the writer can be written to handle single rows. Make sure you are able to notice errors somehow and throw appropriate exceptions, or you'll be cursed with incomplete results and no warning or reason why.
A hackish way to do it would have been to replace console.log(data) in let parser = ... with a customized function writeRowToSqlite(data) that you'll have to write anyway to implement a custom stream. Because of asynchronous API issues, using return data there does not do anything useful. It certainly, as you saw, fails to put the data into the output variable.
As to why output in your modified posting does not contain the data...
Unfortunately, as you discovered, this is usually wrong-headed:
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);
Here, the variable output will be a ReadableStream, which is not the same as the data contained in the readable stream. Put simply, it's like when you have a file in your filesystem, and you can obtain all kinds of system information about the file, but the content contained in the file is accessed through a different call.

Node read line by line, process and store

I have the following code in Node.js which reads from a file, line by line. I want to do stuff to each line and store it in an array. The array would then be used in other functions in the same file. The problem I'm running into is the async nature of reading the stream which results in an empty array. The solutions I've come across all seem to rely on modules.
function processLine(file) {
const fs = require('fs');
const readline = require('readline');
const input = fs.createReadStream(file);
const rl = readline.createInterface(input);
const arr = []
rl.on('line', (line) => {
// do stuff to data and store in array
})
// return array;
}
I am aware of being able to store the chunks and operate on the whole file with input.on('end', cb)... However, I feel like this would put too much functionality within the cb. Plus I still can't use its return value since its async. I guess my question is, is there a way to store data being read and use it within the file?
If you would like to process elements like chunks - take a look on
highWaterMark
https://nodejs.org/api/stream.html#stream_types_of_streams
Proably you will be instered in:
objectMode
as well.
Also there are interfaces which you could use while use streams:
Readable
Writable
Duplex
Transform
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback
Where you could use any Promise based function and simply use callback to finish processing element at right point of time:
_transform = function(data, encoding, callback) {
this.push(data);
callback();
};
or
https://nodejs.org/api/stream.html#stream_class_stream_transform
_write(chunk, encoding, callback) {
// ...
}
However there is another solution - rxjs binding for node stream - which you could use while process elements.

What's the proper way to handle back-pressure in a node.js Transform stream?

##Intro
These are my first adventures in writing the node.js server side. It's been
fun so far but I'm having some difficulty understanding the proper way
to implement something as it relates to node.js streams.
###Problem
For test and learning purposes I'm working with large files whose
content is zlib compressed. The compressed content is binary data, each
packet being 38 bytes in length. I'm trying to create a resulting file
that looks almost identical to the original file except that there is an
uncompressed 31-byte header for every 1024 38-byte packets.
###original file content (decompressed)
+----------+----------+----------+----------+
| packet 1 | packet 2 | ...... | packet N |
| 38 bytes | 38 bytes | ...... | 38 bytes |
+----------+----------+----------+----------+
###resulting file content
+----------+--------------------------------+----------+--------------------------------+
| header 1 | 1024 38 byte packets | header 2 | 1024 38 byte packets |
| 31 bytes | zlib compressed | 31 bytes | zlib compressed |
+----------+--------------------------------+----------+--------------------------------+
As you can see, it's somewhat of a translation problem. This means, I'm
taking some source stream as input and then slightly transforming it
into some output stream. Therefore, it felt natural to implement a
Transform stream.
The class simply attempts to accomplish the following:
Takes stream as input
zlib inflates the chunks of data to count the number of packets,
putting together 1024 of them, zlib deflating, and
prepending a header.
Passes the new resulting chunk on through the pipeline via
this.push(chunk).
A use case would be something like:
var fs = require('fs');
var me = require('./me'); // Where my Transform stream code sits
var inp = fs.createReadStream('depth_1000000');
var out = fs.createWriteStream('depth_1000000.out');
inp.pipe(me.createMyTranslate()).pipe(out);
###Question(s)
Assuming Transform is a good choice for this use case, I seem to be
running into a possible back-pressure issue. My call to this.push(chunk)
within _transform keeps returning false. Why would this be and how
to handle such things?
This question from 2013 is all I was able to find on how to deal with "back pressure"
when creating node Transform streams.
From the node 7.10.0 Transform stream and Readable stream documentation what I gathered
was that once push returned false, nothing else should be pushed until _read was
called.
The Transform documentation doesn't mention _read except to mention that the base Transform
class implements it (and _write). I found the information about push returning false
and _read being called in the Readable stream documentation.
The only other authoritative comment I found on Transform back pressure only mentioned
it as an issue, and that was in a comment at the top of the node file _stream_transform.js.
Here's the section about back pressure from that comment:
// This way, back-pressure is actually determined by the reading side,
// since _read has to be called to start processing a new chunk. However,
// a pathological inflate type of transform can cause excessive buffering
// here. For example, imagine a stream where every byte of input is
// interpreted as an integer from 0-255, and then results in that many
// bytes of output. Writing the 4 bytes {ff,ff,ff,ff} would result in
// 1kb of data being output. In this case, you could write a very small
// amount of input, and end up with a very large amount of output. In
// such a pathological inflating mechanism, there'd be no way to tell
// the system to stop doing the transform. A single 4MB write could
// cause the system to run out of memory.
//
// However, even in such a pathological case, only a single written chunk
// would be consumed, and then the rest would wait (un-transformed) until
// the results of the previous transformed chunk were consumed.
Solution example
Here's the solution I pieced together to handle the back pressure in a Transform stream
which I'm pretty sure works. (I haven't written any real tests, which would require
writing a Writable stream to control the back pressure.)
This is a rudimentary Line transform which needs work as a line transform but does
demonstrate handling the "back pressure".
const stream = require('stream');
class LineTransform extends stream.Transform
{
constructor(options)
{
super(options);
this._lastLine = "";
this._continueTransform = null;
this._transforming = false;
this._debugTransformCallCount = 0;
}
_transform(chunk, encoding, callback)
{
if (encoding === "buffer")
return callback(new Error("Buffer chunks not supported"));
if (this._continueTransform !== null)
return callback(new Error("_transform called before previous transform has completed."));
// DEBUG: Uncomment for debugging help to see what's going on
//console.error(`${++this._debugTransformCallCount} _transform called:`);
// Guard (so we don't call _continueTransform from _read while it is being
// invoked from _transform)
this._transforming = true;
// Do our transforming (in this case splitting the big chunk into lines)
let lines = (this._lastLine + chunk).split(/\r\n|\n/);
this._lastLine = lines.pop();
// In order to respond to "back pressure" create a function
// that will push all of the lines stopping when push returns false,
// and then resume where it left off when called again, only calling
// the "callback" once all lines from this transform have been pushed.
// Resuming (until done) will be done by _read().
let nextLine = 0;
this._continueTransform = () =>
{
let backpressure = false;
while (nextLine < lines.length)
{
if (!this.push(lines[nextLine++] + "\n"))
{
// we've got more to push, but we got backpressure so it has to wait.
if (backpressure)
return;
backpressure = !this.push(lines[nextLine++] + "\n");
}
}
// DEBUG: Uncomment for debugging help to see what's going on
//console.error(`_continueTransform ${this._debugTransformCallCount} finished\n`);
// All lines are pushed, remove this function from the LineTransform instance
this._continueTransform = null;
return callback();
};
// Start pushing the lines
this._continueTransform();
// Turn off guard allowing _read to continue the transform pushes if needed.
this._transforming = false;
}
_flush(callback)
{
if (this._lastLine.length > 0)
{
this.push(this._lastLine);
this._lastLine = "";
}
return callback();
}
_read(size)
{
// DEBUG: Uncomment for debugging help to see what's going on
//if (this._transforming)
// console.error(`_read called during _transform ${this._debugTransformCallCount}`);
// If a transform has not pushed every line yet, continue that transform
// otherwise just let the base class implementation do its thing.
if (!this._transforming && this._continueTransform !== null)
this._continueTransform();
else
super._read(size);
}
}
I tested the above by running it with the DEBUG lines uncommented on a ~10000 line
~200KB file. Redirect stdout or stderr to a file (or both) to separate the debugging
statements from the expected output. (node test.js > out.log 2> err.log)
const fs = require('fs');
let inStrm = fs.createReadStream("testdata/largefile.txt", { encoding: "utf8" });
let lineStrm = new LineTransform({ encoding: "utf8", decodeStrings: false });
inStrm.pipe(lineStrm).pipe(process.stdout);
Helpful debugging hint
While writing this initially I didn't realize that _read could be called before
_transform returned, so I hadn't implemented the this._transforming guard and I was
getting the following error:
Error: no writecb in Transform class
at afterTransform (_stream_transform.js:71:33)
at TransformState.afterTransform (_stream_transform.js:54:12)
at LineTransform._continueTransform (/userdata/mjl/Projects/personal/srt-shift/dist/textfilelines.js:44:13)
at LineTransform._transform (/userdata/mjl/Projects/personal/srt-shift/dist/textfilelines.js:46:21)
at LineTransform.Transform._read (_stream_transform.js:167:10)
at LineTransform._read (/userdata/mjl/Projects/personal/srt-shift/dist/textfilelines.js:56:15)
at LineTransform.Transform._write (_stream_transform.js:155:12)
at doWrite (_stream_writable.js:331:12)
at writeOrBuffer (_stream_writable.js:317:5)
at LineTransform.Writable.write (_stream_writable.js:243:11)
Looking at the node implementation I realized that this error meant that the callback
given to _transform was being called more than once. There wasn't much information
to be found about this error either so I thought I'd include what I figured out here.
I think Transform is suitable for this, but I would perform the inflate as a separate step in the pipeline.
Here's a quick and largely untested example:
var zlib = require('zlib');
var stream = require('stream');
var transformer = new stream.Transform();
// Properties used to keep internal state of transformer.
transformer._buffers = [];
transformer._inputSize = 0;
transformer._targetSize = 1024 * 38;
// Dump one 'output packet'
transformer._dump = function(done) {
// concatenate buffers and convert to binary string
var buffer = Buffer.concat(this._buffers).toString('binary');
// Take first 1024 packets.
var packetBuffer = buffer.substring(0, this._targetSize);
// Keep the rest and reset counter.
this._buffers = [ new Buffer(buffer.substring(this._targetSize)) ];
this._inputSize = this._buffers[0].length;
// output header
this.push('HELLO WORLD');
// output compressed packet buffer
zlib.deflate(packetBuffer, function(err, compressed) {
// TODO: handle `err`
this.push(compressed);
if (done) {
done();
}
}.bind(this));
};
// Main transformer logic: buffer chunks and dump them once the
// target size has been met.
transformer._transform = function(chunk, encoding, done) {
this._buffers.push(chunk);
this._inputSize += chunk.length;
if (this._inputSize >= this._targetSize) {
this._dump(done);
} else {
done();
}
};
// Flush any remaining buffers.
transformer._flush = function() {
this._dump();
};
// Example:
var fs = require('fs');
fs.createReadStream('depth_1000000')
.pipe(zlib.createInflate())
.pipe(transformer)
.pipe(fs.createWriteStream('depth_1000000.out'));
push will return false if the stream you are writing to (in this case, a file output stream) has too much data buffered. Since you're writing to disk, this makes sense: you are processing data faster than you can write it out.
When out's buffer is full, your transform stream will fail to push, and start buffering data itself. If that buffer should fill, then inp's will start to fill. This is how things should be working. The piped streams are only going to process data as fast as the slowest link in the chain can handle it (once your buffers are full).
Ran into a similar problem lately, needing to handle backpressure in an inflating transform stream - the secret to handling push() returning false is to register and handle the 'drain' event on the stream
_transform(data, enc, callback) {
const continueTransforming = () => {
// ... do some work / parse the data, keep state of where we're at etc
if(!this.push(event))
this._readableState.pipes.once('drain', continueTransforming); // will get called again when the reader can consume more data
if(allDone)
callback();
}
continueTransforming()
}
NOTE this is a bit hacky as we're reaching into the internals and pipes can even be an array of Readables but it does work in the common case of ....pipe(transform).pipe(...
Would be great if someone from the Node community can suggest a "correct" method for handling .push() returning false
I ended up following Ledion's example and created a utility Transform class which assists with backpressure. The utility adds an async method named addData, which the implementing Transform can await.
'use strict';
const { Transform } = require('stream');
/**
* The BackPressureTransform class adds a utility method addData which
* allows for pushing data to the Readable, while honoring back-pressure.
*/
class BackPressureTransform extends Transform {
constructor(...args) {
super(...args);
}
/**
* Asynchronously add a chunk of data to the output, honoring back-pressure.
*
* #param {String} data
* The chunk of data to add to the output.
*
* #returns {Promise<void>}
* A Promise resolving after the data has been added.
*/
async addData(data) {
// if .push() returns false, it means that the readable buffer is full
// when this occurs, we must wait for the internal readable to emit
// the 'drain' event, signalling the readable is ready for more data
if (!this.push(data)) {
await new Promise((resolve, reject) => {
const errorHandler = error => {
this.emit('error', error);
reject();
};
const boundErrorHandler = errorHandler.bind(this);
this._readableState.pipes.on('error', boundErrorHandler);
this._readableState.pipes.once('drain', () => {
this._readableState.pipes.removeListener('error', boundErrorHandler);
resolve();
});
});
}
}
}
module.exports = {
BackPressureTransform
};
Using this utility class, my Transforms look like this now:
'use strict';
const { BackPressureTransform } = require('./back-pressure-transform');
/**
* The Formatter class accepts the transformed row to be added to the output file.
* The class provides generic support for formatting the result file.
*/
class Formatter extends BackPressureTransform {
constructor() {
super({
encoding: 'utf8',
readableObjectMode: false,
writableObjectMode: true
});
this.anyObjectsWritten = false;
}
/**
* Called when the data pipeline is complete.
*
* #param {Function} callback
* The function which is called when final processing is complete.
*
* #returns {Promise<void>}
* A Promise resolving after the flush completes.
*/
async _flush(callback) {
// if any object is added, close the surrounding array
if (this.anyObjectsWritten) {
await this.addData('\n]');
}
callback(null);
}
/**
* Given the transformed row from the ETL, format it to the desired layout.
*
* #param {Object} sourceRow
* The transformed row from the ETL.
*
* #param {String} encoding
* Ignored in object mode.
*
* #param {Function} callback
* The callback function which is called when the formatting is complete.
*
* #returns {Promise<void>}
* A Promise resolving after the row is transformed.
*/
async _transform(sourceRow, encoding, callback) {
// before the first object is added, surround the data as an array
// between each object, add a comma separator
await this.addData(this.anyObjectsWritten ? ',\n' : '[\n');
// update state
this.anyObjectsWritten = true;
// add the object to the output
const parsed = JSON.stringify(sourceRow, null, 2).split('\n');
for (const [index, row] of parsed.entries()) {
// prepend the row with 2 additional spaces since we're inside a larger array
await this.addData(` ${row}`);
// add line breaks except for the last row
if (index < parsed.length - 1) {
await this.addData('\n');
}
}
callback(null);
}
}
module.exports = {
Formatter
};
Mike Lippert's answer is the closest to the truth, I think. It appears that waiting for a new _read() call to begin again from the reading stream is the only way that the Transform is actively notified that the reader is ready. I wanted to share a simple example of how I override _read() temporarily.
_transform(buf, enc, callback) {
// prepend any unused data from the prior chunk.
if (this.prev) {
buf = Buffer.concat([ this.prev, buf ]);
this.prev = null;
}
// will keep transforming until buf runs low on data.
if (buf.length < this.requiredData) {
this.prev = buf;
return callback();
}
var result = // do something with data...
var nextbuf = buf.slice(this.requiredData);
if (this.push(result)) {
// Continue transforming this chunk
this._transform(nextbuf, enc, callback);
}
else {
// Node is warning us to slow down (applying "backpressure")
// Temporarily override _read request to continue the transform
this._read = function() {
delete this._read;
this._transform(nextbuf, enc, callback);
};
}
}
I was trying to find the comment mentioned in the source code for transform and the reference link keeps being changed so I will leave this here for reference:
// a transform stream is a readable/writable stream where you do
// something with the data. Sometimes it's called a "filter",
// but that's not a great name for it, since that implies a thing where
// some bits pass through, and others are simply ignored. (That would
// be a valid example of a transform, of course.)
//
// While the output is causally related to the input, it's not a
// necessarily symmetric or synchronous transformation. For example,
// a zlib stream might take multiple plain-text writes(), and then
// emit a single compressed chunk some time in the future.
//
// Here's how this works:
//
// The Transform stream has all the aspects of the readable and writable
// stream classes. When you write(chunk), that calls _write(chunk,cb)
// internally, and returns false if there's a lot of pending writes
// buffered up. When you call read(), that calls _read(n) until
// there's enough pending readable data buffered up.
//
// In a transform stream, the written data is placed in a buffer. When
// _read(n) is called, it transforms the queued up data, calling the
// buffered _write cb's as it consumes chunks. If consuming a single
// written chunk would result in multiple output chunks, then the first
// outputted bit calls the readcb, and subsequent chunks just go into
// the read buffer, and will cause it to emit 'readable' if necessary.
//
// This way, back-pressure is actually determined by the reading side,
// since _read has to be called to start processing a new chunk. However,
// a pathological inflate type of transform can cause excessive buffering
// here. For example, imagine a stream where every byte of input is
// interpreted as an integer from 0-255, and then results in that many
// bytes of output. Writing the 4 bytes {ff,ff,ff,ff} would result in
// 1kb of data being output. In this case, you could write a very small
// amount of input, and end up with a very large amount of output. In
// such a pathological inflating mechanism, there'd be no way to tell
// the system to stop doing the transform. A single 4MB write could
// cause the system to run out of memory.
//
// However, even in such a pathological case, only a single written chunk
// would be consumed, and then the rest would wait (un-transformed) until
// the results of the previous transformed chunk were consumed.
I discovered a solution similar to Ledion's without needing to dive into the internals of the current stream pipeline. You can achieve this via:
_transform(data, enc, callback) {
const continueTransforming = () => {
// ... do some work / parse the data, keep state of where we're at etc
if(!this.push(event))
this.once('data', continueTransforming); // will get called again when the reader can consume more data
if(allDone)
callback();
}
continueTransforming()
}
This works because data is only emitted when someone downstream is consuming the Transform's readable buffer that you're this.push()-ing to. So whenever the downstream has capacity to pull off of this buffer, you should be able to start writing back to the buffer.
The flaw with listening for drain on the downstream (other than reaching into the internals of node) is that you also are relying on your Transform's buffer having been drained as well, which there's no guarantee that it has been when the downstream emits drain.

Categories