what is difference between file reading and streaming? - javascript

I read in some book that using streaming is better than reading a whole file at a time in node.js, I understand the idea .. but I wonder isn't file reading using streams, I'm used to this from Java and C++, when I want to read a file I use streams .. So what's the difference here ??
also what is the difference between fs.createReadStream(<somefile>); and fs.readFile(<somefile>);
both are asynchronous, right !!

First thing is fileread is fully buffered method. and streaming is partial buffered method.
Now what does it mean?
Fully buffered function calls like readFileSync() and readFile() expose
the data as one big blob. That is, reading is performed and then the full set of data is returned either in synchronous or asynchronous fashion.
With these fully buffered methods, we have to wait until all of the data is read, and internally Node will need to allocate enough memory to store all of the data in memory. This can be problematic - imagine an application that reads a 1 GB file from disk. With only fully buffered access we would need to use 1 GB of memory to store the whole content of the file for reading - since both readFile and readFileSync return a string containing all of the data.
Partially buffered access methods are different. They do not treat data input as a discrete event, but rather as a series of events which occur as the data is being read or written. They allow us to access data as it is being read from disk/network/other I/O.
Streams return smaller parts of the data (using a Buffer), and trigger a callback when new data is available for processing.
Streams are EventEmitters. If our 1 GB file would, for example, need to be processed in some way once, we could use a stream and process the data as soon as it is read. This is useful, since we do not need to hold all of the data in memory in some buffer: after processing, we no longer need to keep the data in memory for this kind of application.
The Node stream interface consists of two parts: Readable streams and Writable streams. Some streams are both readable and writable.

So what's the difference here ?? also what is the difference between fs.createReadStream(<somefile>); and fs.readFile(<somefile>); both are asynchronous, right !!
Well aside from the fact that fs.createReadStream() directly returns a stream object, and fs.readFile() expects a callback function in the second argument, there is another huge difference.
Yes they are both asynchronous, but that doesn't change the fact that fs.readFile() doesn't give you any data until the entire file has been buffered into memory. This is much less memory-efficient and slower when relaying data back through server responses. With fs.createReadStream(), you can pipe() the stream object directly to a server's response object, which means your client can immediately start receiving data even if the file is 500MB.
Not only this, you also improve memory efficiency by dealing with the file one chunk at a time rather than all at once. This means that your memory only has to buffer the file contents a few kilobytes at a time rather than all at once.
Here's two snippets demonstrating what I'm saying:
const fs = require('fs');
const http = require('http');
// using readFile()
http.createServer(function (req, res) {
// let's pretend this is a huge 500MB zip file
fs.readFile('some/file/path.zip', function (err, data) {
// entire file must be buffered in memory to data, which could be very slow
// entire chunk is sent at once, no streaming here
res.write(data);
res.end();
});
});
// using createReadStream()
http.createServer(function (req, res) {
// processes the large file in chunks
// sending them to client as soon as they're ready
fs.createReadStream('some/file/path.zip').pipe(res);
// this is more memory-efficient and responsive
});

Related

How to limit flow between streams in NodeJS

I have a readStream piped to writeStream. Read stream reads from an internet and write stream writes to my local database instance. I noticed that read speed is much faster than write speed and my app memory usage rises until it reaches
JavaScript heap out of memory
I suspect that it accumulates read data in the NodeJS app like this:
How can I limit read stream so it reads only what write stream is capable of writing at the given time?
Ok so long story short - mechanism you need to be aware of to solve these kind of issues is backpressure. It is not a problem when you are using standard node's pipe(). I am using custom fan-out to multiple streams thus it happened
You can read about it here https://nodejs.org/en/docs/guides/backpressuring-in-streams/
This solution is not ideal as it will block read-stream whenever any of fan-out write streams is blocked but it gives general idea on how to approach this problem
combinedStream.pipe(transformer).on('data', async (data: DbObject) => {
const writeStream = dbClient.getStreamForTable(data.table);
if (!writeStream.write(data.csv)) {
combinedStream.pause();
await new Promise((resolve) => writeStream.once('drain', resolve));
combinedStream.resume();
}
});

Nodejs: synchronizing multiple simultaneous file changes

I have a small development web server, that I use to write missing translations into files.
app.post('/locales/add/:language/:namespace', async (req, res) => {
const { language, namespace } = req.params
// I'm using fs.promises
let current = await fs.readFile(`./locales/${language}/${namespace}.json`, 'utf8')
current = JSON.parse(current)
const newData = JSON.stringify({ ...req.body, ...current }, null, 2)
await fs.writeFile(`./locales/${language}/${namespace}.json`, newData)
})
Obviously, when my i18n library does multiple writes into one file like this:
fetch('/locales/add/en/index', { body: `{"hello":"hello"}` })
fetch('/locales/add/en/index', { body: `{"bye":"bye"}` })
it seems like the file is being overwritten and only the result of the last request is saved. I cannot just append to the file, because it's JSON. How to fix this?
You will have to use some sort of concurrency control to keep two concurrent requests that are both trying to write to the same resources form interfering with each other.
If you have lots of different files that you may be writing to and perhaps multiple servers writing to it, then you pretty much have to use some sort of file locking, either OS-supplied or manually with lock files and have subsequent requests wait for the file lock to be cleared. If you have only on server writing to the file and a manageable number of files, then you can create a file queue that keeps track of the order of requests and when the file is busy and it can return a promise when it's time for a particular request to do its writing
Concurrency control is always what databases are particularly good at.
I have no experience with either of these packages, but these are the general idea:
https://www.npmjs.com/package/lockfile
https://www.npmjs.com/package/proper-lockfile
These will guarantee one at a time access. I don't know if they will guarantee that multiple requests are granted access in the precise order they attempted to acquire the lock. If you need that, you might have to add that on top with some sort of queue.
Some discussion of this topic here: How can I lock a file while writing to it asynchronously

What would happens on buffer level when an input stream piping to multi output streams?

I'm reading stream document and looking for buffering behavior description about streams at https://nodejs.org/api/stream.html#stream_buffering
The document seems not mentioning about what would happen to inputStream buffer(or buffers?), when piping to multiple outputs as different output have different consuming speeds:
Does the the readableStream keep a dedicated buffer for every output when piping multiple outputs?
Does the outputs keep same speed when consuming or the faster would end earlier?
const input = fs.createReadStream('img.jpg');
const target1 = input.pipe(fs.createWriteStream('target1.jpg'));
const target2 = input.pipe(fs.createWriteStream('target2.jpg'));
TL;DR: The short answer is - the slower target stream controls the flow rate.
So first of all let's see what happens on read side.
const input = fs.createReadStream('img.jpg');
When you instantiate the input stream it is created in paused mode and scheduled for reading (no reading is done synchronously, so it will not access the file yet). The stream has highWaterMark set to something like 16384 and currently has a buffer of 0 bytes.
const target1 = input.pipe(fs.createWriteStream('target1.jpg'));
const target2 = input.pipe(fs.createWriteStream('target2.jpg'));
Now when you actually pipe it to the writable streams the flowing mode is set by adding the on('data') event handler in the pipe method implementation - see the source.
When this is done I assume there's no more program to run, so node starts the actual reading and runs the planned code in the handler above which simply writes any data that comes through.
The flow control happens when any of the targets has more data to write than its highWaterMark which cayses the write operation to return false. The reading is then stopped by calling pause here in the code. Two lines above this you'll see that state.awaitDrain is incremented.
Now the read stream is paused again and the writable streams are writing bytes to disk - at some point the buffer level again goes below the highWaterMark. At this point a drain event is fired which executes this line and, after all awaited drains have been called, resumes the flow. This is done by checking if the decremented awaitDrain property has reached zero, which means that all the awaited drain events have been called.
In the above case the faster of the two streams could return a falsy value while writing, but it would definitely drain as the first. If it wasn't for the awaitDrain the faster stream would resume the data flow and that would cause a possible buffer overflow in the slower of the two.

Should I use the async File IO methods over their synchronous equivalents for local files in node.js?

I have a very simple utility script that I've written in JavaScript for node.js which reads a file, does a few calculations, then writes an output file. The source in its current form looks something like this:
fs.readFile(inputPath, function (err, data) {
if (err) throw err;
// do something with the data
fs.writeFile(outputPath, output, function (err) {
if (err) throw err;
console.log("File successfully written.");
});
});
This works fine, but I'm wondering if there is any disadvantage in this case to using the synchronous variety of these functions instead, like this:
var data = fs.readFileSync(inputPath);
// do something with the data
fs.writeFileSync(outputPath, output);
console.log("File successfully written.");
To me, this is much simpler to read and understand than the callback variety. Is there any reason to use the former method in this case?
I realize that speed isn't an issue at all with this simple script I'm running locally, but I'm interested in understanding the theory behind it. When does using the async methods help, and when does it not? Even in a production application, if I'm only reading a file, then waiting to perform the next task, is there any reason to use the asynchronous method?
What matters is what ELSE your node process needs to do while the synchronous IO happens. In the case of a simple shell script that is run at the command line by a single user, synchronous IO is totally fine since if you were doing asychronous IO all you'd be doing is waiting for the IO to come back anyway.
However, in a network service with multiple users you can NEVER use ANY synchronous IO calls (which is kind of the whole point of node, so believe me when I say this). To do so will cause ALL connected clients to halt processing and it is complete doom.
Rule of thumb: shell script: OK, network service: verboten!
For further reading, I made several analogies in this answer.
Basically, when node does asynchronous IO in a network server, it can ask the OS to do many things: read a few files, make some DB queries, send out some network traffic, and while waiting for that async IO to be ready, it can do memory/CPU things in the main event thread. Using this architecture, node gets pretty good performance/concurrency. However, when a synchronous IO operation happens, the entire node process just blocks and does absolutely nothing. It just waits. No new connections can be received. No processing happens, no event loop ticks, no callbacks, nothing. Just 1 synchronous operation stalls the entire server for all clients. You must not do it at all. It doesn't matter how fast it is or anything like that. It doesn't matter local filesystem or network request. Even if you spend 10ms reading a tiny file from disk for each client, if you have 100 clients, client 100 will wait a full second while that file is read one at a time over and over for clients 1-99.
Asynchronous code does not block the flow of execution, allowing your program to perform other tasks while waiting for an operation to complete.
In the first example, your code can continue running without waiting for the file to be written. In your second example, the code execution is "blocked" until the file is written. This is why synchronous code is known as "blocking" while asynchronous is known as "non-blocking."

Mongoose.js: What are QueryStreams

I looked through the documentation of mongoosejs odm and found following:
http://mongoosejs.com/docs/querystream.html
What are they used for? What can I do with them.
I am not sure if they are used for streaming docs or for dynamicly updating queries...
Regards
Well, it's all about the API.
QueryStream allows to to use ReadStream's API so in order to appreciate QueryStream, you need to know more about ReadStream/WriteStream.
There are many pros:
You can process large amount of data, which you'll be getting as "chunks" so the memory contains one item at a time (it could of a DB document, DB row, a single line from the file, etc.)
You can pause/resume the stream(s)
You can pipe read->write very easily
The idea is that it gives you a unified API for read and write operations.
To answer your question "What can I do with them":
You could do anything with or without node.js's stream API but it definitely makes it clearer and easier to use when there's some sort of standard.
Also, node.js's streams are event-based (based on EventEmitter) so it helps with decoupling.
Edit:
That was more about the aspect of streams. In Mongoose's case, a single chunk contains a document.
To clarify the advantage of the API:
node.js's http.ServerResponse is a writable-stream, which means you should be able to stream Mongoose's resultset to the browser using a single line:
// 'res' is the http response from your route's callback.
Posts.find().stream().pipe(res);
The point is that it doesn't matter if you're writing to http.ServerResponse, a file or anything else. As long as it implements a writable stream, it should work without changes.
Hope I made it clearer.

Categories