How to limit flow between streams in NodeJS - javascript

I have a readStream piped to writeStream. Read stream reads from an internet and write stream writes to my local database instance. I noticed that read speed is much faster than write speed and my app memory usage rises until it reaches
JavaScript heap out of memory
I suspect that it accumulates read data in the NodeJS app like this:
How can I limit read stream so it reads only what write stream is capable of writing at the given time?

Ok so long story short - mechanism you need to be aware of to solve these kind of issues is backpressure. It is not a problem when you are using standard node's pipe(). I am using custom fan-out to multiple streams thus it happened
You can read about it here https://nodejs.org/en/docs/guides/backpressuring-in-streams/
This solution is not ideal as it will block read-stream whenever any of fan-out write streams is blocked but it gives general idea on how to approach this problem
combinedStream.pipe(transformer).on('data', async (data: DbObject) => {
const writeStream = dbClient.getStreamForTable(data.table);
if (!writeStream.write(data.csv)) {
combinedStream.pause();
await new Promise((resolve) => writeStream.once('drain', resolve));
combinedStream.resume();
}
});

Related

Nodejs: synchronizing multiple simultaneous file changes

I have a small development web server, that I use to write missing translations into files.
app.post('/locales/add/:language/:namespace', async (req, res) => {
const { language, namespace } = req.params
// I'm using fs.promises
let current = await fs.readFile(`./locales/${language}/${namespace}.json`, 'utf8')
current = JSON.parse(current)
const newData = JSON.stringify({ ...req.body, ...current }, null, 2)
await fs.writeFile(`./locales/${language}/${namespace}.json`, newData)
})
Obviously, when my i18n library does multiple writes into one file like this:
fetch('/locales/add/en/index', { body: `{"hello":"hello"}` })
fetch('/locales/add/en/index', { body: `{"bye":"bye"}` })
it seems like the file is being overwritten and only the result of the last request is saved. I cannot just append to the file, because it's JSON. How to fix this?
You will have to use some sort of concurrency control to keep two concurrent requests that are both trying to write to the same resources form interfering with each other.
If you have lots of different files that you may be writing to and perhaps multiple servers writing to it, then you pretty much have to use some sort of file locking, either OS-supplied or manually with lock files and have subsequent requests wait for the file lock to be cleared. If you have only on server writing to the file and a manageable number of files, then you can create a file queue that keeps track of the order of requests and when the file is busy and it can return a promise when it's time for a particular request to do its writing
Concurrency control is always what databases are particularly good at.
I have no experience with either of these packages, but these are the general idea:
https://www.npmjs.com/package/lockfile
https://www.npmjs.com/package/proper-lockfile
These will guarantee one at a time access. I don't know if they will guarantee that multiple requests are granted access in the precise order they attempted to acquire the lock. If you need that, you might have to add that on top with some sort of queue.
Some discussion of this topic here: How can I lock a file while writing to it asynchronously

Firebase "update" operation downloads data?

I was profiling a "download leak" in my firebase database (I'm using JavaScript SDK/firebase functions Node.js) and finally narrowed down to the "update" function which surprisingly caused data download (which impacts billing in my case quite significantly - ~50% of the bill comes from this leak):
Firebase functions index.js:
exports.myTrigger = functions.database.ref("some/data/path").onWrite((data, context) => {
var dbRootRef = data.after.ref.root;
return dbRootRef.child("/user/gCapeUausrUSDRqZH8tPzcrqnF42/wr").update({field1:"val1", field2:"val2"})
}
This function generates downloads at "/user/gCapeUausrUSDRqZH8tPzcrqnF42/wr" node
If I change the paths to something like this:
exports.myTrigger = functions.database.ref("some/data/path").onWrite((data, context) => {
var dbRootRef = data.after.ref.root;
return dbRootRef.child("/user/gCapeUausrUSDRqZH8tPzcrqnF42").update({"wr/field1":"val1", "wr/field2":"val2"})
}
It generates download at "/user/gCapeUausrUSDRqZH8tPzcrqnF42" node.
Here is the results of firebase database:profile
How can I get rid of the download while updating data or reduce the usage since I only need to upload it?
I dont think it is possible in firebase cloudfunction trigger.
The .onWrite((data, context) has a data field, which is the complete DataSnapshot.
And there is no way to configure not fetching its val.
Still, there are two things that you might do to help reduce the data cost:
Watch a smaller set for trigger. e.g. functions.database.ref("some/data/path") vs ("some").
Use more specific hook. i.e. onCreate() and onUpdate() vs onWrite().
You should expect that all operations will round trip with your client code. Otherwise, how would the client know when the work is complete? It's going to take some space to express that. The screenshot you're showing (which is very tiny and hard to read - consider copying the text directly into your question) indicates a very small amount of download data.
To get a better sense of what the real cost is, run multiple tests and see if that tiny cost is itself actually just part of the one-time handshake between the client and server when the connection is established. That cost might not be an issue as your function code maintains a persistent connection over time as the Cloud Functions instance is reused.

Is moving database works to child processes a good idea in node.js?

I just started getting into child_process and all I know is that it's good for delegating blocking functions (e.g. looping a huge array) to child processes.
I use typeorm to communicate with the mysql database. I was wondering if there's a benefit to move some of the asynchronous database works to child processes. I read it in another thread (Unfortunately I couldn't find it in the browser history) that there's no good reason to delegate async functions to child processes. Is it true?
example code:
child.js
import {createConnection} "./dbConnection";
import {SomeTable} from "./entity/SomeTable";
process.on('message', (m)=> {
createConnection().then(async connection=>{
let repository = connection.getRepository(SomeTable);
let results = await repository
.createQueryBuilder("t")
.orderBy("t.postId", "DESC")
.getMany();
process.send(results);
})
});
main.js
const cp = require('child_process');
const child = cp.fork('./child.js');
child.send('Please fetch some data');
child.on('message', (m)=>{
console.log(m);
});
The big gain about Javascript is its asynchronous nature...
What happens when you call an asynchronous function is that the code continues to execute, not waiting for the answer. And just when the function is done, and an answer is given does it then continue on with that part.
Your database call is already asynchronous. So you would spawn another node process for completely nothing. Since your database takes all the heat, having more nodeJS processes wouldn't help on that part.
Take the same example but with a file write. What could make the write to the disk faster? Nothing much really... But do we care? Nope because our NodeJS is not blocked and keeps answering requests and handling tasks. The only thing that you might want to check is to not send a thousand file writes at the same time, if they are big there would be a negative impact on the file system, but since a write is not CPU intensive, node will run just fine.
child processes really are a great tool, but it is rare to need it. I too wanted to use some when I heard about them, but the thing is that you will certainly not need them at all... The only time I decided to use it was to create a CPU intensive worker. It would make sure it spawns one child process per Core (since node is single threaded) and respawn any faulty ones.

Node.js streams and responsibility for retrieval of new data

Streams enable us to model infinite time-series'.
They can be implemented using generators, with values retrieved from the stream as the program wants via next.
But Node.js streams are event emitters and tell the program when to handle data through the data event.
This seems like responsibility for telling the program when to process new data is handed to the stream instead of leaving it with the program that presumably knows best whether it can handle the new data.
Is there a name for these two types of approaches (push and pull?) and when you each be used?
However there is a push/pull method set available for streams (usually called streams2) I agree it's still quite complicated to use and generators were raised many times as a possible solution.
I implemented this in my Scramjet framework, so a very simple stream generator would look like this:
const dat = [1,2,3,4];
const iter = (function* () { yield* dat; })();
exports.stream = () => DataStream.fromIterator(iter);
You can run a test case using nodeunit in the scramjet repo here.

what is difference between file reading and streaming?

I read in some book that using streaming is better than reading a whole file at a time in node.js, I understand the idea .. but I wonder isn't file reading using streams, I'm used to this from Java and C++, when I want to read a file I use streams .. So what's the difference here ??
also what is the difference between fs.createReadStream(<somefile>); and fs.readFile(<somefile>);
both are asynchronous, right !!
First thing is fileread is fully buffered method. and streaming is partial buffered method.
Now what does it mean?
Fully buffered function calls like readFileSync() and readFile() expose
the data as one big blob. That is, reading is performed and then the full set of data is returned either in synchronous or asynchronous fashion.
With these fully buffered methods, we have to wait until all of the data is read, and internally Node will need to allocate enough memory to store all of the data in memory. This can be problematic - imagine an application that reads a 1 GB file from disk. With only fully buffered access we would need to use 1 GB of memory to store the whole content of the file for reading - since both readFile and readFileSync return a string containing all of the data.
Partially buffered access methods are different. They do not treat data input as a discrete event, but rather as a series of events which occur as the data is being read or written. They allow us to access data as it is being read from disk/network/other I/O.
Streams return smaller parts of the data (using a Buffer), and trigger a callback when new data is available for processing.
Streams are EventEmitters. If our 1 GB file would, for example, need to be processed in some way once, we could use a stream and process the data as soon as it is read. This is useful, since we do not need to hold all of the data in memory in some buffer: after processing, we no longer need to keep the data in memory for this kind of application.
The Node stream interface consists of two parts: Readable streams and Writable streams. Some streams are both readable and writable.
So what's the difference here ?? also what is the difference between fs.createReadStream(<somefile>); and fs.readFile(<somefile>); both are asynchronous, right !!
Well aside from the fact that fs.createReadStream() directly returns a stream object, and fs.readFile() expects a callback function in the second argument, there is another huge difference.
Yes they are both asynchronous, but that doesn't change the fact that fs.readFile() doesn't give you any data until the entire file has been buffered into memory. This is much less memory-efficient and slower when relaying data back through server responses. With fs.createReadStream(), you can pipe() the stream object directly to a server's response object, which means your client can immediately start receiving data even if the file is 500MB.
Not only this, you also improve memory efficiency by dealing with the file one chunk at a time rather than all at once. This means that your memory only has to buffer the file contents a few kilobytes at a time rather than all at once.
Here's two snippets demonstrating what I'm saying:
const fs = require('fs');
const http = require('http');
// using readFile()
http.createServer(function (req, res) {
// let's pretend this is a huge 500MB zip file
fs.readFile('some/file/path.zip', function (err, data) {
// entire file must be buffered in memory to data, which could be very slow
// entire chunk is sent at once, no streaming here
res.write(data);
res.end();
});
});
// using createReadStream()
http.createServer(function (req, res) {
// processes the large file in chunks
// sending them to client as soon as they're ready
fs.createReadStream('some/file/path.zip').pipe(res);
// this is more memory-efficient and responsive
});

Categories