Node.js streams and responsibility for retrieval of new data - javascript

Streams enable us to model infinite time-series'.
They can be implemented using generators, with values retrieved from the stream as the program wants via next.
But Node.js streams are event emitters and tell the program when to handle data through the data event.
This seems like responsibility for telling the program when to process new data is handed to the stream instead of leaving it with the program that presumably knows best whether it can handle the new data.
Is there a name for these two types of approaches (push and pull?) and when you each be used?

However there is a push/pull method set available for streams (usually called streams2) I agree it's still quite complicated to use and generators were raised many times as a possible solution.
I implemented this in my Scramjet framework, so a very simple stream generator would look like this:
const dat = [1,2,3,4];
const iter = (function* () { yield* dat; })();
exports.stream = () => DataStream.fromIterator(iter);
You can run a test case using nodeunit in the scramjet repo here.

Related

How to limit flow between streams in NodeJS

I have a readStream piped to writeStream. Read stream reads from an internet and write stream writes to my local database instance. I noticed that read speed is much faster than write speed and my app memory usage rises until it reaches
JavaScript heap out of memory
I suspect that it accumulates read data in the NodeJS app like this:
How can I limit read stream so it reads only what write stream is capable of writing at the given time?
Ok so long story short - mechanism you need to be aware of to solve these kind of issues is backpressure. It is not a problem when you are using standard node's pipe(). I am using custom fan-out to multiple streams thus it happened
You can read about it here https://nodejs.org/en/docs/guides/backpressuring-in-streams/
This solution is not ideal as it will block read-stream whenever any of fan-out write streams is blocked but it gives general idea on how to approach this problem
combinedStream.pipe(transformer).on('data', async (data: DbObject) => {
const writeStream = dbClient.getStreamForTable(data.table);
if (!writeStream.write(data.csv)) {
combinedStream.pause();
await new Promise((resolve) => writeStream.once('drain', resolve));
combinedStream.resume();
}
});

gRPC Server-Side Streaming: How to Continue Stream Indefinitely?

I am having an issue with a lightweight gRPC server I'm writing in NodeJS. I'm referencing the documentation here. I have been able to compile my proto files representing messages and services, and have successfully stood up a gRPC server with a server-side stream method I am able to trigger via BloomRPC.
I have a proto message called parcel, which has one field: parcel_id. I want this method to stream a parcel of data every second. My first rudimentary pass at this was a loop that executes every second for a minute, and applies a new parcel via call.write(parcel). I've included the method below, it executes with no errors when I invoke it via gRPC.
/**
* Implements the updateParcel RPC method.
* Feeds new parcel to the passed in "call" param
* until the simulation is stopped.
*/
function updateParcels(call) {
console.log("Parcels requested...");
// Continuously stream parcel protos to requester
let i = 0;
let id = 0;
while(i < 60){
// Create dummy parcel
let parcel = new messages.Parcel();
parcel.setParcelId(id);
id++;// Increment id
// Write parcel to call object
console.log("Sending parcel...");
call.write(parcel);
// Sleep for a second (1000 millis) before repeating
sleep(1000);
}
call.end();
}
My issue is that, although I am able to call my methods and receive results, the behavior is that I receive the first result immediately on the client (for both NodeJS client code and BloomRPC calls), but receive the last 59 results all at once only after the server executes call.end(). There are no errors, and the parcel objects I receive on the client are accurate and formatted correctly, they are just batched as described.
How can I achieve a constant stream of my parcels in real time? Is this possible? I've looked but can't tell for sure - do gRPC server-side streams have a batching behavior by default? I've tried my best to understand the gRPC documentation, but I can't tell if I'm simply trying to force gRPC server-side streams to do something they weren't designed to do. Thanks for any help, and let me know if I can provide any more information, as this is my first gRPC related SO question and I may have missed some relevant information.
It could be unrelated to gRPC, but to the sleep implementation used there.
The default one provided by node is a promise, so for this to work you probably have to declare the function as async and call await sleep(1000);.

Is moving database works to child processes a good idea in node.js?

I just started getting into child_process and all I know is that it's good for delegating blocking functions (e.g. looping a huge array) to child processes.
I use typeorm to communicate with the mysql database. I was wondering if there's a benefit to move some of the asynchronous database works to child processes. I read it in another thread (Unfortunately I couldn't find it in the browser history) that there's no good reason to delegate async functions to child processes. Is it true?
example code:
child.js
import {createConnection} "./dbConnection";
import {SomeTable} from "./entity/SomeTable";
process.on('message', (m)=> {
createConnection().then(async connection=>{
let repository = connection.getRepository(SomeTable);
let results = await repository
.createQueryBuilder("t")
.orderBy("t.postId", "DESC")
.getMany();
process.send(results);
})
});
main.js
const cp = require('child_process');
const child = cp.fork('./child.js');
child.send('Please fetch some data');
child.on('message', (m)=>{
console.log(m);
});
The big gain about Javascript is its asynchronous nature...
What happens when you call an asynchronous function is that the code continues to execute, not waiting for the answer. And just when the function is done, and an answer is given does it then continue on with that part.
Your database call is already asynchronous. So you would spawn another node process for completely nothing. Since your database takes all the heat, having more nodeJS processes wouldn't help on that part.
Take the same example but with a file write. What could make the write to the disk faster? Nothing much really... But do we care? Nope because our NodeJS is not blocked and keeps answering requests and handling tasks. The only thing that you might want to check is to not send a thousand file writes at the same time, if they are big there would be a negative impact on the file system, but since a write is not CPU intensive, node will run just fine.
child processes really are a great tool, but it is rare to need it. I too wanted to use some when I heard about them, but the thing is that you will certainly not need them at all... The only time I decided to use it was to create a CPU intensive worker. It would make sure it spawns one child process per Core (since node is single threaded) and respawn any faulty ones.

Multiple Rhino (java) threads manipulate the same file

I am writing a piece of javascript (ecmascript) within a 3rd-party application which uses embedded Rhino. The application may start multiple Java threads to handle data concurrently. It seems that every Java thread starts its own embedded Rhino context which in turn runs my script.
The purpose of my script is, to receive data from the application and use it to maintain the contents of a particular file. I need a fail-safe solution to handle the concurrency from my script.
So far, what I have come up with is to call out to java and use java.nio.channels.FileLock. However, the documentation here states:
File locks are held on behalf of the entire Java virtual machine. They are not suitable for controlling access to a file by multiple threads within the same virtual machine.
Sure enough, the blocking call FileChannel.lock() does not block but throws an exception, leading to the following ugly code:
var count = 0;
while ( count < 100 )
{
try
{
var rFile = new java.io.RandomAccessFile(this.mapFile, "rw");
var lock = rFile.getChannel().lock();
try
{
// Here I do whatever the script needs to do with the file
}
finally
{
lock.release();
}
rFile.close();
break;
} catch (ex) {
// This is reached whenever another instance has a lock
count++;
java.lang.Thread.sleep( 10 );
}
}
Q: How can I solve this in a safe and reliable manner?
I have seen posts regarding Rhino sync() being similar to Java synchronized but that does not seem to work between multiple instances of Rhino.
UPDATE
I have tried the suggestion of using Synchronizer with org.mozilla.javascript.tools.shell.Global as a template:
function synchronize( fn, obj )
{
return new Packages.org.mozilla.javascript.Synchronizer(fn).call(obj);
}
Next, I use this function as follows:
var mapFile = new java.io.File(mapFilePath);
// MapWriter is a js object
var writer = new MapWriter( mapFile, tempMap );
var on = Packages.java.lang.Class.forName("java.lang.Object");
// Call the writer's update function synchronized
synchronize( function() { writer.update() } , on );
However I see that two threads enter the update() function simultaneously. What is wrong with my code?
Depending how Rhino is embedded, there are two possibilities:
If the code is executed in the Rhino shell, use the sync(f,lock) function to turn a function into a function that synchronizes on the second argument, or on the this object of its invocation if the second argument is absent. (Earlier versions only had the one-argument method, so unless your third-party application uses a recent version, you may need to use that or roll your own; see below.)
If the application is not using the Rhino shell, but using a custom embedding that does not include concurrency tools, you'll need to roll your own version. The source code for sync is a good starting point (see the source code for Global and Synchronizer; you should be able to use Synchronizer pretty much out-of-the-box the same way Global uses it).
It is possible that the problem is that the object on which you are trying to synchronize is not shared across contexts, but is created multiple times by the embedding or something. If so, you may need to use some sort of hack, especially if you have no control over the embedding. If you have no control over the embedding, you could use some kind of VM-global object on which to synchronize, like Runtime.getRuntime() or something (I can't think of any that I immediately know are single objects, but I suspect several of those with singleton APIs like Runtime are.)
Another candidate for something on which to synchronize would be something like Packages.java.lang.Class.forName("java.lang.Object"), which should refer to the same object (the Object class) in all contexts unless the embedding's class loader setup is extremely unusual.

How can I open a nodejs Duplex stream given a file descriptor?

I'm porting an existing program to nodejs. In this program, I open a file descriptor and then hand it off to a thread which calls poll on it in order to determine when it's readable.
Instead of writing a custom C++ module, I'd really like to do this in pure javascript making use of Node's handy dandy Duplex stream.
For example I'd like to do something like this:
var device = new PollDuplexStream(fileDescriptor);
device.on('data', function(data) {
// data handling logic here
});
...
var chunk = new Buffer(...);
device.write(chunk);
It seems like this should exist, but I'm not seeing where it does. Perhaps I'm just blind? What's the real world equivalent of PollDuplexStream from the example above?
Please note that I'm explicitly looking for a solution which starts with a file descriptor rather than a path, otherwise I'd just create my own from fs.createReadStream and fs.createWriteStream.
Also I don't care that it calls poll internally - in fact, I'd prefer that it use libuv's uv_poll_* internally.
You'll need to create a binary addon which uses a uv_poll_t handle, which is able to poll for readability / writability for arbitrary file descriptors.

Categories