I am reading data from a PostgreSQL database using Node.js:
const readFromDatabase = function (callback) {
pg.connect('pg://…', (errConnect, client, disconnect) => {
if (errConnect) {
return callback(errConnect);
}
const query = client.query('SELECT * FROM …');
// …
});
};
The query object now is an event emitter that emits row events whenever a row is received. Additionally, it emits an end event once all rows have been read.
What I would like to do now is to wrap this event emitter into a Highland.js stream and hand this over to the caller of my function. Basically this should do the job:
const stream = highland('row', query);
callback(null, stream);
Unfortunately, I still need to call the disconnect function once all rows have been read, and I don't want the caller to care about this. So how can I hand out the stream while still be able to register a callback for the end event?
I have seen that Highland.js offers the done function that does exactly what I need, but it also causes the stream to start flowing (which I do not want to do internally, that's up to my caller).
How can I solve this?
Related
Short backstory: I am trying to create a Readable stream based on data chunks that are emitted back to my server from the client side with WebSockets. Here's a class I've created to "simulate" that behavior:
class DataEmitter extends EventEmitter {
constructor() {
super();
const data = ['foo', 'bar', 'baz', 'hello', 'world', 'abc', '123'];
// Every second, emit an event with a chunk of data
const interval = setInterval(() => {
this.emit('chunk', data.splice(0, 1)[0]);
// Once there are no more items, emit an event
// notifying that that is the case
if (!data.length) {
this.emit('done');
clearInterval(interval);
}
}, 1e3);
}
}
In this post, the dataEmitter in question will have been created like this.
// Our data is being emitted through events in chunks from some place.
// This is just to simulate that. We cannot change the flow - only listen
// for the events and do something with the chunks.
const dataEmitter = new DataEmitter();
Right, so I initially tried this:
const readable = new Readable();
dataEmitter.on('chunk', (data) => {
readable.push(data);
});
dataEmitter.once('done', () => {
readable.push(null);
});
But that results in this error:
Error [ERR_METHOD_NOT_IMPLEMENTED]: The _read() method is not implemented
So I did this, implementing read() as an empty function:
const readable = new Readable({
read() {},
});
dataEmitter.on('chunk', (data) => {
readable.push(data);
});
dataEmitter.once('done', () => {
readable.push(null);
});
And it works when piping into a write stream, or sending the stream to my test API server. The resulting .txt file looks exactly as it should:
foobarbazhelloworldabc123
However, I feel like there's something quite wrong and hacky with my solution. I attempted to put the listener registration logic (.on('chunk', ...) and .once('done', ...)) within the read() implementation; however, read() seems to get called multiple times, and that results in the listeners being registered multiple times.
The Node.js documentation says this about the _read() method:
When readable._read() is called, if data is available from the resource, the implementation should begin pushing that data into the read queue using the this.push(dataChunk) method. _read() will be called again after each call to this.push(dataChunk) once the stream is ready to accept more data. _read() may continue reading from the resource and pushing data until readable.push() returns false. Only when _read() is called again after it has stopped should it resume pushing additional data into the queue.
After dissecting this, it seems that the consumer of the stream calls upon .read() when it's ready to read more data. And when it is called, data should be pushed into the stream. But, if it is not called, the stream should not have data pushed into it until the method is called again (???). So wait, does the consumer call .read() when it is ready for more data, or does it call it after each time .push() is called? Or both?? The docs seem to contradict themselves.
Implementing .read() on Readable is straightforward when you've got a basic resource to stream, but what would be the proper way of implementing it in this case?
And also, would someone be able to explain in better terms what the .read() method is on a deeper level, and how it should be implemented?
Thanks!
Response to the answer:
I did try registering the listeners within the read() implementation, but because it is called multiple times by the consumer, it registers the listeners multiple times.
Observing this code:
const readable = new Readable({
read() {
console.log('called');
dataEmitter.on('chunk', (data) => {
readable.push(data);
});
dataEmitter.once('done', () => {
readable.push(null);
});
},
});
readable.pipe(createWriteStream('./data.txt'));
The resulting file looks like this:
foobarbarbazbazbazhellohellohellohelloworldworldworldworldworldabcabcabcabcabcabc123123123123123123123
Which makes sense, because the listeners are being registered multiple times.
Seems like the only purpose of actually implementing the read() method is to only start receiving the chunks and pushing them into the stream when the consumer is ready for that.
Based on these conclusions, I've come up with this solution.
class MyReadable extends Readable {
// Keep track of whether or not the listeners have already
// been added to the data emitter.
#registered = false;
_read() {
// If the listeners have already been registered, do
// absolutely nothing.
if (this.#registered) return;
// "Notify" the client via websockets that we're ready
// to start streaming the data chunks.
const emitter = new DataEmitter();
const handler = (chunk: string) => {
this.push(chunk);
};
emitter.on('chunk', handler);
emitter.once('done', () => {
this.push(null);
// Clean up the listener once it's done (this is
// assuming the #emitter object will still be used
// in the future).
emitter.off('chunk', handler);
});
// Mark the listeners as registered.
this.#registered = true;
}
}
const readable = new MyReadable();
readable.pipe(createWriteStream('./data.txt'));
But this implementation doesn't allow for the consumer to control when things are pushed. I guess, however, in order to achieve that sort of control, you'd need to communicate with the resource emitting the chunks to tell it to stop until the read() method is called again.
I have the following setup:
async MyFunction(param) {
//... Do some computation
await WriteToDB()
}
io.on('connection', (socket) => {
socket.on('AnEvent', (param) => MyFunction(param))
})
When an event comes in, it calls an asynchronous function which does some computation and in the end write the result to a database with another asynchronous call.
If MyFunction doesn't have an asynchronous call to write the database in the end, for example
MyFunction(param) {
//... Do some computation
}
then it is obviously that all events will be processed in their incoming order. The processing of the next event will only start after the processing of the previous one finishes. However, because of the asynchronous call to the database, I don't know if those incoming events will still be fully processed in order. I am afraid that the processing of the next event starts before the previous await WriteToDB() finishes. How do I change the code to fully processing them in order?
You are correct that there's no guarantee that incoming events will be processed in the order.
To achieve what you are asking, you would need a "Message Queue" that will periodically check for new messages and process them one by one.
const messageQueue = [];
// SocketIO adding Message to MessageQueue
const eventHandler = (message) => {
messageQueue.push(message);
}
const messageHandler = () => {
if (messageQueue.length === 0) {
return;
}
const message = messageQueue.shift();
// Handle Message
// If successful, ask for next message
return messageHandler();
}
Of course, my example is pretty naive, but I hope it will give you a general idea of how what you are asking is accomplished.
If you are finding yourself needing a more robust message queue, Look into RabbitMQ, BullMQ, Kafka
I want to make sure that a Stream is ready for changes before I can send data to the client. My code:
// Get the Stream (MongoDB collection)
let stream = collection.watch()
// Generate identifier to send it to the client
const uuid = uuid()
// Listen for changes
stream
.on('change', () => {
// Send something to WebSocket client
webSocket.emit('identifier', uuid)
})
// Mutate data base collection to kick off the "change" event
await collection.updateOne()
The line with webSocket.emit is my problem. How do I know if the Stream is already ready to receive events? It happens that change event never occurs so the webSocket.emit gets never invoked.
TL;DR
Basically, I need to send something to the client but need to make sure that the Stream is ready for receiving events before that.
This looks like a race condition where your update query is executed before the changeStream aggregation pipeline reaches the server. Basically you need to wait for the stream cursor to be set before triggering the change.
I couldn't find any "cursor ready" event, so as a workaround you can check its id. It is assigned by the server so when it is available on the client it kinda guarantee that all consecutive data changes will be captured.
Something like this should do the job:
async function streamReady(stream) {
return new Promise(ok => {
const i = setInterval(() => {
if (stream.cursor.cursorState.cursorId) {
clearInterval(i);
return ok()
}
}, 1)
});
}
Then in your code:
// Get the Stream (MongoDB collection)
let stream = collection.watch()
// Generate identifier to send it to the client
const uuid = uuid()
// Listen for changes
stream
.on('change', () => {
// Send something to WebSocket client
webSocket.emit('identifier', uuid)
})
await streamReady(stream);
// Mutate data base collection to kick off the "change" event
await collection.updateOne()
Disclaimer:
The streamReady function above relies on cursorState. It is an internal field which can be changed without notice even in a patch version update of the driver.
I've managed to get this work without using an undocumented piece of the API.
await new Promise<void>((resolve) => {
changeStream.once('resumeTokenChanged', () => {
resolve();
});
});
I'm using the resumeTokenChanged event. This worked for me.
I have an application which listens to a websocket endpoint and processes the data received from it and saves it to a database.
The problem of race condition arises when two callbacks are invoked concurrently (for example: one task may begin processing, then another task may begin processing and update the database, then the first task may update the database - so in the end the database updates are out of order).
The solution I thought of was to record the exact time a callback is called, process the data, then attach the time to the data passed to the database and in the database compare this time with the last update time and act accordingly.
One possible problem I thought of is that the time may be recorded out of order (for example: consider the scenario where the first callback is called, then the second callback is called and the time is recorded, then the time is recorded for the first callback).
How would you do it the right way? Solutions to this problem or other ways to go about it?
EDIT: To be more specific as I'm intending for the program to be as real-time as possible I'd like to allow for the most up-to-date callback to be processed without delay (without waiting for all other previous callbacks to entirely process) but to ensure that the end result of the processing (as is recorded in the database) adheres to the order in which the callbacks arrived (is not corrupt)
You can the data handler callback return a promise of when it's finished.
Each time you get a new data from the socket, wait for that promise before handling it, then store the resulting promise for the next data to wait for.
That would look like this:
const ready = Promise.resolve();
socket.on(..., data => {
ready = ready.then(() => processData(data);
});
This will have no effect on any other code.
EDIT: To do expensive work outside the lock, you can write
socket.on(..., data => {
const result = doExpensiveWork(data); // Returns a promise
ready = Promise.all(result, ready).then(([result]) => insertData(result));
});
I am trying to implement a stream with the new Node.js streams API that will buffer a certain amount of data. When this stream is piped to another stream, or if something consumes readable events, this stream should flush its buffer and then simply become pass-through. The catch is, this stream will be piped to many other streams, and when each destination stream is attached, the buffer must be flushed even if it is already flushed to another stream.
For example:
BufferStream implements stream.Transform, and keeps a 512KB internal ring buffer
ReadableStreamA is piped to an instance of BufferStream
BufferStream writes to its ring buffer, reading data from ReadableStreamA as it comes in. (It doesn't matter if data is lost, as the buffer overwrites old data.)
BufferStream is piped to WritableStreamB
WritableStreamB receives the entire 512KB buffer, and continues to get data as it is written from ReadableStreamA through BufferStream.
BufferStream is piped to WritableStreamC
WritableStreamC also receives the entire 512KB buffer, but this buffer is now different than what WritableStreamB received, because more data has since been written to BufferStream.
Is this possible with the streams API? The only method I can think of would be to create an object with a method that spins up a new PassThrough stream for each destination, meaning I couldn't simply pipe to and from it.
For what it's worth, I've done this with the old "flowing" API by simply listening for new handlers on data events. When a new function was attached with .on('data'), I would call it directly with a copy of the ring buffer.
Here's my take on your issue.
The basic idea is to create a Transform stream, which will allow us to execute your custom buffering logic before sending the data on the output of the stream:
var util = require('util')
var stream = require('stream')
var BufferStream = function (streamOptions) {
stream.Transform.call(this, streamOptions)
this.buffer = new Buffer('')
}
util.inherits(BufferStream, stream.Transform)
BufferStream.prototype._transform = function (chunk, encoding, done) {
// custom buffering logic
// ie. add chunk to this.buffer, check buffer size, etc.
this.buffer = new Buffer(chunk)
this.push(chunk)
done()
}
Then, we need to override the .pipe() method so that we are are notified when the BufferStream is piped into a stream, which allows us to automatically write data to it:
BufferStream.prototype.pipe = function (destination, options) {
var res = BufferStream.super_.prototype.pipe.call(this, destination, options)
res.write(this.buffer)
return res
}
In this way, when we write buffer.pipe(someStream), we perform the pipe as intended and write the internal buffer to the output stream. After that, the Transform class takes care of everything, while keeping track of the backpressure and whatnot.
Here is a working gist. Please note that I didn't bother writing a correct buffering logic (ie. I don't care about the size of the internal buffer), but this should be easy to fix.
Paul's answer is good, but I don't think it meets the exact requirements. It sounds like what needs to happen is that everytime pipe() is called on this transform stream, it needs to first flush the buffer that represents all the accumulation of data between time the transform stream was created/(connected to the source stream) and the time it was connected to the current writable/destination stream.
Something like this might be more correct:
var BufferStream = function () {
stream.Transform.apply(this, arguments);
this.buffer = []; //I guess an array will do
};
util.inherits(BufferStream, stream.Transform);
BufferStream.prototype._transform = function (chunk, encoding, done) {
this.push(chunk ? String(chunk) : null);
this.buffer.push(chunk ? String(chunk) : null);
done()
};
BufferStream.prototype.pipe = function (destination, options) {
var res = BufferStream.super_.prototype.pipe.apply(this, arguments);
this.buffer.forEach(function (b) {
res.write(String(b));
});
return res;
};
return new BufferStream();
I suppose this:
BufferStream.super_.prototype.pipe.apply(this, arguments);
is equivalent to this:
stream.Transform.prototype.pipe.apply(this, arguments);
You could probably optimize this and use some flags when pipe/unpipe are called.