NodeJS: Readable object streams, patterns for generating the data asynchronously - javascript

I'd like to crawl data over SSH in a server cluster with NodeJS.
The remote scripts output JSON that is then parsed and split into an object stream.
My problem is now that the callback-oriented libraries I use (SSH2, MySQL) lead to a callback-pattern that I find hard to match with the Readable API spec. How to implement _read(size) when the data to push is behind a bunch of callbacks?
My current implementation takes advantage of the fact that Streams are also EventEmitters. I start to populate my data upon constructing the Stream instance. When all my callbacks are done, I emit an event. I then listen on the custom event, and only then do I start to push data downwards down the pipe chain.
// Calling code
var stream = new CrawlerStream(argsForTheStream);
stream.on('queue_completed', function() {
stream
.pipe(logger)
.pipe(dbWriter)
.on('end', function() {
// Close db connection etc...
});
});
A mock of the CrawlerStream would be
// Mock of the Readable stream implementation
function CrawlerStream(args) {
// boilerplate
// array holding the data to push
this.data = [];
// semi-colon separated string of commands
var cmdQueue = getCommandQueue();
var self = this;
db.query(sql, function(err, sitesToCrawl, fields) {
var servers = groupSitesByServer(sitesToCrawl);
for (var s in servers) {
sshConnect(getRemoteServer(s), function(err, conn) {
sshExec({
ssh: conn,
cmd: cmdQueue
}, function(err, stdout, stderr) {
// Stdout is parsed as JSON
// Finally I can populate self.data!
// Check if all servers are done
// If I'm the last callback to execute
self.data.push(null);
self.emit('queue_completed');
})
});
}
});
}
util.inherits(CrawlerStream, Readable);
CrawlerStream.prototype._read = function(size) {
while (this.data.length) {
this.push(this.data.shift());
}
}
I'm unsure if this is the idiomatic way to accomplish this and would like to get your advice.
Please note in your answers that I'd like to retain the vanilla NodeJS style of using callbacks (no promises) and that I'm stuck with ES5.
Thanks for your time!

Related

BigQuery stream to front-end with Express

I'm trying to read a query from BigQuery and stream it to the front-end. In Node.js-land with Express, this would be:
app.get('/endpoint', (req, res) => {
bigQuery.createQueryStream(query).pipe(res);
});
However, createQueryStream() does not create a Node.js stream, instead it's a custom stream object that returns table rows and as such it fails:
(node:21236) UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be one of type string or Buffer. Received type object
This is confirmed in the official documentation:
bigquery.createQueryStream(query)
.on('data', function(row) {
// row is a result from your query.
})
So, is there a way to stream BigQuery data to the front-end? I've thought two potential solutions but wanted to know if anyone knows a better way:
JSON.stringify() the row and return JSONL instead of plain JSON. This adds a front-end burden to decode it, but makes it fairly easy on both sides.
Move to the REST API and do actual streaming with request like: request(url, { body: { query, params } }).pipe(res) (or whatever is the specific API, haven't dug there yet).
I was confused that a Node.js library that says that it does streaming doesn't work with Node.js native streams, but this seems to be the case.
BigQuery is intended to be used with a wide array of different client libraries written for different programming languages, and therefore, it does not return nodejs-specific data structures, but rather, more general structures which are common to mostly any structured programming language, such as objects. Answering to your questions, yes, there is a way to stream BigQuery data to the front-end, but this is a rather personal choice, because all it entails is converting from one data type to another. However, I would say the most straight-forward way to do this is by calling JSON.stringify(), which you have already mentioned.
I hope that helps.
We ended up making an implementation that stitched together the reply from BigQuery into a big JSON array:
exports.stream = (query, params, res) => {
// Light testing for descriptive errors in the parameters
for (let key in params) {
if (typeof params[key] === "number" && isNaN(params[key])) {
throw new TypeError(`The parameter "${key}" should be a number`);
}
}
return new Promise((resolve, reject) => {
let prev = false;
const onData = row => {
try {
// Only handle it when there's a row
if (!row) return;
// There was a previous row written before, so add a comma
if (prev) {
res.write(",");
}
res.write(stringify(row));
prev = true;
} catch (error) {
console.error("Cannot parse row:", error);
// Just ignore it, don't write this frame
}
};
const onEnd = () => {
res.write("]");
res.end();
resolve();
};
res.writeHead(200, { "Content-Type": "application/json" });
res.write("[");
bigQuery
.createQueryStream({ query, params })
.on("error", reject)
.on("data", onData)
.on("end", onEnd);
});
};
It will build a large JSON array by stitching together:
[ // <- First character sent
stringify(row1) // <- First row
, // <- add comma on second row iteration
stringify(row2) // <- Second row
...
stringify(rowN) // <- Last row
] // <- Send the "]" character to close the array
This has the advantages:
The data is sent as soon as available, so the bandwidth needs are lower.
(depends on BigQuery implementation) lower memory needs on the server side since not all the data is hold at once in memory, only small chunks.

NodeJS Event Emitter Blocking Issue

I have a node application handling some ZeroMQ events coming from another application utilizing the Node-ZMQ bindings found here: https://github.com/JustinTulloss/zeromq.node
The issue I am running into is one of the operations from an event takes a long time to process and this appears to be blocking any other event from being processed during this time. Although the application is not currently clustered, doing so would only afford a few more threads and doesn't really solve the issue. I am wondering if there is a way of allowing for these async calls to not block other incoming requests while they process, and how I might go about implementing them.
Here is a highly condensed/contrived code example of what I am doing currently:
var zmq = require('zmq');
var zmqResponder = zmq.socket('rep');
var Client = require('node-rest-client').Client;
var client = new Client();
zmqResponder.on('message', function (msg, data) {
var parsed = JSON.parse(msg);
logging.info('ZMQ Request received: ' + parsed.event);
switch (parsed.event) {
case 'create':
//Typically short running process, not an issue
case 'update':
//Long running process this is the issue
serverRequest().then(function(response){
zmqResponder.send(JSON.stringify(response));
});
}
});
function serverRequest(){
var deferred = Q.defer();
client.get(function (data, response) {
if (response.statusCode !== 200) {
deferred.reject(data.data);
} else {
deferred.resolve(data.data);
}
});
return deferred.promise;
}
EDIT** Here's a gist of the code: https://gist.github.com/battlecow/cd0c2233e9f197ec0049
I think, through the comment thread, I've identified your issue. REQ/REP has a strict synchronous message order guarantee... You must receive-send-receive-send-etc. REQ must start with send and REP must start with receive. So, you're only processing one message at a time because the socket types you've chosen enforce that.
If you were using a different, non-event-driven language, you'd likely get an error telling you what you'd done wrong when you tried to send or receive twice in a row, but node lets you do it and just queues the subsequent messages until it's their turn in the message order.
You want to change REQ/REP to DEALER/ROUTER and it'll work the way you expect. You'll have to change your logic slightly for the ROUTER socket to get it to send appropriately, but everything else should work the same.
Rough example code, using the relevant portions of the posted gist:
var zmqResponder = zmq.socket('router');
zmqResponder.on('message', function (msg, data) {
var peer_id = msg[0];
var parsed = JSON.parse(msg[1]);
switch (parsed.event) {
case 'create':
// build parsedResponse, then...
zmqResponder.send([peer_id, JSON.stringify(parsedResponse)]);
break;
}
});
zmqResponder.bind('tcp://*:5668', function (err) {
if (err) {
logging.error(err);
} else {
logging.info("ZMQ awaiting orders on port 5668");
}
});
... you need to grab the peer_id (or whatever you want to call it, in ZMQ nomenclature it's the socket ID of the socket you're sending from, think of it as an "address" of sorts) from the first frame of the message you receive, and then use send it as the first frame of the message you send back.
By the way, I just noticed in your gist you are both connect()-ing and bind()-ing on the same socket (zmq.js lines 52 & 143, respectively). Don't do that. Inferring from other clues, you just want to bind() on this side of the process.

How do pipe (stream of Node.js) and bl (BufferList) work together?

This is actually the exercise No.8 from the Node.js tutorial ([https://github.com/workshopper/learnyounode][1])
The goal:
Write a program that performs an HTTP GET request to a URL provided to you as the first command-line argument. Collect all data from the server (not just the first "data" event) and then write two lines to the console (stdout).
The first line you write should just be an integer representing the number of characters received from the server. The second line should contain the complete String of characters sent by the server.
So here's my solution(It passes but looks uglier compared to the official solution).
var http = require('http'),
bl = require('bl');
var myBL = new bl(function(err, myBL){
console.log(myBL.length);
console.log(myBL.toString());
});
var url = process.argv[2];
http.get(url, function(res){
res.pipe(myBL);
res.on('end', function(){
myBL.end();
});
});
The official solution:
var http = require('http')
var bl = require('bl')
http.get(process.argv[2], function (response) {
response.pipe(bl(function (err, data) {
if (err)
return console.error(err)
data = data.toString()
console.log(data.length)
console.log(data)
}))
})
I have difficulties understanding how the official solution works. I have mainly two questions:
The bl constructor expects the 2nd argument to be an instance of
bl (according to bl module's documentation,
[https://github.com/rvagg/bl#new-bufferlist-callback--buffer--buffer-array-][2]),
but what is data? It came out of nowhere. It should be undefined
when it is passed to construct the bl instance.
when is bl.end()
called? I can see no where that the bl.end() is called...
Hope someone can shed some light on these questions. (I know I should've read the source code, but you know...)
[1]: https://github.com/workshopper/learnyounode
[2]: https://github.com/rvagg/bl#new-bufferlist-callback--buffer--buffer-array-
This portion of the bl github page more or less answers your question:
Give it a callback in the constructor and use it just like
concat-stream:
const bl = require('bl')
, fs = require('fs')
fs.createReadStream('README.md')
.pipe(bl(function (err, data) { // note 'new' isn't strictly required
// `data` is a complete Buffer object containing the full data
console.log(data.toString())
}))
Note that when you use the callback method like this, the resulting
data parameter is a concatenation of all Buffer objects in the
list. If you want to avoid the overhead of this concatenation (in
cases of extreme performance consciousness), then avoid the callback
method and just listen to 'end' instead, like a standard Stream.
You're passing a callback to bl, which is basically a function that it will call when it has a stream of data to do something with. Thus, data is undefined for now... it's just a parameter name that will later be used to pass the text from the GET call for printing.
I believe that bl.end() doesn't have be called because there's no real performance overhead to letting it run, but I could be wrong.
I have read the source code of bl library and node stream API.
BufferList is a custom duplex stream,that is both Readable and Writable.When you run readableStream.pipe(BufferList), by default end() is called on BufferList as the destination when the source stream emits end() which fires when there will be no more data to read.
See the implementation of BufferList.prorotype.end:
BufferList.prototype.end = function (chunk) {
DuplexStream.prototype.end.call(this, chunk)
if (this._callback) {
this._callback(null, this.slice())
this._callback = null
}
}
So the callback passed to BufferList, will be called after BufferList received all data from the source stream, call this.slice() will return the result of concatenating all the Buffers in the BufferList where is the data parameter comes from.
var request=require('request')
request(process.argv[2],function(err,response,body){
console.log(body.length);
console.log(body);
})
you can have a look on this approach to solve the above exercise,
p.s request is a third party module though

meteor js create mongodb database hook to store data from API at fixed interval

tldr - What is the best pattern create a 'proprietary database' with data from an API? In this case, using Meteor JS and collections in mongo db.
Steps
1. Ping API
2. Insert Data into Mongo at some interval
In lib/collections.js
Prices = new Mongo.Collection("prices");
Basic stock api call, in server.js:
Meteor.methods({
getPrice: function () {
var result = Meteor.http.call("GET", "http://api.fakestockprices.com/ticker/GOOG.json");
return result.data;
}
});
Assume the JSON is returned clean and tidy, and I want to store the entire object (how you manipulate what is returned is not important, storing the return value is)
We could manipulate the data in the Meteor.method function above but should we? In Angular services are used to call API, but its recommended to modularize and keep the API call in its own function. Lets borrow that, and Meteor.call the above getPrice.
Assume this also done in server.js (please correct).
Meteor.call("getPrice", function(error, result) {
if (error)
console.log(error)
var price = result;
Meteor.setInterval(function() {
Prices.insert(price);
}, 1800000); // 30min
});
Once in the db, a pub/sub could be established, which I'll omit and link to this overview.
You may want to take a look at the synced-cron package.
With a cron job it's pretty easy, just call your method:
// server.js
SyncedCron.start();
SyncedCron.add({
name: "get Price",
schedule: function(parser){
return parser.text('every 30 minutes');
},
job: function(){
return Meteor.call("getPrice");
}
});
Then in getPrice you can do var result = HTTP.call(/* etc */); and Prices.insert(result);. You would want some additional checks of course, as you have pointed out.

node.js and mongodb-native: wait till a collection is non-empty

I am using node.js and with the native mongodb driver (node-mongodb-native);
My current project uses node.js + now.js + mongo-db.
The system basically sends data from the browser to node.js, which is processed with haskell and later fed back to the browser again.
Via a form and node.js the text is inserted in a mongo-db collection called "messages".
A haskell thread reads the entry and stores the result in the db collection "results". This works fine.
But now I need the javascript code that waits for the result to appear in the collection results.
Pseudo code:
wait until the collection result is non-empty.
findOne() from the collection results.
delete the collection results.
I currently connect to the mongodb like this:
var mongo = require('mongodb'),
Server = mongo.Server,
Db = mongo.Db;
var server = new Server('localhost', 27017, {
auto_reconnect: true
});
var db = new Db('test', server);
My haskell knowledge is quite good but not my javascript skills.
So I did extensive searches, but I didn't get far.
glad you solved it, i was going to write something similar:
setTimeout(function(){
db.collection('results',function(coll){
coll.findOne({}, function(err, one){
if( err ) return callback(err);
coll.drop(callback); //or destroy, not really sure <-- this will drop the whole collection
});
});
} ,1000);
The solution is to use the async library.
var async = require('async');
globalCount = -1;
async.whilst(
function () {
return globalCount<1;
},
function (callback) {
console.log("inner while loop");
setTimeout(db_count(callback), 1000);
},
function (err) {
console.log(" || whilst loop finished!!");
}
);

Categories