node.js data consistency when iterating asynchronously

node.js data consistency when iterating asynchronously - javascript

I have a tool who's basic idea is as follows:
//get a bunch of couchdb databases. this is an array
const jsonFile = require('jsonfile');
let dbList = getDbList();
const filePath = 'some/path/to/file';
const changesObject = {};
//iterate the db list. do asynchronous stuff on each iteration
dbList.forEach(function(db){
let merchantDb = nano.use(db);
//get some changes from the database. validate inside callback
merchantDb.get("_changes", function(err,changes){
validateChanges(changes);
changesObject['db'] = changes.someAttribute;
//write changes to file
jsonFile.writeFile(filePath, changesObject, function (err) {
if (err) {
logger.error("Unable to write to file: ");
}
});
})
const validateChanges = function(changes) {
if (!validateLogic(changes) sendAlertMail();
}
For performance improvements the iteration is not done synchronously. Therefore there can be multiple iterations running in 'parallel'. My question is can this cause any data inconsistencies and/or any issues with the file writing process?
Edit:
The same file gets written to on each iteration.
Edit:2
The changes are stored as a JSON object with key value pairs. The key being the db name.

If you're really writing to a single file, which you appear to be (though it's hard to be sure), then no; you have a race condition in which multiple callbacks will try to write to the same file, possibly at the same time (remember, I/O isn't done on the JavaScript thread in Node unless you use the *Sync functions), which will at best mean the last one wins and will at worst mean I/O errors because of overlap.
If you're writing to separate files for each db, then provided there's no cross-talk (shared state) amongst validateChanges, validateLogic, sendAlertMail, etc., that should be fine.
Just for detail: It will start tasks (jobs) getting the changes and then writing them out; the callbacks of the calls to get won't be run until later, when all of those jobs are queued.
You are creating closures in loops, but the way you're doing it is okay, both because you're doing it within the forEach callback and because you're not using db in the get callback (which would be fine with the forEach callback but not with some other ways you might loop arrays). Details on that aspect in this question's answers if you're interested.
This line is suspect, though:
let merchantDb = nano.use('db');
I suspect you meant (no quotes):
let merchantDb = nano.use(db);
For what it's worth, it sounds from the updates to the question and your various comments like the better solution would be not to write out the file separately each time. Instead, you want to gather up the changes and then write them out.
You can do that with the classic Node-callback APIs you're using like this:
let completed = 0;
//iterate the db list. do asynchronous stuff on each iteration
dbList.forEach(function(db) {
let merchantDb = nano.use(db);
//get some changes from the database. validate inside callback
merchantDb.get("_changes", function(err, changes) {
if (err) {
// Deal with the fact there was an error (don't return)
} else {
validateChanges(changes);
changesObject[db] = changes.someAttribute; // <=== NOTE: This line had 'db' rather than db, I assume that was meant to be just db
}
if (++completed === dbList.length) {
// All done, write changes to file
jsonFile.writeFile(filePath, changesObject, function(err) {
if (err) {
logger.error("Unable to write to file: ");
}
});
}
})
});

Related

Save csv-parse output to a variable

I'm new to using csv-parse and this example from the project's github does what I need with one exception. Instead of outputting via console.log I want to store data in a variable. I've tried assigning the fs line to a variable and then returning data rather than logging it but that just returned a whole bunch of stuff I didn't understand. The end goal is to import a CSV file into SQLite.
var fs = require('fs');
var parse = require('..');
var parser = parse({delimiter: ';'}, function(err, data){
console.log(data);
});
fs.createReadStream(__dirname+'/fs_read.csv').pipe(parser);
Here is what I have tried:
const fs = require("fs");
const parse = require("./node_modules/csv-parse");
const sqlite3 = require("sqlite3");
// const db = new sqlite3.Database("testing.sqlite");
let parser = parse({delimiter: ","}, (err, data) => {
// console.log(data);
return data;
});
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);

I was also struggling to figure out how to get the data from csv-parse back to the top-level that invokes parsing. Specifically I was trying to get parser.info data at the end of processing to see if it was successful, but the solution for that can work to get the row data as well, if you need.
The key was to wrap all the stream event listeners into a Promise, and within the parser's callback resolve the Promise.
function startFileImport(myFile) {
// THIS IS THE WRAPPER YOU NEED
return new Promise((resolve, reject) => {
let readStream = fs.createReadStream(myFile);
let fileRows = [];
const parser = parse({
delimiter: ','
});
// Use the readable stream api
parser.on('readable', function () {
let record
while (record = parser.read()) {
if (record) { fileRows.push(record); }
}
});
// Catch any error
parser.on('error', function (err) {
console.error(err.message)
});
parser.on('end', function () {
const { lines } = parser.info;
// RESOLVE OUTPUT THAT YOU WANT AT PARENT-LEVEL
resolve({ status: 'Successfully processed lines: ', lines });
});
// This will wait until we know the readable stream is actually valid before piping
readStream.on('open', function () {
// This just pipes the read stream to the response object (which goes to the client)
readStream.pipe(parser);
});
// This catches any errors that happen while creating the readable stream (usually invalid names)
readStream.on('error', function (err) {
resolve({ status: null, error: 'readStream error' + err });
});
});
}

This is a question that suggests confusion about an asynchronous streaming API and seems to ask at least three things.
How do I get output to contain an array-of-arrays representing the parsed CSV data?
That output will never exist at the top-level, like you (and many other programmers) hope it would, because of how asynchronous APIs operate. All the data assembled neatly in one place can only exist in a callback function. The next best thing syntactically is const output = await somePromiseOfOutput() but that can only occur in an async function and only if we switch from streams to promises. That's all possible, and I mention it so you can check it out later on your own. I'll assume you want to stick with streams.
An array consisting of all the rows can only exist after reading the entire stream. That's why all the rows are only available in the author's "Stream API" example only in the .on('end', ...) callback. If you want to do anything with all the rows present at the same time, you'll need to do it in the end callback.
From https://csv.js.org/parse/api/ note that the author:
uses the on readable callback to push single records into a previously empty array defined externally named output.
uses the on error callback to report errors
uses the on end callback to compare all the accumulated records in output to the expected result
...
const output = []
...
parser.on('readable', function(){
let record
while (record = parser.read()) {
output.push(record)
}
})
// Catch any error
parser.on('error', function(err){
console.error(err.message)
})
// When we are done, test that the parsed output matched what expected
parser.on('end', function(){
assert.deepEqual(
output,
[
[ 'root','x','0','0','root','/root','/bin/bash' ],
[ 'someone','x','1022','1022','','/home/someone','/bin/bash' ]
]
)
})
As to the goal on interfacing with sqlite, this is essentially building a customized streaming endpoint.
In this use case, implement a customized writable stream that accepts the output of parser and sends rows to the database.
Then you simply chain pipe calls as
fs.createReadStream(__dirname+'/fs_read.csv')
.pipe(parser)
.pipe(your_writable_stream)
Beware: This code returns immediately. It does not wait for the operations to finish. It interacts with a hidden event loop internal to node.js. The event loop often confuses new developers who are arriving from another language, used to a more imperative style, and skipped this part of their node.js training.
Implementing such a customized writable stream can get complicated and is left as an exercise for the reader. It will be easiest if the parser emits a row, and then the writer can be written to handle single rows. Make sure you are able to notice errors somehow and throw appropriate exceptions, or you'll be cursed with incomplete results and no warning or reason why.
A hackish way to do it would have been to replace console.log(data) in let parser = ... with a customized function writeRowToSqlite(data) that you'll have to write anyway to implement a custom stream. Because of asynchronous API issues, using return data there does not do anything useful. It certainly, as you saw, fails to put the data into the output variable.
As to why output in your modified posting does not contain the data...
Unfortunately, as you discovered, this is usually wrong-headed:
const output = fs.createReadStream(__dirname + "/users.csv").pipe(parser);
console.log(output);
Here, the variable output will be a ReadableStream, which is not the same as the data contained in the readable stream. Put simply, it's like when you have a file in your filesystem, and you can obtain all kinds of system information about the file, but the content contained in the file is accessed through a different call.

Duplicate Array Data Web Scraping

I can't seem to get the article duplicates out of my web scraper results, this is my code:
app.get("/scrape", function (req, res) {
request("https://www.nytimes.com/", function (error, response, html) {
// Load the HTML into cheerio and save it to a variable
// '$' becomes a shorthand for cheerio's selector commands, much like jQuery's '$'
var $ = cheerio.load(html);
var uniqueResults = [];
// With cheerio, find each p-tag with the "title" class
// (i: iterator. element: the current element)
$("div.collection").each(function (i, element) {
// An empty array to save the data that we'll scrape
var results = [];
// store scraped data in appropriate variables
results.link = $(element).find("a").attr("href");
results.title = $(element).find("a").text();
results.summary = $(element).find("p.summary").text().trim();
// Log the results once you've looped through each of the elements found with cheerio
db.Article.create(results)
.then(function (dbArticle) {
res.json(dbArticle);
}).catch(function (err) {
return res.json(err);
});
});
res.send("You scraped the data successfully.");
});
});
// Route for getting all Articles from the db
app.get("/articles", function (req, res) {
// Grab every document in the Articles collection
db.Article.find()
.then(function (dbArticle) {
res.json(dbArticle);
})
.catch(function (err) {
res.json(err);
});
});
Right now I am getting five copies of each article sent to the user. I have tried db.Article.distinct and various versions of this to filter the results down to only unique articles. Any tips?

In Short:
Switching the var results = [] from an Array to an Object var results = {} did the trick for me. Still haven't figured out the exact reason for the duplicate insertion of documents in database, will update as soon I find out.
Long Story:
You have multiple mistakes and points of improvement there in your code. I will try pointing them out:
Let's follow them first to make your code error free.
Mistakes
1. Although mongoose's model.create, new mongoose() does seem to work fine with Arrays but I haven't seen such a use before and it does not even look appropriate.
If you intend to create documents one after another then represent your documents using an object instead of an Array. Using an array is more mainstream when you intend to create multiple documents at once.
So switch -
var results = [];
to
var results = {};
2. Sending response headers after they are already sent will create for you an error. I don't know if you have already noticed it or not but its pretty much clear upfront as once the error is popped up the remaining documents won't get stored because of PromiseRejection Error if you haven't setup a try/catch block.
The block inside $("div.collection").each(function (i, element) runs asynchronously so your process control won't wait for each document to get processed, instead it would immediately execute res.send("You scraped the data successfully.");.
This will effectively terminate the Http connection between the client and the server and any further issue of response termination calls like res.json(dbArticle) or res.json(err) will throw an error.
So, just comment the res.json statements inside the .create's then and catch methods. This will although terminate the response even before the whole articles are saved in the DB but you need not to worry as your code would still work behind the scene saving articles in database for you (asynchronously).
If you want your response to be terminated only after you have successfully saved the data then change your middleware implementation to -
request('https://www.nytimes.com', (err, response, html) => {
var $ = cheerio.load(html);
var results = [];
$("div.collection").each(function (i, element) {
var ob = {};
ob.link = $(element).find("a").attr("href");
ob.title = $(element).find("a").text();
ob.summary = $(element).find("p.summary").text().trim();
results.push(ob);
});
db.Article.create(results)
.then(function (dbArticles) {
res.json(dbArticles);
}).catch(function (err) {
return res.json(err);
});
});
After making above changes and even after the first one, my version of your code ran fine. So if you want you can continue on with your current version, or you may try reading some points of improvement.
Points of Improvements
1. Era of callbacks is long gone:
Convert your implementation to utilise Promises as they are more maintainable and easier to reason about. Here are the things you can do -
Change request library from request to axios or any one which supports Promises by default.
2. Make effective use of mongoose methods for insertion. You can perform bulk inserts of multiple statements in just one query. You may find docs on creating documents in mongodb quite helpful.
3. Start using some frontend task automation library such as puppeteer or nightmare.js for data scraping related task. Trust me, they make life a hell lot easier than using cheerio or any other library for the same. Their docs are really good and well maintained so you won't have have hard time picking these up.

Node read line by line, process and store

I have the following code in Node.js which reads from a file, line by line. I want to do stuff to each line and store it in an array. The array would then be used in other functions in the same file. The problem I'm running into is the async nature of reading the stream which results in an empty array. The solutions I've come across all seem to rely on modules.
function processLine(file) {
const fs = require('fs');
const readline = require('readline');
const input = fs.createReadStream(file);
const rl = readline.createInterface(input);
const arr = []
rl.on('line', (line) => {
// do stuff to data and store in array
})
// return array;
}
I am aware of being able to store the chunks and operate on the whole file with input.on('end', cb)... However, I feel like this would put too much functionality within the cb. Plus I still can't use its return value since its async. I guess my question is, is there a way to store data being read and use it within the file?

If you would like to process elements like chunks - take a look on
highWaterMark
https://nodejs.org/api/stream.html#stream_types_of_streams
Proably you will be instered in:
objectMode
as well.
Also there are interfaces which you could use while use streams:
Readable
Writable
Duplex
Transform
https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback
Where you could use any Promise based function and simply use callback to finish processing element at right point of time:
_transform = function(data, encoding, callback) {
this.push(data);
callback();
};
or
https://nodejs.org/api/stream.html#stream_class_stream_transform
_write(chunk, encoding, callback) {
// ...
}
However there is another solution - rxjs binding for node stream - which you could use while process elements.

NodeJS callback scope

As a js/node newcomer, I'm having some problems understanding how I can get around this issue.
Basically I have a list of objects that I would like to save to a MongoDB database if they don't already exist.
Here is some code:
var getDataHandler = function (err, resp, body) {
var data = JSON.parse(body);
for (var i=0; i < data.length; i++) {
var item = data[i];
models.Entry.findOne({id: item.id}, function(err, res) {
if (err) { }
else if (result === null) {
var entry = new models.Entry(item);
feedbackEntry.save(function(err, result) {
if (err) {}
});
}
});
}
}
The problem I have is that because it is asynchronous, once the new models.Entry(item) line is executed the value of item will be equal to the last element in the data array for every single callback.
What kind of pattern can I use to avoid this issue ?
Thanks.

Two kinds of patterns are available :
1) Callbacks. That is you go on calling functions from your functions by passing them as parameters. Callbacks are generally fine but, especially server side when dealing with database or other asynchronous resources, you fast end in "callback hell" and you may grow tired of looking for tricks to reduce the indentation levels of your code. And you may sometimes wonder how you really deal with exceptions. But callbacks are the basis : you must understand how to deal with that problem using callbacks.
2) Promises. Using promises you may have something like that (example from my related blog post) :
db.on(userId) // get a connection from the pool
.then(db.getUser) // use it to issue an asynchronous query
.then(function(user){ // then, with the result of the query
ui.showUser(user); // do something
}).finally(db.off); // and return the connection to the pool
Instead of passing the next function as callback, you just chain with then (in fact it's a little more complex, you have other functions, for example to deal with collections and parallel resolution or error catching in a clean way).
Regarding your scope problem with the variable evolving before the callback is called, the standard solution is this one :
for (var i=0; i<n; i++) {
(function(i){
// any function defined here (a callback) will use the value of i fixed when iterating
})(i);
});
This works because calling a function creates a scope and the callback you create in that scope retains a pointer to that scope where it will fetch i (that's called a closure).

Control flow issue with node/redis and callbacks?

Please could I ask for some advice on a control flow issue with node and redis? (aka Python coder trying to get used to JavaScript)
I don't understand why client.smembers and client.get (Redis lookups) need to be callbacks rather than simply being statements - it makes life very complicated.
Basically I'd like to query a set, and then when I have the results for the set, I need to carry out a get for each result. When I've got all the data, I need to broadcast it back to the client.
Currently I do this inside two callbacks, using a global object, which seems messy. I'm not even sure if it's safe (will the code wait for one client.get to complete before starting another?).
The current code looks like this:
var all_users = [];
// Get all the users for this page.
client.smembers("page:" + current_page_id, function (err, user_ids ) {
// Now get the name of each of those users.
for (var i = 0; i < user_ids.length; i++) {
client.get('user:' + user_ids[i] + ':name', function(err, name) {
var myobj = {};
myobj[user_ids[i]] = name;
all_users.push(myobj);
// Broadcast when we have got to the end of the loop,
// so all users have been added to the list -
// is this the best way? It seems messy.
if (i === (user_ids.length - 1)) {
socket.broadcast('all_users', all_users);
}
});
}
});
But this seems very messy. Is it really the best way to do this? How can I be sure that all lookups have been performed before calling socket.broadcast?
scratches head Thanks in advance for any advice.

I don't understand why client.smembers and client.get (Redis lookups) need to be callbacks rather than simply being statements - it makes life very complicated.
That's what Node is. (I'm pretty sure that this topic was discussed more than enough times here, look through other questions, it's definitely there)
How can I be sure that all lookups have been performed before calling socket.broadcast?
That's what is err for in callback function. This is kinda Node's standard - first parameter in callback is error object (null if everything fine). So just use something like this to be sure no errors occurred:
if (err) {
... // handle errors.
return // or not, it depends.
}
... // process results
But this seems very messy.
You'll get used to it. I'm actually finding it nice, when code is well formatted and project is cleverly structured.
Other ways are:
Using libraries to control async code-flow (Async.js, Step.js, etc.)
If spaghetti-style code is what you think mess is, define some functions to process results and pass them as parameters instead of anonymous ones.

If you totally dislike writing stuff callback-style, you might want to try streamlinejs:
var all_users = [];
// Get all the users for this page.
var user_ids = client.smembers("page:" + current_page_id, _);
// Now get the name of each of those users.
for (var i = 0; i < user_ids.length; i++) {
var name = client.get('user:' + user_ids[i] + ':name', _);
var myobj = {};
myobj[user_ids[i]] = name;
all_users.push(myobj);
}
socket.broadcast('all_users', all_users);
Note that a disadvantage of this variant is that only one username will be fetched at a time. Also, you should still be aware of what this code really does.

Async is a great library and you should take a look. Why ? Clean code / process / easy to track .. etc
Also, keep in mind that all your async function will be processed after your for loop. In you exemple, it may result in wrong "i" value. Use closure :
for (var i = 0; i < user_ids.length; i++) { (function(i) {
client.get('user:' + user_ids[i] + ':name', function(err, name) {
var myobj = {};
myobj[user_ids[i]] = name;
all_users.push(myobj);
// Broadcast when we have got to the end of the loop,
// so all users have been added to the list -
// is this the best way? It seems messy.
if (i === (user_ids.length - 1)) {
socket.broadcast('all_users', all_users);
}
});
})(i)}
What you should do to know when it's finish is use a recursive pattern like async ( i think ) do. It's much simple then doing it yourself.
async.series({
getMembers: function(callback) {
client.smembers("page:" + current_page_id, callback);
}
}, function(err, results) {
var all_users = [];
async.forEachSeries(results.getMembers, function(item, cb) {
all_users.push(item);
cb();
}, function(err) {
socket.broadcast('all_users', all_users);
});
});
This code may not be valid, but you should be able to figure out how to do it.
Step library is good too ( and only 30~ line of code i think)

I don't understand why client.smembers and client.get (Redis lookups)
need to be callbacks rather than simply being statements - it makes
life very complicated.
Right, so everyone agrees callback hell is no bueno. As of this writing, callbacks are a dying feature of Node. Unfortunately, the Redis library does not have native support for returning Promises.
But there is a module you can require in like so:
const util = require("util");
This is a standard library that is included in the Node runtime and has a bunch of utility functions we can use, one of them being "promisify":
https://nodejs.org/api/util.html#util_util_promisify_original
Now of course when you asked this question seven years ago, util.promisify(original) did not exist as it was added in with the release of -v 8.0.0, so we can now update this question with an updated answer.
So promisify is a function and we can pass it a function like client.get() and it will return a new function that take the nasty callback behavior and instead wraps it up nice and neat to make it return a Promise.
So promisify takes any function that accepts a callback as the last argument and makes it instead return a Promise and it sounds like thats the exact behavior that you wanted seven years ago and we are afforded today.
const util = require("util");
client.get = util.promisify(client.get);
So we are passing a reference to the .get() function to util.promisify().
This takes your function, wraps it up so instead of implementing a callback, it instead returns a Promise. So util.promisify() returns a new function that has been promisified.
So you can take that new function and override the existing one on client.get().
Nowadays, you do not have to use a callback for Redis lookup. So now you can use the async/await syntax like so:
const cachedMembers = await client.get('user:' + user_ids[i]);
So we wait for this to be resolved and whatever it resolves with will be assigned to cachedMembers.
The code can be even further cleaned up to be more updated by using an ES6 array helper method instead of your for loop. I hope this answer is useful for current users, otherwise the OP was obsolete.

We Keep Coding

JavaScript is the programming language of the Web.

node.js data consistency when iterating asynchronously - javascript

Related

Save csv-parse output to a variable

Duplicate Array Data Web Scraping

Node read line by line, process and store

NodeJS callback scope

Control flow issue with node/redis and callbacks?

Categories

Resources