Understanding node.js asynchronicity - for loop versus nested callback

Understanding node.js asynchronicity - for loop versus nested callback - javascript

I'm new to nodejs and is trying to understand its asynchronous idea. In the following code snippet, I'm trying to get two documents from mongodb database randomly. It works fine, but looks very ugly because of the nested callback functions. If I want to get 100 documents instead of 2, that would be a disaster.
app.get('/api/two', function(req, res){
dataset.count(function(err, count){
var docs = [];
var rand = Math.floor(Math.random() * count);
dataset.findOne({'index':rand}, function(err, doc){
docs.push(doc);
rand = Math.floor(Math.random() * count);
dataset.findOne({'index':rand}, function(err, doc1){
docs.push(doc1);
res.json(docs);
});
});
});
});
So I tried to use for-loop instead, however, the following code just doesn't work, and I guess I misunderstand the asynchronous method idea.
app.get('/api/two', function(req, res){
dataset.count(function(err, count){
var docs = []
for(i = 0; i < 2 ; i++){
var rand = Math.floor(Math.random() * count);
dataset.findOne({'index':rand}, function(err, doc){
docs.push(doc);
});
}
res.json(docs);
});
});
Can anyone help me with that and explain to me why it doesn't work? Thank you very much.

Can anyone help me with that and explain to me why it doesn't work?
tl;dr -- The problem is caused by running a loop over an asynchronous function (dataset.findOne) that cannot complete before the loop completes. You need to handle this with a library like async (as suggested by the other answer) or by callbacks as in the first code example.
Looping over an synchronous function
This may sound pedantic, but it's important to understand the differences between looping in a synchronous and asynchronous world. Consider this synchronous loop:
var numbers = [];
for( i = 0 ; i < 5 ; i++ ){
numbers[i] = i*2;
}
console.log("array:",numbers);
On my system, this outputs:
array: [ 0, 2, 4, 6, 8 ]
This is because the assignment to numbers[i] happens before the loop can iterate. For any synchronous ("blocking") assignment/function, you will get results in this manner.
For illustration, let's try this code:
function sleep(time){
var stop = new Date().getTime();
while(new Date().getTime() < stop + time) {}
}
for( i = 0 ; i < 5 ; i++ ){
sleep(1000);
}
If you get your watch out or throw in some console.log messages, you'll see that "sleeps" for 5 seconds.
This is because the while loop in sleep blocks...it iterates until the time milliseconds have passed before returning control back to the for loop.
Looping over an asynchronous function
The root of your problem is that dataset.findOne is asynchronous...which means it passes control back to the loop before the database has returned results. The findOne method takes a callback (the anonymous function(err, doc)) that creates a closure.
Describing closures here is beyond the scope of this answer, but if you search this site or use your favorite search engine for "javascript closures" you'll get tons of info.
The bottom line, though, is that the asynchronous call send the query off to the database. Because the transaction will take some time and it has a callback that can accept the query results, it hands control back to the for-loop. (Important: this is where node's "event loop" and it's intersection with "asynchronous programming" comes into play. Node is providing a non-blocking environment by allowing asynchronous behavior like this.)
Let's look at an example of how async issues can trip us up:
for( i = 0 ; i < 5 ; i++ ){
setTimeout(
function(){console.log("I think I is: ", i);} // anonymous callback
,1 // wait 1ms before using the callback function
)
}
console.log("I am done executing.")
You'll get output that looks like this:
I am done executing.
I think I is: 5
I think I is: 5
I think I is: 5
I think I is: 5
I think I is: 5
This is because setTimeout gets a function to call...so even though we only said "wait ONE millisecond", that's still longer than it takes for the loop to iterate 5 times and move on to the last console.log line.
What happens, then, is that the last line fires before the first anonymous callback fires. When it does fire, the loop has finished and i is equal to 5. So what you see here is that the loop is done, has moved on, even though the anonymous function handed to setTimeout still has access to the value of i. (This is "closures" in action...)
If we take this concept and use it to consider your second "broken" code example, we can see why you aren't getting the results you expected.
app.get('/api/two', function(req, res){
dataset.count(function(err, count){
var docs = []
for(i = 0; i < 2 ; i++){
var rand = Math.floor(Math.random() * count);
// THIS IS ASYNCHRONOUS.
// findOne gets a callback...
// hands control back to the for loop...
// and later pushes info into the "doc" array...
// too late for res.json, at least...
dataset.findOne({'index':rand}, function(err, doc){
docs.push(doc);
});
}
// THE LOOP HAS ENDED BEFORE any of the findOne callbacks fire...
// There's nothing in 'docs' to be sent back to the client. :(
res.json(docs);
});
});
The reason async, promises and other similar libraries are a good tool is that they help to solve the problem you are facing. async and promises can turn the "callback hell" that is created in this situation into a relatively clean solution...it's easier to read, easier to see where async stuff is happening, and when you need to make edits you don't have to worry about which callback level you are at/editing/etc.

You could use the async module. For example:
var async = require('async');
async.times(2, function(n, next) {
var rand = Math.floor(Math.random() * count);
dataset.findOne({'index':rand}, function(err, doc) {
next(err, doc);
});
}, function(err, docs) {
res.json(docs);
});
If you want to get 100 documents, you just need to change Async.times(2, to Async.times(100,.

The async module as mentioned above is a good solution. The reason this is happening is because a regular Javascript for loop is synchronous, while your calls to the database are asynchronous. The for loop does not know that you want to wait until the data is retrieved to go onto the next iteration, so it just keeps going, and finishes faster than the data retrieval.

Related

Node.js not waiting for nested inner function calls to execute

Admittedly I'm a novice with node, but it seems like this should be working fine. I am using multiparty to parse a form, which returns an array. I am then using a for each to step through the array. However - the for each is not waiting for the inner code to execute. I am a little confused as to why it is not, though.
var return_GROBID = function(req, res, next) {
var form = new multiparty.Form();
var response_array = [];
form.parse(req, function(err, fields, files) {
files.PDFs.forEach(function (element, index, array) {
fs.readFile(element.path, function (err, data) {
var newPath = __dirname + "/../public/PDFs/" + element.originalFilename;
fs.writeFile(newPath, data, function (err) {
if(err) {
res.send(err);
}
GROBIDrequest.GROBID2js(newPath, function(response) {
response_array.push(response);
if (response_array.length == array.length) {
res.locals.body = response_array;
next();
}
});
});
});
});
});
}
If someone can give me some insight on the proper way to do this that would be great.
EDIT: The mystery continues. I ran this code on another machine and IT WORKED. What is going on? Why would one machine be inconsistent with another?

I'd guess the PDFs.forEach is you just calling the built-in forEach function, correct?
In Javascript many things are asynchronous - meaning that given:
linea();
lineb();
lineb may be executed before linea has finished whatever operation it started (because in asynchronous programming, we don't wait around until a network request comes back, for example).
This is different from other programming languages: most languages will "block" until linea is complete, even if linea could take time (like making a network request). (This is called synchronous programming).
With that preamble done, back to your original question:
So forEach is a synchronous function. If you rewrote your code like the following, it would work (but not be useful):
PDFs.forEach(function (element, index, array) {
console.log(element.path)
}
(console.log is a rare synchronous method in Javascript).
But in your forEach loop you have fs.readFile. Notice that last parameter, a function? Node will call that function back when the operation is complete (a callback).
Your code will currently, and as observed, hit that fs.readFile, say, "ok, next thing", and move on to the next item in the loop.
One way to fix this, with the least changing the code, is to use the async library.
async.forEachOf(PDFs, function(value, key, everythingAllDoneCallback) {
GROBIDrequest.GROBID2js(newPath, function(response) {
response_array.push(response);
if (response_array.length = array.length) {
...
}
everythingAllDoneCallback(null)
} );
With this code you are going through all your asynchronous work, then triggering the callback when it's safe to move on to the next item in the list.
Node and callbacks like this are a very common Node pattern, it should be well covered by beginner material on Node. But it is one of the most... unexpected concepts in Node development.
One resource I found on this was (one from a set of lessons) about NodeJS For Beginners: Callbacks. This, and playing around with blocking (synchronous) and non-blocking (asynchronous) functions, and hopefully this SO answer, may provide some enlightenment :)

Why count callbacks?

I am working on problem 10: ASYNC JUGGLING in the learnyounode tutorials.
This problem is the same as the previous problem (HTTP COLLECT) in
that you need to use http.get(). However, this time you will be
provided with three URLs as the first three command-line arguments.
You must collect the complete content provided to you by each of the
URLs and print it to the console (stdout). You don't need to print out
the length, just the data as a String; one line per URL. The catch is
that you must print them out in the same order as the URLs are
provided to you as command-line arguments.
The official solution involves counting callbacks:
var http = require('http')
var bl = require('bl')
var results = []
var count = 0
function printResults () {
for (var i = 0; i < 3; i++)
console.log(results[i])
}
function httpGet (index) {
http.get(process.argv[2 + index], function (response) {
response.pipe(bl(function (err, data) {
if (err)
return console.error(err)
results[index] = data.toString()
count++
if (count == 3) // yay! we are the last one!
printResults()
}))
})
}
for (var i = 0; i < 3; i++)
httpGet(i)
The program must wait until all three responses have been received before printing them out, so they come out in the same order they were entered.
My attempt involved using a callback to ensure the correct order:
var http = require('http')
var bl = require('bl')
var results = []
function printResults () {
console.log(results[0])
console.log(results[1])
console.log(results[2])
}
function httpGet (i) {
http.get(process.argv[2 + i], function (response) {
response.pipe(bl(function (err, data) {
if (err)
return console.error(err)
results[index] = data.toString()
}))
})
}
function httpGetAll (callback) {
httpGet(0)
httpGet(1)
httpGet(2)
callback()
}
httpGetAll(printResults)
But this spits out undefined three times. So it seems as though the printResults() is being called before the three httpGet() lines are executed. Seems like I don't understand callbacks as well as I thought.
So my question is, is there any way to achieve this using a callback on httpGetAll()? Or do I have to count callbacks to httpGet()?

But this spits out undefined three times. So it seems as though the printResults() is being called before the three httpGet() lines are executed. Seems like I don't understand callbacks as well as I thought.
Yes, you are misunderstanding how asynchronous code behaves. The three httpGet() ARE executed first but their asynchronous callbacks that have the results are NOT executed until a later event loop tick. If you look in httpGet, the code indented 1 level runs on the first tick, which is just really that first line, all the code in the nested callback function that is indented 2 levels does NOT execute on the same tick. That code is just scheduled on the event queue for later after the HTTP response arrives, but node doesn't just wait, it keeps going in the interim.
So my question is, is there any way to achieve this using a callback on httpGetAll()? Or do I have to count callbacks to httpGet()?
Yes, there are ways to implement this correctly without specifically counting callbacks, however, you must "keep track" of the pending calls somehow. Counting is a straightforward and efficient way to do this, but you could also use an array as a queue of pending calls, remove an element from the queue when each response arrives, and know you are done when the queue is empty. You could also track state in an object per request with a done property that starts false and you set to true when the response arrives, and you check if they are all done by ensuring all done properties are true. It's not technically counting, but it is book-keeping of a similar nature.

asynchronously iterate over massive array in JavaScript without triggering stack size exceeded

My environment is NodeJS, although this could be a web related problem as well. I have a large set of data from a database which I am attempting to enumerate over. However, for the sake of argument lets say that I have an array of 20,000 strings:
var y = 'strstrstrstrstrstrstrstrstrstr';
var x = [];
for(var i = 0; i < 20000; i++)
x.push(y);
and I want to enumerate this list asynchronously, lets say using the async library, and lets say because I'm super cautious that I even limit my enumeration to 5 iterations at once:
var allDone = function() { console.log('done!') };
require('async').eachLimit(x, 5, function(item, cb){
...
someAsyncCall(.., cb);
}, allDone);
The expectation is that 5 items of x would be iterated concurrently above and that eventually all 20,000 items would be iterated over and the console would print 'done!'. What actually happens is:
Uncaught exception: [RangeError: Maximum call stack size exceeded]
And at this point I assumed that this must be some sort of bug with the async library, so I wrote my own version of eachLimit which follows:
function eachLimit(data, limit, iterator, cb) {
var consumed = 0;
var consume;
var finished = false;
consume = function() {
if(!finished && consumed >= data.length) {
finished = true;
cb();
}else if(!finished) {
return iterator(data[consumed++], consume);
}
};
var concurrent = limit > data.length ? data.length : limit;
for(var i = 0; i < concurrent; i++)
consume();
}
and interestingly enough, this solved my problem. But then when I moved my experiment from nodeJS over to Chrome, even with my solution above I still receive a stack size exceeded.
Clearly, my method does not increase the stack as large as the eachLimit method contained within async. However, I still consider my approach to be bad because maybe not for 20k items, but for some sized array I can still exceed the stack size using my method. I feel like I need to design some sort of solution to this problem using tail recursion, but I'm not sure if v8 will even optimize for this case, or if it's possible given the problem.

I feel like I need to design some sort of solution to this problem using tail recursion, but I'm not sure if v8 will even optimize for this case, or if it's possible given the problem.
The continuation-passing-style you are using is already tail recursive (or close to anyway). The problem is that most JS engines really tend to do stackoverflows in these sorts of situations.
There are two main ways to work around this issue:
1) Force the code to be async using setTimeout.
What is happening with your code is that you are calling the return callbacks before the original function returns. In some async libraries this will end up resulting in stackoverflow. One simple workaround is to force the callback to run only in the next iteration of the event handling loop, by wrapping it inside a setTimeout. Translate
//Turns out this was actually "someSyncCall"...
someAsyncCall(.., cb);
into
someAsyncCall(..., function(){
setTimeout(cb, 0)
});
The main advantage here is that this is very simple to do. The disadvantage is that this add some latency to your loop because setTimeout is implemented so that there will always be some nonzero delay to the callback (even if you set it to zero). On the server you can use nextTick (or somethign like that, forgot the precise name) to do something similar as well.
That said, its already a bit weird to have a large loop of sequential async operations. If your operations are all actually async then its going to take years to complete due to the network latency.
2) Use trampolining to handle the sync code.
The only way to 100% avoid a stackoverflow is to use bona-fide while loops. With promises this would be a bit easier to write the pseudocode for:
//vastly incomplete pseudocode
function loopStartingFrom(array, i){
for(;i<array.length; i++){
var x = run_next_item(i);
if(is_promise(x)){
return x.then(function(){
loopStartingFrom(array, i+1)
});
}
}
}
Basically, you run your loop in an actual loop, with some way to detect if one of your iterations is returning immediately or deferring to an async computation. When things return immediately you keep the loop running and when you finally get a real async result you stop the loop and resume it when the async iteration result completes.
The downside of using trampolining is that its a bit more complicated. That said, there are some async libraries out there that guarantee that stackoverflow does not occur (by using one of the two tricks I mentioned under the hood).

To prevent a stack overflow, you need to avoid that consume recurses into itself. You can do that using a simple flag:
function eachLimit(data, limit, iterator, cb) {
var consumed = 0,
running = 0,
isAsync = true;
function consume() {
running--;
if (!isAsync)
return;
while (running < limit && consumed < data.length) {
isAsync = false;
running++;
iterator(data[consumed++], consume);
isAsync = true;
}
if (running == 0)
cb();
}
running++;
consume();
}

Have you considered using promises for this? They should resolve the issue of an ever-increasing stack (and also you get to use promises, which is a big plus in my book):
// Here, iterator() should take a single data value as input and return
// a promise for the asynchronous behavior (if it is asynchronous)
// or any value if it is synchronous
function eachLimit(data, limit, iterator) {
return Promise(function (resolve, reject) {
var i = 0;
var failed = false;
function handleFailure(error) {
failed = true;
reject(error);
}
function queueAction() {
try {
Promise.when(iterator(data[i]))
.then(handleSuccess, handleFailure);
} catch (error) {
reject(error);
}
}
function handleSuccess() {
if (!failed) {
if (i < data.length) {
queueAction();
i += 1;
} else {
resolve();
}
}
}
for (; i < data.length && i < limit; i += 1) {
queueAction();
}
});
}

NodeJS callback scope

As a js/node newcomer, I'm having some problems understanding how I can get around this issue.
Basically I have a list of objects that I would like to save to a MongoDB database if they don't already exist.
Here is some code:
var getDataHandler = function (err, resp, body) {
var data = JSON.parse(body);
for (var i=0; i < data.length; i++) {
var item = data[i];
models.Entry.findOne({id: item.id}, function(err, res) {
if (err) { }
else if (result === null) {
var entry = new models.Entry(item);
feedbackEntry.save(function(err, result) {
if (err) {}
});
}
});
}
}
The problem I have is that because it is asynchronous, once the new models.Entry(item) line is executed the value of item will be equal to the last element in the data array for every single callback.
What kind of pattern can I use to avoid this issue ?
Thanks.

Two kinds of patterns are available :
1) Callbacks. That is you go on calling functions from your functions by passing them as parameters. Callbacks are generally fine but, especially server side when dealing with database or other asynchronous resources, you fast end in "callback hell" and you may grow tired of looking for tricks to reduce the indentation levels of your code. And you may sometimes wonder how you really deal with exceptions. But callbacks are the basis : you must understand how to deal with that problem using callbacks.
2) Promises. Using promises you may have something like that (example from my related blog post) :
db.on(userId) // get a connection from the pool
.then(db.getUser) // use it to issue an asynchronous query
.then(function(user){ // then, with the result of the query
ui.showUser(user); // do something
}).finally(db.off); // and return the connection to the pool
Instead of passing the next function as callback, you just chain with then (in fact it's a little more complex, you have other functions, for example to deal with collections and parallel resolution or error catching in a clean way).
Regarding your scope problem with the variable evolving before the callback is called, the standard solution is this one :
for (var i=0; i<n; i++) {
(function(i){
// any function defined here (a callback) will use the value of i fixed when iterating
})(i);
});
This works because calling a function creates a scope and the callback you create in that scope retains a pointer to that scope where it will fetch i (that's called a closure).

node.js and asynchronous programming palindrome

This question might be possible duplication. I am a noob to node.js and asynchronous programming palindrome. I have google searched and seen a lot of examples on this, but I still have bit confusion.
OK, from google search what I understand is that all the callbacks are handled asynchronous.
for example, let's take readfile function from node.js api
fs.readFile(filename, [options], callback) // callback here will be handled asynchronously
fs.readFileSync(filename, [options])
var fs = require('fs');
fs.readFile('async-try.js' ,'utf8' ,function(err,data){
console.log(data); })
console.log("hii");
The above code will first print hii then it will print the content of
the file.
So, my questions are:
Are all callbacks handled asynchronously?
The below code is not asynchronous, why and how do I make it?
function compute(callback){
for(var i =0; i < 1000 ; i++){}
callback(i);
}
function print(num){
console.log("value of i is:" + num);
}
compute(print);
console.log("hii");

Are all callbacks handled asynchronously?
Not necessarily. Generally, they are, because in NodeJS their very goal is to resume execution of a function (a continuation) after a long running task finishes (typically, IO operations). However, you wrote yourself a synchronous callback, so as you can see they're not always asynchronous.
The below code is not asynchronous, why and how do I make it?
If you want your callback to be called asynchronously, you have to tell Node to execute it "when it has time to do so". In other words, you defer execution of your callback for later, when Node will have finished the ongoing execution.
function compute(callback){
for (var i = 0; i < 1000; i++);
// Defer execution for later
process.nextTick(function () { callback(i); });
}
Output:
hii
value of i is:1000
For more information on how asynchronous callbacks work, please read this blog post that explains how process.nextTick works.

No, that is a regular function call.
A callback will not be asynchronous unless it is forced to be. A good way to do this is by calling it within a setTimeout of 0 milliseconds, e.g.
setTimeout(function() {
// Am now asynchronous
}, 0);
Generally callbacks are made asynchronous when the calling function involves making a new request on the server (e.g. Opening a new file) and it doesn't make sense to halt execution whilst waiting for it to complete.

The below code is not asynchronous, why and how do I make it?
function compute(callback){
for(var i =0; i < 1000 ; i++){}
callback(i);
}
I'm going to assume your code is trying to say, "I need to do something 1000 times then use my callback when everything is complete".
Even your for loop won't work here, because imagine this:
function compute(callback){
for(var i =0; i < 1000 ; i++){
DatabaseModel.save( function (err, result) {
// ^^^^^^ or whatever, Some async function here.
console.log("I am called when the record is saved!!");
});
}
callback(i);
}
In this case your for loop will execute the save calls, not wait around for them to be completed. So, in your example, you may get output like (depending on timing)
I am called when the record is saved
hii
I am called when the record is saved
...
For your compute method to only call the callback when everything is truely complete - all 1000 records have been saved in the database - I would look into the async Node package, which can do this easily for you, and provide patterns for many async problems you'll face in Node.
So, you could rewrite your compute function to be like:
function compute(callback){
var count = 0
async.whilst(
function() { return count < 1000 },
function(callback_for_async_module) {
DatabaseModel.save( function (err, result) {
console.log("I am called when the record is saved!!");
callback_for_async_module();
count++;
});
},
function(err) {
// this method is called when callback_for_async_module has
// been called 1000 times
callback(count);
);
console.log("Out of compute method!");
}
Note that your compute function's callback parameter will get called sometime after console.log("Out of compute method"). This function is now asynchronous: the rest of the application does not wait around for compute to complete.

You can put every callback call inside a timeout with one milisecond, that way they will be executed first when there are a thread free and all synchron tasks are done, then will the processor work through the stack of timeouts that want to be executet.

We Keep Coding

JavaScript is the programming language of the Web.