With node.js I want to http.get a number of remote urls in a way that only 10 (or n) runs at a time.
I also want to retry a request if an exception occures locally (m times), but when the status code returns an error (5XX, 4XX, etc) the request counts as valid.
This is really hard for me to wrap my head around.
Problems:
Cannot try-catch http.get as it is async.
Need a way to retry a request on failure.
I need some kind of semaphore that keeps track of the currently active request count.
When all requests finished I want to get the list of all request urls and response status codes in a list which I want to sort/group/manipulate, so I need to wait for all requests to finish.
Seems like for every async problem using promises are recommended, but I end up nesting too many promises and it quickly becomes uncypherable.
There are lots of ways to approach the 10 requests running at a time.
Async Library - Use the async library with the .parallelLimit() method where you can specify the number of requests you want running at one time.
Bluebird Promise Library - Use the Bluebird promise library and the request library to wrap your http.get() into something that can return a promise and then use Promise.map() with a concurrency option set to 10.
Manually coded - Code your requests manually to start up 10 and then each time one completes, start another one.
In all cases, you will have to manually write some retry code and as with all retry code, you will have to very carefully decide which types of errors you retry, how soon you retry them, how much you backoff between retry attempts and when you eventually give up (all things you have not specified).
Other related answers:
How to make millions of parallel http requests from nodejs app?
Million requests, 10 at a time - manually coded example
My preferred method is with Bluebird and promises. Including retry and result collection in order, that could look something like this:
const request = require('request');
const Promise = require('bluebird');
const get = Promise.promisify(request.get);
let remoteUrls = [...]; // large array of URLs
const maxRetryCnt = 3;
const retryDelay = 500;
Promise.map(remoteUrls, function(url) {
let retryCnt = 0;
function run() {
return get(url).then(function(result) {
// do whatever you want with the result here
return result;
}).catch(function(err) {
// decide what your retry strategy is here
// catch all errors here so other URLs continue to execute
if (err is of retry type && retryCnt < maxRetryCnt) {
++retryCnt;
// try again after a short delay
// chain onto previous promise so Promise.map() is still
// respecting our concurrency value
return Promise.delay(retryDelay).then(run);
}
// make value be null if no retries succeeded
return null;
});
}
return run();
}, {concurrency: 10}).then(function(allResults) {
// everything done here and allResults contains results with null for err URLs
});
The simple way is to use async library, it has a .parallelLimit method that does exactly what you need.
Related
Disclaimer: I'm not experienced with programming or with networks in general so I might be missing something quite obvious.
So i'm making a function in node.js that should go over an array of image links from my database and check if they're still working. There's thousands of links to check so I can't just fire off several thousand fetch calls at once and wait for results, instead I'm staggering the requests, going 10 by 10 and doing head requests to minimize the bandwidth usage.
I have two issues.
The first one is that after fetching the first 10-20 links quickly, the other requests take quite a bit longer and 9 or 10 out of 10 of them will time out. This might be due to some sort of network mechanism that throttles my requests when there are many being fired at once, but I'm thinking it's likely due to my second issue.
The second issue is that the checking process slows down after a few iterations. Here's an outline of what I'm doing. I'm taking the string array of image links and slicing it 10 by 10 then I check those 10 posts in 10 promises: (ignore the i and j variables, they're there just to track the individual promises and timeouts for loging/debugging)
const partialResult = await Promise.all(postsToCheck.map(async (post, j) => await this.checkPostForBrokenLink(post, i + j)));
within checkPostForBrokenLink I have a race between the fetch and a timeout of 10 seconds because I don't want to have to wait for the connection to time out every time timing out is a problem, I give it 10 seconds and then flag it as having timed out and move on.
const timeoutPromise = index => {
let timeoutRef;
const promise = new Promise<null>((resolve, reject) => {
const start = new Date().getTime();
console.log('===TIMEOUT INIT===' + index);
timeoutRef = setTimeout(() => {
const end = new Date().getTime();
console.log('===TIMEOUT FIRE===' + index, end - start);
resolve(null);
}, 10 * 1000);
});
return { timeoutRef, promise, index };
};
const fetchAndCancelTimeout = timeout => {
return fetch(post.fileUrl, { method: 'HEAD' })
.then(result => {
return result;
})
.finally(() => {
console.log('===CLEAR===' + index); //index is from the parent function
clearTimeout(timeout);
});
};
const timeout = timeoutPromise(index);
const videoTest = await Promise.race([fetchAndCancelTimeout(timeout.timeoutRef), timeout.promise]);
if fetchAndCancelTimeout finishes before timeout.promise does, it will cancel that timeout, but if the timeout finishes first the promise is still "resolving" in the background, despite the code having moved on. I'm guessing this is why my code is slowing down. The later timeouts take 20-30 seconds from being set up to firing, despite being set to 10 seconds. As far as I know, this has to be because the main process is busy and doesn't have time to execute the event queue, though I don't really know what it could be doing except waiting for the promises to resolve.
So the question is, first off, am I doing something stupid here that I shouldn't be doing and that's causing everything to be slow? Secondly, if not, can I somehow manually stop the execution of the fetch promise if the timeout fires first so as not to waste resources on a pointless process? Lastly, is there a better way to check if a large number of links are valid that what I'm doing here?
I found the problem and it wasn't, at least not directly, related to promise buildup. The code shown was for checking video links but, for images, the fetch call was done by a plugin and that plugin was causing the slowdown. When I started using the same code for both videos and images, the process suddenly became orders of magnitude quicker. I didn't think to check the plugin at first because it was supposed to only do a head request and format the results which shouldn't be an issue.
For anyone looking at this trying to find a way to cancel a fetch, #some provided an idea that seems like it might work. Check out https://www.npmjs.com/package/node-fetch#request-cancellation-with-abortsignal
Something you might want to investigate here is the Bluebird Promise library.
There are two functions in particular that I believe could simplify your implementation regarding rate limiting your requests and handling timeouts.
Bluebird Promise.map has a concurrency option (link), which allows you to set the number of concurrent requests and it also has a Promise.timeout function (link) which will return a rejection of the promise if a certain timeout has occurred.
I am reaching out to a SongKick's REST API in my ReactJS Application, using Axios.
I do not know how many requests I am going to make, because in each request there is a limit of data that I can get, so in each request I need to ask for a different page.
In order to do so, I am using a for loop, like this:
while (toContinue)
{
axios.get(baseUrl.replace('{page}',page))
.then(result => {
...
if(result.data.resultsPage.results.artist === undefined)
toContinue = false;
});
page++;
}
Now, the problem that I'm facing is that I need to know when all my requests are finished. I know about Promise.All, but in order to use it, I need to know the exact URLs I'm going to get the data from, and in my case I do not know it, until I get empty results from the request, meaning that the last request is the last one.
Of course that axios.get call is asynchronous, so I am aware that my code is not optimal and should not be written like this, but I searched for this kind of problem and could not find a proper solution.
Is there any way to know when all the async code inside a function is finished?
Is there another way to get all the pages from a REST API without looping, because using my code, it will cause extra requests (because it is async) being called although I have reached empty results?
Actually your code will crash the engine as the while loop will run forever as it never stops, the asynchronous requests are started but they will never finish as the thread is blocked by the loop.
To resolve that use async / await to wait for the request before continuing the loop, so actually the loop is not infinite:
async function getResults() {
while(true) {
const result = await axios.get(baseUrl.replace('{page}', page));
// ...
if(!result.data.resultsPage.results.artist)
break;
}
}
I have a long, complicated asynchronous process in TypeScript/JavaScript spread across many libraries and functions that, when it is finally finished receiving and processing all of its data, calls a function processComplete() to signal that it's finished:
processComplete(); // Let the program know we're done processing
Right now, that function looks something like this:
let complete = false;
function processComplete() {
complete = true;
}
In order to determine whether the process is complete, other code either uses timeouts or process.nextTick and checks the complete variable over and over again in loops. This is complicated and inefficient.
I'd instead like to let various async functions simply use await to wait and be awoken when the process is complete:
// This code will appear in many different places
await /* something regarding completion */;
console.log("We're done!");
If I were programming in Windows in C, I'd use an event synchronization primitive and the code would look something like this:
Event complete;
void processComplete() {
SetEvent(complete);
}
// Elsewhere, repeated in many different places
WaitForSingleObject(complete, INFINITE);
console.log("We're done!");
In JavaScript or TypeScript, rather than setting a boolean complete value to true, what exactly could processComplete do to make wake up any number of functions that are waiting using await? In other words, how can I implement an event synchronization primitive using await and async or Promises?
This pattern is quite close to your code:
const processComplete = args => new Promise(resolve => {
// ...
// In the middle of a callback for a async function, etc.:
resolve(); // instead of `complete = true;`
// ...
}));
// elsewhere
await processComplete(args);
console.log("We're done!");
More info: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise.
It really depends on what you mean by "other code" in this scenario. It sounds like you want to use some variation of the delegation pattern or the observer pattern.
A simple approach is to take advantage of the fact that JavaScript allows you to store an array of functions. Your processComplete() method could do something like this:
function processComplete(){
arrayOfFunctions.forEach(fn => fn());
}
Elsewhere, in your other code, you could create functions for what needs to be done when the process is complete, and add those functions to the arrayOfFunctions.
If you don't want these different parts of code to be so closely connected, you could set up a completely separate part of your code that functions as a notification center. Then, you would have your other code tell the notification center that it wants to be notified when the process is complete, and your processComplete() method would simply tell the notification center that the process is complete.
Another approach is to use promises.
I have a long, complicated asynchronous process in TypeScript/JavaScript spread across many libraries and functions
Then make sure that every bit of the process that is asynchronous returns a promise for its partial result, so that you can chain onto them and compose them together or await them.
When it is finally finished receiving and processing all of its data, calls a function processComplete() to signal that it's finished
It shouldn't. The function that starts the process should return a promise, and when the process is finished it should fulfill that promise.
If you don't want to properly promisify every bit of the whole process because it's too cumbersome, you can just do
function startProcess(…);
… // do whatever you need to do
return new Promise(resolve => {
processComplete = resolve;
// don't forget to reject when the process failed!
});
}
In JavaScript or TypeScript, rather than setting a boolean complete value to true, what exactly could processComplete do to make wake up any number of functions that are waiting using await?
If they are already awaiting the result of the promise, there is nothing else that needs to be done. (The awaited promise internally has such a flag already). It's really just doing
// somewhere:
var resultPromise = startProcess(…);
// elsewhere:
await resultPromise;
… // the process is completed here
You don't even need to fulfill the promise with a useful result if all you need is to synchronise your tasks, but you really should. (If there's no data they are waiting for, what are they waiting for at all?)
So basically I'm trying to make a ton of, (about 25,000), GET requests to an API. I'm using axios as my library for making HTTP calls.
So I have:
dsHistPromises.push(axios.get(url))
And then I'm using:
axios.all(dsHistPromises)
.then(function(results) {
results.forEach(function(response){
if (format === lists.formats.highlow) {
storage.darkskyHistoryHighLow.push(requests.parseDarkskyHighLow(response, city))
}
// parse data here and print it to files...
})
}).catch(err => {
throw err
})
to handle all of my promises.
When I try to run my code, I get errors like
(node:10400) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 2): Error: read ECONNRESET
I have to imagine this is an issue with the API I'm connecting to. The server probably is having a hard time managing my requests, correct?
Are there are any tricks to getting around this?
As you suspect, I'd say it was because of the API you're accessing. Most sane APIs will have some sort of throttle to prevent one user from bringing them down. You're probably running into that.
What you'll need to do is throttle your messages by putting a time delay in between calls.
There are a couple of ways to handle this. One would be to do each call in order, one at a time. This will help avoid throttling from the server, but may be slower.
If the server will allow it, you can do multiple requests in parallel.
Basically, it'd look something like this:
const urls = getUrlsToCallFromSomewhere();
const delay = 500; // ms
const next = () => {
if (!urls.length) {
return;
}
const url = urls.pop();
axios.get(url).then(data => {
results.push(data);
const now = Date.now();
if (now - start > delay) {
next();
} else {
setTimeout(next, delay - (now - start));
}
});
};
This isn't super perfect, but should give you an idea. Basically, call them one at a time. When it's done, check how long it's been. If you've hit the delay amount already, call it right away. If not, use setTimeout() to wait the amount of time.
You can arrange your code to get a few of these going at the same time to speed things up, and adjust the delay to be as long or as fast as you need.
At the moment, I'm trying to request a very large JSON object from an API (particularly this one) which, depending on various factors, can be upwards of a few MB. The problem is, however, is that NodeJS takes forever to do anything and then just runs out of memory: the first line of my response callback doesn't ever execute.
I could request each item individually, but that is a tremendous amount of requests. To quote the a dev behind the new API:
Until now, if you wanted to get all the market orders for Tranquility you had to request every type per region individually. That would generally be 50+ regions multiplied by upwards of 13,000 types. Even if it was just 13,000 types and 50 regions, that is 650,000 requests required to get all the market information. And if you wanted to get all the data in the 5-minute cache window, it would require almost 2,200 requests per second.
Obviously, that is not a great idea.
I'm trying to get the array items into redis for use later, then follow the next url and repeat until the last page is reached. Is there any way to do this?
EDIT:
Here's the problem code. Visiting the URL works fine in-browser.
// ...
REGIONS.forEach((region) => {
LOG.info(' * Grabbing data for `' + region.name + '#' + region.id + '`');
var href = url + region.id + '/orders/all/', next = href;
var page = 1;
while (!!next) {
https.get(next, (res) => {
LOG.info(' * * Page ' + page++ + ' responded with ' + res.statusCode);
// ...
The first LOG.info line executes, while the second does not.
It appears that you are doing a while(!!next) loop which is the cause of your problem. If you show more of the server code, we could advise more precisely and even suggest a better way to code it.
Javascript run your code single threaded. That means one thread of execution runs to completion before any other events can be run.
So, if you do:
while(!!next) {
https.get(..., (res) => {
// hoping this will run
});
}
Then, your callback to http.get() will never get called. Your while loop just keeps running forever. As long as it is running, the callback from the https.get() can never get called. That request is likely long since completed and there's an event sitting in the internal JS event queue to call the callback, but until your while() loop finished, that event can't get called. So you have a deadlock. The while() loop is waiting for something else to run to change it's condition, but nothing else can run until the while() loop is done.
There are several other ways to do serial async iterations. In general, you can't use .forEach() or while().
Here are several schemes for async looping:
Node.js: How do you handle callbacks in a loop?
While loop with jQuery async AJAX calls
How to synchronize a sequence of promises?
How to use after and each in conjunction to create a synchronous loop in underscore js
Or, the async library which you mentioned also has functions for doing async looping.
First of all, a few MBs of json payload is not exactly huge. So the route handler code might require some close scrutiny.
However, to actually deal with huge amounts of JSON, you can consume your request as a stream. JSONStream (along with many other similar libraries) allow you to do this in a memory efficient way. You can specify the paths you need to process using JSONPath (XPath analog for JSON) and then subscribe to the stream for matching data sets.
Following example from the README of JSONStream illustrates this succinctly:
var request = require('request')
, JSONStream = require('JSONStream')
, es = require('event-stream')
request({url: 'http://isaacs.couchone.com/registry/_all_docs'})
.pipe(JSONStream.parse('rows.*'))
.pipe(es.mapSync(function (data) {
console.error(data)
return data
}))
Use the stream functionality of the request module to process large amounts of incoming data. As data comes through the stream, parse it to a chunk of data that can be worked with, push that data through the pipe, and pull in the next chunk of data.
You might create a transform stream to manipulate a chunk of data that has been parsed and a write stream to store the chunk of data.
For example:
var stream = request ({ url: your_url }).pipe(parseStream)
.pipe(transformStream)
.pipe (writeStream);
stream.on('finish', () => {
setImmediate (() => process.exit(0));
});
Try for info on creating streams https://bl.ocks.org/joyrexus/10026630