NodeJS http and extremely large response bodies - javascript

At the moment, I'm trying to request a very large JSON object from an API (particularly this one) which, depending on various factors, can be upwards of a few MB. The problem is, however, is that NodeJS takes forever to do anything and then just runs out of memory: the first line of my response callback doesn't ever execute.
I could request each item individually, but that is a tremendous amount of requests. To quote the a dev behind the new API:
Until now, if you wanted to get all the market orders for Tranquility you had to request every type per region individually. That would generally be 50+ regions multiplied by upwards of 13,000 types. Even if it was just 13,000 types and 50 regions, that is 650,000 requests required to get all the market information. And if you wanted to get all the data in the 5-minute cache window, it would require almost 2,200 requests per second.
Obviously, that is not a great idea.
I'm trying to get the array items into redis for use later, then follow the next url and repeat until the last page is reached. Is there any way to do this?
EDIT:
Here's the problem code. Visiting the URL works fine in-browser.
// ...
REGIONS.forEach((region) => {
LOG.info(' * Grabbing data for `' + region.name + '#' + region.id + '`');
var href = url + region.id + '/orders/all/', next = href;
var page = 1;
while (!!next) {
https.get(next, (res) => {
LOG.info(' * * Page ' + page++ + ' responded with ' + res.statusCode);
// ...
The first LOG.info line executes, while the second does not.

It appears that you are doing a while(!!next) loop which is the cause of your problem. If you show more of the server code, we could advise more precisely and even suggest a better way to code it.
Javascript run your code single threaded. That means one thread of execution runs to completion before any other events can be run.
So, if you do:
while(!!next) {
https.get(..., (res) => {
// hoping this will run
});
}
Then, your callback to http.get() will never get called. Your while loop just keeps running forever. As long as it is running, the callback from the https.get() can never get called. That request is likely long since completed and there's an event sitting in the internal JS event queue to call the callback, but until your while() loop finished, that event can't get called. So you have a deadlock. The while() loop is waiting for something else to run to change it's condition, but nothing else can run until the while() loop is done.
There are several other ways to do serial async iterations. In general, you can't use .forEach() or while().
Here are several schemes for async looping:
Node.js: How do you handle callbacks in a loop?
While loop with jQuery async AJAX calls
How to synchronize a sequence of promises?
How to use after and each in conjunction to create a synchronous loop in underscore js
Or, the async library which you mentioned also has functions for doing async looping.

First of all, a few MBs of json payload is not exactly huge. So the route handler code might require some close scrutiny.
However, to actually deal with huge amounts of JSON, you can consume your request as a stream. JSONStream (along with many other similar libraries) allow you to do this in a memory efficient way. You can specify the paths you need to process using JSONPath (XPath analog for JSON) and then subscribe to the stream for matching data sets.
Following example from the README of JSONStream illustrates this succinctly:
var request = require('request')
, JSONStream = require('JSONStream')
, es = require('event-stream')
request({url: 'http://isaacs.couchone.com/registry/_all_docs'})
.pipe(JSONStream.parse('rows.*'))
.pipe(es.mapSync(function (data) {
console.error(data)
return data
}))

Use the stream functionality of the request module to process large amounts of incoming data. As data comes through the stream, parse it to a chunk of data that can be worked with, push that data through the pipe, and pull in the next chunk of data.
You might create a transform stream to manipulate a chunk of data that has been parsed and a write stream to store the chunk of data.
For example:
var stream = request ({ url: your_url }).pipe(parseStream)
.pipe(transformStream)
.pipe (writeStream);
stream.on('finish', () => {
setImmediate (() => process.exit(0));
});
Try for info on creating streams https://bl.ocks.org/joyrexus/10026630

Related

How to ensure orderly processing result of a websocket's callbacks?

I have an application which listens to a websocket endpoint and processes the data received from it and saves it to a database.
The problem of race condition arises when two callbacks are invoked concurrently (for example: one task may begin processing, then another task may begin processing and update the database, then the first task may update the database - so in the end the database updates are out of order).
The solution I thought of was to record the exact time a callback is called, process the data, then attach the time to the data passed to the database and in the database compare this time with the last update time and act accordingly.
One possible problem I thought of is that the time may be recorded out of order (for example: consider the scenario where the first callback is called, then the second callback is called and the time is recorded, then the time is recorded for the first callback).
How would you do it the right way? Solutions to this problem or other ways to go about it?
EDIT: To be more specific as I'm intending for the program to be as real-time as possible I'd like to allow for the most up-to-date callback to be processed without delay (without waiting for all other previous callbacks to entirely process) but to ensure that the end result of the processing (as is recorded in the database) adheres to the order in which the callbacks arrived (is not corrupt)
You can the data handler callback return a promise of when it's finished.
Each time you get a new data from the socket, wait for that promise before handling it, then store the resulting promise for the next data to wait for.
That would look like this:
const ready = Promise.resolve();
socket.on(..., data => {
ready = ready.then(() => processData(data);
});
This will have no effect on any other code.
EDIT: To do expensive work outside the lock, you can write
socket.on(..., data => {
const result = doExpensiveWork(data); // Returns a promise
ready = Promise.all(result, ready).then(([result]) => insertData(result));
});

How to handle async code within a loop in JavaScript?

I am reaching out to a SongKick's REST API in my ReactJS Application, using Axios.
I do not know how many requests I am going to make, because in each request there is a limit of data that I can get, so in each request I need to ask for a different page.
In order to do so, I am using a for loop, like this:
while (toContinue)
{
axios.get(baseUrl.replace('{page}',page))
.then(result => {
...
if(result.data.resultsPage.results.artist === undefined)
toContinue = false;
});
page++;
}
Now, the problem that I'm facing is that I need to know when all my requests are finished. I know about Promise.All, but in order to use it, I need to know the exact URLs I'm going to get the data from, and in my case I do not know it, until I get empty results from the request, meaning that the last request is the last one.
Of course that axios.get call is asynchronous, so I am aware that my code is not optimal and should not be written like this, but I searched for this kind of problem and could not find a proper solution.
Is there any way to know when all the async code inside a function is finished?
Is there another way to get all the pages from a REST API without looping, because using my code, it will cause extra requests (because it is async) being called although I have reached empty results?
Actually your code will crash the engine as the while loop will run forever as it never stops, the asynchronous requests are started but they will never finish as the thread is blocked by the loop.
To resolve that use async / await to wait for the request before continuing the loop, so actually the loop is not infinite:
async function getResults() {
while(true) {
const result = await axios.get(baseUrl.replace('{page}', page));
// ...
if(!result.data.resultsPage.results.artist)
break;
}
}

Run 1000 requests so that only 10 runs at a time

With node.js I want to http.get a number of remote urls in a way that only 10 (or n) runs at a time.
I also want to retry a request if an exception occures locally (m times), but when the status code returns an error (5XX, 4XX, etc) the request counts as valid.
This is really hard for me to wrap my head around.
Problems:
Cannot try-catch http.get as it is async.
Need a way to retry a request on failure.
I need some kind of semaphore that keeps track of the currently active request count.
When all requests finished I want to get the list of all request urls and response status codes in a list which I want to sort/group/manipulate, so I need to wait for all requests to finish.
Seems like for every async problem using promises are recommended, but I end up nesting too many promises and it quickly becomes uncypherable.
There are lots of ways to approach the 10 requests running at a time.
Async Library - Use the async library with the .parallelLimit() method where you can specify the number of requests you want running at one time.
Bluebird Promise Library - Use the Bluebird promise library and the request library to wrap your http.get() into something that can return a promise and then use Promise.map() with a concurrency option set to 10.
Manually coded - Code your requests manually to start up 10 and then each time one completes, start another one.
In all cases, you will have to manually write some retry code and as with all retry code, you will have to very carefully decide which types of errors you retry, how soon you retry them, how much you backoff between retry attempts and when you eventually give up (all things you have not specified).
Other related answers:
How to make millions of parallel http requests from nodejs app?
Million requests, 10 at a time - manually coded example
My preferred method is with Bluebird and promises. Including retry and result collection in order, that could look something like this:
const request = require('request');
const Promise = require('bluebird');
const get = Promise.promisify(request.get);
let remoteUrls = [...]; // large array of URLs
const maxRetryCnt = 3;
const retryDelay = 500;
Promise.map(remoteUrls, function(url) {
let retryCnt = 0;
function run() {
return get(url).then(function(result) {
// do whatever you want with the result here
return result;
}).catch(function(err) {
// decide what your retry strategy is here
// catch all errors here so other URLs continue to execute
if (err is of retry type && retryCnt < maxRetryCnt) {
++retryCnt;
// try again after a short delay
// chain onto previous promise so Promise.map() is still
// respecting our concurrency value
return Promise.delay(retryDelay).then(run);
}
// make value be null if no retries succeeded
return null;
});
}
return run();
}, {concurrency: 10}).then(function(allResults) {
// everything done here and allResults contains results with null for err URLs
});
The simple way is to use async library, it has a .parallelLimit method that does exactly what you need.

AngularJS $http.get async execution order

I recently did a lot of coding in AngularJS. After some time it started to feel comfortable with it and also got really productive. But unfortunately there is this one thing I don't understand:
Within my project I need to get data through $http.get and a RESTful API server. This is where I started to stumble first. After implementing promise ($q.defer etc and .then) at functions which are processing data that's necessary to continue, I thought I conquered the problem.
But in this code:
$scope.getObservationsByLocations = function() {
var promise = $q.defer();
var locationCount = 0;
angular.forEach($scope.analysisData, function(loc) { // for each location
$http.get($scope.api + 'Device?_format=json', { // get all devices
params: {
location: loc.location.id
}
}).then(function (resultDevices) {
var data = angular.fromJson(resultDevices);
promise.resolve(data);
// for each device in this location
angular.forEach(angular.fromJson(resultDevices).data.entry.map(function (dev) {
http.get($scope.api + 'Observation?_format=json', { // get all observations
params: {
device: dev.resource.id
}
}).then(function (resultObservations) {
var observations = angular.fromJson(resultObservations);
// for each obervation of that device in this location
angular.forEach(observations.data.entry.map(function(obs) {
$scope.analysisData[locationCount].observations.push({observation: obs.resource});
}));
})
}))
});
locationCount++
});
return promise.promise
};
I can't understand in which order the commands are executed. Since I use the Webstorm IDE and it's debugging feature, it would be more accurate to say I don't know why the commands are executed in an order I don't understand.
Thinking simple, everything included in the forEach have to be executed before the return is reached, because $http.get's are connected through .then's. But following the debugging information, the function iterates over locationCount++ and even returns the promise before it goes deeper (meaning after the first .then() ).
What's that all about? Did I misunderstood this part of the AngularJS concept?
Or is this just really bad practice and I should reach out for a different solution?
If the context is important/interesting: Objects are based on i.e. https://www.hl7.org/fhir/2015May/location.html#5.15.3
With JavaScript you can create only single thread applications, although e.g. here they say it is not guaranteed to be like that.
But we are talking about the real world and real browsers, so your code sample is running as a single thread (by the way the same thread is also used for rendering your CSS and HTML, at least in Firefox).
When it comes to an asynchronous call
$http.get($scope.api + 'Device?_format=json', {
it says "hey, I can do that later". And it waits with that because it must go on with the current thread.
Then, once the current task is done with return it finally can start getting the remote data.
Proof? Check this fiddle:
console.log(1);
for (var i=0;i<1000000;i++) setTimeout(function(){
console.log(2);
},0);
console.log(3);
You see the spike with the for loop? This is the moment when it registers the setTimeout asynchronous calls. Still 3 is printed before 2 because the task is not done until the 3 is printed.
The $http.get is asynchronous, so depending on (among other things) how large the fetched data is, the time it takes to 'complete' the get is variable. Hence why there is no saying in what order they will be completed

JavaScript Double Null Check and Locking

In a language with threads and locks it is easy to implement a lazy load by checking the value of a variable, if it's null then lock the next section of code, check the value again and then load the resource and assign. This prevents it from being loaded multiple times and causes threads after the first to wait for the first thread to complete the action that's needed.
Psuedo code:
if(myvar == null) {
lock(obj) {
if(myvar == null) {
myvar = getData();
}
}
}
return myvar;
JavaScript runs in a single thread, however, it still has this type of issue because of asynchronous execution while one call is waiting on a blocking resource. In this Node.js example:
var allRecords;
module.exports = getAllRecords(callback) {
if(allRecords) {
return callback(null,allRecords);
}
db.getRecords({}, function(err, records) {
if (err) {
return callback(err);
}
// Use existing object if it has been
// set by another async request to this
// function
allRecords = allRecords || partners;
return callback(null, allRecords);
});
}
I'm lazy loading all the records from a small DB table the first time this function is called and then returning the in-memory records on subsequent calls.
Problem: If multiple async requests are made to this function at the same time then the table is going to be loaded unnecessarily from the DB multiple times.
In order to solve this I could simulate a locking mechanism by creating a var lock; variable and setting it to true while the table is loading. I would then put the other async calls into a setTimeout() loop and check back on this variable every (say) 1 second until the data was available and then allow them to return.
The problems with that solution are:
It's fragile, what if the first async call throws and doesn't unset the lock.
How many times do we loop back into the timer before giving up?
How long should the timer be set for? In some environments 1 second might be way too long and inefficient.
Is there a best practise for solving this in JavaScript?
On the first call to the service, initialize an array. Start the fetch operation. Create a Promise, store it in the array.
On subsequent calls, if the data is there, return an already-fulfilled Promise. If not, add another Promise to the array and return that.
When the data arrives, resolve all the waiting Promise objects in the list. (You can throw away the list once the data's there.)
I really like the promise solution in the other answer -- very clever, very interesting. Promises aren't the dominent methodology, so you may need to educate the team. I'm going to go in another direction though.
What you're after is a memoize function -- an in-memory key/value cache of expensive results. JavaScript the Good Parts has a memoize sample towards the end. Lodash has a memoize function. These assume synchronous processing so don't account for your scenario -- which is to say they'd hit the database lots of times until one of the "threads" replied.
The async library also has a memoize function that does exactly what you want. In it's innards, it keeps a queue array of callbacks, and once it gets the answer, it both caches it and calls all the callbacks.
If you're into inventing, by all means, use promises. If you'd just like a plug-n-play answer, use async#memoize.

Categories