I was tasked with transferring a large portion of data using javascript and an API from one database to another. Yes I understand that there are better ways of accomplishing this task, but I was asked to try this method.
I wrote some javascript that makes a GET call to an api that returns an array of data, which I then turnaround and make calls to another api to send this data as individual POST requests.
What I have written so far seems to works fairly well, and I have been able to send over 50k individual POST requests without any errors. But I am having trouble when the number of POST requests increases past around 100k. I end up running out of memory and the browser crashes.
From what I understand so far about promises, is that there may be an issue where promises (or something else?) are still kept in heap memory after they are resolved, which results in running out of memory after too many requests.
I've tried 3 different methods to get all the records to POST successfully after searching for the past couple days. This has included using Bluebirds Promise.map, as well as breaking up the array into chunks first before sending them as POST requests. Each method seems to work up until it has processed about 100k records before it crashes.
async function amGetRequest(controllerName) {
try{
const amURL = "http://localhost:8081/api/" + controllerName;
const amResponse = await fetch(amURL, {
"method": "GET",
});
return await amResponse.json();
} catch (err) {
closeModal()
console.error(err)
}
};
async function brmPostRequest(controllerName, body) {
const brmURL = urlBuilderBRM(controllerName);
const headers = headerBuilderBRM();
try {
await fetch(brmURL, {
"method": "POST",
"headers": headers,
"body": JSON.stringify(body)
});
}
catch(error) {
closeModal()
console.error(error);
};
};
//V1.0 Send one by one and resolve all promises at the end.
const amResult = await amGetRequest(controllerName); //(returns an array of ~245,000 records)
let promiseArray = [];
for (let i = 0; i < amResult.length; i++) {
promiseArray.push(await brmPostRequest(controllerName, amResult[i]));
};
const postResults = await Promise.all(promiseArray);
//V2.0 Use bluebirds Promise.map with concurrency set to 100
const amResult = await amGetRequest(controllerName); //(returns an array of ~245,000 records)
const postResults = Promise.map(amResult, async data => {
await brmPostRequest(controllerName, data);
return Promise.resolve();
}, {concurrency: 100});
//V3.0 Chunk array into max 1000 records and resolve 1000 promises before looping to the next 1000 records
const amResult = await amGetRequest(controllerName); //(returns an array of ~245,000 records)
const numPasses = Math.ceil(amResult.length / 1000);
for (let i=0; i <= numPasses; i++) {
let subset = amResult.splice(0,1000);
let promises = subset.map(async (record) => {
await brmPostRequest(controllerName, record);
});
await Promise.all(promises);
subset.length = 0; //clear out temp array before looping again
};
Is there something that I am missing about getting these promises cleared out of memory after they have been resolved?
Or perhaps a better method of accomplishing this task?
Edit: Disclaimer - I'm still fairly new to JS and still learning.
"Well-l-l-l ... you're gonna need to put a throttle on this thing!"
Without (pardon me ...) attempting to dive too deeply into your code, "no matter how many records you need to transfer, you need to control the number of requests that the browser attempts to do at any one time."
What's probably happening right now is that you're stacking up hundreds or thousands of "promised" requests in local memory – but, how many requests can the browser actually transmit at once? That should govern the number of requests that the browser actually attempts to do. As each reply is returned, your software then decides whether to start another request and if so for which record.
Conceptually, you have so-many "worker bees," according to the number of actual network requests your browser can simultaneously do. Your software never attempts to launch more simultaneous requests than that: it simply launches one new request as each one request is completed. Each request, upon completion, triggers code that decides to launch the next one.
So – you never are "sending thousands of fetch requests." You're probably sending only a handful at a time, even though, in this you-controlled manner, "thousands of requests do eventually get sent."
As you are not intereted in the values delivered by brmPostRequest(), there's no point mapping the original array; neither the promises nor the results need to be acumulated.
Not doing so will save memory and may allow progress beyond the 100k sticking point.
async function foo() {
const amResult = await amGetRequest(controllerName);
let counts = { 'successes': 0, 'errors': 0 };
for (let i = 0; i < amResult.length; i++) {
try {
await brmPostRequest(controllerName, amResult[i]);
counts.successes += 1;
} catch(err) {
counts.errors += 1;
}
};
const console.log(counts);
}
Related
I have a bunch of data that I want to send to a server through http. However in the server side I need to process the data in the same order as they were sent(e.g. if the order of sending is elem1, elem2 and elem3, I would like to process elem1 first, then elem2 and then elem3). Since in http, there is no grantee that the order will be maintained I need some way to maintain the order.
Currently I am keeping the data in a queue and I send one element and await for the response. Once the response reaches me I send the next element.
while (!queue.isEmpty()) {
let data = queue.dequeue();
await sendDataToServer(data);
}
I am not very sure if this will actually work in a production environment and what will be the impact on the performance.
Any sort of help is much appreciated. Thank you
Sorry, I don't have enough reputation to comment, thus I am posting this as an answer.
Firstly, your code will work as intended.
However, since the server has to receive them in order, the performance won't be good. If you can change the server, I suggest you implement it like this:
Add an ID to each data item.
Send all the data items, no need to ensure order.
Create a buffer on the server, the buffer will be able to contain all the data items.
The server receives the items and puts them into the buffer in the right position.
Example code:
Client (see Promise.all)
let i = 0;
let promises = [];
await sendDataLengthToServer(queue.length());
while (!queue.isEmpty()) {
let data = queue.dequeue();
data.id = i;
// no need to wait for a request to finish
promises.push(sendDataToServer(data));
}
await Promise.all(promises);
Server (pseudo-code)
length = receiveDataLengthFromClient()
buffer = new Array(length)
int received = 0
onDataReceivedFromClient(data, {
received = received + 1
buffer[data.id] = data
if (received == length) {
// the buffer contains the data in the right order
}
})
I have following scenario:
My scrapes are behind a login, so there is one login page that I always need to hit first
then I have a list of 30 urls that can be scraped asynchronously for all I care
then at the very end, when all those 30 urls have been scraped I need to hit one last separate url to put the results of the 30 URL scrape into a firebase db and to do some other mutations (like geo lookups for addresses etc)
Currently I have all 30 urls in a request queue (through the Apify web-interface) and I'm trying to see when they are all finished.
But obviously they all run async so that data is never reliable
const queue = await Apify.openRequestQueue();
let pendingRequestCount = await queue.getInfo();
The reason why I need that last URL to be separate are two-fold:
Most obvious reason being that I need to be sure I have the
results of all 30 scrapes before I send everything to DB
neither of the 30 URL's allow me to do Ajax / Fetch calls, which
I need for sending to Firebase and do the geo lookups of addresses
Edit: Tried this based on answer from #Lukáš Křivka. handledRequestCount in the while loop reaches a max of 2, never 4 ... and Puppeteer just ends normally. I've put the "return" inside the while loop because otherwise requests never finish (of course).
In my current test setup I have 4 urls to be scraped (in the Start URLS input fields of Puppeteer Scraper (on Apify.com) and this code :
let title = "";
const queue = await Apify.openRequestQueue();
let {handledRequestCount} = await queue.getInfo();
while (handledRequestCount < 4){
await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
console.log(`Curently handled here: ${handledRequestCount} --- waiting`) // this goes max to '2'
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};
}
log.info("Here I want to add another URL to the queue where I can do ajax stuff to save results from above runs to firebase db");
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};
I would need to see your code to answer completely correctly but this has solutions.
Simply use Apify.PuppeteerCrawler for the 30 URLs. Then you run the crawler with await crawler.run().
After that, you can simply load the data from the default dataset via
const dataset = await Apify.openDataset();
const data = await dataset.getdata().then((response) => response.items);
And do whatever with the data, you can even create new Apify.PuppeteerCrawler to crawl the last URL and use the data.
If you are using Web Scraper though, it is a bit more complicated. You can either:
1) Create a separate actor for the Firebase upload and pass it a webhook from your Web Scraper to load the data from it. If you look at the Apify store, we already have a Firestore uploader.
2) Add a logic that will poll the requestQueue like you did and only when all the requests are handled, you proceed. You can create some kind of loop that will wait. e.g.
const queue = await Apify.openRequestQueue();
let { handledRequestCount } = await queue.getInfo();
while (handledRequestCount < 30) {
console.log(`Curently handled: ${handledRequestCount } --- waiting`)
await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
}
// Do your Firebase stuff
In the scenario where you have one async function that's called for all 30 URLs you scrape, first make sure the function returns its result after all necessary awaits, you could await for Promise.all(arrayOfAll30Promises) then run your last piece of code
Because I was not able to get consistent results with the {handledRequestCount} from getInfo() (see my edit in my original question), I went another route.
I'm basically keeping a record of which URL's have already been scraped via the key/value store.
urls = [
{done:false, label:"vietnam", url:"https://en.wikipedia.org/wiki/Vietnam"},
{done:false , label:"cambodia", url:"https://en.wikipedia.org/wiki/Cambodia"}
]
// Loop over the array and add them to the Queue
for (let i=0; i<urls.length; i++) {
await queue.addRequest(new Apify.Request({ url: urls[i].url }));
}
// Push the array to the key/value store with key 'URLS'
await Apify.setValue('URLS', urls);
Now every time I've processed an url I set its "done" value to true.
When they are all true I'm pushing another (final) url into the queue:
await queue.addRequest(new Apify.Request({ url: "http://www.placekitten.com" }));
I'm sending data to the Firestore at 33Hz. This means that over time, a lot of data is stored.
I need to be able to download this data. For that, I have made a Http Function in firebase, that receives the user uid, device's serial number, start date/time and end date/time. The function then tests if the user exists and also if he does have that serial number. Then it queries firestore and append data to a JSON object, that eventually is send as a response.
But, depending on how long the query has been made, the function will timeout. It does work for short periods.
How should I do to make the function faster? Am I using the wrong tool?
[...]
const nD1 = db.collection('grind-bit').doc(req.query.serial).collection('history').where('date', '>=', startdate).where('date', '<=', enddate).get().then(snapshot => {
snapshot.forEach(doc => {
elem = {};
elem.date = doc.data().date;
elem.rms0 = doc.data().rms0;
elem.rms1 = doc.data().rms1;
elem.rms2 = doc.data().rms2;
data[key].push(elem);
});
if(data[key].length) {
let csv = json2csv(data[key]);
csv = JSON.stringify(csv);
res.set("Content-Type", "text/csv");
return promises.push(res.status(200).send(csv));
} else {
return promises.push(res.status(401).send('No data has been found in this interval.'));
}
[...]
If the timeout is caused by the fact that you're trying to read/return too much data in one go, you might want to consider limiting how much data you return. By adding a limit() clause to your query, you can limit how much data it returns at most.
const nD1 = db.collection('grind-bit').doc(req.query.serial).collection('history')
.where('date', '>=', startdate).where('date', '<=', enddate)
.limit(100)
.get().then(snapshot => {
snapshot.forEach(doc => {
...
This of course means that you may have to call the HTTP function multiple times to ensure you process all data. The simplest approach for that is: if you get 100 results (or whatever limit your specify), try another time after processing them.
Also see:
The Firebase documentation on limiting query results
I'm making a bunch calls to a database that contains a large amount of data on a Windows 7 64 bit OS. As the calls are queuing up I get the error (for ever HTTP call after the first error):
Error: connect ENOBUFS *omitted* - Local (undefined:undefined)
From my google searching I've learned that this error means that my buffer has grown too large and my system's memory can no longer handle the buffer's size.
But I don't really understand what this means. I'm using node.js to with an HTTPS library to handle my requests. When the requests are getting queued and the sockets are opening is the buffer's size allocated in RAM? What will allow the buffer to expand to a greater size? Is this simply a hardware limitation?
I've also read that some OS are able to handle the size of the buffer better than other OS's. Is this the case? If so which OS would be better suited for running a node script that needs to fetch a lot of data via HTTPS requests?
Here's how I'm doing my requests.
for (let j = 0; j < dataQueries; j++) {
getData(function())
}
function getData(callback){
axios.get(url, config)
.then((res) => {
// parse res
callback(parsedRes(res))
}).catch(function(err) {
console.log("Spooky problem alert! : " + err);
})
}
I've omitted some code for brevity, but this is generally how I'm doing my requests. I have a for loop that for every iteration launches a GET request via axios.
I know there is an axios.all command that is used for storing the promise the axios.HTTPMethod gives you, but I saw no change in my code when I set it up to store promises and then iterate over the promises via axios.all
Thanks #Jonasw for your help, but there is a very simple solution to this problem.
I used the small library throttled-queue to get the job done. (If you look at the source code it would be pretty easy to implement your own queue based on this package.
My code changed to:
const throttledQueue = require('throttled-queue')
let throttle = throttledQueue(15, 1000) // 15 times per second
for (let j = 0; j < dataQueries; j++) {\
throttle(function(){
getData(function(res){
// do parsing
})
}
}
function getData(callback){
axios.get(url, config)
.then((res) => {
// parse res
callback(parsedRes(res))
}).catch(function(err) {
console.log("Spooky problem alert! : " + err);
})
}
In my case this got resolved by deleting the autogenerated zip files from my workspace, which got created every time I did cdk deploy. Turns out that my typescript compiler treated these files as source files and counted them into the tarball.
Youre starting a lot of data Queries at the same time. You could chain them up using a partly recursive function, so that theyre executed one after another:
(function proceedwith(j) {
getData(function(){
if(j<dataQueries-1) proceedwith(j+1);
});
})(0)
Experienced the same issue when starting too many requests.
Tried throttled-queue, but wasn't working correctly.
system-sleep worked for me, effectively slowing down the rate at which the requests were made. Sleep is best used in synchronized code, to block before using sync/async code.
Example: (using sleep to limit the rate updateAddress() is called)
// Asynchronus call (what is important is that forEach is synchronous)
con.query(sql, function (err, result) {
if (err) throw err;
result.forEach(function(element) {
sleep(120); // Blocking call sleep for 120ms
updateAddress(element.address); // Another async call (network request)
});
});
This question might be a little vague, but I'll try my best to explain.
I'm trying to create an array of all of the tweets that I can retrieve from Twitter's API, but it limits each request to 200 returned tweets. How can I request to Twitter asynchronously up to the maximum limit of 3200 returned tweets? What do I mean is, is it possible to asynchronously call Twitter's API but build the array sequentially, making sure that the tweets are correctly sorted with regard to date?
So, I have an array:
var results = [];
and I'm using node's request module:
var request = require('request');
what I have right now (for just the limit of 200) is
request(options, function(err, response, body) {
body = JSON.parse(body);
for (var i = 0; i < body.length; i++) {
results.push(body[i].text);
}
return res.json(results);
});
I've looked into maybe using the 'promise' module, but it was confusing to understand. I tried using a while loop, but it got complicated because I couldn't follow the path that the server was taking.
Let me know if this didn't explain things well.
In the end, I want results to be an array populated with all of the tweets that the requests send.
I would suggest using request-promise instead of request. Here is my solution.
var rp = require('request-promise');
var tweets = [];
var promises = [];
for (var i =1; i< 10; i++){
var promise = rp(options);
promises.push(promise);
}
Promise.all(promises).then(function(data){
data.forEach(function(item){
// handle tweets here
});
return res.json(tweets);
});