Best way to push one more scrape after all are done - javascript

I have following scenario:
My scrapes are behind a login, so there is one login page that I always need to hit first
then I have a list of 30 urls that can be scraped asynchronously for all I care
then at the very end, when all those 30 urls have been scraped I need to hit one last separate url to put the results of the 30 URL scrape into a firebase db and to do some other mutations (like geo lookups for addresses etc)
Currently I have all 30 urls in a request queue (through the Apify web-interface) and I'm trying to see when they are all finished.
But obviously they all run async so that data is never reliable
const queue = await Apify.openRequestQueue();
let pendingRequestCount = await queue.getInfo();
The reason why I need that last URL to be separate are two-fold:
Most obvious reason being that I need to be sure I have the
results of all 30 scrapes before I send everything to DB
neither of the 30 URL's allow me to do Ajax / Fetch calls, which
I need for sending to Firebase and do the geo lookups of addresses
Edit: Tried this based on answer from #Lukáš Křivka. handledRequestCount in the while loop reaches a max of 2, never 4 ... and Puppeteer just ends normally. I've put the "return" inside the while loop because otherwise requests never finish (of course).
In my current test setup I have 4 urls to be scraped (in the Start URLS input fields of Puppeteer Scraper (on Apify.com) and this code :
let title = "";
const queue = await Apify.openRequestQueue();
let {handledRequestCount} = await queue.getInfo();
while (handledRequestCount < 4){
await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
console.log(`Curently handled here: ${handledRequestCount} --- waiting`) // this goes max to '2'
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};
}
log.info("Here I want to add another URL to the queue where I can do ajax stuff to save results from above runs to firebase db");
title = await page.evaluate(()=>{ return $('h1').text()});
return {title};

I would need to see your code to answer completely correctly but this has solutions.
Simply use Apify.PuppeteerCrawler for the 30 URLs. Then you run the crawler with await crawler.run().
After that, you can simply load the data from the default dataset via
const dataset = await Apify.openDataset();
const data = await dataset.getdata().then((response) => response.items);
And do whatever with the data, you can even create new Apify.PuppeteerCrawler to crawl the last URL and use the data.
If you are using Web Scraper though, it is a bit more complicated. You can either:
1) Create a separate actor for the Firebase upload and pass it a webhook from your Web Scraper to load the data from it. If you look at the Apify store, we already have a Firestore uploader.
2) Add a logic that will poll the requestQueue like you did and only when all the requests are handled, you proceed. You can create some kind of loop that will wait. e.g.
const queue = await Apify.openRequestQueue();
let { handledRequestCount } = await queue.getInfo();
while (handledRequestCount < 30) {
console.log(`Curently handled: ${handledRequestCount } --- waiting`)
await new Promise((resolve) => setTimeout(resolve, 2000)) // wait for 2 secs
handledRequestCount = await queue.getInfo().then((info) => info.handledRequestCount);
}
// Do your Firebase stuff

In the scenario where you have one async function that's called for all 30 URLs you scrape, first make sure the function returns its result after all necessary awaits, you could await for Promise.all(arrayOfAll30Promises) then run your last piece of code

Because I was not able to get consistent results with the {handledRequestCount} from getInfo() (see my edit in my original question), I went another route.
I'm basically keeping a record of which URL's have already been scraped via the key/value store.
urls = [
{done:false, label:"vietnam", url:"https://en.wikipedia.org/wiki/Vietnam"},
{done:false , label:"cambodia", url:"https://en.wikipedia.org/wiki/Cambodia"}
]
// Loop over the array and add them to the Queue
for (let i=0; i<urls.length; i++) {
await queue.addRequest(new Apify.Request({ url: urls[i].url }));
}
// Push the array to the key/value store with key 'URLS'
await Apify.setValue('URLS', urls);
Now every time I've processed an url I set its "done" value to true.
When they are all true I'm pushing another (final) url into the queue:
await queue.addRequest(new Apify.Request({ url: "http://www.placekitten.com" }));

Related

Strapi Async API Request in a Loop

I'm using a headless CMS (Strapi), which forces pagination on its GET endpoints, with up to 100 entries per page on the content types. I have ~400 entries on a Location content type that I want to show on the same page. I'm trying to read each page on load and store each Location in an array, but I'm having some issues fetching this data asynchronously. Instead of getting each page, I get the first page for each of the iterations of the loop. I'm new to async-await requests, so I'm not sure how to ensure that I am reading each page's data. Thanks in advance!
const getLocationsPage = async (page) => {
return await useFetch(`${this.config.API_BASE_URL}/api/locations?sort=slug&pagination[page]=${page}&pagination[pageSize]=50`)
}
for(let i = 1; i < this.pageCount; i++){
const pageResults = await getLocationsPage(i)
console.log(pageResults)
}

Manage a long-running operation node.js

I am creating a telegram bot, which allows you to get some information about the destiny 2 game world, using the Bungie API. The bot is based on the Bot Framework and uses Telegram as a channel (as a language I am using JavaScript).
now I find myself in the situation where when I send a request to the bot it sends uses series of HTTP calls to the EndPoints of the API to collect information, format it and resubmit it via Adaptive cards, this process however in many cases takes more than 15 seconds showing in chat the message "POST to DestinyVendorBot timed out after 15s" (even if this message is shown the bot works perfectly).
Searching online I noticed that there doesn't seem to be a way to hide this message or increase the time before it shows up. So the only thing left for me to do is to make sure it doesn't show up. To do this I tried to refer to this documentation article. But the code shown is in C #, could someone give me an idea on how to solve this problem of mine or maybe some sample code?
I leave here an example of a call that takes too long and generates the message:
//Mostra l'invetraio dell'armaiolo
if (LuisRecognizer.topIntent(luisResult) === 'GetGunsmith') {
//Take more 15 seconds
const mod = await this.br.getGunsmith(accessdata, process.env.MemberShipType, process.env.Character);
if (mod.error == 0) {
var card = {
}
await step.context.sendActivity({
text: 'Ecco le mod vendute oggi da Banshee-44:',
attachments: [CardFactory.adaptiveCard(card)]
});
} else {
await step.context.sendActivity("Codice di accesso scaduto.");
await this.loginStep(step);
}
}
I have done something similar where you call another function and send the message once the function is complete via proactive message. In my case, I set up the function directly inside the bot instead of as a separate Azure Function. First, you need to save the conversation reference somewhere. I store this in conversation state, and resave it every turn (you could probably do this in onMembersAdded but I chose onMessage when I did it so it resaves the conversation reference every turn). You'll need to import const { TurnContext } = require('botbuilder') for this.
// In your onMessage handler
const conversationData = await this.dialogState.get(context, {});
conversationData.conversationReference = TurnContext.getConversationReference(context.activity);
await this.conversationState.saveChanges(context);
You'll need this for the proactive message. When it's time to send the API, you'll need to send a message (well technically that's optional but recommended), get the conversation data if you haven't gotten it already, and call the API function without awaiting it. If your API is always coming back around 15 seconds, you may just want a standard message (e.g. "One moment while I look that up for you"), but if it's going to be longer I would recommend setting the expectation with the user (e.g. "I will look that up for you. It may take up to a minute to get an answer. In the meantime you can continue to ask me questions."). You should be saving user/conversation state further down in your turn handler. Since you are not awaiting the call, the turn will end and the bot will not hang up or send the timeout message. Here is what I did with a simulation I created.
await dc.context.sendActivity(`OK, I'll simulate a long-running API call and send a proactive message when it's done.`);
const conversationData = await this.dialogState.get(context, {});
apiSimulation.longRunningRequest(conversationData.conversationReference);
// That is in a switch statement. At the end of my turn handler I save state
await this.conversationState.saveChanges(context);
await this.userState.saveChanges(context);
And then the function that I called. As this was just a simulation, I have just awaited a promise, but obviously you would call and await your API(s). Once that comes back you will create a new BotFrameworkAdapter to send the proactive message back to the user.
const request = require('request-promise-native');
const { BotFrameworkAdapter } = require('botbuilder');
class apiSimulation {
static async longRunningRequest(conversationReference) {
console.log('Starting simulated API');
await new Promise(resolve => setTimeout(resolve, 30000));
console.log('Simulated API complete');
// Set up the adapter and send the message
try {
const adapter = new BotFrameworkAdapter({
appId: process.env.microsoftAppID,
appPassword: process.env.microsoftAppPassword,
channelService: process.env.ChannelService,
openIdMetadata: process.env.BotOpenIdMetadata
});
await adapter.continueConversation(conversationReference, async turnContext => {
await turnContext.sendActivity('This message was sent after a simulated long-running API');
});
} catch (error) {
//console.log('Bad Request. Please ensure your message contains the conversation reference and message text.');
console.log(error);
}
}
}
module.exports.apiSimulation = apiSimulation;

Start and Restart async function with setIntervalAsync returns TypeError cannot convert undefined or null to object

I've been trying to implement a web scraper that will use data pulled from MongoDB to create an array of urls to scrape periodically with puppeteer. I have been trying to get my scraper function to scrape periodically by using setIntervalAsync.
Here is my code right now that throws "UnhandledPromiseRejectionWarning: TypeError: Cannot convert undefined or null to object at Function.values..."
puppeteer.js
async function scrape(array){
// initialize for loop here
let port = '9052'
if(localStorage.getItem('scrapeRunning')=='restart'){
clearIntervalAsync(scrape)
localStorage.setItem('scrapeRunning') == 'go'
}else if(localStorage.getItem('scrapeRunning') != 'restart'){
/// Puppeteer scrapes urls in array here ///
}
server.js
app.post('/submit-form', [
// Form Validation Here //
], (req,res)=>{
async function submitForm(amazonUrl,desiredPrice,email){
// Connect to MongoDB and update entry or create new entry
// with post request data
createMongo.newConnectToMongo(amazonUrl,desiredPrice,email)
.then(()=>{
// Set local variable that will alert scraper to clearIntervalAsync///
localStorage.setItem('scrapeRunning','restart');
// before pulling the new updated mongoDB data ...
return createMongo.pullMongoArray();
})
.then((result)=>{
// and start scraping again with the new data
puppeteer.scrape(result)
})
submitForm(req.body.amazonUrl, req.body.desiredPrice,req.body.email);
}
}
createMongo.pullMongoArray()
.then((result)=>{
setIntervalAsync(puppeteer.scrape, 10000, result);
})
Currently the scraper starts as expected after the server is started and keeps 10 seconds between when the scrape ends and when it begins again. Once there is a post submit the MongoDB collection gets updated with the post data, the localStorage item is created, but the scrape function goes off the rails and throws the typeError. I am not sure what is going on and have tried multiple ways to fix this (including leaving setIntervalAsync and clearIntervalAsync inside of the post request code block) but have been unsuccessful so far. I am somewhat new to coding, and extremely inexperienced with asynchronous code, so if someone has any experience with this kind of issue and could shed some light on what is happening I would truly appreciate it!
I only think that it has something to do with async as no matter what I have tried it also seems to run the pullMongoArray function before the newConnectToMongo function is complete.
After a few more hours of searching around I think I may have found a workable solution. I've completely eliminated the use of localStorage and removed the if and else if statements from within the scrape function. I have also make a global timer variable and added control functions to this file.
puppeteer.js
let timer;
function start(result){
timer = setIntervalAsync(scrape,4000, result)
}
function stop(){
clearIntervalAsync(timer)
}
async function scrape(array){
// initialize for loop here
let port = '9052'
// Puppeteer scrapes urls from array here //
}
I've altered my server code a little bit so at the server start it gets the results from MongoDB and uses that in the scraper start function. A post request also calls the stop function before updating MongoDB, pulling a new result from MongoDB, and then recalling the start scraper function.
server.js
createMongo.pullMongoArray()
.then((result)=>{
puppeteer.start(result);
})
app.post('/submit-form', [
// Form Validation In Here //
], (req,res)=>{
async function submitForm(amazonUrl,desiredPrice,email){
// Stop the current instance of the scrape function
puppeteer.stop();
// Connect to MongoDB and update entry or create new entry
// with post request data
createMongo.newConnectToMongo(amazonUrl,desiredPrice,email)
.then(()=>{
// before pulling the new updated mongoDB data ...
console.log('Pulling New Array');
return createMongo.pullMongoArray();
})
.then((result)=>{
// and restarting the repeating scrape function
puppeteer.start(result);
})
}
})

Sending thousands of fetch requests crashes the browser. Out of memory

I was tasked with transferring a large portion of data using javascript and an API from one database to another. Yes I understand that there are better ways of accomplishing this task, but I was asked to try this method.
I wrote some javascript that makes a GET call to an api that returns an array of data, which I then turnaround and make calls to another api to send this data as individual POST requests.
What I have written so far seems to works fairly well, and I have been able to send over 50k individual POST requests without any errors. But I am having trouble when the number of POST requests increases past around 100k. I end up running out of memory and the browser crashes.
From what I understand so far about promises, is that there may be an issue where promises (or something else?) are still kept in heap memory after they are resolved, which results in running out of memory after too many requests.
I've tried 3 different methods to get all the records to POST successfully after searching for the past couple days. This has included using Bluebirds Promise.map, as well as breaking up the array into chunks first before sending them as POST requests. Each method seems to work up until it has processed about 100k records before it crashes.
async function amGetRequest(controllerName) {
try{
const amURL = "http://localhost:8081/api/" + controllerName;
const amResponse = await fetch(amURL, {
"method": "GET",
});
return await amResponse.json();
} catch (err) {
closeModal()
console.error(err)
}
};
async function brmPostRequest(controllerName, body) {
const brmURL = urlBuilderBRM(controllerName);
const headers = headerBuilderBRM();
try {
await fetch(brmURL, {
"method": "POST",
"headers": headers,
"body": JSON.stringify(body)
});
}
catch(error) {
closeModal()
console.error(error);
};
};
//V1.0 Send one by one and resolve all promises at the end.
const amResult = await amGetRequest(controllerName); //(returns an array of ~245,000 records)
let promiseArray = [];
for (let i = 0; i < amResult.length; i++) {
promiseArray.push(await brmPostRequest(controllerName, amResult[i]));
};
const postResults = await Promise.all(promiseArray);
//V2.0 Use bluebirds Promise.map with concurrency set to 100
const amResult = await amGetRequest(controllerName); //(returns an array of ~245,000 records)
const postResults = Promise.map(amResult, async data => {
await brmPostRequest(controllerName, data);
return Promise.resolve();
}, {concurrency: 100});
//V3.0 Chunk array into max 1000 records and resolve 1000 promises before looping to the next 1000 records
const amResult = await amGetRequest(controllerName); //(returns an array of ~245,000 records)
const numPasses = Math.ceil(amResult.length / 1000);
for (let i=0; i <= numPasses; i++) {
let subset = amResult.splice(0,1000);
let promises = subset.map(async (record) => {
await brmPostRequest(controllerName, record);
});
await Promise.all(promises);
subset.length = 0; //clear out temp array before looping again
};
Is there something that I am missing about getting these promises cleared out of memory after they have been resolved?
Or perhaps a better method of accomplishing this task?
Edit: Disclaimer - I'm still fairly new to JS and still learning.
"Well-l-l-l ... you're gonna need to put a throttle on this thing!"
Without (pardon me ...) attempting to dive too deeply into your code, "no matter how many records you need to transfer, you need to control the number of requests that the browser attempts to do at any one time."
What's probably happening right now is that you're stacking up hundreds or thousands of "promised" requests in local memory – but, how many requests can the browser actually transmit at once? That should govern the number of requests that the browser actually attempts to do. As each reply is returned, your software then decides whether to start another request and if so for which record.
Conceptually, you have so-many "worker bees," according to the number of actual network requests your browser can simultaneously do. Your software never attempts to launch more simultaneous requests than that: it simply launches one new request as each one request is completed. Each request, upon completion, triggers code that decides to launch the next one.
So – you never are "sending thousands of fetch requests." You're probably sending only a handful at a time, even though, in this you-controlled manner, "thousands of requests do eventually get sent."
As you are not intereted in the values delivered by brmPostRequest(), there's no point mapping the original array; neither the promises nor the results need to be acumulated.
Not doing so will save memory and may allow progress beyond the 100k sticking point.
async function foo() {
const amResult = await amGetRequest(controllerName);
let counts = { 'successes': 0, 'errors': 0 };
for (let i = 0; i < amResult.length; i++) {
try {
await brmPostRequest(controllerName, amResult[i]);
counts.successes += 1;
} catch(err) {
counts.errors += 1;
}
};
const console.log(counts);
}

nodejs express stream from array

I'm building an app which i need to stream data to client, my data is simply an array of objects .
this is the for loop which makes the array
for(let i =0;i<files.length;i++){
try {
let file = files[i]
var musicPath = `${baseDir}/${file}`
let meta = await getMusicMeta(musicPath)
musics.push(meta)
}
right now I wait for the loop to finish it's works then I send the whole musics array to client, I want to use stream to send musics array one by one to client instead of waiting for the loop to finish
Use scramjet and send the stream straight to the response:
const {DataStream} = require("scramjet");
// ...
response.writeHead(200);
DataStream.fromArray(files)
// all the magic happens below - flow control
.map(file => getMusicMeta(`${baseDir}/${file}`))
.toJSONArray()
.pipe(response);
Scramjet will make use of your flow control and most importantly - it'll get the result out faster than any other streaming framework.
Edit: I wrote a couple lines of code to make this use case easier in scramjet. :)

Categories