How to call a function after it's already defined in Puppeteer

How to call a function after it's already defined in Puppeteer - javascript

I have a function that grabs text data from this site and stores it in a JSON file. This function is only crawling the first page of this website, but I'd like click through or "goto" each url (there are 10 pages) and grab the text data from each page:
await page.goto('http://quotes.toscrape.com/page/1/')
//grab quote data
const quotes = await page.evaluate(() => {
const grabFromDiv = (div, selector) => Array.from(div
.querySelectorAll(selector), (el => el.innerText.trim()))
Currently, it just navigates to page 1, grabs the data, stores it, and then exits. Is there a way to call the quotes function over and over until I've navigated through all 10 pages and collected all the data?

I would just do the same for each page.
If you know the number of pages, then just do:
var quotes = ''
for each page
await page.goto(page)
quotes+ = await page.evaluate(myPageFunction)
If you don't know the number of pages, you need to get that information from the actual page.
then, in the evaluate function just search for the next page:
myPageFunction = function(){
// get your data
const nextPage = document.querySelector('.next a')?.href
return {data: yourData, nextPage: nextPage}
}
You would then have something like:
nextPage = 'http://quotes.toscrape.com/page/1/'
while (nextPage= {
await page.goto(nextPage)
const result = await page.evaluate(myPageFunction)
quotes += result.data
nextPage = resut.nextPage
}
The code is just an example and won't work as is.
Best!

Related

How do I get whole html from Apify Cheerio crawler?

I want to get the whole html not just text.
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({
url: //adress,
uniqueKey: makeid(100)
});
const handlePageFunction = async ({ request, $ }) => {
var content_to = $('.class')
};
// Set up the crawler, passing a single options object as an argument.
const crawler = new Apify.CheerioCrawler({
requestQueue,
handlePageFunction,
});
await crawler.run();
});
When I try this the crawler returns complex object. I know I can extract the text from the content_to variable using .text() but I need the whole html with tags like . What should I do?

If I understand you correctly - you could just use .html() instead of .text(). This way you will get inner html instead of inner text of the element.
Another thing to mention - you could also put body to handlePageFunction arg object:
const handlePageFunction = async ({ request, body, $ }) => {
body would have the whole raw html of the page.

Pulling Articles from Large JSON Response

I'm trying to code something which tracks the Ontario Immigrant Nominee Program Updates page for updates and then sends an email alert if there's a new article. I've done this in PHP but I wanted to try and recreate it in JS because I've been learning JS for the last few weeks.
The OINP has a public API, but the entire body of the webpage is stored in the JSON response (you can see this here: https://api.ontario.ca/api/drupal/page%2F2020-ontario-immigrant-nominee-program-updates?fields=body)
Looking through the safe_value - the common trend is that the Date / Title is always between <h3> tags. What I did with PHP was create a function that stored the text between <h3> into a variable called Date / Title. Then - to store the article body text I just grabbed all the text between </h3> and </p><h3> (basically everything after the title, until the beginning of the next title), stored it in a 'bodytext' variable and then iterated through all occurrences.
I'm stumped figuring out how to do this in JS.
So far - trying to keep it simple, I literally have:
const fetch = require("node-fetch");
fetch(
"https://api.ontario.ca/api/drupal/page%2F2020-ontario-immigrant-nominee-program-updates?fields=body"
)
.then((result) => {
return result.json();
})
.then((data) => {
let websiteData = data.body.und[0].safe_value;
console.log(websiteData);
});
This outputs all of the body. Can anyone point me in the direction of a library / some tips that can help me :
Read through the entire safe_value response and break down each article (Date / Title + Article body) into an array.
I'm probably then just going to upload each article into a MongoDB and then I'll have it checked twice daily -> if there's a new article I'll send an email notif.
Any advice is appreciated!!
Thanks,

You can use regex to get the content of Tags e.g.
/<h3>(.*?)<\/h3>/g.exec(data.body.und[0].safe_value)[1]
returns August 26, 2020

With the use of some regex you can get this done pretty easily.
I wasn't sure about what the "date / title / content" parts were but it shows how to parse some html.
I also changed the code to "async / await". This is more of a personal preference. The code should work the same with "then / catch".
(async () => {
try {
// Make request
const response = await fetch("https://api.ontario.ca/api/drupal/page%2F2020-ontario-immigrant-nominee-program-updates?fields=body");
// Parse response into json
const data = await response.json();
// Get the parsed data we need
const websiteData = data.body.und[0].safe_value;
// Split the html into seperate articles (every <h2> is the start of an new article)
const articles = websiteData.split(/(?=<h2)/g);
// Get the data for each article
const articleInfo = articles.map((article) => {
// Everything between the first h3 is the date
const date = /<h3>(.*)<\/h3>/m.exec(article)[0];
// Everything between the first h4 is the title
const title = /<h4>(.*)<\/h4>/m.exec(article)[0];
// Everything between the first <p> and last </p> is the content of the article
const content = /<p>(.*)<\/p>/m.exec(article)[0];
return {date, title, content};
});
// Show results
console.log(articleInfo);
} catch(error) {
// Show error if there are any
console.log(error);
}
})();
Without comments
(async () => {
try {
const response = await fetch("https://api.ontario.ca/api/drupal/page%2F2020-ontario-immigrant-nominee-program-updates?fields=body");
const data = await response.json();
const websiteData = data.body.und[0].safe_value;
const articles = websiteData.split(/(?=<h2)/g);
const articleInfo = articles.map((article) => {
const date = /<h3>(.*)<\/h3>/m.exec(article)[0];
const title = /<h4>(.*)<\/h4>/m.exec(article)[0];
const content = /<p>(.*)<\/p>/m.exec(article)[0];
return {date, title, content};
});
console.log(articleInfo);
} catch(error) {
console.log(error);
}
})();

I just completed creating .Net Core worker service for this.
The value you are looking for is "metatags.description.og:updated_time.#attached.drupal_add_html_head..#value"
The idea is if the last updated changes you send an email notification!
Try this in you javascript
fetch(`https://api.ontario.ca/api/drupal/page%2F2021-ontario-immigrant-nominee-program-updates`)
.then((result) => {
return result.json();
})
.then((data) => {
let lastUpdated = data.metatags["og:updated_time"]["#attached"].drupal_add_html_head[0][0]["#value"];
console.log(lastUpdated);
});
I will be happy to add you to the email list for the app I just created!

How to connect loop data to pdfgeneratorapi with wix corvid?

I'm generating PDF by using https://pdfgeneratorapi.com/.
Now I can show data one by one using this code.Can any one give me suggestion how can show all data with loop or any other way?
This below photos showing my template from pdfgenerator .
This is the code I'm using to generate PDF
let communicationWay1=[
{0:"dim"},
{1:"kal"}
];
let cstomerExpence1=[
{0:"dim"},
{1:"kal"}
];
let title="test";
let names="test";
let phone="test";
let email="test";
let maritalStatus="test";
let city="test";
let other="test";
const result = await wixData.query(collection)
.eq('main_user_email', $w('#mainE').text)
.find()
.then( (results) => {
if (results.totalCount>0) {
count=1;
// title=results.items[1].title;
names=results.items[0].names;
email=results.items[0].emial;
phone=results.items[0].phone;
maritalStatus=results.items[0].maritalStatus;
city=results.items[0].city;
other=results.items[0].cousterExpenses_other;
title=results.items[0].title;
communicationWay=results.items[0].communicationWay;
cstomerExpence=results.items[0].cstomerExpence;
}
if (results.totalCount>1) {
names1=results.items[1].names;
email1=results.items[1].emial;
phone1=results.items[1].phone;
maritalStatus1=results.items[1].maritalStatus;
city1=results.items[1].city;
other1=results.items[1].cousterExpenses_other;
title1=results.items[1].title;
communicationWay1=results.items[1].communicationWay;
cstomerExpence1=results.items[1].cstomerExpence;
}
} )
.catch( (err) => {
console.log(err);
} );
// Add your code for this event here:
const pdfUrl = await getPdfUrl
({title,names,email,phone,city,maritalStatus,other,communicationWay,cstomerExpence,title1,
names1,email1,phone1,city1,maritalStatus1,other1,communicationWay1,cstomerExpence1
});
if (count===0) { $w("#text21").show();}
else{ $w("#downloadButton").link=wixLocation.to(pdfUrl);}
BELOW CODE IS BACKEND CODE/JSW CODE.
Also I want to open pdf in new tab. I know "_blank" method can be used to open a new tab.But I'm not sure how to add it with the url
import PDFGeneratorAPI from 'pdf-generator-api'
const apiKey = 'MYKEY';
const apiSecret = 'MYAPISECRET';
const baseUrl = 'https://us1.pdfgeneratorapi.com/api/v3/';
const workspace = "HELLO#gmail.com";
const templateID = "MYTEMPLATEID";
let Client = new PDFGeneratorAPI(apiKey, apiSecret)
Client.setBaseUrl(baseUrl)
Client.setWorkspace(workspace)
export async function getPdfUrl(data) {
const {response} = await Client.output(templateID, data, undefined, undefined, {output: 'url'})
return response
}

Just put it in a while loop with a boolean condition.
You can create a variable, for example allShowed, and set its value to False. After that, create another variable, for example numberOfDataToShow, and set it as the number of elements you want to display. Then create a counter, countShowed, initialized with 0 as its value.
Now create a while loop: while allShowed value is False, you loop (and add data).
Everytime a piece of your data is showed, you increment the value of countShowed (and set it to go on adding/showing data). When countShowed will have the exact same value of numberOfDataToShow, set allShowed to True. The loop will interrupt and all your data will be showed.

You would need to use the Container or Table component in PDF Generator API to iterate over a list of items. As #JustCallMeA said you need to send an array of items. PDF Generator API now has an official Wix Velo (previously Corvid) tutorial with a demo page: https://support.pdfgeneratorapi.com/en/article/how-to-integrate-with-wix-velo-13s8135

Passing Puppeteer page as params in a function is not working as expected

Intro
The loginLinkedin takes me to the login page and then return for me the puppeteer page which is then assigned to root so I can still have more options do work with it.
const root = await loginToLInkined("https://www.linkedin.com/login");
await root.goto(url);
max_page = await getMaxPage(root);
console.log("max page",max_page)
I then goto(url)
url is another page I need to go to.
after that I call the getMaxPage(root) with root as a param so I can evaluate() in that function
Problem
const getMaxPage = async root => {
const maxPage = await root.evaluate(()=> {
return document.querySelector(
"li.artdeco-pagination__indicator:nth-last-Child(1)"
);
});
console.log(maxPage)
return parseInt(maxPage.innerText);
};
The problem is that when I console.log(maxPage) it returns undefined and I realized that adding a root s a param isn't actually working the way I'm supposed to do.
What am I doing wrong and how it properly done.
Note I have actually tried to root.evaluate without adding a function and adding the root as a param and it actually returned for me the maxpage

The issue is in what you return from page.evaluate():
const maxPage = await root.evaluate(()=> {
return document.querySelector(
"li.artdeco-pagination__indicator:nth-last-Child(1)"
);
});
This is a DOM node, which is a complex object that cannot be serialized, and the return value must be serializable in order to be returned back to node from Chromium.
So to fix that and all the future scripts just return only what is needed and what can be JSON.stringifyed without error. As pguardiario correctly noted in the comment, in this case it's enough to return innerText from that node:
const maxPage = await root.evaluate(()=> {
return document.querySelector("li.artdeco-pagination__indicator:nth-last-Child(1)").innerText;
});

How to use a loop to request a URL, scrape data, grab a new URL from that page and move on to the next page X times

I am looking to:
Open a known URL (www.source.com/1 below)
scrape all URLs on that page (e.g. www.urllookingfor.com/1 to .../10) and log to console
scrape a new URL (e.g. www.source.com/2) from that page
load the next page and repeat the process X number of times
Imagine a list of 50 URLs dividend across 5 pages where you need to click the next button to move on a page.
The first two steps work fine, but I think the issue is that the nextLink isn't updated before the loop runs again. Essentially what happens is that step four gets repeated with the original URL and not the 'new' URL. The steps above are within an if loop.
I've tried using setTimeout, async...await as I think the issue is that it doesn't have time to load the 'new' URL before the next function is complete but this did not work.
If I add console.log(URL) within the if function, it will print the original URL. But when I add console.log to outside the if loop it prints the updated URL which makes me think 'nextLink' isn't updated until after the if loop.
I've also tried repeating the functions over and over (essentially a repeated if statement), but this also does not seem to update 'nextLink' before the next function runs which goes against the above.
let nextLink = www.source.com/1
//this pulls source page and scrapes required URLs
const getDatafromPage = () => {
request(nextLink, (error, response, html) => {
if((!error) && (response.statusCode == 200))
{
let $ = cheerio.load(html);
$('.class1').each((i, el) => {
let link = $(el).find('.class2').attr('href');
console.log(`${link});
})
}
})
}
//this gets the next URL
const getNextLink = () => {
request(nextLink, (error, response, html) => {
if((!error) && (response.statusCode == 200))
{
let $ = cheerio.load(html);
nextLink = $('.class3').attr('href');
}
})
}
for (let i = 0; i <= 4; i++) {
getDatafromPage();
getNextLink();
}
console.log(nextLink)
Expected results (all 50 URLs from the pages and ends by logging the last source URL)
www.urllookingfor.com/1
...
www.urllookingfor.com/50
www.source.com/5
Actual results (repeats the first page, but then logs the next page at the end):
www.urllookingfor.com/1
...
www.urllookingfor.com/10
www.urllookingfor.com/1
...
www.urllookingfor.com/10
www.source.com/2

Here's more or less what it might look like when I do it:
const doPage = async ($) => {
// do stuff here
}
;(async function(){
let response = await request(url)
let $ = cheerio.load(response)
await doPage($)
let a
// keep following next links
while(a = $('[rel=next]')[0]){
url = new URL($(a).attr('href'), url).href
response = await request(url)
$ = cheerio.load(response)
await doPage($)
}
})()

We Keep Coding

JavaScript is the programming language of the Web.

How to call a function after it's already defined in Puppeteer - javascript

Related

How do I get whole html from Apify Cheerio crawler?

Pulling Articles from Large JSON Response

How to connect loop data to pdfgeneratorapi with wix corvid?

Passing Puppeteer page as params in a function is not working as expected

How to use a loop to request a URL, scrape data, grab a new URL from that page and move on to the next page X times

Categories

Resources