Scraping filmography from a wikipedia page using nodejs and puppeteer

Scraping filmography from a wikipedia page using nodejs and puppeteer - javascript

I'm trying to get the filmography from wikipedia. Usin puppeteer, I'm selecting the filmography section from inspect element and copying the XPath. However, I'm not getting any film data.
scrapers.js
const puppeteer = require("puppeteer")
const scrapeProduct = async (url) => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(url)
const [el] = await page.$x(`//*[#id="mw-content-text"]/div[1]/div[8]/div`)
console.log("el=>", el)
browser.close()
}
scrapeProduct("https://en.wikipedia.org/wiki/Werner_Herzog")
Here's what I'm getting in console.log(el):
https://hastebin.com/usozakisen.yaml

el is an ElementHandle, not the content itself. You can try getting the innerText of that handle:
console.log(await el.evaluate(el => el.innerText));

Related

I get this error when running my index.js file... throw new Error('Execution context was destroyed, most likely because of a navigation.');

I've provided the code below, can you tell me why I would get this error ? I am trying to web-scrape some information from one website to put on a website I am creating, I already have permission to do so. The information I am trying to web-scrape is the name of the event, the time of the event, the location of the event, and the description of the event... I seen this tutorial on YouTube, but for some reason I get this error running ming.
sync function scrapeProduct(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.goto(url);
const [el] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/h3/a');
const txt = await el.getProperty('textContent')
const rawTxt = await txt.jsonValue();
const [el1] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/time[1]/span[2]');
const txt1 = await el1.getProperty('textContent')
const rawTxt1 = await txt1.jsonValue();
console.log({rawTxt, rawTxt1});
browser.close();
}
scrapeProduct('https://events.ucf.edu');

How to set cookies in a variable and use it again: PUPPETEER

i'm trying to insert ALL browser cookies in a variable and then use it again later.
My attempt:
const puppeteer = require('puppeteer')
const fs = require('fs').promises;
(async () => {
const browser = await puppeteer.launch({
headless: false,
executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe',
const page = await browser.newPage();
await page.goto('https://www.google.com/');
Below, it's my code to get cookies, and it work, but now it print a error message.
const client = await page.target().createCDPSession();
const all_browser_cookies = (await client.send('Network.getAllCookies')).cookies;
const current_url_cookies = await page.cookies();
var third_party_cookies = all_browser_cookies.filter(cookie => cookie.domain !== current_url_cookies[0].domain);
and below it's the seccond page (that will use cookies)
(async () => {
const browser2 = await puppeteer.launch({
});
const url = 'https://www.google.com/';
const page2 = await browser2.newPage();
try{
await page2.setCookie(...third_party_cookies);
await page2.goto(url);
}catch(e){
console.log(e);
}
await browser2.close()
})();
})();
until yesterday it works, but today it's appearing this message error:
Error: Protocol error (Network.setCookies): Invalid parameters Failed to deserialize params.cookies.expires - BINDINGS: double value expected at position 662891
Anyone know what is it?

Ok, the solution is here: https://github.com/puppeteer/puppeteer/issues/7029
You are saving an array of cookies and page.setCookie only works with one cookie. Yo have to iterate trought your array like that:
for (let i = 0; i < third_party_cookies.length; i++) {
await page2.setCookie(third_party_cookies[i]);
}

#julio-codesal's suggestion is ok, but according to the puppeteer documentaion it is ok to use page.setCookie with an array, as long as you destructure it, e.g.:
await page.setCookie(...arr)
source: https://devdocs.io/puppeteer/index#pagesetcookiecookies
Not exactly sure what it is, but the error the OP has is something else.

Take screenshots of different elements with specific names in Puppeteer

I am trying to take screenshots of each section in a landing page which may container multiple sections. I was able to do that effectively in "Round1" which I commented out.
My goal is to learn how to write leaner/cleaner code so I made another attempt, "Round2".
In this section it does take a screenshot. But, it takes screenshot of section 3 with file name JSHandle#node.png. Definitely, I am doing this wrong.
Round1 (works perfectly)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.somelandingpage.com');
// const elOne = await page.$('.section-one');
// await elOne.screenshot({path: './public/SectionOne.png'})
// takes a screenshot SectionOne.png
// const elTwo = await page.$('.section-two')
// await elTwo.screenshot({path: './public/SectionTwo.png'})
// takes a screenshot SectionTwo.png
// const elThree = await page.$('.section-three')
// await elThree.screenshot({path: './public/SectionThree.png'})
// takes a screenshot SectionThree.png
Round2
I created an array that holds all the variables and tried to loop through them.
const elOne = await page.$('.section-one');
const elTwo = await page.$('.section-two')
const elThree = await page.$('.section-three')
let lpElements = [elOne, elTwo, elThree];
for(var i=0; i<lpElements.length; i++){
await lpElements[i].screenshot({path: './public/'+lpElements[i] + '.png'})
}
await browser.close();
})();
This takes a screenshot of section-three only, but with wrong file name (JSHandle#node.png). There are no error messages on the console.
How can I reproduce Round1 by modifying the Round2 code?

Your array is only of Puppeteer element handle objects which are getting .toString() called on them.
A clean way to do this is to use an array of objects, each of which has a selector and its name. Then, when you run your loop, you have access to both name and selector.
const puppeteer = require('puppeteer');
const content = `
<div class="section-one">foo</div>
<div class="section-two">bar</div>
<div class="section-three">baz</div>
`;
const elementsToScreenshot = [
{selector: '.section-one', name: 'SectionOne'},
{selector: '.section-two', name: 'SectionTwo'},
{selector: '.section-three', name: 'SectionThree'},
];
const getPath = name => `./public/${name}.png`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(content);
for (const {selector, name} of elementsToScreenshot) {
const el = await page.$(selector);
await el.screenshot({path: getPath(name)});
}
})()
.catch(err => console.error(err))
.finally(async () => await browser.close())
;

Scrape multiple websites using Puppeteer

So I am trying to make a scrape just two elements but from more than only one website (in this case is PS Store). Also, I'm trying to achieve it in the easiest way possible. Since I'm rookie in JS, please be gentle ;) Below my script. I was trying to make it happen with a for loop but with no effect (still it got only the first website from the array). Thanks a lot for any kind of help.
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = [
"https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
"https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
];
const browser = await puppeteer.launch();
const page = await browser.newPage();
for (i = 0; i < urls.length; i++) {
const url = urls[i];
const promise = page.waitForNavigation({ waitUntil: "networkidle" });
await page.goto(`${url}`);
await promise;
}
const [el] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
);
const txt = await el.getProperty("textContent");
const title = await txt.jsonValue();
const [el2] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
);
const txt2 = await el2.getProperty("textContent");
const price = await txt2.jsonValue();
console.log({ title, price });
browser.close();
}
scrapeProduct();

In general, your code is quite okay. Few things should be corrected, though:
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = [
"https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
"https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
];
const browser = await puppeteer.launch({
headless: false
});
for (i = 0; i < urls.length; i++) {
const page = await browser.newPage();
const url = urls[i];
const promise = page.waitForNavigation({
waitUntil: "networkidle2"
});
await page.goto(`${url}`);
await promise;
const [el] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
);
const txt = await el.getProperty("textContent");
const title = await txt.jsonValue();
const [el2] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
);
const txt2 = await el2.getProperty("textContent");
const price = await txt2.jsonValue();
console.log({
title,
price
});
}
browser.close();
}
scrapeProduct();
You open webpage in the loop, that's correct, but then look for elements outside of the loop. Why? You should do it within the loop.
For debugging, I suggest using { headless: false }. This allows you to see what actually happens in the browser.
Not sure what version of puppeteer are you using, but there's no such event as networkidle in latest version from npm. You should use networkidle0 or networkidle2 instead.
You are seeking the elements via xpath html/body/div.... This might be subjective, but I think standard JS/CSS selectors are more readable: body > div .... But, well, if it works...
Code above yields the following in my case:
{ title: 'Days Gone™', price: '289,00 zl' }
{ title: 'Ghost of Tsushima', price: '289,00 zl' }

Puppeter - Link inside an iFrame

I have to get the advertising link below the bullet points of this page.
I am trying with Puppeter but I am having trouble because the Ad is an iframe!
I can successfully get what I need using Chrome console:
document.querySelector('#adContainer a').href
Puppetter
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.setViewport({width: 1440, height: 1000})
await page.goto('https://www.amazon.co.uk/dp/B07DDDB34D', {waitUntil: 'networkidle2'})
await page.waitFor(2500);
const elementHandle = await page.$eval('#adContainer a', el => el.href);
console.log(elementHandle);
await page.screenshot({path: 'example.png', fullPage: false});
await browser.close();
})();
Error: Error: failed to find element matching selector "#adContainer a"
EDIT:
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.setViewport({width: 1440, height: 1000})
await page.goto('https://www.amazon.co.uk/dp/B07DDDB34D', {waitUntil: 'networkidle2'})
const adFrame = page.frames().find(frame => frame.name().includes('"adServer":"cs'))
const urlSelector = '#sp_hqp_shared_inner > div > a';
const url = await adFrame.$eval(urlSelector, element => element.textContent);
console.log(url);
await browser.close();
Run: https://try-puppeteer.appspot.com/

You need to do that query inside the frame itself, which can be accessed via page.frames():
const adFrame = page.frames().find(frame => frame.name().includes('<some text only appearing in name of this iFrame>');
const urlSelector = '#sp_hqp_shared_inner > div > a';
const url = await adFrame.$eval(urlSelector, element => element.textContent);
console.log(url);
How I got the selector of that url:
Discaimer
I haven't tried this myself. Also, I think the appropriate way to get that url inside the iFrame is something more like this:
const url = await adFrame.evaluate((sel) => {
return document.querySelectorAll(sel)[0].href;
}, urlSelector);

You have to switch to the frame you want to work on every time the page loads.
async getRequiredLink() {
return await this.page.evaluate(() => {
let iframe = document.getElementById('frame_id'); //pass id of your frame
let doc = iframe.contentDocument; // changing the context to the working frame
let ele = doc.querySelector('you-selector'); // selecting the required element
return ele.href;
});
}

We Keep Coding

JavaScript is the programming language of the Web.

Scraping filmography from a wikipedia page using nodejs and puppeteer - javascript

el is an ElementHandle, not the content itself. You can try getting the innerText of that handle: console.log(await el.evaluate(el => el.innerText));

Related

I get this error when running my index.js file... throw new Error('Execution context was destroyed, most likely because of a navigation.');

How to set cookies in a variable and use it again: PUPPETEER

Take screenshots of different elements with specific names in Puppeteer

Scrape multiple websites using Puppeteer

Puppeter - Link inside an iFrame

Categories

Resources