Having trouble scraping a particular element on a website using Puppeteer - javascript

I am trying to scrape the key features part of the website with the URL of: "https://www.alpinestars.com/products/stella-missile-v2-1-piece-suit-1" using puppeteer - however, whenever I try to use a selector that works on the chrome console for the website the output for my code is always an empty array or object. For example both document.querySelectorAll("#key\ features > p") and document.getElementById('key features') both return as empty arrays or objects when I output it through my code but work via chrome console.
I have attached my code below:
const puppeteer = require('puppeteer');
async function getDescripData(url) {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto(url);
const descripFeatures = await page.evaluate(() => {
const tds = Array.from(document.getElementById('key features'))
console.log(tds)
return tds.map(td => td.innerText)
});
console.log(descripFeatures)
await browser.close();
return {
features: descripFeatures
}
}
How should I go about overcoming this issue?
Thanks in advance!

Your problem is in Array.from you are passing a non-iterable object and return null.
This works for me:
const puppeteer = require('puppeteer');
const url = 'https://www.alpinestars.com/products/stella-missile-v2-1-piece-suit-1';
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
args: ['--start-maximized'],
devtools: true
});
const page = (await browser.pages())[0];
await page.goto(url);
const descripFeatures = await page.evaluate(() => {
const tds = document.getElementById('key features').innerText;
return tds.split('• ');
});
console.log(descripFeatures)
await browser.close();
})();

Related

Issue by listing specific selectors and get href

I'm new to javascript and using puppeteer, i'd like to get all the podcast url from this url https://www.lesonunique.com/podcasts
Here's my code:
const puppeteer = require("puppeteer")
const getList = async () => {
const browser = await puppeteer.launch({ headless: false, args: [ '--proxy-server=firewall.ina.fr:81' ] })
const page = await browser.newPage()
await page.goto("https://www.lesonunique.com/podcasts")
await page.waitForSelector('div.block3-content.block3-content-modif')
const urls = await page.evaluate(() => Array.from(
document.querySelectorAll('div.block3-content.block3-content-modif'),
link => link.href
));
console.log(urls);
console.log(urls[1]);
The result of console.log(urls) is a 3x3 matric of NULL valuess.
What did i did wrong?
I found the correct way to do it:
const elements = await page_podcast_list.$$eval('div.block3-content.block3-content-modif', podcasts => {
return podcasts.map(podcast => podcast.getElementsByTagName('a')[0].getAttribute('href'));
});
console.log(typeof(elements));
console.log(elements);

Scrape nested page puppeteer

I would like to know how to scrape data located in nested pages. Here's an example I tried to build however couldn't make it work. The idea is to go to https://dev.to/, click the question and grab its title. Then go back and redo the process for the next question.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://dev.to/");
try {
const selectors = await page.$$(".crayons-story > a");
for (const post of selectors) {
await Promise.all([
page.waitForNavigation(),
post.click(),
page.goBack(),
]);
}
} catch (error) {
console.log(error);
} finally {
browser.close();
}
})();
When I run this code, I get
Error: Node is either not visible or not an HTMLElement
Edit: The code is missing a piece where grabs the title, but is enough for the purpose.
What is happening is the website doesn't automatically have that node when the page is opened. However, puppeteer fetches the webcontents immediately after going to that page. What you'll need is a delay so that the website is able to use it's "script" tags and inject the story in.
To wait, use this following command:
await page.waitForSelector(".crayons-story > a")
This makes sure puppeteer waits for that selector to become visible, and then starts scraping the contents.
So your final code should look like this:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://dev.to/");
await page.waitForSelector(".crayons-story > a")
try {
const selectors = await page.$$(".crayons-story > a");
for (const post of selectors) {
await Promise.all([
page.waitForNavigation(),
post.click(".crayons-story > a"),
page.goBack(),
]);
}
} catch (error) {
console.log(error);
} finally {
browser.close();
}
})();
The problem I'm facing here is very similar to this one.
Puppeteer Execution context was destroyed, most likely because of a navigation
The best solution I could come up with is to avoid using page.goBack() and rather use page.goto() so the references are not lost.
Solution 1: (this one uses MAP and the scrape is resolved in an async way, much quicker than the one bellow this one):
const puppeteer = require("puppeteer");
const SELECTOR_POSTS_LINK = ".article--post__title > a";
const SELECTOR_POST_TITLE = ".article-header--title";
async function scrape() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.smashingmagazine.com/articles/");
try {
const links = await page.$$eval(SELECTOR_POSTS_LINK, (links) => links.map((link) => link.href));
const resolver = async (link) => {
await page.goto(link);
const title = await page.$eval(SELECTOR_POST_TITLE, (el) => el.textContent);
return { title };
};
const promises = await links.map((link) => resolver(link));
const articles = await Promise.all(promises);
console.log(articles);
} catch (error) {
console.log(error);
} finally {
browser.close();
}
}
scrape();
Solution 2: (Use for of so it's sync and then much slower than the previous):
const puppeteer = require("puppeteer");
const SELECTOR_POSTS_LINK = ".article--post__title > a";
const SELECTOR_POST_TITLE = ".article-header--title";
async function scrape() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.smashingmagazine.com/articles/");
try {
const links = await page.$$eval(SELECTOR_POSTS_LINK, (links) => links.map((link) => link.href));
const articles = [];
for (const link of links) {
await page.goto(link);
const title = await page.$eval(SELECTOR_POST_TITLE, (el) => el.textContent);
articles.push({ title });
}
console.log(articles);
} catch (error) {
console.log(error);
} finally {
browser.close();
}
}
scrape();

How do I define a variable as a scraped element using Puppeteer

const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null
})
const page = await browser.newPage()
await page.goto('https://www.supremenewyork.com/shop/sweatshirts/ftq968f24/lhrblx1z5')
var productName = await page.evaluate(() => {
document.querySelector('div[id="details"] > p[itemprop="model"]').innerText;
})
console.log(productName);
})()
When I run my code that is supposed to grab the name of the supreme item, it says undefined when it's supposed to log it in the console.
You are neither returning anything from the page.evaluate nor are you setting the value of productName. Try something like this instead that uses $eval to return the innerText of the matching element:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
await page.goto(
"https://www.supremenewyork.com/shop/sweatshirts/ftq968f24/lhrblx1z5"
);
const productName = await page.$eval(
'div[id="details"] > p[itemprop="model"]',
(el) => el.innerText
);
console.log(productName);
})();
If you prefer to use evaluate it would look like:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
await page.goto(
"https://www.supremenewyork.com/shop/sweatshirts/ftq968f24/lhrblx1z5"
);
const productName = await page.evaluate(() => {
// notice the return
return document.querySelector('div[id="details"] > p[itemprop="model"]').innerText;
});
console.log(productName);
})();
If innerText doesn't return anything you may instead need to use something like textContent.
Hopefully that helps!

puppeteer trouble. Or, at least, javascript trouble; you decide, please

Can someone please explain what might be going wrong here:
await page.click('some-selector-that-devtools-confirms-is-definitely-there')
let peeps = await page.evaluate(() =>
{
document.querySelector('some-selector-that-devtools-confirms-is-definitely-there')
}
);
console.log('classes: '+peeps.classList)
I've tried page.wait...., to no avail, same error
Error
TypeError: Cannot read property 'classList' of undefined
Incidentally, is there a best practice for finding out if an element has a certain css class?
You have two problems here.
You don't return document.querySelector('some-selector-that-devtools-confirms-is-definitely-there') so the variable peeps will be undefined
You expect you can return any value with page.evaluate(). but acutely you can only return a serializable value, so it is not possible to return an element or NodeList back from the page environment using this method.
Example to return classlist by page.evaluate().
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://google.com", { waitUntil: "networkidle2" });
const classList = await page.evaluate(() => {
return [...document.querySelector("div").classList];
});
console.log(classList);
await browser.close();
})();
There are two main problems with your code:
Your evaluate method is not returning anything;
You need to access the classList inside the evaluate method
Here's an example:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://stackoverflow.com/");
const classes = await page.evaluate(() => {
return document.querySelector("body").classList;
});
console.log(classes);
await browser.close();
})();
As an alternative approach, you could use getProperty("className"):
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://stackoverflow.com/");
const el = await page.$("body");
const className = await el.getProperty("className");
const classes = className._remoteObject.value.split(" ");
console.log(classes);
await browser.close();
})();

Puppeteer unable to use get property

Cannot read property 'getProperty' of undefined is the error that I get.
const puppeteer = require('puppeteer');
async function scrapeUdemy(url) {
try {
const browser = await puppeteer.launch({headless: false, slowmo: 250});
const page = await browser.newPage()
await page.goto(url)
const [el] = await page.$x('//*[#id="udemy"]/div[1]/div[4]/div/div/div[2]/div/div/div[1]/a/div[1]/div[1]');
const txt = await el.getProperty('textContent');
const rawTxt = await src.jsonValue();
console.log({srcTxt});
browser.close();
}
catch(err) {
console.log(err.message);
}
}
scrapeUdemy('https://www.udemy.com/user/eren-cem-salta/')
I tried using other versions but does not work. It is not working with the catch block too.
The element that you want to get is loaded with AJAX after the page started and you have to wait until it appears in the DOM:
await page.waitForSelector('[data-purpose="course-card-container"] div.udlite-heading-sm');
And why not use the same selector to get all of the cards:
const titles = await page.evaluate(() => {
const nodes = document.querySelectorAll(
'[data-purpose="course-card-container"] div.udlite-heading-sm'
);
return [...nodes].map((node) => node.textContent);
})

Categories