I'm scraping a website and I'm using Cheerio and Puppeteer.
I need to click a certain button with a given text. Here is my code:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.website.com', {waitUntil: 'networkidle0'});
const html = await page.content();
const $ = cheerio.load(html);
const items = [];
$('.grid-table-container').each((index, element) => {
items.push({
element: $($('.grid-option-name', element)[0]).contents().not($('.grid-option-name', element).children()).text() },
button: $('.grid-option-selectable>div', element)
});
});
items.forEach(item => {
if (item.element === 'Foo Bar') {
await page.click(item.button);
}
});
Here is the markup I'm trying to scrape:
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table">
<div class="grid-item">
<div class="grid-item-container">
<div class="grid-table-container>
<div class="grid-option-header">
<div class="grid-option-caption">
<div class="grid-option-name">
Foo Bar
<span>some other text</span>
</div>
</div>
</div>
<div class="grid-option-table">
<div class="grid-option">
<div class="grid-option-selectable">
<div></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="item-table"></div>
<div class="item-table"></div>
Clicking on Cheerio element doesn't work. So, does exist any way to do it?
You could add jquery to the page and do it there:
await page.addScriptTag({path: "jquery.js"})
await page.evaluate(() => {
// do jquery stuff here
})
There's no way to do this. Puppeteer is a totally different API from Cheerio. The two don't talk to each other or interoperate at all. The only thing you can do is snapshot an HTML string in Puppeteer and pass it to Cheerio.
Puppeteer works in the browser context on the live website, with native XPath and CSS capabilities--basically, all the power of the browser at your disposal.
On the other hand, Cheerio is a Node-based HTML parser that simulates a tiny portion of the browser environment. It offers a small subset of Puppeteer's functionality, so don't use Cheerio and Puppeteer together under most circumstances.
Taking a snapshot of the live site, then re-parsing the string into a tree Cheerio can work with is confusing, inefficient and offers few obvious advantages over using the actual thing that's right in front of you. It's like buying a bike just to carry it around.
The solution is to stick with Puppeteer ElementHandle objects:
const puppeteer = require("puppeteer"); // ^19.0.0
const html = `
<div class="item-table">
<div class="grid-item">
<div class="grid-item-container">
<div class="grid-table-container">
<div class="grid-option-header">
<div class="grid-option-caption">
<div class="grid-option-name">
Foo Bar
<span>some other text</span>
</div>
</div>
</div>
<div class="grid-option-table">
<div class="grid-option">
<div class="grid-option-selectable">
<div></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<script>
// for testing purposes
const el = document.querySelector(".grid-option-selectable > div");
el.addEventListener("click", e => e.target.textContent = "clicked");
el.style.height = el.style.width = "50px";
</script>
`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
for (const el of await page.$$(".grid-item-container")) {
const text = await el.$eval(
".grid-option-name",
el => el.childNodes[0].textContent
);
const sel = ".grid-option-selectable > div";
if (text.trim() === "Foo Bar") {
const selectable = await el.$(sel);
await selectable.click();
}
console.log(await el.$eval(sel, el => el.textContent)); // => clicked
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Or perform your click in the browser:
await page.$$eval(".grid-item-container", els => els.forEach(el => {
const text = el.querySelector(".grid-option-name")
.childNodes[0].textContent.trim();
if (text.trim() === "Foo Bar") {
document.querySelector(".grid-option-selectable > div").click();
}
}));
You might consider selecting using an XPath or iterating childNodes to examine all text nodes rather than assuming the text is at position 0, but I've left these as exercises to focus on the main point at hand.
Related
I am trying to write a purchasing bot for supreme website as a way to teach myself javascript and pupetteer. I am having trouble finding a way to click on the item that contains the text that is given as an argument in the function. Here is what I am trying
'''
const puppeteer = require('puppeteer-extra');
const pluginStealth = require("puppeteer-extra-plugin-stealth");
puppeteer.use(pluginStealth());
const BASE_URL = "https://www.supremenewyork.com/shop/all";
const CHECKOUT_URL = "https://www.supremenewyork.com/checkout";
async function startBot(itemCategory) {
const browser = await puppeteer.launch({
args: ['--no-sandbox', '--disable-setuid-sandbox'],
headless: false
});
let page = await browser.newPage();
await page.goto(itemCategory);
return {page};
}
async function closeBrowser(browser) {
return browser.close();
}
async function addToCart(page, itemName) {
page.$$eval('a', as => as.find(a => a.innerText.match(itemName).click()));
}
async function checkout() {
const page = await startBot(categories['t_shirts']);
await addToCart(page, item_info['name']);
}
checkout();
'''
The error I get is:
TypeError: page.$x is not a function
Am I going about this the correct way or is there a better method
This is a brief overview of the HTML
<div class="inner-article">
<a style="height:150px;" href="/shop/t-shirts/kw2h4jdue/dvor2fh76"><img width="150" height="150" src="//assets.supremenewyork.com/231423/vi/u13QfMEqLmM.jpg" alt="U13qfmeqlmm">
</a>
<h1>
<a class="name-link" href="/shop/t-shirts/kw2h4jdue/dvor2fh76">Knowledge Tee
</a>
</h1>
<p>
<a class="name-link" href="/shop/t-shirts/kw2h4jdue/dvor2fh76">Black
</a>
</p>
</div>
I want to click the a class thats text is 'Knowledge Tee'
Thanks
page.$x should be a function I think so something else is going wrong there.
You can also do something like:
page.$$eval('a', as => as.find(a => a.innerText.match("Knowledge Tee").click()))
I am fetching all anchor tag links via web scraping and want to print all links with space between them so while console I used "\n" but it is not making space after end of first link and second link text start without space.
Code:
(async() => {
const html = await axios.get('https://www.xyz');
const $ = await cheerio.load(html.data);
let data = []
$(".div-previews").each((i, elem) => {
console.log('data::', $(elem).find(".header-text a").text() + "\n"); // show links with space between them
})();
})
This should work better - I replaced your each and find
(async() => {
const html = await axios.get('https://www.xyz');
const $ = await cheerio.load(html.data);
console.log('data::',
$(".div-previews .header-text a")
.map(function() { return this.textContent })
.get()
.join("\n") // or .join(" ")
)
})
Example
console.log('data::',
$(".div-previews .header-text a")
.map(function() {
return this.textContent
})
.get()
.join("\n") // or .join(" ")
)
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script>
<div class="div-previews">
<div class="header-text">
Link 1
</div>
</div>
<div class="div-previews">
<div class="header-text">
Link 2
</div>
</div>
<div class="div-previews">
<div class="header-text">
Link 3
Link 4
</div>
</div>
I'm a complete beginner in javascript and web scraping using puppeteer and I am trying to get the scores of a simple euroleague round in
https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019
By inspecting the score list above I find out that the score list is a div element containing other divs inside with the stats displayed .
HTML for a single match between 2 teams (there are more divs for matches below this example )
//score list
<div class="wp-module wp-module-asidegames wp-module-5lfarqnjesnirthi">
//the data-code increases to "euro_245" ...
<div class="">
<div class="game played" data-code="euro_244" data-date="1583427600000" data-played="1">
<a href="/main/results/showgame?gamecode=244&seasoncode=E2019" class="game-link">
<div class="club">
<span class="name">Zenit St Petersburg</span>
<span class="score homepts winner">76</span>
</div>
<div class="club">
<span class="name">Zalgiris Kaunas</span>
<span class="score awaypts ">75</span>
</div>
<div class="info">
<span class="date">March 5 18:00 CET</span>
<span class="live">
LIVE <span class="minute"></span>
</span>
<span class="final">
FINAL
</span>
</div>
</a>
</div>
//more teams
</div>
</div>
What I want is to iterate through the outer div element and get the teams playing and the score of each match and store them in a json file . However since I am a complete beginner I do not understand how to iterate through the html above .
This is my web scraping code to get the element :
const puppeteer = require('puppeteer');
const sleep = (delay) => new Promise((resolve) => setTimeout(resolve,delay));
async function getTeams(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await sleep(3000);
const games = await page.$x('//*[#id="main-one"]/div/div/div/div[1]/div[1]/div[3]');
//this is where I will execute the iteration part to get the matches with their scores
await sleep(2000);
await browser.close();
}
getTeams('https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019');
I would appreciate your help with guiding me through the iteration part .
Thank you in advance
The most accurate selector for a game box is div.game.played (a div which both has the .game and the .played CSS classes), you will need to count the elements that match this criteria. It is possible with page.$$eval (page .$$eval (selector, pageFunction[, ...args])) which runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.
As we are using the element indexes for the specific data fields we run a regular for loop with the length of the elements.
If you need a specific range of "euro_xyz" you can get the data-code attribute values in a page.evaluate method with Element.getAttribute and check their number against the desired "xyz" number.
To collect each game's data we can define a collector array (gameObj) which can be extended with each iteration. In each iteration we fill an actualGame object with the actual data.
It is important to determine which child elements contain the corresponding data values, e.g.: the home club's name is 'div.game.played > a > div:nth-child(1) > span:nth-child(1)' the div child number selects the club while the span child number decides between the club name and the points. The loop's [i] index is responsible for grabbing the right game box's values (that's why it was counted in the beginning).
For example:
const allGames = await page.$$('div.game.played')
const allGameLength = await page.$$eval('div.game.played', el => el.length)
const gameObj = []
for (let i = 0; i < allGameLength; i++) {
try {
let dataCode = await page.evaluate(el => el.getAttribute('data-code'), allGames[i])
dataCode = parseInt(dataCode.replace('euro_', ''))
if (dataCode > 243) {
const actualGame = {
homeClub: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(1) > span:nth-child(1)'))[i]),
awayClub: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(2) > span:nth-child(1)'))[i]),
homePoints: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(1) > span:nth-child(2)'))[i]),
awayPoints: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(2) > span:nth-child(2)'))[i]),
gameDate: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(3) > span:nth-child(1)'))[i])
}
gameObj.push(actualGame)
}
} catch (e) {
console.error(e)
}
}
console.log(JSON.stringify(gameObj))
There is a page.waitFor method in puppeteer for the same purpose as your sleep function, but you can also wait for selectors to be appeared (page.waitForSelector).
The click on the div element with role='button' isn't operate.I need to click on the icon, but I can't do it.
html:
<div class="list">
<div class="item">
<div role="button" tabindex="-1">
<strong>ItemName2</strong>
</div>
<div class="d">
<div class="item-icon" role="button" tabindex="-1" style="display: none">
<i aria-label="icon: edit" class="edit"></i>
</div>
</div>
</div>
<div class="item"> ... </div>
<div class="item"> ... </div>
<div class="item"> ... </div>
</div>
js:
try {
await driver.get("http://127.0.0.1");
let findButtons = await driver.findElements(By.tagName('strong'));
let buttons = findButtons.map(elem => elem.getText());
const allButtons = await Promise.all(buttons);
console.log(allButtons); // It is displayed all button values, such as ItemName1
let tButton;
for (let i = 0; i < findButtons.length; i++) {
if (allButtons[i] == 'ItemName2') {
tButton = await findButtons[i];
tButton.click(); // I try to click on this button, where value = ItemName2
console.log(allButtons[i]); // It is displayed button value 'ItemName2'
}}}
Console error:
(node:12254) UnhandledPromiseRejectionWarning: StaleElementReferenceError: stale element reference: element is not attached to the page document
You are getting stale element exception because you are trying to get the element with old references. Each time you click on the element in your loop, the elements reference will be updated and allButtons[i] does not work. In order to handle this you have to get the latest refers of buttons. Try the below.
js:
const { By, Key, until } = require("selenium-webdriver");
const webdriver = require("selenium-webdriver");
require("chromedriver");
async () => {
let driver = await new webdriver.Builder().forBrowser("chrome").build();
try {
await driver.get("http://10.203.201.77:8000/login");
let findButtons = await driver.findElements(By.tagName('strong'));
let buttons = findButtons.map(elem => elem.getText());
const allButtons = await Promise.all(buttons);
console.log(allButtons); // It is displayed all button values, such as ItemName1
let tButton;
for (let i = 0; i < findButtons.length; i++) {
buttons = findButtons.map(elem => elem.getText()); # getting the button so that the elements refererence will refresh
if (allButtons[i] == 'ItemName2') {
tButton = await findButtons[i];
tButton.click(); // I try to click on this button, where value = ItemName2
console.log(allButtons[i]); // It is displayed button value 'ItemName2'
}
}
console.log("DONE");
} catch (e) {
console.log(e);
} finally {
await driver.quit();
}
}
}
I found solution:
let findButtons = await driver.findElements(By.tagName('strong'));
let buttons = findButtons.map(async elem => await elem.getText()); // I add async & await
const allButtons = await Promise.all(buttons);
console.log(allButtons); // There are all itemName
I am working with cheerio and I am stuck at a point where I want to get the href value of children div of <div class="card">.
<div class="Card">
<div class="title">
<a target="_blank" href="test">
Php </a>
</div>
<div>some content</div>
<div>some content</div>
<div>some content</div>
</div>
I got first childern correctly but i want to get div class=title childern a href value. I am new to node and i already search for that but i didn't get an appropriate answer.
var jobs = $("div.jobsearch-SerpJobCard",html);
here is my script
const rp = require('request-promise');
const $ = require('cheerio');
const potusParse = require('./potusParser');
const url = "";
rp(url)
.then((html)=>{
const Urls = [];
var jobs = $("div.Card",html);
for (let i = 2; i < jobs.length; i++) {
Urls.push(
$("div.Card > div[class='title'] >a", html)[i].attribs.href
);
}
console.log(Urls);
})
.catch(err => console.log(err));
It looks something like this:
$('.Card').map((i, card) => {
return {
link: $(card).find('a').text(),
href: $(card).find('a').attr('href'),
}
}).get()
Edit: the nlp library is chrono-node and I also recommend timeago.js to go the opposite way