I'm a complete beginner in javascript and web scraping using puppeteer and I am trying to get the scores of a simple euroleague round in
https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019
By inspecting the score list above I find out that the score list is a div element containing other divs inside with the stats displayed .
HTML for a single match between 2 teams (there are more divs for matches below this example )
//score list
<div class="wp-module wp-module-asidegames wp-module-5lfarqnjesnirthi">
//the data-code increases to "euro_245" ...
<div class="">
<div class="game played" data-code="euro_244" data-date="1583427600000" data-played="1">
<a href="/main/results/showgame?gamecode=244&seasoncode=E2019" class="game-link">
<div class="club">
<span class="name">Zenit St Petersburg</span>
<span class="score homepts winner">76</span>
</div>
<div class="club">
<span class="name">Zalgiris Kaunas</span>
<span class="score awaypts ">75</span>
</div>
<div class="info">
<span class="date">March 5 18:00 CET</span>
<span class="live">
LIVE <span class="minute"></span>
</span>
<span class="final">
FINAL
</span>
</div>
</a>
</div>
//more teams
</div>
</div>
What I want is to iterate through the outer div element and get the teams playing and the score of each match and store them in a json file . However since I am a complete beginner I do not understand how to iterate through the html above .
This is my web scraping code to get the element :
const puppeteer = require('puppeteer');
const sleep = (delay) => new Promise((resolve) => setTimeout(resolve,delay));
async function getTeams(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await sleep(3000);
const games = await page.$x('//*[#id="main-one"]/div/div/div/div[1]/div[1]/div[3]');
//this is where I will execute the iteration part to get the matches with their scores
await sleep(2000);
await browser.close();
}
getTeams('https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019');
I would appreciate your help with guiding me through the iteration part .
Thank you in advance
The most accurate selector for a game box is div.game.played (a div which both has the .game and the .played CSS classes), you will need to count the elements that match this criteria. It is possible with page.$$eval (page .$$eval (selector, pageFunction[, ...args])) which runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.
As we are using the element indexes for the specific data fields we run a regular for loop with the length of the elements.
If you need a specific range of "euro_xyz" you can get the data-code attribute values in a page.evaluate method with Element.getAttribute and check their number against the desired "xyz" number.
To collect each game's data we can define a collector array (gameObj) which can be extended with each iteration. In each iteration we fill an actualGame object with the actual data.
It is important to determine which child elements contain the corresponding data values, e.g.: the home club's name is 'div.game.played > a > div:nth-child(1) > span:nth-child(1)' the div child number selects the club while the span child number decides between the club name and the points. The loop's [i] index is responsible for grabbing the right game box's values (that's why it was counted in the beginning).
For example:
const allGames = await page.$$('div.game.played')
const allGameLength = await page.$$eval('div.game.played', el => el.length)
const gameObj = []
for (let i = 0; i < allGameLength; i++) {
try {
let dataCode = await page.evaluate(el => el.getAttribute('data-code'), allGames[i])
dataCode = parseInt(dataCode.replace('euro_', ''))
if (dataCode > 243) {
const actualGame = {
homeClub: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(1) > span:nth-child(1)'))[i]),
awayClub: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(2) > span:nth-child(1)'))[i]),
homePoints: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(1) > span:nth-child(2)'))[i]),
awayPoints: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(2) > span:nth-child(2)'))[i]),
gameDate: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(3) > span:nth-child(1)'))[i])
}
gameObj.push(actualGame)
}
} catch (e) {
console.error(e)
}
}
console.log(JSON.stringify(gameObj))
There is a page.waitFor method in puppeteer for the same purpose as your sleep function, but you can also wait for selectors to be appeared (page.waitForSelector).
Related
I am trying to write a purchasing bot for supreme website as a way to teach myself javascript and pupetteer. I am having trouble finding a way to click on the item that contains the text that is given as an argument in the function. Here is what I am trying
'''
const puppeteer = require('puppeteer-extra');
const pluginStealth = require("puppeteer-extra-plugin-stealth");
puppeteer.use(pluginStealth());
const BASE_URL = "https://www.supremenewyork.com/shop/all";
const CHECKOUT_URL = "https://www.supremenewyork.com/checkout";
async function startBot(itemCategory) {
const browser = await puppeteer.launch({
args: ['--no-sandbox', '--disable-setuid-sandbox'],
headless: false
});
let page = await browser.newPage();
await page.goto(itemCategory);
return {page};
}
async function closeBrowser(browser) {
return browser.close();
}
async function addToCart(page, itemName) {
page.$$eval('a', as => as.find(a => a.innerText.match(itemName).click()));
}
async function checkout() {
const page = await startBot(categories['t_shirts']);
await addToCart(page, item_info['name']);
}
checkout();
'''
The error I get is:
TypeError: page.$x is not a function
Am I going about this the correct way or is there a better method
This is a brief overview of the HTML
<div class="inner-article">
<a style="height:150px;" href="/shop/t-shirts/kw2h4jdue/dvor2fh76"><img width="150" height="150" src="//assets.supremenewyork.com/231423/vi/u13QfMEqLmM.jpg" alt="U13qfmeqlmm">
</a>
<h1>
<a class="name-link" href="/shop/t-shirts/kw2h4jdue/dvor2fh76">Knowledge Tee
</a>
</h1>
<p>
<a class="name-link" href="/shop/t-shirts/kw2h4jdue/dvor2fh76">Black
</a>
</p>
</div>
I want to click the a class thats text is 'Knowledge Tee'
Thanks
page.$x should be a function I think so something else is going wrong there.
You can also do something like:
page.$$eval('a', as => as.find(a => a.innerText.match("Knowledge Tee").click()))
Im fetching a group of documents in a forEach from Firestore, I know im fetching this all as I can see documents and is fields in the console.
But when I add the id to the field fetched it only shows one document. (it should be 5 tiles and images but I can only see one)
From reading I know I only need to some how duplicate the HTML code based on the documents fetch in the collection but am struggling to do so.
Javascript code
const i = query(collection(db, "teams"));
const unsubscribe = onSnapshot(i, (querySnapshot) => {
querySnapshot.forEach((doc) => {
const docData = doc.data();
document.getElementById("ageGroup").innerText = docData.ageGroup,
document.getElementById("teamImage").src = docData.teamImage,
console.log("Current data: ", docData);
});
});
HTML Code
<section class="teams">
<article>
<h1 class="team-names" id="ageGroup"></h1>
<div class="team-line"></div>
<img class="team-image" id="teamImage">
</article>
</section>
Unless you know how many items will be in the result in advance, dynamically create new elements inside the loop, and then insert the data from the database into the HTML.
Don't use IDs - those should only be used for absolutely unique elements, not for repeated elements.
For an <article> for each item, where the parent of all articles is the .teams, do:
const teams = document.querySelector('.teams');
querySnapshot.forEach((doc) => {
const docData = doc.data();
const article = teams.appendChild(document.createElement('article'));
article.innerHTML = `
<h1 class="team-names"></h1>
<div class="team-line"></div>
<img class="team-image">
`;
article.children[0].textContent = docData.ageGroup;
article.children[2].src = docData.teamImage;
});
Why when you are searching for something else is deleting the previous contents ?For example first you search for egg and show the contents but then when you search for beef the program deletes the egg and shows only beef.Code :
const searchBtn = document.getElementById('search-btn');
const mealList = document.getElementById('meal');
const mealDetailsContent = document.querySelector('.meal-details-content');
const recipeCloseBtn = document.getElementById('recipe-close-btn');
// event listeners
searchBtn.addEventListener('click', getMealList);
mealList.addEventListener('click', getMealRecipe);
recipeCloseBtn.addEventListener('click', () => {
mealDetailsContent.parentElement.classList.remove('showRecipe');
});
// get meal list that matches with the ingredients
function getMealList(){
let searchInputTxt = document.getElementById('search-input').value.trim();
fetch(`https://www.themealdb.com/api/json/v1/1/filter.php?i=${searchInputTxt}`)
.then(response => response.json())
.then(data => {
let html = "";
if(data.meals){
data.meals.forEach(meal => {
html += `
<div class = "meal-item" data-id = "${meal.idMeal}">
<div class = "meal-img">
<img src = "${meal.strMealThumb}" alt = "food">
</div>
<div class = "meal-name">
<h3>${meal.strMeal}</h3>
Get Recipe
</div>
</div>
`;
});
mealList.classList.remove('notFound');
} else{
html = "Sorry, we didn't find any meal!";
mealList.classList.add('notFound');
}
mealList.innerHTML = html;
});
}
It's because you are replacing the contents in the mealList element every time.
A simple workaround would be to retrieve the the innerHTML values before you update it.
Something like
let html = mealList.innerHTML;
rather than starting off empty every time you call the function should do the trick.
I'm scraping a website and I'm using Cheerio and Puppeteer.
I need to click a certain button with a given text. Here is my code:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.website.com', {waitUntil: 'networkidle0'});
const html = await page.content();
const $ = cheerio.load(html);
const items = [];
$('.grid-table-container').each((index, element) => {
items.push({
element: $($('.grid-option-name', element)[0]).contents().not($('.grid-option-name', element).children()).text() },
button: $('.grid-option-selectable>div', element)
});
});
items.forEach(item => {
if (item.element === 'Foo Bar') {
await page.click(item.button);
}
});
Here is the markup I'm trying to scrape:
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table"></div>
<div class="item-table">
<div class="grid-item">
<div class="grid-item-container">
<div class="grid-table-container>
<div class="grid-option-header">
<div class="grid-option-caption">
<div class="grid-option-name">
Foo Bar
<span>some other text</span>
</div>
</div>
</div>
<div class="grid-option-table">
<div class="grid-option">
<div class="grid-option-selectable">
<div></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="item-table"></div>
<div class="item-table"></div>
Clicking on Cheerio element doesn't work. So, does exist any way to do it?
You could add jquery to the page and do it there:
await page.addScriptTag({path: "jquery.js"})
await page.evaluate(() => {
// do jquery stuff here
})
There's no way to do this. Puppeteer is a totally different API from Cheerio. The two don't talk to each other or interoperate at all. The only thing you can do is snapshot an HTML string in Puppeteer and pass it to Cheerio.
Puppeteer works in the browser context on the live website, with native XPath and CSS capabilities--basically, all the power of the browser at your disposal.
On the other hand, Cheerio is a Node-based HTML parser that simulates a tiny portion of the browser environment. It offers a small subset of Puppeteer's functionality, so don't use Cheerio and Puppeteer together under most circumstances.
Taking a snapshot of the live site, then re-parsing the string into a tree Cheerio can work with is confusing, inefficient and offers few obvious advantages over using the actual thing that's right in front of you. It's like buying a bike just to carry it around.
The solution is to stick with Puppeteer ElementHandle objects:
const puppeteer = require("puppeteer"); // ^19.0.0
const html = `
<div class="item-table">
<div class="grid-item">
<div class="grid-item-container">
<div class="grid-table-container">
<div class="grid-option-header">
<div class="grid-option-caption">
<div class="grid-option-name">
Foo Bar
<span>some other text</span>
</div>
</div>
</div>
<div class="grid-option-table">
<div class="grid-option">
<div class="grid-option-selectable">
<div></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<script>
// for testing purposes
const el = document.querySelector(".grid-option-selectable > div");
el.addEventListener("click", e => e.target.textContent = "clicked");
el.style.height = el.style.width = "50px";
</script>
`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
for (const el of await page.$$(".grid-item-container")) {
const text = await el.$eval(
".grid-option-name",
el => el.childNodes[0].textContent
);
const sel = ".grid-option-selectable > div";
if (text.trim() === "Foo Bar") {
const selectable = await el.$(sel);
await selectable.click();
}
console.log(await el.$eval(sel, el => el.textContent)); // => clicked
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Or perform your click in the browser:
await page.$$eval(".grid-item-container", els => els.forEach(el => {
const text = el.querySelector(".grid-option-name")
.childNodes[0].textContent.trim();
if (text.trim() === "Foo Bar") {
document.querySelector(".grid-option-selectable > div").click();
}
}));
You might consider selecting using an XPath or iterating childNodes to examine all text nodes rather than assuming the text is at position 0, but I've left these as exercises to focus on the main point at hand.
The click on the div element with role='button' isn't operate.I need to click on the icon, but I can't do it.
html:
<div class="list">
<div class="item">
<div role="button" tabindex="-1">
<strong>ItemName2</strong>
</div>
<div class="d">
<div class="item-icon" role="button" tabindex="-1" style="display: none">
<i aria-label="icon: edit" class="edit"></i>
</div>
</div>
</div>
<div class="item"> ... </div>
<div class="item"> ... </div>
<div class="item"> ... </div>
</div>
js:
try {
await driver.get("http://127.0.0.1");
let findButtons = await driver.findElements(By.tagName('strong'));
let buttons = findButtons.map(elem => elem.getText());
const allButtons = await Promise.all(buttons);
console.log(allButtons); // It is displayed all button values, such as ItemName1
let tButton;
for (let i = 0; i < findButtons.length; i++) {
if (allButtons[i] == 'ItemName2') {
tButton = await findButtons[i];
tButton.click(); // I try to click on this button, where value = ItemName2
console.log(allButtons[i]); // It is displayed button value 'ItemName2'
}}}
Console error:
(node:12254) UnhandledPromiseRejectionWarning: StaleElementReferenceError: stale element reference: element is not attached to the page document
You are getting stale element exception because you are trying to get the element with old references. Each time you click on the element in your loop, the elements reference will be updated and allButtons[i] does not work. In order to handle this you have to get the latest refers of buttons. Try the below.
js:
const { By, Key, until } = require("selenium-webdriver");
const webdriver = require("selenium-webdriver");
require("chromedriver");
async () => {
let driver = await new webdriver.Builder().forBrowser("chrome").build();
try {
await driver.get("http://10.203.201.77:8000/login");
let findButtons = await driver.findElements(By.tagName('strong'));
let buttons = findButtons.map(elem => elem.getText());
const allButtons = await Promise.all(buttons);
console.log(allButtons); // It is displayed all button values, such as ItemName1
let tButton;
for (let i = 0; i < findButtons.length; i++) {
buttons = findButtons.map(elem => elem.getText()); # getting the button so that the elements refererence will refresh
if (allButtons[i] == 'ItemName2') {
tButton = await findButtons[i];
tButton.click(); // I try to click on this button, where value = ItemName2
console.log(allButtons[i]); // It is displayed button value 'ItemName2'
}
}
console.log("DONE");
} catch (e) {
console.log(e);
} finally {
await driver.quit();
}
}
}
I found solution:
let findButtons = await driver.findElements(By.tagName('strong'));
let buttons = findButtons.map(async elem => await elem.getText()); // I add async & await
const allButtons = await Promise.all(buttons);
console.log(allButtons); // There are all itemName