Hello I have site that url are rendered by javascript.
I want to find all script tags in my site, then math script src and return only valid ones.
Next find parent of script and finally click on link.
This is what i have:
const scripts = await page.$$('script').then(scripts => {
return scripts.map(script => {
if(script.src.indexOf('aaa')>0){
return script
}
});
});
scripts.forEach(script => {
let link = script.parentElement.querySelector('a');
link.click();
});
My problem is that i have script.src is undefined.
When i remove that condition i move to forEach loop but i get querySelector is undefined. I can write that code in js inside of console of debug mode but i cant move it to Puppeteer API.
from console i get results as expected
let scripts = document.querySelectorAll('script');
scripts.forEach(script=>{
let el = script.parentElement.querySelector('a');
console.log(el)
})
When you use $$ or $, it will return a JSHandle, which is not same as a HTML Node or NodeList that returns when you run querySelector inside evaluate. So script.src will always return undefined.
You can use the following instead, $$eval will evaluate a selector and map over the NodeList/Array of Nodes for you.
page.$$eval('script', script => {
const valid = script.getAttribute('src').indexOf('aaa') > 0 // do some checks
const link = valid && script.parentElement.querySelector('a') // return the nearby anchor element if the check passed;
if (link) link.click(); // click if it exists
})
There are other ways to achieve this, but I merged all into one. Ie, If it works on browser, then you can also use .evaluate and run the exact code and get exact desired result.
page.evaluate(() => {
let scripts = document.querySelectorAll('script');
scripts.forEach(script => {
let el = script.parentElement.querySelector('a');
console.log(el) // it won't show on your node console, but on your actual browser when it is running;
el.click();
})
})
Related
I am new to js and have been trying hours getting inner text of bio__name__display tag (see the attached DOM) and I am failing.
What could be the problem here?
I am calling async functions defined in page.js from index.js, and when I console.log the return value of a first function, it works fine. However the second function does not work (the output is undefined).
For CSS Selector, I tried the following but to no avail.
div.bio__name > span.bio__name__display
"div.bio__name>span.bio__name__display"
div.bio__name span.bio__name__display
index.js
const splinterlandsPage= require('./page');
.
.
.
await page.waitForTimeout(10000);
let [mana, displayName] = await Promise.all([
splinterlandsPage.checkMatchMana(page).then((mana) => mana).catch(() => 'no mana'),
splinterlandsPage.getText(page).then((displayName) => displayName).catch(() => 'displayName name not caught')
]);
console.log("mana : ", mana) //works
console.log("displayName: ", displayName); //does not work
.
.
.
page.js
.
.
.
// first function
async function checkMatchMana(page) {
const mana = await page.$$eval("div.col-md-12 > div.mana-cap__icon", el => el.map(x => x.getAttribute("data-original-title")));
const manaValue = parseInt(mana[0].split(':')[1], 10);
return manaValue;
}
// second function
async function getText(page) {
const displayName= await page.$$eval("div.bio__name > span.bio__name__display", el => el.innerText);
return displayName
}
.
.
.
exports.checkMatchMana = checkMatchMana;
exports.getText= getText;
DOM
As much as I wanted to share the actual site URL, it was tough to do so since the access to the DOM required a sign-up for the site and this specific DOM in question is available for only 2 minutes after a certain button within the site is clicked.
I finally figured them out myself today - below is a solution and some notes for later reference.
Solution
The issue was not a CSS Selector but what $$eval returned.
Since $$eval returns a list of elements, the return value has to be processed like el => el.map(x => x.innerText)) instead of el => el.innerText.
page.js
// second function
async function getText(page) {
const displayName = await page.$$eval("div.bio__name > span.bio__name__display", el => el.map(x => x.innerText));
return displayName[0]
}
Other workaround can be using $eval instead which returns a single matching element.
// second function
async function getText(page) {
const displayName = await page.$eval("div.bio__name > span.bio__name__display", el => el.innerText);
return displayName
}
Notes
$$eval and $eval
$$eval runs document.querySelectorAll('CSS Selector') internally, which returns multiple elements that match the specified group of selectors.
(Also works when there's only a single element that matches, but the return value needs to be processed accordingly in pageFunction[, ...args])
$eval runs document.querySelector('CSS Selector') internally, which returns a single element that matches the specified group of selectors.
CSS Selectors
As a new learner, I often find myself lost getting my head around choosing right Selectors.
Might be a quick and dirty way but the first answer to this question helped.
I can't believe that I'm asking an obvious question, but I still get the wrong in console log.
Console shows crawl like "[]" in the site, but I've checked at least 10 times for typos. Anyways, here's the javascript code.
I want to crawl in the site.
This is the kangnam.js file :
const axios = require('axios');
const cheerio = require('cheerio');
const log = console.log;
const getHTML = async () => {
try {
return await axios.get('https://web.kangnam.ac.kr', {
headers: {
Accept: 'text/html'
}
});
} catch (error) {
console.log(error);
}
};
getHTML()
.then(html => {
let ulList = [];
const $ = cheerio.load(html.data);
const $allNotices = $("ul.tab_listl div.list_txt");
$allNotices.each(function(idx, element) {
ulList[idx] = {
title : $(this).find("list_txt title").text(),
url : $(this).find("list_txt a").attr('href')
};
});
const data = ulList.filter(n => n.title);
return data;
}). then(res => log(res));
I've checked and revised at least 10 times
Yet, Js still throws this result :
root#goorm:/workspace/web_platform_test/myapp/kangnamCrawling(master)# node kangnam.js
[]
Mate, I think the issue is you're parsing it incorrectly.
$allNotices.each(function(idx, element) {
ulList[idx] = {
title : $(this).find("list_txt title").text(),
url : $(this).find("list_txt a").attr('href')
};
});
The data that you're trying to parse for is located within the first index of the $(this) array, which is really just storing a DOM Node. As to why the DOM stores Nodes this way, it's most likely due to efficiency and effectiveness. But all the data that you're looking for is contained within this Node object. However, the find() is superficial and only checks the indexes of an array for the conditions you supplied, which is a string search. The $(this) array only contains a Node, not a string, so when you you call .find() for a string, it will always return undefined.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/find
You need to first access the initial index and do property accessors on the Node. You also don't need to use $(this) since you're already given the same exact data with the element parameter. It's also more efficient to just use element since you've already been given the data you need to work with.
$allNotices.each(function(idx, element) {
ulList[idx] = {
title : element.children[0].attribs.title,
url : element.children[0].attribs.href
};
});
This should now populate your data array correctly. You should always analyze the data structures you're parsing for since that's the only way you can correctly parse them.
Anyways, I hope I solved your problem!
I'm trying to load page elements into an array and retrieve the innerHTML from both and be able to click on them.
var grabElements = await page.$$(selector);
await grabElements[0].click();
This allows me to grab my elements and click on them but it won't display innerHTML.
var elNum = await page.$$eval(selector, (element) => {
let n = []
element.forEach(e => {
n.push(e);
})
return n;
});
await elNum[0].click();
This lets me get the innerHTML if I push the innerHTML to n. If I push just the element e and try to click or get its innerHTML outside of the var declaration, it doesn't work. The innerHTML comes as undefined and if I click, I get an error saying elnum[index].click() is not a function. What am I doing wrong?
The difference between page.$$eval (and other evaluate-style methods, with the exception of evaluateHandle) and page.$$ is that the evaluate family only works with serializable values. As you discovered, you can't return elements from these methods because they're not serialiable (they have circular references and would be useless in Node anyway).
On the other hand, page.$$ returns Puppeteer ElementHandles that are references to DOM elements that can be manipulated from Puppeteer's API in Node rather than in the browser. This is useful for many reasons, one of which is that ElementHandle.click() issues a totally different set of operations than running the native DOMElement.click() in the browser.
From the comments:
An example of what I'm trying to get is: <div class = "class">This is the innerHTML text I want. </div>. On the page, it's text inside a clickable portion of the website. What i want to do is loop through the available options, then click on the ones that match an innerHTML I'm looking for.
Here's a simple example you should be able to extrapolate to your actual use case:
const puppeteer = require("puppeteer"); // ^19.1.0
const {setTimeout} = require("timers/promises");
const html = `
<div>
<div class="class">This is the innerHTML text I want.</div>
<div class="class">This is the innerHTML text I don't want.</div>
<div class="class">This is the innerHTML text I want.</div>
</div>
<script>
document.querySelectorAll(".class").forEach(e => {
e.addEventListener("click", () => e.textContent = "clicked");
});
</script>
`;
const target = "This is the innerHTML text I want.";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
///////////////////////////////////////////
// approach 1 -- trusted Puppeteer click //
///////////////////////////////////////////
const handles = await page.$$(".class");
for (const handle of handles) {
if (target === (await handle.evaluate(el => el.textContent))) {
await handle.click();
}
}
// show that it worked and reset
console.log(await page.$eval("div", el => el.innerHTML));
await page.setContent(html);
//////////////////////////////////////////////
// approach 2 -- untrusted native DOM click //
//////////////////////////////////////////////
await page.$$eval(".class", (els, target) => {
els.forEach(el => {
if (target === el.textContent) {
el.click();
}
});
}, target);
// show that it worked and reset
console.log(await page.$eval("div", el => el.innerHTML));
await page.setContent(html);
/////////////////////////////////////////////////////////////////
// approach 3 -- selecting with XPath and using trusted clicks //
/////////////////////////////////////////////////////////////////
const xp = '//*[#class="class"][text()="This is the innerHTML text I want."]';
for (const handle of await page.$x(xp)) {
await handle.click();
}
// show that it worked and reset
console.log(await page.$eval("div", el => el.innerHTML));
await page.setContent(html);
///////////////////////////////////////////////////////////////////
// approach 4 -- selecting with XPath and using untrusted clicks //
///////////////////////////////////////////////////////////////////
await page.evaluate(xp => {
// https://stackoverflow.com/a/68216786/6243352
const $x = xp => {
const snapshot = document.evaluate(
xp, document, null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null
);
return [...Array(snapshot.snapshotLength)]
.map((_, i) => snapshot.snapshotItem(i))
;
};
$x(xp).forEach(e => e.click());
}, xp);
// show that it worked
console.log(await page.$eval("div", el => el.innerHTML));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Output in all cases is:
<div class="class">clicked</div>
<div class="class">This is the innerHTML text I don't want.</div>
<div class="class">clicked</div>
Note that === might be too strict without calling .trim() on the textContent first. You may want an .includes() substring test instead, although the risk there is that it's too permissive. Or a regex may be the right tool. In short, use whatever makes sense for your use case rather than (necessarily) my === test.
With respect to the XPath approach, this answer shows a few options for dealing with whitespace and substrings.
I have read a few different QAs related to this but none seem to be working.
I am trying to target an element (Angular) called mat-radio-button with a class called mat-radio-checked. And then select the inner text.
In Chrome this is easy:
https://i.stack.imgur.com/Ev0iQ.png
https://i.stack.imgur.com/lVoG3.png
To find the first element that matches in Playwright I can do something like:
let test: any = await page.textContent(
"mat-radio-button.mat-radio-checked"
);
console.log(test);
But if I try this:
let test: any = await page.$$(
"mat-radio-button.mat-radio-checked"
);
console.log(test);
console.log(test[0]);
console.log(test[1]);
});
It does not return an array of elements I can select the inner text of.
I need to be able to find all elements with that class so I can use expect to ensure the returned inner text is correct, eg:
expect(test).toBe("Australian Citizen");
I found out the issue was due to the page generating beforehand and the elements were not available. So I added a waitForSelector:
await page.waitForSelector("mat-radio-button");
const elements = await page.$$("mat-radio-button.mat-radio-checked");
console.log(elements.length);
console.log(await elements[0].innerText());
console.log(await elements[1].innerText());
console.log(await elements[2].innerText());
public async validListOfItems(name) {
await this.page.waitForSelector(name, {timeout: 5000});
const itemlists = await this.page.$$(name);
let allitems: string[]=new Array();
for await (const itemlist of itemlists) {
allitems.push(await itemlist.innerText());
}
return allitems;
}
Make a common method and call by sending parameters.
Intro
The loginLinkedin takes me to the login page and then return for me the puppeteer page which is then assigned to root so I can still have more options do work with it.
const root = await loginToLInkined("https://www.linkedin.com/login");
await root.goto(url);
max_page = await getMaxPage(root);
console.log("max page",max_page)
I then goto(url)
url is another page I need to go to.
after that I call the getMaxPage(root) with root as a param so I can evaluate() in that function
Problem
const getMaxPage = async root => {
const maxPage = await root.evaluate(()=> {
return document.querySelector(
"li.artdeco-pagination__indicator:nth-last-Child(1)"
);
});
console.log(maxPage)
return parseInt(maxPage.innerText);
};
The problem is that when I console.log(maxPage) it returns undefined and I realized that adding a root s a param isn't actually working the way I'm supposed to do.
What am I doing wrong and how it properly done.
Note I have actually tried to root.evaluate without adding a function and adding the root as a param and it actually returned for me the maxpage
The issue is in what you return from page.evaluate():
const maxPage = await root.evaluate(()=> {
return document.querySelector(
"li.artdeco-pagination__indicator:nth-last-Child(1)"
);
});
This is a DOM node, which is a complex object that cannot be serialized, and the return value must be serializable in order to be returned back to node from Chromium.
So to fix that and all the future scripts just return only what is needed and what can be JSON.stringifyed without error. As pguardiario correctly noted in the comment, in this case it's enough to return innerText from that node:
const maxPage = await root.evaluate(()=> {
return document.querySelector("li.artdeco-pagination__indicator:nth-last-Child(1)").innerText;
});