puppeteer: Getting HTML from NodeList? - javascript

I'm getting a list of 30 items from the code:
const boxes = await page.evaluate(() => {
return document.querySelectorAll("DIV.a-row.dealContainer.dealTile")
})
console.log(boxes);
The result
{ '0': {},
'1': {},
'2': {},
....
'28': {},
'29': {} }
I have the need to see the html of the elements.
But every property I tried of boxes is simply undefined. I tried length, innerHTML, 'innerText` and some other.
I am sure of box really containing something because puppeteer's screenshot shows the content before I start to 'browse' the content of the page
What am I doing wrong?

There are multiple ways to do this:
Use page.$$eval to execute the selector and return the result in one step.
Use page.evaluate to get the attributes after querying the elements.
Code sample for page.$$eval
const htmls = await page.$$eval('selector', el => el.innerHTML);
Code sample for page.evaluate
const singleBox = boxes[0];
const html = await page.evaluate(el => el.innerHTML, singleBox);

Related

What is the difference between page.$$(selector) and page.$$eval(selector, function) in puppeteer?

I'm trying to load page elements into an array and retrieve the innerHTML from both and be able to click on them.
var grabElements = await page.$$(selector);
await grabElements[0].click();
This allows me to grab my elements and click on them but it won't display innerHTML.
var elNum = await page.$$eval(selector, (element) => {
let n = []
element.forEach(e => {
n.push(e);
})
return n;
});
await elNum[0].click();
This lets me get the innerHTML if I push the innerHTML to n. If I push just the element e and try to click or get its innerHTML outside of the var declaration, it doesn't work. The innerHTML comes as undefined and if I click, I get an error saying elnum[index].click() is not a function. What am I doing wrong?
The difference between page.$$eval (and other evaluate-style methods, with the exception of evaluateHandle) and page.$$ is that the evaluate family only works with serializable values. As you discovered, you can't return elements from these methods because they're not serialiable (they have circular references and would be useless in Node anyway).
On the other hand, page.$$ returns Puppeteer ElementHandles that are references to DOM elements that can be manipulated from Puppeteer's API in Node rather than in the browser. This is useful for many reasons, one of which is that ElementHandle.click() issues a totally different set of operations than running the native DOMElement.click() in the browser.
From the comments:
An example of what I'm trying to get is: <div class = "class">This is the innerHTML text I want. </div>. On the page, it's text inside a clickable portion of the website. What i want to do is loop through the available options, then click on the ones that match an innerHTML I'm looking for.
Here's a simple example you should be able to extrapolate to your actual use case:
const puppeteer = require("puppeteer"); // ^19.1.0
const {setTimeout} = require("timers/promises");
const html = `
<div>
<div class="class">This is the innerHTML text I want.</div>
<div class="class">This is the innerHTML text I don't want.</div>
<div class="class">This is the innerHTML text I want.</div>
</div>
<script>
document.querySelectorAll(".class").forEach(e => {
e.addEventListener("click", () => e.textContent = "clicked");
});
</script>
`;
const target = "This is the innerHTML text I want.";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
///////////////////////////////////////////
// approach 1 -- trusted Puppeteer click //
///////////////////////////////////////////
const handles = await page.$$(".class");
for (const handle of handles) {
if (target === (await handle.evaluate(el => el.textContent))) {
await handle.click();
}
}
// show that it worked and reset
console.log(await page.$eval("div", el => el.innerHTML));
await page.setContent(html);
//////////////////////////////////////////////
// approach 2 -- untrusted native DOM click //
//////////////////////////////////////////////
await page.$$eval(".class", (els, target) => {
els.forEach(el => {
if (target === el.textContent) {
el.click();
}
});
}, target);
// show that it worked and reset
console.log(await page.$eval("div", el => el.innerHTML));
await page.setContent(html);
/////////////////////////////////////////////////////////////////
// approach 3 -- selecting with XPath and using trusted clicks //
/////////////////////////////////////////////////////////////////
const xp = '//*[#class="class"][text()="This is the innerHTML text I want."]';
for (const handle of await page.$x(xp)) {
await handle.click();
}
// show that it worked and reset
console.log(await page.$eval("div", el => el.innerHTML));
await page.setContent(html);
///////////////////////////////////////////////////////////////////
// approach 4 -- selecting with XPath and using untrusted clicks //
///////////////////////////////////////////////////////////////////
await page.evaluate(xp => {
// https://stackoverflow.com/a/68216786/6243352
const $x = xp => {
const snapshot = document.evaluate(
xp, document, null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null
);
return [...Array(snapshot.snapshotLength)]
.map((_, i) => snapshot.snapshotItem(i))
;
};
$x(xp).forEach(e => e.click());
}, xp);
// show that it worked
console.log(await page.$eval("div", el => el.innerHTML));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Output in all cases is:
<div class="class">clicked</div>
<div class="class">This is the innerHTML text I don't want.</div>
<div class="class">clicked</div>
Note that === might be too strict without calling .trim() on the textContent first. You may want an .includes() substring test instead, although the risk there is that it's too permissive. Or a regex may be the right tool. In short, use whatever makes sense for your use case rather than (necessarily) my === test.
With respect to the XPath approach, this answer shows a few options for dealing with whitespace and substrings.

Playwright - Find multiple elements or class names

I have read a few different QAs related to this but none seem to be working.
I am trying to target an element (Angular) called mat-radio-button with a class called mat-radio-checked. And then select the inner text.
In Chrome this is easy:
https://i.stack.imgur.com/Ev0iQ.png
https://i.stack.imgur.com/lVoG3.png
To find the first element that matches in Playwright I can do something like:
let test: any = await page.textContent(
"mat-radio-button.mat-radio-checked"
);
console.log(test);
But if I try this:
let test: any = await page.$$(
"mat-radio-button.mat-radio-checked"
);
console.log(test);
console.log(test[0]);
console.log(test[1]);
});
It does not return an array of elements I can select the inner text of.
I need to be able to find all elements with that class so I can use expect to ensure the returned inner text is correct, eg:
expect(test).toBe("Australian Citizen");
I found out the issue was due to the page generating beforehand and the elements were not available. So I added a waitForSelector:
await page.waitForSelector("mat-radio-button");
const elements = await page.$$("mat-radio-button.mat-radio-checked");
console.log(elements.length);
console.log(await elements[0].innerText());
console.log(await elements[1].innerText());
console.log(await elements[2].innerText());
public async validListOfItems(name) {
await this.page.waitForSelector(name, {timeout: 5000});
const itemlists = await this.page.$$(name);
let allitems: string[]=new Array();
for await (const itemlist of itemlists) {
allitems.push(await itemlist.innerText());
}
return allitems;
}
Make a common method and call by sending parameters.

get all spans and click them with puppeteer - fails with "node not visible' and other errors

I have a HTML page that has many div elements, each one with the following structure (the input id and name changes):
<div class="item">
<div class="box">
<div class="img-block">
<label for="check-11">
<input id="check-11" name="result11" type="checkbox">
<span class="fake-input"></span>
</label>
</div>
</div>
</div>
I want to use puppeteer to get all the span with the 'fake-input' class and click on them.
The problem is that it never works, no matter what I try.
In every attempt the start is the same:
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(baseUrl, { waitUntil: 'networkidle2' });
// FETCHING AND CLICKING
}();
I tried many things:
1:
await page.waitForSelector('span.fake-input');
await page.click('span.fake-input');
2:
await page.waitForSelector('span.fake-input')
.then(()=>{
console.log(`clicked!`);
page.click('span.fake-input')
3:
const spans = await page.evaluate(() => {
return Array.from(document.querySelectorAll('span'), el => el.textContent)
})
console.log('spans', spans)
for (let index = 0; index < 7; index++) {
const element = spans[index];
await page.click('span')
}'=
4:
await page.evaluate(selector=>{
return document.querySelector(selector).click();
},'span.fake-input)
console.log('clicked');
In every solution the page fails to get anything at all (either return null or undefined, so the error is "click" is not a funciton in null) or it fails with the error "Node is either not visible or not an HTMLElement".
No matter the error, in any case I fail to fetch all the spans, and click on them.
Can anyone tell me what I'm doing wrong?
Use page.$$ to return multiple elements (equivalent of document.querySelectorAll). Use page.$ to return a single element (equivalent of document.querySelector).
If you want to extract a certain value from a group of elements, use page.$$eval and page.$eval for a single element.
e.g. return elementHandle to script
const spans = await page.$$('div#item label .fake-input')
spans.forEach(span=>span.click())
If you extracting a value from an element, pass a callback to it that returns what you need to extract
e.g.
const spanTexts = page.$$eval('div#item label .fake-input', spans => {
spans.map(span=>span.innerText)
})
console.log(spanTexts)
I should add that page.$$eval and page.$eval executes your callback in the browser context.
var obj = document.querySelectorAll("span.fake-input");
for(var i=0;i<obj.length;i++){
obj[i].click();
}
Vanilla JavaScript would work much easier

How do I get whole html from Apify Cheerio crawler?

I want to get the whole html not just text.
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({
url: //adress,
uniqueKey: makeid(100)
});
const handlePageFunction = async ({ request, $ }) => {
var content_to = $('.class')
};
// Set up the crawler, passing a single options object as an argument.
const crawler = new Apify.CheerioCrawler({
requestQueue,
handlePageFunction,
});
await crawler.run();
});
When I try this the crawler returns complex object. I know I can extract the text from the content_to variable using .text() but I need the whole html with tags like . What should I do?
If I understand you correctly - you could just use .html() instead of .text(). This way you will get inner html instead of inner text of the element.
Another thing to mention - you could also put body to handlePageFunction arg object:
const handlePageFunction = async ({ request, body, $ }) => {
body would have the whole raw html of the page.

Puppeteer find array elements in page and then click

Hello I have site that url are rendered by javascript.
I want to find all script tags in my site, then math script src and return only valid ones.
Next find parent of script and finally click on link.
This is what i have:
const scripts = await page.$$('script').then(scripts => {
return scripts.map(script => {
if(script.src.indexOf('aaa')>0){
return script
}
});
});
scripts.forEach(script => {
let link = script.parentElement.querySelector('a');
link.click();
});
My problem is that i have script.src is undefined.
When i remove that condition i move to forEach loop but i get querySelector is undefined. I can write that code in js inside of console of debug mode but i cant move it to Puppeteer API.
from console i get results as expected
let scripts = document.querySelectorAll('script');
scripts.forEach(script=>{
let el = script.parentElement.querySelector('a');
console.log(el)
})
When you use $$ or $, it will return a JSHandle, which is not same as a HTML Node or NodeList that returns when you run querySelector inside evaluate. So script.src will always return undefined.
You can use the following instead, $$eval will evaluate a selector and map over the NodeList/Array of Nodes for you.
page.$$eval('script', script => {
const valid = script.getAttribute('src').indexOf('aaa') > 0 // do some checks
const link = valid && script.parentElement.querySelector('a') // return the nearby anchor element if the check passed;
if (link) link.click(); // click if it exists
})
There are other ways to achieve this, but I merged all into one. Ie, If it works on browser, then you can also use .evaluate and run the exact code and get exact desired result.
page.evaluate(() => {
let scripts = document.querySelectorAll('script');
scripts.forEach(script => {
let el = script.parentElement.querySelector('a');
console.log(el) // it won't show on your node console, but on your actual browser when it is running;
el.click();
})
})

Categories