I've been scraping a bunch of websites lately and some of them work just fine, while on others I encounter problems.
`
const { url, selectors } = WebsiteUtils
const getData = async (url) => {
const browser = await puppeteer.launch({ headless: false })
const page = await browser.newPage()
await page.setViewport({
height: 20000,
width: 1500,
})
await page.goto(url, { waitUntil: 'networkidle2' })
const link = page.$(selectors.link)
return link.innerHTML
}
console.log(await getData(url))
`
In this case, the 'selectors.link' is literally just 'ul' as there's a bunch of them in the page. I already tried to wait for the selector (.waitForSelector), I used page.evaluate instead of .$, even put a 5 secs timeout to make sure everything was loaded. The HTML is not even inside an iframe, it's a really straightforward page. When I get to this I'm pretty much out of options.
The output here is undefined,
if i just log 'link' I get ElementHandle {}
Update:
I figured this out. Unlike what i previously wrote, iframe were indeed present in the page although they were not present in the HTML of the devtools.
I just spotted them running page.frames() and access the content of the individual frames with frame.$eval()
Related
I'm attempting to iterate through a list of URLs, and instead of puppeteer loading each page, it only loads one. What can I do to make this work?
async function main() {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.setViewport({width: 1200, height: 720});
await page.goto('https://s23.a2zinc.net/clients/acmedia/americancoatingsshow2022/Public/Exhibitors.aspx?Index=All#', { waitUntil: 'networkidle0' }); // wait until page load
const hrefs = await page.$$eval('a', as => as.map(a => a.href));
for (let i = 0; i < hrefs.length; i++) {
const url = hrefs[i];
if (url.includes('eBooth.aspx')) {
console.log(url)
const page = await browser.newPage()
await page.goto(`${url}`);
await page.waitForNavigation({ waitUntil: 'networkidle0' });
}
}
main();
The main problem is an extra await page.waitForNavigation({ waitUntil: 'networkidle0' }); that will fail to resolve. page.goto already waits for navigation, so you're asking Puppeteer to wait for a navigation that will never happen.
Only use page.waitForNavigation if you're doing something to trigger a navigation, not as part of a typical page.goto call. Remove this line and your code should work (more or less) as expected.
Furthermore, you're opening a whole new page (browser tab) per link. That's 360 tabs by my count, liable to run most computers out of memory. Better to navigate a single page repeatedly or close pages after you're finished doing whatever you plan to do on these pages. If that's too slow, try running chunks in parallel or using a task queue.
Also, the links are available in the static HTML, so you might not need Puppeteer here, again, depending on what you're planning on doing on each page. If you can get all of the data from each page statically, you could have a massive speedup, completing 360 scrapes with fetch/cheerio in a fraction of the time it'd take Puppeteer.
If you do stick with Puppeteer to bypass detection or deal with JS/interactivity, consider using domcontentloaded rather than networkidle0, which is usually unnecessarily strict and slow. The blog post linked explains the difference between the various loading conditions. See also my answer in the canonical thread Puppeteer wait until page is completely loaded for a deeper dive into page loading in Puppeteer.
a[href] is a more precise selector than a, because it's possible that some a anchors have no href and should be discarded to avoid undefineds popping up.
Here's how I'd write this (with the aforementioned caveat that Puppeteer might not be needed at all):
const puppeteer = require("puppeteer"); // ^14.3.0
let browser;
(async () => {
browser = await puppeteer.launch({headless: false});
const [page] = await browser.pages();
await page.setViewport({width: 1200, height: 720});
const url = "https://s23.a2zinc.net/clients/acmedia/americancoatingsshow2022/Public/Exhibitors.aspx?Index=All#";
await page.goto(url, {waitUntil: "domcontentloaded"});
const hrefs = await page.$$eval("a[href]", els =>
els.map(a => a.href).filter(e => e.includes("eBooth.aspx"))
);
console.log(hrefs.length); // => 360
for (const url of hrefs) {
await page.goto(url);
// page is loaded; do your thing on this page
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Is it possible to target an iFrame when using the GUI Workflow builder in AWS Cloudwatch Synthetics?
I've set up the canary to log in to a website and redirect the page which has run successfully, but one of the elements I need to check with Node.js is within an iFrame which isn't being recognised.
This is the iframe code. It loads from Javascript, but all content is from the same domain:
<iframe id="paramsFrame" src="empty.htm" frameborder="0" ppTabId="-1"
onload="paramsDocumentLoaded('paramsFrame', true);"></iframe>
This is the code I'm using for this section, but it's just returning a timeout error:
await synthetics.executeStep('verifyText', async function() {
const elementHandle = await page.waitForSelector('#paramsFrame');
const frame = await elementHandle.contentFrame();
await frame.waitForXPath("//div[#class=\'css7\'][contains(text(),'Specificity')]", { timeout: 30000 });
})
This code is trying to target a div with class css7 found within an iframe with id paramsFrame
Edit: I did a null check on frame and it came back as not null, not sure if that is relevant.
I also tried to target an element directly:
const next = await frame.waitForSelector('.protocol-name-link');
but I got the error message:
TimeoutError: waiting for selector .protocol-name-link
If the iframe is on a different origin (e.g. different domain), you cannot access it through Puppeteer.
You can try to disable some security features of Puppeteer, although this is not advised.
Specifically, you'd probably want to add these args to puppeteer.launch
--disable-web-security
--disable-features=IsolateOrigins,site-per-process
I tried running similar code on a website which had a youtube iframe and I didnt need the puppeteer launch args
i.e
args: [
"--disable-web-security",
"--disable-features=IsolateOrigins,site-per-process",
],
But, First I would like to suggest is for the iframe try to confirm it is the same iframe that you need maybe by logging, debugging or even just going on dev console.
And the second is to use full xpath of the element in the frame.
Here is my code which I tried running.
const page = await browser.newPage();
console.log("open page");
await page.goto("https://captioncrusher.com/");
console.log("page opened");
// use this if you want to wait for all the requests to be done.
// await page.waitForNetworkIdle();
const elementHandle = await page.waitForSelector("iframe.yt");
const frame = await elementHandle.contentFrame();
//These both work for me
const aLink = await frame.waitForXPath("/html/body/div/div/a");
const classLink = await frame.waitForSelector(".ytp-impression-link");
await browser.close();
Somtimes puppeteer does not type in some input fields, to be specific, I tried to simply type something in a website's input filed called "https://webtor.io/", which has a single huge input field, I hope someone could help me with that specific example.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://webtor.io/');
await page.type(`input[type="text"]`, 'something', { delay: 50 })
})();
This happens because when you go to the page the page has render the html and load up scripts, which ends up causing the delay and sometimes the text input is not loaded hence the fail.
await page.goto(''https://webtor.io/', {waitUntil: 'networkidle0'});
Check out this link for further details.
Puppeteer wait until page is completely loaded
The problem has been resolved by adding cookie from an actual browser.
I'm trying to get half-price products from this website https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials. The website is rendered by AngularJS so I'm trying to use puppeteer for data scraping.
headless is false, just a blank page shows up
headless is true, it throws an exception as the image Error while running with headless browser
const puppeteer = require('puppeteer');
async function getProductNames(){
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.setViewport({ width: 1000, height: 926 });
await page.goto("https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials");
await page.waitForSelector('.product-name')
console.log("Begin to evaluate JS")
var productNames = await page.evaluate(() => {
var div = document.querySelectorAll('.product-name');
console.log(div)
var productnames = []
// leave it blank for now
return productnames
})
console.log(productNames)
browser.close()
}
getProductNames();
P/S: While looking into this issue, I figure out the web page is actually console.log out the data of each page, but I can't trace the request. If you can show me how it could be great.
The web page console log data
Try adding options parameter to page.to('url'[,options]) method
page.goto("https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials", { waitUntil: 'networkidle2' })
It will consider navigation to be finished only when there are no more than 2 network connections for at least 500 ms.
You can refer documentation about parameters of options object here: Goto Options parameter
I am trying to get the a new tab and scrape the title of that page with puppeteer.
This is what I have
// use puppeteer
const puppeteer = require('puppeteer');
//set wait length in ms: 1000ms = 1sec
const short_wait_ms = 1000
async function run() {
const browser = await puppeteer.launch({
headless: false, timeout: 0});
const page = await browser.newPage();
await page.goto('https://biologyforfun.wordpress.com/2017/04/03/interpreting-random-effects-in-linear-mixed-effect-models/');
// second page DOM elements
const CLICKHERE_SELECTOR = '#post-2068 > div > div.entry-content > p:nth-child(2) > a:nth-child(1)';
// main page
await page.waitFor(short_wait_ms);
await page.click(CLICKHERE_SELECTOR);
// new tab opens - move to new tab
let pages = await browser.pages();
//go to the newly opened page
//console.log title -- Generalized Linear Mixed Models in Ecology and in R
}
run();
I can't figure out how to use browser.page() to start working on the new page.
According to the Puppeteer Documentation:
page.title()
returns: <Promise<string>> Returns page's title.
Shortcut for page.mainFrame().title().
Therefore, you should use page.title() for getting the title of the newly opened page.
Alternatively, you can gain a slight performance boost by using the following:
page._frameManager._mainFrame.evaluate(() => document.title)
Note: Make sure to use the await operator when calling page.title(), as the title tag must be downloaded before Puppeteer can access its content.
You shouldn't need to move to the new tab.
To get the title of any page you can use:
const pageTitle = await page.title();
Also after you click something and you're waiting for the new page to load you should wait for the load event or the network to be Idle:
// Wait for redirection
await page.waitForNavigation({waitUntil: 'networkidle', networkIdleTimeout: 1000});
Check the docs: https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagewaitfornavigationoptions