Background
I am running Puppeteer in an application locally that works fine. When I move it to a production debian server, it times out at the
page.goto(url) function.
Example
I have tried a bunch of different suggestions online. In the example below you will see a few options I have tried that were suggested on line. I have tried these all alone and in different combinations with each other. Yes I am that desperate at this point.
const browser = await puppeteer.launch({
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--ignore-certificate-errors',
'--ignore-certificate-errors-spki-list',
'--user-data-dir']});
const page = await browser.newPage();
await page.goto(
`https://example.com/${template}?data=${JSON.stringify(req.body)}`, {waitUntil: 'networkidle0'}
);
page.goto(url) works locally, but fails when running on server.
Question
Why is page.goto() failing on the server and is there any work around?
page.setDefaultNavigationTimeout is your option
const browser = await puppeteer.launch({
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--ignore-certificate-errors',
'--ignore-certificate-errors-spki-list',
'--user-data-dir']});
const page = await browser.newPage();
page.setDefaultNavigationTimeout(3600); // 1 hour
await page.goto(
`https://example.com/${template}?data=${JSON.stringify(req.body)}`, {waitUntil: 'networkidle2'}
);
reference https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagesetdefaultnavigationtimeouttimeout
Related
I'm attempting to iterate through a list of URLs, and instead of puppeteer loading each page, it only loads one. What can I do to make this work?
async function main() {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.setViewport({width: 1200, height: 720});
await page.goto('https://s23.a2zinc.net/clients/acmedia/americancoatingsshow2022/Public/Exhibitors.aspx?Index=All#', { waitUntil: 'networkidle0' }); // wait until page load
const hrefs = await page.$$eval('a', as => as.map(a => a.href));
for (let i = 0; i < hrefs.length; i++) {
const url = hrefs[i];
if (url.includes('eBooth.aspx')) {
console.log(url)
const page = await browser.newPage()
await page.goto(`${url}`);
await page.waitForNavigation({ waitUntil: 'networkidle0' });
}
}
main();
The main problem is an extra await page.waitForNavigation({ waitUntil: 'networkidle0' }); that will fail to resolve. page.goto already waits for navigation, so you're asking Puppeteer to wait for a navigation that will never happen.
Only use page.waitForNavigation if you're doing something to trigger a navigation, not as part of a typical page.goto call. Remove this line and your code should work (more or less) as expected.
Furthermore, you're opening a whole new page (browser tab) per link. That's 360 tabs by my count, liable to run most computers out of memory. Better to navigate a single page repeatedly or close pages after you're finished doing whatever you plan to do on these pages. If that's too slow, try running chunks in parallel or using a task queue.
Also, the links are available in the static HTML, so you might not need Puppeteer here, again, depending on what you're planning on doing on each page. If you can get all of the data from each page statically, you could have a massive speedup, completing 360 scrapes with fetch/cheerio in a fraction of the time it'd take Puppeteer.
If you do stick with Puppeteer to bypass detection or deal with JS/interactivity, consider using domcontentloaded rather than networkidle0, which is usually unnecessarily strict and slow. The blog post linked explains the difference between the various loading conditions. See also my answer in the canonical thread Puppeteer wait until page is completely loaded for a deeper dive into page loading in Puppeteer.
a[href] is a more precise selector than a, because it's possible that some a anchors have no href and should be discarded to avoid undefineds popping up.
Here's how I'd write this (with the aforementioned caveat that Puppeteer might not be needed at all):
const puppeteer = require("puppeteer"); // ^14.3.0
let browser;
(async () => {
browser = await puppeteer.launch({headless: false});
const [page] = await browser.pages();
await page.setViewport({width: 1200, height: 720});
const url = "https://s23.a2zinc.net/clients/acmedia/americancoatingsshow2022/Public/Exhibitors.aspx?Index=All#";
await page.goto(url, {waitUntil: "domcontentloaded"});
const hrefs = await page.$$eval("a[href]", els =>
els.map(a => a.href).filter(e => e.includes("eBooth.aspx"))
);
console.log(hrefs.length); // => 360
for (const url of hrefs) {
await page.goto(url);
// page is loaded; do your thing on this page
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Somtimes puppeteer does not type in some input fields, to be specific, I tried to simply type something in a website's input filed called "https://webtor.io/", which has a single huge input field, I hope someone could help me with that specific example.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://webtor.io/');
await page.type(`input[type="text"]`, 'something', { delay: 50 })
})();
This happens because when you go to the page the page has render the html and load up scripts, which ends up causing the delay and sometimes the text input is not loaded hence the fail.
await page.goto(''https://webtor.io/', {waitUntil: 'networkidle0'});
Check out this link for further details.
Puppeteer wait until page is completely loaded
The problem has been resolved by adding cookie from an actual browser.
I'm trying to get half-price products from this website https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials. The website is rendered by AngularJS so I'm trying to use puppeteer for data scraping.
headless is false, just a blank page shows up
headless is true, it throws an exception as the image Error while running with headless browser
const puppeteer = require('puppeteer');
async function getProductNames(){
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.setViewport({ width: 1000, height: 926 });
await page.goto("https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials");
await page.waitForSelector('.product-name')
console.log("Begin to evaluate JS")
var productNames = await page.evaluate(() => {
var div = document.querySelectorAll('.product-name');
console.log(div)
var productnames = []
// leave it blank for now
return productnames
})
console.log(productNames)
browser.close()
}
getProductNames();
P/S: While looking into this issue, I figure out the web page is actually console.log out the data of each page, but I can't trace the request. If you can show me how it could be great.
The web page console log data
Try adding options parameter to page.to('url'[,options]) method
page.goto("https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials", { waitUntil: 'networkidle2' })
It will consider navigation to be finished only when there are no more than 2 network connections for at least 500 ms.
You can refer documentation about parameters of options object here: Goto Options parameter
I want to start a chromium browser instant headless, do some automated operations, and then turn it visible before doing the rest of the stuff.
Is this possible to do using Puppeteer, and if it is, can you tell me how? And if it is not, is there any other framework or library for browser automation that can do this?
So far I've tried the following but it didn't work.
const browser = await puppeteer.launch({'headless': false});
browser.headless = true;
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
Short answer: It's not possible
Chrome only allows to either start the browser in headless or non-headless mode. You have to specify it when you launch the browser and it is not possible to switch during runtime.
What is possible, is to launch a second browser and reuse cookies (and any other data) from the first browser.
Long answer
You would assume that you could just reuse the data directory when calling puppeteer.launch, but this is currently not possible due to multiple bugs (#1268, #1270 in the puppeteer repo).
So the best approach is to save any cookies or local storage data that you need to share between the browser instances and restore the data when you launch the browser. You then visit the website a second time. Be aware that any state the website has in terms of JavaScript variable, will be lost when you recrawl the page.
Process
Summing up, the whole process should look like this (or vice versa for headless to headfull):
Crawl in non-headless mode until you want to switch mode
Serialize cookies
Launch or reuse second browser (in headless mode)
Restore cookies
Revisit page
Continue crawling
As mentioned, this isn't currently possible since the headless switch occurs via Chromium launch flags.
I usually do this with userDataDir, which the Chromium docs describe as follows:
The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state.
Here's a simple example. This launches a browser headlessly, sets a local storage value on an arbitrary page, closes the browser, re-opens it headfully, retrieves the local storage value and prints it.
const puppeteer = require("puppeteer"); // ^18.0.4
const url = "https://www.example.com";
const opts = {userDataDir: "./data"};
let browser;
(async () => {
{
browser = await puppeteer.launch({...opts, headless: true});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() => localStorage.setItem("hello", "world"));
await browser.close();
}
{
browser = await puppeteer.launch({...opts, headless: false});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const result = await page.evaluate(() => localStorage.getItem("hello"));
console.log(result); // => world
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Change const opts = {userDataDir: "./data"}; to const opts = {}; and you'll see null print instead of world; the user data doesn't persist.
The answer from a few years ago mentions issues with userDataDir and suggests a cookies solution. That's fine, but I haven't had any issues with userDataDir so either they've been resolved on the Puppeteer end or my use cases haven't triggered the issues.
There's a useful-looking answer from a reputable source in How to turn headless on after launch? but I haven't had a chance to try it yet.
I wondering how can I get PDF using Chrome Headless (for example puppeteer). It seems like a good PDF maker but only on chrome using #media print. So here is my question:
Can I get PDF by puppeteer on another browser (ie, mozilla) too? I think I can do that if I want print static page with no inputs. But if I have inputs for users and they are saving it on IE. Can I use this somehow?
Ok i downloaded the puppeteer. I've got the code:
$scope.aClick = function(){
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('/vUrl_form.html', {waitUntil: 'networkidle'});
await page.pdf({path: 'images/asd.pdf', format: 'A4'});
browser.close();
})();
};
and this can't still work (i don't know why, but app can't run).
No - Puppeteer only works with Chromium/Chrome.
Unfortunately Puppeteer only works with Chromium/Chrome.
If you want to use Headless Mozilla Firefox, you might consider checking this out https://developer.mozilla.org/en-US/Firefox/Headless_mode .
If you still want to use Puppeteer, here is a working snippet that creates a .pdf file:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle'});
// page.pdf() is currently supported only in headless mode.
// #see https://bugs.chromium.org/p/chromium/issues/detail?id=753118
await page.pdf({
path: 'hn.pdf',
format: 'letter'
});
browser.close();
})();
Today it's possible to use firefox with puppeter https://firefox-puppeteer.readthedocs.io/en/master/ Maybe when people answered it wasn't. But I can't find url to pdf functionality. Just screenshots.