I am trying to build a simple scraper for this website by using puppeteer.
The code goes as follows:
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
let pagelink = "https://www.speisekarte.de/berlin/restaurants?page=1"
await page.waitFor(3 * 1000);
await page.goto(pagelink);
await page.waitFor(3 * 1000);
await page.waitForSelector("#notice")
However, I cannot access the overlay notice for the cookies which should have the Id "notice".
This does not work either for await page.waitForSelector("#notice")
in my puppeteer code.
Nor with document.getElementById("notice") in Chromium, if I use the console of Chromium during the session manually. Also, it does not work, if I use it in Firefox's console. Funnily enough, chunks like
document.querySelectorAll("button")
work as expected. I checked with a colleague and she can access the element using the above mentioned queries in her Chrome and in her Firefox browser. She also uses a Mac. Any idea, what is happening here? Any help would be much appreciated.
Related
Once I open a puppeteer browser page and redirect to a specific site. If the proxy used for the browser is banned, I want to be able to switch it without closing or restarting the browser. refreshing the browser would be the best option. However, I have not yet been able to figure out how to do it. I've tried making variables out of '--proxy-server=xxxx' and switching that variable whenever proxy is banned, but that didnt work out. I've tried many other things too but have yet been able to figure it out. Any kind of help would be much appreciated.
var proxies = get_proxy() // get proxy
var useragent = randomUseragent.getRandom()
let browser = await puppeteer.launch({
headless: false,
args: [
`--proxy-server=${proxies.address}:${proxies.port}`,
`--user-agent="${useragent}"`,
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-infobars',
'--window-position=0,0',
'--ignore-certifcate-errors',
'--ignore-certifcate-errors-spki-list'
]
})
let page = await browser.newPage()
Example of how i want it to be:
const proxy_banned = await check_proxy_ban()
If(proxy_banned){
- Switch puppeteer browser proxy
- refresh
- return function
} else{
- return function
}
This is my first time using puppeteer and I want to open a google chrome page and navigate to a chrome extension I have installed . I try to enable the chrome extension but when I run my script in headless:false mode the browser pops up without my extension .
My code :
//my extension path
const StayFocusd = 'C:\\Users\\vasilis\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\Extensions\\laankejkbhbdhmipfmgcngdelahlfoji\\1.6.0_0';
async function run(){
//this is where I try to enable my extension
const browser = await puppeteer.launch({
headless: false,
ignoreDefaultArgs: [`--disable-extensions-except=${StayFocusd}`,"--enable-automation"],
}
);
const page = await browser.newPage();
sleep(3000);
await browser.close();
}
run();
So the extension does not load and I get no error or anything . I would appreciate your help
It is not enough to set --disable-extensions-except launch flag with your CRX path, you should also use --load-extension to actually load your extension in the opened browser instance.
You also seem to make a mistake using ignoreDefaultArgs where you should have used args (like this Chromium literally did the opposite of what you've expected).
Correct usage of puppeteer.launch:
const browser = await puppeteer.launch({
headless: false,
args: [
`--disable-extensions-except=${StayFocusd}`,
`--load-extension=${StayFocusd}`,
'--enable-automation'
]
})
You can make use of the official docs about Working with Chrome Extensions.
I am currently developing a node.js script that needs to launch a headful chromium instance using Puppeteer and then make a screenshot of a page every 3 seconds, this is my code :
const puppeteer = require('puppeteer');
async function init (){
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto('https://example.com');
screenshot(page)
};
async function screenshot(page){
let buffer = await page.screenshot();
let imageBuffer = buffer.toString('base64');
// save imageBuffer to database
setTimeout(screenshot, 3000, page)
}
My current issue is that I need the user to still be able to normally navigate on the browser and on his computer but this impossible as :
The page lags when making the screenshot as you can see on the following video : https://youtu.be/Tl2w-qKckkc
The browser window focuses and goes on top of all the windows when making the screenshot.
I also tried using Playwright but the same bug occurs when using it with chromium. Can someone please help.
In Playwright, do the following:
// Affects all the platforms.
const page = await browser.newPage({ viewport: null });
// Local fix for those using Apple hardware with Retina displays.
const page = await browser.newPage({ deviceScaleFactor: 2 });
I posted a detailed reply at https://github.com/microsoft/playwright/issues/2576. Please feel free to follow up and ask questions / request features there!
I want to start a chromium browser instant headless, do some automated operations, and then turn it visible before doing the rest of the stuff.
Is this possible to do using Puppeteer, and if it is, can you tell me how? And if it is not, is there any other framework or library for browser automation that can do this?
So far I've tried the following but it didn't work.
const browser = await puppeteer.launch({'headless': false});
browser.headless = true;
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
Short answer: It's not possible
Chrome only allows to either start the browser in headless or non-headless mode. You have to specify it when you launch the browser and it is not possible to switch during runtime.
What is possible, is to launch a second browser and reuse cookies (and any other data) from the first browser.
Long answer
You would assume that you could just reuse the data directory when calling puppeteer.launch, but this is currently not possible due to multiple bugs (#1268, #1270 in the puppeteer repo).
So the best approach is to save any cookies or local storage data that you need to share between the browser instances and restore the data when you launch the browser. You then visit the website a second time. Be aware that any state the website has in terms of JavaScript variable, will be lost when you recrawl the page.
Process
Summing up, the whole process should look like this (or vice versa for headless to headfull):
Crawl in non-headless mode until you want to switch mode
Serialize cookies
Launch or reuse second browser (in headless mode)
Restore cookies
Revisit page
Continue crawling
As mentioned, this isn't currently possible since the headless switch occurs via Chromium launch flags.
I usually do this with userDataDir, which the Chromium docs describe as follows:
The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state.
Here's a simple example. This launches a browser headlessly, sets a local storage value on an arbitrary page, closes the browser, re-opens it headfully, retrieves the local storage value and prints it.
const puppeteer = require("puppeteer"); // ^18.0.4
const url = "https://www.example.com";
const opts = {userDataDir: "./data"};
let browser;
(async () => {
{
browser = await puppeteer.launch({...opts, headless: true});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() => localStorage.setItem("hello", "world"));
await browser.close();
}
{
browser = await puppeteer.launch({...opts, headless: false});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const result = await page.evaluate(() => localStorage.getItem("hello"));
console.log(result); // => world
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Change const opts = {userDataDir: "./data"}; to const opts = {}; and you'll see null print instead of world; the user data doesn't persist.
The answer from a few years ago mentions issues with userDataDir and suggests a cookies solution. That's fine, but I haven't had any issues with userDataDir so either they've been resolved on the Puppeteer end or my use cases haven't triggered the issues.
There's a useful-looking answer from a reputable source in How to turn headless on after launch? but I haven't had a chance to try it yet.
I wondering how can I get PDF using Chrome Headless (for example puppeteer). It seems like a good PDF maker but only on chrome using #media print. So here is my question:
Can I get PDF by puppeteer on another browser (ie, mozilla) too? I think I can do that if I want print static page with no inputs. But if I have inputs for users and they are saving it on IE. Can I use this somehow?
Ok i downloaded the puppeteer. I've got the code:
$scope.aClick = function(){
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('/vUrl_form.html', {waitUntil: 'networkidle'});
await page.pdf({path: 'images/asd.pdf', format: 'A4'});
browser.close();
})();
};
and this can't still work (i don't know why, but app can't run).
No - Puppeteer only works with Chromium/Chrome.
Unfortunately Puppeteer only works with Chromium/Chrome.
If you want to use Headless Mozilla Firefox, you might consider checking this out https://developer.mozilla.org/en-US/Firefox/Headless_mode .
If you still want to use Puppeteer, here is a working snippet that creates a .pdf file:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle'});
// page.pdf() is currently supported only in headless mode.
// #see https://bugs.chromium.org/p/chromium/issues/detail?id=753118
await page.pdf({
path: 'hn.pdf',
format: 'letter'
});
browser.close();
})();
Today it's possible to use firefox with puppeter https://firefox-puppeteer.readthedocs.io/en/master/ Maybe when people answered it wasn't. But I can't find url to pdf functionality. Just screenshots.