The problem has been resolved by adding cookie from an actual browser.
I'm trying to get half-price products from this website https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials. The website is rendered by AngularJS so I'm trying to use puppeteer for data scraping.
headless is false, just a blank page shows up
headless is true, it throws an exception as the image Error while running with headless browser
const puppeteer = require('puppeteer');
async function getProductNames(){
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.setViewport({ width: 1000, height: 926 });
await page.goto("https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials");
await page.waitForSelector('.product-name')
console.log("Begin to evaluate JS")
var productNames = await page.evaluate(() => {
var div = document.querySelectorAll('.product-name');
console.log(div)
var productnames = []
// leave it blank for now
return productnames
})
console.log(productNames)
browser.close()
}
getProductNames();
P/S: While looking into this issue, I figure out the web page is actually console.log out the data of each page, but I can't trace the request. If you can show me how it could be great.
The web page console log data
Try adding options parameter to page.to('url'[,options]) method
page.goto("https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials", { waitUntil: 'networkidle2' })
It will consider navigation to be finished only when there are no more than 2 network connections for at least 500 ms.
You can refer documentation about parameters of options object here: Goto Options parameter
Related
I've been scraping a bunch of websites lately and some of them work just fine, while on others I encounter problems.
`
const { url, selectors } = WebsiteUtils
const getData = async (url) => {
const browser = await puppeteer.launch({ headless: false })
const page = await browser.newPage()
await page.setViewport({
height: 20000,
width: 1500,
})
await page.goto(url, { waitUntil: 'networkidle2' })
const link = page.$(selectors.link)
return link.innerHTML
}
console.log(await getData(url))
`
In this case, the 'selectors.link' is literally just 'ul' as there's a bunch of them in the page. I already tried to wait for the selector (.waitForSelector), I used page.evaluate instead of .$, even put a 5 secs timeout to make sure everything was loaded. The HTML is not even inside an iframe, it's a really straightforward page. When I get to this I'm pretty much out of options.
The output here is undefined,
if i just log 'link' I get ElementHandle {}
Update:
I figured this out. Unlike what i previously wrote, iframe were indeed present in the page although they were not present in the HTML of the devtools.
I just spotted them running page.frames() and access the content of the individual frames with frame.$eval()
Somtimes puppeteer does not type in some input fields, to be specific, I tried to simply type something in a website's input filed called "https://webtor.io/", which has a single huge input field, I hope someone could help me with that specific example.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://webtor.io/');
await page.type(`input[type="text"]`, 'something', { delay: 50 })
})();
This happens because when you go to the page the page has render the html and load up scripts, which ends up causing the delay and sometimes the text input is not loaded hence the fail.
await page.goto(''https://webtor.io/', {waitUntil: 'networkidle0'});
Check out this link for further details.
Puppeteer wait until page is completely loaded
I m using nodejs puppeteer to scrape a website. I've come across a situation where i need to go back in a new tab, but i couldn't find a way to do it in puppeteer (i can produce it manually on windows by ctrl + clicking the browser's go back button)
below is an example where i need to launch many pages in parallel starting from a particular page
const page = await browser.newPage();
await page.goto(myWebsiteUrl);
// going through some pages..
for (let i = 0; i < numberOfPagesInParallel; i++) {
// instanciating many pages with goback
const newBackPage = await page.gobackAndReturnNewPage(); // this is what i wish i could do, but not possible in puppeteer
const promise = processNewBackPageAsync(newBackPage);
this.allPromises.push(promise);
}
await Promise.all([...this.allPromises])
I searched across puppeteer api and chrome devtools protocol and don't find any way to clone a tab or clone history to another tab, maybe this is a usefull feature to add to both puppeteer and chrome CDP.
But, there is a way to create a new page and go back in history without need to track history, the limitation of this solution is that the new page does not share/clone the history of original page, I also tried to use Page.navigateToHistoryEntry but since the history is owned by page I got a error
So, there is the solution that creates a new page and go to last history url.
const puppeteer = require("puppeteer");
(async function() {
// headless: false
// to see the result in the browser
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
// let's do some navigation
await page.goto("http://localhost:5000");
await page.goto("http://localhost:5000/page-one");
await page.goto("http://localhost:5000/page-two");
// access history and evaluate last url of page
const session = await page.target().createCDPSession();
const history = await session.send("Page.getNavigationHistory");
const last = history.entries[history.entries.length - 2];
// create a new page and go back
// important: the page created here does not share the history
const backPage = await browser.newPage();
await backPage.goto(last.url);
// see results
await page.screenshot({ path: "page.png" });
await backPage.screenshot({ path: "back-page.png" });
// uncomment if you use headless chrome
// await browser.close();
})();
References:
https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getNavigationHistory
I am trying to run my first code on puppeteer.
Puppeteer v1.20.0
Node v8.11.3
Npm v5.6.0
It is a basic example but it doesn't works :
const puppeteer = require('puppeteer');
puppeteer.launch({headless: false}).then(async browser => {
const page = await browser.newPage();
await page.goto('https://www.linkedin.com/learning/login', { waitUntil: 'networkidle0' });
console.log(0);
await page.waitFor('#username');
console.log(1);
await browser.close();
});
When I run the script, chromium start and I can see the Linkedin login page with the form and the #username form's field, but puppeteer doesn't find the field. It displays 0 but never 1 and then runs a TimeoutError: waiting for selector "#username" failed: timeout 30000ms exceeded.
Increase timeout doesn't change anything and if I check the console in chromium the field is there.
Linkedin login page works as an SPA and I don't know if I'm using the right way here.
Thank you in advance.
username input is inside iframe you cant access it like this , you need to access iframe first
await page.goto('https://www.linkedin.com/learning/login');
await page.waitForSelector('.authentication-iframe');
var frames = await page.frames();
var myframe = frames.find(
f =>
f.url().indexOf("uas/login") > 0);
let username = '123456#gmail.com';
const usernamefild = await myframe.$("#username");
await usernamefild.type(username, {delay: 10});
I want to start a chromium browser instant headless, do some automated operations, and then turn it visible before doing the rest of the stuff.
Is this possible to do using Puppeteer, and if it is, can you tell me how? And if it is not, is there any other framework or library for browser automation that can do this?
So far I've tried the following but it didn't work.
const browser = await puppeteer.launch({'headless': false});
browser.headless = true;
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
Short answer: It's not possible
Chrome only allows to either start the browser in headless or non-headless mode. You have to specify it when you launch the browser and it is not possible to switch during runtime.
What is possible, is to launch a second browser and reuse cookies (and any other data) from the first browser.
Long answer
You would assume that you could just reuse the data directory when calling puppeteer.launch, but this is currently not possible due to multiple bugs (#1268, #1270 in the puppeteer repo).
So the best approach is to save any cookies or local storage data that you need to share between the browser instances and restore the data when you launch the browser. You then visit the website a second time. Be aware that any state the website has in terms of JavaScript variable, will be lost when you recrawl the page.
Process
Summing up, the whole process should look like this (or vice versa for headless to headfull):
Crawl in non-headless mode until you want to switch mode
Serialize cookies
Launch or reuse second browser (in headless mode)
Restore cookies
Revisit page
Continue crawling
As mentioned, this isn't currently possible since the headless switch occurs via Chromium launch flags.
I usually do this with userDataDir, which the Chromium docs describe as follows:
The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state.
Here's a simple example. This launches a browser headlessly, sets a local storage value on an arbitrary page, closes the browser, re-opens it headfully, retrieves the local storage value and prints it.
const puppeteer = require("puppeteer"); // ^18.0.4
const url = "https://www.example.com";
const opts = {userDataDir: "./data"};
let browser;
(async () => {
{
browser = await puppeteer.launch({...opts, headless: true});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() => localStorage.setItem("hello", "world"));
await browser.close();
}
{
browser = await puppeteer.launch({...opts, headless: false});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const result = await page.evaluate(() => localStorage.getItem("hello"));
console.log(result); // => world
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Change const opts = {userDataDir: "./data"}; to const opts = {}; and you'll see null print instead of world; the user data doesn't persist.
The answer from a few years ago mentions issues with userDataDir and suggests a cookies solution. That's fine, but I haven't had any issues with userDataDir so either they've been resolved on the Puppeteer end or my use cases haven't triggered the issues.
There's a useful-looking answer from a reputable source in How to turn headless on after launch? but I haven't had a chance to try it yet.