NodeJS puppeteer cookies fail - javascript

I'm trying to pull data from a website with puppeteer, but the cookies I added disappear after the page is opened.
and the session is suddenly dropped
const cookies = fs.readFileSync(__dirname+'/cookie');
const deserializedCookies = JSON.parse(cookies)
await page.setCookie(...deserializedCookies)
await page.goto(slugx, {waitUntil: 'networkidle2'});

Related

How to reuse a cookie so, that the website knows I've already accept the terms?

Most websites when you load ask you to accept cookies and privacy, I think it's mainly in the EU.
I'm struggling on how to reuse the cookies so, I don't have to keep clicking "accept all", every time I load up chrome.
The way I'm thinking is that if I click on "accept all" the first time and save the cookie, I can write a code that fetches the cookie file and it knows I accepted the website cookie and so, it doesn't pop up again.
The website I'm using for this example is https://finviz.com/
const puppeteer = require('puppeteer')
const fs = require('fs')
;(async () => {
const browser = await puppeteer.launch({ headless: false })
const page = await browser.newPage()
await page.goto('https://finviz.com/')
const cookiesString = await fs.readFile('./cookies.json')
const cookies = JSON.parse(cookiesString)
await page.setCookie(...cookies)
})()
It is at least complicated to write an app that listens for the setting of cookies to copy them to a file and put them back when the browser is restartet. The same applies for the case that you want to save the cookies manually.
But if you do that then deleting the cookies would be unnecessary - so you could simply allow cookies in the settings of your browser.

How to go back in a new page using puppeteer?

I m using nodejs puppeteer to scrape a website. I've come across a situation where i need to go back in a new tab, but i couldn't find a way to do it in puppeteer (i can produce it manually on windows by ctrl + clicking the browser's go back button)
below is an example where i need to launch many pages in parallel starting from a particular page
const page = await browser.newPage();
await page.goto(myWebsiteUrl);
// going through some pages..
for (let i = 0; i < numberOfPagesInParallel; i++) {
// instanciating many pages with goback
const newBackPage = await page.gobackAndReturnNewPage(); // this is what i wish i could do, but not possible in puppeteer
const promise = processNewBackPageAsync(newBackPage);
this.allPromises.push(promise);
}
await Promise.all([...this.allPromises])
I searched across puppeteer api and chrome devtools protocol and don't find any way to clone a tab or clone history to another tab, maybe this is a usefull feature to add to both puppeteer and chrome CDP.
But, there is a way to create a new page and go back in history without need to track history, the limitation of this solution is that the new page does not share/clone the history of original page, I also tried to use Page.navigateToHistoryEntry but since the history is owned by page I got a error
So, there is the solution that creates a new page and go to last history url.
const puppeteer = require("puppeteer");
(async function() {
// headless: false
// to see the result in the browser
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
// let's do some navigation
await page.goto("http://localhost:5000");
await page.goto("http://localhost:5000/page-one");
await page.goto("http://localhost:5000/page-two");
// access history and evaluate last url of page
const session = await page.target().createCDPSession();
const history = await session.send("Page.getNavigationHistory");
const last = history.entries[history.entries.length - 2];
// create a new page and go back
// important: the page created here does not share the history
const backPage = await browser.newPage();
await backPage.goto(last.url);
// see results
await page.screenshot({ path: "page.png" });
await backPage.screenshot({ path: "back-page.png" });
// uncomment if you use headless chrome
// await browser.close();
})();
References:
https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getNavigationHistory

Puppeteer cannot goto web page to get selector

The problem has been resolved by adding cookie from an actual browser.
I'm trying to get half-price products from this website https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials. The website is rendered by AngularJS so I'm trying to use puppeteer for data scraping.
headless is false, just a blank page shows up
headless is true, it throws an exception as the image Error while running with headless browser
const puppeteer = require('puppeteer');
async function getProductNames(){
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.setViewport({ width: 1000, height: 926 });
await page.goto("https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials");
await page.waitForSelector('.product-name')
console.log("Begin to evaluate JS")
var productNames = await page.evaluate(() => {
var div = document.querySelectorAll('.product-name');
console.log(div)
var productnames = []
// leave it blank for now
return productnames
})
console.log(productNames)
browser.close()
}
getProductNames();
P/S: While looking into this issue, I figure out the web page is actually console.log out the data of each page, but I can't trace the request. If you can show me how it could be great.
The web page console log data
Try adding options parameter to page.to('url'[,options]) method
page.goto("https://shop.coles.com.au/a/richmond-south/specials/search/half-price-specials", { waitUntil: 'networkidle2' })
It will consider navigation to be finished only when there are no more than 2 network connections for at least 500 ms.
You can refer documentation about parameters of options object here: Goto Options parameter

Can the browser turned headless mid-execution when it was started normally, or vice-versa?

I want to start a chromium browser instant headless, do some automated operations, and then turn it visible before doing the rest of the stuff.
Is this possible to do using Puppeteer, and if it is, can you tell me how? And if it is not, is there any other framework or library for browser automation that can do this?
So far I've tried the following but it didn't work.
const browser = await puppeteer.launch({'headless': false});
browser.headless = true;
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
Short answer: It's not possible
Chrome only allows to either start the browser in headless or non-headless mode. You have to specify it when you launch the browser and it is not possible to switch during runtime.
What is possible, is to launch a second browser and reuse cookies (and any other data) from the first browser.
Long answer
You would assume that you could just reuse the data directory when calling puppeteer.launch, but this is currently not possible due to multiple bugs (#1268, #1270 in the puppeteer repo).
So the best approach is to save any cookies or local storage data that you need to share between the browser instances and restore the data when you launch the browser. You then visit the website a second time. Be aware that any state the website has in terms of JavaScript variable, will be lost when you recrawl the page.
Process
Summing up, the whole process should look like this (or vice versa for headless to headfull):
Crawl in non-headless mode until you want to switch mode
Serialize cookies
Launch or reuse second browser (in headless mode)
Restore cookies
Revisit page
Continue crawling
As mentioned, this isn't currently possible since the headless switch occurs via Chromium launch flags.
I usually do this with userDataDir, which the Chromium docs describe as follows:
The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state.
Here's a simple example. This launches a browser headlessly, sets a local storage value on an arbitrary page, closes the browser, re-opens it headfully, retrieves the local storage value and prints it.
const puppeteer = require("puppeteer"); // ^18.0.4
const url = "https://www.example.com";
const opts = {userDataDir: "./data"};
let browser;
(async () => {
{
browser = await puppeteer.launch({...opts, headless: true});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() => localStorage.setItem("hello", "world"));
await browser.close();
}
{
browser = await puppeteer.launch({...opts, headless: false});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const result = await page.evaluate(() => localStorage.getItem("hello"));
console.log(result); // => world
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Change const opts = {userDataDir: "./data"}; to const opts = {}; and you'll see null print instead of world; the user data doesn't persist.
The answer from a few years ago mentions issues with userDataDir and suggests a cookies solution. That's fine, but I haven't had any issues with userDataDir so either they've been resolved on the Puppeteer end or my use cases haven't triggered the issues.
There's a useful-looking answer from a reputable source in How to turn headless on after launch? but I haven't had a chance to try it yet.

Puppeteer get 3rd-party cookies

How can I get 3rd-party cookies from a website using Puppeteer?
For first party, I know I can use:
await page.cookies()
I was interested to know the answer so have found a solution too, it works for the current versions of Chromium 75.0.3765.0 and puppeteer 1.15.0 (updated May 2nd 2019).
Using internal puppeteer page._client methods we can make use of Chrome DevTools Protocol directly:
(async() => {
const browser = await puppeteer.launch({});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com', {waitUntil : 'networkidle2' });
// Here we can get all of the cookies
console.log(await page._client.send('Network.getAllCookies'));
})();
In the object returned there are cookies for google.com and imgur.com which we couldn't have obtained with normal browser javascript:
You can create a Chrome DevTools Protocol session on the page target using target.createCDPSession(). Then you can send Network.getAllCookies to obtain a list of all browser cookies.
The page.cookies() function will only return cookies for the current URL. So we can filter out the current page cookies from all of the browser cookies to obtain a list of third-party cookies only.
const client = await page.target().createCDPSession();
const all_browser_cookies = (await client.send('Network.getAllCookies')).cookies;
const current_url_cookies = await page.cookies();
const third_party_cookies = all_browser_cookies.filter(cookie => cookie.domain !== current_url_cookies[0].domain);
console.log(all_browser_cookies); // All Browser Cookies
console.log(current_url_cookies); // Current URL Cookies
console.log(third_party_cookies); // Third-Party Cookies
const browser = await puppeteer.launch({});
const page = await browser.newPage();
await page.goto('https://www.stackoverflow.com/', {waitUntil : 'networkidle0' });
// networkidle2, domcontentloaded, load are the options for wai until
// Here we can get all of the cookies
var content = await page._client.send('Network.getAllCookies');
console.log(JSON.stringify(content, null, 4));

Categories