Puppeteer get 3rd-party cookies - javascript

How can I get 3rd-party cookies from a website using Puppeteer?
For first party, I know I can use:
await page.cookies()

I was interested to know the answer so have found a solution too, it works for the current versions of Chromium 75.0.3765.0 and puppeteer 1.15.0 (updated May 2nd 2019).
Using internal puppeteer page._client methods we can make use of Chrome DevTools Protocol directly:
(async() => {
const browser = await puppeteer.launch({});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com', {waitUntil : 'networkidle2' });
// Here we can get all of the cookies
console.log(await page._client.send('Network.getAllCookies'));
})();
In the object returned there are cookies for google.com and imgur.com which we couldn't have obtained with normal browser javascript:

You can create a Chrome DevTools Protocol session on the page target using target.createCDPSession(). Then you can send Network.getAllCookies to obtain a list of all browser cookies.
The page.cookies() function will only return cookies for the current URL. So we can filter out the current page cookies from all of the browser cookies to obtain a list of third-party cookies only.
const client = await page.target().createCDPSession();
const all_browser_cookies = (await client.send('Network.getAllCookies')).cookies;
const current_url_cookies = await page.cookies();
const third_party_cookies = all_browser_cookies.filter(cookie => cookie.domain !== current_url_cookies[0].domain);
console.log(all_browser_cookies); // All Browser Cookies
console.log(current_url_cookies); // Current URL Cookies
console.log(third_party_cookies); // Third-Party Cookies

const browser = await puppeteer.launch({});
const page = await browser.newPage();
await page.goto('https://www.stackoverflow.com/', {waitUntil : 'networkidle0' });
// networkidle2, domcontentloaded, load are the options for wai until
// Here we can get all of the cookies
var content = await page._client.send('Network.getAllCookies');
console.log(JSON.stringify(content, null, 4));

Related

NodeJS puppeteer cookies fail

I'm trying to pull data from a website with puppeteer, but the cookies I added disappear after the page is opened.
and the session is suddenly dropped
const cookies = fs.readFileSync(__dirname+'/cookie');
const deserializedCookies = JSON.parse(cookies)
await page.setCookie(...deserializedCookies)
await page.goto(slugx, {waitUntil: 'networkidle2'});

How to reuse a cookie so, that the website knows I've already accept the terms?

Most websites when you load ask you to accept cookies and privacy, I think it's mainly in the EU.
I'm struggling on how to reuse the cookies so, I don't have to keep clicking "accept all", every time I load up chrome.
The way I'm thinking is that if I click on "accept all" the first time and save the cookie, I can write a code that fetches the cookie file and it knows I accepted the website cookie and so, it doesn't pop up again.
The website I'm using for this example is https://finviz.com/
const puppeteer = require('puppeteer')
const fs = require('fs')
;(async () => {
const browser = await puppeteer.launch({ headless: false })
const page = await browser.newPage()
await page.goto('https://finviz.com/')
const cookiesString = await fs.readFile('./cookies.json')
const cookies = JSON.parse(cookiesString)
await page.setCookie(...cookies)
})()
It is at least complicated to write an app that listens for the setting of cookies to copy them to a file and put them back when the browser is restartet. The same applies for the case that you want to save the cookies manually.
But if you do that then deleting the cookies would be unnecessary - so you could simply allow cookies in the settings of your browser.

How to go back in a new page using puppeteer?

I m using nodejs puppeteer to scrape a website. I've come across a situation where i need to go back in a new tab, but i couldn't find a way to do it in puppeteer (i can produce it manually on windows by ctrl + clicking the browser's go back button)
below is an example where i need to launch many pages in parallel starting from a particular page
const page = await browser.newPage();
await page.goto(myWebsiteUrl);
// going through some pages..
for (let i = 0; i < numberOfPagesInParallel; i++) {
// instanciating many pages with goback
const newBackPage = await page.gobackAndReturnNewPage(); // this is what i wish i could do, but not possible in puppeteer
const promise = processNewBackPageAsync(newBackPage);
this.allPromises.push(promise);
}
await Promise.all([...this.allPromises])
I searched across puppeteer api and chrome devtools protocol and don't find any way to clone a tab or clone history to another tab, maybe this is a usefull feature to add to both puppeteer and chrome CDP.
But, there is a way to create a new page and go back in history without need to track history, the limitation of this solution is that the new page does not share/clone the history of original page, I also tried to use Page.navigateToHistoryEntry but since the history is owned by page I got a error
So, there is the solution that creates a new page and go to last history url.
const puppeteer = require("puppeteer");
(async function() {
// headless: false
// to see the result in the browser
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
// let's do some navigation
await page.goto("http://localhost:5000");
await page.goto("http://localhost:5000/page-one");
await page.goto("http://localhost:5000/page-two");
// access history and evaluate last url of page
const session = await page.target().createCDPSession();
const history = await session.send("Page.getNavigationHistory");
const last = history.entries[history.entries.length - 2];
// create a new page and go back
// important: the page created here does not share the history
const backPage = await browser.newPage();
await backPage.goto(last.url);
// see results
await page.screenshot({ path: "page.png" });
await backPage.screenshot({ path: "back-page.png" });
// uncomment if you use headless chrome
// await browser.close();
})();
References:
https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getNavigationHistory

Can the browser turned headless mid-execution when it was started normally, or vice-versa?

I want to start a chromium browser instant headless, do some automated operations, and then turn it visible before doing the rest of the stuff.
Is this possible to do using Puppeteer, and if it is, can you tell me how? And if it is not, is there any other framework or library for browser automation that can do this?
So far I've tried the following but it didn't work.
const browser = await puppeteer.launch({'headless': false});
browser.headless = true;
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
Short answer: It's not possible
Chrome only allows to either start the browser in headless or non-headless mode. You have to specify it when you launch the browser and it is not possible to switch during runtime.
What is possible, is to launch a second browser and reuse cookies (and any other data) from the first browser.
Long answer
You would assume that you could just reuse the data directory when calling puppeteer.launch, but this is currently not possible due to multiple bugs (#1268, #1270 in the puppeteer repo).
So the best approach is to save any cookies or local storage data that you need to share between the browser instances and restore the data when you launch the browser. You then visit the website a second time. Be aware that any state the website has in terms of JavaScript variable, will be lost when you recrawl the page.
Process
Summing up, the whole process should look like this (or vice versa for headless to headfull):
Crawl in non-headless mode until you want to switch mode
Serialize cookies
Launch or reuse second browser (in headless mode)
Restore cookies
Revisit page
Continue crawling
As mentioned, this isn't currently possible since the headless switch occurs via Chromium launch flags.
I usually do this with userDataDir, which the Chromium docs describe as follows:
The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state.
Here's a simple example. This launches a browser headlessly, sets a local storage value on an arbitrary page, closes the browser, re-opens it headfully, retrieves the local storage value and prints it.
const puppeteer = require("puppeteer"); // ^18.0.4
const url = "https://www.example.com";
const opts = {userDataDir: "./data"};
let browser;
(async () => {
{
browser = await puppeteer.launch({...opts, headless: true});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() => localStorage.setItem("hello", "world"));
await browser.close();
}
{
browser = await puppeteer.launch({...opts, headless: false});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const result = await page.evaluate(() => localStorage.getItem("hello"));
console.log(result); // => world
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Change const opts = {userDataDir: "./data"}; to const opts = {}; and you'll see null print instead of world; the user data doesn't persist.
The answer from a few years ago mentions issues with userDataDir and suggests a cookies solution. That's fine, but I haven't had any issues with userDataDir so either they've been resolved on the Puppeteer end or my use cases haven't triggered the issues.
There's a useful-looking answer from a reputable source in How to turn headless on after launch? but I haven't had a chance to try it yet.

Puppeteer Launch Incognito

I am connected to a browser using a ws endpoint (puppeteer.connect({ browserWSEndpoint: '' })).
When I launch the browser that I ultimately connect to, is there a way to launch this in incognito?
I know I can do something like this:
const incognito = await this.browser.createIncognitoBrowserContext();
But it seems like the incognito session is tied to the originally opened browser. I just want it to be by itself.
I also see you can do this:
const baseOptions: LaunchOptions = { args: ['--incognito']};
But I am not sure if this is the best way or not.
Any advice would be appreciated. Thank you!
The best way to accomplish your goal is to launch the browser directly into incognito mode by passing the --incognito flag to puppeteer.launch():
const browser = await puppeteer.launch({
args: [
'--incognito',
],
});
Alternatively, you can create a new incognito browser context after launching the browser using browser.createIncognitoBrowserContext():
const browser = await puppeteer.launch();
const context = await browser.createIncognitoBrowserContext();
You can check whether a browser context is incognito using browserContext.isIncognito():
if (context.isIncognito()) { /* ... */ }
the solutions above didn't work for me:
an incognito window is created, but then when the new page is created, it is no longer incognito.
The solution that worked for me was:
const browser = await puppeteer.launch();
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
then you can use page and it's an incognito page
For Puppeteer sharp it's rather messy but this seems to work.. Hopefully it helps someone.
using (Browser browser = await Puppeteer.LaunchAsync(options))
{
// create the async context
var context = await browser.CreateIncognitoBrowserContextAsync();
// get the page created by default when launch async ran and close it whilst keeping the browser active
var browserPages = await browser.PagesAsync();
await browserPages[0].CloseAsync();
// create a new page using the incognito context
using (Page page = await context.NewPageAsync())
{
// do something
}
}

Categories