HTTPRequests are lost/missing when using Puppeteer with setRequestInterception enabled

HTTPRequests are lost/missing when using Puppeteer with setRequestInterception enabled - javascript

Here's the scoop.
I'm trying to use Puppeteer v18.0.5 with the bundled chromium browser against a specific website. I'm using Node v16.16.0 However, when I enable request interception via page.setRequestInterception(true), all of the HTTPRequests for any image resources are lost. My handler is invoked far less while intercepting than when not intercepting. The page never fires any requests for images. But when I disable the interception, the page loads normally. Yes, I know about invoking continue() on all requests. I'm currently doing that in the request handler on the page.
I've also poured over the Puppeteer issues pages and have found similar symptoms on some of the earlier Puppeteer versions, but they were all different issues that have all been resolved since those early versions. This seems unique.
I've looked through Puppeteer source code as well as CDP events to try and find any explanation, but have found none.
As an important note for anyone trying to reproduce this, you must be proxied through a server in the London general area in order to successfully load this site.
Here's my code to reproduce:
const puppeteer = require('puppeteer');
(async () => {
const options = {
browserWidth: 1366,
browserHeight: 983,
intercepting: false
};
const browser = await puppeteer.launch(
{
args: [`--window-size=${options.browserWidth},${options.browserHeight}`],
defaultViewport: {width: options.browserWidth, height: options.browserHeight},
headless: false
}
);
const page = (await browser.pages())[0];
page.on('request', async (request) => {
console.log(`Request: ${request.method()} | ${request.url()} | ${request.resourceType()} | ${request._requestId}`);
if (options.intercepting) await request.continue();
});
await page.setRequestInterception(options.intercepting);
await page.goto('https://vegas.williamhill.com', {waitUntil: 'networkidle2', timeout: 65000});
// To give a moment to view the page in headful mode before closing browser.
await new Promise(resolve => setTimeout(resolve, 5000));
await browser.close();
})();
Here's what the page looks like with intercepting disabled:
Expected Page Load
Here's what the page looks like with intercepting enabled and continuing all requests.
Page load while intercepting and continuing all requests
With request interception disabled my handler is invoked for 104 different requests. But with the interception enabled it's only invoked 22 times. I'm not hitting a navigation timeout as the .goto() method returns before my timeout each time.
Any insight into what configuration/strategy I'm missing would be immensely appreciated.

Maybe you are incepting some javascript files that initiate the requests that you are not seeing?

Related

Testcafe scripts mostly fails over Mozilla Firefox browser due page load issue

The page is very slow to load when run via TestCafe over Firefox. If I manually load the page in the browser, the page loads reasonably fast. So I'm not sure what's causing the page to load slowly. Even while executing basic testcafe scripts in Firefox we can easily observe this Slowness issue. We have a longer Script execution time because of this issue. If the same script is executed over the chrome browser we are not facing a page load issue and script execution time is almost half as compared to chrome.
To quantify the number of scripts passed ratio in chrome is 70-75% but in Firefox it's hardly around 20-25%. I'm not sure what causes this issue.
Any help or suggestion will be much appreciated. Thank you in advance for your help!
Sample Code :
import { Role } from "testcafe";
import { Selector } from "testcafe";
fixture`FirefoxSlowness`.beforeEach(async t => {
await t.wait(8000)
await t.navigateTo("https://dxc.com/us/en");
await t.maximizeWindow();
});
test('DXCTestWebsite_Test_01', async t => {
const Logo = Selector('#Logo_and_Copy').exists;
await t.expect(Logo).ok();
await t.click('[class="nav-link mega-menu-toggle collapsed"]');
await t.wait(1000)
await t.click('#onetrust-accept-btn-handler')
await t.click('[class="mega-menu__nav-item nav-item"]')
await t.click('[class="card-link"]')
await t.click('#contact-us-link')
const contactus = Selector('[class="cmp-accordion__title"]')
await t.click(contactus.withExactText("Careers"))
await t.wait(2000)
await t.click('[href="/us/en/careers"]')
await t.click('#typehead')
await t.typeText('#typehead', '42A - Human Resources Specialist')
await t.click('#ph-search-backdrop')
await t.click('[data-ph-id="ph-page-element-page24-AA0w8h"]')
});

Puppeteer change proxy of already opened browser page

Once I open a puppeteer browser page and redirect to a specific site. If the proxy used for the browser is banned, I want to be able to switch it without closing or restarting the browser. refreshing the browser would be the best option. However, I have not yet been able to figure out how to do it. I've tried making variables out of '--proxy-server=xxxx' and switching that variable whenever proxy is banned, but that didnt work out. I've tried many other things too but have yet been able to figure it out. Any kind of help would be much appreciated.
var proxies = get_proxy() // get proxy
var useragent = randomUseragent.getRandom()
let browser = await puppeteer.launch({
headless: false,
args: [
`--proxy-server=${proxies.address}:${proxies.port}`,
`--user-agent="${useragent}"`,
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-infobars',
'--window-position=0,0',
'--ignore-certifcate-errors',
'--ignore-certifcate-errors-spki-list'
]
})
let page = await browser.newPage()
Example of how i want it to be:
const proxy_banned = await check_proxy_ban()
If(proxy_banned){
- Switch puppeteer browser proxy
- refresh
- return function
} else{
- return function
}

Multiple separate browser with one tab each - simultaneous interaction with elements on pages (puppeteer headless)

Using Node.js, Chrome and puppeteer as headless on ubuntu server, I'm scraping a few different websites. One of the occasional task is to interact with the loaded page (click on a link to open another page and then possibly do another click to accept the terms and such).
I can do all this just fine, but I'm trying to understand how it will work if I have multiple pages open simultaneously and am trying to interact with different loaded pages at the same time (overlapping times).
To visualize this, I'm thinking how a user will do the same job. They'll have to open multiple browser windows, open the page and switch between them to see and then click on links.
But using puppeteer, we have separate browser object, we don't need to see the window or page to know where to click. We can traverse it through the browser object and then do a click on desired element without looking (headless).
I'm thinking I should be able to do multiple pages at the same time as long as I have CPU and memory available to handle them.
Does anyone have any experience with puppeteer interacting with multiple websites simultaneously? Anything I need to watch out for?

This is the problem the library puppeteer-cluster (I'm the author) is addressing. It allows you to build a pool of pages (or browsers) to use and run tasks inside.
You find several general code samples in the repository (and also on stackoverflow). Let me address your specific use case of running different tasks with an example.
Code Sample
The following code creates two tasks:
crawl: Opens the page and extracts an URL to then start the second task
screenshot: Takes a screenshot of the extracted URL
The process is started by queuing the crawl task with the URLs.
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({ // use four pages in parallel
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 4,
});
// We define two tasks
const crawl = async ({ page, data: url }) => {
await page.goto(url);
const extractedURL = /* ... */; // extract an URL (or multiple) from the document somehow
cluster.queue(extractedURL, screenshot);
};
const screenshot = async ({ page, data: url }) => {
await page.goto(url);
await page.screenshot();
};
// Crawl some pages
cluster.queue('https://www.google.com/', crawl);
cluster.queue('https://github.com/', crawl);
// Wait until everything is done and close the cluster
await cluster.idle();
await cluster.close();
})();
This is a minimal example. I left out error handling, monitoring and the setup options.

I can usually get 5 or so browsers going on a 4GB server, if you're just popping urls off a queue it's pretty straightforward:
const puppeteer = require('puppeteer');
let queue = [
'http://www.amazon.com',
'http://www.google.com',
'http://www.fabebook.com',
'http://www.reddit.com',
]
const doQueue = async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
let url
while(url = queue.shift()){
await page.goto(url)
console.log(await page.title())
}
await browser.close()
}
[1,2,3].map(() => doQueue())

uploading a file using puppeteer browserWSEndpoint

I am trying to upload a file using puppeteer and browserWSEndpoint, the error message I am getting is
"Uncaught (in promise) Error: File chooser handling does not work with multiple connections to the same page".
Here is my code:
const puppeteer = require('puppeteer');
async function getTest() {
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://chrome.browserless.io'
});
const page = (await browser.pages())[0];
await page.goto('https://someWebSite');
//DO STUFF
console.log("before upload"); //code runs until here
const [fileChooser] = await Promise.all([page.waitForFileChooser(),page.click('#uploadTrigger'),]);
await fileChooser.accept(['C:\\myProgram\\pic.jpg']);
await page.click('#edit-submit');
}
getTest().then(console.log);
I must mention that if I don't use browserWSEndpoint, and use this code at the beginning instead, everything works fine.
const browser = await puppeteer.launch({headless: false, defaultViewport:null});
Honnestly I am pretty lost with browserWSEndpoint, I used info from this post How to run Puppeteer code in any web browser?
which led me to browserless.io, copied the code and it works.
Now this is my precise question, my error indicates does not work with multiple connections to the same page. How exactly am I connecting with multiple connections? Maybe I can resolve this issue and then I could use const [fileChooser].
My main issue is that I need to upload a file, using browserless
Others seem to have the same problem according to https://github.com/GoogleChrome/puppeteer/issues/4783, but using chromuim is not an option if I want to use browserless

If you are the only client connected to that browser you must be connected to a browser that doesn't support the fileChooser. You should connect to a Chromium 77.0.3844.0 (r674921) or higher.

Can the browser turned headless mid-execution when it was started normally, or vice-versa?

I want to start a chromium browser instant headless, do some automated operations, and then turn it visible before doing the rest of the stuff.
Is this possible to do using Puppeteer, and if it is, can you tell me how? And if it is not, is there any other framework or library for browser automation that can do this?
So far I've tried the following but it didn't work.
const browser = await puppeteer.launch({'headless': false});
browser.headless = true;
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});

Short answer: It's not possible
Chrome only allows to either start the browser in headless or non-headless mode. You have to specify it when you launch the browser and it is not possible to switch during runtime.
What is possible, is to launch a second browser and reuse cookies (and any other data) from the first browser.
Long answer
You would assume that you could just reuse the data directory when calling puppeteer.launch, but this is currently not possible due to multiple bugs (#1268, #1270 in the puppeteer repo).
So the best approach is to save any cookies or local storage data that you need to share between the browser instances and restore the data when you launch the browser. You then visit the website a second time. Be aware that any state the website has in terms of JavaScript variable, will be lost when you recrawl the page.
Process
Summing up, the whole process should look like this (or vice versa for headless to headfull):
Crawl in non-headless mode until you want to switch mode
Serialize cookies
Launch or reuse second browser (in headless mode)
Restore cookies
Revisit page
Continue crawling

As mentioned, this isn't currently possible since the headless switch occurs via Chromium launch flags.
I usually do this with userDataDir, which the Chromium docs describe as follows:
The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state.
Here's a simple example. This launches a browser headlessly, sets a local storage value on an arbitrary page, closes the browser, re-opens it headfully, retrieves the local storage value and prints it.
const puppeteer = require("puppeteer"); // ^18.0.4
const url = "https://www.example.com";
const opts = {userDataDir: "./data"};
let browser;
(async () => {
{
browser = await puppeteer.launch({...opts, headless: true});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() => localStorage.setItem("hello", "world"));
await browser.close();
}
{
browser = await puppeteer.launch({...opts, headless: false});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const result = await page.evaluate(() => localStorage.getItem("hello"));
console.log(result); // => world
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Change const opts = {userDataDir: "./data"}; to const opts = {}; and you'll see null print instead of world; the user data doesn't persist.
The answer from a few years ago mentions issues with userDataDir and suggests a cookies solution. That's fine, but I haven't had any issues with userDataDir so either they've been resolved on the Puppeteer end or my use cases haven't triggered the issues.
There's a useful-looking answer from a reputable source in How to turn headless on after launch? but I haven't had a chance to try it yet.

We Keep Coding

JavaScript is the programming language of the Web.

HTTPRequests are lost/missing when using Puppeteer with setRequestInterception enabled - javascript

Maybe you are incepting some javascript files that initiate the requests that you are not seeing?

Related

Testcafe scripts mostly fails over Mozilla Firefox browser due page load issue

Puppeteer change proxy of already opened browser page

Multiple separate browser with one tab each - simultaneous interaction with elements on pages (puppeteer headless)

uploading a file using puppeteer browserWSEndpoint

Can the browser turned headless mid-execution when it was started normally, or vice-versa?

Categories

Resources