Puppeteer: is there a way to access the DevTools Network API? - javascript

I am trying to use Puppeteer for end-to-end tests. These tests require accessing the network emulation capabilities of DevTools (e.g. to simulate offline browsing).
So far I am using chrome-remote-interface, but it is too low-level for my taste.
As far as I know, Puppeteer does not expose the network DevTools features (emulateNetworkConditions in the DevTools protocol).
Is there an escape hatch in Puppeteer to access those features, e.g. a way to execute a Javascript snippet in a context in which the DevTools API is accessible?
Thanks
Edit:
OK, so it seems that I can work around the lack of an API using something like this:
const client = page._client;
const res = await client.send('Network.emulateNetworkConditions',
{ offline: true, latency: 40, downloadThroughput: 40*1024*1024,
uploadThroughput: 40*1024*1024 });
But I suppose it is Bad Form and may slip under my feet at any time?

Update: headless Chrome now supports network throttling!
In Puppeteer, you can emulate devices (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pageemulateoptions) but not network conditions. It's something we're considering, but headless Chrome needs to support network throttling first.
To emulate a device, I'd use the predefined devices found in DeviceDescriptors:
const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone = devices['iPhone 6'];
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.emulate(iPhone);
await page.goto('https://www.google.com');
// other actions...
browser.close();
});

Related

how to launch Google Chrome with authenticated Proxies

I want to launch Google Chrome with authenticated proxies so I can connect a puppeteer instance with it. I'm using this cmd line to launch a new instance:
chrome --remote-debugging-port=9222 --user-data-dir="C:\Users\USER\AppData\Local\Google\Chrome\User Data
I managed to use authenticated proxies with Chrome, but it was a bit complicated, especially in my case, as I want to launch multiple Chrome browser, each with its own proxies.
I used this: proxy-login-automator
It worked fine, but as I said, it's bit complicated and it needs a bit of work, so I can integrate it as I want to use it. This is how am connecting to the Chrome instance:
const browserURL = 'http://127.0.0.1:9222';
const browser = await puppeteer.connect({browserURL});
const page = await browser.newPage();

Confusion over args for Puppeteer

I am a little confused over the arguments needed for Puppeteer, in particular when the puppeteer-extra stealth plugin is used. I am currently just using all the default settings and Chromium however I keep seeing examples like this:
let options = {
headless: false,
ignoreHTTPSErrors: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-sync',
'--ignore-certificate-errors'
],
defaultViewport: { width: 1366, height: 768 }
};
Do I actually need any of these to avoid being detected? Been using Puppeteer without setting any of them and it passes the bot test out of the box. What is --no-sandbox for?
these are chromium features - not puppeteer specific
please take a look at the following sections for --no-sandbox for example.
https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md#setting-up-chrome-linux-sandbox
Setting Up Chrome Linux Sandbox
In order to protect the host
environment from untrusted web content, Chrome uses multiple layers of
sandboxing. For this to work properly, the host should be configured
first. If there's no good sandbox for Chrome to use, it will crash
with the error No usable sandbox!.
If you absolutely trust the content you open in Chrome, you can launch
Chrome with the --no-sandbox argument:
const browser = await puppeteer.launch({args: ['--no-sandbox',
'--disable-setuid-sandbox']});
NOTE: Running without a sandbox is
strongly discouraged. Consider configuring a sandbox instead.
https://chromium.googlesource.com/chromium/src/+/HEAD/docs/linux/sandboxing.md#linux-sandboxing
Chromium uses a multiprocess model, which allows to give different
privileges and restrictions to different parts of the browser. For
instance, we want renderers to run with a limited set of privileges
since they process untrusted input and are likely to be compromised.
Renderers will use an IPC mechanism to request access to resource from
a more privileged (browser process). You can find more about this
general design here.
We use different sandboxing techniques on Linux and Chrome OS, in
combination, to achieve a good level of sandboxing. You can see which
sandboxes are currently engaged by looking at chrome://sandbox
(renderer processes) and chrome://gpu (gpu process).\
. . .
You can disable all sandboxing (for
testing) with --no-sandbox.

Why does headless need to be false for Puppeteer to work?

I'm creating a web api that scrapes a given url and sends that back. I am using Puppeteer to do this. I asked this question: Puppeteer not behaving like in Developer Console
and recieved an answer that suggested it would only work if headless was set to be false. I don't want to be constantly opening up a browser UI i don't need (I just the need the data!) so I'm looking for why headless has to be false and can I get a fix that lets headless = true.
Here's my code:
express()
.get("/*", (req, res) => {
global.notBaseURL = req.params[0];
(async () => {
const browser = await puppet.launch({ headless: false }); // Line of Interest
const page = await browser.newPage();
console.log(req.params[0]);
await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
title = await page.$eval("title", (el) => el.innerText);
browser.close();
res.send({
title: title,
});
})();
})
.listen(PORT, () => console.log(`Listening on ${PORT}`));
This is the page I'm trying to scrape: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106?origin=coordinating-5460106-0-1-FTR-recbot-recently_viewed_snowplow_mvp&recs_placement=FTR&recs_strategy=recently_viewed_snowplow_mvp&recs_source=recbot&recs_page_type=category&recs_seed=0&color=BLACK
The reason it might work in UI mode but not headless is that sites who aggressively fight scraping will detect that you are running in a headless browser.
Some possible workarounds:
Use puppeteer-extra
Found here: https://github.com/berstend/puppeteer-extra
Check out their docs for how to use it. It has a couple plugins that might help in getting past headless-mode detection:
puppeteer-extra-plugin-anonymize-ua -- anonymizes your User Agent. Note that this might help with getting past headless mode detection, but as you'll see if you visit https://amiunique.org/ it is unlikely to be enough to keep you from being identified as a repeat visitor.
puppeteer-extra-plugin-stealth -- this might help win the cat-and-mouse game of not being detected as headless. There are many tricks that are employed to detect headless mode, and as many tricks to evade them.
Run a "real" Chromium instance/UI
It's possible to run a single browser UI in a manner that let's you attach puppeteer to that running instance. Here's an article that explains it: https://medium.com/#jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0
Essentially you're starting Chrome or Chromium (or Edge?) from the command line with --remote-debugging-port=9222 (or any old port?) plus other command line switches depending on what environment you're running it in. Then you use puppeteer to connect to that running instance instead of having it do the default behavior of launching a headless Chromium instance: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });. Read the puppeteer docs here for more info: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions
The ENDPOINT_URL is displayed in the terminal when you launch the browser from the command line with the --remote-debugging-port=9222 option.
This option is going to require some server/ops mojo, so be prepared to do a lot more Stack Overflow searches. :-)
There are other strategies I'm sure but those are the two I'm most familiar with. Good luck!
Todd's answer is thorough, but worth trying before resorting to some of the recommendations there is to slap on the following user agent line pulled from the relevant Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true }:
await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
await page.goto(yourURL);
Now, the Nordstorm site provided by OP seems to be able to detect robots even with headless: false, at least at the present moment. But other sites are less strict and I've found the above line to be useful on some of them as shown in Puppeteer can't find elements when Headless TRUE and Puppeteer , bringing back blank array.
Visit the GH issue thread above for other ideas and see useragents.me for a rotating list of current user agents.

How to "hook in" puppeteer into a running Chrome instance/tab

Is it somehow possible to attach puppeteer to a running Chrome instance (manually started browser) and then takeover control within a tab? I'm assuming that it's eventually related to start the Chrome browser using the --no-sandbox flag but don't know how to continue from there.
Thanks for any help
You can use puppeteer.connect(options) (see here):
const puppeteer = require('puppeteer');
const browserWSEndpoint = 'a browser websocket endpoint to connect to';
const browser = await puppeteer.connect({browserWSEndpoint});
//continue from here

Capturing application screen with JavaScript

Is it possible to capture the entire window as screenshot using JavaScript?
The application might contain many iframes and div's where content are loaded asynchronously.
I have explored canvas2image but it works on an html element, using the same discards any iframe present on the page.
I am looking for a solution where the capture will take care of all the iframes present.
The only way to capture the contents of an iframe using ONLY JavaScript in the webpage (No extensions, or application running outside the browser on a users system) is to use the HTMLIFrameElement.getScreenshot() API in Firefox. This API is non-standard, and ONLY works in Firefox.
For any other browser, no. An iframe is typically sandboxed, and as such it is not accessible by the browser by design.
The best way to get a screenshot of a webpage that I have found and use, is an instance of Headless Chrome or Headless Firefox. These will take a screenshot of everything on the page, just as a user would see it.
Yes, widh Puppeteer it is possible.
1 - Just install the dependency:
npm i puppeteer-core
2 - Create JavaScript file, screenshot.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://yourweb.com');
await page.screenshot({path: 'screenshot.png'});
await browser.close();
})();
3 - Run:
node screenshot.js
Source
Web pages are not the best things to be "screenshoted", because of their nature; they can include async elements, frames or something like that, they are usually responsive etc...
For your purpose the best way is to use external api or an external service, I think is not a good idea to try doing that with JS.
You should try https://www.url2png.com/

Categories