How to configure puppeteer to use current browser user.
Sometimes, I want to get some information through my account, and with my chrome plugin that is installed currently.
So, I try to configure the "userDataDir" fields, I don't know if I configured it incorrectly, but it never works.
const browser = await puppeteer.launch({
executablePath: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
headless: false,
userDataDir: '~/Library/Application Support/Google/Chrome/Default',
})
Does anyone have the same problem as me?
Related
I am a little confused over the arguments needed for Puppeteer, in particular when the puppeteer-extra stealth plugin is used. I am currently just using all the default settings and Chromium however I keep seeing examples like this:
let options = {
headless: false,
ignoreHTTPSErrors: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-sync',
'--ignore-certificate-errors'
],
defaultViewport: { width: 1366, height: 768 }
};
Do I actually need any of these to avoid being detected? Been using Puppeteer without setting any of them and it passes the bot test out of the box. What is --no-sandbox for?
these are chromium features - not puppeteer specific
please take a look at the following sections for --no-sandbox for example.
https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md#setting-up-chrome-linux-sandbox
Setting Up Chrome Linux Sandbox
In order to protect the host
environment from untrusted web content, Chrome uses multiple layers of
sandboxing. For this to work properly, the host should be configured
first. If there's no good sandbox for Chrome to use, it will crash
with the error No usable sandbox!.
If you absolutely trust the content you open in Chrome, you can launch
Chrome with the --no-sandbox argument:
const browser = await puppeteer.launch({args: ['--no-sandbox',
'--disable-setuid-sandbox']});
NOTE: Running without a sandbox is
strongly discouraged. Consider configuring a sandbox instead.
https://chromium.googlesource.com/chromium/src/+/HEAD/docs/linux/sandboxing.md#linux-sandboxing
Chromium uses a multiprocess model, which allows to give different
privileges and restrictions to different parts of the browser. For
instance, we want renderers to run with a limited set of privileges
since they process untrusted input and are likely to be compromised.
Renderers will use an IPC mechanism to request access to resource from
a more privileged (browser process). You can find more about this
general design here.
We use different sandboxing techniques on Linux and Chrome OS, in
combination, to achieve a good level of sandboxing. You can see which
sandboxes are currently engaged by looking at chrome://sandbox
(renderer processes) and chrome://gpu (gpu process).\
. . .
You can disable all sandboxing (for
testing) with --no-sandbox.
I'm creating a web api that scrapes a given url and sends that back. I am using Puppeteer to do this. I asked this question: Puppeteer not behaving like in Developer Console
and recieved an answer that suggested it would only work if headless was set to be false. I don't want to be constantly opening up a browser UI i don't need (I just the need the data!) so I'm looking for why headless has to be false and can I get a fix that lets headless = true.
Here's my code:
express()
.get("/*", (req, res) => {
global.notBaseURL = req.params[0];
(async () => {
const browser = await puppet.launch({ headless: false }); // Line of Interest
const page = await browser.newPage();
console.log(req.params[0]);
await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
title = await page.$eval("title", (el) => el.innerText);
browser.close();
res.send({
title: title,
});
})();
})
.listen(PORT, () => console.log(`Listening on ${PORT}`));
This is the page I'm trying to scrape: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106?origin=coordinating-5460106-0-1-FTR-recbot-recently_viewed_snowplow_mvp&recs_placement=FTR&recs_strategy=recently_viewed_snowplow_mvp&recs_source=recbot&recs_page_type=category&recs_seed=0&color=BLACK
The reason it might work in UI mode but not headless is that sites who aggressively fight scraping will detect that you are running in a headless browser.
Some possible workarounds:
Use puppeteer-extra
Found here: https://github.com/berstend/puppeteer-extra
Check out their docs for how to use it. It has a couple plugins that might help in getting past headless-mode detection:
puppeteer-extra-plugin-anonymize-ua -- anonymizes your User Agent. Note that this might help with getting past headless mode detection, but as you'll see if you visit https://amiunique.org/ it is unlikely to be enough to keep you from being identified as a repeat visitor.
puppeteer-extra-plugin-stealth -- this might help win the cat-and-mouse game of not being detected as headless. There are many tricks that are employed to detect headless mode, and as many tricks to evade them.
Run a "real" Chromium instance/UI
It's possible to run a single browser UI in a manner that let's you attach puppeteer to that running instance. Here's an article that explains it: https://medium.com/#jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0
Essentially you're starting Chrome or Chromium (or Edge?) from the command line with --remote-debugging-port=9222 (or any old port?) plus other command line switches depending on what environment you're running it in. Then you use puppeteer to connect to that running instance instead of having it do the default behavior of launching a headless Chromium instance: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });. Read the puppeteer docs here for more info: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions
The ENDPOINT_URL is displayed in the terminal when you launch the browser from the command line with the --remote-debugging-port=9222 option.
This option is going to require some server/ops mojo, so be prepared to do a lot more Stack Overflow searches. :-)
There are other strategies I'm sure but those are the two I'm most familiar with. Good luck!
Todd's answer is thorough, but worth trying before resorting to some of the recommendations there is to slap on the following user agent line pulled from the relevant Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true }:
await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
await page.goto(yourURL);
Now, the Nordstorm site provided by OP seems to be able to detect robots even with headless: false, at least at the present moment. But other sites are less strict and I've found the above line to be useful on some of them as shown in Puppeteer can't find elements when Headless TRUE and Puppeteer , bringing back blank array.
Visit the GH issue thread above for other ideas and see useragents.me for a rotating list of current user agents.
I have this code right here that opens a browser and pushes it inside an array, anyway iam trying to use a proxy ip to launch the browser with but everytime the pages i try to go to never load and they just give me an error like This site cant be reached OR ERR NETWORK RESET / ERR_TUNNEL_CONNECTION_FAILED etc, here is the code iam using, and also i use this website to get the proxies http://free-proxy.cz/en/
bots.browsers.push(await puppeteer.launch({
headless: false,
args:[
'--start-maximized',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process',
'--disable-gpu',
'--proxy-server=52.149.152.236:80'
]
}));
}
I want to launch only one Chromium instance from first script and then attach to it from other scripts. I know about puppeteer.connect() but the problem is that I start the script which is supposed to launch Chromium:
const puppeteer = require('puppeteer');
const fs = require('fs');
const logger = fs.createWriteStream('log.txt', {
flags: 'a' // 'a' means appending (old data will be preserved)
});
(async() => {
const browser = await puppeteer.launch({ headless: false});
logger.write('-----Browser is launched\n');
logger.write(browser.wsEndpoint());
})();
...and it never ends because I didn`t do browser.close(). Thus, I can`t start running other scripts. How can I launch Chromium, obtain its endpoint and end the script remaining Chromium launched.
(This one doesn`t contain an appropriate answer)
Answering question
Basically you can spawn child_process with detached set to true. Then exit your main script with process.exit() to launch Chromium see 1.js.
Script that responsible to launch Chromium and saving the web socket see chromiumLauncher.js
When the web socket are saved, you can connect via puppeteer.launch see 2.js
Here i push it on github (dirty code).
I am trying to use Puppeteer for end-to-end tests. These tests require accessing the network emulation capabilities of DevTools (e.g. to simulate offline browsing).
So far I am using chrome-remote-interface, but it is too low-level for my taste.
As far as I know, Puppeteer does not expose the network DevTools features (emulateNetworkConditions in the DevTools protocol).
Is there an escape hatch in Puppeteer to access those features, e.g. a way to execute a Javascript snippet in a context in which the DevTools API is accessible?
Thanks
Edit:
OK, so it seems that I can work around the lack of an API using something like this:
const client = page._client;
const res = await client.send('Network.emulateNetworkConditions',
{ offline: true, latency: 40, downloadThroughput: 40*1024*1024,
uploadThroughput: 40*1024*1024 });
But I suppose it is Bad Form and may slip under my feet at any time?
Update: headless Chrome now supports network throttling!
In Puppeteer, you can emulate devices (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pageemulateoptions) but not network conditions. It's something we're considering, but headless Chrome needs to support network throttling first.
To emulate a device, I'd use the predefined devices found in DeviceDescriptors:
const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone = devices['iPhone 6'];
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.emulate(iPhone);
await page.goto('https://www.google.com');
// other actions...
browser.close();
});