Puppeteerjs cant run headless with proxies

Puppeteerjs cant run headless with proxies - javascript

I have this code right here that opens a browser and pushes it inside an array, anyway iam trying to use a proxy ip to launch the browser with but everytime the pages i try to go to never load and they just give me an error like This site cant be reached OR ERR NETWORK RESET / ERR_TUNNEL_CONNECTION_FAILED etc, here is the code iam using, and also i use this website to get the proxies http://free-proxy.cz/en/
bots.browsers.push(await puppeteer.launch({
headless: false,
args:[
'--start-maximized',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process',
'--disable-gpu',
'--proxy-server=52.149.152.236:80'
]
}));
}

Related

How to configure puppeteer to use current browser user?

How to configure puppeteer to use current browser user.
Sometimes, I want to get some information through my account, and with my chrome plugin that is installed currently.
So, I try to configure the "userDataDir" fields, I don't know if I configured it incorrectly, but it never works.
const browser = await puppeteer.launch({
executablePath: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
headless: false,
userDataDir: '~/Library/Application Support/Google/Chrome/Default',
})
Does anyone have the same problem as me?

How to enable sharedArrayBuffer in chrome without cross-origin isolation

I have this experiment which I only run on my local machine: I load an external webpage from, for example https://example.com and the with puppeteer I inject a javascript file which is served from http://localhost:5000.
So far there are no issues. But, this injected javascript file loads a WebAssembly file and then I get the following error
Uncaught (in promise) ReferenceError: SharedArrayBuffer is not defined
....
And indeed, SharedArrayBuffer is not defined (Chrome v96) with the result that my code is not working at all (It used to work though). So my question is, how can I solve this error?
Reading more about this, it seems that you can add two headers
res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
which I did for both files without much success. Maybe this will not work given that the page is from a different domain than the injected js and WASM files.
But maybe there is an other solution possible. Here is my command to start chrome
client.browser = await puppeteer.launch({
headless: false,
devtools: true,
defaultViewport: null,
executablePath: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
args: [
'--debug-devtools',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-web-security',
'--allow-running-insecure-content',
'--disable-notifications',
'--window-size=1920,1080'
]
//slowMo: 500
});
I know chrome has too many options, so maybe there is an option for this SharedArrayBuffer issue as well?
Hope someone knows how this works and can help me, Thnx a lot!

In this thread someone suggested to start chrome as follows
$> chrome --enable-features=SharedArrayBuffer
meaning I can add --enable-features=SharedArrayBuffer to my puppeteer config!

Peter Beverloo made an extensive list of Chromium command line switches on his blog a while back.
There are lots of command lines which can be used with the Google Chrome browser. Some change behavior of features, others are for debugging or experimenting. This page lists the available switches including their conditions and descriptions. Last automated update occurred on 2020-08-12.
See # https://peter.sh/experiments/chromium-command-line-switches/
If you're looking a specific command it will be there, give it a shot. Tho I'm pretty sure cross-origin restrictions were implemented specifically to prevent what you're trying to do.

Confusion over args for Puppeteer

I am a little confused over the arguments needed for Puppeteer, in particular when the puppeteer-extra stealth plugin is used. I am currently just using all the default settings and Chromium however I keep seeing examples like this:
let options = {
headless: false,
ignoreHTTPSErrors: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-sync',
'--ignore-certificate-errors'
],
defaultViewport: { width: 1366, height: 768 }
};
Do I actually need any of these to avoid being detected? Been using Puppeteer without setting any of them and it passes the bot test out of the box. What is --no-sandbox for?

these are chromium features - not puppeteer specific
please take a look at the following sections for --no-sandbox for example.
https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md#setting-up-chrome-linux-sandbox
Setting Up Chrome Linux Sandbox
In order to protect the host
environment from untrusted web content, Chrome uses multiple layers of
sandboxing. For this to work properly, the host should be configured
first. If there's no good sandbox for Chrome to use, it will crash
with the error No usable sandbox!.
If you absolutely trust the content you open in Chrome, you can launch
Chrome with the --no-sandbox argument:
const browser = await puppeteer.launch({args: ['--no-sandbox',
'--disable-setuid-sandbox']});
NOTE: Running without a sandbox is
strongly discouraged. Consider configuring a sandbox instead.
https://chromium.googlesource.com/chromium/src/+/HEAD/docs/linux/sandboxing.md#linux-sandboxing
Chromium uses a multiprocess model, which allows to give different
privileges and restrictions to different parts of the browser. For
instance, we want renderers to run with a limited set of privileges
since they process untrusted input and are likely to be compromised.
Renderers will use an IPC mechanism to request access to resource from
a more privileged (browser process). You can find more about this
general design here.
We use different sandboxing techniques on Linux and Chrome OS, in
combination, to achieve a good level of sandboxing. You can see which
sandboxes are currently engaged by looking at chrome://sandbox
(renderer processes) and chrome://gpu (gpu process).\
. . .
You can disable all sandboxing (for
testing) with --no-sandbox.

Why does headless need to be false for Puppeteer to work?

I'm creating a web api that scrapes a given url and sends that back. I am using Puppeteer to do this. I asked this question: Puppeteer not behaving like in Developer Console
and recieved an answer that suggested it would only work if headless was set to be false. I don't want to be constantly opening up a browser UI i don't need (I just the need the data!) so I'm looking for why headless has to be false and can I get a fix that lets headless = true.
Here's my code:
express()
.get("/*", (req, res) => {
global.notBaseURL = req.params[0];
(async () => {
const browser = await puppet.launch({ headless: false }); // Line of Interest
const page = await browser.newPage();
console.log(req.params[0]);
await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
title = await page.$eval("title", (el) => el.innerText);
browser.close();
res.send({
title: title,
});
})();
})
.listen(PORT, () => console.log(`Listening on ${PORT}`));
This is the page I'm trying to scrape: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106?origin=coordinating-5460106-0-1-FTR-recbot-recently_viewed_snowplow_mvp&recs_placement=FTR&recs_strategy=recently_viewed_snowplow_mvp&recs_source=recbot&recs_page_type=category&recs_seed=0&color=BLACK

The reason it might work in UI mode but not headless is that sites who aggressively fight scraping will detect that you are running in a headless browser.
Some possible workarounds:
Use puppeteer-extra
Found here: https://github.com/berstend/puppeteer-extra
Check out their docs for how to use it. It has a couple plugins that might help in getting past headless-mode detection:
puppeteer-extra-plugin-anonymize-ua -- anonymizes your User Agent. Note that this might help with getting past headless mode detection, but as you'll see if you visit https://amiunique.org/ it is unlikely to be enough to keep you from being identified as a repeat visitor.
puppeteer-extra-plugin-stealth -- this might help win the cat-and-mouse game of not being detected as headless. There are many tricks that are employed to detect headless mode, and as many tricks to evade them.
Run a "real" Chromium instance/UI
It's possible to run a single browser UI in a manner that let's you attach puppeteer to that running instance. Here's an article that explains it: https://medium.com/#jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0
Essentially you're starting Chrome or Chromium (or Edge?) from the command line with --remote-debugging-port=9222 (or any old port?) plus other command line switches depending on what environment you're running it in. Then you use puppeteer to connect to that running instance instead of having it do the default behavior of launching a headless Chromium instance: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });. Read the puppeteer docs here for more info: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions
The ENDPOINT_URL is displayed in the terminal when you launch the browser from the command line with the --remote-debugging-port=9222 option.
This option is going to require some server/ops mojo, so be prepared to do a lot more Stack Overflow searches. :-)
There are other strategies I'm sure but those are the two I'm most familiar with. Good luck!

Todd's answer is thorough, but worth trying before resorting to some of the recommendations there is to slap on the following user agent line pulled from the relevant Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true }:
await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
await page.goto(yourURL);
Now, the Nordstorm site provided by OP seems to be able to detect robots even with headless: false, at least at the present moment. But other sites are less strict and I've found the above line to be useful on some of them as shown in Puppeteer can't find elements when Headless TRUE and Puppeteer , bringing back blank array.
Visit the GH issue thread above for other ideas and see useragents.me for a rotating list of current user agents.

Unable to use pa11y actions to login to sites

I'm seemingly unable to login to sites using pa11y and it's 'actions' feature. The documentation and sites I've found talking about pa11y actions seem to indicate this is a simple affair, but I'm having no luck.
I've attempted to login to various sites ranging from established sites (GitHub), to my own sites hosted in the cloud, and even sites running on my local machine. All of them get to the "click" action for the login form submit button (usually a POST request), then hang and eventually timeout, before the "wait for url to change" action takes place.
If I set headless to false and open the inspector in Chromium, I can see the POST request being made after the login button has been clicked, but it never completes. Interestingly, when I try this against sites I control and can see the logs for, I see the successful logins happening on the server side, but pa11y (maybe it's really a puppeteer or headless-chrome issue) never seems to receive the response.
I've included the code I've tried for GitHub which is not working. I'm using Node 8.11.3 on a Mac to run this. I've even tried on a second computer and still have the same issue. The code is based on an example given in the pa11y docs and modified only slightly (https://github.com/pa11y/pa11y/blob/master/example/actions/index.js).
What am I doing wrong?
'use strict';
const pa11y = require('pa11y');
runExample();
async function runExample() {
try {
var urlToTest = 'https://github.com/login';
const result = await pa11y(urlToTest, {
actions: [
'navigate to https://github.com/login',
'set field #login_field to MYUSERNAME',
'set field #password to MYPASSWORD',
'click element #login input.btn.btn-primary.btn-block',
// everything stops here...
'wait for url to not be https://github.com/login',
'navigate to https://github.com/settings/profile',
'screen-capture github.png'
],
chromeLaunchConfig: {
headless: false,
slowMo: 250
},
log: {
debug: console.log,
error: console.error,
info: console.log
},
timeout: 90000
});
console.log(result);
} catch (error) {
console.error(error.message);
}
}

I had the same issue. I manage to pass the login page by combining this: https://github.com/emadehsan/thal and the example from here https://github.com/pa11y/pa11y/blob/master/example/puppeteer/index.js. Maybe it helps you.

The problem appears to stem from using certain versions of the puppeteer library, which is a dependency of pa11y. Using puppeteer 1.6.2 results in a working environment. Using puppeteer 1.7.0 was resulting timeout issues. I was able to fix this for myself by specifying "puppeteer": "~1.6.0", alongside the pa11y dependency in my package.json file. The current version of Pa11y will happily work with any version of puppeteer down to 1.4.0.
Here's the relevant issue on GitHub for the pa11y library to follow for updates: https://github.com/pa11y/pa11y/issues/421

We Keep Coding

JavaScript is the programming language of the Web.