Why does Cloudflare 403 only on headless requests? - javascript

I'm trying to use Puppeteer to load a page that's protected by Cloudflare. When it loads as headless, it 403s and presents a captcha. However, when it loads as not headless, it receives a 200 and loads the page correctly.
I'm trying to understand what is happening during the not-headless load that allows the page to load correctly.
I thought it might be some kind of local Javascript execution on the page, but I've disabled Javascript in all cases. I've also ruled out IP and rate limiting — all tests were conducted on the same personal machine, at least 1 minute apart.
Here's the code:
(async () => {
let headless = true;
const browser = await puppeteer.launch({headless});
const page = await browser.newPage();
await page.setJavaScriptEnabled(false);
await page.goto('https://angel.co/company/angellist/jobs');
await page.screenshot({path: 'out.png'});
await browser.close();
})();
And the results:
for headless = false
for headless = true

Related

Headless Chrome to Execute Javascript

I want to Server side render my React Application. So the request first comes to my web server (written in Rust). I aggregate all the data required to generate the html.
After that I want to execute my React application using headless Chrome.
Every example of headless chrome shows me to "navigate to a page".
Using Nodejs for example using Puppeteer library
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
await browser.close();
})();
Instead of navigating to a URL, I just want to use the headless Chrome as a JavaScript engine, which given a JavaScript executes it.
I looked but nowhere I could find a example of that.
Well, quick look at pupetter API shows this one:
page.evaluate(pageFunction[, ...args])
pageFunction <function|string> Function to be evaluated in the page context
...
Exact link to API: https://github.com/puppeteer/puppeteer/blob/v3.1.0/docs/api.md#pageevaluatepagefunction-args

How to get a screenshot/preview of another website

Is there a way in which you can get a screenshot of another websites pages?
e.g: you introduce a url in an input, hit enter, and a script gives you a screenshot of the site you put in. I manage to do it with headless browsers, but I fear that could take too much resources and time, to launch. let's say phantomjs each time the input is used the headless browser would need to get the new data, I investigate HotJar, it does something similar to what I'm looking for, but it gives you a script that you must put into the page header, which is fine by me, afterwards, you get a preview, how does it work?, and how can one replicate it?
Do you want a print screen of your page or someone else's?
Own page
Use puppeteer or phantomJS with Beverly build of your site, this way you will only run it when it changes, and have a screenshot ready at any time.
Foreign page
You have access to it (the owner runs your script)
Either try to get into his build pipeline, and use solution from above.
Or use this solution Using HTML5/Canvas/JavaScript to take in-browser screenshots.
You don't have any access
Use some long-running process that will give you screenshot when asked.
Imagine a server with one URL endpoint: screenshot.example.com?facebook.com.
The long-running server has a puppeteer/phantomJS instance ready to go when given URL, it will flood that page, get the screenshot and send it back. The browser will actually think of it as a slow ping image request.
You can make this with puppeteer
install with: npm i puppeteer
save the following code to example.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
and run it with:
node example.js

How to make pupeteer automatically follow redirects?

How could I make pupeteer automatically follow the redirects, the same way the browser handles it?
For example, if I try loading a website https://www.aonhewitt.com on the browser, it automatically redirects to https://www.aon.com/home/solutions/retirement.html. But if I try the following with pupeteer:
const pageResponse = await page.goto(url, {waitUntil: "domcontentloaded"});
It gives an error saying "Navigation timeout of 30000 ms exceeded". How could I avoid this error?

Puppeteer: is there a way to access the DevTools Network API?

I am trying to use Puppeteer for end-to-end tests. These tests require accessing the network emulation capabilities of DevTools (e.g. to simulate offline browsing).
So far I am using chrome-remote-interface, but it is too low-level for my taste.
As far as I know, Puppeteer does not expose the network DevTools features (emulateNetworkConditions in the DevTools protocol).
Is there an escape hatch in Puppeteer to access those features, e.g. a way to execute a Javascript snippet in a context in which the DevTools API is accessible?
Thanks
Edit:
OK, so it seems that I can work around the lack of an API using something like this:
const client = page._client;
const res = await client.send('Network.emulateNetworkConditions',
{ offline: true, latency: 40, downloadThroughput: 40*1024*1024,
uploadThroughput: 40*1024*1024 });
But I suppose it is Bad Form and may slip under my feet at any time?
Update: headless Chrome now supports network throttling!
In Puppeteer, you can emulate devices (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pageemulateoptions) but not network conditions. It's something we're considering, but headless Chrome needs to support network throttling first.
To emulate a device, I'd use the predefined devices found in DeviceDescriptors:
const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone = devices['iPhone 6'];
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.emulate(iPhone);
await page.goto('https://www.google.com');
// other actions...
browser.close();
});

JavaScript: screenshot rendered web page Browser style

I assume this question might have been asked before but after hours of searching, I haven't found anything satisfying.
Here's my question: Is it possible to screenshot a fully rendered web page using JavaScript? A little like what most browsers do on windows on the press of ctrl+p.
I have looked into a lot of alternative solutions like html2Canvas.js
but none suits my needs. The biggest issue being my web page almost entirely rendered on client side using Javascript. This is also why server side solution like PhantomJS are hardly applicable.
I need the screenshots to be printed as image or PDF.
Any idea?
Thanks.
Have you looked into Puppeteer by Google?
If you're able to run it on your server, it might be exactly what you're looking for. See their example code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();

Categories