Manually change response URL during Puppeteer request interception - javascript

I'm having a hard time navigating relative urls with puppeteer for a specific use case. Below you can see the basic setup and an pseudo example describing the problem.
Essentially I want to change the current url the browser thinks he is at.
What I already tried:
Manipulating the response body by resolving all relative URLs by myself. Collides with some javascript based links.
Triggering a new page.goto(response.url) if request url doesn't match response url and returning the response from the previous request. Can't seem to input custom options, so I don't know which request is a fake page.goto.
Can somebody lend me a helping hand? Thanks in advance.
Setup:
const browser = await puppeteer.launch({
headless: false,
});
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on('request', (request) => {
const resourceType = request.resourceType();
if (['document', 'xhr', 'script'].includes(resourceType)) {
// fetching takes place on an different instance and handles redirects internally
const response = await fetch(request);
request.respond({
body: response.body,
statusCode: response.statusCode,
url: response.url // no effect
});
} else {
request.abort('aborted');
}
});
Navigation:
await page.goto('https://start.de');
// redirects to https://redirect.de
await page.click('a');
// relative href '/demo.html' resolves to https://start.de/demo.html instead of https://redirect.de/demo.html
await page.click('a');
Update 1
Solution
Manipulating the browser history direction via window.location.
await page.goto('https://start.de');
// redirects to https://redirect.de internally
await page.click('a');
// changing current window location
await page.evaluate(() => {
window.location.href = 'https://redirect.de';
});
// correctly resolves to https://redirect.de/demo.html instead of https://start.de/demo.html
await page.click('a');

When you match the request that you want to edit its body, just get the URL and make a call using "node-fetch" or "request" modules, when you receive the body edit it then sends it as a response to the original request.
for example:
const requestModule = require("request");
const cheerio = require("cheerio");
page.on("request", async (request) => {
// Match the url that you want
const isMatched = /page-12/.test(request.url());
if (isMatched) {
// Make a new call
requestModule({
url: request.url(),
resolveWithFullResponse: true,
})
.then((response) => {
const { body, headers, statusCode, statusMessage } = response;
const contentType = headers["content-type"];
// Edit body using cheerio module
const $ = cheerio.load(body);
$("a").each(function () {
$(this).attr("href", "/fake_pathname");
});
// Send response
request.respond({
ok: statusMessage === "OK",
status: statusCode,
contentType,
body: $.html(),
});
})
.catch(() => request.continue());
} else request.continue();
});

Related

Can't get json with axios.get and headers

I am trying to get the joke from https://icanhazdadjoke.com/. This is the code I used
const getDadJoke = async () => {
const res = await axios.get('https://icanhazdadjoke.com/', {headers: {Accept: 'application/json'}})
console.log(res.data.joke)
}
getDadJoke()
I expected to get the joke but instead I got the full html page, as if I didn't specify the headers at all. What am I doing wrong?
If you look at the API documentation for icanhazdadjoke.com, there is a section titled "Custom user agent." In that section, they explain how they want any requests to have a User Agent header. If you use Axios in a browser context, the User Agent is set for you by your browser. But I'm going to go out on a limb and say that you are running this code via Node, in which case, you may manually need to set the User Agent header, like so:
const getDadJoke = async () => {
const res = await axios.get(
'https://icanhazdadjoke.com/',
{
headers:
{
'Accept': 'application/json',
'User-Agent': 'my URL, email or whatever'
}
}
)
console.log(res.data.joke)
}
getDadJoke()
The docs say what they want you to put for the User Agent, but I think it would honestly work if there were any User Agent field at all.
The HTML page you're getting is a 503 response from Cloudflare.
As per the API documentation
Custom user agent
If you intend on using the icanhazdadjoke.com API we kindly ask that you set a custom User-Agent header for all requests.
My guess is they have a Cloudflare Browser Integrity Check configured that's triggering for the default Node / Axios user-agent.
Setting a custom user-agent appears to get around this...
const getDadJoke = async () => {
try {
const res = await axios.get("https://icanhazdadjoke.com/", {
headers: {
accept: "application/json",
"user-agent": "My Node and Axios app", // use something better than this
},
});
console.log(res.data.joke);
} catch (err) {
console.error(err.response?.data, err.toJSON());
}
};
Given how unreliable Axios releases have been since v1.0.0, I highly recommend you switch to something else. The Fetch API is available natively in Node since v18
const getDadJoke = async () => {
try {
const res = await fetch("https://icanhazdadjoke.com/", {
headers: {
accept: "application/json",
"user-agent": "My Node and Fetch app", // use something better than this
},
});
if (!res.ok) {
const err = new Error(`${res.status} ${res.statusText}`);
err.text = await res.text();
throw err;
}
console.log((await res.json()).joke);
} catch (err) {
console.error(err, err.text);
}
};
Using Axios REST API call which response JSON format.
If you using API from https://icanhazdadjoke.com/api#authentication
, you can use Axios.
Here is example.
Alternative method.
You needs to use web scrapping method for this case. Because HTML response from https://icanhazdadjoke.com/.
This is example how to scrap using puppeteer library in node.js
Demo code
Save as get-joke.js file.
const puppeteer = require("puppeteer");
async function getJoke() {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://icanhazdadjoke.com/');
const joke = await page.evaluate(() => {
const jokes = Array.from(document.querySelectorAll('p[class="subtitle"]'))
return jokes[0].innerText;
});
await browser.close();
return Promise.resolve(joke);
} catch (error) {
return Promise.reject(error);
}
}
getJoke()
.then((joke) => {
console.log(joke);
})
Selector
Main Idea to use DOM tree selector
In the Chrome's DevTool (by pressing F12), shows HTML DOM tree structures.
<p> tag has class name is subtitle
document.querySelectorAll('p[class="subtitle"]')
Install dependency and run it
npm install puppeteer
node get-joke.js
Result
You can get the joke from that web site.

Puppeteer how to send PUT request to the redirected page

I am trying to send PUT request to the final URL but before final URL, there is a redirect. Also sending a fetch request inside final URL page is also fine. when I go to devtools console, write fetch from there also works but I need to do it inside the code, of course.
When I set await page.setRequestInterception(true); and page.once('request', (req) => {...}) it sends put request to the first page which I dont want it to do that.
Let's say first URL is https://example.com/first --> this redirects to final URL
final URL https://example.com/final --> this is where I want to send PUT request and retrieve status code. I have tried setting a timer or getting current url with page.url() and trying some if else statements, but did not work.
here is my current code;
app.get('/cookie', async (req, res) => {
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({
headless: false,
executablePath: `C:/Program Files (x86)/Google/Chrome/Application/chrome.exe`,
defaultViewport: null,
args: ['--start-maximized'],
slowMo: 150,
});
const page = await browser.newPage();
await page.setUserAgent(randomUserAgent.getRandom());
page.setDefaultNavigationTimeout(0);
page.setJavaScriptEnabled(true);
await page.goto(
'finalURL',
{ waitUntil: 'load', timeout: 0 }
);
await delay(5000);
await page.setRequestInterception(true);
page.once('request', (request) => {
request.continue({
method: 'PUT',
});
page.setRequestInterception(false);
});
let statusCode;
await page.waitForResponse((response) => {
statusCode = response.status();
return true;
});
res.json(statusCode);
});

How to accept a JavaScript alert popup in Puppeteer?

I have a script which uses Puppeteer to automatically log in to a corporate portal. The login uses SAML. So, when puppeteer opens up an instance of chromium and visits the page, a popup appears on screen to confirm the identity of the user. All I need to do is either manually click on "OK" button or press Enter from keyboard.
I have tried to simulate the pressing of the Enter key using puppeteer but it does not work.
The login screen -
Script -
const puppeteer = require('puppeteer');
async function startDb() {
const browser = await puppeteer.launch({
headless:false,
defaultViewport:null
});
const page = await browser.newPage();
await page.goto("https://example.com");
await page.waitFor(3000);
await page.keyboard.press('Enter');
console.log('Opened')
};
startDb();
**EDIT **
There is a solution proposed in this issue:
Basically just intercept the request, then fire the request off yourself using your favorite httpclient lib, and repond to the intercepted request with the response info.
const puppeteer = require('puppeteer');
const request = require('request');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch();
let page = await browser.newPage();
// Enable Request Interception
await page.setRequestInterception(true);
// Client cert files
const cert = fs.readFileSync('/path/to/cert.crt.pem');
const key = fs.readFileSync('/path/to/cert.key.pem');
page.on('request', interceptedRequest => {
// Intercept Request, pull out request options, add in client cert
const options = {
uri: interceptedRequest.url(),
method: interceptedRequest.method(),
headers: interceptedRequest.headers(),
body: interceptedRequest.postData(),
cert: cert,
key: key
};
// Fire off the request manually (example is using using 'request' lib)
request(options, function(err, resp, body) {
// Abort interceptedRequest on error
if (err) {
console.error(`Unable to call ${options.uri}`, err);
return interceptedRequest.abort('connectionrefused');
}
// Return retrieved response to interceptedRequest
interceptedRequest.respond({
status: resp.statusCode,
contentType: resp.headers['content-type'],
headers: resp.headers,
body: body
});
});
});
await page.goto('https://client.badssl.com/');
await browser.close();
})();
Before page.goto(), put this code:
page.on('dialog', async dialog => {
await dialog.accept();
});

Puppeteer - How can I get the current page (application/pdf) as a buffer or file?

Using Puppeteer (https://github.com/GoogleChrome/puppeteer), I have a page that's a application/pdf. With headless: false, the page is loaded though the Chromium PDF viewer, but I want to use headless. How can I download the original .pdf file or use as a blob with another library, such as (pdf-parse https://www.npmjs.com/package/pdf-parse)?
Since Puppeteer does not currently support navigation to a PDF document in headless mode via page.goto() due to the upstream issue, you can use page.setRequestInterception() to enable request interception, and then you can listen for the 'request' event and detect whether the resource is a PDF before using the request client to obtain the PDF buffer.
After obtaining the PDF buffer, you can use request.abort() to abort the original Puppeteer request, or if the request is not for a PDF, you can use request.continue() to continue the request normally.
Here's a full working example:
'use strict';
const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', request => {
if (request.url().endsWith('.pdf')) {
request_client({
uri: request.url(),
encoding: null,
headers: {
'Content-type': 'applcation/pdf',
},
}).then(response => {
console.log(response); // PDF Buffer
request.abort();
});
} else {
request.continue();
}
});
await page.goto('https://example.com/hello-world.pdf').catch(error => {});
await browser.close();
})();
Grant Miller's solution didn't work for me because I was logged in the website. But if the pdf is public this solution works out well.
The solution for my case was to add the cookies
await page.setRequestInterception(true);
page.on('request', async request => {
if (request.url().indexOf('exibirFat.do')>0) { //This condition is true only in pdf page (in my case of course)
const options = {
encoding: null,
method: request._method,
uri: request._url,
body: request._postData,
headers: request._headers
}
/* add the cookies */
const cookies = await page.cookies();
options.headers.Cookie = cookies.map(ck => ck.name + '=' + ck.value).join(';');
/* resend the request */
const response = await request_client(options);
//console.log(response); // PDF Buffer
buffer = response;
let filename = 'file.pdf';
fs.writeFileSync(filename, buffer); //Save file
} else {
request.continue();
}
});

Programmatically capturing AJAX traffic with headless Chrome

Chrome officially supports running the browser in headless mode (including programmatic control via the Puppeteer API and/or the CRI library).
I've searched through the documentation, but I haven't found how to programmatically capture the AJAX traffic from the instances (ie. start an instance of Chrome from code, navigate to a page, and access the background response/request calls & raw data (all from code not using the developer tools or extensions).
Do you have any suggestions or examples detailing how this could be achieved? Thanks!
Update
As #Alejandro pointed out in the comment, resourceType is a function and the return value is lowercased
page.on('request', request => {
if (request.resourceType() === 'xhr')
// do something
});
Original answer
Puppeteer's API makes this really easy:
page.on('request', request => {
if (request.resourceType === 'XHR')
// do something
});
You can also intercept requests with setRequestInterception, but it's not needed in this example if you're not going to modify the requests.
There's an example of intercepting image requests that you can adapt.
resourceTypes are defined here.
I finally found how to do what I wanted. It can be done with chrome-remote-interface (CRI), and node.js. I'm attaching the minimal code required.
const CDP = require('chrome-remote-interface');
(async function () {
// you need to have a Chrome open with remote debugging enabled
// ie. chrome --remote-debugging-port=9222
const protocol = await CDP({port: 9222});
const {Page, Network} = protocol;
await Page.enable();
await Network.enable(); // need this to call Network.getResponseBody below
Page.navigate({url: 'http://localhost/'}); // your URL
const onDataReceived = async (e) => {
try {
let response = await Network.getResponseBody({requestId: e.requestId})
if (typeof response.body === 'string') {
console.log(response.body);
}
} catch (ex) {
console.log(ex.message)
}
}
protocol.on('Network.dataReceived', onDataReceived)
})();
Puppeteer's listeners could help you capture xhr response via response and request event.
You should check wether request.resourceType() is xhr or fetch first.
listener = page.on('response', response => {
const isXhr = ['xhr','fetch'].includes(response.request().resourceType())
if (isXhr){
console.log(response.url());
response.text().then(console.log)
}
})
const browser = await puppeteer.launch();
const page = await browser.newPage();
const pageClient = page["_client"];
pageClient.on("Network.responseReceived", event => {
if (~event.response.url.indexOf('/api/chart/rank')) {
console.log(event.response.url);
pageClient.send('Network.getResponseBody', {
requestId: event.requestId
}).then(async response => {
const body = response.body;
if (body) {
try {
const json = JSON.parse(body);
}
catch (e) {
}
}
});
}
});
await page.setRequestInterception(true);
page.on("request", async request => {
request.continue();
});
await page.goto('http://www.example.com', { timeout: 0 });

Categories