Chrome officially supports running the browser in headless mode (including programmatic control via the Puppeteer API and/or the CRI library).
I've searched through the documentation, but I haven't found how to programmatically capture the AJAX traffic from the instances (ie. start an instance of Chrome from code, navigate to a page, and access the background response/request calls & raw data (all from code not using the developer tools or extensions).
Do you have any suggestions or examples detailing how this could be achieved? Thanks!
Update
As #Alejandro pointed out in the comment, resourceType is a function and the return value is lowercased
page.on('request', request => {
if (request.resourceType() === 'xhr')
// do something
});
Original answer
Puppeteer's API makes this really easy:
page.on('request', request => {
if (request.resourceType === 'XHR')
// do something
});
You can also intercept requests with setRequestInterception, but it's not needed in this example if you're not going to modify the requests.
There's an example of intercepting image requests that you can adapt.
resourceTypes are defined here.
I finally found how to do what I wanted. It can be done with chrome-remote-interface (CRI), and node.js. I'm attaching the minimal code required.
const CDP = require('chrome-remote-interface');
(async function () {
// you need to have a Chrome open with remote debugging enabled
// ie. chrome --remote-debugging-port=9222
const protocol = await CDP({port: 9222});
const {Page, Network} = protocol;
await Page.enable();
await Network.enable(); // need this to call Network.getResponseBody below
Page.navigate({url: 'http://localhost/'}); // your URL
const onDataReceived = async (e) => {
try {
let response = await Network.getResponseBody({requestId: e.requestId})
if (typeof response.body === 'string') {
console.log(response.body);
}
} catch (ex) {
console.log(ex.message)
}
}
protocol.on('Network.dataReceived', onDataReceived)
})();
Puppeteer's listeners could help you capture xhr response via response and request event.
You should check wether request.resourceType() is xhr or fetch first.
listener = page.on('response', response => {
const isXhr = ['xhr','fetch'].includes(response.request().resourceType())
if (isXhr){
console.log(response.url());
response.text().then(console.log)
}
})
const browser = await puppeteer.launch();
const page = await browser.newPage();
const pageClient = page["_client"];
pageClient.on("Network.responseReceived", event => {
if (~event.response.url.indexOf('/api/chart/rank')) {
console.log(event.response.url);
pageClient.send('Network.getResponseBody', {
requestId: event.requestId
}).then(async response => {
const body = response.body;
if (body) {
try {
const json = JSON.parse(body);
}
catch (e) {
}
}
});
}
});
await page.setRequestInterception(true);
page.on("request", async request => {
request.continue();
});
await page.goto('http://www.example.com', { timeout: 0 });
Related
I am trying to get the joke from https://icanhazdadjoke.com/. This is the code I used
const getDadJoke = async () => {
const res = await axios.get('https://icanhazdadjoke.com/', {headers: {Accept: 'application/json'}})
console.log(res.data.joke)
}
getDadJoke()
I expected to get the joke but instead I got the full html page, as if I didn't specify the headers at all. What am I doing wrong?
If you look at the API documentation for icanhazdadjoke.com, there is a section titled "Custom user agent." In that section, they explain how they want any requests to have a User Agent header. If you use Axios in a browser context, the User Agent is set for you by your browser. But I'm going to go out on a limb and say that you are running this code via Node, in which case, you may manually need to set the User Agent header, like so:
const getDadJoke = async () => {
const res = await axios.get(
'https://icanhazdadjoke.com/',
{
headers:
{
'Accept': 'application/json',
'User-Agent': 'my URL, email or whatever'
}
}
)
console.log(res.data.joke)
}
getDadJoke()
The docs say what they want you to put for the User Agent, but I think it would honestly work if there were any User Agent field at all.
The HTML page you're getting is a 503 response from Cloudflare.
As per the API documentation
Custom user agent
If you intend on using the icanhazdadjoke.com API we kindly ask that you set a custom User-Agent header for all requests.
My guess is they have a Cloudflare Browser Integrity Check configured that's triggering for the default Node / Axios user-agent.
Setting a custom user-agent appears to get around this...
const getDadJoke = async () => {
try {
const res = await axios.get("https://icanhazdadjoke.com/", {
headers: {
accept: "application/json",
"user-agent": "My Node and Axios app", // use something better than this
},
});
console.log(res.data.joke);
} catch (err) {
console.error(err.response?.data, err.toJSON());
}
};
Given how unreliable Axios releases have been since v1.0.0, I highly recommend you switch to something else. The Fetch API is available natively in Node since v18
const getDadJoke = async () => {
try {
const res = await fetch("https://icanhazdadjoke.com/", {
headers: {
accept: "application/json",
"user-agent": "My Node and Fetch app", // use something better than this
},
});
if (!res.ok) {
const err = new Error(`${res.status} ${res.statusText}`);
err.text = await res.text();
throw err;
}
console.log((await res.json()).joke);
} catch (err) {
console.error(err, err.text);
}
};
Using Axios REST API call which response JSON format.
If you using API from https://icanhazdadjoke.com/api#authentication
, you can use Axios.
Here is example.
Alternative method.
You needs to use web scrapping method for this case. Because HTML response from https://icanhazdadjoke.com/.
This is example how to scrap using puppeteer library in node.js
Demo code
Save as get-joke.js file.
const puppeteer = require("puppeteer");
async function getJoke() {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://icanhazdadjoke.com/');
const joke = await page.evaluate(() => {
const jokes = Array.from(document.querySelectorAll('p[class="subtitle"]'))
return jokes[0].innerText;
});
await browser.close();
return Promise.resolve(joke);
} catch (error) {
return Promise.reject(error);
}
}
getJoke()
.then((joke) => {
console.log(joke);
})
Selector
Main Idea to use DOM tree selector
In the Chrome's DevTool (by pressing F12), shows HTML DOM tree structures.
<p> tag has class name is subtitle
document.querySelectorAll('p[class="subtitle"]')
Install dependency and run it
npm install puppeteer
node get-joke.js
Result
You can get the joke from that web site.
I am currently working on the migration of a web application to a PWA, and since a few days I encounter this problem: "Site cannot be installed: Page does not work offline. Starting in Chrome 93, the installability criteria is changing, and this site will not be installable. See https://goo.gle/improved-pwa-offline-detection for more information."
I did several searches on forums, especially on Stackoverflow to find a solution to my problem, but without success. All the proposed solutions did not solve my problem.
I created this JS script following recommendations found on Stackoverflow :
importScripts('./ngsw-worker.js');
const OFFLINE_VERSION = 1;
const CACHE_NAME = 'offline-html';
const OFFLINE_URL = 'offline.html';
self.addEventListener('install', (event) => {
event.waitUntil((async () => {
const cache = await caches.open(CACHE_NAME);
await cache.add(new Request(OFFLINE_URL, {cache: 'reload'}));
})());
});
self.addEventListener('fetch', (event) => {
if (event.request.mode === 'navigate') {
event.respondWith((async () => {
try {
const preloadResponse = await event.preloadResponse;
if (preloadResponse) {
return preloadResponse;
}
const networkResponse = await fetch(event.request);
return networkResponse;
} catch (error) {
console.log('Fetch failed; returning offline page instead.', error);
const cache = await caches.open(CACHE_NAME);
const cachedResponse = await cache.match(OFFLINE_URL);
return cachedResponse;
}
})());
}
});
This script is executed correctly when the application is run, but the error message persists.
Do you have a solution for this problem?
Thank you.
PS : I use the 8.1.1 version of Angular
I am calling server data by using ajax in index.html. It is perfectly fetching those data. Now, i am working with serviceworker. I can cache all the static assets(images,js,css) and check those cached assets in Cached storage in application tab in Chrome dev tools. I can see in Network tab also those assets are cached( disk cache).
Now, I want to cache those ajax response(array of image files) using service worker. In network tab, i can see it is calling url (type : xhr ) not cached. I have tried so far to fetch the url and cache those but not able to do it.
Here is my ajax call in index.html
<script type="text/javascript">
$(document).ready(function () {
var url = 'index.cfm?action=main.appcache';
$.ajax({
type:"GET",
url: url,
data: function(data){
var resData = JSON.stringify(data);
},
cache: true,
complete: doSomething
})
});
function doSomething(data) {
console.log(data.responseText);
}
</script>
Here is my serviceWorker fetch event:
self.addEventListener('fetch', (event) => {
if (event.request.mode === 'navigate') {
event.respondWith((async () => {
try {
const preloadResponse = await event.preloadResponse;
if (preloadResponse) {
return preloadResponse;
}
const normalizedUrl = new URL(event.request.url);
if(normalizedUrl.endsWith === 'index.cfm?action=main.appcache'){
const fetchResponseP = fetch(normalizedUrl);
const fetchResponseCloneP = fetchResponseP.then(r => r.clone());
event.waitUntil(async function() {
const cache = await caches.open(precacheName);
await cache.put(normalizedUrl, await fetchResponseCloneP);
}());
return (await caches.match(normalizedUrl)) || fetchResponseP;
}
const networkResponse = await fetch(event.request);
return networkResponse;
} catch (error) {
console.log('Fetch failed; returning offline page instead.', error);
const cache = await caches.open(precacheName);
const cachedResponse = await cache.match(offlineDefaultPage);
return cachedResponse;
}
})());
}
});
Please help me what are changes needed to cache the response.
Your fetch event handler starts with
if (event.request.mode === 'navigate') {
// ...
}
That means the code inside of it will only execute if the incoming fetch event is for a navigation request. Only the initial request for an HTML document when first loading a page is a navigation request. Your AJAX requests for other subresources are not navigation requests.
If you want to cache your requests for index.cfm?action=main.appcache in addition to your logic in place for navigation requests, you can add another if statement after your first one, and check for that URL:
self.addEventListener('fetch', (event) => {
if (event.request.mode === 'navigate') {
// ...
}
if (event.request.url.endsWith('index.cfm?action=main.appcache')) {
// Your caching logic goes here. See:
// https://developers.google.com/web/fundamentals/instant-and-offline/offline-cookbook#serving-suggestions
}
});
i have a node server. I pass a Url into request and then extract the contects with cherio. Now what im trying to do is detect if that webpage is using google analytics. How would i do this?
request({uri: URL}, function(error, response, body)
{
if (!error)
{
const $ = cheerio.load(body);
const usesAnalytics = body.includes('googletag') || body.includes('analytics.js') || body.includes('ga.js');
const isUsingGA = ?;
}
}
From the official analytics site, they say that you can find some strings that would indicate GA is active. I have tried scanning the body for these but they always return false even if that page is running GA. I included this in the code above.
Ive looked at websites that use it and I cant see anything in their index that would suggest they are using it. Its only when i go to their sources and see they are using it. How would i detect this in node?
I have Node script which uses Puppeteer to monitor the requests sent from a website.
I wrote this some time ago so some parts might be irrelevant to you but here you go:
'use strict';
const puppeteer = require('puppeteer');
function getGaTag(lookupDomain){
return new Promise((resolve) => {
(async() => {
var result = [];
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', request => {
const url = request.url();
const regexp = /(UA|YT|MO)-\d+-\d+/i;
// look for tracking script
if (url.match(/^https?:\/\/www\.google-analytics\.com\/(r\/)?collect/i)) {
console.log(url.match(regexp));
console.log('\n');
result.push(url.match(regexp)[0]);
}
request.continue();
});
try {
await page.goto(lookupDomain);
await page.waitFor(9000);
} catch (err) {
console.log("Couldn't fetch page " + err);
}
await browser.close();
resolve(result);
})();
})
}
getGaTag('https://store.google.com/').then(result => {
console.log(result)
})
Running node ga-check.js now returns the UA ID of the Google Analytucs tracker on the lookup domain: [ 'UA-54090495-1' ] which in this case is https://store.google.com
Hope this helps!
I'm having a hard time navigating relative urls with puppeteer for a specific use case. Below you can see the basic setup and an pseudo example describing the problem.
Essentially I want to change the current url the browser thinks he is at.
What I already tried:
Manipulating the response body by resolving all relative URLs by myself. Collides with some javascript based links.
Triggering a new page.goto(response.url) if request url doesn't match response url and returning the response from the previous request. Can't seem to input custom options, so I don't know which request is a fake page.goto.
Can somebody lend me a helping hand? Thanks in advance.
Setup:
const browser = await puppeteer.launch({
headless: false,
});
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on('request', (request) => {
const resourceType = request.resourceType();
if (['document', 'xhr', 'script'].includes(resourceType)) {
// fetching takes place on an different instance and handles redirects internally
const response = await fetch(request);
request.respond({
body: response.body,
statusCode: response.statusCode,
url: response.url // no effect
});
} else {
request.abort('aborted');
}
});
Navigation:
await page.goto('https://start.de');
// redirects to https://redirect.de
await page.click('a');
// relative href '/demo.html' resolves to https://start.de/demo.html instead of https://redirect.de/demo.html
await page.click('a');
Update 1
Solution
Manipulating the browser history direction via window.location.
await page.goto('https://start.de');
// redirects to https://redirect.de internally
await page.click('a');
// changing current window location
await page.evaluate(() => {
window.location.href = 'https://redirect.de';
});
// correctly resolves to https://redirect.de/demo.html instead of https://start.de/demo.html
await page.click('a');
When you match the request that you want to edit its body, just get the URL and make a call using "node-fetch" or "request" modules, when you receive the body edit it then sends it as a response to the original request.
for example:
const requestModule = require("request");
const cheerio = require("cheerio");
page.on("request", async (request) => {
// Match the url that you want
const isMatched = /page-12/.test(request.url());
if (isMatched) {
// Make a new call
requestModule({
url: request.url(),
resolveWithFullResponse: true,
})
.then((response) => {
const { body, headers, statusCode, statusMessage } = response;
const contentType = headers["content-type"];
// Edit body using cheerio module
const $ = cheerio.load(body);
$("a").each(function () {
$(this).attr("href", "/fake_pathname");
});
// Send response
request.respond({
ok: statusMessage === "OK",
status: statusCode,
contentType,
body: $.html(),
});
})
.catch(() => request.continue());
} else request.continue();
});