puppeteer: How to intercept ServiceWorker/WebWorker request? - javascript

There are some old solutions to resolve this question, the one is in github, the other is in stackoverflow.
puppeteer has _client property in lower version.
The solution in lower vision as follows:
page._client.send('Network.setBypassServiceWorker', {bypass: true})
The puppeteer version is 18.0.5, so the Page has not _client property.
So the same solution in higher vison as follows:
const client = await page.target().createCDPSession();
await client.send("Network.setBypassServiceWorker", { bypass: true });
But it not working.
So how to resolve this problem?

We should add a new line await client.send("Network.enable") to enable the network. The code as follows:
...
const client = await page.target().createCDPSession();
await client.send("Network.enable"); // Must enable network.
await client.send("Network.setBypassServiceWorker", { bypass: true });
await page.setRequestInterception(true);
...
So we can handle the response in page.on().
...
page.on("response", async (res) => {
// do somethings.
})
...

Related

Using wappalyzer and puppeteer in node.js

I am trying to build a scraper to monitor web projects automatically.
So far so good, the script is running, but now I want to add a feature that automatically analyses what libraries I used in the projects. The most powerful script for this job is wappalyser. They have a node package (https://www.npmjs.com/package/wappalyzer) and it's written that you can use it combined with pupperteer.
I managed to run pupperteer and to log the source code of the sites in the console, but I don't get the right way to pass the source code to the wappalyzer analyse function.
Do you guys have a hint for me?
I tryed this code but a am getting a TypeError: url.split is not a function
function getLibarys(url) {
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url);
// get source code with puppeteer
const html = await page.content();
const wappalyzer = new Wappalyzer();
(async function () {
try {
await wappalyzer.init()
// Optionally set additional request headers
const headers = {}
const site = await wappalyzer.open(page, headers)
// Optionally capture and output errors
site.on('error', console.error)
const results = await site.analyze()
console.log(JSON.stringify(results, null, 2))
} catch (error) {
console.error(error)
}
await wappalyzer.destroy()
})()
await browser.close()
})()
}
Fixed it by using the sample code from wappalyzer.
function getLibarys(url) {
const Wappalyzer = require('wappalyzer');
const options = {
debug: false,
delay: 500,
headers: {},
maxDepth: 3,
maxUrls: 10,
maxWait: 5000,
recursive: true,
probe: true,
proxy: false,
userAgent: 'Wappalyzer',
htmlMaxCols: 2000,
htmlMaxRows: 2000,
noScripts: false,
noRedirect: false,
};
const wappalyzer = new Wappalyzer(options)
;(async function() {
try {
await wappalyzer.init()
// Optionally set additional request headers
const headers = {}
const site = await wappalyzer.open(url, headers)
// Optionally capture and output errors
site.on('error', console.error)
const results = await site.analyze()
console.log(JSON.stringify(results, null, 2))
} catch (error) {
console.error(error)
}
await wappalyzer.destroy()
})()
}
I do not know if you still need an answer to this. But this is what a wappalyzer collaborator told me:
Normally you'd run Wappalyzer like this:
const Wappalyzer = require('wappalyzer')
const wappalyzer = new Wappalyzer()
await wappalyzer.init() // Launches a Puppeteer instance
const site = await wappalyzer.open(url)
If you want to use your own browser instance, you can skip wappalyzer.init() and assign the instance to wappalyzer.browser:
const Wappalyzer = require('wappalyzer')
const wappalyzer = new Wappalyzer()
wappalyzer.browser = await puppeteer.launch() // Use your own Puppeteer launch logic
const site = await wappalyzer.open(url)
You can find the discussion here.
Hope this helps.

Preserving indentation with Tesseract.js library

I am using tessearct.js library in my angular code.
I want to preserve the white spaces, the indentation as it is. How to do it?
Currently I am using this piece of code to do it.
async doOCR {
const worker = createWorker({
logger: m => console.log(m),
});
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const value = await worker.recognize(this.selectedFile);
}
I am looking a method to do it on client side only, that's why not using its python library.
You can give it a try after version (3.04), they have added the preserve_interword_spaces`. You can try this and check if this works:
async doOCR {
const worker = createWorker({
logger: m => console.log(m),
});
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
// there is no proper documentation, but they have added this flag
// to run it as a command
await worker.setParameters({
preserve_interword_spaces: 1,
});
const value = await worker.recognize(this.selectedFile);
}

UnhandledPromiseRejectionWarning: Error: Evaluation failed theme is not defined

Before I start the question, I am new in JavaScript, and I have very basic knowledge of async js, but i need to solve this so i can have my first project functional.
I am trying to build a scraping app using Node and Puppeteer. Basically, the user enters a URL ("link" in the code below), puppeteer goes trough the website code, tries to find the specific piece and returns the data. That part I got working so far.
The problem is when a user enters a URL of a site that doesn't have that piece of code. In that case, I get UnhandledPromiseRejectionWarning: Error: Evaluation failed theme is not defined
What do I do so when there is an error like that, I can catch it and redirect the page instead of Getting Internal Server error.
app.post("/results", function(req, res) {
var link = req.body.link;
(async link => {
const browser = await puppeteer.launch({ args: ['--no-sandbox'] })
const page = await browser.newPage()
await page.goto(link, { waitUntil: 'networkidle2'})
const data = await page.evaluate('theme.name');
await browser.close()
return data
})(link)
.then(data => {
res.render("index", {data: data, siteUrl: link});
})
})
You can extend the async part to the whole route handler and do whatever you want on catch:
app.post('/results', async (req, res) => {
try {
const link = req.body.link
const browser = await puppeteer.launch({ args: ['--no-sandbox'] })
const page = await browser.newPage()
await page.goto(link, { waitUntil: 'networkidle2'})
const data = await page.evaluate('theme.name')
await browser.close()
res.render("index", {data: data, siteUrl: link})
} catch(e) {
// redirect or whatever
res.redirect('/')
}
});

Manually change response URL during Puppeteer request interception

I'm having a hard time navigating relative urls with puppeteer for a specific use case. Below you can see the basic setup and an pseudo example describing the problem.
Essentially I want to change the current url the browser thinks he is at.
What I already tried:
Manipulating the response body by resolving all relative URLs by myself. Collides with some javascript based links.
Triggering a new page.goto(response.url) if request url doesn't match response url and returning the response from the previous request. Can't seem to input custom options, so I don't know which request is a fake page.goto.
Can somebody lend me a helping hand? Thanks in advance.
Setup:
const browser = await puppeteer.launch({
headless: false,
});
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on('request', (request) => {
const resourceType = request.resourceType();
if (['document', 'xhr', 'script'].includes(resourceType)) {
// fetching takes place on an different instance and handles redirects internally
const response = await fetch(request);
request.respond({
body: response.body,
statusCode: response.statusCode,
url: response.url // no effect
});
} else {
request.abort('aborted');
}
});
Navigation:
await page.goto('https://start.de');
// redirects to https://redirect.de
await page.click('a');
// relative href '/demo.html' resolves to https://start.de/demo.html instead of https://redirect.de/demo.html
await page.click('a');
Update 1
Solution
Manipulating the browser history direction via window.location.
await page.goto('https://start.de');
// redirects to https://redirect.de internally
await page.click('a');
// changing current window location
await page.evaluate(() => {
window.location.href = 'https://redirect.de';
});
// correctly resolves to https://redirect.de/demo.html instead of https://start.de/demo.html
await page.click('a');
When you match the request that you want to edit its body, just get the URL and make a call using "node-fetch" or "request" modules, when you receive the body edit it then sends it as a response to the original request.
for example:
const requestModule = require("request");
const cheerio = require("cheerio");
page.on("request", async (request) => {
// Match the url that you want
const isMatched = /page-12/.test(request.url());
if (isMatched) {
// Make a new call
requestModule({
url: request.url(),
resolveWithFullResponse: true,
})
.then((response) => {
const { body, headers, statusCode, statusMessage } = response;
const contentType = headers["content-type"];
// Edit body using cheerio module
const $ = cheerio.load(body);
$("a").each(function () {
$(this).attr("href", "/fake_pathname");
});
// Send response
request.respond({
ok: statusMessage === "OK",
status: statusCode,
contentType,
body: $.html(),
});
})
.catch(() => request.continue());
} else request.continue();
});

Programmatically capturing AJAX traffic with headless Chrome

Chrome officially supports running the browser in headless mode (including programmatic control via the Puppeteer API and/or the CRI library).
I've searched through the documentation, but I haven't found how to programmatically capture the AJAX traffic from the instances (ie. start an instance of Chrome from code, navigate to a page, and access the background response/request calls & raw data (all from code not using the developer tools or extensions).
Do you have any suggestions or examples detailing how this could be achieved? Thanks!
Update
As #Alejandro pointed out in the comment, resourceType is a function and the return value is lowercased
page.on('request', request => {
if (request.resourceType() === 'xhr')
// do something
});
Original answer
Puppeteer's API makes this really easy:
page.on('request', request => {
if (request.resourceType === 'XHR')
// do something
});
You can also intercept requests with setRequestInterception, but it's not needed in this example if you're not going to modify the requests.
There's an example of intercepting image requests that you can adapt.
resourceTypes are defined here.
I finally found how to do what I wanted. It can be done with chrome-remote-interface (CRI), and node.js. I'm attaching the minimal code required.
const CDP = require('chrome-remote-interface');
(async function () {
// you need to have a Chrome open with remote debugging enabled
// ie. chrome --remote-debugging-port=9222
const protocol = await CDP({port: 9222});
const {Page, Network} = protocol;
await Page.enable();
await Network.enable(); // need this to call Network.getResponseBody below
Page.navigate({url: 'http://localhost/'}); // your URL
const onDataReceived = async (e) => {
try {
let response = await Network.getResponseBody({requestId: e.requestId})
if (typeof response.body === 'string') {
console.log(response.body);
}
} catch (ex) {
console.log(ex.message)
}
}
protocol.on('Network.dataReceived', onDataReceived)
})();
Puppeteer's listeners could help you capture xhr response via response and request event.
You should check wether request.resourceType() is xhr or fetch first.
listener = page.on('response', response => {
const isXhr = ['xhr','fetch'].includes(response.request().resourceType())
if (isXhr){
console.log(response.url());
response.text().then(console.log)
}
})
const browser = await puppeteer.launch();
const page = await browser.newPage();
const pageClient = page["_client"];
pageClient.on("Network.responseReceived", event => {
if (~event.response.url.indexOf('/api/chart/rank')) {
console.log(event.response.url);
pageClient.send('Network.getResponseBody', {
requestId: event.requestId
}).then(async response => {
const body = response.body;
if (body) {
try {
const json = JSON.parse(body);
}
catch (e) {
}
}
});
}
});
await page.setRequestInterception(true);
page.on("request", async request => {
request.continue();
});
await page.goto('http://www.example.com', { timeout: 0 });

Categories