i have a node server. I pass a Url into request and then extract the contects with cherio. Now what im trying to do is detect if that webpage is using google analytics. How would i do this?
request({uri: URL}, function(error, response, body)
{
if (!error)
{
const $ = cheerio.load(body);
const usesAnalytics = body.includes('googletag') || body.includes('analytics.js') || body.includes('ga.js');
const isUsingGA = ?;
}
}
From the official analytics site, they say that you can find some strings that would indicate GA is active. I have tried scanning the body for these but they always return false even if that page is running GA. I included this in the code above.
Ive looked at websites that use it and I cant see anything in their index that would suggest they are using it. Its only when i go to their sources and see they are using it. How would i detect this in node?
I have Node script which uses Puppeteer to monitor the requests sent from a website.
I wrote this some time ago so some parts might be irrelevant to you but here you go:
'use strict';
const puppeteer = require('puppeteer');
function getGaTag(lookupDomain){
return new Promise((resolve) => {
(async() => {
var result = [];
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', request => {
const url = request.url();
const regexp = /(UA|YT|MO)-\d+-\d+/i;
// look for tracking script
if (url.match(/^https?:\/\/www\.google-analytics\.com\/(r\/)?collect/i)) {
console.log(url.match(regexp));
console.log('\n');
result.push(url.match(regexp)[0]);
}
request.continue();
});
try {
await page.goto(lookupDomain);
await page.waitFor(9000);
} catch (err) {
console.log("Couldn't fetch page " + err);
}
await browser.close();
resolve(result);
})();
})
}
getGaTag('https://store.google.com/').then(result => {
console.log(result)
})
Running node ga-check.js now returns the UA ID of the Google Analytucs tracker on the lookup domain: [ 'UA-54090495-1' ] which in this case is https://store.google.com
Hope this helps!
Related
I am trying to get the joke from https://icanhazdadjoke.com/. This is the code I used
const getDadJoke = async () => {
const res = await axios.get('https://icanhazdadjoke.com/', {headers: {Accept: 'application/json'}})
console.log(res.data.joke)
}
getDadJoke()
I expected to get the joke but instead I got the full html page, as if I didn't specify the headers at all. What am I doing wrong?
If you look at the API documentation for icanhazdadjoke.com, there is a section titled "Custom user agent." In that section, they explain how they want any requests to have a User Agent header. If you use Axios in a browser context, the User Agent is set for you by your browser. But I'm going to go out on a limb and say that you are running this code via Node, in which case, you may manually need to set the User Agent header, like so:
const getDadJoke = async () => {
const res = await axios.get(
'https://icanhazdadjoke.com/',
{
headers:
{
'Accept': 'application/json',
'User-Agent': 'my URL, email or whatever'
}
}
)
console.log(res.data.joke)
}
getDadJoke()
The docs say what they want you to put for the User Agent, but I think it would honestly work if there were any User Agent field at all.
The HTML page you're getting is a 503 response from Cloudflare.
As per the API documentation
Custom user agent
If you intend on using the icanhazdadjoke.com API we kindly ask that you set a custom User-Agent header for all requests.
My guess is they have a Cloudflare Browser Integrity Check configured that's triggering for the default Node / Axios user-agent.
Setting a custom user-agent appears to get around this...
const getDadJoke = async () => {
try {
const res = await axios.get("https://icanhazdadjoke.com/", {
headers: {
accept: "application/json",
"user-agent": "My Node and Axios app", // use something better than this
},
});
console.log(res.data.joke);
} catch (err) {
console.error(err.response?.data, err.toJSON());
}
};
Given how unreliable Axios releases have been since v1.0.0, I highly recommend you switch to something else. The Fetch API is available natively in Node since v18
const getDadJoke = async () => {
try {
const res = await fetch("https://icanhazdadjoke.com/", {
headers: {
accept: "application/json",
"user-agent": "My Node and Fetch app", // use something better than this
},
});
if (!res.ok) {
const err = new Error(`${res.status} ${res.statusText}`);
err.text = await res.text();
throw err;
}
console.log((await res.json()).joke);
} catch (err) {
console.error(err, err.text);
}
};
Using Axios REST API call which response JSON format.
If you using API from https://icanhazdadjoke.com/api#authentication
, you can use Axios.
Here is example.
Alternative method.
You needs to use web scrapping method for this case. Because HTML response from https://icanhazdadjoke.com/.
This is example how to scrap using puppeteer library in node.js
Demo code
Save as get-joke.js file.
const puppeteer = require("puppeteer");
async function getJoke() {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://icanhazdadjoke.com/');
const joke = await page.evaluate(() => {
const jokes = Array.from(document.querySelectorAll('p[class="subtitle"]'))
return jokes[0].innerText;
});
await browser.close();
return Promise.resolve(joke);
} catch (error) {
return Promise.reject(error);
}
}
getJoke()
.then((joke) => {
console.log(joke);
})
Selector
Main Idea to use DOM tree selector
In the Chrome's DevTool (by pressing F12), shows HTML DOM tree structures.
<p> tag has class name is subtitle
document.querySelectorAll('p[class="subtitle"]')
Install dependency and run it
npm install puppeteer
node get-joke.js
Result
You can get the joke from that web site.
I'm trying to build telegram bot to parse page on use request. My parsing code works fine inside one async function, but completeky falls on its face if I try to put it inside another async function.
Here is the relevant code I have:
const puppeteer = require('puppeteer');
const fs = require('fs/promises');
const { Console } = require('console');
async function start(){
async function searcher(input) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = ; //here is a long url combining logic, that works fine
await page.goto(url);
const currentUrl = requestPage.url();
console.log(currentUrl); //returns nothing.
//here is some long parsing logic
await browser.close();
return combinedResult;
}
//here is a bot code
const { Telegraf } = require('telegraf');
const bot = new Telegraf('my bot ID');
bot.command('start', ctx => {
console.log(ctx.from);
bot.telegram.sendMessage(ctx.chat.id, 'Greatings message', {});
bot.telegram.sendMessage(ctx.chat.id, 'request prompt ', {});
})
bot.on('text', (ctx) => {
console.log(ctx.message.text);
const queryOutput = searcher(ctx.message.text);
bot.telegram.sendMessage(ctx.chat.id, queryOutput, {});
});
bot.launch()
}
start();
Here is an error message:
/Users/a.rassanov/Desktop/Fetch/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:218
return Promise.reject(new Error(`Protocol error (${method}): Session closed. Most likely the ${this._targetType} has been closed.`));
^
Error: Protocol error (Page.navigate): Session closed. Most likely the page has been closed.
I'm very new to this, and your help is really appriciated.
I am currently working on the migration of a web application to a PWA, and since a few days I encounter this problem: "Site cannot be installed: Page does not work offline. Starting in Chrome 93, the installability criteria is changing, and this site will not be installable. See https://goo.gle/improved-pwa-offline-detection for more information."
I did several searches on forums, especially on Stackoverflow to find a solution to my problem, but without success. All the proposed solutions did not solve my problem.
I created this JS script following recommendations found on Stackoverflow :
importScripts('./ngsw-worker.js');
const OFFLINE_VERSION = 1;
const CACHE_NAME = 'offline-html';
const OFFLINE_URL = 'offline.html';
self.addEventListener('install', (event) => {
event.waitUntil((async () => {
const cache = await caches.open(CACHE_NAME);
await cache.add(new Request(OFFLINE_URL, {cache: 'reload'}));
})());
});
self.addEventListener('fetch', (event) => {
if (event.request.mode === 'navigate') {
event.respondWith((async () => {
try {
const preloadResponse = await event.preloadResponse;
if (preloadResponse) {
return preloadResponse;
}
const networkResponse = await fetch(event.request);
return networkResponse;
} catch (error) {
console.log('Fetch failed; returning offline page instead.', error);
const cache = await caches.open(CACHE_NAME);
const cachedResponse = await cache.match(OFFLINE_URL);
return cachedResponse;
}
})());
}
});
This script is executed correctly when the application is run, but the error message persists.
Do you have a solution for this problem?
Thank you.
PS : I use the 8.1.1 version of Angular
I need to integrate the puppeteer-jest test framework with TestRail using TestRail API. But for that, I need to know what tests are failed and what of the tests are passed
I Search some information in the official GitHub Repository and in the Jest site. But nothing about it.
Test:
describe('Single company page Tests:', () => {
let homePage;
beforeAll(async () => {
homePage = await addTokenToBrowser(browser);
}, LOGIN_FLOW_MAX_TIME);
it('Open the company page from the list', async done => {
await goto(homePage, LIST_PAGE_RELATIVE_PATH);
await listPage.clickSearchByCompanyName(homePage);
await addCompanyNamePopup.isPopupDisplayed(homePage);
await addCompanyNamePopup.fillCompanyName(homePage, companies.century.link);
await addCompanyNamePopup.clickNext(homePage);
await addCompanyNamePopup.fillListName(homePage, listNames[0]);
await addCompanyNamePopup.clickSave(homePage);
await addCompanyNamePopup.clickViewList(homePage);
const nextPage = await clickCompanyName(homePage, browser, companies.century.name);
await companyPage.isOverviewTabPresent(nextPage);
await companyPage.isPeopleTabPresent(nextPage);
await companyPage.isSocialTabPresent(nextPage);
await companyPage.isFinanceTabPresent(nextPage);
await companyPage.isLeaseTabPresent(nextPage);
await homePage.close();
done();
});
}
I expected to get all passed and failed test cases name and write it to JSON with the name of test cases and the result of them.
Actually, I have nothing of this.
You can use true/false assertion approach I like I do in my github project.
for example, try anchor case to some final selector with simple assert:
describe('E2E testing', () => {
it('[Random Color Picker] color button clickable', async () => {
// Setup
let expected = true;
let expectedCssLocator = '#color-button';
let actual;
// Execute
let actualPromise = await page.waitForSelector(expectedCssLocator);
if (actualPromise != null) {
await page.click(expectedCssLocator);
actual = true;
}
else
actual = false;
// Verify
assert.equal(actual, expected);
});
Chrome officially supports running the browser in headless mode (including programmatic control via the Puppeteer API and/or the CRI library).
I've searched through the documentation, but I haven't found how to programmatically capture the AJAX traffic from the instances (ie. start an instance of Chrome from code, navigate to a page, and access the background response/request calls & raw data (all from code not using the developer tools or extensions).
Do you have any suggestions or examples detailing how this could be achieved? Thanks!
Update
As #Alejandro pointed out in the comment, resourceType is a function and the return value is lowercased
page.on('request', request => {
if (request.resourceType() === 'xhr')
// do something
});
Original answer
Puppeteer's API makes this really easy:
page.on('request', request => {
if (request.resourceType === 'XHR')
// do something
});
You can also intercept requests with setRequestInterception, but it's not needed in this example if you're not going to modify the requests.
There's an example of intercepting image requests that you can adapt.
resourceTypes are defined here.
I finally found how to do what I wanted. It can be done with chrome-remote-interface (CRI), and node.js. I'm attaching the minimal code required.
const CDP = require('chrome-remote-interface');
(async function () {
// you need to have a Chrome open with remote debugging enabled
// ie. chrome --remote-debugging-port=9222
const protocol = await CDP({port: 9222});
const {Page, Network} = protocol;
await Page.enable();
await Network.enable(); // need this to call Network.getResponseBody below
Page.navigate({url: 'http://localhost/'}); // your URL
const onDataReceived = async (e) => {
try {
let response = await Network.getResponseBody({requestId: e.requestId})
if (typeof response.body === 'string') {
console.log(response.body);
}
} catch (ex) {
console.log(ex.message)
}
}
protocol.on('Network.dataReceived', onDataReceived)
})();
Puppeteer's listeners could help you capture xhr response via response and request event.
You should check wether request.resourceType() is xhr or fetch first.
listener = page.on('response', response => {
const isXhr = ['xhr','fetch'].includes(response.request().resourceType())
if (isXhr){
console.log(response.url());
response.text().then(console.log)
}
})
const browser = await puppeteer.launch();
const page = await browser.newPage();
const pageClient = page["_client"];
pageClient.on("Network.responseReceived", event => {
if (~event.response.url.indexOf('/api/chart/rank')) {
console.log(event.response.url);
pageClient.send('Network.getResponseBody', {
requestId: event.requestId
}).then(async response => {
const body = response.body;
if (body) {
try {
const json = JSON.parse(body);
}
catch (e) {
}
}
});
}
});
await page.setRequestInterception(true);
page.on("request", async request => {
request.continue();
});
await page.goto('http://www.example.com', { timeout: 0 });