Scraping with puppeteer and returning JSON

Scraping with puppeteer and returning JSON - javascript

I'm trying to create a node app that requires a URL from the user, the URL is then passed to scrape.js and using puppeteer, scrapes certain fields, and then passes the data back to app.js in a json format (so that I can then upset it into a doc). But what I receive is the entire ServerResponse and not the data in a json format as I'm intending.
I was hoping someone with more experience could shed some light. Here is what I have so far:
// app.js
const scrape = require('./scrape');
const router = express.Router();
router.get( '/', ( req, res ) => {
const url = req.body.url;
const item = new Promise((resolve, reject) => {
scrape
.scrapeData()
.then((data) => res.json(data))
.catch(err => reject('Scraping failed...'))
})
});
// scrape.js
const puppeteer = require('puppeteer');
const scrapeData = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setViewport({ width: 360, height: 640 });
await page.goto(url);
let scrapedData = await page.evaluate(() => {
let scrapedDetails = [];
let elements = document.querySelectorAll('#a-page');
elements.forEach(element => {
let detailsJson = {};
try {
detailsJson.title = element.querySelector('h1#title').innerText;
detailsJson.desc = element.querySelector('#description_box').innerText;
} catch (exception) {}
scrapedDetails.push(detailsJson);
});
return scrapedDetails;
}));
// console.dir(scrapeData) - logs the data successfully.
};
module.exports.scrapeData = scrapeData

You have a naming problem. scrape.js is exporting the scrapeData function. Inside that function, you declared a scrapedData variable, which is not the same thing.
Where you put a:
console.dir(scrapeData) - logs the data successfully.
Add
return scrapeData;
That should solve your issue.

Related

"The execution context was destroyed, most likely due to a navigation" when scraping with Puppeteer in NextJs

I'm trying to create an application to search for my music on some sites that post illegal content so I can ask them to delete it later.
I am facing this problem in puppeteer, when I try to press enter on the search input I get this error: Error: Execution context was destroyed, most likely because of a navigation.
I have two files. One called urlScrapper.js with my script and an array with the names of my songs:
import InfringementFinder from './InfringementFinder.js';
const songs = ['Artist Name - Song Name', 'Artist Name - Song Name'];
const irscCodes = ['XXXXXXXXXX', XXXXXXXXXX];
InfringementFinder(songs, irscCodes).then(() => {
console.log('Search complete!');
}).catch((error) => {
console.error(error);
});
and InfringementFinder.js:
import puppeteer from 'puppeteer';
const InfringementFinder = async (songs, irscCodes) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const mainPage = 'https://example.com/';
await page.goto(mainPage);
// This enter the search term in the input field
await page.type('.search-field', 'Artist Name - Song Name'); // this supposed to be my prop but someone doesnt work
// Trigger the search by submitting the form
const searchSubmit = await page.waitForSelector('.search-submit');
await searchSubmit.press('Enter');
// Wait for the search results to load
await page.waitForSelector('.g1-frame');
// This finds the first entry-content element containing the information
const entryContent = await page.$('.g1-frame');
if (!entryContent) return;
// This press on the element
await entryContent.press('Enter');
// Extract the relevant information
try {
const data = await page.evaluate(() => {
const trackElements = Array.from(document.querySelectorAll('li', 'ol', 'a', 'href', 'strong', 'span', 'p', 'div', 'class'))
.filter(el => el.innerText.includes('Artist Name - Song Name'));
const tracks = trackElements.map(trackElement => {
const trackName = trackElement.innerText.split(' – ')[0];
return { trackName };
});
const downloadLinks = Array.from(document.querySelectorAll('.dl-btn'))
.map(link => link.getAttribute('href'));
return { tracks, downloadLinks };
});
console.log('Data:', data);
} catch (error) {
console.error(error);
} finally {
await browser.close();
}
};
export default InfringementFinder;
It only works if I try to scrape a page where I know my music is posted and using a different code version but the idea is to search the whole website using the search input.
The logic is as follows: You click on the search input, type the name of the song, navigate to another page, click on your music, navigate to another page, and scrape the name of the songs and links to illegal downloads.

Your error is probably due to the following code:
await entryContent.press('Enter'); // triggers a nav
// Extract the relevant information immediately
// without waiting for nav to complete
try {
const data = await page.evaluate(() => {
I'd either wait for a nav here or wait for the selector on the next page you're about to access with evaluate.
Also,
document.querySelectorAll('li', 'ol', 'a', 'href', 'strong', 'span', 'p', 'div', 'class')
doesn't make sense: querySelectorAll only accepts one parameter, and 'class' isn't a name of an HTML element. It's a good idea to test this in the browser first, because it's plain JS.
I don't see 'Artist Name - Song Name' anywhere on the page.
This code:
await page.waitForSelector('.g1-frame');
// This finds the first entry-content element containing the information
const entryContent = await page.$('.g1-frame');
if (!entryContent) return;
could just be:
const entryContent = await page.waitForSelector('.g1-frame');
It's common to assume you need to navigate the site as the user would: go to the homepage, type in the search term, press Enter...
Better is to look at the query string of the search page and build your own, avoiding the extra nav and fuss of dealing with the DOM. Here's an example:
const puppeteer = require("puppeteer"); // ^19.6.3
const baseUrl = "<Your base URL, ending in .net>";
const searchTerm = "autechre";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const url = `${baseUrl}?s=${encodeURIComponent(searchTerm)}`;
await page.setJavaScriptEnabled(false);
await page.setRequestInterception(true);
page.on("request", request => {
if (request.resourceType() === "document") {
request.continue();
}
else {
request.abort();
}
});
await page.goto(url, {waitUntil: "domcontentloaded"});
await (await page.$(".g1-frame")).click();
const trackListEl = await page.waitForSelector(".entry-content > .greyf12");
const tracks = await trackListEl.$$eval("li", els => {
const fields = ["artist", "track"];
return els.map(e =>
Object.fromEntries(
e.textContent
.split(/ *– */)
.map((e, i) => [fields[i], e.trim()]),
)
);
});
const downloadLinks = await page.$$eval(".dl-btn", els =>
els.map(e => e.getAttribute("href"))
);
console.log({tracks, downloadLinks});
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Note that we don't need to execute JS and we're blocking almost all resource requests, so we can speed up the scrape significantly by switching to fetch/axios and a simple HTML parser like Cheerio:
const cheerio = require("cheerio"); // 1.0.0-rc.12
const baseUrl = "<Your base URL, ending in .net>";
const searchTerm = "autechre";
const url = `${baseUrl}?s=${encodeURIComponent(searchTerm)}`;
const get = url =>
fetch(url) // Node 18 or install node-fetch, or use another library like axios
.then(res => {
if (!res.ok) {
throw Error(res.statusText);
}
return res.text();
});
get(url)
.then(html =>
get(cheerio.load(html)(".entry-title a").attr("href"))
)
.then(html => {
const $ = cheerio.load(html);
const tracks = [...$(".entry-content > .greyf12 li")].map(
e => {
const fields = ["artist", "track"];
return Object.fromEntries(
$(e)
.text()
.split(/ *– */)
.map((e, i) => [fields[i], e.trim()])
);
}
);
const downloadLinks = [...$(".dl-btn")].map(e =>
$(e).attr("href")
);
console.log({tracks, downloadLinks});
});
The code is simpler, and on my machine, twice as fast as Puppeteer.

Puppeteer querySelector returns empty object [duplicate]

Recently I started to crawl the web using Puppeteer. Below is a code for extracting a specific product name from the shopping mall.
const puppeteer = require('puppeteer');
(async () => {
const width = 1600, height = 1040;
const option = { headless: false, slowMo: true, args: [`--window-size=${width},${height}`] };
const browser = await puppeteer.launch(option);
const page = await browser.newPage();
const vp = {width: width, height: height};
await page.setViewport(vp);
const navigationPromise = page.waitForNavigation();
await page.goto('https://shopping.naver.com/home/p/index.nhn');
await navigationPromise;
await page.waitFor(2000);
const textBoxId = 'co_srh_input';
await page.type('.' + textBoxId, '양말', {delay: 100});
await page.keyboard.press('Enter');
await page.waitFor(5000);
await page.waitForSelector('div.info > a.tit');
const stores = await page.evaluate(() => {
const links = Array.from(document.querySelectorAll('div.info > a.tit'));
return links.map(link => link.innerText).slice(0, 10) // 10개 제품만 가져오기
});
console.log(stores);
await browser.close();
})();
I have a question. How can I output the crawled results to an HTML document (without using the database)? Please use sample code to explain it.

I used what was seen on blog.kowalczyk.info
const puppeteer = require("puppeteer");
const fs = require("fs");
async function run() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.google.com/", { waitUntil: "networkidle2" });
// hacky defensive move but I don't know a better way:
// wait a bit so that the browser finishes executing JavaScript
await page.waitFor(1 * 1000);
const html = await page.content();
fs.writeFileSync("index.html", html);
await browser.close();
}
run();

fs.writeFile()
You can use the following write_file function that returns a Promise that resolves or rejects when fs.writeFile() succeeds or fails.
Then, you can await the Promise from within your anonymous, asynchronous function and check whether or not the data was written to the file:
'use strict';
const fs = require('fs');
const puppeteer = require('puppeteer');
const write_file = (file, data) => new Promise((resolve, reject) => {
fs.writeFile(file, data, 'utf8', error => {
if (error) {
console.error(error);
reject(false);
} else {
resolve(true);
}
});
});
(async () => {
// ...
const stores = await page.evaluate(() => {
return Array.from(document.querySelectorAll('div.info > a.tit'), link => link.innerText).slice(0, 10); // 10개 제품만 가져오기
});
if (await write_file('example.html', stores.toString()) === false) {
console.error('Error: Unable to write stores to example.html.');
}
// ...
});

After puppeteer infinite scroll finishes does not return all results

Here is the code in my data scraping file:
const puppeteer = require('puppeteer');
const db = require('../db');
const Job = require('../models/job');
(async() => {
try {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
// args: ['--no-zygote', '--no-sandbox']
});
const url = 'https://www.linkedin.com/jobs/search?keywords=Junior%20Software%20Developer&location=Indianapolis%2C%20IN&geoId=&trk=homepage-jobseeker_jobs-search-bar_search-submit&position=1&pageNum=0';
// Open browser instance
const page = await browser.newPage({
waitUntil: 'networkidle0'
});
console.log(`Navigating to ${url}`);
await page.goto(url);
// Scroll to bottom of page, click on 'See More Jobs' and repeat
let lastHeight = await page.evaluate('document.body.scrollHeight');
const scroll = async() => {
while (true) {
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForTimeout(2000);
let newHeight = await page.evaluate('document.body.scrollHeight');
if (newHeight === lastHeight) {
console.log('Done scrolling!');
break;
}
lastHeight = newHeight;
seeMoreJobs();
}
console.log(data);
}
// Click on 'See More Jobs'
const seeMoreJobs = async() => {
await page.evaluate(() => {
document.querySelector('button[data-tracking-control-name="infinite-scroller_show-more"]').click();
});
}
// Collect data
const data = await page.evaluate(() => {
const allJobsArr = Array.from(document.querySelectorAll('a[data-tracking-control-name="public_jobs_jserp-result_search-card"]'));
const namesAndUrls = allJobsArr.map(job => {
return {
name: job.innerText,
url: job.href,
path: job.pathname
}
});
return namesAndUrls;
});
scroll();
} catch (err) {
console.log(err);
}
})();
So the above code is designed to navigate to the variable url and then to scroll until the scroll function "breaks"/finishes, i.e., to the very bottom of the page. Once these actions have finished, I want to then log some data in the form of an array with three properties from each job posting: name, href, and path. When I run the IIFE as shown I am able to grab the first 24-25 job postings with my data function, which are the first to be displayed on page load (before any of the scrolling takes place).
For whatever reason, this data function is unable to evaluate the entire page or document after all the scrolling has occurred.
I have tried various things and have really analyzed what the code is doing, but alas, I am at a loss for a solution. My end goal here is to comb through every job posting that has displayed with my scrolling function and then to log everything (not just the first 24-25 results) returned with the desired data properties to the console.
Thanks, all.

Ok, I have now figured out the reason why it was only pulling out the first 25 results, and I believe it was a problem of scope, sort of how I had outlined in the original question. I ended up housing the 'data' functional expression within the scroll() function, so that the same 'page' was being 'evaluated', otherwise I believe the two were looking at two different instances of the 'page'. I know this might not be the most accurate explanation, so if someone would like to better articulate this for me, that would be awesome. Here is the simple solution to the simple problem that I was having. Thanks.
const puppeteer = require('puppeteer');
const db = require('../db');
const Job = require('../models/job');
(async() => {
try {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
// args: ['--no-zygote', '--no-sandbox']
});
const url = 'https://www.linkedin.com/jobs/search?keywords=Junior%20Software%20Developer&location=Indianapolis%2C%20IN&geoId=&trk=homepage-jobseeker_jobs-search-bar_search-submit&position=1&pageNum=0';
// Open browser instance
const page = await browser.newPage({
waitUntil: 'networkidle0'
});
console.log(`Navigating to ${url}`);
await page.goto(url);
// Scroll to bottom of page, click on 'See More Jobs' and repeat
let lastHeight = await page.evaluate('document.body.scrollHeight');
const scroll = async() => {
while (true) {
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForTimeout(2000);
let newHeight = await page.evaluate('document.body.scrollHeight');
if (newHeight === lastHeight) {
break;
}
lastHeight = newHeight;
seeMoreJobs();
}
// Scrape all junior job titles
const data = await page.evaluate(() => {
const allJobsArr = Array.from(document.querySelectorAll('a[data-tracking-control-name="public_jobs_jserp-result_search-card"]'));
const namesAndUrls = allJobsArr.map(job => {
return {
name: job.innerText,
url: job.href,
path: job.pathname
}
});
const juniorJobs = namesAndUrls.filter(function(job) {
return job.name.includes('Junior') || job.name.includes('Jr') || job.name.includes('Entry') && job.url && job.path;
});
return juniorJobs;
});
console.log(data);
}
// Click on 'See More Jobs'
const seeMoreJobs = async() => {
await page.evaluate(() => {
document.querySelector('button[data-tracking-control-name="infinite-scroller_show-more"]').click();
});
}
scroll();
} catch (err) {
console.log(err);
}
})();

Using puppeteer on an actual API

I'm sorry for the long question, I'm a newbie at node but I have made a CRUD API before with authentication and everything, I just need to understand how to integrate puppeteer to the API, so let me begin:
This is my project structure:
puppeteer-api
controllers - puppeteer.controller.js
routes - puppeteer.routes.js
index.js
This is my index.js file:
const puppeteer = require('puppeteer');
const express = require('express');
const booking = require('./routes/puppeteer.routes')
const app = express();
app.use('/booking', booking);
let port = 8080;
app.listen(port, () => {
console.log('Server is running on https://localhost:8080/');
});
puppeteer.routes.js:
const express = require('express');
const router = express.Router();
const puppeteer_controller = require('../controllers/puppeteer.controller');
router.get('/', puppeteer_controller.get_booking);
module.exports = router;
puppeteer.controller.js:
const puppeteer = require('puppeteer');
exports.get_booking = (req, res, next) => {
res.json = (async function main() {
try {
const browser = await puppeteer.launch({ headless: true});
const page = await browser.newPage();
await page.goto('https://www.booking.com/searchresults.es-ar.html?label=gen173nr-1DCAEoggI46AdIM1gEaAyIAQGYASy4ARfIAQzYAQPoAQGIAgGoAgM&lang=es-ar&sid=bc11c3e819d105b3c501d0c7a501c718&sb=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.es-ar.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaAyIAQGYASy4ARfIAQzYAQPoAQGIAgGoAgM%3Bsid%3Dbc11c3e819d105b3c501d0c7a501c718%3Bsb_price_type%3Dtotal%26%3B&ss=El+Bols%C3%B3n%2C+R%C3%ADo+Negro%2C+Argentina&is_ski_area=&checkin_year=&checkin_month=&checkout_year=&checkout_month=&no_rooms=1&group_adults=2&group_children=0&b_h4u_keep_filters=&from_sf=1&ss_raw=el+bols&ac_position=0&ac_langcode=es&ac_click_type=b&dest_id=-985282&dest_type=city&place_id_lat=-41.964452&place_id_lon=-71.532732&search_pageview_id=06d48fb6823e00e9&search_selected=true&search_pageview_id=06d48fb6823e00e9&ac_suggestion_list_length=5&ac_suggestion_theme_list_length=0');
await page.waitForSelector('.sr_item');
page.on('console', consoleObj => console.log(consoleObj.text()));
console.log('Retrieving hotels data');
const hoteles = page.evaluate(() => {
let hoteles = [];
let x = document.getElementsByClassName('sr_item');
hoteles.push(x);
let navigation = document.getElementsByClassName('sr_pagination_item');
for (const nav of navigation) {
nav.click();
hoteles.push(document.getElementsByClassName('sr_item'));
}
console.log('Finished looping through');
return hoteles;
});
} catch(e) {
console.log('error', e);
}
})();
};
So, what I want is to be able to send a GET request from my app and get a response from my API with a list of hotels from booking, it's just a personal project, the thing is, using Postman I'm sending the GET request but I get no response at all, so I'm wondering what I'm doing wrong and what direction to follow, if anyone would be able to point me in the right direction I would be so grateful.

The block ({})() runs your code instantly instead of on a request.
res.json is a function, you should not reassign this.
Instead, move the function somewhere else and call it like below,
async function scraper() {
try {
const browser = await puppeteer.launch({ headless: true});
const page = await browser.newPage();
await page.goto('https://www.booking.com/searchresults.es-ar.html?label=gen173nr-1DCAEoggI46AdIM1gEaAyIAQGYASy4ARfIAQzYAQPoAQGIAgGoAgM&lang=es-ar&sid=bc11c3e819d105b3c501d0c7a501c718&sb=1&src=index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.es-ar.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaAyIAQGYASy4ARfIAQzYAQPoAQGIAgGoAgM%3Bsid%3Dbc11c3e819d105b3c501d0c7a501c718%3Bsb_price_type%3Dtotal%26%3B&ss=El+Bols%C3%B3n%2C+R%C3%ADo+Negro%2C+Argentina&is_ski_area=&checkin_year=&checkin_month=&checkout_year=&checkout_month=&no_rooms=1&group_adults=2&group_children=0&b_h4u_keep_filters=&from_sf=1&ss_raw=el+bols&ac_position=0&ac_langcode=es&ac_click_type=b&dest_id=-985282&dest_type=city&place_id_lat=-41.964452&place_id_lon=-71.532732&search_pageview_id=06d48fb6823e00e9&search_selected=true&search_pageview_id=06d48fb6823e00e9&ac_suggestion_list_length=5&ac_suggestion_theme_list_length=0');
await page.waitForSelector('.sr_item');
page.on('console', consoleObj => console.log(consoleObj.text()));
console.log('Retrieving hotels data');
const hoteles = page.evaluate(() => {
let hoteles = [];
let x = document.getElementsByClassName('sr_item');
hoteles.push(x);
let navigation = document.getElementsByClassName('sr_pagination_item');
for (const nav of navigation) {
nav.click();
hoteles.push(document.getElementsByClassName('sr_item'));
}
console.log('Finished looping through');
return hoteles;
});
} catch(e) {
console.log('error', e);
}
}
// Call the scraper
exports.get_booking = async (req, res, next) => {
const scraperData = await scraper();
res.json(scraperData)
}
It will make the controller into a promise and return the JSON data.

Puppeteer doesn't close browser

I'm running puppeteer on express/node/ubuntu as follow:
var puppeteer = require('puppeteer');
var express = require('express');
var router = express.Router();
/* GET home page. */
router.get('/', function(req, res, next) {
(async () => {
headless = true;
const browser = await puppeteer.launch({headless: true, args:['--no-sandbox']});
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
res.send(bodyHTML)
await browser.close();
})();
});
running this script multiple times leaves hundred of Zombies:
$ pgrep chrome | wc -l
133
Which clogs the srv,
How do I fix this?
Running kill from a Express JS script could solve it?
Is there a better way to get the same result other than puppeteer and headless chrome?

Ahhh! This is a simple oversight. What if an error occurs and your await browser.close() never executes thus leaving you with zombies.
Using shell.js seems to be a hacky way of solving this issue.
The better practice is to use try..catch..finally. The reason being you would want the browser to be closed irrespective of a happy flow or an error being thrown.
And unlike the other code snippet, you don't have to try and close the browser in the both the catch block and finally block. finally block is always executed irrespective of whether an error is thrown or not.
So, your code should look like,
const puppeteer = require('puppeteer');
const express = require('express');
const router = express.Router();
/* GET home page. */
router.get('/', function(req, res, next) {
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox'],
});
try {
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
const bodyHTML = await page.evaluate(() => document.body.innerHTML);
res.send(bodyHTML);
} catch (e) {
console.log(e);
} finally {
await browser.close();
}
})();
});
Hope this helps!

wrap your code in try-catch like this and see if it helps
headless = true;
const browser = await puppeteer.launch({headless: true, args:['--no-sandbox']});
try {
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
res.send(bodyHTML);
await browser.close();
} catch (error) {
console.log(error);
} finally {
await browser.close();
}

From my experience, the browser closing process may take some time after close is called. Anyway, you can check the browser process property to check if it's still not closed and force kill it.
if (browser && browser.process() != null) browser.process().kill('SIGINT');
I'm also posting the full code of my puppeteer resources manager below. Take a look at bw.on('disconnected', async () => {
const puppeteer = require('puppeteer-extra')
const randomUseragent = require('random-useragent');
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
const USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36';
puppeteer.use(StealthPlugin())
function ResourceManager(loadImages) {
let browser = null;
const _this = this;
let retries = 0;
let isReleased = false;
this.init = async () => {
isReleased = false;
retries = 0;
browser = await runBrowser();
};
this.release = async () => {
isReleased = true;
if (browser) await browser.close();
}
this.createPage = async (url) => {
if (!browser) browser = await runBrowser();
return await createPage(browser,url);
}
async function runBrowser () {
const bw = await puppeteer.launch({
headless: true,
devtools: false,
ignoreHTTPSErrors: true,
slowMo: 0,
args: ['--disable-gpu','--no-sandbox','--no-zygote','--disable-setuid-sandbox','--disable-accelerated-2d-canvas','--disable-dev-shm-usage', "--proxy-server='direct://'", "--proxy-bypass-list=*"]
});
bw.on('disconnected', async () => {
if (isReleased) return;
console.log("BROWSER CRASH");
if (retries <= 3) {
retries += 1;
if (browser && browser.process() != null) browser.process().kill('SIGINT');
await _this.init();
} else {
throw "===================== BROWSER crashed more than 3 times";
}
});
return bw;
}
async function createPage (browser,url) {
const userAgent = randomUseragent.getRandom();
const UA = userAgent || USER_AGENT;
const page = await browser.newPage();
await page.setViewport({
width: 1920 + Math.floor(Math.random() * 100),
height: 3000 + Math.floor(Math.random() * 100),
deviceScaleFactor: 1,
hasTouch: false,
isLandscape: false,
isMobile: false,
});
await page.setUserAgent(UA);
await page.setJavaScriptEnabled(true);
await page.setDefaultNavigationTimeout(0);
if (!loadImages) {
await page.setRequestInterception(true);
page.on('request', (req) => {
if(req.resourceType() == 'stylesheet' || req.resourceType() == 'font' || req.resourceType() == 'image'){
req.abort();
} else {
req.continue();
}
});
}
await page.evaluateOnNewDocument(() => {
//pass webdriver check
Object.defineProperty(navigator, 'webdriver', {
get: () => false,
});
});
await page.evaluateOnNewDocument(() => {
//pass chrome check
window.chrome = {
runtime: {},
// etc.
};
});
await page.evaluateOnNewDocument(() => {
//pass plugins check
const originalQuery = window.navigator.permissions.query;
return window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
});
await page.evaluateOnNewDocument(() => {
// Overwrite the `plugins` property to use a custom getter.
Object.defineProperty(navigator, 'plugins', {
// This just needs to have `length > 0` for the current test,
// but we could mock the plugins too if necessary.
get: () => [1, 2, 3, 4, 5],
});
});
await page.evaluateOnNewDocument(() => {
// Overwrite the `plugins` property to use a custom getter.
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en'],
});
});
await page.goto(url, { waitUntil: 'networkidle2',timeout: 0 } );
return page;
}
}
module.exports = {ResourceManager}

I solve it with https://www.npmjs.com/package/shelljs
var shell = require('shelljs');
shell.exec('pkill chrome')

try to close the browser before sending the response
var puppeteer = require('puppeteer');
var express = require('express');
var router = express.Router();
router.get('/', function(req, res, next) {
(async () => {
headless = true;
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
await browser.close();
res.send(bodyHTML);
})();
});

I ran into the same issue and while your shelljs solution did work, it kills all chrome processes, which might interrupt one that is still processing a request. Here is a better solution that should work.
var puppeteer = require('puppeteer');
var express = require('express');
var router = express.Router();
router.get('/', function (req, res, next) {
(async () => {
await puppeteer.launch({ headless: true }).then(async browser => {
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
await browser.close();
res.send(bodyHTML);
});
})();
});

use
(await browser).close()
that happens because what the browser contains is a promise you have to solve it, I suffered a lot for this I hope it helps

I use the following basic setup for running Puppeteer:
const puppeteer = require("puppeteer");
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
/* use the page */
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Here, the finally block guarantees the browser will close correctly regardless of whether an error was thrown. Errors are logged (if desired). I like .catch and .finally as chained calls because the mainline Puppeteer code is one level flatter, but this accomplishes the same thing:
const puppeteer = require("puppeteer");
(async () => {
let browser;
try {
browser = await puppeteer.launch();
const [page] = await browser.pages();
/* use the page */
}
catch (err) {
console.error(err);
}
finally {
await browser?.close();
}
})();
There's no reason to call newPage because Puppeteer starts with a page open.
As for Express, you need only place the entire code above, including let browser; and excluding require("puppeteer"), into your route, and you're good to go, although you might want to use an async middleware error handler.
You ask:
Is there a better way to get the same result other than puppeteer and headless chrome?
That depends on what you're doing and what you mean by "better". If your goal is to get document.body.innerHTML and the page content you're interested in is baked into the static HTML, you can dump Puppeteer entirely and just make a request to get the resource, then use Cheerio to extract the desired information.
Another consideration is that you may not need to load and close a whole browser per request. If you can use one new page per request, consider the following strategy:
const express = require("express");
const puppeteer = require("puppeteer");
const asyncHandler = fn => (req, res, next) =>
Promise.resolve(fn(req, res, next)).catch(next)
;
const browserReady = puppeteer.launch({
args: ["--no-sandbox", "--disable-setuid-sandbox"]
});
const app = express();
app
.set("port", process.env.PORT || 5000)
.get("/", asyncHandler(async (req, res) => {
const browser = await browserReady;
const page = await browser.newPage();
try {
await page.goto(req.query.url || "http://www.example.com");
return res.send(await page.content());
}
catch (err) {
return res.status(400).send(err.message);
}
finally {
await page.close();
}
}))
.use((err, req, res, next) => res.sendStatus(500))
.listen(app.get("port"), () =>
console.log("listening on port", app.get("port"))
)
;
Finally, make sure to never set any timeouts to 0 (for example, page.setDefaultNavigationTimeout(0);), which introduces the potential for the script to hang forever. If you need a generous timeout, at most set it to a few minutes--long enough not to trigger false positives.
See also:
Parallelism of Puppeteer with Express Router Node JS. How to pass page between routes while maintaining concurrency
Puppeteer unable to run on heroku

We Keep Coding

JavaScript is the programming language of the Web.

Scraping with puppeteer and returning JSON - javascript

You have a naming problem. scrape.js is exporting the scrapeData function. Inside that function, you declared a scrapedData variable, which is not the same thing. Where you put a: console.dir(scrapeData) - logs the data successfully. Add return scrapeData; That should solve your issue.

Related

"The execution context was destroyed, most likely due to a navigation" when scraping with Puppeteer in NextJs

Puppeteer querySelector returns empty object [duplicate]

After puppeteer infinite scroll finishes does not return all results

Using puppeteer on an actual API

Puppeteer doesn't close browser

Categories

Resources