How to manage session data in puppeteer web scraping

How to manage session data in puppeteer web scraping - javascript

I am trying to scrap data from this website immobilienscout24.de using puppeteer. I think it is required to keep session data to navigate different pages on the site. following is my code sometime some pages are not loaded and are detected my requests as robot requests.
Please see the code and help me with session management when web scraping using puppeteer.
const puppeteer = require('puppeteer-extra')
const storage = require('node-persist');
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
const cheerio = require('cheerio')
const pretty = require("pretty");
puppeteer.launch({
headless: false,
args: ["--disable-setuid-sandbox"],
'ignoreHTTPSErrors': true,
executablePath: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
userDataDir: '/Users/username/Library/Application Support/Google/Chrome/Default'
}).then(async browser => {
const page = await browser.newPage()
const baseURL = 'https://www.immobilienscout24.de'
for(var p=1; p <= 10; p++) {
await page.goto("https://www.immobilienscout24.de/Suche/de/neubauwohnung-mieten?pagenumber="+p,{
waitUntil: "load"
})
const client = await page.target().createCDPSession();
const cookies = (await client.send('Network.getAllCookies')).cookies;
await page.setCookie(...cookies);
const localStorage = await page.evaluate(() => Object.assign({}, window.localStorage))
const html = await page.content();
const $ = cheerio.load(html);
const tiles = $('.result-list__listing');
tiles.map( async (i, item) => {
let link = $(item).find('a.result-list-entry__brand-title-container').attr('href');
if (link.includes("expose")) {
link = baseURL+link
}
console.log(link)
});
await page.waitForTimeout(10000)
}
await browser.close()
})

You're making 10 request at the same time, because you're using a traditional loop:
for(var p=1; p <= 10; p++)
So the website properly had something like rate-limit to prevent ddos attack, that's why you're detected as bot.
With ES6, you can request 10 times but in a sequence like this:
for (let p of [...Array(10).keys()] ){
// execute your request here
}
Hope it help!

Related

I get this error when running my index.js file... throw new Error('Execution context was destroyed, most likely because of a navigation.');

I've provided the code below, can you tell me why I would get this error ? I am trying to web-scrape some information from one website to put on a website I am creating, I already have permission to do so. The information I am trying to web-scrape is the name of the event, the time of the event, the location of the event, and the description of the event... I seen this tutorial on YouTube, but for some reason I get this error running ming.
sync function scrapeProduct(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.goto(url);
const [el] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/h3/a');
const txt = await el.getProperty('textContent')
const rawTxt = await txt.jsonValue();
const [el1] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/time[1]/span[2]');
const txt1 = await el1.getProperty('textContent')
const rawTxt1 = await txt1.jsonValue();
console.log({rawTxt, rawTxt1});
browser.close();
}
scrapeProduct('https://events.ucf.edu');

How to set cookies in a variable and use it again: PUPPETEER

i'm trying to insert ALL browser cookies in a variable and then use it again later.
My attempt:
const puppeteer = require('puppeteer')
const fs = require('fs').promises;
(async () => {
const browser = await puppeteer.launch({
headless: false,
executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe',
const page = await browser.newPage();
await page.goto('https://www.google.com/');
Below, it's my code to get cookies, and it work, but now it print a error message.
const client = await page.target().createCDPSession();
const all_browser_cookies = (await client.send('Network.getAllCookies')).cookies;
const current_url_cookies = await page.cookies();
var third_party_cookies = all_browser_cookies.filter(cookie => cookie.domain !== current_url_cookies[0].domain);
and below it's the seccond page (that will use cookies)
(async () => {
const browser2 = await puppeteer.launch({
});
const url = 'https://www.google.com/';
const page2 = await browser2.newPage();
try{
await page2.setCookie(...third_party_cookies);
await page2.goto(url);
}catch(e){
console.log(e);
}
await browser2.close()
})();
})();
until yesterday it works, but today it's appearing this message error:
Error: Protocol error (Network.setCookies): Invalid parameters Failed to deserialize params.cookies.expires - BINDINGS: double value expected at position 662891
Anyone know what is it?

Ok, the solution is here: https://github.com/puppeteer/puppeteer/issues/7029
You are saving an array of cookies and page.setCookie only works with one cookie. Yo have to iterate trought your array like that:
for (let i = 0; i < third_party_cookies.length; i++) {
await page2.setCookie(third_party_cookies[i]);
}

#julio-codesal's suggestion is ok, but according to the puppeteer documentaion it is ok to use page.setCookie with an array, as long as you destructure it, e.g.:
await page.setCookie(...arr)
source: https://devdocs.io/puppeteer/index#pagesetcookiecookies
Not exactly sure what it is, but the error the OP has is something else.

Take screenshots of different elements with specific names in Puppeteer

I am trying to take screenshots of each section in a landing page which may container multiple sections. I was able to do that effectively in "Round1" which I commented out.
My goal is to learn how to write leaner/cleaner code so I made another attempt, "Round2".
In this section it does take a screenshot. But, it takes screenshot of section 3 with file name JSHandle#node.png. Definitely, I am doing this wrong.
Round1 (works perfectly)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.somelandingpage.com');
// const elOne = await page.$('.section-one');
// await elOne.screenshot({path: './public/SectionOne.png'})
// takes a screenshot SectionOne.png
// const elTwo = await page.$('.section-two')
// await elTwo.screenshot({path: './public/SectionTwo.png'})
// takes a screenshot SectionTwo.png
// const elThree = await page.$('.section-three')
// await elThree.screenshot({path: './public/SectionThree.png'})
// takes a screenshot SectionThree.png
Round2
I created an array that holds all the variables and tried to loop through them.
const elOne = await page.$('.section-one');
const elTwo = await page.$('.section-two')
const elThree = await page.$('.section-three')
let lpElements = [elOne, elTwo, elThree];
for(var i=0; i<lpElements.length; i++){
await lpElements[i].screenshot({path: './public/'+lpElements[i] + '.png'})
}
await browser.close();
})();
This takes a screenshot of section-three only, but with wrong file name (JSHandle#node.png). There are no error messages on the console.
How can I reproduce Round1 by modifying the Round2 code?

Your array is only of Puppeteer element handle objects which are getting .toString() called on them.
A clean way to do this is to use an array of objects, each of which has a selector and its name. Then, when you run your loop, you have access to both name and selector.
const puppeteer = require('puppeteer');
const content = `
<div class="section-one">foo</div>
<div class="section-two">bar</div>
<div class="section-three">baz</div>
`;
const elementsToScreenshot = [
{selector: '.section-one', name: 'SectionOne'},
{selector: '.section-two', name: 'SectionTwo'},
{selector: '.section-three', name: 'SectionThree'},
];
const getPath = name => `./public/${name}.png`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(content);
for (const {selector, name} of elementsToScreenshot) {
const el = await page.$(selector);
await el.screenshot({path: getPath(name)});
}
})()
.catch(err => console.error(err))
.finally(async () => await browser.close())
;

Initializing a Puppeteer Browser Outside of Scraping Function

I am very new to puppeteer (I started today). I have some code that is working the way that I want it to except for an issue that I think is making it extremely inefficient. I have a function that links me through potentially thousands of urls that have incremental IDs to pull the name, position, and stats of each player and then inserts that data into a neDB database. Here is my code:
const puppeteer = require('puppeteer');
const Datastore = require('nedb');
const database = new Datastore('database.db');
database.loadDatabase();
async function scrapeProduct(url, id){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let attributes = [];
const [name] = await page.$x('//*[#id="ctl00_ctl00_ctl00_Main_Main_name"]');
const txt = await name.getProperty('innerText');
const playerName = await txt.jsonValue();
attributes.push(playerName);
//Make sure that there is a legitimate player profile before trying to pull a bunch of 'undefined' information.
if(playerName){
const [role] = await page.$x('//*[#id="ctl00_ctl00_ctl00_Main_Main_position"]');
const roleTxt = await role.getProperty('innerText');
const playerRole = await roleTxt.jsonValue();
attributes.push(playerRole);
//Loop through the 12 attributes and pull their values.
for(let i = 1; i < 13; i++){
let vLink = '//*[#id="ctl00_ctl00_ctl00_Main_Main_SectionTabBox"]/div/div/div/div[1]/table/tbody/tr['+i+']/td[2]';
const [e1] = await page.$x(vLink);
const val = await e1.getProperty('innerText');
const skillVal = await val.jsonValue();
attributes.push(skillVal);
}
//Create a player profile to be pushed into the database. (I realize this is very wordy and ugly code)
let player = {
Name: attributes[0],
Role: attributes[1],
Athleticism: attributes[2],
Speed: attributes[3],
Durability: attributes[4],
Work_Ethic: attributes[5],
Stamina: attributes[6],
Strength: attributes[7],
Blocking: attributes[8],
Tackling: attributes[9],
Hands: attributes[10],
Game_Instinct: attributes[11],
Elusiveness: attributes[12],
Technique: attributes[13],
_id: id,
};
database.insert(player);
console.log('player #' + id + " scraped.");
await browser.close();
} else {
console.log("Blank profile");
await browser.close();
}
}
//Making sure the first URL is scraped before moving on to the next URL. (i removed the URL because its unreasonably long and is not important for this part).
(async () => {
for(let i = 0; i <= 1000; i++){
let link = 'https://url.com/Ratings.aspx?rid='+i+'&section=Ratings';
await scrapeProduct(link, i);
}
})();
What I think is making this so inefficient is the fact that everytime scrapeProduct() is called, i create a new browser and create a new page. Instead I believe it would be more efficient to create 1 browser and 1 page and just change the pages URL with
await page.goto(url)
I believe that in order to do what I'm trying to accomplish here, i need to move:
const browser = await puppeteer.launch();
const page = await browser.newPage();
outside of my scrapeProduct() function but i cannot seem to get this to work. Anytime I try i get an error in my function saying that page is not defined. I am very new to puppeteer (started today), I would appreciate any guidance on how to accomplish this. Thank you very much!
TL;DR
How do i create 1 Browser instance and 1 Page instance that a function can use repeatedly by only changing the await page.goto(url) function.

About a year ago I tried to a make an React Native Pokemon Go helper app. Since there wasn't an api for pokemon nest and pokestops I created a server that scraped thesilphroad.com and I found the need to implement something like #Arkan said.
I wanted the server to be able to take multiple request, so I decided to initialize the browser when the server is booted up. When a request is received, the server checks to see if MAX_TABS have been reached. If reached, it waits, if not a new tab is opened and the scrape is performed
Here's the scraper.js
const puppeteer = require ('puppeteer')
const fs = require('fs')
const Page = require('./Page')
const exec = require('child_process').exec
const execSync = require('util').promisify(exec)
module.exports = class scraper {
constructor(){
this.browser = null
this.getPages = null
this.getTotalPages = null
this.isRunning = false
//browser permissions
this.permissions = ['geolocation']
this.MAX_TABS = 5
//when puppeteer launches
this.useFirstTab = true
}
async init(config={}){
let headless = config.headless != undefined ? config.headless : true
this.permissions = this.permissions.concat(config.permissions || [])
//get local chromium location
let browserPath = await getBrowserPath('firefox') || await getBrowserPath('chrome')
this.browser = await puppeteer.launch({
headless:headless,
executablePath:browserPath,
defaultViewport:null,
args:[
'--start-maximized',
]
})
this.getPages = this.browser.pages
this.getTotalPages = ()=>{
return this.getPages().then(pages=>pages.length).catch(err=>0)
}
this.isRunning = true
}
async waitForTab(){
let time = Date.now()
let cycles = 1
await new Promise(resolve=>{
let interval = setInterval(async()=>{
let totalPages = await this.getTotalPages()
if(totalPages < this.MAX_TABS){
clearInterval(interval)
resolve()
}
if(Date.now() - time > 100)
console.log('Waiting...')
if(Date.now() - time > 20*1000){
console.log('... ...\n'.repeat(cycle)+'Still waiting...')
cycle++
time = Date.now()
}
},500)
})
}
//open new tab and go to page
async openPage(url,waitSelector,lat,long){
await this.waitForTab()
let pg
//puppeteer launches with a blank tab, use this
// if(this.useFirstTab){
// let pages = await this.browser.pages()
// pg = pages.pop()
// this.useFirstTab = false
// }
// else
pg = await this.browser.newPage()
if(lat && long){
await this.setPermissions(url)
}
let page = await new Page()
await page.init(pg,url,waitSelector,lat,long)
return page
}
async setPermissions(url){
const context = this.browser.defaultBrowserContext();
await context.overridePermissions(url,this.permissions)
}
}
// assumes that the browser is in path
async function getBrowserPath(browserName){
return execSync('command -v chromium').then(({stdout,stderr})=>{
if(stdout.includes('not found'))
return null
return stdout
}).catch(err=>null)
}
The scraper imports Page.js, which is just wrapper for a puppeteer Page object with the functions I used most made available
const path = require('path')
const fs = require('fs')
const userAgents = require('./staticData/userAgents.json')
const cookiesPath = path.normalize('./cookies.json')
// a wrapper for a puppeteer page with pre-made functions
module.exports = class Page{
constuctor(useCookies=false){
this.page = null
this.useCookies = useCookies
this.previousSession = this.useCookies && fs.existsSync(cookiesPath)
}
async close (){
await this.page.close()
}
async init(page,url,waitSelector,lat,long){
this.page = page
let userAgent = userAgents[Math.floor(Math.random()*userAgents.length)]
await this.page.setUserAgent(userAgent)
await this.restoredSession()
if(lat && long)
await this.page.setGeolocation({
latitude: lat || 59.95, longitude:long || 30.31667, accuracy:40
})
await this.page.goto(url)
await this.wait(waitSelector)
}
async screenshotElement(selector='body',directory='./screenshots',padding=0,offset={}) {
const rect = await this.page.evaluate(selector => {
const el = document.querySelector(selector)
const {x, y, width, height} = el.getBoundingClientRect()
return {
left: x,
top: y,
width,
height,
id: el.id
}
}, selector)
let ext = 'jpeg'
let filename = path.normalize(directory+'/'+Date.now())
return await this.page.screenshot({
type:ext,
path:filename+' - '+selector.substring(5)+'.'+ext,
clip: {
x: rect.left - padding+(offset.left || 0),
y: rect.top - padding+(offset.right || 0),
width: rect.width + padding * 2+(offset.width||0),
height: rect.height + padding * 2+ (offset.height||0)
},
encoding:'base64'
})
}
async restoredSession(){
if(!this.previousSession)
return false
let cookies = require(cookiesPath)
for(let cookie of cookies){
await this.page.setCookie(cookie)
}
console.log('Loaded previous session')
return true
}
async saveSession(){
//write cookie to file
if(!this.useCookies)
return
const cookies = await this.page.cookies()
fs.writeFileSync(cookiesPath,JSON.stringify(cookies,null,2))
console.log('Wrote cookies to file')
}
//wait for text input elment and type text
async type(selector,text,options={delay:150}){
await this.wait(selector)
await this.page.type(selector,text,options)
}
//click and waits
async click(clickSelector,waitSelector=500){
await this.page.click(clickSelector)
await this.wait(waitSelector)
}
//hovers over element and waits
async hover(selector,waitSelector=500){
await this.page.hover(selector)
await this.wait(1000)
await this.wait(waitSelector)
}
//waits and suppresses timeout errors
async wait(selector=500, waitForNav=false){
try{
//waitForNav is a puppeteer's waitForNavigation function
//which for me does nothing but timeouts after 30s
waitForNav && await this.page.waitForNavigation()
await this.page.waitFor(selector)
} catch (err){
//print everything but timeout errors
if(err.name != 'Timeout Error'){
console.log('error name:',err.name)
console.log(err)
console.log('- - - '.repeat(4))
}
this.close()
}
}
}
``

To achieve this, you'll just need to separate the browser from your requests, like in a class, for example:
class PuppeteerScraper {
async launch(options = {}) {
this.browser = await puppeteer.launch(options);
// you could reuse the page instance if it was defined here
}
/**
* Pass the address and the function that will scrape your data,
* in order to mantain the page inside this object
*/
async goto(url, callback) {
const page = await this.browser.newPage();
await page.goto(url);
/**evaluate its content */
await callback(page);
await page.close();
}
async close() {
await this.browser.close();
}
}
and, to implement it:
/**
* scrape function, takes the page instance as its parameters
*/
async function evaluate_page(page) {
const titles = await page.$$eval('.col-xs-6 .star-rating ~ h3 a', (itens) => {
const text_titles = [];
for (const item of itens) {
if (item && item.textContent) {
text_titles.push(item.textContent);
}
}
return text_titles;
});
console.log('titles', titles);
}
(async () => {
const scraper = new PuppeteerScraper();
await scraper.launch({ headless: false });
for (let i = 1; i <= 6; i++) {
let link = `https://books.toscrape.com/catalogue/page-${i}.html`;
await scraper.goto(link, evaluate_page);
}
scraper.close();
})();
altho, if you want something more complex, you could take a look how they done at Apify project.

Scrape multiple websites using Puppeteer

So I am trying to make a scrape just two elements but from more than only one website (in this case is PS Store). Also, I'm trying to achieve it in the easiest way possible. Since I'm rookie in JS, please be gentle ;) Below my script. I was trying to make it happen with a for loop but with no effect (still it got only the first website from the array). Thanks a lot for any kind of help.
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = [
"https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
"https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
];
const browser = await puppeteer.launch();
const page = await browser.newPage();
for (i = 0; i < urls.length; i++) {
const url = urls[i];
const promise = page.waitForNavigation({ waitUntil: "networkidle" });
await page.goto(`${url}`);
await promise;
}
const [el] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
);
const txt = await el.getProperty("textContent");
const title = await txt.jsonValue();
const [el2] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
);
const txt2 = await el2.getProperty("textContent");
const price = await txt2.jsonValue();
console.log({ title, price });
browser.close();
}
scrapeProduct();

In general, your code is quite okay. Few things should be corrected, though:
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = [
"https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
"https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
];
const browser = await puppeteer.launch({
headless: false
});
for (i = 0; i < urls.length; i++) {
const page = await browser.newPage();
const url = urls[i];
const promise = page.waitForNavigation({
waitUntil: "networkidle2"
});
await page.goto(`${url}`);
await promise;
const [el] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
);
const txt = await el.getProperty("textContent");
const title = await txt.jsonValue();
const [el2] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
);
const txt2 = await el2.getProperty("textContent");
const price = await txt2.jsonValue();
console.log({
title,
price
});
}
browser.close();
}
scrapeProduct();
You open webpage in the loop, that's correct, but then look for elements outside of the loop. Why? You should do it within the loop.
For debugging, I suggest using { headless: false }. This allows you to see what actually happens in the browser.
Not sure what version of puppeteer are you using, but there's no such event as networkidle in latest version from npm. You should use networkidle0 or networkidle2 instead.
You are seeking the elements via xpath html/body/div.... This might be subjective, but I think standard JS/CSS selectors are more readable: body > div .... But, well, if it works...
Code above yields the following in my case:
{ title: 'Days Gone™', price: '289,00 zl' }
{ title: 'Ghost of Tsushima', price: '289,00 zl' }

We Keep Coding

JavaScript is the programming language of the Web.

How to manage session data in puppeteer web scraping - javascript

Related

I get this error when running my index.js file... throw new Error('Execution context was destroyed, most likely because of a navigation.');

How to set cookies in a variable and use it again: PUPPETEER

Take screenshots of different elements with specific names in Puppeteer

Initializing a Puppeteer Browser Outside of Scraping Function

Scrape multiple websites using Puppeteer

Categories

Resources