Scraping amazon with puppeteer - javascript

I'm currently working on some personal projects and I just had the idea to do some amazon scraping so I can get the products details like the name and price.
I found that the most consistent view that used the same id's for product name and price was the mobile view so that's why I'm using it.
The problem is that I can't get the price.
I've done the same exactly query selector for the name (that works) in the price but with no success.
const puppeteer = require('puppeteer');
const url = 'https://www.amazon.com/dp/B01MUAGZ49';
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setViewport({ width: 360, height: 640 });
await page.goto(url);
let producData = await page.evaluate(() => {
let productDetails = [];
let elements = document.querySelectorAll('#a-page');
elements.forEach(element => {
let detailsJson = {};
try {
detailsJson.name = element.querySelector('h1#title').innerText;
detailsJson.price = element.querySelector('#newBuyBoxPrice').innerText;
} catch (exception) {}
productDetails.push(detailsJson);
});
return productDetails;
});
console.dir(producData);
})();
I should get the name and the price in the console.dir but right now I only get
[ { name: 'Nintendo Switch – Neon Red and Neon Blue Joy-Con ' } ]

Just setting the viewports height and weight is not enough to fully simulate a mobile browser. Right now the page assumes that you just have a very small browser window.
The easiest way to simulate a mobile device is by using the the function page.emulate and the default DeviceDesriptors, which contain information about a large number of mobile devices.
Quote from the docs for page.emulate:
Emulates given device metrics and user agent. This method is a shortcut for calling two methods:
page.setUserAgent(userAgent)
page.setViewport(viewport)
To aid emulation, puppeteer provides a list of device descriptors which can be obtained via the require('puppeteer/DeviceDescriptors') command. [...]
Example
Here is an example on how to simulate an iPhone when visiting the page.
const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone = devices['iPhone 6'];
const url = '...';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.emulate(iPhone);
await page.goto(url);
// Simlified page.evaluate
let producData = await page.evaluate(() => ({
name: document.querySelector('#a-page h1#title').innerText,
price: document.querySelector('#a-page #newBuyBoxPrice').innerText
}));
console.dir(producData);
})();
I also simplified your page.evaluate a little, but you can of course also use your original code after the page.goto. This returned the name and the price of the product for me.

Related

I get this error when running my index.js file... throw new Error('Execution context was destroyed, most likely because of a navigation.');

I've provided the code below, can you tell me why I would get this error ? I am trying to web-scrape some information from one website to put on a website I am creating, I already have permission to do so. The information I am trying to web-scrape is the name of the event, the time of the event, the location of the event, and the description of the event... I seen this tutorial on YouTube, but for some reason I get this error running ming.
sync function scrapeProduct(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.goto(url);
const [el] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/h3/a');
const txt = await el.getProperty('textContent')
const rawTxt = await txt.jsonValue();
const [el1] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/time[1]/span[2]');
const txt1 = await el1.getProperty('textContent')
const rawTxt1 = await txt1.jsonValue();
console.log({rawTxt, rawTxt1});
browser.close();
}
scrapeProduct('https://events.ucf.edu');

Take screenshots of different elements with specific names in Puppeteer

I am trying to take screenshots of each section in a landing page which may container multiple sections. I was able to do that effectively in "Round1" which I commented out.
My goal is to learn how to write leaner/cleaner code so I made another attempt, "Round2".
In this section it does take a screenshot. But, it takes screenshot of section 3 with file name JSHandle#node.png. Definitely, I am doing this wrong.
Round1 (works perfectly)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.somelandingpage.com');
// const elOne = await page.$('.section-one');
// await elOne.screenshot({path: './public/SectionOne.png'})
// takes a screenshot SectionOne.png
// const elTwo = await page.$('.section-two')
// await elTwo.screenshot({path: './public/SectionTwo.png'})
// takes a screenshot SectionTwo.png
// const elThree = await page.$('.section-three')
// await elThree.screenshot({path: './public/SectionThree.png'})
// takes a screenshot SectionThree.png
Round2
I created an array that holds all the variables and tried to loop through them.
const elOne = await page.$('.section-one');
const elTwo = await page.$('.section-two')
const elThree = await page.$('.section-three')
let lpElements = [elOne, elTwo, elThree];
for(var i=0; i<lpElements.length; i++){
await lpElements[i].screenshot({path: './public/'+lpElements[i] + '.png'})
}
await browser.close();
})();
This takes a screenshot of section-three only, but with wrong file name (JSHandle#node.png). There are no error messages on the console.
How can I reproduce Round1 by modifying the Round2 code?
Your array is only of Puppeteer element handle objects which are getting .toString() called on them.
A clean way to do this is to use an array of objects, each of which has a selector and its name. Then, when you run your loop, you have access to both name and selector.
const puppeteer = require('puppeteer');
const content = `
<div class="section-one">foo</div>
<div class="section-two">bar</div>
<div class="section-three">baz</div>
`;
const elementsToScreenshot = [
{selector: '.section-one', name: 'SectionOne'},
{selector: '.section-two', name: 'SectionTwo'},
{selector: '.section-three', name: 'SectionThree'},
];
const getPath = name => `./public/${name}.png`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(content);
for (const {selector, name} of elementsToScreenshot) {
const el = await page.$(selector);
await el.screenshot({path: getPath(name)});
}
})()
.catch(err => console.error(err))
.finally(async () => await browser.close())
;

How to scrape image src urls using node js and puppeteer

I want to scrape an image from wikipedia page but the problem is i am getting 3 urls of the same image at a time and those three urls are in the same tag called img .I just want src url. Anybody knows how to do it.
const puppeteer = require('puppeteer');
const sleep = require('sleep');
(async ()=> {
const browser = await puppeteer.launch({
"headless": false
});
const page =await browser.newPage();
await page.goto("https://www.wikipedia.org/");
const xpathselector = `//span[contains(text(), "Commons")]`;
const commonlinks = await page.waitForXPath(xpathselector);
await page.waitFor(3000);
await commonlinks.click();
await page.waitFor(2000)
//await page.waitForSelector()
const images = await page.$eval(('a[class="image"] > img[src]'),node => node.innerHTML);
console.log(images);
} ) ();
//*[#id="mainpage-potd"]/div[1]/a/img
I bet that you "see" three URLs because you are looking at the srcset, which has many URLs for different screens. resolutions. You could return the src property instead:
const images = await page.$eval(('a[class="image"] > img[src]'),node => node.src);

Get transferred/resources size and DOMContentLoaded/Load time from devTools network metrics

I am building a web application named UXChecker which works similary to website.grader.com but with a number of differences.
After the user sends URL from frontend via API to backend, I want to crawl this URL and do UX measurement. I will be checking buttons, SEO, responsivity, etc. as well as checking network metrics saying what is the total size of the web (Mb) and how fast did it load. I decided to do it in Node.js using Puppeteer.
Is there a way how to get "total page size" (resources size) and "load time" (or at least DOMContentLoad) metrics?
If not, could you recommend a different library or code snippets, please?
So far I have tried:
const perfEntries = JSON.parse(
await page.evaluate(() => JSON.stringify(performance.getEntries()))
);
I have counted all transferSizes but the number is not even close to the numbers from network devTools tab.
After that I tried:
let performanceMetrics = await page._client.send('Performance.getMetrics');
which gave me "DOMContentLoaded" among the other metrics. But this number was different from the value in the network tab. I wasn't able to get totalSize (resources size) of the web.
My Puppeteer code (you can test by running node example.js):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
await page.setCacheEnabled(false);
// setting width and height
await page.setViewport({
width: 1200,
height: 780,
deviceScaleFactor: 1,
});
// going to URL
await page.goto("http://www.cupcakeipsum.com/", { waitUntil: 'networkidle2' });
// performance entries available
const perfEntries = JSON.parse(
await page.evaluate(() => JSON.stringify(performance.getEntries()))
);
// performance test
// total size of the website:
totalSize = () => {
let totalSize = 0;
perfEntries.forEach(entry => {
if (entry.transferSize > 0) {
totalSize += entry.transferSize;
}
});
return totalSize;
}
console.log("==== Total size ====")
console.log(totalSize());
// performance metrics
console.log("==== Devtools: Performance.getMetrics ====");
let performanceMetrics = await page._client.send('Performance.getMetrics');
console.log(performanceMetrics.metrics);
await browser.close();
})();
Is there a way to measure these metrics, please? I will be thankful for every help and tips.

JQuery: .find() locates certain elements, but not others

I am trying to scrape glassdoor company reviews as an exercise, and I've attempted to learn javascript and JQuery to do it with puppeteer. In my script, I try to output to the console both the
Summary of the review, and
date of the review.
(Pictures 1 and 2 for the html positions of the summary and date)
However, for some reason, only the summaries get printed to the console, whereas the dates are not. Do I have a mistake in my code?
const puppeteer = require("puppeteer");
const cheerio = require('cheerio');
// puppeteer usage as normal
puppeteer.launch({ headless: false }).then(async browser => {
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation();
await page.setViewport({ width: 1440, height: 794 }) ;
await page.goto('https://www.glassdoor.com/Reviews/Grubhub-Reviews-E419089.htm');
await navigationPromise;
var data = [];
const html = await page.content();
const $ = cheerio.load(html);
$(".hreview").each(function() {
console.log("\nMain scraping function happening...")
// This works
console.log($(this).find("span.summary").text());
// This does not work
console.log($(this).find("time.date").text());
});
await browser.close();
})

Categories