I'm currently working on some personal projects and I just had the idea to do some amazon scraping so I can get the products details like the name and price.
I found that the most consistent view that used the same id's for product name and price was the mobile view so that's why I'm using it.
The problem is that I can't get the price.
I've done the same exactly query selector for the name (that works) in the price but with no success.
const puppeteer = require('puppeteer');
const url = 'https://www.amazon.com/dp/B01MUAGZ49';
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setViewport({ width: 360, height: 640 });
await page.goto(url);
let producData = await page.evaluate(() => {
let productDetails = [];
let elements = document.querySelectorAll('#a-page');
elements.forEach(element => {
let detailsJson = {};
try {
detailsJson.name = element.querySelector('h1#title').innerText;
detailsJson.price = element.querySelector('#newBuyBoxPrice').innerText;
} catch (exception) {}
productDetails.push(detailsJson);
});
return productDetails;
});
console.dir(producData);
})();
I should get the name and the price in the console.dir but right now I only get
[ { name: 'Nintendo Switch – Neon Red and Neon Blue Joy-Con ' } ]
Just setting the viewports height and weight is not enough to fully simulate a mobile browser. Right now the page assumes that you just have a very small browser window.
The easiest way to simulate a mobile device is by using the the function page.emulate and the default DeviceDesriptors, which contain information about a large number of mobile devices.
Quote from the docs for page.emulate:
Emulates given device metrics and user agent. This method is a shortcut for calling two methods:
page.setUserAgent(userAgent)
page.setViewport(viewport)
To aid emulation, puppeteer provides a list of device descriptors which can be obtained via the require('puppeteer/DeviceDescriptors') command. [...]
Example
Here is an example on how to simulate an iPhone when visiting the page.
const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone = devices['iPhone 6'];
const url = '...';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.emulate(iPhone);
await page.goto(url);
// Simlified page.evaluate
let producData = await page.evaluate(() => ({
name: document.querySelector('#a-page h1#title').innerText,
price: document.querySelector('#a-page #newBuyBoxPrice').innerText
}));
console.dir(producData);
})();
I also simplified your page.evaluate a little, but you can of course also use your original code after the page.goto. This returned the name and the price of the product for me.
Related
I've provided the code below, can you tell me why I would get this error ? I am trying to web-scrape some information from one website to put on a website I am creating, I already have permission to do so. The information I am trying to web-scrape is the name of the event, the time of the event, the location of the event, and the description of the event... I seen this tutorial on YouTube, but for some reason I get this error running ming.
sync function scrapeProduct(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.goto(url);
const [el] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/h3/a');
const txt = await el.getProperty('textContent')
const rawTxt = await txt.jsonValue();
const [el1] = await page.$x('//*[#id="calendar-events-day"]/ul/li[1]/time[1]/span[2]');
const txt1 = await el1.getProperty('textContent')
const rawTxt1 = await txt1.jsonValue();
console.log({rawTxt, rawTxt1});
browser.close();
}
scrapeProduct('https://events.ucf.edu');
I am trying to take screenshots of each section in a landing page which may container multiple sections. I was able to do that effectively in "Round1" which I commented out.
My goal is to learn how to write leaner/cleaner code so I made another attempt, "Round2".
In this section it does take a screenshot. But, it takes screenshot of section 3 with file name JSHandle#node.png. Definitely, I am doing this wrong.
Round1 (works perfectly)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.somelandingpage.com');
// const elOne = await page.$('.section-one');
// await elOne.screenshot({path: './public/SectionOne.png'})
// takes a screenshot SectionOne.png
// const elTwo = await page.$('.section-two')
// await elTwo.screenshot({path: './public/SectionTwo.png'})
// takes a screenshot SectionTwo.png
// const elThree = await page.$('.section-three')
// await elThree.screenshot({path: './public/SectionThree.png'})
// takes a screenshot SectionThree.png
Round2
I created an array that holds all the variables and tried to loop through them.
const elOne = await page.$('.section-one');
const elTwo = await page.$('.section-two')
const elThree = await page.$('.section-three')
let lpElements = [elOne, elTwo, elThree];
for(var i=0; i<lpElements.length; i++){
await lpElements[i].screenshot({path: './public/'+lpElements[i] + '.png'})
}
await browser.close();
})();
This takes a screenshot of section-three only, but with wrong file name (JSHandle#node.png). There are no error messages on the console.
How can I reproduce Round1 by modifying the Round2 code?
Your array is only of Puppeteer element handle objects which are getting .toString() called on them.
A clean way to do this is to use an array of objects, each of which has a selector and its name. Then, when you run your loop, you have access to both name and selector.
const puppeteer = require('puppeteer');
const content = `
<div class="section-one">foo</div>
<div class="section-two">bar</div>
<div class="section-three">baz</div>
`;
const elementsToScreenshot = [
{selector: '.section-one', name: 'SectionOne'},
{selector: '.section-two', name: 'SectionTwo'},
{selector: '.section-three', name: 'SectionThree'},
];
const getPath = name => `./public/${name}.png`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(content);
for (const {selector, name} of elementsToScreenshot) {
const el = await page.$(selector);
await el.screenshot({path: getPath(name)});
}
})()
.catch(err => console.error(err))
.finally(async () => await browser.close())
;
I want to scrape an image from wikipedia page but the problem is i am getting 3 urls of the same image at a time and those three urls are in the same tag called img .I just want src url. Anybody knows how to do it.
const puppeteer = require('puppeteer');
const sleep = require('sleep');
(async ()=> {
const browser = await puppeteer.launch({
"headless": false
});
const page =await browser.newPage();
await page.goto("https://www.wikipedia.org/");
const xpathselector = `//span[contains(text(), "Commons")]`;
const commonlinks = await page.waitForXPath(xpathselector);
await page.waitFor(3000);
await commonlinks.click();
await page.waitFor(2000)
//await page.waitForSelector()
const images = await page.$eval(('a[class="image"] > img[src]'),node => node.innerHTML);
console.log(images);
} ) ();
//*[#id="mainpage-potd"]/div[1]/a/img
I bet that you "see" three URLs because you are looking at the srcset, which has many URLs for different screens. resolutions. You could return the src property instead:
const images = await page.$eval(('a[class="image"] > img[src]'),node => node.src);
I am building a web application named UXChecker which works similary to website.grader.com but with a number of differences.
After the user sends URL from frontend via API to backend, I want to crawl this URL and do UX measurement. I will be checking buttons, SEO, responsivity, etc. as well as checking network metrics saying what is the total size of the web (Mb) and how fast did it load. I decided to do it in Node.js using Puppeteer.
Is there a way how to get "total page size" (resources size) and "load time" (or at least DOMContentLoad) metrics?
If not, could you recommend a different library or code snippets, please?
So far I have tried:
const perfEntries = JSON.parse(
await page.evaluate(() => JSON.stringify(performance.getEntries()))
);
I have counted all transferSizes but the number is not even close to the numbers from network devTools tab.
After that I tried:
let performanceMetrics = await page._client.send('Performance.getMetrics');
which gave me "DOMContentLoaded" among the other metrics. But this number was different from the value in the network tab. I wasn't able to get totalSize (resources size) of the web.
My Puppeteer code (you can test by running node example.js):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
await page.setCacheEnabled(false);
// setting width and height
await page.setViewport({
width: 1200,
height: 780,
deviceScaleFactor: 1,
});
// going to URL
await page.goto("http://www.cupcakeipsum.com/", { waitUntil: 'networkidle2' });
// performance entries available
const perfEntries = JSON.parse(
await page.evaluate(() => JSON.stringify(performance.getEntries()))
);
// performance test
// total size of the website:
totalSize = () => {
let totalSize = 0;
perfEntries.forEach(entry => {
if (entry.transferSize > 0) {
totalSize += entry.transferSize;
}
});
return totalSize;
}
console.log("==== Total size ====")
console.log(totalSize());
// performance metrics
console.log("==== Devtools: Performance.getMetrics ====");
let performanceMetrics = await page._client.send('Performance.getMetrics');
console.log(performanceMetrics.metrics);
await browser.close();
})();
Is there a way to measure these metrics, please? I will be thankful for every help and tips.
I am trying to scrape glassdoor company reviews as an exercise, and I've attempted to learn javascript and JQuery to do it with puppeteer. In my script, I try to output to the console both the
Summary of the review, and
date of the review.
(Pictures 1 and 2 for the html positions of the summary and date)
However, for some reason, only the summaries get printed to the console, whereas the dates are not. Do I have a mistake in my code?
const puppeteer = require("puppeteer");
const cheerio = require('cheerio');
// puppeteer usage as normal
puppeteer.launch({ headless: false }).then(async browser => {
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation();
await page.setViewport({ width: 1440, height: 794 }) ;
await page.goto('https://www.glassdoor.com/Reviews/Grubhub-Reviews-E419089.htm');
await navigationPromise;
var data = [];
const html = await page.content();
const $ = cheerio.load(html);
$(".hreview").each(function() {
console.log("\nMain scraping function happening...")
// This works
console.log($(this).find("span.summary").text());
// This does not work
console.log($(this).find("time.date").text());
});
await browser.close();
})