I'm trying to scrape three fields from different companies listed in a webpage. The idea here is to make use of this webpage, parse the inner page links of different companies from the landing page and finally scrape title,phone and email from the detail pages. The script that I've created can perform this without any issue.
However, I wish to scrape content from next pages (Next) as well. As I'm very new to node, I would appreciate if somebody could help me implement the logic of grabbing next pages within the script below.
const puppeteer = require("puppeteer");
const base = "https://www.timesbusinessdirectory.com";
const url = "https://www.timesbusinessdirectory.com/company-listings";
(async () => {
const browser = await puppeteer.launch({headless:false});
const [page] = await browser.pages();
await page.goto(url,{waitUntil: 'networkidle2'});
page.waitForSelector(".company-listing");
const sections = await page.$$(".company-listing");
let data = [];
for (const section of sections) {
const itemName = await section.$eval("h3 > a", el => el.getAttribute("href"));
data.push(itemName);
}
let itmdata = [];
for (const link of data) {
const newlink = base.concat(link);
await page.goto(newlink,{waitUntil: 'networkidle2'});
page.waitForSelector(".company-details");
const result = await page.$$(".company-details");
for (const itmres of result) {
const company = await itmres.$eval("h3", el => el.textContent);
const phone = await itmres.$eval("#valuephone a[href]", el => el.textContent);
const email = await itmres.$eval("a[onclick^='showCompanyEmail']", el => el.getAttribute("onclick"));
console.log({
title: company,
tel : phone.trim(),
emailId : email.split("('")[1].split("',")[0]
});
}
}
await browser.close();
})();
How can I implement the logic of traversing next pages within the script?
EDIT:
I was trying to use the next page link within the script iteratively defining a while loop. This time the script however doesn't seem to respond at all. Perhaps I've done something wrong along the lines.
In your current code, you declare new url variable in each iteration. Instead of this, you need to declare the outer url variable as let and redefine it:
let url = "https://www.timesbusinessdirectory.com/company-listings";
// ...
url = base.concat(nextPageLink);
Related
I saved data on workbook on following code
export const storeSettingsToWorkbook = async (settingsType: Settings, storeData:
WorkbookModel) => {
return Excel.run(async (context) => {
const originalXml = createXmlObject(storeData);
const customXmlPart = context.workbook.customXmlParts.add(originalXml);
customXmlPart.load("id");
await context.sync();
// Store the XML part's ID in a setting
const settings = context.workbook.settings;
settings.add(settingsTitles[settingsType], customXmlPart.id);
await context.sync();
})
}
when i get data -it works normally.But when i want to get this data form another "add-in" on Excel- I cannot get this data
const {settings} = context.workbook;
const sheet = context.workbook.worksheets.getActiveWorksheet().load("items");
const xmlPartIDSetting = settings.getItemOrNullObject(settingsTitles[settingsType]).load("value");
await context.sync();
if (xmlPartIDSetting.value) {
const customXmlPart = context.workbook.customXmlParts.getItem(xmlPartIDSetting.value);
const xmlBlob = customXmlPart.getXml();
await context.sync()
const parsedObject = parseFromXmlString(xmlBlob.value);
const normalizedData = normalizeParsedData(parsedObject);
Any ideas?
Thanks for reaching us.
This is by design. Each addin has its own setting and cannot share with each other.
You can use 'context.workbook.properties.custom' as a workaround.
You can also use 'context.workbook.worksheets.getActiveWorksheet().customProperties', but the two add-ins are required to be on the same worksheet.
I am trying to take screenshots of each section in a landing page which may container multiple sections. I was able to do that effectively in "Round1" which I commented out.
My goal is to learn how to write leaner/cleaner code so I made another attempt, "Round2".
In this section it does take a screenshot. But, it takes screenshot of section 3 with file name JSHandle#node.png. Definitely, I am doing this wrong.
Round1 (works perfectly)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.somelandingpage.com');
// const elOne = await page.$('.section-one');
// await elOne.screenshot({path: './public/SectionOne.png'})
// takes a screenshot SectionOne.png
// const elTwo = await page.$('.section-two')
// await elTwo.screenshot({path: './public/SectionTwo.png'})
// takes a screenshot SectionTwo.png
// const elThree = await page.$('.section-three')
// await elThree.screenshot({path: './public/SectionThree.png'})
// takes a screenshot SectionThree.png
Round2
I created an array that holds all the variables and tried to loop through them.
const elOne = await page.$('.section-one');
const elTwo = await page.$('.section-two')
const elThree = await page.$('.section-three')
let lpElements = [elOne, elTwo, elThree];
for(var i=0; i<lpElements.length; i++){
await lpElements[i].screenshot({path: './public/'+lpElements[i] + '.png'})
}
await browser.close();
})();
This takes a screenshot of section-three only, but with wrong file name (JSHandle#node.png). There are no error messages on the console.
How can I reproduce Round1 by modifying the Round2 code?
Your array is only of Puppeteer element handle objects which are getting .toString() called on them.
A clean way to do this is to use an array of objects, each of which has a selector and its name. Then, when you run your loop, you have access to both name and selector.
const puppeteer = require('puppeteer');
const content = `
<div class="section-one">foo</div>
<div class="section-two">bar</div>
<div class="section-three">baz</div>
`;
const elementsToScreenshot = [
{selector: '.section-one', name: 'SectionOne'},
{selector: '.section-two', name: 'SectionTwo'},
{selector: '.section-three', name: 'SectionThree'},
];
const getPath = name => `./public/${name}.png`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(content);
for (const {selector, name} of elementsToScreenshot) {
const el = await page.$(selector);
await el.screenshot({path: getPath(name)});
}
})()
.catch(err => console.error(err))
.finally(async () => await browser.close())
;
I'm currently working on some personal projects and I just had the idea to do some amazon scraping so I can get the products details like the name and price.
I found that the most consistent view that used the same id's for product name and price was the mobile view so that's why I'm using it.
The problem is that I can't get the price.
I've done the same exactly query selector for the name (that works) in the price but with no success.
const puppeteer = require('puppeteer');
const url = 'https://www.amazon.com/dp/B01MUAGZ49';
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setViewport({ width: 360, height: 640 });
await page.goto(url);
let producData = await page.evaluate(() => {
let productDetails = [];
let elements = document.querySelectorAll('#a-page');
elements.forEach(element => {
let detailsJson = {};
try {
detailsJson.name = element.querySelector('h1#title').innerText;
detailsJson.price = element.querySelector('#newBuyBoxPrice').innerText;
} catch (exception) {}
productDetails.push(detailsJson);
});
return productDetails;
});
console.dir(producData);
})();
I should get the name and the price in the console.dir but right now I only get
[ { name: 'Nintendo Switch – Neon Red and Neon Blue Joy-Con ' } ]
Just setting the viewports height and weight is not enough to fully simulate a mobile browser. Right now the page assumes that you just have a very small browser window.
The easiest way to simulate a mobile device is by using the the function page.emulate and the default DeviceDesriptors, which contain information about a large number of mobile devices.
Quote from the docs for page.emulate:
Emulates given device metrics and user agent. This method is a shortcut for calling two methods:
page.setUserAgent(userAgent)
page.setViewport(viewport)
To aid emulation, puppeteer provides a list of device descriptors which can be obtained via the require('puppeteer/DeviceDescriptors') command. [...]
Example
Here is an example on how to simulate an iPhone when visiting the page.
const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone = devices['iPhone 6'];
const url = '...';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.emulate(iPhone);
await page.goto(url);
// Simlified page.evaluate
let producData = await page.evaluate(() => ({
name: document.querySelector('#a-page h1#title').innerText,
price: document.querySelector('#a-page #newBuyBoxPrice').innerText
}));
console.dir(producData);
})();
I also simplified your page.evaluate a little, but you can of course also use your original code after the page.goto. This returned the name and the price of the product for me.
For my college project, I made a Wikipedia scraper using nodejs and puppeteer. It's working for all but one link. After scraping almost half the data of a table in that page (I am using the console.log to see which data has been scraped at that moment) it just does nothing. It does not show any error. It does not stop executing, it just does nothing after that. The puppeteer browser does not close either.
In the original scraper, I used a loop of links to generate data. As it was not working so I made a separate scraper for that link but the same thing is happening. Can anyone help me out?
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
try {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
link = "https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_June_2016";
console.log("==============================");
console.log("Travelling to link:", link);
console.log("==============================");
await page.goto(link, {waitUntil: 'networkidle0'});
let rowArray = await page.$$("table[class='wikitable sortable jquery-tablesorter'] > tbody > tr");
var dataA = [];
for(let row of rowArray){
let date = await row.$eval('td:nth-child(1)', element => element.textContent);
date = date.substring(0, date.length - 1);
let type = await row.$eval('td:nth-child(2)', element => element.textContent);
type = type.substring(0, type.length - 1);
let dead = await row.$eval('td:nth-child(3)', element => element.textContent);
dead = dead.substring(0, dead.length - 1);
let injured = await row.$eval('td:nth-child(4)', element => element.textContent);
injured = injured.substring(0, injured.length - 1);
let location = await row.$eval('td:nth-child(5)', element => element.textContent);
location = location.substring(0, location.length - 1);
let details = await row.$eval('td:nth-child(6)', element => element.textContent);
details = details.substring(0, details.length - 1);
let perpetrator = await row.$eval('td:nth-child(7)', element => element.textContent);
perpetrator = perpetrator.substring(0, perpetrator.length - 1);
let partOf = await row.$eval('td:nth-child(8)', element => element.textContent);
partOf = partOf.substring(0, partOf.length - 1);
console.log("==============================");
console.log({date, type, dead, injured, location, details, perpetrator, partOf});
console.log("==============================");
dataA.push({date, type, dead, injured, location, details, perpetrator, partOf});
}
console.log("==============================");
console.log("Started writing JSON file");
fs.writeFileSync(`./june.json`, JSON.stringify(dataA), 'utf-8');
console.log("Finished writing JSON file");
console.log("==============================");
await browser.close();
} catch (error) {
console.error();
}
})();
Just by looking at the point where it stops
It seems the script struggles with the next row which doesn't have a "closing cell"
My guess is if you edit that page and close it it will work (or update your script to handle that scenario)
Looking at the Wikipedia source, in that row the "part of" cell is missing, thus your code just hangs in the 'await' section
let partOf = await row.$eval('td:nth-child(8)', element => element.textContent);
Thus you get no errors.
I have a json file that is generated dynamically by taking the values of a page using a crawler, json is created as follows:
{
"temperatura":"31°C",
"sensacao":"RealFeel® 36°",
"chuva":"0 mm",
"vento":"NNO11km/h",
"momentoAtualizacao":"Dia",
"Cidade":"carazinho",
"Site":"Accuweather"
}
{
"temperatura":"29 º",
"sensacao":"29º ST",
"vento":"11 Km/h",
"umidade":"51% UR",
"pressao":"1013 hPa",
"Cidade":"carazinho",
"Site":"Tempo Agora"
}
The problem with this generated file is missing [] to join all the files inside an array, and commas to separate the files.
The final json should look like this:
[{
"temperatura":"31°C",
"sensacao":"RealFeel® 36°",
"chuva":"0 mm",
"vento":"NNO11km/h",
"momentoAtualizacao":"Dia",
"Cidade":"carazinho",
"Site":"Accuweather"
},
{
"temperatura":"29 º",
"sensacao":"29º ST",
"vento":"11 Km/h",
"umidade":"51% UR",
"pressao":"1013 hPa",
"Cidade":"carazinho",
"Site":"Tempo Agora"
}]
I am currently using this code to generate json.
const climatempo = async (config) => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
const override = Object.assign(page.viewport(), {width: 1920, heigth:1024});
await page.setViewport(override);
await page.goto(config.cidades[cidade],{waitUntil: 'load',timeout:'60000'})
if(siteEscolhido == "accu"){
const elementTemp = await page.$(config.regras.elementTemp)
const temperatura = await page.evaluate(elementTemp => elementTemp.textContent, elementTemp)
const sensacaoElement= await page.$(config.regras.sensacaoElement)
const sensacao = await page.evaluate(sensacaoElement => sensacaoElement.textContent, sensacaoElement)
const chuvaElement = await page.$(config.regras.chuvaElement)
const chuva = await page.evaluate(chuvaElement => chuvaElement.textContent, chuvaElement)
const ventoElement = await page.$(config.regras.ventoElement)
const vento = await page.evaluate(ventoElement => ventoElement.textContent, ventoElement)
const atualizadoA = await page.$(config.regras.atualizadoA)
const momentoAtualizacao = await page.evaluate(atualizadoA => atualizadoA.textContent, atualizadoA)
var dado = {
temperatura:temperatura,
sensacao:sensacao,
chuva:chuva,
vento:vento,
momentoAtualizacao:momentoAtualizacao,
Cidade:cidade,
Site:"Accuweather"
}
//dados.push(dado)
var x = JSON.stringify(dado)
fs.appendFile('climatempo.json',x,function(err){
if(err) throw err
})
console.log("Temperatura:" + temperatura)
console.log(sensacao)
console.log("Vento:" + vento)
console.log("chuva:" + chuva)
console.log(momentoAtualizacao)
await browser.close()
If anyone has any idea how to solve my problem, please let me know!
Grateful, Carlos
I would suggest doing it a little differently
I will try to explain in pseudo code since i dont understand your variable names
read json file
array = JSON.parse(fileContents)
array.push(newItem)
newContents = JSON.stringify(array)
file WRITE (not append) newContents
I recommend reading the file, pushing onto an array captured from that file, and then writing the file back to disk.
Assuming the file has content already in the form of an array:
let fileDado = JSON.parse(fs.readFileSync('climatempo.json'));
fileDado.push(dado);
fs.writeFileSync('climatempo.json', JSON.stringify(fileDado));