I'm trying to scrape data from a CDC website.
I'm using cheerio.js to fetch the data, and copying the HTML selector into my code, like so:
const listItems = $('#tab1_content > div > table > tbody > tr:nth-child(1) > td:nth-child(3)');
However, when I run the program, I just get a blank array. How is this possible? I'm copying the HTML selector verbatim into my code, so why is this not working? Here is a short video showing the issue: https://youtu.be/a3lqnO_D4pM
Here is my full code, along with a link were you can run the code:
const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");
// URL of the page we want to scrape
const url = "https://nccd.cdc.gov/DHDSPAtlas/reports.aspx?geographyType=county&state=CO&themeId=2&filterIds=5,1,3,6,7&filterOptions=1,1,1,1,1";
// Async function which scrapes the data
async function scrapeData() {
try {
// Fetch HTML of the page we want to scrape
const { data } = await axios.get(url);
// Load HTML we fetched in the previous line
const $ = cheerio.load(data);
// Select all the list items in plainlist class
const listItems = $('#tab1_content > div > table > tbody > tr:nth-child(1) > td:nth-child(3)');
// Stores data in array
const dataArray = [];
// Use .each method to loop through the elements
listItems.each((idx, el) => {
// Object holding data
const dataObject = { name: ""};
// Store the textcontent in the above object
dataObject.name = $(el).text();
// Populate array with data
dataArray.push(dataObject);
});
// Log array to the console
console.dir(dataArray);
} catch (err) {
console.error(err);
}
}
// Invoke the above function
scrapeData();
Run the code here: https://replit.com/#STCollier/Web-Scraping#index.js
Thanks for any help.
Related
I've tried a lot of different things, and this seems to be the closest, but it only returns one of the div items i looking for...
I've gotten it to return multiple, BUT whenever it does return multiple divs - It returns blank list when i try to call the .text() function on the html and enter it into list (I've since deleted that code.)
Here is the webpage, and if you check every single item box has a seller name. I've tried for about 5 hours now, and im obviously a beginner especially in JS so it's been a good challenge. I think im lacking some fundamental knowledge which is holding me back.
https://poshmark.com/category/Men-Jackets_&_Coats?sort_by=like_count&all_size=true&my_size=false
Thank you to everyone who offers help. Happy holidays.
const express = require("express");
const cheerio = require("cheerio");
const request = require("request-promise");
const pretty = require("pretty");
const { default: axios } = require("axios");
const app = express();
const port = process.env.port || 5000;
let states =[];
const url = "https://poshmark.com/category/Men-Jackets_&_Coats?sort_by=like_count&all_size=true&my_size=false";
const fetchData = async () => {
try {
let res = await axios.get(url);
let $ = await cheerio.load(res.data);
const items = $("#content > div > div > div > div:nth-child(4) > section > div.tiles_container.m--t--1");
const itemArea = $("#content > div > div > div > div:nth-child(4) > section > div.tiles_container.m--t--1 > div:nth-child(1) > div > div")
itemArea.each(function(i){
itemHref = itemArea.find("href").text()
areaText = itemArea.text();
console.log(areaText);
console.log(itemHref);
//console.log(`${i} : ${itemArea}\n\n\n`)
});
} catch (error) {
console.log(error)
return
}
};
fetchData();
I tried grabbing the href and grabbing the class with the username at the bottom of each div, and each time it returned blank or undefined once i thought i finally got it.
It may just be an issue with your selector. however, I found that this works:
const itemArea = $(".tiles_container a.tile__creator span");
itemArea.each(function (i, element) {
console.log('username: ', $(element).text());
});
this way we get only the user name in the text and not the entire card text.
I am trying to use axios to retrieve data from a url and then append the data to html elements I created using javascript.
In a nutshell for each programming language in my url I would like to have a card showing the headline and author name of each article.
This is my HTML
<body>
<div class="parentDiv">
</div>
</body>
</html>
and my JS
const CardsTest = (one) => {
// class headline
const divHead = Object.assign(document.createElement('div'), {className: 'one', textContent: one.headline});
// class author
const divAut = Object.assign(document.createElement('div'), {className: 'writer'});
const spanCont = Object.assign(document.createElement('span'), {className: 'name', textContent: one.authorName});
divAut.appendChild(SpanCont);
divHead.appendChild(divAut);
return divHead;
}
const cardAppender = (div) => {
const divOne = document.querySelector(div);
axios.get('http://localhost:5000/api/articles')
.then((resp) => {
Object.keys(resp.data).forEach (
function(obj) {
const topicsData = CardsTest(obj.articles);
divOne.appendChild(obj.articles)
}
)
})
}
cardAppender('parentDiv')
I know that my function CardsTest creates all the components and my cardsappender can, at the very least print out the JSON from the target URL. I know if I run the function with axios and console log obj.articles I get an object promise containing the articles in the URL.
To summarize; I expect cardAppender to take a url, and take a callback function (Cards Test) appending the writer and headline to the elements in Cards Test and then append that to my html parentDiv. However this is not happening
UPDATE
Tried changing my cardAppender function by creating an array of programming languages (the keys in my JSON) and then appending headline and authorname for each article to my Cards Test function, but this function is still not creating the components in Cards Test:
const cardsAppender = (div) => {
const newArr = ["javascript","bootstrap","technology","jquery","node"]
const divOne = document.querySelector('.parentDiv');
axios.get('http://localhost:5000/api/articles')
.then((resp) => {
newArr.forEach((item) => {
const cardsHolds = CardsTest(resp.data.articles[item])
divOne.appendChild(cardsHolds)
})
})
}
cardsAppender('.parentDiv')
You are using Object.keys to iterate, so you need to use that key as a property index. Or use Object.values. It's not really clear what shape your data is though, so this might need to be tweaked.
Object.keys(resp.data).forEach (
function(obj) {
const topicsData = CardsTest(res.data[obj].articles);
divOne.appendChild(res.data[obj].articles)
}
Object.values(resp.data).forEach (
function(obj) {
const topicsData = CardsTest(obj.articles);
divOne.appendChild(obj.articles)
}
so I'm having this issue trying to scrape a web-table. Im able to extract tablenodes by using the 'firstChild' and 'lastElementChild' as a single child node. My problem here is that i want to extract all the childnodes(rows/cells) in map or array so i can iterate and extract data in a loop.
NOTE: im using puppeteer therefore ASYNC function
here is a code-snippet:
const [table] = await page.$x(xpath);
const tbody = await table.getProperty('lastElementChild'); //<-- in this case tbody is lastchild
const rows = Array.from(await tbody.getProperties('childNodes')); // <-- LINE OF THE PROBLEM
const cell = await rows.getProperty('firstChild') // <-- using firstChild for testing (ideally 'childNodes' with forEach())
const data = await cell.getProperty('innerText');
const txt = await data.jsonValue();
console.log(txt);
i found another way...
here is the solution:
const row = await page.evaluate(() => {
let row = document.querySelector('.fluid-table__row'); //<-- this refers to a HTML class
let cells = [];
row.childNodes.forEach(function(cell){
cells.push(cell.textContent)
})
return cells;
})
console.log(row);
So I'm trying to crawl a site using Puppeteer. All the data I'm looking to grab is in multiple tables. Specifically, I'm trying to grab the data from a single table. I was able to grab the specific table using a very verbose .querySelector(table.myclass ~ table.myclass), so now my issue is, my code is grabbing the first item of each table (starting from the correct table, which is the 2nd table), but I can't find a way to get it to just grab all the data in only the 2nd table.
const puppeteer = require('puppeteer');
const myUrl = "https://coolurl.com";
(async () => {
const browser = await puppeteer.launch({
headless: true
});
const page = (await browser.pages())[0];
await page.setViewport({
width: 1920,
height: 926
});
await page.goto(myUrl);
let gameData = await page.evaluate(() => {
let games = [];
let gamesElms = document.querySelectorAll('table.myclass ~ table.myclass');
gamesElms.forEach((gameelement) => {
let gameJson = {};
try {
gameJson.name = gameelement.querySelector('.myclass2').textContent;
} catch (exception) {
console.warn(exception);
}
games.push(gameJson);
});
return games;
})
console.log(gameData);
browser.close();
})();
You can use either of the following methods to select the second table:
let gamesElms = document.querySelectorAll('table.myclass')[1];
let gamesElms = document.querySelector('table.myclass:nth-child(2)');
Additionally, you can use the example below to push all of the data from the table to an array:
let games = Array.from(document.querySelectorAll('table.myclass:nth-child(2) tr'), e => {
return Array.from(e.querySelectorAll('th, td'), e => e.textContent);
});
// console.log(games[rowNum][cellNum]); <-- textContent
I have a firebase database structure in this format
the database is follows this format Ordergroup/groupid/item
I want to get the buyerid for each item once a new group is created in the ordergroup node. So I try this
exports.sendorderemailtoseller = functions.database.ref('/Ordergroup/{pushId}').onCreate((snapshot, context) => {
const parentRef = snapshot.ref.parent;
const ref = snapshot.ref;
const original = snapshot.val();
const buyerid = original.buyerid;
})
I then notice that original only returns the first child and the buyerid comes out as undefined. How can I get a snapshot of all the child in the groupid excluding Ordersummary?
Your variable 'original' is actually getting the whole tree under node 1522509953304, so you will need to iterate over its children to get each buyerid, like below:
exports.sendorderemailtoseller = functions.database.ref('/Ordergroup/{pushId}').onCreate((snapshot, context) => {
const buyerids = [];
snapshot.forEach((item) => {
buyerids.push(item.val().buyerid);
});
console.log(buyerids);
});