Extracting childNodes from Table.getProperties('childNodes/children') - javascript

so I'm having this issue trying to scrape a web-table. Im able to extract tablenodes by using the 'firstChild' and 'lastElementChild' as a single child node. My problem here is that i want to extract all the childnodes(rows/cells) in map or array so i can iterate and extract data in a loop.
NOTE: im using puppeteer therefore ASYNC function
here is a code-snippet:
const [table] = await page.$x(xpath);
const tbody = await table.getProperty('lastElementChild'); //<-- in this case tbody is lastchild
const rows = Array.from(await tbody.getProperties('childNodes')); // <-- LINE OF THE PROBLEM
const cell = await rows.getProperty('firstChild') // <-- using firstChild for testing (ideally 'childNodes' with forEach())
const data = await cell.getProperty('innerText');
const txt = await data.jsonValue();
console.log(txt);

i found another way...
here is the solution:
const row = await page.evaluate(() => {
let row = document.querySelector('.fluid-table__row'); //<-- this refers to a HTML class
let cells = [];
row.childNodes.forEach(function(cell){
cells.push(cell.textContent)
})
return cells;
})
console.log(row);

Related

Cannot select element using Puppeteer

I'm using puppeteer to scrape some data off of a website, but all of my selections for a certain element are always undefined.
const tempFunction = await page.evaluate(() => {
let a = document.querySelectorAll(".flex.flex-wrap.w-100.flex-grow-0.flex-shrink-0.ph2.pr0-xl.pl4-xl.mt0-xl.mt3")
let container = document.querySelector(".flex.flex-wrap.w-100.flex-grow-0.flex-shrink-0.ph2.pr0-xl.pl4-xl.mt0-xl.mt3")
let b = container.getElementsByClassName("mb1 ph1 pa0-xl bb b--near-white w-33")
return b
})
For some reason this code always returns undefined, but similar code works fine.
const checkData = await page.evaluate(() =>{
let tempArray = []
let element = document.querySelectorAll('.weather-block')
tempArray.push(element[0].innerText)
return tempArray
})
Even when trying to use specific selectors or id's, I only get undefined. Not sure where to go from here.

Data are blank when scraping website (cheerio.js)

I'm trying to scrape data from a CDC website.
I'm using cheerio.js to fetch the data, and copying the HTML selector into my code, like so:
const listItems = $('#tab1_content > div > table > tbody > tr:nth-child(1) > td:nth-child(3)');
However, when I run the program, I just get a blank array. How is this possible? I'm copying the HTML selector verbatim into my code, so why is this not working? Here is a short video showing the issue: https://youtu.be/a3lqnO_D4pM
Here is my full code, along with a link were you can run the code:
const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");
// URL of the page we want to scrape
const url = "https://nccd.cdc.gov/DHDSPAtlas/reports.aspx?geographyType=county&state=CO&themeId=2&filterIds=5,1,3,6,7&filterOptions=1,1,1,1,1";
// Async function which scrapes the data
async function scrapeData() {
try {
// Fetch HTML of the page we want to scrape
const { data } = await axios.get(url);
// Load HTML we fetched in the previous line
const $ = cheerio.load(data);
// Select all the list items in plainlist class
const listItems = $('#tab1_content > div > table > tbody > tr:nth-child(1) > td:nth-child(3)');
// Stores data in array
const dataArray = [];
// Use .each method to loop through the elements
listItems.each((idx, el) => {
// Object holding data
const dataObject = { name: ""};
// Store the textcontent in the above object
dataObject.name = $(el).text();
// Populate array with data
dataArray.push(dataObject);
});
// Log array to the console
console.dir(dataArray);
} catch (err) {
console.error(err);
}
}
// Invoke the above function
scrapeData();
Run the code here: https://replit.com/#STCollier/Web-Scraping#index.js
Thanks for any help.

How do I print just the first and second element of an array?

I have an array called tagline that looks like this:
[" Leger Poll", " Web survey of 2", "test", "test", "test", "test"]
it is pulled from an external CSS file. I have assigned it a variable name tagline.
I want to print the first and second elements using document.getElementById so that I can style the text. I am not sure why this is not working? I tried pulling the variable outside of the main function so that it would be global but still not working. I am a beginner coder. Here is what I have. Please help.
var tagline = [];
async function getData() {
// const response = await fetch('testdata.csv');
var response = await fetch('data/test3.csv');
var data = await response.text();
data = data.replace(/"/g, "");
var years = [];
var vals = [];
var rows = data.split('\n').slice(1);
rows = rows.slice(0, rows.length - 1);
rows = rows.filter(row => row.length !== 0)
rows.forEach(row => {
var cols = row.split(",");
years.push(cols[0]);
vals.push(0 + parseFloat(cols[1]));
tagline.push(cols[2]);
});
console.log(years, vals, tagline);
return { years, vals, tagline };
}
var res = tagline.slice(1);
document.getElementById("demo1").innerHTML = res;
var res2 = tagline.slice(2);
document.getElementById("demo2").innerHTML = res2;
</script> ```
It seems You defined the function getData() but you didn't call it to execute.
Since you use Async function, I am using then().
var tagline = [];
async function getData() { ...// your function }
getData().then(() => {
const res = tagline[0];
document.getElementById("demo1").innerHTML = res;
const res2 = tagline[1];
document.getElementById("demo2").innerHTML = res2;
});
To access a specific index of an array use:
array[index];
In your case:
tagline[0]; //first element
tagline[1]; //second element
Since the getData is async you must await for it to fill the tagline:
await getData(); //call it before you use the tagline array
If you are using an older version of JS which does not support async/await you need to wait for the promise response with .then.
Also, be aware:
The slice() method returns a shallow copy of a portion of an array
into a new array object selected from start to end (end not included)
where start and end represent the index of items in that array.

CefSharp does not marshal DOM nodes

When I execute js in CEFSharp using EvaluateScriptAsync(), I can return primitive types like string or array. For example, the following works:
var result = await Browser.EvaluateScriptAsync("Array.from(document.getElementsByTagName('input')).map(element => element.value)");
if (result.Success && result.Result != null)
{
dynamic values = result.Result;
foreach (dynamic value in values)
{
MessageBox.Show($"Value is: {value}");
}
}
But once I try to get a DOM element, either one or a list of, I get null:
var result = await Browser.EvaluateScriptAsync("Array.from(document.getElementsByTagName('input'))");
// `result.Success` is `true`, `result.Result` is `null`
I thought that CEFSharp only knows how to marshal primitive types, but object literals also work:
var result = await Browser.EvaluateScriptAsync("({ a: 1, b: 'hello' })");
if (result.Success && result.Result != null)
{
dynamic obj = result.Result;
MessageBox.Show($"{{ a: {obj.a}, b: {obj.b} }}");
}
So it turns out that CEFSharp only doesn't know how to marshal DOM objects.
Why? Is there a solution or workaround out there?
Firstly it's important to understand that Javascript is executed in the render process. The result of EvaluateScriptAsync is effectively a DTO, we create an object that represents the result of executing the script.
It's not currently possible to return a HTMLElement or any object that has a cyclic reference.
If we look at `HTMLElement as a specific example it will have a parentElement/parentNode and the parent has children which includes the node itself. You also end up walking the whole DOM tree as well.
CEF has very limited type support for it's CefV8Value type, so it's hard to do anything too fancy. See this.
We could potentially add an extension method that wraps the user script in an IIFE and does some instanceof HTMLElement style type checking to return a trimmed down representation of the HTML element. See this for an example of how I'm fudging support for returning a Promise.
As an alternative to using JavaScript you can now use CefSharp.Dom which is an asynchronous library for accessing the DOM.
It's freely available on
// Add using CefSharp.Dom to access CreateDevToolsContextAsync and related extension methods.
await using var devToolsContext = await chromiumWebBrowser.CreateDevToolsContextAsync();
// Get element by Id
// https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector
var element = await devToolsContext.QuerySelectorAsync<HtmlElement>("#myElementId");
//Strongly typed element types (this is only a subset of the types mapped)
var htmlDivElement = await devToolsContext.QuerySelectorAsync<HtmlDivElement>("#myDivElementId");
var htmlSpanElement = await devToolsContext.QuerySelectorAsync<HtmlSpanElement>("#mySpanElementId");
var htmlSelectElement = await devToolsContext.QuerySelectorAsync<HtmlSelectElement>("#mySelectElementId");
var htmlInputElement = await devToolsContext.QuerySelectorAsync<HtmlInputElement>("#myInputElementId");
var htmlFormElement = await devToolsContext.QuerySelectorAsync<HtmlFormElement>("#myFormElementId");
var htmlAnchorElement = await devToolsContext.QuerySelectorAsync<HtmlAnchorElement>("#myAnchorElementId");
var htmlImageElement = await devToolsContext.QuerySelectorAsync<HtmlImageElement>("#myImageElementId");
var htmlTextAreaElement = await devToolsContext.QuerySelectorAsync<HtmlImageElement>("#myTextAreaElementId");
var htmlButtonElement = await devToolsContext.QuerySelectorAsync<HtmlButtonElement>("#myButtonElementId");
var htmlParagraphElement = await devToolsContext.QuerySelectorAsync<HtmlParagraphElement>("#myParagraphElementId");
var htmlTableElement = await devToolsContext.QuerySelectorAsync<HtmlTableElement>("#myTableElementId");
// Get a custom attribute value
var customAttribute = await element.GetAttributeAsync<string>("data-customAttribute");
//Set innerText property for the element
await element.SetInnerTextAsync("Welcome!");
//Get innerText property for the element
var innerText = await element.GetInnerTextAsync();
//Get all child elements
var childElements = await element.QuerySelectorAllAsync("div");
//Change CSS style background colour
await element.EvaluateFunctionAsync("e => e.style.backgroundColor = 'yellow'");
//Type text in an input field
await element.TypeAsync("Welcome to my Website!");
//Click The element
await element.ClickAsync();
// Simple way of chaining method calls together when you don't need a handle to the HtmlElement
var htmlButtonElementInnerText = await devToolsContext.QuerySelectorAsync<HtmlButtonElement>("#myButtonElementId")
.AndThen(x => x.GetInnerTextAsync());
//Event Handler
//Expose a function to javascript, functions persist across navigations
//So only need to do this once
await devToolsContext.ExposeFunctionAsync("jsAlertButtonClick", () =>
{
_ = devToolsContext.EvaluateExpressionAsync("window.alert('Hello! You invoked window.alert()');");
});
var jsAlertButton = await devToolsContext.QuerySelectorAsync<HtmlButtonElement>("#jsAlertButton");
//Write up the click event listner to call our exposed function
_ = jsAlertButton.AddEventListenerAsync("click", "jsAlertButtonClick");
//Get a collection of HtmlElements
var divElements = await devToolsContext.QuerySelectorAllAsync<HtmlDivElement>("div");
foreach (var div in divElements)
{
// Get a reference to the CSSStyleDeclaration
var style = await div.GetStyleAsync();
//Set the border to 1px solid red
await style.SetPropertyAsync("border", "1px solid red", important: true);
await div.SetAttributeAsync("data-customAttribute", "123");
await div.SetInnerTextAsync("Updated Div innerText");
}
//Using standard array
var tableRows = await htmlTableElement.GetRowsAsync().ToArrayAsync();
foreach (var row in tableRows)
{
var cells = await row.GetCellsAsync().ToArrayAsync();
foreach (var cell in cells)
{
var newDiv = await devToolsContext.CreateHtmlElementAsync<HtmlDivElement>("div");
await newDiv.SetInnerTextAsync("New Div Added!");
await cell.AppendChildAsync(newDiv);
}
}
//Get a reference to the HtmlCollection and use async enumerable
//Requires Net Core 3.1 or higher
var tableRowsHtmlCollection = await htmlTableElement.GetRowsAsync();
await foreach (var row in tableRowsHtmlCollection)
{
var cells = await row.GetCellsAsync();
await foreach (var cell in cells)
{
var newDiv = await devToolsContext.CreateHtmlElementAsync<HtmlDivElement>("div");
await newDiv.SetInnerTextAsync("New Div Added!");
await cell.AppendChildAsync(newDiv);
}
}

Struggling to query specific element among others with the same class name using .querySelector

So I'm trying to crawl a site using Puppeteer. All the data I'm looking to grab is in multiple tables. Specifically, I'm trying to grab the data from a single table. I was able to grab the specific table using a very verbose .querySelector(table.myclass ~ table.myclass), so now my issue is, my code is grabbing the first item of each table (starting from the correct table, which is the 2nd table), but I can't find a way to get it to just grab all the data in only the 2nd table.
const puppeteer = require('puppeteer');
const myUrl = "https://coolurl.com";
(async () => {
const browser = await puppeteer.launch({
headless: true
});
const page = (await browser.pages())[0];
await page.setViewport({
width: 1920,
height: 926
});
await page.goto(myUrl);
let gameData = await page.evaluate(() => {
let games = [];
let gamesElms = document.querySelectorAll('table.myclass ~ table.myclass');
gamesElms.forEach((gameelement) => {
let gameJson = {};
try {
gameJson.name = gameelement.querySelector('.myclass2').textContent;
} catch (exception) {
console.warn(exception);
}
games.push(gameJson);
});
return games;
})
console.log(gameData);
browser.close();
})();
You can use either of the following methods to select the second table:
let gamesElms = document.querySelectorAll('table.myclass')[1];
let gamesElms = document.querySelector('table.myclass:nth-child(2)');
Additionally, you can use the example below to push all of the data from the table to an array:
let games = Array.from(document.querySelectorAll('table.myclass:nth-child(2) tr'), e => {
return Array.from(e.querySelectorAll('th, td'), e => e.textContent);
});
// console.log(games[rowNum][cellNum]); <-- textContent

Categories