Using Cheerio and axios, I am trying to get the text "Version" and "1.7.0" from the divs inside nested tds in a table from a vscode marketplace page
I have tried this and a bunch of other ways to pinpoint the div text at the bottom but I'm not sure that I'm addressing it correctly. I am unsure where to start to get the nested elements inside the table, and I am pretty confused. Any help on the surely simple problem is appreciated.
const cheerio = require('cheerio');
const axios = require('axios');
const url = "https://marketplace.visualstudio.com/items?itemName=bloumbs.borders-dark"
axios.get(url).then((response) => {
const $ = cheerio.load(response.data)
// With this I get no response:
$('.ux-table-metadata > tbody > tr > td > div').each(() => {
console.log($(this).text());
});
// And with this method, it return "null"
let version = $('.ux-table-metadata tbody tr td div').html($.versionText)
console.log(version)
})
This is the section of html I'm working with:
<div class="ux-section-other">
<h3 class="itemdetails-section-header right">More Info</h3>
<div>
<table class="ux-table-metadata">
<tbody>
<tr>
<td>
<div>Version</div>
</td>
<td>
<div>1.7.0</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
$(".ux-table-metadata > tbody > tr > td").each(function() {
console.log($(this).find("div").html());
});
or
$(".ux-table-metadata > tbody > tr > td").each(function() {
console.log($(this).children().html());
});
tested with jquery
Related
I am trying to get get all of the td elements that have a class name of calendarCellRegularPast I have tried multiple attempts with page.$, page.$$, page.eval and many others and cannot seem to get the proper element. I got the object at on point but was unable to parse it. I know that there is a way to do this but I am new to puppeteer and Javascript honestly and just cannot figure it out.
This is a calendar of scheduled work days and I am trying to get even further down and grab the date and time. I want to have that in a for loop though so I can grab all of the work times.
I don't need a fix to my code as I know its messy but I just need to know how to grab those elements
Here is some of my code so far, though it is messy as I was trying numerous things
const el = await page.$("#scrollContainer > table > tbody > tr:nth-child(2) > td:nth-child(1)")
const className = await el.getProperty('className')
const getElm = async() => {
try {
const elm = await page.evaluate(()=> {
let persons = [];
const weekElm = document.querySelectorAll("#scrollContainer > table > tbody");
document.querySelector("#scrollContainer > table > tbody > tr:nth-child(2) > td:nth-child(3)")
console.log('\n\n\n', weekElm.length, '\n\n\n');
try {
console.log(weekElm);
} catch (e) {
console.error('could not log "weeksElm"');
}
for (let i = 2; i < weekElm.length; i++) {
try {
const trd = weekElm[i];
console.log(trd);
const tr = document.querySelectorAll(`tr:nth-child${i+1}`)
persons.push(tr);
try {
console.log('\n\n\n', tr, '\n\n\n');
} catch (e) {
console.error('could not log "\\n\\n\\n, tr, \\n\\n\\n"');
}
for ( let j = 0; j < tr.length; i++) {
const td = (tr.querySelectorAll(`td:nth-child${j}`))
try {
pass
} catch (e) {
console.error('could not log "weeksElm"');
}
}
} catch (e) {}
}
})
console.log(persons);
} catch (e){
console.error("unable to get the work schedule \n unfortunately i still work");
}
}
}
Firstly, I would recommend using a http client and cheerio.js for scraping, as it's faster and more lightweight. But I can understand why you would use puppeteer because of authentication.
You can get the html of the page using puppeteer by using page.content() and then pass that into cheerio. But if you wanted to do it using a http client, it would look like this:
const axios = require("axios");
const cheerio = require("cheerio");
const PAGEURL = "example.com";
function getData() {
return new Promise((resolve, reject) => {
axios.get(PAGEURL).then(res => res.data).then(HTML => {
const $ = cheerio.load(HTML);
// scrape data here
resolve()
}).catch(err => reject(err));
})
}
First, here's a mock-up of your HTML:
<table class="etmSchedualTable">
<tbody>
<tr>
<td class="calanderCellRegularPast">
<table class="etmCursor">
<tbody>
<tr>
<td>
<span class="calanderDateNormal"> 01</span>
</td>
<td>
<span class="calanderCellRegularPast etmNoBorder">
<span>17:15</span> " - "
<span>23:45</span>
<span class="etmMoreLink">more...</span>
</span>
</td>
</tr>
<tr>
<td>
<span class="calanderDateNormal"> 02</span>
</td>
<td>
<span class="calanderCellRegularPast etmNoBorder">
<span>17:15</span> " - "
<span>23:45</span>
<span class="etmMoreLink">more...</span>
</span>
</td>
</tr>
</tbody>
</table>
</td>
<tr>
</tbody>
</table>
At the end, we want an array with objects that look like this:
{
date: 1,
start: "17:45",
end: "21:15"
}
To do so, we first need to select the parent that holds all the items we want. In this case it's $$(".etmCursor tbody tr"). This will give use a list of all the tr's in the table .etmCursor.
Now we have to loop through table rows (tr) and get the object properties.
const timetable = $(".etmCursor tbody tr").map((i, elm) => {
return {
date: $(elm).find(".calendarDateNormal").text(),
start: $(elm).find(".CalendarCellRegularPast:nth-child(1)").text(),
end: $(elm).find(".CalendarCellRegularPast:nth-child(2)").text(),
}
}).toArray();
The .map() method works in reverse to a normal map, where you get the element before the index. And since it uses a collection, we have to use toArray() to turn it to an array.
We use the .find() to do a search on the selected object.
.text() then gets use the text of that element.
Now we should have the data you wanted.
I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:
<table>
<tr>
<th>Product name</th>
<td>Shakeweight</td>
</tr>
<tr>
<th>Product category</th>
<td>Exercise equipment</td>
</tr>
<tr>
<th>Manufacturer name</th>
<td>The Shakeweight Company</td>
</tr>
<tr>
<th>Manufacturer address</th>
<td>
<table>
<tr><td>123 Fake Street</td></tr>
<tr><td>Springfield, MO</td></tr>
</table>
</td>
</tr>
In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.
I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:
var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)
This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.
This is really an xpath question and isn't specific to puppeteer, so this question might also help, as you're going to need to find the <td> that comes after the <th> you've found: XPath:: Get following Sibling
But your xpath does work for me. In Chrome DevTools on the page with the HTML in your question, run this line to query the document:
$x('//th[text()="Manufacturer name"]')
NOTE: $x() is a helper function that only works in Chrome DevTools, though Puppeteer has a similar Page.$x function.
That expression should return an array with one element, the <th> with that text in the query. To get the <td> next to it:
$x('//th[text()="Manufacturer name"]/following-sibling::td')
And to get its inner text:
$x('//th[text()="Manufacturer name"]/following-sibling::td')[0].innerText
Once you're able to follow that pattern you should be able to use similar strategies to get the data you want in puppeteer, similar to this:
const puppeteer = require('puppeteer');
const main = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://127.0.0.1:8080/'); // <-- EDIT THIS
const mfg = await page.$x('//th[text()="Manufacturer name"]/following-sibling::td');
const prop = await mfg[0].getProperty('innerText');
const text = await prop.jsonValue();
console.log(text);
await browser.close();
}
main();
As per your use case explanation in the above answer, here is the logic for the use case:
await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url
await page.waitFor('table'); //waitFor an element that contains the text
const textDataArr = await page.evaluate(() => {
const trArr = Array.from(document.querySelectorAll('table tbody tr'));
//Find an index of a tr row where th innerText equals 'Manufacturer name'
let fetchValueRowIndex = trArr.findIndex((v, i) => {
const element = document.querySelector('table tbody tr:nth-child(i+1) th');
return element.innerText === 'Manufacturer name';
});
//If the findex is found return the innerText of td of the same row else returns undefined
return (fetchValueRowIndex > -1) ? document.querySelector(`table tbody tr:nth-child(${fetchValueRowIndex}+1) td`).innerText : undefined;
});
console.log(textDataArr);
You can do something like this to get the data:
await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url
await page.waitFor('table'); //waitFor an element that contains the text
const textDataArr = await page.evaluate(() => {
const element = document.querySelector('table tbody tr:nth-child(3) td'); // select thrid row td element like so
return element && element.innerText; // will return text and undefined if the element is not found
});
console.log(textDataArr);
A simple way to get those all at once:
let data = await page.evaluate(() => {
return [...document.querySelectorAll('tr')].reduce((acc, tr, i) => {
let cells = [...tr.querySelectorAll('th,td')].map(el => el.innerText)
acc[cells[0]] = cells[1]
return acc
}, {})
})
I'm looking to retrieve the text inside a HTML table that is rendered via a webgrid. The text that I want is located inside a div with the class productID. My starting reference point is in the same row but the last td with the class span2. I'm trying to use jQuery's closest() method however I'm not getting any value returned.
Please see below for a section of the rendered HTML and my jQuery function:
HTML:
<tr>
<td class="span1"><div class="productID">1</div></td>
<td class="span2">Listing</td>
<td class="span2">Full Districtution</td>
<td class="span2">$1,350.00</td>
<td class="span2">2016-01-01</td>
<td class="span2"><div title="This is my brand new title!" data-original-title="" class="priceToolTip">2016-04-30</div></td>
<td>Select</td>
</tr>
jQuery:
$(".priceToolTip").mouseover(function () {
var row = $(this).closest("span1").find(".productID").parent().find(".productID").text();
console.log("Closest row is: " + row);
});
The .closest() method looks for a match in the ancestors. So you can use it to grab the tr then look for .productID like so:
var productID = $(this).closest('tr').find('.productID').text();
Or:
var productID = $(this).parent().find('.productID').text();
Or:
var productID = $(this).siblings('.span1').find('.productID').text();
.span1 is not the closest element of .priceToolTip. Use closest("tr").find(".span1 .productID") like following.
$(".priceToolTip").mouseover(function () {
var row = $(this).closest("tr").find(".span1 .productID").text();
console.log("Closest row is: " + row);
});
I would like to split this entire table into three sub tables using Javascript. Each table should retain it's header information.
I cannot adjust the id's or classes as they are generated by a web application, so I need to make do with what is available.
I've been trying to crack this with Jfiddle for quite awhile and am getting frustrated. I'm pretty new to Javascript, but can't image this would require a lot of code. If anyone knows how to split this apart by row size as well (i.e. Split Table up, but selectively), that would be appreciated as well.
I'm limited to Javascript and Jquery 1.7.
<div id="serviceArray">
<table border="1" class="array vertical-array">
<thead>
<tr>
<th>#</th>
<th>Savings</th>
<th>Expenses</th>
<th>Savings</th>
<th>Expenses</th>
<th>Savings</th>
<th>Expenses</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sum</td>
<td>$180</td>
<td>$500</td>
<td>$300</td>
<td>$700</td>
<td>$600</td>
<td>$1000</td>
</tr>
<tr>
<td>Home</td>
<td>$100</td>
<td>$200</td>
<td>$200</td>
<td>$300</td>
<td>$400</td>
<td>$500</td>
</tr>
<tr>
<td>Work</td>
<td>$80</td>
<td>$300</td>
<td>$100</td>
<td>$400</td>
<td>$200</td>
<td>$500</td>
</tr>
</tbody>
</table>
</div>
Did you mean like this?
var tables = $('#serviceArray table tbody tr').map(function () { //For each row
var $els = $(this).closest('tbody') //go to its parent tbody
.siblings('thead').add( //fetch thead
$(this) //and add itself (tr)
.wrap($('<tbody/>')) //wrapping itself in tbody
.closest('tbody')); //get itself with its tbody wrapper
return $els.clone() //clone the above created steps , i.e thead and tbody with one tr
.wrapAll($('<table/>', { //wrap them all to a new table with
'border': '1', //attributes.
'class': 'array vertical-array'
})
).closest('table'); //get the new table
}).get();
$('#serviceArray table').remove();
$('body').append(tables); //append all to the table.
Demo
Or just simply clone the table and remove all other trs from tbody except this one and add it to DOM (Much Shorter Solution).
var tables = $('#serviceArray table tbody tr').map(function (idx) {
var $table = $(this).closest('table').clone().find('tbody tr:not(:eq(' + idx + '))').remove().end();
return $table;
}).get();
Demo
Each of the methods used has documentation available in web and you can use this to work out something yourself to what you need.
You can use simple Javascript for table creation and it will generate rows according to your returned response from api.
var tableHeader = this.responseJsonData.Table_Headers;
var tableData = this.responseJsonData.Table_Data;
let table = document.querySelector("table");
function generateTableHead(table, data) {
//alert("In Table Head");
let thead = table.createTHead();
let row = thead.insertRow();
for (let key of data) {
let th = document.createElement("th");
let text = document.createTextNode(key);
th.appendChild(text);
row.appendChild(th);
}
}
function generateTable(table, data) {
// alert("In Generate Head");
for (let element of data) {
let row = table.insertRow();
for (key in element) {
let cell = row.insertCell();
let text = document.createTextNode(element[key]);
cell.appendChild(text);
}
}
}
I have a page with 2-3 tables. In those tables I want to change the text of a specific column located in <thead> and also a value in each <td> line, and I would like to get the id from each line.
What is the fastest way to do this, performance-wise?
HTML
Table-Layout:
<table class="ms-viewtable">
<thead id="xxx">
<tr class ="ms-viewheadertr">
<th>
<th>
<tbody>
<tr class="ms-itmHover..." id="2,1,0">
<td>
<td>
<tr class="ms-itmHover..." id="2,2,0">
<td>
<td>
</table>
JavaScript
Script with that I started:
$('.ms-listviewtable').each(function () {
var table = $(this);
$table.find('tr > th').each(function () {
//Code here
});
$table.find('tr > td').each(function () {
//Code here
});
How can I get the Id? Is this there a better way to do what I want?
You can get the id of an element by calling .attr on "id" i.e. $(this).attr("id");.
In jquery the best way to get to any element is by giving it an ID, and referencing it.
I would structure it the other way around - give the table elements meaningful IDs, and then put the information that I'd like to retrieve in their class attributes.
<tr id="ms-itmHover..." class="2,2,0">
And then retrieve it as follows: $('#ms-itmHover...').attr('class');
You can get the IDs by "mapping" from table row to associated ID thus:
var ids = $table.find('tbody > tr').map(function() {
return this.id;
}).get();
You can access individual cells using the .cells property of the table row:
$table.each('tbody > tr', function() {
var cell = this.cells[i]; // where 'i' is desired column number
...
});
Go thru all tables, collect all rows and locate their identifiers by your needs:
$('table.ms-viewtable').each(function(){
$(this).find('tr').each(function(){
var cells = $(this).children(); //all cells (ths or tds)
if (this.parentNode.nodeName == 'THEAD') {
cells.eq(num).html('header row '+this.parentNode.id);
} else { // in "TBODY"
cells.eq(num).html('body row '+this.id);
}
});
});
jsfiddle