I am trying to get get all of the td elements that have a class name of calendarCellRegularPast I have tried multiple attempts with page.$, page.$$, page.eval and many others and cannot seem to get the proper element. I got the object at on point but was unable to parse it. I know that there is a way to do this but I am new to puppeteer and Javascript honestly and just cannot figure it out.
This is a calendar of scheduled work days and I am trying to get even further down and grab the date and time. I want to have that in a for loop though so I can grab all of the work times.
I don't need a fix to my code as I know its messy but I just need to know how to grab those elements
Here is some of my code so far, though it is messy as I was trying numerous things
const el = await page.$("#scrollContainer > table > tbody > tr:nth-child(2) > td:nth-child(1)")
const className = await el.getProperty('className')
const getElm = async() => {
try {
const elm = await page.evaluate(()=> {
let persons = [];
const weekElm = document.querySelectorAll("#scrollContainer > table > tbody");
document.querySelector("#scrollContainer > table > tbody > tr:nth-child(2) > td:nth-child(3)")
console.log('\n\n\n', weekElm.length, '\n\n\n');
try {
console.log(weekElm);
} catch (e) {
console.error('could not log "weeksElm"');
}
for (let i = 2; i < weekElm.length; i++) {
try {
const trd = weekElm[i];
console.log(trd);
const tr = document.querySelectorAll(`tr:nth-child${i+1}`)
persons.push(tr);
try {
console.log('\n\n\n', tr, '\n\n\n');
} catch (e) {
console.error('could not log "\\n\\n\\n, tr, \\n\\n\\n"');
}
for ( let j = 0; j < tr.length; i++) {
const td = (tr.querySelectorAll(`td:nth-child${j}`))
try {
pass
} catch (e) {
console.error('could not log "weeksElm"');
}
}
} catch (e) {}
}
})
console.log(persons);
} catch (e){
console.error("unable to get the work schedule \n unfortunately i still work");
}
}
}
Firstly, I would recommend using a http client and cheerio.js for scraping, as it's faster and more lightweight. But I can understand why you would use puppeteer because of authentication.
You can get the html of the page using puppeteer by using page.content() and then pass that into cheerio. But if you wanted to do it using a http client, it would look like this:
const axios = require("axios");
const cheerio = require("cheerio");
const PAGEURL = "example.com";
function getData() {
return new Promise((resolve, reject) => {
axios.get(PAGEURL).then(res => res.data).then(HTML => {
const $ = cheerio.load(HTML);
// scrape data here
resolve()
}).catch(err => reject(err));
})
}
First, here's a mock-up of your HTML:
<table class="etmSchedualTable">
<tbody>
<tr>
<td class="calanderCellRegularPast">
<table class="etmCursor">
<tbody>
<tr>
<td>
<span class="calanderDateNormal"> 01</span>
</td>
<td>
<span class="calanderCellRegularPast etmNoBorder">
<span>17:15</span> " - "
<span>23:45</span>
<span class="etmMoreLink">more...</span>
</span>
</td>
</tr>
<tr>
<td>
<span class="calanderDateNormal"> 02</span>
</td>
<td>
<span class="calanderCellRegularPast etmNoBorder">
<span>17:15</span> " - "
<span>23:45</span>
<span class="etmMoreLink">more...</span>
</span>
</td>
</tr>
</tbody>
</table>
</td>
<tr>
</tbody>
</table>
At the end, we want an array with objects that look like this:
{
date: 1,
start: "17:45",
end: "21:15"
}
To do so, we first need to select the parent that holds all the items we want. In this case it's $$(".etmCursor tbody tr"). This will give use a list of all the tr's in the table .etmCursor.
Now we have to loop through table rows (tr) and get the object properties.
const timetable = $(".etmCursor tbody tr").map((i, elm) => {
return {
date: $(elm).find(".calendarDateNormal").text(),
start: $(elm).find(".CalendarCellRegularPast:nth-child(1)").text(),
end: $(elm).find(".CalendarCellRegularPast:nth-child(2)").text(),
}
}).toArray();
The .map() method works in reverse to a normal map, where you get the element before the index. And since it uses a collection, we have to use toArray() to turn it to an array.
We use the .find() to do a search on the selected object.
.text() then gets use the text of that element.
Now we should have the data you wanted.
Related
I am having a problem when I try to update the DOM with new information coming from an API.
Every time that I click to add new users, the array displays the old, and new information. Ideally, it would update the array first and then display only the new information. I will attach a picture of what is happening. I would like to every time the user click on add new user, the DOM update with only the information of that new user.
HTML part
<table class="table is-fullwidth table is-hoverable table-info">
<thead>
<tr">
<th title="Channel Name" class="has-text-left"> Channel Name </th>
<th title="View per week" class="has-text-right"> View per week </th>
</tr>
</thead>
<tbody id="body-table">
<tr id="tr-table">
</tr>
</tbody>
</table>
script.js
const trline = document.getElementById('body-table')
let usersList = [];
async function getnewUsers(){
const res = await fetch('https://randomuser.me/api')
const data = await res.json()
// create an instance of the results
const user = data.results[0]
// create the new user
const newUser = {
name:`${user.name.first} ${user.name.last}`,
social: Math.floor(Math.random() * 10000 )
}
// update the new user to the database...
addData(newUser)
}
function addData(obj) {
usersList.push(obj)
// update the information on the screen
updateDOM()
}
function updateDOM( providedData = usersList){
providedData.forEach(item => {
const element = document.createElement('tr')
element.innerHTML = `
<td class="has-text-left cname"> ${item.name} </td>
<td class="has-text-right cview"> ${item.social} k</td>
`
trline.appendChild(element)
})
}
addUser.addEventListener('click', getnewUsers)
Result picture:
I found the problem and the solution.
I didn't reset the HTML part to clear before adding a new item. I had to fix the function updateDOM with this: trline.innerHTML = ''
After that, the function works fine.
function updateDOM( providedData = usersList){
trline.innerHTML = '' // clear everything before adding new stuff
providedData.forEach(item => {
const element = document.createElement('tr')
element.innerHTML = `
<td class="has-text-left cname"> ${item.name} </td>
<td class="has-text-right cview"> ${item.social} k</td>
`
trline.appendChild(element)
})
}
I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:
<table>
<tr>
<th>Product name</th>
<td>Shakeweight</td>
</tr>
<tr>
<th>Product category</th>
<td>Exercise equipment</td>
</tr>
<tr>
<th>Manufacturer name</th>
<td>The Shakeweight Company</td>
</tr>
<tr>
<th>Manufacturer address</th>
<td>
<table>
<tr><td>123 Fake Street</td></tr>
<tr><td>Springfield, MO</td></tr>
</table>
</td>
</tr>
In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.
I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:
var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)
This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.
This is really an xpath question and isn't specific to puppeteer, so this question might also help, as you're going to need to find the <td> that comes after the <th> you've found: XPath:: Get following Sibling
But your xpath does work for me. In Chrome DevTools on the page with the HTML in your question, run this line to query the document:
$x('//th[text()="Manufacturer name"]')
NOTE: $x() is a helper function that only works in Chrome DevTools, though Puppeteer has a similar Page.$x function.
That expression should return an array with one element, the <th> with that text in the query. To get the <td> next to it:
$x('//th[text()="Manufacturer name"]/following-sibling::td')
And to get its inner text:
$x('//th[text()="Manufacturer name"]/following-sibling::td')[0].innerText
Once you're able to follow that pattern you should be able to use similar strategies to get the data you want in puppeteer, similar to this:
const puppeteer = require('puppeteer');
const main = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://127.0.0.1:8080/'); // <-- EDIT THIS
const mfg = await page.$x('//th[text()="Manufacturer name"]/following-sibling::td');
const prop = await mfg[0].getProperty('innerText');
const text = await prop.jsonValue();
console.log(text);
await browser.close();
}
main();
As per your use case explanation in the above answer, here is the logic for the use case:
await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url
await page.waitFor('table'); //waitFor an element that contains the text
const textDataArr = await page.evaluate(() => {
const trArr = Array.from(document.querySelectorAll('table tbody tr'));
//Find an index of a tr row where th innerText equals 'Manufacturer name'
let fetchValueRowIndex = trArr.findIndex((v, i) => {
const element = document.querySelector('table tbody tr:nth-child(i+1) th');
return element.innerText === 'Manufacturer name';
});
//If the findex is found return the innerText of td of the same row else returns undefined
return (fetchValueRowIndex > -1) ? document.querySelector(`table tbody tr:nth-child(${fetchValueRowIndex}+1) td`).innerText : undefined;
});
console.log(textDataArr);
You can do something like this to get the data:
await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url
await page.waitFor('table'); //waitFor an element that contains the text
const textDataArr = await page.evaluate(() => {
const element = document.querySelector('table tbody tr:nth-child(3) td'); // select thrid row td element like so
return element && element.innerText; // will return text and undefined if the element is not found
});
console.log(textDataArr);
A simple way to get those all at once:
let data = await page.evaluate(() => {
return [...document.querySelectorAll('tr')].reduce((acc, tr, i) => {
let cells = [...tr.querySelectorAll('th,td')].map(el => el.innerText)
acc[cells[0]] = cells[1]
return acc
}, {})
})
Using Cheerio and axios, I am trying to get the text "Version" and "1.7.0" from the divs inside nested tds in a table from a vscode marketplace page
I have tried this and a bunch of other ways to pinpoint the div text at the bottom but I'm not sure that I'm addressing it correctly. I am unsure where to start to get the nested elements inside the table, and I am pretty confused. Any help on the surely simple problem is appreciated.
const cheerio = require('cheerio');
const axios = require('axios');
const url = "https://marketplace.visualstudio.com/items?itemName=bloumbs.borders-dark"
axios.get(url).then((response) => {
const $ = cheerio.load(response.data)
// With this I get no response:
$('.ux-table-metadata > tbody > tr > td > div').each(() => {
console.log($(this).text());
});
// And with this method, it return "null"
let version = $('.ux-table-metadata tbody tr td div').html($.versionText)
console.log(version)
})
This is the section of html I'm working with:
<div class="ux-section-other">
<h3 class="itemdetails-section-header right">More Info</h3>
<div>
<table class="ux-table-metadata">
<tbody>
<tr>
<td>
<div>Version</div>
</td>
<td>
<div>1.7.0</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
$(".ux-table-metadata > tbody > tr > td").each(function() {
console.log($(this).find("div").html());
});
or
$(".ux-table-metadata > tbody > tr > td").each(function() {
console.log($(this).children().html());
});
tested with jquery
My goal is to fetch .textContent from different <td> tags, each lying within a separate <tr>.
I think the problem lies within the table variable, as I am not checking the correct variable for children. Currently, data variable is only fetching the first <tr>, so price evaluates with this code. However, volume and turnover does not. I think it is a simple fix but I just can't figure it out!
JavaScript:
try {
const tradingData = await page.evaluate(() => {
let table = document.querySelector("#trading-data tbody");
let tableData = Array.from(table.children);
let data = tableData.map(tradeData => {
console.log(tradeData);
let price = tradeData.querySelector(".quoteapi-price").textContent;
console.log(price);
let volume = tradeData.querySelector("quoteapi-volume").textContent;
console.log(volume);
let turnover = tradeData.querySelector("quoteapi-value").textContent;
console.log(turnover);
return { price, volume, turnover };
})
return data;
});
console.log(tradingData);
} catch (err) {
console.log(err);
}
HTML:
<table id="trading-data" class="qq_table">
<tbody>
<tr class="qq_tr_border_bot">
<td>Price</td>
<td class="qq_td_right quoteapi-number quoteapi-price" data-quoteapi="price">$0.105</td>
</tr>
<tr class="qq_tr_border_bot">
<td>Change</td>
<td class="qq_td_right pos" data-quoteapi="changeSignCSS">
<span data-quoteapi="change (signed)" class="quoteapi-number quoteapi-price quoteapi-change">0.005</span>
<span data-quoteapi="pctChange (pct)" class="quoteapi-number quoteapi-pct-change">(5.00%)</span>
</td>
</tr>
<tr class="qq_tr_border_bot">
<td>Volume</td>
<td class="qq_td_right quoteapi-number quoteapi-volume" data-quoteapi="volume scale=false">5,119,162</td>
</tr>
<tr>
<td>Turnover</td>
<td class="qq_td_right quoteapi-number quoteapi-value" data-quoteapi="value scale=false">$540,173</td>
</tr>
</tbody>
</table>
For example, this should return price="$0.11", volume="3,900,558", turnover="$412,187"
You only need the map function when you are expecting multiple tables or tbodies. As this seems not to be the case in your example, you can do it like this:
const tradingData = await page.evaluate(() => {
let table = document.querySelector("#trading-data tbody");
let price = table.querySelector(".quoteapi-price").textContent;
let volume = table.querySelector(".quoteapi-volume").textContent;
let turnover = table.querySelector(".quoteapi-value").textContent;
return { price, volume, turnover };
});
console.log(tradingData);
I have a api call who give me the list of data, and I am iterating data via ng-repeat (its a list of more than 100 items)
For getting list of data I have call an Api in App Controller in angularjs like this:
var path = serverUrl + 'api/getAllMails';
$http.get(path).then(function (result) {
$scope.mails=result
})
For Iterating the mails in Html file i have use table like the below
<table>
<tr class="header">
<th class="center">Id</th>
<th class="center">Mode of Payment</th>
<th class="center">Payment Collected</th>
<th class="center">Status</th>
</tr>
<tr ng-repeat="mail in mails">
<td>{{mail.id}}</td>
<td>{{mail.paymentType}}</td>
<td>Rs. {{mail.cost}}
<input type="text" ng-model="mail.cost">
<button ng-click="updateCost=(mail.id, mail.cost)">Update Cost</button>
</td>
<td>{{mail.status}}
<input type="text" ng-model="mail.status">
<button ng-click="updateStatus(mail.id, mail.status)">Update Status</button>
</td>
</tr>
</table>
Suppose in the first iterations the cost will be "100" and the status will be "pending". And I have to update this row only, change cost to "1000" and status will be "Delivered".
In my App controller of Angularjs I have create methods. These two methods are calling apis and updating data in database and return the list of updated mails.
$scope.updateStatus = function(mailId, mailStatus) {
var path = serverUrl + 'api/updateStatus';
$http.get(path, {
params: {
mailId: mailId,
mailStatus: mailStatus
}
}).then(function(result) {
$scope.mails = result
})
}
$scope.updateCost = function(mailId, mailCost) {
var path = serverUrl + 'api/updateStatus';
$http.get(path, {
params: {
mailId: mailId,
mailCost: mailCost
}
}).then(function(result) {
$scope.mails = result
})
}
These code are working fine but while it took lot of time to load a page. So what can I do to reduce the loading time or is there any better way to do the same thing.
Any help will be appreciable. Thank you
You are replacing the entire dataset when there is no reason for that, you should only update the row you change. Ensure your updateStatus return the object you update and update that item in $scope.mails
In example
$scope.updateCost = function(mailId, mailCost) {
var path = serverUrl + 'api/updateStatus';
$http.get(path, {
params: {
mailId: mailId,
mailStatus: mailCost
}
}).then(function(result) {
// result is the item you changed
for (var i = $scope.mails.length - 1; i >= 0; i--) {
if($scope.mails[i].id === mailId) {
$scope.mails[i] = result;
return;
}
};
})
}