Puppeteer - how to select an element based on its inner text? - javascript

I am working on scraping a bunch of pages with Puppeteer. The content is not differentiated with classes/ids/etc. and is presented in a different order between pages. As such, I will need to select the elements based on their inner text. I have included a simplified sample html below:
<table>
<tr>
<th>Product name</th>
<td>Shakeweight</td>
</tr>
<tr>
<th>Product category</th>
<td>Exercise equipment</td>
</tr>
<tr>
<th>Manufacturer name</th>
<td>The Shakeweight Company</td>
</tr>
<tr>
<th>Manufacturer address</th>
<td>
<table>
<tr><td>123 Fake Street</td></tr>
<tr><td>Springfield, MO</td></tr>
</table>
</td>
</tr>
In this example, I would need to scrape the manufacturer name and manufacturer address. So I suppose I would need to select the appropriate tr based upon the inner text of the nested th and scrape the associated td within that same tr. Note that the order of the rows of this table is not always the same and the table contains many more rows than this simplified example, so I can't just select the 3rd and 4th td.
I have tried to select an element based on inner text using XPATH as below but it does not seem to be working:
var manufacturerName = document.evaluate("//th[text()='Manufacturer name']", document, null, XPathResult.ANY_TYPE, null)
This wouldn't even be the data I would need (it would be the td associated with this th), but I figured this would be step 1 at least. If someone could provide input on the strategy to select by inner text, or to select the td associated with this th, I'd really appreciate it.

This is really an xpath question and isn't specific to puppeteer, so this question might also help, as you're going to need to find the <td> that comes after the <th> you've found: XPath:: Get following Sibling
But your xpath does work for me. In Chrome DevTools on the page with the HTML in your question, run this line to query the document:
$x('//th[text()="Manufacturer name"]')
NOTE: $x() is a helper function that only works in Chrome DevTools, though Puppeteer has a similar Page.$x function.
That expression should return an array with one element, the <th> with that text in the query. To get the <td> next to it:
$x('//th[text()="Manufacturer name"]/following-sibling::td')
And to get its inner text:
$x('//th[text()="Manufacturer name"]/following-sibling::td')[0].innerText
Once you're able to follow that pattern you should be able to use similar strategies to get the data you want in puppeteer, similar to this:
const puppeteer = require('puppeteer');
const main = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://127.0.0.1:8080/'); // <-- EDIT THIS
const mfg = await page.$x('//th[text()="Manufacturer name"]/following-sibling::td');
const prop = await mfg[0].getProperty('innerText');
const text = await prop.jsonValue();
console.log(text);
await browser.close();
}
main();

As per your use case explanation in the above answer, here is the logic for the use case:
await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url
await page.waitFor('table'); //waitFor an element that contains the text
const textDataArr = await page.evaluate(() => {
const trArr = Array.from(document.querySelectorAll('table tbody tr'));
//Find an index of a tr row where th innerText equals 'Manufacturer name'
let fetchValueRowIndex = trArr.findIndex((v, i) => {
const element = document.querySelector('table tbody tr:nth-child(i+1) th');
return element.innerText === 'Manufacturer name';
});
//If the findex is found return the innerText of td of the same row else returns undefined
return (fetchValueRowIndex > -1) ? document.querySelector(`table tbody tr:nth-child(${fetchValueRowIndex}+1) td`).innerText : undefined;
});
console.log(textDataArr);

You can do something like this to get the data:
await page.goto(url, { waitUntil: 'networkidle2' }); // Go to webpage url
await page.waitFor('table'); //waitFor an element that contains the text
const textDataArr = await page.evaluate(() => {
const element = document.querySelector('table tbody tr:nth-child(3) td'); // select thrid row td element like so
return element && element.innerText; // will return text and undefined if the element is not found
});
console.log(textDataArr);

A simple way to get those all at once:
let data = await page.evaluate(() => {
return [...document.querySelectorAll('tr')].reduce((acc, tr, i) => {
let cells = [...tr.querySelectorAll('th,td')].map(el => el.innerText)
acc[cells[0]] = cells[1]
return acc
}, {})
})

Related

Grab the classname of nested HTML elements with Puppeteer

I am trying to get get all of the td elements that have a class name of calendarCellRegularPast I have tried multiple attempts with page.$, page.$$, page.eval and many others and cannot seem to get the proper element. I got the object at on point but was unable to parse it. I know that there is a way to do this but I am new to puppeteer and Javascript honestly and just cannot figure it out.
This is a calendar of scheduled work days and I am trying to get even further down and grab the date and time. I want to have that in a for loop though so I can grab all of the work times.
I don't need a fix to my code as I know its messy but I just need to know how to grab those elements
Here is some of my code so far, though it is messy as I was trying numerous things
const el = await page.$("#scrollContainer > table > tbody > tr:nth-child(2) > td:nth-child(1)")
const className = await el.getProperty('className')
const getElm = async() => {
try {
const elm = await page.evaluate(()=> {
let persons = [];
const weekElm = document.querySelectorAll("#scrollContainer > table > tbody");
document.querySelector("#scrollContainer > table > tbody > tr:nth-child(2) > td:nth-child(3)")
console.log('\n\n\n', weekElm.length, '\n\n\n');
try {
console.log(weekElm);
} catch (e) {
console.error('could not log "weeksElm"');
}
for (let i = 2; i < weekElm.length; i++) {
try {
const trd = weekElm[i];
console.log(trd);
const tr = document.querySelectorAll(`tr:nth-child${i+1}`)
persons.push(tr);
try {
console.log('\n\n\n', tr, '\n\n\n');
} catch (e) {
console.error('could not log "\\n\\n\\n, tr, \\n\\n\\n"');
}
for ( let j = 0; j < tr.length; i++) {
const td = (tr.querySelectorAll(`td:nth-child${j}`))
try {
pass
} catch (e) {
console.error('could not log "weeksElm"');
}
}
} catch (e) {}
}
})
console.log(persons);
} catch (e){
console.error("unable to get the work schedule \n unfortunately i still work");
}
}
}
Firstly, I would recommend using a http client and cheerio.js for scraping, as it's faster and more lightweight. But I can understand why you would use puppeteer because of authentication.
You can get the html of the page using puppeteer by using page.content() and then pass that into cheerio. But if you wanted to do it using a http client, it would look like this:
const axios = require("axios");
const cheerio = require("cheerio");
const PAGEURL = "example.com";
function getData() {
return new Promise((resolve, reject) => {
axios.get(PAGEURL).then(res => res.data).then(HTML => {
const $ = cheerio.load(HTML);
// scrape data here
resolve()
}).catch(err => reject(err));
})
}
First, here's a mock-up of your HTML:
<table class="etmSchedualTable">
<tbody>
<tr>
<td class="calanderCellRegularPast">
<table class="etmCursor">
<tbody>
<tr>
<td>
<span class="calanderDateNormal"> 01</span>
</td>
<td>
<span class="calanderCellRegularPast etmNoBorder">
<span>17:15</span> " - "
<span>23:45</span>
<span class="etmMoreLink">more...</span>
</span>
</td>
</tr>
<tr>
<td>
<span class="calanderDateNormal"> 02</span>
</td>
<td>
<span class="calanderCellRegularPast etmNoBorder">
<span>17:15</span> " - "
<span>23:45</span>
<span class="etmMoreLink">more...</span>
</span>
</td>
</tr>
</tbody>
</table>
</td>
<tr>
</tbody>
</table>
At the end, we want an array with objects that look like this:
{
date: 1,
start: "17:45",
end: "21:15"
}
To do so, we first need to select the parent that holds all the items we want. In this case it's $$(".etmCursor tbody tr"). This will give use a list of all the tr's in the table .etmCursor.
Now we have to loop through table rows (tr) and get the object properties.
const timetable = $(".etmCursor tbody tr").map((i, elm) => {
return {
date: $(elm).find(".calendarDateNormal").text(),
start: $(elm).find(".CalendarCellRegularPast:nth-child(1)").text(),
end: $(elm).find(".CalendarCellRegularPast:nth-child(2)").text(),
}
}).toArray();
The .map() method works in reverse to a normal map, where you get the element before the index. And since it uses a collection, we have to use toArray() to turn it to an array.
We use the .find() to do a search on the selected object.
.text() then gets use the text of that element.
Now we should have the data you wanted.

Updating the DOM with arrays JavaScript

I am having a problem when I try to update the DOM with new information coming from an API.
Every time that I click to add new users, the array displays the old, and new information. Ideally, it would update the array first and then display only the new information. I will attach a picture of what is happening. I would like to every time the user click on add new user, the DOM update with only the information of that new user.
HTML part
<table class="table is-fullwidth table is-hoverable table-info">
<thead>
<tr">
<th title="Channel Name" class="has-text-left"> Channel Name </th>
<th title="View per week" class="has-text-right"> View per week </th>
</tr>
</thead>
<tbody id="body-table">
<tr id="tr-table">
</tr>
</tbody>
</table>
script.js
const trline = document.getElementById('body-table')
let usersList = [];
async function getnewUsers(){
const res = await fetch('https://randomuser.me/api')
const data = await res.json()
// create an instance of the results
const user = data.results[0]
// create the new user
const newUser = {
name:`${user.name.first} ${user.name.last}`,
social: Math.floor(Math.random() * 10000 )
}
// update the new user to the database...
addData(newUser)
}
function addData(obj) {
usersList.push(obj)
// update the information on the screen
updateDOM()
}
function updateDOM( providedData = usersList){
providedData.forEach(item => {
const element = document.createElement('tr')
element.innerHTML = `
<td class="has-text-left cname"> ${item.name} </td>
<td class="has-text-right cview"> ${item.social} k</td>
`
trline.appendChild(element)
})
}
addUser.addEventListener('click', getnewUsers)
Result picture:
I found the problem and the solution.
I didn't reset the HTML part to clear before adding a new item. I had to fix the function updateDOM with this: trline.innerHTML = ''
After that, the function works fine.
function updateDOM( providedData = usersList){
trline.innerHTML = '' // clear everything before adding new stuff
providedData.forEach(item => {
const element = document.createElement('tr')
element.innerHTML = `
<td class="has-text-left cname"> ${item.name} </td>
<td class="has-text-right cview"> ${item.social} k</td>
`
trline.appendChild(element)
})
}

Table from fetched URL clickable array (or to variable)

I'm new to javascript and trying to achieve one thing. I have on my site's table with TD's fetched from API URL by code bellow, with the result as on the image below.
What I'm trying to achieve is to make TD's clickable and an onclick redirects to another page where fetch URL will change from https://api.ulozenka.cz/v3/consignments to https://api.ulozenka.cz/v3/consignments/id (id from first td array), or at least onlick redirect to another page and store id from current td first array to variable (PHP variable will be better if it's possible).
Thanks for your reply.
Image
<table id="consignments-list" align="center" style="text-align: center; border-collapse: collapse;
border-spacing: 0;
width: 97%;
border: 3px solid #ddd;" cellpadding="10">
<thead>
<tr>
<th width="20%">ID</th>
<th width="40%">Name</th>
<th width="40%">Status</th>
</tr>
</thead>
<tbody>
<!-- load users here -->
</tbody>
</table>
<script id="jsbin-javascript">
window.onload = () => {
const table = document.querySelector('#consignments-list');
// call API using `fetch`
fetch('https://api.ulozenka.cz/v3/consignments', {
headers: new Headers({
'X-Shop': '18157',
'X-Key': '##################'
})
})
.then(res => res.json())
.then(res => {
// loop over all users
res.data.map(consignment => {
// create a `tr` element
const tr = document.createElement('tr');
// create ID `td`
const idTd = document.createElement('td');
idTd.textContent = consignment.id;
// create Name `td`
const statusTd = document.createElement('td');
statusTd.textContent = `${consignment.status.name}`;
//---
const nameTd = document.createElement('td');
nameTd.textContent = `${consignment.customer_name} ${consignment.customer_surname}`;
// add tds to tr
tr.appendChild(idTd);
tr.appendChild(nameTd);
tr.appendChild(statusTd);
// app tr to table
table.querySelector('tbody').appendChild(tr);
});
})
.catch(err => console.log('Error:', err));
};
</script>
If I understand you question correctly, what you want is create a onclick attribute into all td's elements. You can do something like this:
nameTd.setAttribute("onclick",`https://api.ulozenka.cz/v3/consignments/${consignment.id}`)

How to apply hyperlinks to URL text in existing HTML - UPDATED WITH JSFIDDLE & WORKING SOLUTION

UPDATE: http://jsfiddle.net/daltontech/qfjr7e6a/ - Thanks to both that helped!
Original question:
I get JSON data from a report in my HelpDesk software that I import into an HTML table (via Python) & one of the columns is the address of the request, but it is not clickable. I can edit the Python file (though I don't expect the answer is there) and the HTML file (and Javascript is both fine and expected to be the solution), but I cannot change the JSON data (much).
I can use JQuery, but if vanilla Javascript can do it, that is my preference.
I tried innerHTML (with and without global flag), but after about 20 rows, it fails spectacularly in IE & Chrome (all I tested) & this list is typically 50+.
I do use innerHTML successfully in other places, mainly linking technician names to their requests (a shorter list) like:
{ document.body.innerHTML = document.body.innerHTML.replace('Jenny', 'Jenny'); }
Here's what I have to work with:
<table class="Requests" id="Requests">
<thead><tr><th>URL</th><th>Title</th><th>Technician</th></tr></thead>
<tr><td>https://helpdesk.domain.com/8675309</td><td>I need a phone number</td><td>Jenny</td></tr>
<tr><td>https://helpdesk.domain.com/8675310</td><td>Some other issue</td>
<td>John</td></tr>
</table>
Everything before the number is always the same, so that gives some flexibility and I can have the JSON file provide a few options (just not the <a> tag...) like:
1. 8675309
2. https://helpdesk.domain.com/8675309
3. sometext8675309
4. sometext8675309someothertext
I'm hoping to accomplish either of the two row examples - either works, might prefer latter:
<table class="Requests" id="Requests">
<thead><tr><th>URL</th><th>Title</th><th>Technician</th></tr></thead>
<tr><td>https://helpdesk.domain.com/8675309</td><td>I need a phone number</td><td>Jenny</td></tr>
<tr><td>link</td><td>Some other issue</td><td>John</td></tr>
</table>
Commented Code:
// get the elements
document
.querySelectorAll(".Requests > tbody > tr > td:first-child")
// for each element remove the text and
// replace it with an anchor tag
// use the original element's text as the link
.forEach(c => {
let a = Object.assign(document.createElement("a"), {
href: c.textContent,
textContent: c.textContent
});
c.textContent = "";
c.appendChild(a);
});
Example Snippet
document
.querySelectorAll(".Requests > tbody > tr > td:first-child")
.forEach(c =>
c.parentNode.replaceChild(Object.assign(document.createElement("a"), {
href: c.textContent,
textContent: c.textContent
}), c)
);
<table class="Requests" id="Requests">
<thead>
<tr>
<th>URL</th>
<th>Title</th>
<th>Technician</th>
</tr>
</thead>
<tr>
<td>https://helpdesk.domain.com/8675309</td>
<td>I need a phone number</td>
<td>Jenny</td>
</tr>
<tr>
<td>https://helpdesk.domain.com/8675310</td>
<td>Some other issue</td>
<td>John</td>
</tr>
</table>
If I understand your question right, you want to use client-side JS to modify an already generated HTML table.
The below code works for me with +200 rows so I don't think using .innerHTML has an inherent issue, maybe there is something else causing your code to crash?
EDIT (IE):
let rows = document.querySelectorAll('#Requests tr');
for (let i = 1; i < rows.length; i++) {
rows[i].getElementsByTagName('td')[0].innerHTML = '<a href="' + rows[i].getElementsByTagName('td')[0]
.innerHTML + '">' + rows[i].getElementsByTagName('td')[0].innerHTML + '</a>';
}
let rows = document.querySelectorAll('#Requests tr');
rows.forEach(function(r, i) {
if (i > 0) {
r.getElementsByTagName('td')[0].innerHTML = '' + r.getElementsByTagName('td')[0].innerHTML + ''
}
});

NodeJS: How can I scrape two different tables, that are visually part of the same table, into one JSON Object?

Here's an example of the table of data I'm scraping:
The elements in red are in the <th> tags while the elements in green are in a <td> tag, the <tr> tag can be displayed according to how they're grouped (i.e. '1' is in it's own <tr>; HTML snippet:
EDIT: I forgot to add the surrounding div
<div class="table-cont">
<table class="tg-1">
<thead>
<tr>
<th class="tg-phtq">ID</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">1</td>
<td class="tg-0pky">2</td>
<td class="tg-0pky">3</td>
</tr>
</tbody>
</table>
<table class="tg-2">
<thead>
<tr>
<th class="tg-phtq">Sample1</td>
<th class="tg-phtq">Sample2</td>
<...the rest of the table code matches the pattern...>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Swimm</td>
<td class="tg-dvpl">1:30</td>
<...>
</tr>
</tbody>
<...the rest of the table code...>
</table>
</div>
As you can see, in the HTML they're actually two different tables while they're displayed in the above example as only one. I want to generate a JSON object where the keys and values include the data from the two tables as if they were one, and output a single JSON Object.
How I'm scraping it right now is a bit of modified javascript code I found on a tutorial:
EDIT: In the below, I've been trying to find a way to select all relevant <th> tags from both tables and insert them into the same array as the rest of the <th> tag array and do the same for <tr> in the table body; I'm fairly sure for the th I can just insert the element separately before the rest but only because there's a single one - I've been having problems figuring out how to do that for both arrays and make sure all the items in the two arrays map correctly to each other
EDIT 2: Possible solution? I tried using XPath Selectors and I can use them in devTools to select everything I want, but page.evaluate doesn't accept them and page.$x('XPath') returns JSHandle#node since I'm trying to make an array, but I don't know where to go from there
let scrapeMemberTable = async (page) => {
await page.evaluate(() => {
let ths = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > thead > tr > th'));
let trs = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > tbody > tr'));
// the above two lines of code are the main problem area- I haven't been
//able to select all the head/body elements I want in just those two lines of code
// just removig the table id "tg-2" seems to deselect the whole thing
const headers = ths.map(th => th.textContent);
let results = [];
trs.forEach(tr => {
let r = {};
let tds = Array.from(tr.querySelectorAll('td')).map(td => td.textContent);
headers.forEach((k,i) => r[k] = tds[i]);
results.push(r);
});
return results; //results is OBJ in JSON format
}
}
...
results = results.concat( //merge into one array OBJ
await scrapeMemberTable(page)
);
...
Intended Result:
[
{
"ID": "1", <-- this is the goal
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]
Actual Result:
[
{
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]

Categories