I'm using puppeteer to scrape a website. There are a few texts that I need to get which are in the same type of HTML tag/
<div class="card__title">
<div class="visually-hidden">Title</div>
text i want</div>
how I can extract all of the innertext values?
I can get only the first one with this
await page.$eval('div[class="card__title"]', e => e.innerText)
You can get it this way.
var cardTitleCollection = document.getElementsByClassName('card__title');
for(var i = 0; i < cardTitleCollection.length; i++) {
var item = cardTitleCollection[i];
console.log(item.innerText)
}
Example here: https://codepen.io/yasgo/pen/RwGbJxE
You can get all the nodes inside your target element using querySelection which will return a collection, convert to array, and filter the text nodes.
var el = document.querySelector('.card__title').childNodes;
var textNodes = Array.prototype.slice.call( el, 0 ).filter(item => item.nodeType === Node.TEXT_NODE)
console.log(textNodes[1].nodeValue)
<div class="card__title">
<div class="visually-hidden">Title</div>
text i want</div>
Related
I'm trying to use Puppeteer to find prices from websites, and the way I'm doing it is running a $$eval and a set of if statements. I'm checking that there is a "$" in the innerText as well as a children.length of 0. This works decently, but there are sometimes children for just part of the string (i.e. a bold $ but a regular number). Is there any way to find out what tag the children are easily? I would like to check for no children having <div> tags. Below is my code:
var prices = await page.$$eval("div, span, p", (priceElements) => {
var tmpPrices = [];
for (var i = 0; i < priceElements.length; i++) {
const priceIndex = priceElements[i].innerText.toLowerCase().indexOf("$");
if (
priceIndex > -1 &&
priceElements[i].children.length == 0 &&
priceElements[i].innerText.substring(priceIndex + 1).indexOf("$") == -1
) {
tmpPrices.push(priceElements[i].innerText);
}
}
return tmpPrices;
});
And here's an example of when it doesn't work:
<div class="pricevalue1 accent-color1" style="font-size: 20px;">
<b>
<span class="currency-symbol">$</span>
92,950
</b>
</div>
Consider the following hierarchy in DOM
<div class="bodyCells">
<div style="foo">
<div style="foo">
<div style="foo1"> 'contains the list of text elements I want to scrape' </div>
<div style="foo2"> 'contains the list of text elements I want to scrape' </div>
</div>
<div style="foo">
<div style="foo3"> 'contains the list of text elements I want to scrape' </div>
<div style="foo4"> 'contains the list of text elements I want to scrape' </div>
</div>
By using class name bodyCells, I need to scrape out the data from each of the divs one at a time (i.e) Initially from 1st div, then from the next div and so on and store it in separate arrays. How can I possibly achieve this? (using puppeteer)
NOTE: I have tried using class name directly to achieve this but, it gives all the texts in a single array. I need to get data from each tag separately in different arrays.
Expected output:
array1=["text present within style="foo1" div tag"]
array2=["text present within style="foo2" div tag"]
array3=["text present within style="foo3" div tag"]
array4=["text present within style="foo4" div tag"]
As you noted, you can fetch each of the texts in a single array using the class name. Next, if you iterate over each of those, you can create a separate array for each subsection.
I created a fiddle here - https://jsfiddle.net/32bnoey6/ - with this example code:
const cells = document.getElementsByClassName('bodyCells');
const scrapedElements = [];
for (var i = 0; i < cells.length; i++) {
const item = cells[i];
for (var j = 0; j < item.children.length; j++) {
const outerDiv = item.children[j];
const innerDivs = outerDiv.children;
for (var k = 0; k < innerDivs.length; k++) {
const targetDiv = innerDivs[k];
scrapedElements.push([targetDiv.innerHTML]);
}
}
}
console.log(scrapedElements);
With the following code, I'm getting the values of "id"(almost 35), and then add 1 to each "id", so 1 will be 2 and so on. Where I'm stock, it is on how to replace that id number in the html.
This is the code that use to get the values of each id, then I push them into an array, then I run another "for loop" to add 1 to each value, but I don't how to return them to the html.
var x = document.getElementsByClassName('p-divs');
var portfolio = new Array;
for (var i = 0; i < x.length; i++)
{
var y = document.getElementsByClassName('p-divs')[i].getAttribute('id');
portfolio.push(y);
}
console.log(portfolio);
var portfolio2 = new Array;
for (var i = 0; i<portfolio.length; i++)
{
var newId;
newId = parseInt(portfolio[i]) + 1;
portfolio2.push(newId);
}
console.log(portfolio2);
<div class="col-lg-3 col-md-3 col-sm-6 col-xs-12 p-divs" id="1">
<div class="portfolio">
<center>
<img src="images/pace.png" width="230" height="190" alt="" class="img-responsive">
</center>
</div>
</div>
Since you're using jQuery library the code could be simple than what you've so far using .each() method :
$('.p-divs').each(function(){
$(this).attr('id', Number(this.id) + 1);
});
Or shorter using using .attr() method callback like :
$('.p-divs').attr('id', function(){
return Number(this.id) + 1;
});
The more clear version could be :
$('.p-divs').each(function(){
var current_id = Number(this.id); //Get current id
var new_id = current_id + 1; //Increment to define the new one
$(this).attr('id', new_id); //Set the new_id to the current element 'id'
});
Hope this helps.
$(function(){
$('.p-divs').attr('id', function(){
return Number(this.id) + 1;
});
//Just for Debug
console.log( $('body').html() );
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div class="p-divs" id="1">
<div class="portfolio">
<center>Image 1</center>
</div>
</div>
<div class="p-divs" id="2">
<div class="portfolio">
<center>Image 2</center>
</div>
</div>
<div class="p-divs" id="3">
<div class="portfolio">
<center>Image 3</center>
</div>
</div>
<div class="p-divs" id="4">
<div class="portfolio">
<center>Image 4</center>
</div>
</div>
Using native javascript, just use getattribute's opposite: setAttribute
for (var i = 0; i < x.length; i++)
{
var y = document.getElementsByClassName('p-divs')[i].getAttribute('id');
y++;
document.getElementsByClassName('p-divs')[i].setAttribute("id",y);
}
var j = document.getElementsByClassName('p-divs');
for (var i = 0; i < x.length; i++) {
j[i].id = portfolio2[i];
}
Add this to the end of your code. Vanilla JS.
j will be an array of your divs, i will keep count of which div we're on, and we are simply accessing the "id" of each element in the "j" array and updating it to the corresponding value in your pre-populated "portfolio2" array.
Hope this helps!
P.S.- I would also recommend that instead of using 'new Array' to instantiate your arrays, you use the array literal notation '[]'. This is more concise and also avoids needing to put (); after Array.
I'd suggest, assuming I'm not missing something, and that you're able to us ES6 methods:
// converting the NodeList returned from document.querySelectorAll()
// into an Array, and iterating over that Array using
// Array.prototype.forEach():
Array.from( document.querySelectorAll('.p-divs') ).forEach(
// using an Arrow function to work with the current element
// (divElement) of the Array of elements,
// here we use parseInt() to convert the id of the current
// element into a number (with no sanity checking), adding 1
// and assigning that result to be the new id:
divElement => divElement.id = parseInt( divElement.id, 10 ) + 1
);
Note that updating, changing or otherwise modifying an id shouldn't be necessary in most circumstances, and having a purely numeric id may present problems for CSS selecting those elements (it's valid, but only in HTML 5, but will still be problematic).
for(i=0;i<$('.p-divs').length;i++){
newId= parseInt($($('.p-divs')[i]).attr('id'))+1;
$($('.p-divs')[i]).attr('id',newId)
}
Using Jquery attr
I have this html code on page:
<div style="display:none" id="roles">
<span>Manager</span>
<span>Seller</span>
</div>
And i jas want to get array of string between spans element.
var roles = document.getElementById("roles").innerText.match("what i should get here"); // output roles = ["Manager", "Seller"]
var roles = Array.prototype.slice.call(document.getElementById('roles').getElementsByTagName('span')).map(function(node) {
return node.innerText || node.textContent;
});
console.log(roles);
<div style="display:none" id="roles">
<span>Manager</span>
<span>Seller</span>
</div>
You need to iterate span elements and get its text.
//Get the span elements
var spans = document.querySelectorAll("#roles span");
var roles = [];
//Iterate the elements
for (var i = 0; i < spans.length; i++) {
//fetch textContent and push it to array
roles.push(spans[i].textContent);
}
console.log(roles)
<div style="display:none" id="roles">
<span>Manager</span>
<span>Seller</span>
</div>
You can also use as suggested by #Tushar
//Get the span elements
var spans = document.querySelectorAll("#roles span");
var roles = Array.from(spans).map(s => s.textContent);
console.log(roles)
<div style="display:none" id="roles">
<span>Manager</span>
<span>Seller</span>
</div>
Get all span elements using querySelectorAll then convert it to array with help of Array.from method(older browser use [].slice.call) and now generate the result array using Array#map method.
// for older browser use `[].slice.call(...` instead of `Array.from(...`
var res = Array.from(document.querySelectorAll('#roles span')).map(function(e) {
return e.textContent;
});
console.log(res);
<div style="display:none" id="roles">
<span>Manager</span>
<span>Seller</span>
</div>
<div id="exmpl1">
<span id="1st"> A1</span>
<span id="2nd"> A2</span>
<span id="3rd"> B1</span>
<span id="4th"> B2</span>
<span id="5th"> C1</span>
</div>
var spans = document.getElementById('exmpl1').getElementsByTagName('span'),
obj = {};
for (var i = 0, j = spans.length; i < j; i++) {
obj[spans[i].id] = spans[i].textContent || spans[i].innerText;
}
console.log(obj);
Here is another solution :
var spans;
document.querySelectorAll('#roles > span').forEach(function(element){
spans.push(element.textContent);
});
I would recommend first finding the descendents of the roles div and then iterating through them in a for loop to post the results to an array.
//Define the parent div
var parent = document.getElementById('roles');
//Find the children of the parent div
var children = parent.getElementsByTagName('span');
//Count how many children there are within the parent div
var totalChildren = children.length;
//Declare a new array to store the roles in
var rolesArray = [];
//And let the for loop push each role into the array
for (var i = 0; i < totalChildren; i++) {
rolesArray.push(children[i].innerText);
}
I have a bunch of spans of class = "change" and each has a unique id. I created an array of those spans using:
var changesArray = $('.change').toArray()
I want to be able to get the index of the span in the array when I click on it. I tried:
$('.change').click(function(){
var thisChange = $(this).attr('id');
var thisChangeIndex = $.inArray(thisChange,changesArray);
});
But all I get is -1 for every .change I click on.
I'm a bit of a newbie with this type of code. Help?
The toArray method says
Retrieve all the elements contained in the jQuery set, as an array.
You are looking for a particular id in the array - that will never work.
If you want the index of the item you can use .index()
$('.change').click(function(){
var thisChangeIndex = $('.change').index(this);
console.log(thisChangeIndex);
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div>
<span class="change">change1</span>
<span class="change">change2</span>
<span class="change">change3</span>
<span class="change">change4</span>
</div>
<div>
<span class="change">change5</span>
<span class="change">change6</span>
<span class="change">change7</span>
<span class="change">change8</span>
</div>
You should keep a plain array of the unique ID's only:
var changesArrayIds = $('.change').toArray().map(function(x) { return x.id; });
Then this line should work fine:
var thisChangeIndex = $.inArray(thisChange, changesArrayIds);
If you insist on using .toArray that works http://codepen.io/8odoros/pen/JKWxqz
var changesArray = $('.change').toArray();
$('.change').click(function(){
var thisChange = $(this).attr('id');
var thisChangeIndex = -1;
$.each( changesArray, function( i, val ) {
if( thisChange==val.id) thisChangeIndex= i;
});
console.log(thisChangeIndex);
});
When you call toArray, you get an array of all the DOM nodes, not the jquery objects. You can search on this instead of $(this):
var changesArray = $('.change').click(function(){
var thisChangeIndex = $.inArray(this,changesArray);
}).toArray();