Scrape text from a complex DOM structure - javascript

Consider the following hierarchy in DOM
<div class="bodyCells">
<div style="foo">
<div style="foo">
<div style="foo1"> 'contains the list of text elements I want to scrape' </div>
<div style="foo2"> 'contains the list of text elements I want to scrape' </div>
</div>
<div style="foo">
<div style="foo3"> 'contains the list of text elements I want to scrape' </div>
<div style="foo4"> 'contains the list of text elements I want to scrape' </div>
</div>
By using class name bodyCells, I need to scrape out the data from each of the divs one at a time (i.e) Initially from 1st div, then from the next div and so on and store it in separate arrays. How can I possibly achieve this? (using puppeteer)
NOTE: I have tried using class name directly to achieve this but, it gives all the texts in a single array. I need to get data from each tag separately in different arrays.
Expected output:
array1=["text present within style="foo1" div tag"]
array2=["text present within style="foo2" div tag"]
array3=["text present within style="foo3" div tag"]
array4=["text present within style="foo4" div tag"]

As you noted, you can fetch each of the texts in a single array using the class name. Next, if you iterate over each of those, you can create a separate array for each subsection.
I created a fiddle here - https://jsfiddle.net/32bnoey6/ - with this example code:
const cells = document.getElementsByClassName('bodyCells');
const scrapedElements = [];
for (var i = 0; i < cells.length; i++) {
const item = cells[i];
for (var j = 0; j < item.children.length; j++) {
const outerDiv = item.children[j];
const innerDivs = outerDiv.children;
for (var k = 0; k < innerDivs.length; k++) {
const targetDiv = innerDivs[k];
scrapedElements.push([targetDiv.innerHTML]);
}
}
}
console.log(scrapedElements);

Related

Creating a template out of HTML Elements

lets say i have a parent-div. And in this div-container, i want to display 5 elements which have all the same structure. For example:
<div class="element">
<p class="name">
</p>
<div class="logo">
</div>
</div>
Is there a way to make an object or prototype out of it, so i dont have to generate every single HTML Element with their classes and src values with the appendChild-function and Dot-Notations in a for-loop?
Im thinking of something like:
for(let i = 0; i<=5;i++){
var element = new element(class,src1,src2 ...);
}
And the "element" is defined in a external class file or something familiar.
Im a beginner, so please show mercy :)
You'll need to clone the node from the template's content. For example:
const templateElement = document.querySelector("#someTemplate")
.content
.querySelector(".element");
// create an Array of nodes (so in memory)
const fiveNodes = [];
for (let i = 0; i < 5; i += 1) {
const nwNode = templateElement.cloneNode(true);
// ^ clone the whole tree
nwNode.querySelector("p.name").textContent += ` #${i + 1}`;
fiveNodes.push(nwNode);
}
// append the nodes to document.body
// this is faster than appending every element in the loop
fiveNodes.forEach(el => document.body.append(el));
<template id="someTemplate">
<div class="element">
<p class="name">I am node</p>
<div class="logo"></div>
</div>
</template>

JavaScript get multiple innertext

I'm using puppeteer to scrape a website. There are a few texts that I need to get which are in the same type of HTML tag/
<div class="card__title">
<div class="visually-hidden">Title</div>
text i want</div>
how I can extract all of the innertext values?
I can get only the first one with this
await page.$eval('div[class="card__title"]', e => e.innerText)
You can get it this way.
var cardTitleCollection = document.getElementsByClassName('card__title');
for(var i = 0; i < cardTitleCollection.length; i++) {
var item = cardTitleCollection[i];
console.log(item.innerText)
}
Example here: https://codepen.io/yasgo/pen/RwGbJxE
You can get all the nodes inside your target element using querySelection which will return a collection, convert to array, and filter the text nodes.
var el = document.querySelector('.card__title').childNodes;
var textNodes = Array.prototype.slice.call( el, 0 ).filter(item => item.nodeType === Node.TEXT_NODE)
console.log(textNodes[1].nodeValue)
<div class="card__title">
<div class="visually-hidden">Title</div>
text i want</div>

Create a HTML list based on value of a data attribute of div tags?

I have a PHP function that generates hierarchical view of some blog posts according to their category, child category, grand child category and so on. It generates a string containing div tags with its data attributes. I want to convert those divs to html <ul><li> based on their value of attribute aria-level.
Actual output from php method
<div role="heading" aria-level="1">Test 1</div>
<div role="heading" aria-level="2">Test 1.1</div>
<div role="heading" aria-level="3">Test 1.1.1</div>
<div role="heading" aria-level="3">Test 1.1.2</div>
<div role="heading" aria-level="1">Test 2</div>
<div role="heading" aria-level="3">Test 2.1.1</div>
<div role="heading" aria-level="2">Test 2.2</div>
Desired Output using php/js/jquery/ any framework
Test 1Test 1.1Test 1.1.1Test 1.1.2Test 2Test 2.1.1Test 2.2
What I have achieved so far ?
function buildRec(nodes, elm, lv) {
var node;
// filter
do {
node = nodes.shift();
} while(node && !(/^h[123456]$/i.test(node.tagName)));
// process the next node
if(node) {
var ul, li, cnt;
var curLv = parseInt(node.tagName.substring(1));
if(curLv == lv) { // same level append an il
cnt = 0;
} else if(curLv < lv) { // walk up then append il
cnt = 0;
do {
elm = elm.parentNode.parentNode;
cnt--;
} while(cnt > (curLv - lv));
} else if(curLv > lv) { // create children then append il
cnt = 0;
do {
li = elm.lastChild;
if(li == null)
li = elm.appendChild(document.createElement("li"));
elm = li.appendChild(document.createElement("ul"));
cnt++;
} while(cnt < (curLv - lv));
}
li = elm.appendChild(document.createElement("li"));
// replace the next line with archor tags or whatever you want
li.innerHTML = node.innerHTML;
// recursive call
buildRec(nodes, elm, lv + cnt);
}
}
// example usage
var all = document.getElementById("content").getElementsByTagName("*");
var nodes = [];
for(var i = all.length; i--; nodes.unshift(all[i]));
var result = document.createElement("ul");
buildRec(nodes, result, 1);
document.getElementById("outp").appendChild(result);
<div id="outp">
</div>
<div id="content">
<h1>Test 1</h1>
<h2>Test 1.1</h2>
<h3>Test 1.1.1</h3>
<h3>Test 1.1.2</h3>
<h1>Test 2</h1>
<h3>Test 2.1.1</h3>
<h2>Test 2.2</h2>
<p></p>
</div>
Problem to be resolved ?
As you can see above it is using Heading tags to sort. But unfortunately my category hierarchy are not limited to only 6th level. It may grow. So I want a JS/Jquery/any framework to convert some tags to ul/li structure based on their attribute value. If any changes from Backend side is needed I can change those attributes/tags to any from PHP side. If it can be done from PHP side easily then some example code snippets is also welcome. Consider the above div tags as a single string input. :)

Handle dynamically growing list in react js

I have a component for displaying a list, and this component is rendered from a parent component. I am currently generating the contents of this list using a for loop. Like -
let content = []
for (let i = 0; i < list.length; i++) {
content.push(
<div>
<p> list[i].something </p>
<p> list[i].somethingElse </p>
</div>
)
}
return content;
Now whenever a new object is added to this list, all the previous objects of the list, and the newly added object get rendered. This becomes extremely slow when the list contains around 1000 objects.
Is there a way by which only the new added can be added and rendered, without re-rendering all the previous entries of the list again?
This must be mainly because you havent added key, try the following code after setting an id for each list item and assign it to key prop.
let content = []
for (let i = 0; i < list.length; i++) {
content.push(
<div key={list[i].id}>
<p> list[i].something </p>
<p> list[i].somethingElse </p>
</div>
)
}
return content;
If the list is a static one which doesnt change, index can also be used as the value for key prop.
let content = []
for (let i = 0; i < list.length; i++) {
content.push(
<div key={i}>
<p> list[i].something </p>
<p> list[i].somethingElse </p>
</div>
)
}
return content;
You shoulda unique id for each item to render list item component. You can do on the ES6 something like:
const renderList = (list) => (
list.map(item =>
<div key={item.id}>
<p>{item.something}</p>
<p>{item.somethingElse}</p>
</div>
);

Count sibling elements that all have same unknown class name

I need help working with a large list of sibling elements with different class names.
Getting the amount of elements with the same class name and putting them in an array
Finding first element in that class group (this can be number or name).
Statement that runs a function: if element = first element of group do console.log("first element");
Here's an example of the first 3 classes but this will go from groupA to Groupz
<div class = 'slider'>
<div class = 'item1 groupA'> <!-- Start Group A -->
<img src='xyz' />
</div>
<div class = 'item1 groupA'>
<img src='xyz' />
</div>
<div class = 'item1 groupA'>
<img src='xyz' />
</div>
<div class = 'item1 groupA'>
<img src='xyz' />
</div>
<div class = 'item1 groupB'> <!-- Start Group B -->
<img src='xyz' />
</div>
<div class = 'item1 groupB'>
<img src='xyz' />
</div>
<div class = 'item1 groupB'>
<img src='xyz' />
</div>
<div class = 'item1 groupC'> <!-- Start Group C -->
<img src='xyz' />
</div>
<div class = 'item1 groupC'>
<img src='xyz' />
</div> <!-- All the way to group Z -->
</div>
Edit: Your requirement is very specific. Below is just a sample to just loop thru all childrens and store the count and first element in the matching count. Let me
$(function () {
$.fn.benton = function () {
//just the immediate childrens
var $chds = $(this).children();
var lc = {
firstEl: {},
classCount: {}
};
$.each ($chds, function (idx, el) {
if (el.className) {
var tokens = el.className.split(' ');
for (var i = 0; i < tokens.length; i++) {
if (lc.classCount.hasOwnProperty(tokens[i])) {
lc.classCount[tokens[i]] += 1;
} else {
lc.classCount[tokens[i]] = 1;
lc.firstEl[tokens[i]] = $(el);
}
}
}
});
return lc;
};
var stats = $('.slider').benton();
console.log(stats.classCount['groupA']);
stats.firstEl['item1'].css({border: '1px solid red', width: 100, height: 10});
});
DEMO: http://jsfiddle.net/LhwQ4/1/
I think what you need is to use context of slider to get the child elements.. see below,
var $slider = $('.slider')
Now using the $slider context,
$('.groupA', $slider)
//Returns Array of jQuery object with elements has class `groupA`
$('.groupA:first', $slider)
//Returns first element in collection of element with class `groupA`
To get all elements with the same class name, you would only have to use a simple jQuery selector. The returned value is an array containing all matching elements.
var groupA = $(".groupA");
To get the number of items you need only access the length parameter of the array.
var groupALength = groupA.length;
If you want to extract only the first element of any matched elements, you can use jQuery's :first selector.
var firstElement = $(".groupA:first");
var groups = {};
$(".slider").children().each(function(i, el) {
var classes = el.className.split(/\s+/);
for (var i=0; i<classes.length; i++)
if (classes[i] in groups)
groups[classes[i]].push(el);
else
groups[classes[i]] = [el];
});
Now, you can access all elements of a group via groups["groupA"] etc (jQuery collection: $(groups["groupB"])) and the first one via groups["groupC"][0]. The amount of elements in a group is just the length of the array.
Notice that this puts all elements in the group "item1" - I don't know what you need that class for.
Ok, so this solution is quite sensitive. I'm making a few assumptions about your HTML.
In your example you gave each item a class of item1. I am assuming that this is just an issue of copying and pasting the element. Each "item" should have the same class so that you can retrieve all the items with one selector. For my example, I'm assuming a class of item.
There should be only this item class plus an additional "group" class. Any other class given to the item will render this solution invalid.
// fetch ALL items
var allItems = $(".item");
// initialize groups array
var groups = {};
$.each(allItems,function(index,elem){
var item = $(this);
var itemClass = item.attr('class');
// remove the "item" class and any leftover whitespace
itemClass = $.trim(itemClass.replace('item','')); // should now be groupA/groupB...
// add item to array at the index of the group
if (groups[itemClass] == undefined){
groups[itemClass] = [];
}
groups[itemClass].push(item);
});
You should now be left with an array of arrays containing all the items. To see this in action, you can check out this jsFiddle.

Categories