Puppeter delete node inside element - javascript

I want to scrape a page with some news inside.
Here it's an HTML simplified version of what I have :
<info id="random_number" class="news">
<div class="author">
Name of author
</div>
<div class="news-body">
<blockquote>...<blockquote>
Here it's the news text
</div>
</info>
<info id="random_number" class="news">
<div class="author">
Name of author
</div>
<div class="news-body">
Here it's the news text
</div>
</info>
I want to get the author and text body of each news, without the blockquote part.
So I wrote this code :
let newsPage = await newsPage.$$("info.news");
for (var news of newsPage){ // Loop through each element
let author = await news.$eval('.author', s => s.textContent.trim());
let textBody = await news.$eval('.news-body', s => s.textContent.trim());
console.log('Author :'+ author);
console.log('TextBody :'+ textBody);
}
It works well, but I don't know how to remove the blockquote part of the "news-body" part, before getting the text, how can I do this ?
EDIT : Sometimes there is blockquote exist, sometime not.

You can use optional chaining with ChildNode.remove(). Also you may consider innerText more readable.
let textMessage = await comment.$eval('.news-body', (element) => {
element.querySelector('blockquote')?.remove();
return element.innerText.trim();
});

Related

Click to change image is changing the wrong one

So I'm working on a website that has a lot of movies, and people can choose what movies are they favorite, and for that, I have a star image that can be clicked and when clicked that image will change to another one
Like this:
To this:
The problem I have is that the only image that change is the first one, When I click, for example on the star next to the Ratatouille movie, it will change the first star
This is the HTML:
const getMovieHtml = (movie) => `<div class="movie">
<h2>${movie.title}</h2>
<img onclick="bottonclick()" id="estrelinhas" src="./icons/empty-star.png" alt="estrela vazia" width=40>
<div class="content">
<img src="${movie.posterUrl}" alt="${movie.title}" />
<div class="text">
<p>${movie.summary}</p>
<div class="year">${movie.year}</div>
<div><strong>Directors:</strong> ${movie.director}</div>
<div><strong>Actors:</strong> ${movie.actors}</div>
</div>
</div>
</div>`;
And this is the arrow function I used to make the star change:
const bottonclick = () => {
if (document.getElementById("estrelinhas").src.includes("empty-star.png")) {
document.getElementById("estrelinhas").src = "./icons/filled-star.png";
} else {
document.getElementById("estrelinhas").src = "./icons/empty-star.png";
}
};
ID attributes of HTML elements should be unique. If you don't have unique ID's the code doesn't know which star to update. Read more about IDs here: https://www.w3schools.com/html/html_id.asp
To fix this, a solution would be to use a unique identifier for each image, so that when you click "favourite" it knows which star to reference.
Assuming for example that the movie.posterURL is unique you can use that as the ID, however the data from wherever you are getting the movie from might already have a unique identifier that you could pass to the id attribute of the image instead
Your code could look something like this:
const getMovieHtml = (movie) => `<div class="movie">
<h2>${movie.title}</h2>
<img onclick="bottonclick(e)" id="${movie.posterUrl}" src="./icons/empty-star.png" alt="estrela vazia" width=40>
<div class="content">
<img src="${movie.posterUrl}" alt="${movie.title}" />
<div class="text">
<p>${movie.summary}</p>
<div class="year">${movie.year}</div>
<div><strong>Directors:</strong> ${movie.director}</div>
<div><strong>Actors:</strong> ${movie.actors}</div>
</div>
</div>
</div>`;
const buttonClick = (e) => {
const star = document.getElementById(e.target.id);
if (star.src.includes("empty-star.png")) {
star.src = "./icons/filled-star.png";
} else {
star.src = "./icons/empty-star.png";
}
}
beacuse you getElementById, so you not take all containers,
but you get 1, first,
change getElementById for querySelectorAll and give there some id,
to localizate in DOM ,
buttonClick(e)
and in
function buttonClick(e) {
e.target.value/id/or something
}

Excluding inner tags from string using Regex

I have the following text:
If there would be more <div>matches<div>in</div> string</div>, you will merge them to one
How do I make a JS regex that will produce the following text?
If there would be more <div>matches in string</div>, you will merge them to one
As you can see, the additional <div> tag has been removed.
I would use a DOMParser to parseFromString into the more fluent HTMLDocument interface to solve this problem. You are not going to solve it well with regex.
const htmlDocument = new DOMParser().parseFromString("this <div>has <div>nested</div> divs</div>");
htmlDocument.body.childNodes; // NodeList(2): [ #text, div ]
From there, the algorithm depends on exactly what you want to do. Solving the problem exactly as you described to us isn't too tricky: recursively walk the DOM tree; remember whether you've seen a tag yet; if so, exclude the node and merge its children into the parent's children.
In code:
const simpleExampleHtml = `<div>Hello, this is <p>a paragraph</p> and <div>some <div><div><div>very deeply</div></div> nested</div> divs</div> that should be eliminated</div>`
// Parse into an HTML document
const doc = new DOMParser().parseFromString(exampleHtml, "text/html").body;
// Process a node, removing any tags that have already been seen
const processNode = (node, seenTags = []) => {
// If this is a text node, return it
if (node.nodeName === "#text") {
return node.cloneNode()
}
// If this node has been seen, return its children
if (seenTags.includes(node.tagName)) {
// flatMap flattens, in case the same node is repeatedly nested
// note that this is a newer JS feature and lacks IE11 support: https://caniuse.com/?search=flatMap
return Array.from(node.childNodes).flatMap(child => processNode(child, seenTags))
}
// If this node has not been seen, process its children and return it
const newChildren = Array.from(node.childNodes).flatMap(child => processNode(child, [...seenTags, node.tagName]))
// Clone the node so we don't mutate the original
const newNode = node.cloneNode()
// We can't directly assign to node.childNodes - append every child instead
newChildren.forEach(child => newNode.appendChild(child))
return newNode
}
// resultBody is an HTML <body> Node with the desired result as its childNodes
const resultBody = processNode(doc);
const resultText = resultBody.innerHTML
// <div>Hello, this is <p>a paragraph</p> and some very deeply nested divs that should be eliminated</div>
But make sure you know EXACTLY what you want to do!
There's lots of potential complications you could face with data that's more complex than your example. Here are some examples where the simple approach may not give you the desired result.
<!-- nodes where nested identical children are meaningful -->
<ul>
<li>Nested list below</li>
<li>
<ul>
<li>Nested list item</li>
</ul>
</li>
</ul>
<!-- nested nodes with classes or IDs -->
<span>A span with <span class="some-class">nested spans <span id="DeeplyNested" class="another-class>with classes and IDs</span></span></span>
<!-- places where divs are essential to the layout -->
<div class="form-container">
<form>
<div class="form-row">
<label for="username">Username</label>
<input type="text" name="username" />
</div>
<div class="form-row"
<label for="password">Password</label>
<input type="text" name="password" />
</div>
</form>
</div>
Simple approach without using Regex by using p element of html and get its first div content as innerText(exclude any html tags) and affect it to p, finally get content but this time with innerHTML:
let text = 'If there would be more <div>mathces <div>in</div> string</div>, you will merge them to one';
const p = document.createElement('p');
p.innerHTML = text;
p.querySelector('div').innerText = p.querySelector('div').innerText;
console.log(p.innerHTML);

How to render a large amount of html after hitting search button?

I am building a recipe search app but I am not sure how to render the big chunk of html representing the 30 recipes I want to have displayed. These recipe card obviously have to change based on what kind of meal I search for. How do I Implement the html into my Js so I can make it change based on the meal type?
I tried to do it with insertAdjecentHTML but since it doesn't replaces the old page with the recipe on but rather ads on to it. It didn't worked:
Nevertheless here is the code:
// recipe card markup
const renderRecipe = () => {
const markup = `
<div class="recipe-results__card">
<div class="recipe-results--img-container">
<img
src="${data.data.hits.recipe.image}"
alt="${data.data.hits.recipe.label}"
class="recipe-results__img"
/>
</div>
<p class="recipe-results__categories">
${limitCategories}
</p>
<h2 class="recipe-results__name">
${data.data.hits.recipe.label}
</h2>
</div>
`;
DOM.recipeGrid.insertAdjacentHTML("beforeend", markup);
};
// fn to display the markup for every search res
export const renderResults = recipes => {
recipes.forEach(renderRecipe);
};
concatenate (push to an array) the HTML so you have ONE update
You can use innerHTML instead of insertAdjacent to replace the content
let html = [];
data.data.hits.forEach(recipe => {
html.push(`
<div class="recipe-results__card">
<div class="recipe-results--img-container">
<img
src="${recipe.image}"
alt="${recipe.label}"
class="recipe-results__img"
/>
</div>
<p class="recipe-results__categories">
${limitCategories}
</p>
<h2 class="recipe-results__name">
${recipe.label}
</h2>
</div>
`) });
document.getElementById("recipeGrid").innerHTML = html.join("");

Selecting a childnode by inner text from a NodeList

First part [solved]
Given this HTML
<div id="search_filters">
<section class="facet clearfix"><p>something</p><ul>...</ul></section>
<section class="facet clearfix"><p>something1</p><ul>...</ul></section>
<section class="facet clearfix"><p>something2</p><ul>...</ul></section>
<section class="facet clearfix"><p>something3</p><ul>...</ul></section>
<section class="facet clearfix"><p>something4</p><ul>...</ul></section>
</div>
I can select all the section with
const select = document.querySelectorAll('section[class^="facet clearfix"]');
The result is a nodelist with 5 children.
What I'm trying to accomplish is to select only the section containing the "something3" string.
This is my first attempt:
`const xpathResult = document.evaluate( //section[p[contains(.,'something3')]],
select, null, XPathResult.ANY_TYPE, null );
How can I filter the selection to select only the node I need?
Second part:
Sorry for keeping update the question but it seems this is a problem out of my actual skill..
Now that i get the Node i need to work in what i've to do it's to set a custom order of the li in the sections:
<ul id="" class="collapse">
<li>
<label>
<span>
<input>
<span>
<a> TEXT
</a>
</span>
</input>
</span>
</label>
</li>
<li>..</li>
Assuming there are n with n different texts and i've a custom orders to follow i think the best way to go it would be look at the innertex and matching with an array where i set the correct order.
var sorOrder = ['text2','text1','text4','text3']
I think this approach should lead you to the solution.
Giving your HTML
<div id="search_filters">
<section class="facet clearfix"><p>something</p><ul>...</ul></section>
<section class="facet clearfix"><p>something1</p><ul>...</ul></section>
<section class="facet clearfix"><p>something2</p><ul>...</ul></section>
<section class="facet clearfix"><p>something3</p><ul>...</ul></section>
<section class="facet clearfix"><p>something4</p><ul>...</ul></section>
</div>
I would write this js
const needle = "something3";
const selection = document.querySelectorAll('section.facet.clearfix');
let i = -1;
console.info("SELECTION", selection);
let targetIndex;
while(++i < selection.length){
if(selection[i].innerHTML.indexOf(needle) > -1){
targetIndex = i;
}
}
console.info("targetIndex", targetIndex);
console.info("TARGET", selection[targetIndex]);
Then you can play and swap elements around without removing them from the DOM.
PS. Since you know the CSS classes for the elements you don't need to use the ^* (start with) selector. I also improved that.
PART 2: ordering children li based on content
const ul = selection[targetIndex].querySelector("ul"); // Find the <ul>
const lis = ul.querySelectorAll("li"); // Find all the <li>
const sortOrder = ['text2', 'text1', 'text4', 'text3'];
i = -1; // We already declared this above
while(++i < sortOrder.length){
const li = [...lis].find((li) => li.innerHTML.indexOf(sortOrder[i]) > -1);
!!li && ul.appendChild(li);
}
This will move the elements you want (only the one listed in sortOrder) in the order you need, based on the content and the position in sortOrder.
Codepen Here
In order to use a higher order function like filter on your nodelist, you have to change it to an array. You can do this by destructering:
var select = document.querySelectorAll('section[class^="facet clearfix"]');
const filtered = [...select].filter(section => section.children[0].innerText == "something3")
This answer explains the magic behind it better.
You can get that list filtered as:
DOM Elements:
Array.from(document.querySelectorAll('section[class^="facet clearfix"]')).filter(_ => _.querySelector('p').innerText === "something3")
HTML Strings:
Array.from(document.querySelectorAll('section[class^="facet clearfix"]')).filter(_ => _.querySelector('p').innerText === "something3").map(_ => _.outerHTML)
:)

Unable to grab an image with cheerio/node.js

My problem is pretty straightforward. I am trying to console log the URL of an image from the amazon link bellow. Either from a more precise selection
So I've spent the majority of the time attempting t select the id/class of the link but seem to only get the as close as #imgTagWrapperId which returns a lot of redundant information. In theory I should be able to grab the links with regex narrowing things down, but for the life of me I can only seem to replace the text i return and not simply grab it. Alternatively I have as stated attempted to grab the img src itself, only to return a nonsensical string of code. When I view page source the same ball of text appears there but not when I inspect elements directly.
const request = require('request');
const cheerio = require('cheerio');
request(`http://amazon.com/dp/B079H6RLKQ`, (error,response,html) =>{
if (!error && response.statusCode ==200) {
const $ = cheerio.load(html);
const productTitle = $("#productTitle").text().replace(/\s\s+/g, '');
const prodImg = $(`#imgTagWrapperId`).html();
console.log(productTitle);
console.log(prodImg);
} else {
console.log(error);
}
})
This current code returns the product title faithfully but returns this for the prodImg output:
<img alt="Samsung Galaxy S9 G960U 64GB Unlocked 4G LTE Phone w/ 12MP Camera - Midnight Black" src="

...(this nonsense goes on for a mile) ....
" data-old-hires="https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SL1500_.jpg" class="a-dynamic-image a-stretch-horizontal" id="landingImage" data-a-dynamic-image="{"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX522_.jpg":[564,522],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX342_.jpg":[369,342],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX679_.jpg":[733,679],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX425_.jpg":[459,425],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX466_.jpg":[503,466],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX569_.jpg":[615,569],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX385_.jpg":[416,385]}" style="max-width:679px;max-height:733px;">
</div>
Thank you in advance for any help and guidance with this. I Have exhausted all other usual sources and am ready to be called an idiot.
EDIT:
Someone wanted the html for before and after selection, Ill oblige but it might be better to just view the page source in the link and ctrl+ f. Wall of text bellow.
<div class="variationUnavailable unavailableExp">
<div class="inner">
<div class="a-box a-alert a-alert-error" aria-live="assertive" role="alert"><div class="a-box-inner a-alert-container"><h4 class="a-alert-heading">Image Unavailable</h4><i class="a-icon a-icon-alert"></i><div class="a-alert-content">
<span class="a-text-bold">
Image not available for<br/>Color:
<span class="unvailableVariation"></span>
</span>
</div></div></div>
</div>
</div>
<!-- Append onload function to stretch image on load to avoid flicker when transitioning from low res image from Mason to large image variant in desktop -->
<!-- any change in onload function requires a corresponding change in Mason to allow it pass in /mason/amazon-family/gp/product/features/embed-features.mi -->
<!-- and /mason/amazon-family/gp/product/features/embed-landing-image.mi -->
<ul class="a-unordered-list a-nostyle a-horizontal list maintain-height">
<span id="imageBlockEDPOverlay"></span>
<li class="image item itemNo0 selected maintain-height"><span class="a-list-item">
<span class="a-declarative" data-action="main-image-click" data-main-image-click="{}">
<div id="imgTagWrapperId" class="imgTagWrapper">
<img alt="Samsung Galaxy S9 G960U 64GB Unlocked 4G LTE Phone w/ 12MP Camera - Midnight Black" src="

" data-old-hires="https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SL1500_.jpg" class="a-dynamic-image a-stretch-horizontal" id="landingImage" data-a-dynamic-image="{"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX522_.jpg":[564,522],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX342_.jpg":[369,342],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX679_.jpg":[733,679],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX425_.jpg":[459,425],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX466_.jpg":[503,466],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX569_.jpg":[615,569],"https://images-na.ssl-images-amazon.com/images/I/81%2Bh9mpyQmL._SX385_.jpg":[416,385]}" style="max-width:679px;max-height:733px;">
</div>
</span>
</span></li>
<li class="mainImageTemplate template"><span class="a-list-item">
<span class="a-declarative" data-action="main-image-click" data-main-image-click="{}">
<div class="imgTagWrapper">
<span class="placeHolder"></span>
</div>
</span>
</span></li>
Thanks to Rishi Raj for giving a quickfix solution. $('#landingImage').attr('data-old-hires'). I was also adding a unnecessary .html() to the const which had gotten in the way. Thanks again everyone!
Couldn't you simply target the image directly and get the url with .attr('src')?
const request = require('request');
const cheerio = require('cheerio');
request('http://amazon.com/dp/B079H6RLKQ', (error,response,html) => {
if (!error && response.statusCode === 200) {
const $ = cheerio.load(html);
const productTitle = $('#productTitle').text().replace(/\s\s+/g, '');
const prodImg = $('#landingImage').attr('data-old-hires');
console.log(productTitle);
console.log(prodImg);
} else {
console.log(error);
}
});

Categories