How do I get whole html from Apify Cheerio crawler? - javascript

I want to get the whole html not just text.
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({
url: //adress,
uniqueKey: makeid(100)
});
const handlePageFunction = async ({ request, $ }) => {
var content_to = $('.class')
};
// Set up the crawler, passing a single options object as an argument.
const crawler = new Apify.CheerioCrawler({
requestQueue,
handlePageFunction,
});
await crawler.run();
});
When I try this the crawler returns complex object. I know I can extract the text from the content_to variable using .text() but I need the whole html with tags like . What should I do?

If I understand you correctly - you could just use .html() instead of .text(). This way you will get inner html instead of inner text of the element.
Another thing to mention - you could also put body to handlePageFunction arg object:
const handlePageFunction = async ({ request, body, $ }) => {
body would have the whole raw html of the page.

Related

How to make an element contenteditable and send the new string to Node/Express-server

So I have built an REST API with working endpoints GET, POST, PUT and DELETE on an Express server and currently working on the clientside. The Express server is reading and writing to a JSON-file. So on the clientside I've done a HTML and a scriptfile, I've managed to make a GET-request and render what's in the JSON-file so far on the HTML. Also I've managed to make a inputform on the HTML and make a POST-request that makes a new object with all keys/values I would like the object to have. So my thought was that to make a PUT-request and update a certain object. Is it possible to use the HTML attribute contenteditable? At the part where I'm rendering everything from the script-file I'm in a for of loop and also making a addEventListener where I'd like to send the NEW information from the contenteditable element to the async/await-function that is making the PUT-request. But at all tries I'm only able to see the old information.
async function printCharacters(){
const get = await fetch ('http://localhost:3000/api')
const characters = await get.json()
for(const character of characters){
const characterContainers = document.createElement("div");
main.appendChild(characterContainers);
characterContainers.className = "characterContainer";
const characterName = document.createElement("p");
characterName.setAttribute("contenteditable", "true");
characterName.innerHTML = character.characterName;
characterContainers.appendChild(characterName);
const updateButton = document.createElement("button");
updateButton.className = "updateButton";
updateButton.innerText = "Update";
characterContainers.appendChild(updateButton);
updateButton.addEventListener("click", () => {
updateCharacter(characterName.innerHTML);
});
}
}
async function updateCharacter(data){
const response = await fetch (`http://localhost:3000/api/update/${data.id}`, {
method: "PUT",
headers: {
"Content-Type": "application/json"
},
body: JSON.stringify({
"characterName": data.characterName
})
})
return response.json();
};
I've tried making outside a function and then it's possible to console.log and catch the new information.
updateButton.addEventListener("click", (e) => {
e.preventDefault();
e.stopPropagation();
const changedData = {
id: character.id,
characterName: characterName.innerHTML,
class: characterClass.innerHTML,
weapon: characterWeapon.innerHTML,
description: characterDescription.innerHTML
};
updateCharacter(changedData);
This was the working solution. I've tried before to get innerHTML both inside and outside the eventListener, but to get the updated information inside the element it has to be inside the eventListener and also the addition of preventDefault and stopPropagation.

How to crawling using Node.js

I can't believe that I'm asking an obvious question, but I still get the wrong in console log.
Console shows crawl like "[]" in the site, but I've checked at least 10 times for typos. Anyways, here's the javascript code.
I want to crawl in the site.
This is the kangnam.js file :
const axios = require('axios');
const cheerio = require('cheerio');
const log = console.log;
const getHTML = async () => {
try {
return await axios.get('https://web.kangnam.ac.kr', {
headers: {
Accept: 'text/html'
}
});
} catch (error) {
console.log(error);
}
};
getHTML()
.then(html => {
let ulList = [];
const $ = cheerio.load(html.data);
const $allNotices = $("ul.tab_listl div.list_txt");
$allNotices.each(function(idx, element) {
ulList[idx] = {
title : $(this).find("list_txt title").text(),
url : $(this).find("list_txt a").attr('href')
};
});
const data = ulList.filter(n => n.title);
return data;
}). then(res => log(res));
I've checked and revised at least 10 times
Yet, Js still throws this result :
root#goorm:/workspace/web_platform_test/myapp/kangnamCrawling(master)# node kangnam.js
[]
Mate, I think the issue is you're parsing it incorrectly.
$allNotices.each(function(idx, element) {
ulList[idx] = {
title : $(this).find("list_txt title").text(),
url : $(this).find("list_txt a").attr('href')
};
});
The data that you're trying to parse for is located within the first index of the $(this) array, which is really just storing a DOM Node. As to why the DOM stores Nodes this way, it's most likely due to efficiency and effectiveness. But all the data that you're looking for is contained within this Node object. However, the find() is superficial and only checks the indexes of an array for the conditions you supplied, which is a string search. The $(this) array only contains a Node, not a string, so when you you call .find() for a string, it will always return undefined.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/find
You need to first access the initial index and do property accessors on the Node. You also don't need to use $(this) since you're already given the same exact data with the element parameter. It's also more efficient to just use element since you've already been given the data you need to work with.
$allNotices.each(function(idx, element) {
ulList[idx] = {
title : element.children[0].attribs.title,
url : element.children[0].attribs.href
};
});
This should now populate your data array correctly. You should always analyze the data structures you're parsing for since that's the only way you can correctly parse them.
Anyways, I hope I solved your problem!

How to call a function after it's already defined in Puppeteer

I have a function that grabs text data from this site and stores it in a JSON file. This function is only crawling the first page of this website, but I'd like click through or "goto" each url (there are 10 pages) and grab the text data from each page:
await page.goto('http://quotes.toscrape.com/page/1/')
//grab quote data
const quotes = await page.evaluate(() => {
const grabFromDiv = (div, selector) => Array.from(div
.querySelectorAll(selector), (el => el.innerText.trim()))
Currently, it just navigates to page 1, grabs the data, stores it, and then exits. Is there a way to call the quotes function over and over until I've navigated through all 10 pages and collected all the data?
I would just do the same for each page.
If you know the number of pages, then just do:
var quotes = ''
for each page
await page.goto(page)
quotes+ = await page.evaluate(myPageFunction)
If you don't know the number of pages, you need to get that information from the actual page.
then, in the evaluate function just search for the next page:
myPageFunction = function(){
// get your data
const nextPage = document.querySelector('.next a')?.href
return {data: yourData, nextPage: nextPage}
}
You would then have something like:
nextPage = 'http://quotes.toscrape.com/page/1/'
while (nextPage= {
await page.goto(nextPage)
const result = await page.evaluate(myPageFunction)
quotes += result.data
nextPage = resut.nextPage
}
The code is just an example and won't work as is.
Best!

cheerio js couldn't select anchor tag

I want to select all download links from this HTML source:
This is my code:
export const getSingleMusic = async (url) => {
const html = await getHtml(url)
const $ = cheerio.load(html)
const downloadSection = $('#mt-modal-download .col-4')
downloadSection.each((i, elem) => {
console.log($(elem).find('a').html())
})
}
But this is the result:
How can I select them?
jQuery $elem.html() returns innerHTML. Cheerio had .html() to return the outerHTML but as you can see from this issue .html() returns outerHTML, while .html(html) sets innerHTML they have changed it so $elem.html() now returns innerHTML and if we want to get the outerHTML then we can call $.html($elem).
You have to either use $.html() like this:
downloadSection.each((i, elem) => {
console.log($.html($(elem).find('a')));
});
or just get the $(elem).html() directly to get the innerHTML which contains the anchor links:
downloadSection.each((i, elem) => {
console.log($(elem).html());
});

Get list of elements with class name in javascript with selenium

How can I get a list of elements with a certain class name in javascript with selenium?
I am searching for any elements with the class message_body. I want a array containing all of the elements that have that class.
driver.findElements(By.className("message_body")) doesn't work, it seems to return something else.
How can I get this list?
Here is an example to get the text from a list of elements:
driver.findElements(By.className("message_body")).then(function(elements){
elements.forEach(function (element) {
element.getText().then(function(text){
console.log(text);
});
});
});
So, I'm using an older version of Selenium, v2.47.1, but something I used when driver.findElements(By.className("someClass")) wasn't sufficient was driver.findElements(By.xpath("/path/to/[#class='someClass']")) . This will return a List<WebElement>. If I remember correctly, By.xpath is a little slower than some of the other options on some browsers, but not by a whole lot....
First you need to get the elements collection:
public async getInstalledApps(): Promise<any[]> {
const appsList: WebComponent = this.browser.find(By.css(`div .icons__title`));
return await appsList.getElements();
}
Then, using the function above you can do anything, for example get the text property and save them. For example if it's a group of Apps button and you want to get a names array of them:
public async getInstalledAppsList(): Promise<string[]> {
const appsList: string[] = [];
let app: string = '';
(await this.getInstalledApps()).forEach(async element => {
await Promise.resolve(element).then(async (text: any) => {
app = await (await text.getText());
appsList.push(app);
});
});
return appsList;
}

Categories