Data crawling websites using cheerio - javascript

I got trouble with crawling data from website. I cant get the tag <tbody> of table , and then i cant not get the content text of tag <tr> and <td>. I used cheerio to crawling data. Please help me. Here are the codes below:
const cheerio= require('cheerio');
const request= require('request-promise');
request('https://liveboard.cafef.vn/',(error,response,html) => {
if(!error && response.statusCode==200)
{
const $=cheerio.load(html);
const tab=$('#myTable')
const tr=tab.find('tbody').find('tr')
for (var j=0; j<tr.length;j++ )
{
const contTr=$(tr[j])
console.log(contTr.text().trim())
const td=contTr.find('td')
for (var i=0;i<td.length;i++)
{
const contTd=$(td[i])
console.log(contTd.text())
}
}
}
else
{
new Error(error)
}
})

If you look into view-source:https://liveboard.cafef.vn/ you will see that all tbody tags are emty. They are filled by JavaScript. cheerio does not run JavaScript, it just parses the static source HTML. You need something like https://github.com/puppeteer/puppeteer/ — a headless browser manipulation library that runs JavaScript and helps to scrape full-fledged pages.

Related

Why can't I scrape for specific information of this webpage? (with node.js and jQuery)

So I want to scrape for specific information about news from this website: https://24.hu/fn/gazdasag/2022/07/23/igy-lehet-olcsobb-a-maganorvos-az-egeszsegpenztarbol/
I'm working on creating a web crawler and I need the new's title and the content. I use node.js, javascript and jQuery. And I've also created tests for that and I can reach the title, but I can't get the context. Despite of the fact that I've tried the code in the console of the browser, and it works well there.
This would be the code in the console:
$('[data-io-article-url="https://24.hu/fn/gazdasag/2022/07/23/igy-lehet-olcsobb-a-maganorvos-az-egeszsegpenztarbol/"]').text().trim();
And I get the following answer:
A pandémia az életünk számos területét befolyásolta, de talán semmit sem annyira közvetlenül, mint az orvoshoz járási szokásainkat. Az elmúlt két évben tanúi lehettünk annak, hogy a végletekig leterhelt állami egészségügyi rendszer egyre nehezebben bírja a betegek megfelelő ellátását. Ráadásul úgy tűnik, hogy a járványt még korántsem tudhatjuk magunk mögött.....
In my VS Code I saved the html of the webpage and created the following test:
const fs = require("fs");
const parser = require("../24Parser");
const newsPage1Html = fs.readFileSync("tests/html/test.html");
let parserResult;
beforeAll(() => {
parserResult = parser(newsPage1Html, );
})
describe("parsing html news page correctly", () => {
test("title", () => {
expect(parserResult.title).toBe("Így lehet olcsóbb a magánorvos az egészségpénztárból");
})
test("content", () => {
expect(parserResult.content).toBe("lskl");
})
})
And my parser looks like this:
const cheerio = require("cheerio");
function parseAll(html, page) {
const $ = cheerio.load(html);
const title = $('[itemprop="headline"]').text().trim();
//const content = $(`[data-io-article-url="${page}"]`).text().trim();
const content = $('[data-io-article-url="https://24.hu/fn/gazdasag/2022/07/23/igy-lehet-olcsobb-a-maganorvos-az-egeszsegpenztarbol/"]').text().trim();
return { title, content}
}
module.exports = parseAll;
So I use exactly the same code and I get nothing in case of the content. Why is that?
I would like to create it dynamic later, that's why the commented line.
Try to check, if scrapping is not blocked on this site, or does it have limits for requests. Also, it can be helpful to use a puppeteer wich opens google chrome in headless mode and make scraping.

Is it possible to validate my JavaScript library using a public/private key signing method?

I am dynamically loading my JavaScript as a "plugin" on a third party's page and would like to verify that it's coming from me.
I can use the content SRI, however I update the library frequently and need it to be dynamically loaded.
Below is an example of what I'd like to see. Is there a way I can achieve this?
const publickey = '...my-key...'
const response = await fetch('https://example.com/my-library.js')
const sig = response.headers.get('X-Signature')
const text = await response.text()
try {
validate(text, sig, publicKey)
const s = document.createElement('script')
s.innerHTML = text
document.head.appendChild(s)
} catch(err) {
console.error("")
}

Nothing shows up in the console when scraping a website

I'm doing a personal project where I want to scrape some game rankings off a website, but I'm unable to locate in the HTML the titles of the games that I want to scrape.
const request = require('request');
const cheerio = require('cheerio');
request('https://newzoo.com/insights/rankings/top-20-core-pc-games/', (error, response, html) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(html);
//var table = $('#ranking');
//console.log(table.text());
$('.ranking-row').each((i,el) => {
const title = $(el).find('td').find('td:nth-child(1)').text();
console.log(title);
});
}
});
Change
const title = $(el).find('td').find('td:nth-child(1)').text();
to
const title = $(el).find('td:nth-child(2)').text();
PS: To debug xpaths, use the chrome debugger. If you go to this specific site and search for .ranking-row td td:nth-child(1), you will see that nothing is returned. But if you do .ranking-row td:nth-child(2) you would get the desired result.
This is a simple xpath error caused by looking for the same td twice and using the wrong index in nth-child.

Extract public posts from Facebook page without API/APP key/token/secret

Just to clarify in advance, I don't have a Facebook account and I have no intent to create one. Also, what I'm trying to achieve is perfectly legal in my country and the USA.
Instead of using the Facebook API to get the latest timeline posts of a Facebook page, I want to send a get request directly to the page URL (e.g. this page) and extract the posts from the HTML source code.
(I'd like to get the text and the creation time of the post.)
When I run this in the web console:
document.getElementsByClassName('userContent')
I get a list of elements containing the text of the latest posts.
But I'd like to extract that information from a nodejs script. I could probably do it quite easily using a headless browser like puppeteer or the like, but that would create a ton of unnecessary overhead. I'd really like to a simple approach like downloading the HTML code, passing it to cheerio and use cheeriio's jQuery-like API to extract the posts.
Here is my attempt of trying exactly that:
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
rp.get('https://www.facebook.com/pg/officialstackoverflow/posts/').then( postsHtml => {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent');
console.log(timeLinePostEls.html()); // should NOT be null
const newestPostEl = timeLinePostEls.get(0);
console.log(newestPostEl.html()); // should NOT be null
const newestPostText = newestPostEl.text();
console.log(newestPostText);
//const newestPostTime = newestPostEl.parent(??).child('.livetimestamp').title;
//console.log(newestPostTime);
}).catch(console.error);
unfortunately $('.userContent') does not work. However, I was able to verify that the data I'm looking for is embedded somewhere in that HTML code.
But I couldn't really come up with a with a good regex approach or the like to extract that data.
Depending on the post content the number of HTML tags within the post varies heavily.
Here is a simple example of a post containing one link:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}"><p>We're proud to be named one of Built In NYC's Best Places to Work in 2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for Best Perks and Benefits. See what it took to make the list and check out our profile to see some of our job openings. http://*******/2H3Kbr2</p></div>
Formatted in a more readable form it looks somewhat like this:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}">
<p>
We're proud to be named one of Built In NYC's Best Places to Work in
2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for
Best Perks and Benefits. See what it took to make the list and check out our
profile to see some of our job openings.
SHORT_LINK.....
</p>
</div>
This regex seems to work okay, but I don't think it is very reliable:
/<div class="[^"]+ userContent [^"]+" data-ft="[^"]+">(.+?)<\/div>/g
If for example the post contained another div-element then it wouldn't work properly. In addition to that I have no way of knowing the time/date the post was created using this approach?
Any ideas how I could relatively reliably extract the most recent 2-3 posts including the creation date/time?
Okay, I finally figured it out. I hope this will be useful to others. This function will extract the 20 latest posts, including the creation time:
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
function GetFbPosts(pageUrl) {
const requestOptions = {
url: pageUrl,
headers: {
'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0'
}
};
return rp.get(requestOptions).then( postsHtml => {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
const posts = timeLinePostEls.map(post=>{
return {
message: post.html(),
created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
}
});
return posts;
});
}
GetFbPosts('https://www.facebook.com/pg/officialstackoverflow/posts/').then(posts=>{
// Log all posts
for (const post of posts) {
console.log(post.created_at, post.message);
}
});
Since Facebook messages can have complicated formatting the message is not plain text, but HTML. But you could remove the formatting and just get the text by replacing message: post.html() with message: post.text().
Edit:
If you want to get more than the latest 20 posts, it is more complicated. The first 20 posts are served statically on the initial html page. All following posts are retrieved via ajax in chunks of 8 posts.
It can be achieved like that:
// make sure your node.js version supports async/await (v10 and above should be fine)
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
class FbScrape {
constructor(options={}) {
this.headers = options.headers || {
'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' // you may have to update this at some point
};
}
async getPosts(pageUrl, limit=20) {
const staticPostsHtml = await rp.get({ url: pageUrl, headers: this.headers });
if (limit <= 20) {
return this._parsePostsHtml(staticPostsHtml);
} else {
let staticPosts = this._parsePostsHtml(staticPostsHtml);
const nextResultsUrl = this._getNextPageAjaxUrl(staticPostsHtml);
const ajaxPosts = await this._getAjaxPosts(nextResultsUrl, limit-20);
return staticPosts.concat(ajaxPosts);
}
}
_parsePostsHtml(postsHtml) {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
const posts = timeLinePostEls.map(post => {
return {
message: post.html(),
created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
}
});
return posts;
}
async _getAjaxPosts(resultsUrl, limit=8, posts=[]) {
const responseBody = await rp.get({ url: resultsUrl, headers: this.headers });
const extractedJson = JSON.parse(responseBody.substr(9));
const postsHtml = extractedJson.domops[0][3].__html;
const newPosts = this._parsePostsHtml(postsHtml);
const allPosts = posts.concat(newPosts);
const nextResultsUrl = this._getNextPageAjaxUrl(postsHtml);
if (allPosts.length+1 >= limit)
return allPosts;
else
return await this._getAjaxPosts(nextResultsUrl, limit, allPosts);
}
_getNextPageAjaxUrl(html) {
return 'https://www.facebook.com' + /"(\/pages_reaction_units\/more[^"]+)"/g.exec(html)[1].replace(/&/g, '&') + '&__a=1';
}
}
const fbScrape = new FbScrape();
const minimum = 28; // minimum number of posts to request (gets rounded up to 20, 28, 36, 44, 52, 60, 68 etc... because of page sizes (page1=20; all_following_pages=8)
fbScrape.getPosts('https://www.facebook.com/pg/officialstackoverflow/posts/', minimum).then(posts => { // get at least the 28 latest posts
// Log all posts
for (const post of posts) {
console.log(post.created_at, post.message);
}
});

convert cheerio.load() to a DOM object

I'm trying to learn how to make a web scraper and save content from a site into a text file using node. My issue is that to get the content, I am using cheerio and jquery (I think?), which I have no experience with. I'm trying to take the result I got from cheerio and convert it to a DOM object which I have much more experience dealing with. How can I take the html from cheerio and convert it to a DOM object? Thanks in advance!
const request = require('request');
const cheerio = require('cheerio');
request('https://www.wuxiaworld.com/novel/overgeared/og-chapter-153',(error, response, html) => {
if(!error & response.statusCode == 200) {
const $ = cheerio.load(html);
console.log(html);
html.getElementsByClassName('fr-view')[1];//I want the ability to do this
}
})
You are using cheerio, the first example there shows u how to add a class and get a string with the HTML.
You can change your code to look like that:
const request = require('request');
const cheerio = require('cheerio');
request('https://www.wuxiaworld.com/novel/overgeared/og-chapter-153',(error, response, html) => {
if(!error & response.statusCode == 200) {
const $ = cheerio.load(html);
const result = $('.my-calssName').html(); // cheerio api to find by css selector, just like jQuery.
console.log(result);
}
})

Categories