convert cheerio.load() to a DOM object - javascript

I'm trying to learn how to make a web scraper and save content from a site into a text file using node. My issue is that to get the content, I am using cheerio and jquery (I think?), which I have no experience with. I'm trying to take the result I got from cheerio and convert it to a DOM object which I have much more experience dealing with. How can I take the html from cheerio and convert it to a DOM object? Thanks in advance!
const request = require('request');
const cheerio = require('cheerio');
request('https://www.wuxiaworld.com/novel/overgeared/og-chapter-153',(error, response, html) => {
if(!error & response.statusCode == 200) {
const $ = cheerio.load(html);
console.log(html);
html.getElementsByClassName('fr-view')[1];//I want the ability to do this
}
})

You are using cheerio, the first example there shows u how to add a class and get a string with the HTML.
You can change your code to look like that:
const request = require('request');
const cheerio = require('cheerio');
request('https://www.wuxiaworld.com/novel/overgeared/og-chapter-153',(error, response, html) => {
if(!error & response.statusCode == 200) {
const $ = cheerio.load(html);
const result = $('.my-calssName').html(); // cheerio api to find by css selector, just like jQuery.
console.log(result);
}
})

Related

Data crawling websites using cheerio

I got trouble with crawling data from website. I cant get the tag <tbody> of table , and then i cant not get the content text of tag <tr> and <td>. I used cheerio to crawling data. Please help me. Here are the codes below:
const cheerio= require('cheerio');
const request= require('request-promise');
request('https://liveboard.cafef.vn/',(error,response,html) => {
if(!error && response.statusCode==200)
{
const $=cheerio.load(html);
const tab=$('#myTable')
const tr=tab.find('tbody').find('tr')
for (var j=0; j<tr.length;j++ )
{
const contTr=$(tr[j])
console.log(contTr.text().trim())
const td=contTr.find('td')
for (var i=0;i<td.length;i++)
{
const contTd=$(td[i])
console.log(contTd.text())
}
}
}
else
{
new Error(error)
}
})
If you look into view-source:https://liveboard.cafef.vn/ you will see that all tbody tags are emty. They are filled by JavaScript. cheerio does not run JavaScript, it just parses the static source HTML. You need something like https://github.com/puppeteer/puppeteer/ — a headless browser manipulation library that runs JavaScript and helps to scrape full-fledged pages.

Scraping data from Youtube with Cheerio

var req = require('request');
var cheerio = require('cheerio');
req('https://www.youtube.com/channel/UCVRhrcoG6FOvHGKehYtvKHg/about', (err, response , body) => {
if(!err) {
let $ = cheerio.load(body);
console.log($('style-scope.ytd-channel-about-metadata-renderer').html())
} else {
console.log(err);
}
})
})
https://somon.is-inside.me/B49SiWJC.png
Hello, I'm trying to scrape the 'views' data in YouTube. But every time, it logs to console null.
There is a screenshot link, i'm trying to fetching data with class name but I couldn't get it to work. Where is the error?
You need to add the class selector character "."
try this console.log($('.style-scope .ytd-channel-about-metadata-renderer').html())
or this console.log($('.ytd-channel-about-metadata-renderer').html())

Nothing shows up in the console when scraping a website

I'm doing a personal project where I want to scrape some game rankings off a website, but I'm unable to locate in the HTML the titles of the games that I want to scrape.
const request = require('request');
const cheerio = require('cheerio');
request('https://newzoo.com/insights/rankings/top-20-core-pc-games/', (error, response, html) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(html);
//var table = $('#ranking');
//console.log(table.text());
$('.ranking-row').each((i,el) => {
const title = $(el).find('td').find('td:nth-child(1)').text();
console.log(title);
});
}
});
Change
const title = $(el).find('td').find('td:nth-child(1)').text();
to
const title = $(el).find('td:nth-child(2)').text();
PS: To debug xpaths, use the chrome debugger. If you go to this specific site and search for .ranking-row td td:nth-child(1), you will see that nothing is returned. But if you do .ranking-row td:nth-child(2) you would get the desired result.
This is a simple xpath error caused by looking for the same td twice and using the wrong index in nth-child.

Web Scraping Node.js in DOM page

I want to get information from the site using Node.js
I tryied so hard, and ̶g̶o̶t̶ ̶s̶o̶ ̶f̶a̶r̶ . So, I want to get a magnet URI link, this link is in:
<div id="download">
<img src="/parse/s.rutor.org/i/magnet.gif">
How to get this link from div and href field using cheerio. I dont know how to jQuery, I just want to write an parser.
Here is my try:
const request = require('request');
const cheerio = require('cheerio');
request('http://s.new-rutor.org/torrent/562496/povorot-ne-tuda-5-krovnoe-rodstvo_wrong-turn-5-bloodlines-2012-bdrip-avc-p/', function(err, resp, body) {
if (!err){
const $ = cheerio.load(body);
var magnet = $('.href', '#downloads').text()
// $('#downloads').find('href').text()
console.log(magnet);
}
});
That code is only getting empty place in console
Note: I'm using request-promise instead of request
This code console.logs all a-tags with a href that contains 'magnet'
const request = require('request-promise');
const cheerio = require('cheerio');
request('http://s.new-rutor.org/torrent/562496/povorot-ne-tuda-5-krovnoe-rodstvo_wrong-turn-5-bloodlines-2012-bdrip-avc-p/').then(res => {
const $ = cheerio.load(res)
const links = $('a')
links.each(i => {
const link = links.eq(i).attr('href')
if (link && link.includes('magnet')) {
console.log(link)
}
})
})
eq selects a specific link from that index
links.each(i => links.eq(i))
then we grab the content inside the attribute href (the magnet link) with
links.eq(i).attr('href')

Unable to get information from <div> Node spider with Cheerio

I'm trying to download the lat/long locations of CCTV locations from the City of Baltimore website (project on the surveillance state) but not getting the console to log anything.
Here's the site:
and my code is:
const request = require('request');
const cheerio = require('cheerio');
let URL = 'https://data.baltimorecity.gov/Public-Safety/CCTV-Locations/hdyb-27ak/data'
let cameras = [];
request(URL, function(err, res, body) {
if(!err && res.statusCode == 200) {
let $ = cheerio.load(body);
$('div.blist-t1-c140113793').each(function() {
let camera = $(this);
let location = camera.text();
console.log(location);
cameras.push(location);
});
console.log(cameras);
}
});
I've tried setting the to blist-t1-c140113793 and blist-td blist-t1-c140113793 but neither has worked.
That's because data for those divs are loaded asynchronously, after the page was rendered. JavaScript is not executed by Cherrio, or any other such library. You'll need either to analyze network traffic and understand which HTTP call loads this data, or use something like Selenium, that actually executes JavaScript inside the browser.

Categories