Web Scraping Node.js in DOM page - javascript

I want to get information from the site using Node.js
I tryied so hard, and ̶g̶o̶t̶ ̶s̶o̶ ̶f̶a̶r̶ . So, I want to get a magnet URI link, this link is in:
<div id="download">
<img src="/parse/s.rutor.org/i/magnet.gif">
How to get this link from div and href field using cheerio. I dont know how to jQuery, I just want to write an parser.
Here is my try:
const request = require('request');
const cheerio = require('cheerio');
request('http://s.new-rutor.org/torrent/562496/povorot-ne-tuda-5-krovnoe-rodstvo_wrong-turn-5-bloodlines-2012-bdrip-avc-p/', function(err, resp, body) {
if (!err){
const $ = cheerio.load(body);
var magnet = $('.href', '#downloads').text()
// $('#downloads').find('href').text()
console.log(magnet);
}
});
That code is only getting empty place in console

Note: I'm using request-promise instead of request
This code console.logs all a-tags with a href that contains 'magnet'
const request = require('request-promise');
const cheerio = require('cheerio');
request('http://s.new-rutor.org/torrent/562496/povorot-ne-tuda-5-krovnoe-rodstvo_wrong-turn-5-bloodlines-2012-bdrip-avc-p/').then(res => {
const $ = cheerio.load(res)
const links = $('a')
links.each(i => {
const link = links.eq(i).attr('href')
if (link && link.includes('magnet')) {
console.log(link)
}
})
})
eq selects a specific link from that index
links.each(i => links.eq(i))
then we grab the content inside the attribute href (the magnet link) with
links.eq(i).attr('href')

Related

How to get a specific class/xpath data using request in ElectronJS

I am trying to get only the price of the coin but instead, I am getting the whole HTML because of the body.
I cannot find any documentation or usage for the request package so I needed to ask here.
I am trying to find the class="price" which only shows the price.
Is there a way to search based on class or the XPath of the URL or a way to cut everything out except for the class="price" section?
const request = require('request')
request('https://www.livecoinwatch.com/price/Spark-FLR', function (
error,
response,
body
) {
console.error('error:', error)
console.log('body:', body)
})
when you get the doctument,find a package like jquery parse it,find the price。
like this:
const jsdom = require('jsdom');
const {JSDOM} = jsdom;
const {document} = (new JSDOM('<!doctype html><html><body></body></html>')).window; //the document
global.document = document;
const window = document.defaultView;
const $ = require('jquery')(window);

Scraping data from Youtube with Cheerio

var req = require('request');
var cheerio = require('cheerio');
req('https://www.youtube.com/channel/UCVRhrcoG6FOvHGKehYtvKHg/about', (err, response , body) => {
if(!err) {
let $ = cheerio.load(body);
console.log($('style-scope.ytd-channel-about-metadata-renderer').html())
} else {
console.log(err);
}
})
})
https://somon.is-inside.me/B49SiWJC.png
Hello, I'm trying to scrape the 'views' data in YouTube. But every time, it logs to console null.
There is a screenshot link, i'm trying to fetching data with class name but I couldn't get it to work. Where is the error?
You need to add the class selector character "."
try this console.log($('.style-scope .ytd-channel-about-metadata-renderer').html())
or this console.log($('.ytd-channel-about-metadata-renderer').html())

Nothing shows up in the console when scraping a website

I'm doing a personal project where I want to scrape some game rankings off a website, but I'm unable to locate in the HTML the titles of the games that I want to scrape.
const request = require('request');
const cheerio = require('cheerio');
request('https://newzoo.com/insights/rankings/top-20-core-pc-games/', (error, response, html) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(html);
//var table = $('#ranking');
//console.log(table.text());
$('.ranking-row').each((i,el) => {
const title = $(el).find('td').find('td:nth-child(1)').text();
console.log(title);
});
}
});
Change
const title = $(el).find('td').find('td:nth-child(1)').text();
to
const title = $(el).find('td:nth-child(2)').text();
PS: To debug xpaths, use the chrome debugger. If you go to this specific site and search for .ranking-row td td:nth-child(1), you will see that nothing is returned. But if you do .ranking-row td:nth-child(2) you would get the desired result.
This is a simple xpath error caused by looking for the same td twice and using the wrong index in nth-child.

Getting all images from a webpage and save the to disk programmatically (NodeJS & Javascript)

I need to get a lot of images from a few websites and download them to my disk so that I can use them (will upload them to a blob (azure) and then save the link to my DB).
GETTING THE IMAGES
I know how to get the images from the html with JS, for example one of them I would make a for-loop and do:
document.getElementsByClassName('person')[i].querySelector('div').querySelector('img').getAttribute('src')
And there I would have the links to all the images.
SAVING THE IMAGES
I also saw that I can save the files to disk using node and the fs module, by doing:
function saveImageToDisk(url, localPath) {var fullUrl = url;
var file = fs.createWriteStream(localPath);
var request = https.get(url, function(response) {
response.pipe(file);
});
}
HOW TO PUT IT ALL TOGETHER
This is where I am stuck, I don't know exactly how to connect the two parts (the script and the nodejs code), I want to get the image and also the image name (alt tag in this case) and then use them in node to upload the image to a blob and put them name and image blob url in my DB.
I thought I could download the html page and then put the JS script on the bottom of the body but then I don't know how to pass the url to the nodejs code.
How can I do this?
I am not very used to using scripts, I mostly used node without them and I get a bit confused by their interactions and how to connect js scripts to my code.
Also is this the best way to go about this or is there a simpler/better way I am not seeing?
This feels like you should use a crawler. The following code should work (using the npm module crawler):
const Crawler = require("crawler")
const c = new Crawler({
callback: function(error, res, done) {
if (error) {
console.log({error})
} else {
const images = res.$('.person div img')
images.each(index => {
// here you can save the file or save them in an array to download them later
console.log({
src: images[index].attribs.src,
alt: images[index].attribs.alt,
})
})
}
}
})
c.queue('https://www.yoursite.com')
You need a bridge between Web API (for DOM parsing etc) and Node.js API. For example, some headless browser managing tool for Node.js. Say, you can use puppeteer with this script:
'use strict';
const puppeteer = require('puppeteer');
const https = require('https');
const fs = require('fs');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://en.wikipedia.org/wiki/Image');
const imgURLs = await page.evaluate(() =>
Array.from(
document.querySelectorAll('#mw-content-text img.thumbimage'),
({ src }) => src,
)
);
console.log(imgURLs);
await browser.close();
imgURLs.forEach((imgURL, i) => {
https.get(imgURL, (response) => {
response.pipe(fs.createWriteStream(`${i++}.${imgURL.slice(-3)}`));
});
});
} catch (err) {
console.error(err);
}
})();
You can even download images just once, using pictures already downloaded by the browser. This script saves identical images, but with one session of requests, without using https Node.js module (this saves time, network traffic and server workload):
'use strict';
const puppeteer = require('puppeteer');
const fs = require('fs');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
const allImgResponses = {};
page.on('response', (response) => {
if (response.request().resourceType() === 'image') {
allImgResponses[response.url()] = response;
}
});
await page.goto('https://en.wikipedia.org/wiki/Image');
const selecedImgURLs = await page.evaluate(() =>
Array.from(
document.querySelectorAll('#mw-content-text img.thumbimage'),
({ src }) => src,
)
);
console.log(selecedImgURLs);
let i = 0;
for (const imgURL of selecedImgURLs) {
fs.writeFileSync(
`${i++}.${imgURL.slice(-3)}`,
await allImgResponses[imgURL].buffer(),
);
}
await browser.close();
} catch (err) {
console.error(err);
}
})();
I recommend you to use the dom-parser module. See here: https://www.npmjs.com/package/dom-parser
By doing so, you can download the whole html-File with http.get() and parse it using the dom-parser. Then extract all the information you need from the HTML-File. With the Image URL, use your saveImageToDisk() function.
Following your idea, you have to add the JS script to the html-File as you mentioned. But in addition you have to use Ajax (xmlHttpRequest) to post the URL to a nodeJS-Server.
You can use Promise & inside it do the job of getting all the images and put the image url in an array.Then inside the then method you can either iterate the array and call the saveImageToDisk each time or you can send the array to the middle layer with slide modification. The second option is better since it will make only one network call
function getImages() {
return new Promise((resolve, reject) => {
// Array.from will create an array
// map will return a new array with all the image url
let k = Array.from(document.getElementsByClassName('person')[0].querySelector('div')
.querySelectorAll('img'))
.map((item) => {
return item.getAttribute('src')
})
resolve(k)
})
}
getImages().then((d) => {
// it will work only after the promise is resolved
console.log('****', d);
(item => {
// call saveImageToDisk function
})
})
function saveImageToDisk(url, localPath) {
var fullUrl = url;
var file = fs.createWriteStream(localPath);
var request = https.get(url, function(response) {
response.pipe(file);
});
<div class='person'>
<div>
<img src='https://www.fast-growing-trees.com/images/P/Leyland-Cypress-450-MAIN.jpg'>
<img src='http://cdn.shopify.com/s/files/1/2473/3486/products/Cypress_Leyland_2_Horticopia_d1b5b63a-8bf7-4897-96fb-05320bf3d81b_grande.jpg?v=1532991076'>
<img src='https://www.fast-growing-trees.com/images/P/Live-Oak-Tree-450w.jpg'>
<img src='https://www.greatgardenplants.com/images/uploads/452_1262_popup.jpg'>
<img src='https://shop.arborday.org/data/default/images/catalog/600/Turnkey/1/Leyland-Cypress_3-828.jpg'>
<img src='https://images-na.ssl-images-amazon.com/images/I/51RZkKnrlSL._SX425_.jpg'>
<img src='https://thumbs-prod.si-cdn.com/Z3JYiuJ96ReLq04NCT1B94sTd4E=/800x600/filters:no_upscale()/https://public-media.si-cdn.com/filer/06/9c/069cfb16-c46c-4742-85f0-3c7e45fa139d/mar2018_a05_talkingtrees.jpg'>
</div>

convert cheerio.load() to a DOM object

I'm trying to learn how to make a web scraper and save content from a site into a text file using node. My issue is that to get the content, I am using cheerio and jquery (I think?), which I have no experience with. I'm trying to take the result I got from cheerio and convert it to a DOM object which I have much more experience dealing with. How can I take the html from cheerio and convert it to a DOM object? Thanks in advance!
const request = require('request');
const cheerio = require('cheerio');
request('https://www.wuxiaworld.com/novel/overgeared/og-chapter-153',(error, response, html) => {
if(!error & response.statusCode == 200) {
const $ = cheerio.load(html);
console.log(html);
html.getElementsByClassName('fr-view')[1];//I want the ability to do this
}
})
You are using cheerio, the first example there shows u how to add a class and get a string with the HTML.
You can change your code to look like that:
const request = require('request');
const cheerio = require('cheerio');
request('https://www.wuxiaworld.com/novel/overgeared/og-chapter-153',(error, response, html) => {
if(!error & response.statusCode == 200) {
const $ = cheerio.load(html);
const result = $('.my-calssName').html(); // cheerio api to find by css selector, just like jQuery.
console.log(result);
}
})

Categories