Unable to get information from <div> Node spider with Cheerio - javascript

I'm trying to download the lat/long locations of CCTV locations from the City of Baltimore website (project on the surveillance state) but not getting the console to log anything.
Here's the site:
and my code is:
const request = require('request');
const cheerio = require('cheerio');
let URL = 'https://data.baltimorecity.gov/Public-Safety/CCTV-Locations/hdyb-27ak/data'
let cameras = [];
request(URL, function(err, res, body) {
if(!err && res.statusCode == 200) {
let $ = cheerio.load(body);
$('div.blist-t1-c140113793').each(function() {
let camera = $(this);
let location = camera.text();
console.log(location);
cameras.push(location);
});
console.log(cameras);
}
});
I've tried setting the to blist-t1-c140113793 and blist-td blist-t1-c140113793 but neither has worked.

That's because data for those divs are loaded asynchronously, after the page was rendered. JavaScript is not executed by Cherrio, or any other such library. You'll need either to analyze network traffic and understand which HTTP call loads this data, or use something like Selenium, that actually executes JavaScript inside the browser.

Related

iFrame + PDF.js + puppeteer - good combination to generate and show PDF files?

since monday i try to find the right way of fast and secure generating and displaying PDF Files with the following - maybe im just confused or to blind to see the answer:
Apache - runs my PHP Scripts for my actual project (port 443)
NodeJS - runs a single script for generating PDF files from HTML (port 8080)
What i need: Ensure, that the User is allowed to generate and view the PDF.
It is important to me to have the viewer bar (as seen in the screenshot) is available.
There is a cookie in which a Session-Hash is stored and on which the user authenticates whith on every request (for example via AJAX).
Description of the full procedure:
On one page of my project an iFrame is displayed. In this is a PDF-viewer (from PDF.js) is loaded and some buttons around it:
state before it all begins
Clicking on a button on the left (named with "Load PDF 1", ...) fires the following Event:
$(document).on("click", ".reportelement", function () {
//some data needs to be passed
let data = "report=birthdaylist";
//point iFrame to a new address
$("#pdfViewer").attr("src", "https://example.org/inc/javascript/web/viewer.html?file=https://example.org:8080?" + data);
});
At this point, the iFrame is going to reload the viewer, which takes the GET argument and executes it:
https://example.org/inc/javascript/web/viewer.html?file=https://example.org:8080?" + data //sends the data to the NodeJS script and recieves PDF
==> ?file=https://example.org:8080 //GET... it's bad... How to do a POST in iFrame?!
So, have a look at the NodeJS Script (I have to say I am not very famliar with async and NodeJS):
const https = require("https");
const fs = require("fs");
const puppeteer = require('puppeteer');
const url = require("url");
var qs = require('querystring');
const request = require("request-promise");
const options = {
key: fs.readFileSync("key.pem", "utf-8"),
cert: fs.readFileSync("cert.pem", "utf-8"),
passphrase: 'XXXXXXXX'
};
https.createServer(options, function (req, res) {
(async function () {
if (req.method == 'POST') {
var body = '';
req.on('data', function (data) {
body += data;
// Too much POST data, kill the connection!
// 1e6 === 1 * Math.pow(10, 6) === 1 * 1000000 ~~~ 1MB
if (body.length > 1e6)
req.connection.destroy();
});
req.on('end', function () {
//got a selfsigned certificate only, will change it soon!
process.env['NODE_TLS_REJECT_UNAUTHORIZED'] = 0
(async function () {
var result = await request.post('https://example.org/index.php', {
//htpasswd secured at the moment
'auth': {
'user': 'user',
'pass': 'pass',
'sendImmediately': false
},
//i would like to send the cookie oder the hash in it
//or something else to it ensure, that the user is allowed to
form: {
giveme: 'html'
}
},
function (error, response, body) {
//for debugging reasons
console.log("error: " + error);
console.log("response: " + response);
console.log("body: " + body);
}
);
const browser = await puppeteer.launch();
const main = async () => {
//generating pdf using result from request.post
}
const rendered_pdf = await main();
res.writeHead(200, {
"Access-Control-Allow-Headers": "Origin, X-Requested-With, Content-Type, Accept",
"Access-Control-Allow-Origin": "*",
'Content-Type': 'application/pdf',
'Content-Disposition': 'attachment; filename=mypdf.pdf',
'Content-Length': rendered_pdf.length
});
res.end(rendered_pdf);
})();
});
} else if (req.method == 'GET') {
console.log("we got a GET");
} else {
console.log("we got NOTHING");
}
})();
}).listen(8080);
Everything is working fine and PDF's are displayed well - but as i mentioned before, i dont know how to ensure, that the user is allowed to generate and see the PDF.
tldr;
Is there a way (maybe without an iFrame) to secure the user is permitted? It is important to me to have the viewer bar (as seen in the screenshot) is available.
diagram of current procedure
I think i found a solution.
diagram of new approach/token logic
Using a Token (hash or random string) for retrieving a PDF file only should do it.
The Token does not authenticate the user. I think this is an safer approach?
Feel free to comment/answer :)

Nothing shows up in the console when scraping a website

I'm doing a personal project where I want to scrape some game rankings off a website, but I'm unable to locate in the HTML the titles of the games that I want to scrape.
const request = require('request');
const cheerio = require('cheerio');
request('https://newzoo.com/insights/rankings/top-20-core-pc-games/', (error, response, html) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(html);
//var table = $('#ranking');
//console.log(table.text());
$('.ranking-row').each((i,el) => {
const title = $(el).find('td').find('td:nth-child(1)').text();
console.log(title);
});
}
});
Change
const title = $(el).find('td').find('td:nth-child(1)').text();
to
const title = $(el).find('td:nth-child(2)').text();
PS: To debug xpaths, use the chrome debugger. If you go to this specific site and search for .ranking-row td td:nth-child(1), you will see that nothing is returned. But if you do .ranking-row td:nth-child(2) you would get the desired result.
This is a simple xpath error caused by looking for the same td twice and using the wrong index in nth-child.

convert cheerio.load() to a DOM object

I'm trying to learn how to make a web scraper and save content from a site into a text file using node. My issue is that to get the content, I am using cheerio and jquery (I think?), which I have no experience with. I'm trying to take the result I got from cheerio and convert it to a DOM object which I have much more experience dealing with. How can I take the html from cheerio and convert it to a DOM object? Thanks in advance!
const request = require('request');
const cheerio = require('cheerio');
request('https://www.wuxiaworld.com/novel/overgeared/og-chapter-153',(error, response, html) => {
if(!error & response.statusCode == 200) {
const $ = cheerio.load(html);
console.log(html);
html.getElementsByClassName('fr-view')[1];//I want the ability to do this
}
})
You are using cheerio, the first example there shows u how to add a class and get a string with the HTML.
You can change your code to look like that:
const request = require('request');
const cheerio = require('cheerio');
request('https://www.wuxiaworld.com/novel/overgeared/og-chapter-153',(error, response, html) => {
if(!error & response.statusCode == 200) {
const $ = cheerio.load(html);
const result = $('.my-calssName').html(); // cheerio api to find by css selector, just like jQuery.
console.log(result);
}
})

Using Node.js to find the value of Bitcoin on a webpage at real time

I'm trying to make a .js file that will constantly have the price of bitcoin updated (every five minutes or so). I've tried tons of different ways to web scrape but they always output with either null or nothing. Here is my latest code, any ideas?
var express = require('express');
var path = require('path');
var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');
var app = express();
var url = 'https://blockchain.info/charts/';
var port = 9945;
function BTC() {
request(url, function (err, res, body) {
var $ = cheerio.load(body);
var a = $(".market-price");
var b = a.text();
console.log(b);
})
setInterval(BTC, 300000)
}
BTC();
app.listen(port);
console.log('server is running on '+port);
It successfully says what port it's running on, that's not the problem. This example (when outputting) just makes a line break every time the function happens.
UPDATE:
I changed the new code I got from Wartoshika and it stopped working, but im not sure why. Here it is:
function BTCPrice() {
request('https://blockchain.info/de/ticker', (error, response, body) => {
const data = JSON.parse(body);
var value = (parseInt(data.USD.buy, 10) + parseInt(data.USD.sell, 10)) / 2;
return value;
});
};
console.log(BTCPrice());
If I have it console.log directly from inside the function it works, but when I have it console.log the output of the function it outputs undefined. Any ideas?
I would rather use a JSON api to get the current bitcoin value instead of an HTML parser. With the JSON api you get a strait forward result set that is parsable by your browser.
Checkout Exchange Rates API
Url will look like https://blockchain.info/de/ticker
Working script:
const request = require('request');
function BTC() {
// send a request to blockchain
request('https://blockchain.info/de/ticker', (error, response, body) => {
// parse the json answer and get the current bitcoin value
const data = JSON.parse(body);
value = (parseInt(data.THB.buy, 10) + parseInt(data.THB.sell, 10)) / 2;
console.log(value);
});
}
BTC();
Using the value as callback:
const request = require('request');
function BTC() {
return new Promise((resolve) => {
// send a request to blockchain
request('https://blockchain.info/de/ticker', (error, response, body) => {
// parse the json answer and get the current bitcoin value
const data = JSON.parse(body);
value = (parseInt(data.THB.buy, 10) + parseInt(data.THB.sell, 10)) / 2;
resolve(value);
});
});
}
BTC().then(val => console.log(val));
As the other answer stated, you should really use an API. You should also think about what type of price you want to request. If you just want a sort of index price that aggregates prices from multiple exchanges, use something like the CoinGecko API. Also if you need real-time data you need a websocket-based API, not a REST API.
If you need prices for a particular exchange, for example you're building a trading bot for one or more exchanges, you;ll need to communicate with each exchange's websoceket API directly. For that I would recommend something like the Coygo API, a node.js package that connects you directly to each exchange's real-time data feeds. You want something that doesn't add a middleman since that would add latency to your data.

Node.js Returning Null when Scraping HTML

I'm currently trying out some code which is meant to look for a specific torrent of Kickass Torrents as a proof of concept, but for some reason my simple code is failing to retun any value besides null, depsite the fact that I have confirmed that a torrent exists with the ID that I have in the program.
searchTerm = "photoshop"
var request = require("request"),
cheerio = require("cheerio"),
url = "https://kat.cr/usearch/" + searchTerm + "/";
request(url, function (error, response, body) {
if (!error) {
var $ = cheerio.load(body),
magnet = $("[data-id='233B2C174D5FEF9D6AFAA61D150EC0B6F821D6A9'] href").html();
console.log(magnet)
}
});
The main KickAssTorrents site, kat.cr, encrypts the data that is returned. Instead, you could use http://kickasstorrentseu.com, which doesn't encrypt what is returned

Categories