How to get a specific class/xpath data using request in ElectronJS - javascript

I am trying to get only the price of the coin but instead, I am getting the whole HTML because of the body.
I cannot find any documentation or usage for the request package so I needed to ask here.
I am trying to find the class="price" which only shows the price.
Is there a way to search based on class or the XPath of the URL or a way to cut everything out except for the class="price" section?
const request = require('request')
request('https://www.livecoinwatch.com/price/Spark-FLR', function (
error,
response,
body
) {
console.error('error:', error)
console.log('body:', body)
})

when you get the doctument,find a package like jquery parse it,find the price。
like this:
const jsdom = require('jsdom');
const {JSDOM} = jsdom;
const {document} = (new JSDOM('<!doctype html><html><body></body></html>')).window; //the document
global.document = document;
const window = document.defaultView;
const $ = require('jquery')(window);

Related

Reading a web page with node.js and urllib

I'm learning programming and found myself in a tough spot; the code from the tutorial is not working and I can't understand why.
It's a shell script that's supposed to retrieve a wikipedia page, strip it of the references, and return just the paragraphs text.
It uses the urllib library. In the code below, the only difference from the tutorial's is the use of fs to make a text file with the page content. The rest is copied and pasted.
#!/usr/local/bin/node
// Returns the paragraphs from a Wikipedia link, stripped of reference numbers.
let urllib = require("urllib");
let url = process.argv[2];
let fs = require("fs");
console.log(url);
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
urllib.request(url, { followRedirect: true }, function(error, data, response) {
let body = data.toString();
// Simulate a Document Object Model.
let { document } = (new JSDOM(body)).window;
// Grab all the paragraphs and references.
let paragraphs = document.querySelectorAll("p");
let references = document.querySelectorAll(".reference");
// Remove any references.
references.forEach(function(reference) {
reference.remove();
});
// Print out all of the paragraphs.
paragraphs.forEach(function(paragraph) {
console.log(paragraph.textContent);
fs.appendFileSync("article.txt", `${paragraph}\n`);
});
});
My first guess, was that urllib was not working for some reason. This cause, even if I installed it as per official documentation, when I type which urllib at the command line, it doesn't return a path.
But then, node doesn't return an error for not knowing what the require("urllib") is when I run the file.
The actual output is the following:
$ ./wikp https://es.wikipedia.org/wiki/JavaScript
https://es.wikipedia.org/wiki/JavaScript
$
Can anybody help please?
I think the tutorial you followed might have been a little out of date.
This works for me:
let urllib = require("urllib");
let url = process.argv[2];
let fs = require("fs");
console.log(url);
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
urllib.request(url, { followRedirect: true }).then(({data, res}) => {
let body = data.toString();
// Simulate a Document Object Model.
let { document } = (new JSDOM(body)).window;
// Grab all the paragraphs and references.
let paragraphs = document.querySelectorAll("p");
let references = document.querySelectorAll(".reference");
// Remove any references.
references.forEach(function(reference) {
reference.remove();
});
// Print out all of the paragraphs.
paragraphs.forEach(function(paragraph) {
console.log(paragraph.textContent);
fs.appendFileSync("article.txt", `${paragraph.textContent}\n`);
});
});
The package you are using (urllib) is using promises, that might have been different in the past, when the tutorial was released.

Why can't I scrape for specific information of this webpage? (with node.js and jQuery)

So I want to scrape for specific information about news from this website: https://24.hu/fn/gazdasag/2022/07/23/igy-lehet-olcsobb-a-maganorvos-az-egeszsegpenztarbol/
I'm working on creating a web crawler and I need the new's title and the content. I use node.js, javascript and jQuery. And I've also created tests for that and I can reach the title, but I can't get the context. Despite of the fact that I've tried the code in the console of the browser, and it works well there.
This would be the code in the console:
$('[data-io-article-url="https://24.hu/fn/gazdasag/2022/07/23/igy-lehet-olcsobb-a-maganorvos-az-egeszsegpenztarbol/"]').text().trim();
And I get the following answer:
A pandémia az életünk számos területét befolyásolta, de talán semmit sem annyira közvetlenül, mint az orvoshoz járási szokásainkat. Az elmúlt két évben tanúi lehettünk annak, hogy a végletekig leterhelt állami egészségügyi rendszer egyre nehezebben bírja a betegek megfelelő ellátását. Ráadásul úgy tűnik, hogy a járványt még korántsem tudhatjuk magunk mögött.....
In my VS Code I saved the html of the webpage and created the following test:
const fs = require("fs");
const parser = require("../24Parser");
const newsPage1Html = fs.readFileSync("tests/html/test.html");
let parserResult;
beforeAll(() => {
parserResult = parser(newsPage1Html, );
})
describe("parsing html news page correctly", () => {
test("title", () => {
expect(parserResult.title).toBe("Így lehet olcsóbb a magánorvos az egészségpénztárból");
})
test("content", () => {
expect(parserResult.content).toBe("lskl");
})
})
And my parser looks like this:
const cheerio = require("cheerio");
function parseAll(html, page) {
const $ = cheerio.load(html);
const title = $('[itemprop="headline"]').text().trim();
//const content = $(`[data-io-article-url="${page}"]`).text().trim();
const content = $('[data-io-article-url="https://24.hu/fn/gazdasag/2022/07/23/igy-lehet-olcsobb-a-maganorvos-az-egeszsegpenztarbol/"]').text().trim();
return { title, content}
}
module.exports = parseAll;
So I use exactly the same code and I get nothing in case of the content. Why is that?
I would like to create it dynamic later, that's why the commented line.
Try to check, if scrapping is not blocked on this site, or does it have limits for requests. Also, it can be helpful to use a puppeteer wich opens google chrome in headless mode and make scraping.

Editing an XML document

I am new to JavaScript and need the ability to create, edit and export an XML document on the server side. I have seen different options on the Internet, but they do not suit me.
It seems that I found one suitable option with processing my XML file into JSON, and then back and then export it through another plugin, but maybe there is some way to make it easier?
Thanks!
I recently came across a similar problem. The solution turned out to be very simple. It is to use XML-Writer
In your project folder, first install it via the console
npm install xml-writer
Next, first import it and create a new file to parse what's going on here:
var XMLWriter = require ('xml-writer');
xw = new XMLWriter;
xw.startDocument ();
xw.startElement ('root');
xw.writeAttribute ('foo', 'value');
xw.text ('Some content');
xw.endDocument ();
console.log (xw.toString ());
You can find more information here and at the bottom of the page see the different code for each item. In this way, you can create, edit and export xml files. Good luck and if something is not clear, write!
Additional
You will need also fs module
const fs = require("fs")
const xmlParser = require("xml2json")
const formatXml = require("xml-formatter")
Completed code:
const fs = require("fs")
const xmlParser = require("xml2json")
const formatXml = require("xml-formatter")
var XMLWriter = require('xml-writer');
xw = new XMLWriter;
xw.startDocument();
xw.startElement('root');
xw.startElement('man');
xw.writeElement('name', 'Sergio');
xw.writeElement('adult', 'no');
xw.endElement();
xw.startElement('item');
xw.writeElement('name', 'phone');
xw.writeElement('price', '305.77');
xw.endElement();
xw.endDocument();
const stringifiedXmlObj = JSON.stringify(xmlObj)
const finalXml = xmlParser.toXml(stringifiedXmlObj)
fs.writeFile("./datax.xml", formatXml(finalXml, { collapseContent: true }), function (err, result) {
if (err) {
console.log("Error")
} else {
console.log("Xml file successfully updated.")
}
})
})

Web Scraping Node.js in DOM page

I want to get information from the site using Node.js
I tryied so hard, and ̶g̶o̶t̶ ̶s̶o̶ ̶f̶a̶r̶ . So, I want to get a magnet URI link, this link is in:
<div id="download">
<img src="/parse/s.rutor.org/i/magnet.gif">
How to get this link from div and href field using cheerio. I dont know how to jQuery, I just want to write an parser.
Here is my try:
const request = require('request');
const cheerio = require('cheerio');
request('http://s.new-rutor.org/torrent/562496/povorot-ne-tuda-5-krovnoe-rodstvo_wrong-turn-5-bloodlines-2012-bdrip-avc-p/', function(err, resp, body) {
if (!err){
const $ = cheerio.load(body);
var magnet = $('.href', '#downloads').text()
// $('#downloads').find('href').text()
console.log(magnet);
}
});
That code is only getting empty place in console
Note: I'm using request-promise instead of request
This code console.logs all a-tags with a href that contains 'magnet'
const request = require('request-promise');
const cheerio = require('cheerio');
request('http://s.new-rutor.org/torrent/562496/povorot-ne-tuda-5-krovnoe-rodstvo_wrong-turn-5-bloodlines-2012-bdrip-avc-p/').then(res => {
const $ = cheerio.load(res)
const links = $('a')
links.each(i => {
const link = links.eq(i).attr('href')
if (link && link.includes('magnet')) {
console.log(link)
}
})
})
eq selects a specific link from that index
links.each(i => links.eq(i))
then we grab the content inside the attribute href (the magnet link) with
links.eq(i).attr('href')

node.js parsing html text to get a value to a javascript variable

I' doing this successfully to get the help text of my page of interest.
router.get('/get', function (req, res) {
var pg = 'https://j......com/f/resource'
console.log('get', pg);
requestify.get(pg).then(function (resp) {
console.log(resp.body);
});
});
Now that I have the page's text, I'm wanting to parse the text to get the value of a javascript variable which I know exists in the text.
<script> var x1 = {"p": {.......bla bla ...}};</script>
I know that sometimes the <script> tag will include the type attribute; but it will not always include the type attribute.
When I find the value of x1 I what to use that in my javascript's app's as a value in myVar variable.
If you do not have THE answer then your comment/tip as to what I should research is appreciated.
I was hoping I would find some module I can just drop the entire text into and have the module somehow just output all variables a values for me.
So you're not re-inventing the wheel, I feel like using JSDOM (and it's execution capabilities) would be the best best. To mock what you have:
const express = require('express');
const jsdom = require("jsdom");
const { JSDOM } = jsdom; // it exports a JSDOM class
// Mock a remote resource
const remote = express()
.use('/', (req, res) => {
res.send('<!DOCTYPE html><html lang="en-US"><head><title>Test document</title><script>var x1 = { "p": { "foo": "bar" } };</script></head><body></body></html>');
})
.listen(3001);
// Create "your" server
const local = express()
.use('/', (req, res) => {
// fetch the remote resource and load it into JSDOM. No need for
// requestify, but you can use the JSDOM ctor and pass it a string
// if you're doing something more complex than hitting an endpoint
// (like passing auth, creds, etc.)
JSDOM.fromURL('http://localhost:3001/', {
runScripts: "dangerously" // allow <script> to run
}).then((dom) => {
// pass back the result of "x1" from the context of the
// loaded dom page.
res.send(dom.window.x1);
});
})
.listen(3000);
I then receive back:
{"p":{"foo":"bar"}}

Categories