Reading a web page with node.js and urllib

Reading a web page with node.js and urllib - javascript

I'm learning programming and found myself in a tough spot; the code from the tutorial is not working and I can't understand why.
It's a shell script that's supposed to retrieve a wikipedia page, strip it of the references, and return just the paragraphs text.
It uses the urllib library. In the code below, the only difference from the tutorial's is the use of fs to make a text file with the page content. The rest is copied and pasted.
#!/usr/local/bin/node
// Returns the paragraphs from a Wikipedia link, stripped of reference numbers.
let urllib = require("urllib");
let url = process.argv[2];
let fs = require("fs");
console.log(url);
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
urllib.request(url, { followRedirect: true }, function(error, data, response) {
let body = data.toString();
// Simulate a Document Object Model.
let { document } = (new JSDOM(body)).window;
// Grab all the paragraphs and references.
let paragraphs = document.querySelectorAll("p");
let references = document.querySelectorAll(".reference");
// Remove any references.
references.forEach(function(reference) {
reference.remove();
});
// Print out all of the paragraphs.
paragraphs.forEach(function(paragraph) {
console.log(paragraph.textContent);
fs.appendFileSync("article.txt", `${paragraph}\n`);
});
});
My first guess, was that urllib was not working for some reason. This cause, even if I installed it as per official documentation, when I type which urllib at the command line, it doesn't return a path.
But then, node doesn't return an error for not knowing what the require("urllib") is when I run the file.
The actual output is the following:
$ ./wikp https://es.wikipedia.org/wiki/JavaScript
https://es.wikipedia.org/wiki/JavaScript
$
Can anybody help please?

I think the tutorial you followed might have been a little out of date.
This works for me:
let urllib = require("urllib");
let url = process.argv[2];
let fs = require("fs");
console.log(url);
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
urllib.request(url, { followRedirect: true }).then(({data, res}) => {
let body = data.toString();
// Simulate a Document Object Model.
let { document } = (new JSDOM(body)).window;
// Grab all the paragraphs and references.
let paragraphs = document.querySelectorAll("p");
let references = document.querySelectorAll(".reference");
// Remove any references.
references.forEach(function(reference) {
reference.remove();
});
// Print out all of the paragraphs.
paragraphs.forEach(function(paragraph) {
console.log(paragraph.textContent);
fs.appendFileSync("article.txt", `${paragraph.textContent}\n`);
});
});
The package you are using (urllib) is using promises, that might have been different in the past, when the tutorial was released.

Related

Editing an XML document

I am new to JavaScript and need the ability to create, edit and export an XML document on the server side. I have seen different options on the Internet, but they do not suit me.
It seems that I found one suitable option with processing my XML file into JSON, and then back and then export it through another plugin, but maybe there is some way to make it easier?
Thanks!

I recently came across a similar problem. The solution turned out to be very simple. It is to use XML-Writer
In your project folder, first install it via the console
npm install xml-writer
Next, first import it and create a new file to parse what's going on here:
var XMLWriter = require ('xml-writer');
xw = new XMLWriter;
xw.startDocument ();
xw.startElement ('root');
xw.writeAttribute ('foo', 'value');
xw.text ('Some content');
xw.endDocument ();
console.log (xw.toString ());
You can find more information here and at the bottom of the page see the different code for each item. In this way, you can create, edit and export xml files. Good luck and if something is not clear, write!
Additional
You will need also fs module
const fs = require("fs")
const xmlParser = require("xml2json")
const formatXml = require("xml-formatter")
Completed code:
const fs = require("fs")
const xmlParser = require("xml2json")
const formatXml = require("xml-formatter")
var XMLWriter = require('xml-writer');
xw = new XMLWriter;
xw.startDocument();
xw.startElement('root');
xw.startElement('man');
xw.writeElement('name', 'Sergio');
xw.writeElement('adult', 'no');
xw.endElement();
xw.startElement('item');
xw.writeElement('name', 'phone');
xw.writeElement('price', '305.77');
xw.endElement();
xw.endDocument();
const stringifiedXmlObj = JSON.stringify(xmlObj)
const finalXml = xmlParser.toXml(stringifiedXmlObj)
fs.writeFile("./datax.xml", formatXml(finalXml, { collapseContent: true }), function (err, result) {
if (err) {
console.log("Error")
} else {
console.log("Xml file successfully updated.")
}
})
})

Getting all the text content from a HTML string in NodeJS

I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.
For example, the HTML String might be:
<ul>
<li>First</li>
<li>Second</li>
</ul>
What I want:
First Second
or
First
Second
I've tried to get the text content by first wrapping the entire string inside a div and then getting the textContent using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond which is not what I want).
The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces.
Are there any cleaner, neater, and simpler solution than this?

Convert HTML to Plain Text:
In your terminal, install the html-to-text npm package:
npm install html-to-text
Then in JavaScript::
const { convert } = require('html-to-text'); // Import the library
var htmlString = `
<ul>
<li>First</li>
<li>Second</li>
</ul>
`;
var text = convert(htmlString, { wordwrap: 130 })
// Out:
// First
// Second
Hope this helps!

You can try get rid of html tags using regex, for the yours example try the following:
let str = `<ul>
<li>First</li>
<li>Second</li>
</ul>`
console.log(str)
let regex = '<\/?!?(li|ul)[^>]*>'
var re = new RegExp(regex, 'g');
str = str.replace(re, '');
console.log(str)

Okay you can try this example, This may help you
I used JSDom module
https://www.npmjs.com/package/jsdom
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent);
BTW Helped me
This code can help I think :)

Using the DOM, you could use document.Node.textContent. However, NodeJs doesn't have textContent (since it doesn't have native access to the DOM), therefore you should use external packages. You could install request and cheerio, using npm. cheerio, suggested by Jon Church, is maybe the easiest web scraping tool to use (there are also complexer ones like jsdom)
With power of cheerio and request in your hands, you could write
const request = require("request");
const cheerio = require("cheerio");
const fs = require("fs");
//taken from https://stackoverflow.com/a/19709846/10713877
function is_absolute(url)
{
var r = new RegExp('^(?:[a-z]+:)?//', 'i');
return r.test(url);
}
function is_local(url)
{
var r = new RegExp('^(?:file:)?//', 'i');
return (r.test(url) || !is_absolute(url));
}
function send_request(URL)
{
if(is_local(URL))
{
if(URL.slice(0,7)==="file://")
url_tmp = URL.slice(7,URL.length);
else
url_tmp = URL;
//taken from https://stackoverflow.com/a/20665078/10713877
const $ = cheerio.load(fs.readFileSync(url_tmp));
//Do something
console.log($.text())
}
else
{
var options = {
url: URL,
headers: {
'User-Agent': 'Your-User-Agent'
}
};
request(options, function(error, response, html) {
//no error
if(!error && response.statusCode == 200)
{
console.log("Success");
const $ = cheerio.load(html);
return Promise.resolve().then(()=> {
//Do something
console.log($.text())
});
}
else
{
console.log(`Failure: ${error}`);
}
});
}
}
Let me explain the code. You pass a URL to send_request function. It checks whether the URL string is a path to your local file, (a relative path, or a path starting with file://). If it is a local file, it proceeds to use cheerio module, otherwise, it has to send a request, to the website, using the request module, then use cheerio module. Regular Expressions are used in is_absolute and is_local. You get the text using text() method provided by cheerio. Under the comments //Do something, you could do whatever you want with the text.
There are websites that let you know 'Your-User-Agent', copy-paste your user agent to that field.
Below lines will work
//your local file
send_request("/absolute/path/to/your/local/index.html");
send_request("/relative/path/to/your/local/index.html");
send_request("file:///absolute/path/to/your/local/index.html");
//website
send_request("https://stackoverflow.com/");
EDIT: I am on a linux system.

You can try using npm library htmlparser2. Its will be very simple using this
const htmlparser2 = require('htmlparser2');
const htmlString = ''; //your html string goes here
const parser = new htmlparser2.Parser({
ontext(text) {
if (text && text.trim().length > 0) {
//do as you need, you can concatenate or collect as string array
}
}
});
parser.write(htmlString);
parser.end();

how to get discord.js to pick a random image from file

I'm in another pickle I've realized over the past week that my images are not loading due to the fact the links have expired so I wanna find out how to go about using a file directory in the code.
Here's what I've tried:
});
client.on('message', message => {
if (message.content.startsWith('L!hug')) {
var fs = require('fs');
var files = fs.readdirSync('C:\Users\nevbw\Desktop\games\FBIBot\images\hugs')
/* now files is an Array of the name of the files in the folder and you can pick a random name inside of that array */
let chosenFile = files[Math.floor(Math.random() * files.length)]
}
});
and
});
client.on('message', message => {
if (message.content.startsWith('L!hug')) {
const path = 'C:\Users\nevbw\Desktop\games\FBIBot\images\hugs';
const fs = require('fs');
fs.readdirSync(path).forEach(file => {
ranfile = Math.floor(Math.random() * file.length);
message.channel.sendFile(ranfile);
})
}
});

found out through searching and searching but found a answer the modified it to this, i hope people use this in future reference!
const num = (Math.floor(Math.random()* 5)+1).toString(); message.channel.send({files: [`./slap/slap${num}.gif`]})

Using fs.readdirSync('./images/') instead of fs.readFileSync('./images/') works easier, but then you will have to create the folder inside of VSC and put the images in the folder, you can also drag and drop the images into the solution and use:
var files = fs.readdirSync(`./images/`).filter(file => file.endsWith('.png'))
so that when it looks for an image, it doesn't select anything else. hope it helps for some people.

Happy to help.
You're using FS the wrong way. This Is What It Should Look Like :D Also Here Is Some Documentation on It ( https://nodejs.org/dist/latest-v13.x/docs/api/fs.html ).
-- Code --
Also Just As A Tip! I See You Are Using Full Directories, That's Quite Innificeng (E.g if You Change Your Username, Drive ID, etc.) so in fs provided the image is in the same folder you can just do ./(ImageName), or if it is in the same folder but under another say /FBIBot/Images you can do ./Images/(ImageName). ^^
--
What The Error Was: (I Unfortunately Cannot Test it But I Am Like 99% Sure).
You Were Using fs.readdirSync(path).forEach(file => { When You Were Meant To Be Using fs.readfilesync(path).forEach(file => {.
-- First Code --
});
client.on('message', message => {
if (message.content.startsWith('L!hug')) {
var fs = require('fs');
var files = fs.readfileSync('C:\Users\nevbw\Desktop\games\FBIBot\images\hugs')
/* now files is an Array of the name of the files in the folder and you can pick a random name inside of that array */
let chosenFile = files[Math.floor(Math.random() * files.length)]
}
});
-- Second Code --
});
client.on('message', message => {
if (message.content.startsWith('L!hug')) {
var fs = require('fs');
var files = fs.readFileSync('C:\Users\nevbw\Desktop\games\FBIBot\images\hugs')
/* now files is an Array of the name of the files in the folder and you can pick a random name inside of that array */
let chosenFile = files[Math.floor(Math.random() * files.length)]
}
});
^^

Saving text from an element on a web page using javascript

Let's say there's some text inside a particular element on a web page that I want to save. Using Javascript, how could I save that text/append it to a file "myfile.txt" on my hard drive? The element dynamically changes over time so whenever it updates i'd like it to append the new text to the file.
I've been doing some research on web scraping, and it just seems too over the top/complicated for this task.

I've written a Node.js program that fetches a webpage url every X seconds, and compare the previous and new value of a specific html element. It will only save changes to a specific output file.
Note that the previous value record will be deleted after each run of this program, meaning that the first time you run this program it will always save the extracted text ( Because there's nothing to compare to )
This program uses node-fetch and jsdom npm packages.
fs is a build in package for Node.js.
If you are new to Node.js, you can follow this to install in your computer.
const fetch = require('node-fetch');
const jsdom = require('jsdom');
const fs = require('fs');
// Local previous extracted text variable to compare and determine changes
let prevExtractedText;
// The webpage URL to fetch from
const url = 'https://en.wikipedia.org/wiki/Node.js';
// Setting your file's output path
const outputFilepath = 'myfile.txt';
// Setting timeout every 5 seconds
const timeout = 5000;
const handleOnError = (err) => {
console.error(`! An error occurred: ${err.message}`);
process.exit(1);
}
const handleFetchAndSaveFile = async () => {
let html;
try {
console.log(`* Fetching ${url}...`);
const resp = await fetch(url);
console.log('* Converting response into html text...');
html = await resp.text();
} catch (err) {
handleOnError(err);
}
// Convert into DOM in Node.js enviroment
const dom = new jsdom.JSDOM(html);
// Example with element of id "footer-places-privacy"
const extractedText = dom.window.document.getElementById("footer-places-privacy").textContent;
console.log(`* Comparing previous extracted text (${prevExtractedText}) and current extracted text (${extractedText})`);
if (prevExtractedText !== extractedText) {
// Update prevExtractedText
prevExtractedText = extractedText;
console.log(`* Updating ${outputFilepath}...`);
try {
// Writing new extracted text into a file
await fs.appendFileSync(outputFilepath, extractedText);
console.log(`* ${outputFilepath} has been updated and saved.`);
} catch (err) {
handleOnError(err);
}
}
console.log('--------------------------------------------------')
}
console.log(`* Polling ${url} every ${timeout}ms`);
setInterval(handleFetchAndSaveFile, timeout);
Working demo: https://codesandbox.io/s/nodejs-webpage-polling-jqf6v?fontsize=14&hidenavigation=1&theme=dark

JSDOM is not loading JavaScript included with <script> tag

Note: This question is not a duplicate of other existing questions because this question does not use jsdom.env() function call which older version of JSDOM use.
File bar.js:
console.log('bar says: hello')
File foo.js:
var jsdom = require('jsdom')
var html = '<!DOCTYPE html><head><script src="bar.js"></script></head><body><div>Foo</div></body>'
var window = new jsdom.JSDOM(html).window
window.onload = function () {
console.log('window loaded')
}
When I run foo.js, I get this output.
$ node foo.js
window loaded
Why did bar says: hello output did not come? It looks like bar.js was not loaded. How can I make jsdom load the file in the script tag?
[EDIT/SOLUTION]: Problem solved after following a suggestion in the answer by Quentin. This code works:
var jsdom = require('jsdom')
var html = '<!DOCTYPE html><head><script src="bar.js"></script></head><body><div>Foo</div></body>'
var window = new jsdom.JSDOM(html, { runScripts: "dangerously", resources: "usable" }).window
window.onload = function () {
console.log('window loaded')
}

Go to the JSDOM homepage.
Skim the headings until you find one marked Executing scripts
To enable executing scripts inside the page, you can use the
runScripts: "dangerously" option:
const dom = new JSDOM(`<body>
<script>document.body.appendChild(document.createElement("hr"));</script>
</body>`, { runScripts: "dangerously" });
// The script will be executed and modify the DOM:
dom.window.document.body.children.length === 2;
Again we emphasize to only use this when feeding jsdom code you know
is safe. If you use it on arbitrary user-supplied code, or code from
the Internet, you are effectively running untrusted Node.js code, and
your machine could be compromised.
If you want to execute external scripts, included via <script
src="">, you'll also need to ensure that they load them. To do this,
add the option resources: "usable" as described below.

Given I was unable to reproduce the url-based solution from the code above...
Brutal bundle alternative : inline it all !
Read the various .js files, inject them as string into the html page. Then wait the page to load as in a normal navigator.
These libraries are loaded into _window = new JSDOM(html, { options }).window; and therefor available to your node script.
This is likely to prevent you from doing xhr calls and therefore only partially solve the issue.
say-hello.js
// fired when loaded
console.log("say-hello.js says: hello!")
// defined and needing a call
var sayBye = function(name) {
var name = name ||'Hero!';
console.log("say-hello.js says: Good bye! "+name)
}
main.js:
const fs = require("fs");
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
var NAME = process.env.NAME; // variable from terminal
var html = '<!DOCTYPE html><head></head><body><div>Foo</div></body>'
var _window = new JSDOM(html, {
runScripts: "dangerously",
resources: "usable" }).window;
/* ************************************************************************* */
/* Add scripts to head ***************************************************** */
var jsFiles = [
'say-hello.js'
];
var scriptsContent = ``;
for(var i =0; i< jsFiles.length;i++){
console.log(__dirname + '/'+ jsFiles[i])
let scriptContent = fs.readFileSync( jsFiles[i], 'utf8');
scriptsContent = scriptsContent + `
/* ******************************************************************************************* */
/* `+jsFiles[i]+` **************************************************************************** */
`+scriptContent;
};
let scriptElement = _window.document.createElement('script');
scriptElement.textContent = scriptsContent;
_window.document.head.appendChild(scriptElement);
/* ************************************************************************* */
/* Run page **************************************************************** */
_window.document.addEventListener('DOMContentLoaded', () => {
console.log('main says: DOMContentLoaded')
// We need to delay one extra turn because we are the first DOMContentLoaded listener,
// but we want to execute this code only after the second DOMContentLoaded listener
// (added by external.js) fires.
_window.sayBye(NAME); // prints "say-hello.js says: Good bye!"
});
Run it:
NAME=John node main.js # expects hello and good bye to john messages
Source:
https://github.com/jsdom/jsdom/issues/1914
https://github.com/jsdom/jsdom/issues/3023

Using JSDOM option url : file://${__dirname}/index.html could work, according to a source. If you test it, please report result here.

We Keep Coding

JavaScript is the programming language of the Web.

Reading a web page with node.js and urllib - javascript

Related

Editing an XML document

Getting all the text content from a HTML string in NodeJS

how to get discord.js to pick a random image from file

Saving text from an element on a web page using javascript

JSDOM is not loading JavaScript included with <script> tag

Categories

Resources