Extracting table value from an URL with Node JS - javascript

I am quite new to Node JS and express but I am trying to build a website which serves static files. After some research I've found out that NodeJS with Express can be quite useful for this.
So far I managed to serve some static html files which are located on my server, but now I want to do something else:
I have an URL to an html page, and in that html page, there is a table with some information.
I want to extract specific a couple of values from it, and 1) save it as JSON in a file, 2) write those values in a html page. I've tried to play with jQuery, but so far I've been unsuccessful.
This is what I have so far:
1.node app running on port 8081, which I will further access it from anywhere with NGINX reverse proxy (I already have nginx setup and it works)
2.I can get the URL and serve it as HTML when I use the proper URI.
3.I see that the table doesn't have an ID, but only the "details" class associated with it. Also, I am only interested in getting these rows:
<div class='group'>
<table class='details'>
<tr>
<th>Status:</th>
<td>
With editors
</td>
</tr>
From what I've seen so far, jQuery would work fine if the table has an ID.
This is my code in app.js
var express = require('express');
var app = express();
var request = require('request');
const path = require('path');
var content;
app.use('/', function(req, res, next) {
var status = 'It works';
console.log('This is very %s', status);
//console.log(content);
next();
});
request(
{
uri:
'https://authors.aps.org/Submissions/status?utf8=%E2%9C%93&accode=CH10674&author=Poenaru&commit=Submit'
},
function(error, response, body) {
content = body;
}
);
app.get('/', function(req, res) {
console.log('Got a GET request for the homepage');
res.sendFile(path.join(__dirname, '/', 'index.html'));
});
app.get('/url', function(req, res) {
console.log('You requested table data!!!');
TO DO: SHOW ONLY THE THE VALUES OF THAT TABLE INSTEAD OF THE WHOLE HTML PAGE
res.send(content);
});
var server = app.listen(8081, function() {
var host = server.address().address;
var port = server.address().port;
console.log('Node-App listening at http://%s:%s', host, port);
});
Basically, the HTML content of that URL is saved into content variable, and now I would like to save only the table from it, and also output only the saved part to the new html page.
Any ideas?
Thank you in advance :)

Ok, So I've come across this package called cheerio which basically allows one to use jQuery on the server. Having the html code from that specific URL, I could search in that table the elements that I need. Cheerio is quite straight-forward and with this code I got the results I needed:
var cheerio = require('cheerio');
request(
'https://authors.aps.org/Submissions/status?utf8=%E2%9C%93&accode=CH10674&author=Poenaru&commit=Submit',
(error, res, html) => {
if (!error && res.statusCode === 200) {
const $ = cheerio.load(html);
const details = $('.details');
const articleInfo = details.find('th').eq(0);
const articleStatus = details
.find('th')
.next()
.eq(0);
//console.log(details.html());
console.log(articleInfo.html());
console.log(articleStatus.html());
}
}
);
Thank you #O.Jones and #avcS for guiding me to jsdon and html-node-parser. I will definitely play with those in the near future :)
Cheers!

Your task is called "scraping." You want to scrape a particular chunk of data from some web page you did not create and then return it as part of your own web page.
You have noticed a problem with scraping: often the page you're scraping does not cleanly identify the data you want with a distinctive id. So you must use some guesswork to find it. #AvcS pointed out a server-side npm library called jsdom you can use for this purpose.
Notice this: Even though browsers and nodejs both use Javascript, they are still very different environments. Browser Javascript has lots of built-in APIs to access web pages' Document Object Models (DOMs). But nodejs doesn't have those APIs. If you try to load jQuery into node.js, it won't work, because it depends on browser DOM APIs. The jsdom package gives you some of those DOM APIs.
Once you have fetched that web page to scrape, code like this may help you get what you need.
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
...
const page = new JSDOM(page_in_text_string).window;
Then you can use a subset of the DOM APIs to find the elements you want in your page. In your example, you are looking for elements with the selector div.class table.group. You're looking for the div.class element.
You can do this sort of thing to find what you need:
const desiredTbl = page.document.querySelector("div.class table.group");
const desiredDiv = desiredTbl ? desiredTbl.parentNode : null;
const result = desiredDiv ? desiredDiv.textContent : null;
Finally do this:
page.close();
Your question says you want certain rows from your document. HTML document don't have rows, they have elements. If you want to extract just parts of elements (part of the table rather than the whole thing) you'll need to use some text-string code. Just sayin'
Also, I have not debugged any of this. That is left to you.
There's a smaller and faster library to do similar things called node-html-parser. If performance is important you may want that one instead.

Related

Input Processing in JavaScipt

I'm new to Web Development (including JavaScript and HTML) and have a few issues within my personal project that seem to have no clear fixes.
Overview
My project is taking input from a user on the website, and feeding it to my back-end to output a list of word completion suggestions.
For example, input => "bass", then the program would suggest "bassist", "bassa", "bassalia", "bassalian", "bassalan", etc. as possible completions for the pattern "bass" (these are words extracted from an English dictionary text file).
The backend - running on Node JS libraries
trie.js file:
/* code for the trie not fully shown */
var Deque = require("collections/deque"); // to be used somewhere
function add_word_to_trie(word) { ... }
function get_words_matching_pattern(pattern, number_to_get = DEFAULT_FETCH) { ... }
// read in words from English dictionary
var file = require('fs');
const DICTIONARY = 'somefile.txt';
function preprocess() {
file.readFileSync(DICTIONARY, 'utf-8')
.split('\n')
.forEach( (item) => {
add_word_to_trie(item.replace(/\r?\n|\r/g, ""));
});
}
preprocess();
module.exports = get_words_matching_trie;
The frontend
An HTML script that renders the visuals for the website, as well as getting input from the user and passing it onto the backend script for getting possible suggestions. It looks something like this:
index.html script:
<!DOCTYPE HTML>
<html>
<!-- code for formatting website and headers not shown -->
<body>
<script src = "./trie.js">
function get_predicted_text() {
const autofill_options = get_words_matching_pattern(input.value);
/* add the first suggestion we get from the autofill options to the user's input
arbitrary, because I couldn't get this to actually work. Actual version of
autofill would be more sophisticated. */
document.querySelector("input").value += autofill_options[0];
}
</script>
<input placeholder="Enter text..." oninput="get_predicted_text()">
<!-- I get a runtime error here saying that get_predicted_text is not defined -->
</body>
</html>
Errors I get
Firstly, I get the obvious error of 'require()' being undefined on the client-side. This, I fix using browserify.
Secondly, there is the issue of 'fs' not existing on the client-side, for being a node.js module. I have tried running the trie.js file using node and treating it with some server-side code:
function respond_to_user_input() {
fs.readFile('./index.html', null, (err, html) => {
if (err) throw err;
http.createServer( (request, response) => {
response.write(html);
response.end();
}).listen(PORT);
});
respond_to_user_input();
}
With this, I'm not exactly sure how to edit document elements, such as changing input.value in index.html, or calling the oninput event listener within the input field. Also, my CSS formatting script is not called if I invoke the HTML file through node trie.js command in terminal.
This leaves me with the question: is it even possible to run index.html directly (through Google Chrome) and have it use node JS modules when it calls the trie.js script? Can the server-side code I described above with the HTTP module, how can I fix the issues of invoking my external CSS script (which my HTML file sends an href to) and accessing document.querySelector("input") to edit my input field?

How to fix '$(...).click is not a function' in Node/Cheerio

I am writing an application in node.js that will navigate to a website, click a button on the website, and then extract certain pieces of data from the website. All is going well except for the button-clicking aspect. I cannot seem to simulate a button click. I'm extremely new at this, so I'd appreciate any suggestions y'all have! Sadly I've scoured the internet looking for a solution to this issue and have been unable to find one.
I have used .click() and .bind('click, ...) in a .js file that uses 'request' and 'cheerio'.
I have also tried using page.click() and page.evaluate() in a different .js file that uses 'chrome-launcher', 'chrome-remote-interface', and 'puppeteer'.
Here is my code for the 'request' and 'cheerio' file:
const request = require('request');
const cheerio = require('cheerio');
let p1 = {}, p2 = {}, p3 = {}, p4 = {}, p5 = {};
p1.name = 'TheJackal666';
p2.name = 'Naether Raviel';
p3.name = 'qman37';
p4.name = 'ranger51';
p5.name = 'fernanda12x';
const team = {1: p1, 2: p2, 3: p3, 4: p4, 5: p5};
for(var x in team){
let url = 'https://na.op.gg/summoner/userName=' +
team[x].name;
request(url, (error, response, html) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(html);
$('.SummonerRefreshButton.Button.SemiRound.Blue').click();
//FIXME: MAKE A FUNCTION THAT SUCCESSFULLY "CLICKS" UPDATE BUTTON
team[x].overallWR = $('.winratio');
team[x].overallWR =
team[x].overallWR.text().match(/\d/g);
team[x].overallWR =
team[x].overallWR.join("");
console.log(team[x].overallWR);
}
});
}
I expect to successfully click the update button on any of the pages (there is a section on the page that says when it was last updated) without getting an error. As it is, I either get an error that:
"$(...).click is not a function"
or (if I incorporate that line into an outer function) I get no error, but no result.
See the documentation:
Cheerio is not a web browser
Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript. If your use case requires any of this functionality, you should consider projects like PhantomJS or JSDom.
Cheerio is a HTML parser.
Cheerio can be used to select and manipulate dom elements, but it is not a full browser.
Cheerio only has access to the original source dom, which means that if the dom of a webpage is manipulated by javascript, Cheerio will not notice that change.
Cheerio cannot be used to interact with dom elements (ala jQuery) because it does not similarly execute within a window (js window)
As of this moment, if you need to manipulate or select against js-rendered html, your best option is puppeteer. This is likely to change though,
HTH

Access a web page from Node script on desktop

I know that this could be a very stupid question, but, since I'm totally new to Javascript, I'm not sure about how to do this. I want to write a script and run it through node on my laptop, and, in this script, I want to interact with a web page in order to use functions like document.getElementById and stuff like that.
In Python one could do this by using something like Beautiful Soup or requests, but how do you do this in Javascript?
I have implemented a crawler using cheerio and request-promise as follows:
https://www.npmjs.com/package/cheerio
let request = require('request-promise');
let cheerio = require('cheerio');
request = request.defaults({
transform: function (body) {
return cheerio.load(body);
}
});
// ... omitted
request({uri: 'http://example.org'})
.then($ => {
const element = $('.element-with-class');
});

node.js request a webpage with async scripts

I'm downloading a webpage using the request module which is very straight forward.
My problem is that the page I'm trying to download has some async scripts (have the async attributes) and they're not downloaded with the html document return from the http request.
My question is how I can make an http request with/with-out (preferably with) request module, and have the WHOLE page download without exceptions as described above due to some edge cases.
Sounds like you are trying to do webscraping using Javascript.
Using request is a very fundemental approach which may be too low-level and tiome consuming for your needs. The topic is pretty broad but you should look into more purpose built modules such as cheerio, x-ray and nightmare.
x-ray x-ray will let you select elements directly from the page in a jquery like way instead of parsing the whole body.
nightmare provides a modern headless browser which makes it possible for you to enter input as though using the browser manually. With this you should be able to better handle the ajax type requests which are causing you problems.
HTH and good luck!
Using only request you could try the following approach to pull the async scripts.
Note: I have tested this with a very basic set up and there is work to be done to make it robust. However, it worked for me:
Test setup
To set up the test I create a html file which includes a script in the body like this: <script src="abc.js" async></script>
Then create temporary server to launch it (httpster)
Scraper
"use strict";
const request = require('request');
const options1 = { url: 'http://localhost:3333/' }
// hard coded script name for test purposes
const options2 = { url: 'http://localhost:3333/abc.js' }
let htmlData // store html page here
request.get(options1)
.on('response', resp => resp.on('data', d => htmlData += d))
.on('end', () => {
let scripts; // store scripts here
// htmlData contains webpage
// Use xml parser to find all script tags with async tags
// and their base urls
// NOT DONE FOR THIS EXAMPLE
request.get(options2)
.on('response', resp => resp.on('data', d => scripts += d))
.on('end', () => {
let allData = htmlData.toString() + scripts.toString();
console.log(allData);
})
.on('error', err => console.log(err))
})
.on('error', err => console.log(err))
This basic example works. You will need to find all js scripts on the page and extract the url part which I have not done here.

Display in iframe a div with a certain ID from an external page

I want to display the content of a certain div with the ID="example" from an external page in an iframe on my website, is it possible and how can it be done? I have no control over the external page, only on my website. The only thing i know from that external page is the ID of the div i want to show on my website....maybe you have a fiddle example
There are many possibilites to build it. (Server or client side).
One of them is use NodeJS working as web crawler, http://net.tutsplus.com/tutorials/javascript-ajax/how-to-scrape-web-pages-with-node-js-and-jquery/
Here is an example, using NodeJS with express and cheerio (for Jquery manipulation):
var url = 'YOUR_URL';
function getContentJSON(body) {
var cheerio = require('cheerio');
var $ = cheerio.load(body);
var content = $('div#ID').text().trim();
var result = {"content": content}
return JSON.stringify(result);
}
function requestPage(url, res) {
var request = require('request');
request(url, function (error, response, body) {
if (!error && response.statusCode == 200) {
var json = getContentJSON(body);
res.send(statusCodes.OK, json);
}
});
}
exports.get = function(request, response) {
requestPage(url, response);
};
If I understand correctly, you want to display only the div on your website.
As far as I know this is not possible due to browser restrictions. If you just load it into an iframe, you will not be able to manipulate the DOM of the content within that iframe. You could try and see if you can manipulate the scrollbars but I have no knowledge if this is possible or if the same restrictions apply.
Another option would be to load the external page using AJAX. This however will be a cross domain request, requiring the external website to allow CORS requests from your domain. This is highly unlikely. You could use a proxy on your webserver to circumvent this problem.
In short: I don't think this is possible using standard javascript. It could however be done using a server side proxy.

Categories