How to fix '$(...).click is not a function' in Node/Cheerio - javascript

I am writing an application in node.js that will navigate to a website, click a button on the website, and then extract certain pieces of data from the website. All is going well except for the button-clicking aspect. I cannot seem to simulate a button click. I'm extremely new at this, so I'd appreciate any suggestions y'all have! Sadly I've scoured the internet looking for a solution to this issue and have been unable to find one.
I have used .click() and .bind('click, ...) in a .js file that uses 'request' and 'cheerio'.
I have also tried using page.click() and page.evaluate() in a different .js file that uses 'chrome-launcher', 'chrome-remote-interface', and 'puppeteer'.
Here is my code for the 'request' and 'cheerio' file:
const request = require('request');
const cheerio = require('cheerio');
let p1 = {}, p2 = {}, p3 = {}, p4 = {}, p5 = {};
p1.name = 'TheJackal666';
p2.name = 'Naether Raviel';
p3.name = 'qman37';
p4.name = 'ranger51';
p5.name = 'fernanda12x';
const team = {1: p1, 2: p2, 3: p3, 4: p4, 5: p5};
for(var x in team){
let url = 'https://na.op.gg/summoner/userName=' +
team[x].name;
request(url, (error, response, html) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(html);
$('.SummonerRefreshButton.Button.SemiRound.Blue').click();
//FIXME: MAKE A FUNCTION THAT SUCCESSFULLY "CLICKS" UPDATE BUTTON
team[x].overallWR = $('.winratio');
team[x].overallWR =
team[x].overallWR.text().match(/\d/g);
team[x].overallWR =
team[x].overallWR.join("");
console.log(team[x].overallWR);
}
});
}
I expect to successfully click the update button on any of the pages (there is a section on the page that says when it was last updated) without getting an error. As it is, I either get an error that:
"$(...).click is not a function"
or (if I incorporate that line into an outer function) I get no error, but no result.

See the documentation:
Cheerio is not a web browser
Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript. If your use case requires any of this functionality, you should consider projects like PhantomJS or JSDom.

Cheerio is a HTML parser.
Cheerio can be used to select and manipulate dom elements, but it is not a full browser.
Cheerio only has access to the original source dom, which means that if the dom of a webpage is manipulated by javascript, Cheerio will not notice that change.
Cheerio cannot be used to interact with dom elements (ala jQuery) because it does not similarly execute within a window (js window)
As of this moment, if you need to manipulate or select against js-rendered html, your best option is puppeteer. This is likely to change though,
HTH

Related

Extracting table value from an URL with Node JS

I am quite new to Node JS and express but I am trying to build a website which serves static files. After some research I've found out that NodeJS with Express can be quite useful for this.
So far I managed to serve some static html files which are located on my server, but now I want to do something else:
I have an URL to an html page, and in that html page, there is a table with some information.
I want to extract specific a couple of values from it, and 1) save it as JSON in a file, 2) write those values in a html page. I've tried to play with jQuery, but so far I've been unsuccessful.
This is what I have so far:
1.node app running on port 8081, which I will further access it from anywhere with NGINX reverse proxy (I already have nginx setup and it works)
2.I can get the URL and serve it as HTML when I use the proper URI.
3.I see that the table doesn't have an ID, but only the "details" class associated with it. Also, I am only interested in getting these rows:
<div class='group'>
<table class='details'>
<tr>
<th>Status:</th>
<td>
With editors
</td>
</tr>
From what I've seen so far, jQuery would work fine if the table has an ID.
This is my code in app.js
var express = require('express');
var app = express();
var request = require('request');
const path = require('path');
var content;
app.use('/', function(req, res, next) {
var status = 'It works';
console.log('This is very %s', status);
//console.log(content);
next();
});
request(
{
uri:
'https://authors.aps.org/Submissions/status?utf8=%E2%9C%93&accode=CH10674&author=Poenaru&commit=Submit'
},
function(error, response, body) {
content = body;
}
);
app.get('/', function(req, res) {
console.log('Got a GET request for the homepage');
res.sendFile(path.join(__dirname, '/', 'index.html'));
});
app.get('/url', function(req, res) {
console.log('You requested table data!!!');
TO DO: SHOW ONLY THE THE VALUES OF THAT TABLE INSTEAD OF THE WHOLE HTML PAGE
res.send(content);
});
var server = app.listen(8081, function() {
var host = server.address().address;
var port = server.address().port;
console.log('Node-App listening at http://%s:%s', host, port);
});
Basically, the HTML content of that URL is saved into content variable, and now I would like to save only the table from it, and also output only the saved part to the new html page.
Any ideas?
Thank you in advance :)
Ok, So I've come across this package called cheerio which basically allows one to use jQuery on the server. Having the html code from that specific URL, I could search in that table the elements that I need. Cheerio is quite straight-forward and with this code I got the results I needed:
var cheerio = require('cheerio');
request(
'https://authors.aps.org/Submissions/status?utf8=%E2%9C%93&accode=CH10674&author=Poenaru&commit=Submit',
(error, res, html) => {
if (!error && res.statusCode === 200) {
const $ = cheerio.load(html);
const details = $('.details');
const articleInfo = details.find('th').eq(0);
const articleStatus = details
.find('th')
.next()
.eq(0);
//console.log(details.html());
console.log(articleInfo.html());
console.log(articleStatus.html());
}
}
);
Thank you #O.Jones and #avcS for guiding me to jsdon and html-node-parser. I will definitely play with those in the near future :)
Cheers!
Your task is called "scraping." You want to scrape a particular chunk of data from some web page you did not create and then return it as part of your own web page.
You have noticed a problem with scraping: often the page you're scraping does not cleanly identify the data you want with a distinctive id. So you must use some guesswork to find it. #AvcS pointed out a server-side npm library called jsdom you can use for this purpose.
Notice this: Even though browsers and nodejs both use Javascript, they are still very different environments. Browser Javascript has lots of built-in APIs to access web pages' Document Object Models (DOMs). But nodejs doesn't have those APIs. If you try to load jQuery into node.js, it won't work, because it depends on browser DOM APIs. The jsdom package gives you some of those DOM APIs.
Once you have fetched that web page to scrape, code like this may help you get what you need.
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
...
const page = new JSDOM(page_in_text_string).window;
Then you can use a subset of the DOM APIs to find the elements you want in your page. In your example, you are looking for elements with the selector div.class table.group. You're looking for the div.class element.
You can do this sort of thing to find what you need:
const desiredTbl = page.document.querySelector("div.class table.group");
const desiredDiv = desiredTbl ? desiredTbl.parentNode : null;
const result = desiredDiv ? desiredDiv.textContent : null;
Finally do this:
page.close();
Your question says you want certain rows from your document. HTML document don't have rows, they have elements. If you want to extract just parts of elements (part of the table rather than the whole thing) you'll need to use some text-string code. Just sayin'
Also, I have not debugged any of this. That is left to you.
There's a smaller and faster library to do similar things called node-html-parser. If performance is important you may want that one instead.

Access a web page from Node script on desktop

I know that this could be a very stupid question, but, since I'm totally new to Javascript, I'm not sure about how to do this. I want to write a script and run it through node on my laptop, and, in this script, I want to interact with a web page in order to use functions like document.getElementById and stuff like that.
In Python one could do this by using something like Beautiful Soup or requests, but how do you do this in Javascript?
I have implemented a crawler using cheerio and request-promise as follows:
https://www.npmjs.com/package/cheerio
let request = require('request-promise');
let cheerio = require('cheerio');
request = request.defaults({
transform: function (body) {
return cheerio.load(body);
}
});
// ... omitted
request({uri: 'http://example.org'})
.then($ => {
const element = $('.element-with-class');
});

Node.js: requesting a page and allowing the page to build before scraping

I've seen some answers to this that refer the askee to other libraries (like phantom.js), but I'm here wondering if it is at all possible to do this in just node.js?
Considering my code below. It requests a webpage using request, then using cheerio it explores the dom to scrape the page for data. It works flawlessly and if everything had gone as planned, I believe it would have outputted a file as i imagined in my head.
The problem is that the page I am requesting in order to scrape, build the table im looking at asynchronously using either ajax or jsonp, i'm not entirely sure how .jsp pages work.
So here I am trying to find a way to "wait" for this data to load before I scrape the data for my new file.
var cheerio = require('cheerio'),
request = require('request'),
fs = require('fs');
// Go to the page in question
request({
method: 'GET',
url: 'http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp'
}, function(err, response, body) {
if (err) return console.error(err);
// Tell Cherrio to load the HTML
$ = cheerio.load(body);
// Create an empty object to write to the file later
var toSort = {}
// Itterate over DOM and fill the toSort object
$('#emb table td.list_right').each(function() {
var row = $(this).parent();
toSort[$(this).text()] = {
[$("#lastdate").text()]: $(row).find(".idx1").html(),
[$("#currdate").text()]: $(row).find(".idx2").html()
}
});
//Write/overwrite a new file
var stream = fs.createWriteStream("/tmp/shipping.txt");
var toWrite = "";
stream.once('open', function(fd) {
toWrite += "{\r\n"
for(i in toSort){
toWrite += "\t" + i + ": { \r\n";
for(j in toSort[i]){
toWrite += "\t\t" + j + ":" + toSort[i][j] + ",\r\n";
}
toWrite += "\t" + "}, \r\n";
}
toWrite += "}"
stream.write(toWrite)
stream.end();
});
});
The expected result is a text file with information formatted like a JSON object.
It should look something like different instances of this
"QINHUANGDAO - GUANGZHOU (50,000-60,000DWT)": {
 "2016-09-29": 26.7,
"2016-09-30": 26.8,
},
But since the name is the only thing that doesn't load async, (the dates and values are async) I get a messed up object.
I tried Actually just setting a setTimeout in various places in the code. The script will only be touched by developers that can afford to run the script several times if it fails a few times. So while not ideal, even a setTimeout (up to maybe 5 seconds) would be good enough.
It turns out the settimeouts don't work. I suspect that once I request the page, I'm stuck with the snapshot of the page "as is" when I receive it, and I'm in fact not looking at a live thing I can wait for to load its dynamic content.
I've wondered investigating how to intercept the packages as they come, but I don't understand HTTP well enough to know where to start.
The setTimeout will not make any difference even if you increase it to an hour. The problem here is that you are making a request against this url:
http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp
and their server returns back the html and in this html there are the js and css imports. This is the end of your case, you just have the html and that's it. Instead the browser knows how to use and to parse the html document, so it is able to understand the javascript scripts and to execute/run them and this is exactly your problem. Your program is not able to understand that has something to do with the HTML contents. You need to find or to write a scraper that is able to run javascript. I just found this similar issue on stackoverflow:
Web-scraping JavaScript page with Python
The guy there suggests https://github.com/niklasb/dryscrape and it seems that this tool is able to run javascript. It is written in python though.
You are trying to scrape the original page that doesn't include the data you need.
When the page is loaded, browser evaluates JS code it includes, and this code knows where and how to get the data.
The first option is to evaluate the same code, like PhantomJS do.
The other (and you seem to be interested in it) is to investigate the page's network activity and to understand what additional requests you should perform to get the data you need.
In your case, these are:
http://index.chineseshipping.com.cn/servlet/cbfiDailyGetContrast?SpecifiedDate=&jc=jsonp1475577615267&_=1475577619626
and
http://index.chineseshipping.com.cn/servlet/allGetCurrentComposites?date=Tue%20Oct%2004%202016%2013:40:20%20GMT+0300%20(MSK)&jc=jsonp1475577615268&_=1475577620325
In both requests:
_ is a decache parameter to prevent caching.
jc is a name of a JS wrapper function which should be invoked with the result (https://en.wikipedia.org/wiki/JSONP)
So, scrapping the table template at http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp and performing two additional requests you will be able to combine them into the same data structure you see in the browser.

How do I use jQuery in Windows Script Host?

I'm working on some code that needs to parse numerous files that contain fragments of HTML. It seems that jQuery would be very useful for this, but when I try to load jQuery into something like WScript or CScript, it throws an error because of jQuery's many references to the window object.
What practical way is there to use jQuery in code that runs without a browser?
Update: In response to the comments, I have successfully written JavaScript code to read the contents of files using new ActiveXObject('Scripting.FileSystemObject');. I know that ActiveX is evil, but this is just an internal project to get some data out of some files that contain HTML fragments and into a proper database.
Another Update: My code so far looks about like this:
var fileIo, here;
fileIo = new ActiveXObject('Scripting.FileSystemObject');
here = unescape(fileIo.GetParentFolderName(WScript.ScriptFullName) + "\\");
(function() {
var files, thisFile, thisFileName, thisFileText;
for (files = new Enumerator(fileIo.GetFolder(here).files); !files.atEnd(); files.moveNext()) {
thisFileName = files.item().Name;
thisFile = fileIo.OpenTextFile(here + thisFileName);
thisFileText = thisFile.ReadAll();
// I want to do something like this:
s = $(thisFileText).find('input#txtFoo').val();
}
})();
Update: I posted this question on the jQuery forums as well: http://forum.jquery.com/topic/how-to-use-jquery-without-a-browser#14737000003719577
Following along with your code, you could create an instance of IE using Windows Script Host, load your html file in to the instance, append jQuery dynamically to the loaded page, then script from that.
This works in IE8 with XP, but I'm aware of some security issues in Windows 7/IE9. IF you run into problems you could try lowering your security settings.
var fileIo, here, ie;
fileIo = new ActiveXObject('Scripting.FileSystemObject');
here = unescape(fileIo.GetParentFolderName(WScript.ScriptFullName) + "\\");
ie = new ActiveXObject("InternetExplorer.Application");
ie.visible = true
function loadDoc(src) {
var head, script;
ie.Navigate(src);
while(ie.busy){
WScript.sleep(100);
}
head = ie.document.getElementsByTagName("head")[0];
script = ie.document.createElement('script');
script.src = "http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js";
head.appendChild(script);
return ie.document.parentWindow;
}
(function() {
var files, thisFile, win;
for (files = new Enumerator(fileIo.GetFolder(here).files); !files.atEnd(); files.moveNext()) {
thisFile = files.item();
if(fileIo.GetExtensionName(thisFile)=="htm") {
win = loadDoc(thisFile);
// your jQuery reference = win.$
WScript.echo(thisFile + ": " + win.$('input#txtFoo').val());
}
}
})();
This is pretty easy to do in Node.js with the cheerio package. You can read in arbitrary HTML from whatever source you want, parse it with cheerio and then access the parsed elements using jQuery style selectors.

How can I list websites on IIS7, from script, without using IIS6 compat pack (WMI veneer)

On IIS6, I can use WMI to list available websites, like this:
var iis = GetObject("winmgmts://localhost/root/MicrosoftIISv2");
var query = "SELECT * FROM IIsWebServerSetting"
// get the list of virtual servers
var results = iis.ExecQuery(query);
for(var e = new Enumerator(results); !e.atEnd(); e.moveNext()) {
var site = e.item();
// site.Name // W3SVC/1, W3SVC/12378398, etc
// site.Name.substr(6) // 1, 12378398, etc
// site.ServerComment) // "Default Web Site", "Site2", etc
// site.ServerBindings(0).Port // 80, 8080, etc
}
I know I can run this script on IIS7, if I have previously installed the IIS6 Compatibility Pack.
Is it possible to get the list of WebSites without requiring the compatibility pack as a pre-requisite?
I know I can run AppCmd to do this from the command line:
\Windows\system32\inetsrv\appcmd list sites
But... can I run that from a custom action in an MSI?
And... if not, how can I do the equivalent thing (list websites on IIS7) from javascript?
EDIT
Here's how I tried running the command from within Javascript.
function GetWebSites_IIS7()
{
var ParseOneLine = function(oneLine) {
...a bunch of regex parsing here....
};
LogMessage("GetWebSites_IIS7() ENTER");
var shell = new ActiveXObject("WScript.Shell");
var windir = shell.Environment("system")("windir");
// aka Session.Property("%WINDIR%")
var appcmd = windir + "\\system32\\inetsrv\\appcmd.exe list sites";
var oExec = shell.Exec(appcmd);
var sites = [];
while (!oExec.StdOut.AtEndOfStream) {
var oneLine = oExec.StdOut.ReadLine();
var line = ParseOneLine(oneLine);
LogMessage(" site: " + line.name);
sites.push(line);
}
return sites;
}
This works, but it briefly pops a visible console window, which then disappears. Doesn't look very polished. I think I can avoid the console window by using shell.Run() instead of shell.Exec(). But shell.Run() doesn't give access to the stdout, so I would have to redirect the output to a temporary file, then read the output. I haven't tried that yet. That may introduce some security issues; I'll have to see.
Related:
Where and how should my CustomAction create and read a temporary file?
Yes, you can run appcmd from the custom action the same way you do any custom action which runs exe. First off, you should author a DirectorySearch/FileSearch elements to find the full path to the executable. Next, add a custom action with ExeCommand attribute. You're probably trying to get feedback from a user, so leave it immediate. Also, think about using QuietExec in order not to show console window to your users.
By the way, if my guess is correct, you're trying to do something like this. Hope this helps.

Categories