I am currently writing a program that collects information from a sports website. (it contains the history of some basketball matches) The problem is that the website uses Angular.js for dynamical HTML binding. Consequently, the HTML source code involves lots of variables.
I need to find out the values of the variables in order to make my program work as I want. Is there any library or framework that could help me?
Edit: I am not limited by anything, but I prefer a web app (MEAN, JS frameworks with node-webkit). If it can't be done, I can also code it in C++ or Java (or extend it further to Android with NDK or SDK)
Disclaimer: This is not grey-hat stuff. I just need to do some web-scraping.
PhantomJS is a headless browser. It will allow you to use JavaScript to get the information you want.
Details:
It will browse to the page you want, execute the JavaScript like any browser and have access to the page as if it was displayed to a normal user using a normal browser. Using JavaScript DOM traversal, you will be able to get the information you need. This is almost the same as automatizing the task of opening a console in a browser and executing javascript which will get the information from the page.
While the below example is really simple, it can do much more than just getting the page results... it can click buttons, navigate to other pages, extract only relevant information, extract the page as an image... Do not hesitate referring to its Quick start documentation to learn more about it.
Example script returning the complete HTML page after waiting 10 seconds for the AngularJS to have finished calculating the page:
Command line usage: phantomjs-1.9.1 this_script.js
this_script.js (PhantomJS 2.0 may have different syntax in some cases):
var url = phantom.args[0]
function getDocumentElementAsHTML(page) {
return page.evaluate(function() {
return document.documentElement.innerHTML
})
}
var page = new WebPage()
page.settings.userAgent = "PhantomJS"
//page.onConsoleMessage = function (msg) { console.log(msg); }
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to access network')
phantom.exit()
} else {
setTimeout(function(){
console.log(getDocumentElementAsHTML(page))
phantom.exit()
},10000)
}
});
PS: Waiting 10 seconds is not always a great solution, I used to periodically test the existence of the elements I wanted to get information from to be sure the JavaScript finished loading instead.
Source: grey-hat stuff I did in the past
I'd say you'd want to look at http://phantomjs.org/, http://www.slimerjs.org/, and/or http://casperjs.org/.
Phantom & Slimer give you API access to Webkit and Gecko respectively. Casper adds a more user friendly API over the top.
Related
I want to create a custom profiler for Javascript as a Chrome DevTools Extension. To do so, I'd have to instrument all Javascript code of a website (parse to AST, inject hooks, generate new source). This should've been easily possible using chrome.devtools.inspectedWindow.reload() and its parameter preprocessorScript described here: https://developer.chrome.com/extensions/devtools_inspectedWindow.
Unfortunately, this feature has been removed (https://bugs.chromium.org/p/chromium/issues/detail?id=438626) because nobody was using it.
Do you know of any other way I could achieve the same thing with a Chrome Extension? Is there any other way I can replace an incoming Javascript source with a changed version? This question is very specific to Chrome Extensions (and maybe extensions to other browsers), I'm asking this as a last resort before going a different route (e.g. dedicated app).
Use the Chrome Debugging Protocol.
First, use DOMDebugger.setInstrumentationBreakpoint with eventName: "scriptFirstStatement" as a parameter to add a break-point to the first statement of each script.
Second, in the Debugger Domain, there is an event called scriptParsed. Listen to it and if called, use Debugger.setScriptSource to change the source.
Finally, call Debugger.resume each time after you edited a source file with setScriptSource.
Example in semi-pseudo-code:
// Prevent code being executed
cdp.sendCommand("DOMDebugger.setInstrumentationBreakpoint", {
eventName: "scriptFirstStatement"
});
// Enable Debugger domain to receive its events
cdp.sendCommand("Debugger.enable");
cdp.addListener("message", (event, method, params) => {
// Script is ready to be edited
if (method === "Debugger.scriptParsed") {
cdp.sendCommand("Debugger.setScriptSource", {
scriptId: params.scriptId,
scriptSource: `console.log("edited script ${params.url}");`
}, (err, msg) => {
// After editing, resume code execution.
cdg.sendCommand("Debugger.resume");
});
}
});
The implementation above is not ideal. It should probably listen to the breakpoint event, get to the script using the associated event data, edit the script and then resume. Listening to scriptParsed and then resuming the debugger are two things that shouldn't be together, it could create problems. It makes for a simpler example, though.
On HTTP you can use the chrome.webRequest API to redirect requests for JS code to data URLs containing the processed JavaScript code.
However, this won't work for inline script tags. It also won't work on HTTPS, since the data URLs are considered unsafe. And data URLs are can't be longer than 2MB in Chrome, so you won't be able to redirect to large JS files.
If the exact order of execution of each script isn't important you could cancel the script requests and then later send a message with the script content to the page. This would make it work on HTTPS.
To address both issues you could redirect the HTML page itself to a data URL, in order to gain more control. That has a few negative consequences though:
Can't reload page because URL is fixed to data URL
Need to add or update <base> tag to make sure stylesheet/image URLs go to the correct URL
Breaks ajax requests that require cookies/authentication (not sure if this can be fixed)
No support for localStorage on data URLs
Not sure if this works: in order to fix #1 and #4 you could consider setting up an HTML page within your Chrome extension and then using that as the base page instead of a data URL.
Another idea that may or may not work: Use chrome.debugger to modify the source code.
I'm interested in the concept of injecting a bit of HTML into existing web pages to perform a service. The idea is to create an improved bookmarking system - but I digress, the specific implementation is unimportant. I'm quite new to web development and so I have no definite idea as to how to accomplish this, thought I have noticed a couple of possibilities.
I found out I can right click > 'inspect element' and proceed to edit my browser's version of the HTML corresponding with the webpage I'm viewing. I assume that this means I can edit what I see and interact with. Could I possibly create a script that ran from a button on bookmarks bar that injected an Iframe which linked to a web service of my making? (And deleted itself after being used).
Could I possibly use a chrome extension to accomplish this? I have no experience with creating extensions and so I have no clue what they're capable of - though I wouldn't be against learning.
Which of these would be best? If they are even valid ideas. Or is there another way that I've yet to know of?
EDIT: The goal is to have a user click a button in the browser if they would like to save this page. They are then presented an interface visually independent of the rest of the page that allows them to categorize this webpage according to their interests. It would take the current link, add some information such as a comment, rating, etc. and add it to the user's data. This is meant as a sort of side-service to a website whose purpose would be to better organize and display the browsing information of the user.
Yes, you can absolutely do this. You're asking about Bookmarklets.
A bookmarklet is just a bookmark where the URL is a piece of JavaScript instead of a URL. They are very simple, yet can be capable of doing anything to a web page. Full JavaScript access.
A bookmarklet can be engaged on any web page -- the user simply has to click the bookmark(let) to launch it on the current page.
Bookmark = "http://chasemoskal.com/"
Bookmarklet = "javascript:(function(){ alert('I can do anything!') })();"
That's all it is. You can create a bookmarklet link which can be clicked-and-dragged onto a bookmark bar like this:
Bookmarklet
Bookmarklets can be limited in size, however, you can load an entire external script from the bookmarklet.
You can do what you refer to as like an <iframe>, so here are some steps that may help you, simply put:
Create an XMLHttpRequest object and make a request for a page trough it.
Make the innerHTML field of an element to hold the resultString of the previous request, aka the HTML structure.
Lets assume you have an element with the id="Result" on your html. The request goes like this:
var req = new XMLHttpRequest();
req.open('GET', 'http://example.com/mydocument.html', true);
req.onreadystatechange = function (aEvt) {
if (req.readyState == 4 && req.status == 200) {
Result.innerHTML = req.responseText;
}
};
req.send(null);
Here's an improved version in the form of a fiddle.
When you're done, you can delete that injected HTML by simply:
Result.innerHTML = '';
And then anything inside it will be gone.
However, you can't make request to other servers due to request policies. They have to be under the same domain or server. Take a look at this: Using XMLHttpRequest on MDN reference pages for more information.
I have an old html page that creates a script file and executes it using:
fsoObject = new ActiveXObject("Scripting.FileSystemObject")
wshObject = new ActiveXObject("WScript.Shell")
I am trying to modify it and make it usable also from other browsers. If you know the answer stop reading and please answer. If there is no quick answer, here is the description of my attempts. I was successful in doing the job, but only when the script is shorter than 2000 characters. I need help for scripts longer than 2000 characters.
The webpage is for internal use only, so it is easy for me to create a custom URL protocol on each computer that runs a VBScript file from a network drive.
I created my custom URL Protocol that starts a VBScript file like this:
Windows Registry Editor Version 5.00
[HKEY_CLASSES_ROOT\MyUrlProtocol]
"URL Protocol"=""
#="Url:MyUrlProtocol"
"UseOriginalUrlEncoding"=dword:00000001
[HKEY_CLASSES_ROOT\MyUrlProtocol\DefaultIcon]
#="C:\\Windows\\System32\\WScript.exe"
[HKEY_CLASSES_ROOT\MyUrlProtocol\shell]
[HKEY_CLASSES_ROOT\MyUrlProtocol\shell\open]
[HKEY_CLASSES_ROOT\MyUrlProtocol\shell\open\command]
#="C:\\Windows\\System32\\WScript.exe \"X:\\MyUrlProtocol.vbs\" \"%1\""
In MyUrlProtocol.vbs I have this:
MsgBox "The length of the link is " & Len(WScript.Arguments(0)) & " characters"
MsgBox "The content of the link is: " & WScript.Arguments(0)
When I click on click me I see two messages, so everything works well (tested with Chrome and IE in Windows 7.)
It works also when I execute document.getElementById("test").click()
I thought this could be the solution: I would pass the text of the script to the VBS static script, which would create the dynamic script and run it, but with this system I can't pass more than ~2000 characters.
So I tried to split the text of the script in chunks smaller than 2000 characters and simulate several clicks on the link, but only the first one works.
So I tried with xmlhttp.open("GET","MyUrlProtocol:test",false);, but Chrome says Cross origin requests are only supported for HTTP.
Is it possible to pass more than 2000 characters to a VBScript script via a custom URL protocol?
If not, is it possible to call several custom URL protocols in sequence?
If not, is there another way to create a script file and run it from Javascript?
EDIT 1
I found a solution, but in Chrome only works when it likes, so I'm back to square one.
The code below in IE executes the script 4 times (correct), but in Chrome only the first execution runs.
If I change it to delay += 2000, then Chrome usually runs the script 2 times, but sometimes 1 and sometimes 3 or even 4 times.
If I change it to delay += 10000, then it usually runs the script 4 times, but sometimes misses one.
The function is always executed 4 times, both in Chrome and IE. What is weird is that the sr.click() sometimes does nothing and the function execution continues.
<HTML>
<HEAD>
<script>
var delay;
function runScript(text) {
setTimeout(function(){runScript2(text)}, delay);
delay += 100;
}
function runScript2(text) {
var sr = document.getElementById('scriptRunner');
sr.href='intelliclad:'+text;
sr.click();
}
function test(){
delay = 0;
runScript("uno");
runScript("due");
runScript("tre");
runScript("quattro");
}
</script>
</HEAD>
<BODY>
<input type="button" value="Run test" onclick="test()">
scriptRunner
</BODY>
</HMTL>
EDIT 2
I tried with Luke's suggestion of setting the next timeout from inside the call back but nothing changed (IE works always, Chrome whenever it likes).
Here is the new code:
var scripts;
var delay = 2000;
function runScript() {
var sr = document.getElementById('scriptRunner');
sr.href = 'intelliclad:' + scripts.shift();
sr.click();
if(scripts.length)
setTimeout(function() {runScript()}, delay);
}
function test(){
scripts = ["uno", "due", "tre", "quattro"];
runScript();
}
Some background: The page asks for the shape of a panel, which can be just a few parameters [nfaces=1, shape1='square', width1=100] or hundreds of parameters for panels with many faces, many slots, many fasteners, etc. After asking for all the parameters a script for our internal 3D CAD (which can be larger than 20KB) is generated and the CAD is started and asked to execute the script.
I would like to do all on the client side, because the page is served by a Domino web server, which can't even dream of managing such a complex script.
I didn't read your whole post...have an answer:
I too wish that custom url protocols can handle long urls. They simply do not. IE is even worse as some OSs only accept 800 chars.
So, here's the solution:
For long urls, only pass a single use token. The vbscript uses the token
and does a url get to your web server to get all of the data.
This is the only way I've been able to successfully pass lots of data around. If you ever find a clearer solution, please remember to post it here.
Update:
Note that this is the best way I have found to deal with the url protocol limitations. I too wish this was not necessary. This does work and works well.
You mentioned Dominos, so possibly you need something in a POS environment... I create a web based POS system, so we could face a lot of the same issues.
Suppose you want a custom url to print a pdf to the default printer without the annoying popup window. We need to do this thousands of times a day...
When building the web page, add the print button which when pressed calls the custom url: myproto://printpdf?id=12345&tocken=onetimetoken
this will execute your vbscript on the local desktop
in your vbscript, parse the arguments and react. In this case, your command is printpdf and the id is 123456 and you have a onetime tocken key.
have the vb script to an https get to: https://mydomain.com/APIs/printpdf.whatever?id=12345&key=onetimetoken
check the credentials based on the ip address and token, if all aligns, then return the contents of the pdf (you may want to convert the pdf to a byte array string)
now the vbscript has the pdf, assemble it and write it to a temp folder then execute a silent pdf print command (I use Sumatra PDF http://blog.kowalczyk.info/software/sumatrapdf/free-pdf-reader.html)
mission accomplished.
Since I do know what you what to do in your custom url and the general workflow, I can only describe how I've solved the sort url issue.
Using this technique, the possibilities are limitless. You have full control over the local computer running the web browser, you have a onetime use token which grants access to a web API with can return any sort of information you program.
You could write a custom url protocol to turn on the pizza oven if you wanted :)
If you are not able to create the server side code which is listening for vbscript's get request then this would not work.
You might be able to pass the data from the browser to the vbscript using the clipboard.
Update 2:
Since in this case the data is on the client (one single form can define hundreds of parameters), the server API doesn't know what to answer to the vb script request. So the workflow described above must be preceded by these two steps:
The onkeypress event executes a submit to send the current parameters to the server
The server replies with the refreshed form, adding to the body onload a call to a function which uses another submit to call the custom url, as described on point 1 listed above.
Update 3:
stenci, what you've added (in Update 2) will work. I would do it like this:
user presses a button saying I'm done editing the form
ajax post the form to the server
the server saves the data and attaches unique key to the datastore
the server returns the key to ajax callback function
now the client has a single use key and invokes the url schema passing the key
vbscript does an https get to the server and passes the key
server returns the data to the vbscript
It is a bit long winded. Once coded it will work like a charm.
The only other alternative I can see is to copy the form data to the clipboard using something like: http://zeroclipboard.org/
and then in vbscript see if you can read the clipboard like: Use clipboard from VBScript
How about creating an iFrame for each instance?
Something like this:
function runScript(text) {
var iframe = document.createElement('iframe');
iframe.src = 'intelliclad:'+text;
document.body.appendChild(iframe);
}
function test(){
runScript("uno");
runScript("due");
runScript("tre");
runScript("quattro");
}
You can then use css styling to make these iframes transparent / hidden.
You might not like this answer, but I've used this method in the past and it works.
Instead of relying on ActiveX, consider using a Java Applet, and JNI.
Basically, you have to make sure the native scripts you want to run are available on your client machine, along with a JNI wrapper.
The applet will have to be at least self signed, for the browser to allow it to load and access a native library. Once the JNI libraries are loaded, you can easily call methods from the page / applet.
As a consequence of using Java, you could possibly use the same applet for windows as well as linux clients, provided of course you have native libraries present on the respective clients.
This series of articles talks about precisely your problem : http://www.javaworld.com/article/2076775/java-security/escape-the-sandbox--access-native-methods-from-an-applet.html
P.S the article is really old, but the concept remains unchanged.
I want to fetch some information from a website using the phantomjs/casperjs libraries, as I'm looking for the HTML result after all javascripts on a site are run. I worked it out with the following code from this answer:
var page = require('webpage').create();
page.open('http://www.scorespro.com/basketball/', function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var p = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML
});
console.log(p);
}
phantom.exit();
});
And I also worked it out to get phantomjs/casperjs running on heroku by following these instructions, so when I now say heroku run phantomjs theScriptAbove.js on OS X terminal I get the HTML of the given basketball scores website as I was expecting.
But what I actually want is to get the html text from within a Mac desktop application, this is the reason why I was looking for a way to run the scripts on a web server like heroku. So, my question is:
Is there any way to get the HTML text (that my script prints as a result) remotely within my Objective-C desktop application?
Or asked in another way: how can I run and get the answer of my script remotely by using POST/GET?
p.s.
I can handle with Rails applications, so if there's a way to do this using Rails - I just need the basic idea of what I have to do and how to get the phantomjs script to communicate with Rails. But I think there might be an even simpler solution ...
If I understand you correctly you're talking about interprocess communication - so that Phantom's result (the page HTML) can somehow be retrieved by the app.
per the phantom docs, couple options:
write the HTML to a file and pick up the file in your app
run the webserver module and do a GET to phantom, and have the phantom script respond with the page HTML
see http://phantomjs.org/api/webserver/
I've been travelling and developing for the past few weeks.
The site I'm developing was running well.
Then, the other day, i connected to a network and the page 'looked' fine, but it turns out the javascript wasn't running. I checked firebug, and there were no errors, as I was suspecting that maybe a script didn't load (I'm using the google api for jQuery and jQuery UI, as well as loading google maps api and fbconnect).
I would suspect that if the issue was with one of these pages not loading I would get an error, and yet there was nothing.
Thinking maybe i didn't connect properly or something, i reconnected to the network and even restarted my computer, as well as trying to run the local version. I got nothing.
The local version not running also hinted to me that it was the loading of an external javascript which caused the problem.
I let it pass as something strange with that one network. Unfortunately now I'm 100s of miles away.
Today my brother sent me an e-mail that the network he was on at the airport wouldn't load my page. Same issue. Everything is laid out properly, and part of the layout is set in Javascript, so clearly javascript is running.
he too got no errors. Of course, he got on his plane, and now he is no longer at the airport. Now the site works on his computer (and i haven't changed anything).
How on earth would you go about figuring out what happened in this situation? That is two of maybe 12 or so networks. But I have no idea how i would find a network that doesn't work (and living in a small town, it could be difficult for me to find a network that doesn't work).
Any ideas?
The site is still in Dev, so I'd rather not post a link just yet (but could in a few days).
What I can see not working is the javascript functions which are called on load, and on click. So i do think it is a javascript issue, but no errors.
This wouldn't be as HUGE an issue if I could find and sit on one of these networks, but I can't. So what would you do?
EDIT ----------------------------------------------------------
the first function(s - their linked) that doesn't get called is below.
I've cut the code of at the .ajax call as the call wasn't being made.
function getResultsFromForm(){
jQuery('form#filterList input.button').hide();
var searchAddress=jQuery('form#filterList input#searchTxt').val();
if(searchAddress=='' || searchAddress=='<?php echo $searchLocation; ?>'){
mapShow(20, -40, 0, 'areaMap', 2);
jQuery('form#filterList input.button').show();
return;
}
if (GBrowserIsCompatible()) {
var geo = new GClientGeocoder();
geo.setBaseCountryCode(cl.address.country);
geo.getLocations(searchAddress, function (result)
{
if(!result.Placemark && searchAddress!='<?php echo $searchLocation; ?>'){
jQuery('span#addressNotFound').text('<?php echo $addressNotFound; ?>').slideDown('slow');
jQuery('form#filterList input.button').show();
} else {
jQuery('span#addressNotFound').slideUp('slow').empty();
jQuery('span#headerLocal').text(searchAddress);
var date = new Date();
date.setTime(date.getTime() + (8 * 24 * 60 * 60 * 1000));
jQuery.cookie('address', searchAddress, { expires: date});
var accuracy= result.Placemark[0].AddressDetails.Accuracy;
var lat = result.Placemark[0].Point.coordinates[1];
var long = result.Placemark[0].Point.coordinates[0];
lat=parseFloat(lat);
long=parseFloat(long);
var getTab=jQuery('div#tabs div#active').attr('class');
jQuery('div#tabs').show();
loadForecast(lat, long, getTab, 'true', 0);
var zoom=zoomLevel();
mapShow(lat, long, accuracy, 'areaMap', zoom );
}
});
}
}
function zoomLevel(){
var zoomarray= new Array();
zoomarray=jQuery('span.viewDist').attr('id');
zoomarray=zoomarray.split("-");
var zoom=zoomarray[1];
if(zoom==''){
zoom=5;
}
zoom=parseFloat(zoom);
return(zoom);
}
function loadForecast(lat, long, type, loadForecast, page){
jQuery('div#holdForecast').empty();
var date = new Date();
var d = date.getDate();
var day = (d < 10) ? '0' + d : d;
var m = date.getMonth() + 1;
var month = (m < 10) ? '0' + m : m;
var year='2009';
toDate=year+'-'+month+'-'+day;
var genre=jQuery('span.genreblock span#updateGenre').html();
var numDays='';
var numResults='';
var range=jQuery('span.viewDist').attr('id');
var dateRange = jQuery('.updateDate').attr('id');
jQuery('div#holdShows ul.showList').html('<li class="show"><div class="showData"><center><img src="../hwImages/loading.gif"/></center></div></li>');
jQuery('div#holdShows ul.'+type+'List').livequery(function(){
jQuery.ajax({
type: "GET",
url: "processes/formatShows.php",
data: "output=&genre="+genre+"&numResults="+numResults+"&date="+toDate+"&dateRange="+dateRange+"&range="+range+"&lat="+lat+"&long="+long+'&page='+page,
success: function(response){
EDIT 2 -----------------------------------------------------------------------------
Please keep in mind that the problem is not that I can't load the site, the site works fine on most connections, but there are times when the site doesn't work, and no errors are thrown, and nothing changes. My brother couldn't run it earlier today while I had no problems, so it was something to do with his location/network. HOWEVER, the page loads, he had a connection, it was his first time visiting the site, so nothing could have been cashed. Same with when I had the issue a few days before. I didn't change anything, and I got to a different network and everything worked fine.
Two things: first -- get the javascript local to your site when developing. Loading it from elsewhere to take advantage of caching is an optimization that I'd leave to the end. I'd also only load it from highly available remote sites, like Google, to minimize problems. Second, make your site at least minimally usable without javascript enabled. Use form postbacks that get replaced with Ajax functionality from javascript that runs when the page is loaded, for example. You might not be able to get everything, but I've found that I can make most things work without javascript in at least a workable, if not elegant fashion.
I realize this doesn't solve your immediate problem, but I think it would help your site to remain available in the face of situations like this.
Turns out the problem with this was in assuming that google map could find any lat/long within north america reliably via ip address.
I add a if(!google.loader.ClientLocation) function for the instances where google cannot find the location via ip.
The strangest bit was that I was hitting this error in an office in downtown Palo Alto which I thought would have been heavily mapped by the google geocoder.
If some of the JS is being hosted by a different server (eg: if you are including something like jQuery from the jQuery site instead of hosting a copy of it yourself) then maybe one of these sites is down temporarily.
You could try use something like "Live HTTP Headers" (available at the Mozilla Addons site) to watch HTTP headers in real-time, which can be really useful when doing web development. You should be able to determine very quickly if all your JS is in fact loading correctly or not.
You could also use something like Ethereal or Wireshark, but that is probably a little heavy-handed when all you need is to see the request/response headers. Using an Addon is far less hassle.
A few things:
1.) where you include your scripts... make sure you have a separate closing tag! DO NOT self close them.
<script src="..."/><!--self-closing will fail, -->
<script src="..."></script><!--this will work -->
2.) is there a reason why you are using jQuery() rather than $() ?
3.) does your SERVER specify a DOCTYPE that your local environment didn't?
4.) what browser are you testing in? in particular are you testing in IE, if so does it work in Firefox?
5.) can you post some of the generated code if you can't supply a URL?
Once you're sure that all the scripts are loading as expected in the dev environment you could try either removing them one-by-one from the page and see which one recreates the issue you were having - then that's the script that wasn't loading.
Firebug has a Net Tab which does something very similar to what Live HTTP Headers does, even tracking the timings of the load and any ajax requests.
I almost always recommend Firefox plugin, "Firebug" for this.
Firstly, you can inspect the scripts to check that they have loaded, which will rule out the "it didn't load from a remote source" problem.
Secondly, it will display errors that occur in the JavaScript console, which may point out something that appeared to be silently previously.
Lastly, keep an eye out for cross-site-scripting problems when using JavaScript from another domain.