I am using PhantomJS to take a screenshot of a page every five minutes, and it works correctly most of the time. The problem is that sometimes the page I am taking a screenshot of fails to load the AngularJS library, and then, it can't build the page after that. So I am trying to figure out how to load a local copy in its place. Here is what I have been trying...
var page = require('webpage').create(),system = require('system');
var home = 'https://smartway.tn.gov/traffic/';
page.open(home, function (status) {
if(status === "success"){
page.injectJs('angular.js');
window.setTimeout((function() {
page.evaluate(function () {
/*stuff*/
});
}), 2000);
}
});
So angular.js is the name of my local copy of what the site would normally download. The site calls the script at the end of the body with several other scripts, and I am trying to find the best way to include it. I am wondering if it needs to be included by replacing the script tag in the html so it can be loaded in sequence, but I am not sure how to do that.
Thanks
It is problematic to reload a single JavaScript file when it failed, particularly when it is the framework. There are probably many scripts which depend on it. When the core framework is not loaded, those scripts will stop executing, because the angular reference cannot be resolved.
You could inject a local version of angular, but then you would have to go over all the other scripts which reference angular and "reload" them by either downloading and evaling them in order or putting them into the page as script elements. I advise against it, because it is probably very error prone.
You should just reload the page if angular does not exist after page load (callback of page.open). Since the same problem may occurr during reload, this has to be done recursively:
function open(countDown, done){
if (countDown === 0) {
done("ERROR: not loaded");
return;
}
page.open(home, function (status) {
if(status === "success"){
var angularExists = page.evaluate(function () {
return !!angular;
});
if (angularExists){
done();
} else {
open(countDown - 1, done);
}
} else {
open(countDown - 1, done);
}
});
}
open(5, function(err){
if(err) {
console.log(err);
} else {
page.render(target);
}
});
You can also try the page.reload() function instead of a page.open().
The other possiblity is to always inject the local version when the page loading began and stop any request for the remote version of the script:
page.onLoadStarted = function() {
page.injectJs('angular.js');
};
page.onResourceRequested = function(requestData, networkRequest) {
var match = requestData.url.match(/angular\.min\.js/g);
if (match != null) {
networkRequest.abort();
}
};
page.open(home, function (status) {
if(status === "success"){
window.setTimeout((function() {
page.evaluate(function () {
/*stuff*/
});
}), 2000);
}
});
This version works entirely without reloading.
Related
I am trying to find a way to wget/download a website.
I have tried wget and curl but no luck, then I've been led to PhantomJS.
var url = 'https://www.sagedining.com/menus/admiralfarragutacademy';
var fs = require('fs');
var page = require('webpage').create();
page.open(url, function(status) {
if (status === 'success') {
var html = page.evaluate(function() {
return document.documentElement.outerHTML;
});
try {
fs.write("/root/choate/page.html", html, 'w');
} catch(e) {
console.log(e);
}
}
phantom.exit();
});
When I run this code on my Debian VPS,
sudo xvfb-run -- phantomjs menu.js
It downloads the site when it's still loading, and therefore only downloads the loading screen.
It also throws this error every time it runs:
TypeError: Attempting to change the setter of an unconfigurable property.
TypeError: Attempting to change the setter of an unconfigurable property.
Is there any way to download this website after it loads all the menus? Does the error message have anything to do with it?
Thank you in advance.
That error is coming from PhantomJS because the page code is trying to set some properties in the DOM and maybe it does not have access to them. You should wait for the loading to happen, you can do it using timeout function:
if (status === 'success') {
window.setTimeout(function () {
var html = page.evaluate(function() {
return document.documentElement.outerHTML;
});
try {
fs.write("/root/choate/page.html", html, 'w');
} catch(e) {
console.log(e);
}
}, 1000); //Increase the value if you need more time
}
I am using ipinfo.io to detect the visitors country and then reload the page with an appended querystring based on that. When the page loads I would like to do something after DOMContentLoaded.
DOMContentLoaded is called fine if I don't reload the page, but I would like it to work with the reload. How do I achieve that?
Sample code below:
jQuery.getJSON('https://ipinfo.io', function(data){
if(data){
if(data.country){
if(data.country.toLowerCase()=='us')
{
window.location.replace(window.location.href+"?location=us");
}
}
}
});
//works when page is not reloaded
document.addEventListener("DOMContentLoaded",
function() {
doSomething...
});
You have a race condition here: based on your description it is likely that the getJSON command is "racing" with the DOMContentLoaded event. If getJSON is successful before your DOM is ready, then it will redirect the page and stop all script execution on the page.
To avoid that, try moving getJSON into the DOMContentLoaded callback.
document.addEventListener("DOMContentLoaded", function() {
jQuery.getJSON('https://ipinfo.io', function(data) {
if (data) {
if (data.country) {
if (data.country.toLowerCase() == 'us') {
window.location.replace(window.location.href + "?location=us");
}
}
}
});
// Other logic here
});
On a side note, you can avoid triple nesting by combining the three if statements (and remember to use strict comparison whenever possible, ===):
jQuery.getJSON('https://ipinfo.io', function(data) {
if (data && data.country && data.country.toLowerCase() === 'us') {
window.location.replace(window.location.href + "?location=us");
}
}
I use phantomjs 2.1.1 and something is bothering me.
Here is the piece of code that I use for scraping a url and the html of the website is written into output.html file
page = require('webpage').create();
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
var content = page.content;
fs.write("output.html", content, 'w');
}, 40000); //40 seconds timeout
}
});
Now, I need to scrape its paginations too. The next pages are loaded by a javascript function page(2); or page(3); I tried to get it done using
var pageinationOutput = page.evaluate(function (s) {
page(2);
});
console.log(pageinationOutput); // I need the output made by the `page(2);` call.
page = require('webpage').create();
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
var content = page.content;
fs.write("output.html", content, 'w');
}, 40000); //40 seconds timeout
}
});
But i am not getting any outputs for this.
How can I execute a JavaScript function after a page is finished loading and get the new changes that has happened to the website contents after the javascript exec, in this case website will call the next page (using ajax) after page(2); method call.
Thanks in advance!
I found out the solution myself but I am not sure whether it's the perfect way to do it.
Code:
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
var content = page.content;
fs.write("output.html", content, 'w');
page.evaluate(function (cb) {
window.page(2);
});
var waiter = window.setInterval(function () {
var nextPageContent = page.evaluate(function (cb) {
return document.documentElement.outerHTML;
});
if (nextPageContent !== false) {
window.clearInterval(waiter);
fs.write("output-2.html", content, 'w');
}
}, 40000);//40 seconds timeout
}, 40000);//40 seconds timeout
}
});
I recently published a project that gives PHP access to a browser. Get it here: https://github.com/merlinthemagic/MTS. It is also PhantomJS under the hood.
If you provided the URL i could make a working example. I need to know how you determine the last page. In the example i simply set it to 10.
I also need to know if the page buttons have an id attribute, If they dont no problem, we find another way to trigger them. But for this example I assume they do and to make it simple the ids will be page_2, page_3 ....
After downloading and setup you would simply use the following code:
$myUrl = "http://www.example.com";
$windowObj = \MTS\Factories::getDevices()->getLocalHost()->getBrowser('phantomjs')->getNewWindow($myUrl);
//now you can either retrieve the DOM for each page:
$doms = array();
//get the initial page DOM
$doms[] = $windowObj->getDom();
$pageID = "page_";
$lastPage = 10;
for ($i = 2; $i <= $lastPage; $i++) {
$windowObj->mouseEventOnElement("[id=".$pageID. $i . "]", 'leftclick');
$doms[] = $windowObj->getDom();
}
//$doms now hold all the pages, so you can parse them.
I'm using PhantomJS to scrape data from a webpage. PhantomJS is not returning anything from the evaluate method. The script just runs for a few seconds and then exits.
I've already checked to see if PhantomJS is connecting to the page -- it is.
PhantomJS is also able to grab the page title. I've already double-checked the class I'm looking for, yes -- I'm spelling it correctly.
var page = require('webpage').create();
page.open('http://www.maccosmetics.com/product/13854/36182/Products/Makeup/Lips/Lipstick/Giambattista-Valli-Lipstick', function(status) {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
waitFor(function() {
return page.evaluate(function() {
$('.product__price').is(':visible');
});
}, function(){
search = page.evaluate(function() {
return $('.product__price').text();
});
console.log(search)
});
});
phantom.exit();
});
I don't know what's going wrong here.
It's not showing you anything, because you're exiting too early. All functions (except evaluate()) that take a callback are asynchronous.
You're requesting to include jQuery in the page by calling page.includeJs(), you immediately exit PhantomJS. You need to exit when you're finished:
var page = require('webpage').create();
page.open('http://www.maccosmetics.com/product/13854/36182/Products/Makeup/Lips/Lipstick/Giambattista-Valli-Lipstick', function(status) {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
waitFor(function() {
return page.evaluate(function() {
$('.product__price').is(':visible');
});
}, function(){
search = page.evaluate(function() {
return $('.product__price').text();
});
console.log(search);
phantom.exit();
});
});
});
I have written a plugin that depends on external libraries that I want to include conditionally, that is, the user can choose to not have them be included automatically in case the user's web site already has those libraries. Here is some pseudocode to illustrate the issue
<script type="text/javascript" src="path/to/plugin.js"></script>
<script type="text/javascript">
PLUGIN.init({
"param1": "foo",
"parma2": 33,
"include": {"jquery": 0, "googlemaps": 0}
});
</script>
In my plugin script
var PLUGIN = {
"init": function(obj) {
if (obj.include.googlemaps !== 0) {
document.write('<script type="text/javascript" src="http://maps.google.com/maps/api/js?sensor=true&v=3.6">\x3C/script>');
}
if (obj.include.jquery !== 0) {
document.write('<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.js">\x3C/script>');
}
.. do more things ..
}
The problem is that when I am ready to "do more things," the libraries don't seem to be loaded yet. I get an error that jquery not found, or google maps not found. I can solve this by changing my code to
document.write('<script type="text/javascript" src="http://maps.google.com/maps/api/js?sensor=true&v=3.6">\x3C/script>');
document.write('<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.js">\x3C/script>');
var PLUGIN = {
"init": function(obj) {
.. do more things ..
}
but now the user can't control loading the libraries or not loading them. Suggestions? Workarounds?
Update: Thanks for the suggestions, you all, but no joy so far. Here is what I am doing, and what is happening. Since I am potentially loading 0 or more scripts (the user can optionally decide which scripts need not be loaded), I have made my code like so
"importLib": function(libPath, callback) {
var newLib = document.createElement("script");
if (callback !== null) {
newLib.onload = callback;
}
newLib.src = libPath;
document.head.appendChild(newLib);
},
"init": function(obj) {
var scripts = [];
if (obj.include.googlemaps !== 0) {
scripts.push("http://maps.google.com/maps/api/js?sensor=true&v=3.6");
}
if (obj.include.jquery !== 0) {
scripts.push("http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.js");
}
if (obj.include.anotherlib !== 0) {
scripts.push("http://path/to/another/lib.js");
}
var len_scripts = scripts.length,
callback = null;
if (len_scripts > 0) {
for (var i = 0; i < len_scripts; i++) {
// add callback only on the last lib to be loaded
if (i == len_scripts - 1) {
callback = function() { startApp(obj) };
}
importLib(scripts[i], callback);
}
}
// Start the app rightaway if no scripts need to be loaded
else {
startApp(obj);
}
},
"startApp": function(obj) {
}
What happens is that Firefox croaks with a attempt to run compile-and-go script on a cleared scope error, and Safari doesn't get that error, but doesn't load anything. Funnily, Safari error console shows no error at all. Seems like the Firefox error is caused by the line document.head.appendChild(newLib); which, if I comment, the error goes away, but of course, the web page doesn't load correctly.
You should add each script as a DOM node and use the onload attribute to take action when it has completed loading.
function importLib(libPath, callback) {
var newLib = document.createElement("script");
newLib.onload = callback;
newLib.src = libPath;
document.head.appendChild(newLib);
}
Above, the libPath argument is the URL of the library, and the callback argument is a function to call when loading is complete. You could use it as follows:
importLib("http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.js", function() {
alert("jquery loaded!");
nowDoSomething(aboutIt);
});
By the way: in general, document.write is not a good solution for most problems (but I won't say never the right solution -- there are exceptions to every rule).
the above solution would work in modern browsers but for IE 7/8 u might wanna added little extra code, something like this:
function importLib(libPath, callback) {
var newLib = document.createElement("script");
if (navigator.userAgent.indexOf('MSIE') !== -1) {
newLib.onreadystatechange = function () {// this piece is for IE 7 and 8
if (this.readyState == 'complete') {
callback();
}
};
} else {
newLib.onload = callback;
}
newLib.src = libPath;
document.head.appendChild(newLib);
}
I encountered the same issue. One workaround if you are writing your site in .NET is by conditionally writing the script reference before the page loads from the code behind. My issue is when remote users access my app over VPN, it blocks access to the internet, thus google maps cannot be referenced. This prevents the rest of the page from loading within a reasonable timeframe. I tried controlling the script reference of the google maps library via jQuery's getScript() command, but as you found out, the subsequent google maps configuration code runs before the external library is referenced.
My solution was to conditionally reference google maps from code behind instead:
VB (code behind):
'if VPN mode is not enable, add the external google maps script reference (this speeds up the interface when using VPN significantly)
If Session("VPNMode") = False Then
Dim sb As System.Text.StringBuilder = New System.Text.StringBuilder()
sb.AppendLine("")
sb.AppendLine("<script type='text/javascript'")
sb.Append(" src='http://maps.google.com/maps/api/js?v=3&sensor=false'>")
sb.Append("</script>")
Dim header As LiteralControl = New LiteralControl
header.Text = sb.ToString()
Me.Page.Header.Controls.Add(header)
End If
Client side script (javascript):
<script type='text/javascript'>
$(function () {
if ($("input[name*='VPN']").is(":checked"))
{ }
else {
loadGoogleMap()
}
});
function loadGoogleMap() {
hazsite = new google.maps.LatLng(hazLat, hazLong);
map = new google.maps.Map(document.getElementById('map_canvas'), { zoom: 18, center: hazsite, mapTypeId: google.maps.MapTypeId.SATELLITE });
var marker = new google.maps.Marker({
position: hazsite,
map: map,
title: "Site Location"
});
}
</script>