I'm trying to get PhantomJS to take an html string and then have it render the full page as a browser would (including execution of any javascript in the page source). I need the resulting html result as a string. I have seen examples of page.open which is of no use since I already have the page source in my database.
Do I need to use page.open to trigger the javascript rendering engine in PhantomJS? Is there anyway to do this all in memory (ie.. without page.open making a request or reading/writing html source from/to disk?
I have seen a similar question and answer here but it doesn't quite solve my issue. After running the code below, nothing I do seems to render the javascript in the html source string.
var page = require('webpage').create();
page.setContent('raw html and javascript in this string', 'http://whatever.com');
//everything i've tried from here on doesn't execute the javascript in the string
--------------Update---------------
Tried the following based on the suggestion below but this still does not work. Just returns the raw source that I supplied with no javascript rendered.
var page = require('webpage').create();
page.settings.localToRemoteUrlAccessEnabled = true;
page.settings.webSecurityEnabled = false;
page.onLoadFinished = function(){
var resultingHtml = page.evaluate(function() {
return document.documentElement.innerHTML;
});
console.log(resultingHtml);
//console.log(page.content); // this didn't work either
phantom.exit();
};
page.url = input.Url;
page.content = input.RawHtml;
//page.setContent(input.RawHtml, input.Url); //this didn't work either
The following works
page.onLoadFinished = function(){
console.log(page.content); // rendered content
};
page.content = "your source html string";
But you have to keep in mind that if you set the page from a string, the domain will be about:blank. So if the html loads resources from other domains, then you should run PhantomJS with the --web-security=false --local-to-remote-url-access=true commandline options:
phantomjs --web-security=false --local-to-remote-url-access=true script.js
Additionally, you may need to wait for the completion of the JavaScript execution which might be not be finished when PhantomJS thought it finished. Use either setTimeout() to wait a static amount of time or waitFor() to wait for a specific condition on a page. More robust ways to wait for a full page are given in this question: phantomjs not waiting for “full” page load
The setTimeout made it work even though I'm not excited to wait a set amount of time for each page. The waitFor approach that is discussed here doesn't work since I have no idea what elements each page might have.
var system = require('system');
var page = require('webpage').create();
page.setContent(input.RawHtml, input.Url);
window.setTimeout(function () {
console.log(page.content);
phantom.exit();
}, input.WaitToRenderTimeInMilliseconds);
Maybe not the answer you want, but using PhantomJsCloud.com you can do it easily, Here's an example: http://api.phantomjscloud.com/api/browser/v2/a-demo-key-with-low-quota-per-ip-address/?request={url:%22http://example.com%22,content:%22%3Ch1%3ENew%20Content!%3C/h1%3E%22,renderType:%22png%22,scripts:{domReady:[%22var%20hiDiv=document.createElement%28%27div%27%29;hiDiv.innerHTML=%27Hello%20World!%27;document.body.appendChild%28hiDiv%29;window._pjscMeta.scriptOutput={Goodbye:%27World%27};%22]},outputAsJson:false} The "New Content!" is the content that replaces the original content, and the "Hello World!" is placed in the page by a script.
If you want to do this via normal PhantomJs, you'll need to use the injectJs or includeJs functions, after the page content is loaded.
Related
I am trying to get the HTML (ie what you see initially when the page completes loading) for some web-page URI. Stripping out all error checking and assuming static HTML, it's a single line of code:
function GetDisplayedHTML($uri) {
return file_get_contents($uri);
}
This works fine for static HTML, and is easy to extend by simple parsing, if the page has static file dependencies/references. So tags like <script src="XXX">, <a href="XXX">, <img src="XXX">, and CSS, can also be detected and the dependencies returned in an array, if they matter.
But what about web pages where the HTML is dynamically created using events/AJAX? For example suppose the HTML for the web page is just a brief AJAX-based or OnLoad script that builds the visible web page? Then parsing alone won't work.
I guess what I need is a way from within PHP, to open and render the http response (ie the HTML we get at first) via some javascript engine or browser, and once it 'stabilises', capture the HTML (or static DOM?) that's now present, which will be what the user's actually seeing.
Since such a webpage could continually change itself, I'd have to define "stable" (OnLoad or after X seconds?). I also don't need to capture any timer or async event states (ie "things set in motion that might cause web page updates at some future time"). I only need enough of the DOM to represent the static appearance the user could see, at that time.
What would I need to do, to achieve this programmatically in PHP?
To render page with JS you need to use some browser. PhantomJS was created for tasks like this. Here is simple script to run with Phantom:
var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;
if (args.length === 1) {
console.log('First argument must be page URL!');
} else {
page.open(args[1], function (status) {
window.setTimeout(function () { //Wait for scripts to run
var content = page.content;
console.log(content);
phantom.exit();
}, 500);
});
}
It returns resulting HTML to console output.
You can run it from console like this:
./phantomjs.exe render.js http://yandex.ru
Or you can use PHP to run it:
<?php
$path = dirname(__FILE__);
$html = shell_exec($path . DIRECTORY_SEPARATOR . 'phantomjs.exe render.js http://phantomjs.org/');
echo htmlspecialchars($html);
My PHP code assumes that PhantomJS executable is in the same directory as PHP script.
I've been trying to figure this out for a couple days now but haven't been able to achieve it.
There's this web page were I need to scrap all records available on it, I've noticed that if I modify the pagination link with firebug or the browser's inspector I can get all the records I need, for example, this is the original link:
<a href="javascript:gReport.navigate.paginate('paginator_min_row=16max_rows=15rows_fetched=15')">
If I modify that link like this
<a href="javascript:gReport.navigate.paginate('paginator_min_row=1max_rows=5000rows_fetched=5000')">
And then click on the pagination button on the browser (the very same that contains the link I've just changed) I'm able to get all records I need from that site (most of the time "rows" doesn't get any bigger than 4000, I use 5000 just in case)
Since I have to process that file by hand every single day I thought that maybe I could automatize the process with PhantomJS and get the whole page on a single run without looking for that link then changing it, so in order to modify the pagination link and getting all records I'm using the following code:
var page = require('webpage').create();
var fs = require('fs');
page.open('http://testingsite1.local', function () {
page.evaluate(function(){
$('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').first().attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id','clickit');
$('#clickit')[0].click();
});
page.render('test.png');
fs.write('test.html', page.content, 'w');
phantom.exit();
});
Notice that there are TWO pagination links on that website, because of that I'm using jquery's ".first()" to choose only the first one.
Also since the required link doesn't have any identificator I select it using its own link then change it to what I need, and lastly I add the "clickit" ID to it for later calling.
Now, this are my questions:
I'm, not exactly sure why it isn't working, if I run the code it fetches the first page only, after examining the requested page source code I do see the href link has been changed to what I want but it just doesn't get called, I have two different theories on what might be wrong
The modified href isn't getting "clicked" so the page isn't getting updated
The href does get clicked, but since the page takes a few seconds to load all results dynamically I only get to dump the first page Phantomjs gets to see
What do you guys think about it?
[UPDATE NOV 6 2015]
Ok, so the answers provided by #Artjomb and #pguardiario pointed me in a new direction:
I needed more debugging info on what was going on
I needed to call gReport.navigate.paginate function directly
Sadly I simply lack the the experience to properly use PhantomJS, several other samples indicated that I could achieve what I wanted with CasperJS, so I tried it, this is what I produced after a couple of hours
var utils = require('utils');
var fs = require('fs');
var url = 'http://testingsite1.local';
var casper = require('casper').create({
verbose: true,
logLevel: 'debug'
});
casper.on('error', function(msg, backtrace) {
this.echo("=========================");
this.echo("ERROR:");
this.echo(msg);
this.echo(backtrace);
this.echo("=========================");
});
casper.on("page.error", function(msg, backtrace) {
this.echo("=========================");
this.echo("PAGE.ERROR:");
this.echo(msg);
this.echo(backtrace);
this.echo("=========================");
});
casper.start(url, function() {
var url = this.evaluate(function() {
$('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id', 'clicklink');
return gReport.navigate.paginate('paginator_min_row=1max_rows=5000rows_fetched=5000');
});
});
casper.then(function() {
this.waitForSelector('.nonexistant', function() {
// Nothing here
}, function() {
//page load failed after 5 seconds
this.capture('screen.png');
var html = this.getPageContent();
var f = fs.open('test.html', 'w');
f.write(html);
f.close();
}, 50000);
});
casper.run(function() {
this.exit();
});
Please be gentle as I know this code sucks, I'm no Javascript expert and in fact I know very little of it, I know I should have waited an element to appear but it simply didn't work on my tests as I was still getting the page without update from the AJAX request.
In the end I waited a long time (50 seconds) for the AJAX request to show on page and then dump the HTML
Oh! and calling the function directly did work great!
The href does get clicked, but since the page takes a few seconds to load all results dynamically I only get to dump the first page Phantomjs gets to see
It's easy to check whether it's that by wrapping the render, write and exit calls in setTimeout and trying different timeouts:
page.open('http://testingsite1.local', function () {
page.evaluate(function(){
$('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').first().attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id','clickit');
$('#clickit')[0].click();
});
setTimeout(function(){
page.render('test.png');
fs.write('test.html', page.content, 'w');
phantom.exit();
}, 5000);
});
If it's really just a timeout issue, then you should use the waitFor() function to wait for a specific condition like "all elements loaded" or "x elements of that type are loaded".
The modified href isn't getting "clicked" so the page isn't getting updated
This is a little trickier. You can listen to the onConsoleMessage, onError, onResourceError, onResourceTimeout events (Example) and see if there are errors on the page. Some of those errors are fixable by the stuff you can do in PhantomJS: Function.prototype.bind not available or HTTPS site/resources cannot be loaded.
There are other ways to click something that are more reliable such as this one.
I've been debugging this for a long time and it has me completely baffled. I need to save ads to my computer for a work project. Here is an example ad that I got from CNN.com:
http://ads.cnn.com/html.ng/site=cnn&cnn_pagetype=main&cnn_position=300x250_rgt&cnn_rollup=homepage&page.allowcompete=no¶ms.styles=fs&Params.User.UserID=5372450203c5be0a3c695e599b05d821&transactionID=13999976982075532128681984&tile=2897967999935&domId=6f4501668a5e9d58&kxid=&kxseg=
When I visit this link in Google Chrome and Firefox, I see an ad (if the link stops working, simply go to CNN.com and grab the iframe URL for one of the ads). I developed a PhantomJS script that will save a screenshot and the HTML of any page. It works on any website, but it doesn't seem to work on these ads. The screenshot is blank and the HTML contains a tracking pixel (a 1x1 transparent gif used to track the ad). I thought that it would give me what I see in my normal browser.
The only thing that I can think of is that the AJAX calls are somehow messing up PhantomJS, so I hard-coded a delay but I got the same results.
Here is the most basic piece of test code that reproduces my problem:
var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
}
else {
// Output Results Immediately
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlBeforeTimeout.htm", html, 'w');
page.render('RenderBeforeTimeout.png');
// Output Results After Delay (for AJAX)
window.setTimeout(function () {
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlAfterTimeout.htm", html, 'w');
page.render('RenderAfterTimeout.png');
phantom.exit();
}, 9000); // 9 Second Delay
}
});
You can run this code using this command in your terminal:
phantomjs getHtml.js 'http://www.google.com/'
The above command works well. When you replace the Google URL with an Ad URL (like the one at the top of this post), is gives me the unexpected results that I explained.
Thanks so much for your help! This is my first question that I've ever posted on here, because I can almost always find the answer by searching Stack Overflow. This one, however, has me completely stumped! :)
EDIT: I'm running PhantomJS 1.9.7 on Ubuntu 14.04 (Trusty Tahr)
EDIT: Okay, I've been working on it for a while now and I think it has something to do with cookies. If I clear all of my history and view the link in my browser, it also comes up blank. If I then refresh the page, it displays fine. It also displays fine if I open it in a new tab. The only time it doesn't is when I try to view it directly after clearing my cookies.
EDIT: I've tried loading the link twice in PhantomJS without exiting (manually requesting it twice in my script before calling phantom.exit()). It doesn't work. In the PhantomJS documentation it says that the cookie jar is enabled by default. Any ideas? :)
You should try using the onLoadFinished callback instead of checking for status in page.open. Something like this should work:
var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];
page.open(url);
page.onLoadFinished = function()
{
// Output Results Immediately
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlBeforeTimeout.htm", html, 'w');
page.render('RenderBeforeTimeout.png');
// Output Results After Delay (for AJAX)
window.setTimeout(function () {
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlAfterTimeout.htm", html, 'w');
page.render('RenderAfterTimeout.png');
phantom.exit();
}, 9000); // 9 Second Delay
};
I have an answer here that loops through all files in a local folder and saves images of the resulting pages: Using Phantom JS to convert all HTML files in a folder to PNG
The same principle applies to remote HTML pages.
Here is what I have from the output:
Before Timeout:
http://i.stack.imgur.com/GmsH9.jpg
After Timeout:
http://i.stack.imgur.com/mo6Ax.jpg
i was searching google to get any js lib which can capture the image of any website or url. i came to know that phantomjs library can do it. here i got a small code which capture and convert the github home page to png image
if anyone familiar with phantomjs then please tell me what is the meaning of this line
var page = require('webpage').create();
here i can give any name instead of webpage ?
if i need to capture the portion of any webpage then how can i do it with the help of this library. anyone can guide me.
var page = require('webpage').create();
page.open('http://github.com/', function () {
page.render('github.png');
phantom.exit();
});
https://github.com/ariya/phantomjs/wiki
thanks
Here is a simple phantomjs script for grabbing an image:
var page = require('webpage').create(),
system = require('system'),
address, output, size;
address = "http://google.com";
output = "your_image.png";
page.viewportSize = { width: 900, height: 600 };
page.open(address, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
page.render(output);
console.log('done');
phantom.exit();
}, 10000);
}
})
Where..
'address' is your url string.
'output' is your filename string.
Also 'width' & 'height' are the dimensions of what area of the site to capture (comment this out if you want the whole page)
To run this from the command line save the above as 'script_name.js and fire off phantom making the js file the first argument.
Hope this helps :)
The line you ask about:
var page = require('webpage').create();
As far as I can tell, that line does 3 things: It adds a module require('webpage'), then creates a WebPage Object in PhantomJS .create(), and then assigns that Object to var = page
The name "webpage" tells it which module to add.
http://phantomjs.org/api/webpage/
I too need a way to use page.render() to capture just one section of a web page, but I don't see an easy way to do this. It would be nice to select a page element by ID and just render out that element based at whatever size it is. They should really add that for the next version of PhantomJS.
For now, my only workaround is to add an anchor tag to my URL http://example.com/page.html#element to make the page scroll to the element that I want, and then set a width and height that gets close to the size I need.
I recently discovered that I can manipulate the page somewhat before rendering, so I want to try to use this technique to hide all of the other elements except the one I want to capture. I have not tried this yet, but maybe I will have some success.
See this page and look at how they use querySelector(): https://github.com/ariya/phantomjs/blob/master/examples/technews.js
I have about 100 static HTML pages that I want to apply some DOM manipulations to. They all follow the same HTML structure. I want to apply some DOM manipulations to each of these files, and then save the resulting HTML.
These are the manipulations I want to apply:
# [start]
$("h1.title, h2.description", this).wrap("<hgroup>");
if ( $("h1.title").height() < 200 ) {
$("div.content").addClass('tall');
}
# [end]
# SAVE NEW HTML
The first line (.wrap()) I could easily do with a find and replace, but it gets tricky when I have to determine the calculated height of an element, which can't be easily be determined sans-JavaScript.
Does anyone know how I can achieve this? Thanks!
While the first part could indeed be solved in "text mode" using regular expressions or a more complete DOM implementation in JavaScript, for the second part (the height calculation), you'll need a real, full browser or a headless engine like PhantomJS.
From the PhantomJS homepage:
PhantomJS is a command-line tool that packs and embeds WebKit.
Literally it acts like any other WebKit-based web browser, except that
nothing gets displayed to the screen (thus, the term headless). In
addition to that, PhantomJS can be controlled or scripted using its
JavaScript API.
A schematic instruction (which I admit is not tested) follows.
In your modification script (say, modify-html-file.js) open an HTML page, modify it's DOM tree and console.log the HTML of the root element:
var page = new WebPage();
page.open(encodeURI('file://' + phantom.args[0]), function (status) {
if (status === 'success') {
var html = page.evaluate(function () {
// your DOM manipulation here
return document.documentElement.outerHTML;
});
console.log(html);
}
phantom.exit();
});
Next, save the new HTML by redirecting your script's output to a file:
#!/bin/bash
mkdir modified
for i in *.html; do
phantomjs modify-html-file.js "$1" > modified/"$1"
done
I tried PhantomJS as in katspaugh's answer, but ran into several issues trying to manipulate pages. My use case was modifying the static html output of Doxygen, without modifying Doxygen itself. The goal was to reduce delivered file size by remove unnecessary elements from the page, and convert it to HTML5. Additionally I also wanted to use jQuery to access and modify elements more easily.
Loading the page in PhantomJS
The APIs appear to have changed drastically since the accepted answer. Additionally, I used a different approach (derived from this answer), which will be important in mitigating one of the major issues I encountered.
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
// Reading the page's content into your "webpage"
// This automatically refreshes the page
page.content = fs.read(system.args[1]);
// Make all your changes here
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Preventing JavaScript from Running
My page uses Google Analytics in the footer, and now the page is modified beyond my intention, presumably because javascript was run. If we disable javascript, we can't actually use jQuery to modify the page, so that isn't an option. I've tried temporarily changing the tag, but when I do, every special character is replaced with an html-escaped equivalent, destroying all javascript code on the page. Then, I came across this answer, which gave me the following idea.
var rawPageString = fs.read(system.args[1]);
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");
page.content = rawPageString;
// Make all your changes here
rawPageString = page.content;
rawPageString = rawPageString.replace(/<script type='foo\/bar'/g, "<script");
Adding jQuery
There's actually an example on how to use jQuery. However, I thought an offline copy would be more appropriate. Initially I tried using page.includeJs as in the example, but found that page.injectJs was more suitable for the use case. Unlike includeJs, there's no <script> tag added to the page context, and the call blocks execution which simplifies the code. jQuery was placed in the same directory I was executing my script from.
page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {
// Make all changes here
// Remove the foo/bar type more easily here
$("script[type^=foo]").removeAttr("type");
});
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Putting it All Together
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
var rawPageString = fs.read(system.args[1]);
// Prevent in-page javascript execution
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");
page.content = rawPageString;
page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {
// Make all changes here
// Remove the foo/bar type
$("script[type^=foo]").removeAttr("type");
});
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Using it from the command line:
phantomjs modify-html-file.js "input_file.html" "output_file.html"
Note: This was tested and working with PhantomJS 2.0.0 on Windows 8.1.
Pro tip: If speed matters, you should consider iterating the files from within your PhantomJS script rather than a shell script. This will avoid the latency that PhantomJS has when starting up.
you can get your modified content by $('html').html() (or a more specific selector if you don't want stuff like head tags), then submit it as a big string to your server and write the file server side.