HTML output from PhantomJS and Google Chrome/Firefox are different

HTML output from PhantomJS and Google Chrome/Firefox are different - javascript

I've been debugging this for a long time and it has me completely baffled. I need to save ads to my computer for a work project. Here is an example ad that I got from CNN.com:
http://ads.cnn.com/html.ng/site=cnn&cnn_pagetype=main&cnn_position=300x250_rgt&cnn_rollup=homepage&page.allowcompete=no&params.styles=fs&Params.User.UserID=5372450203c5be0a3c695e599b05d821&transactionID=13999976982075532128681984&tile=2897967999935&domId=6f4501668a5e9d58&kxid=&kxseg=
When I visit this link in Google Chrome and Firefox, I see an ad (if the link stops working, simply go to CNN.com and grab the iframe URL for one of the ads). I developed a PhantomJS script that will save a screenshot and the HTML of any page. It works on any website, but it doesn't seem to work on these ads. The screenshot is blank and the HTML contains a tracking pixel (a 1x1 transparent gif used to track the ad). I thought that it would give me what I see in my normal browser.
The only thing that I can think of is that the AJAX calls are somehow messing up PhantomJS, so I hard-coded a delay but I got the same results.
Here is the most basic piece of test code that reproduces my problem:
var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
}
else {
// Output Results Immediately
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlBeforeTimeout.htm", html, 'w');
page.render('RenderBeforeTimeout.png');
// Output Results After Delay (for AJAX)
window.setTimeout(function () {
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlAfterTimeout.htm", html, 'w');
page.render('RenderAfterTimeout.png');
phantom.exit();
}, 9000); // 9 Second Delay
}
});
You can run this code using this command in your terminal:
phantomjs getHtml.js 'http://www.google.com/'
The above command works well. When you replace the Google URL with an Ad URL (like the one at the top of this post), is gives me the unexpected results that I explained.
Thanks so much for your help! This is my first question that I've ever posted on here, because I can almost always find the answer by searching Stack Overflow. This one, however, has me completely stumped! :)
EDIT: I'm running PhantomJS 1.9.7 on Ubuntu 14.04 (Trusty Tahr)
EDIT: Okay, I've been working on it for a while now and I think it has something to do with cookies. If I clear all of my history and view the link in my browser, it also comes up blank. If I then refresh the page, it displays fine. It also displays fine if I open it in a new tab. The only time it doesn't is when I try to view it directly after clearing my cookies.
EDIT: I've tried loading the link twice in PhantomJS without exiting (manually requesting it twice in my script before calling phantom.exit()). It doesn't work. In the PhantomJS documentation it says that the cookie jar is enabled by default. Any ideas? :)

You should try using the onLoadFinished callback instead of checking for status in page.open. Something like this should work:
var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];
page.open(url);
page.onLoadFinished = function()
{
// Output Results Immediately
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlBeforeTimeout.htm", html, 'w');
page.render('RenderBeforeTimeout.png');
// Output Results After Delay (for AJAX)
window.setTimeout(function () {
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlAfterTimeout.htm", html, 'w');
page.render('RenderAfterTimeout.png');
phantom.exit();
}, 9000); // 9 Second Delay
};
I have an answer here that loops through all files in a local folder and saves images of the resulting pages: Using Phantom JS to convert all HTML files in a folder to PNG
The same principle applies to remote HTML pages.
Here is what I have from the output:
Before Timeout:
http://i.stack.imgur.com/GmsH9.jpg
After Timeout:
http://i.stack.imgur.com/mo6Ax.jpg

Related

Selecting menu item using PhantomJS

I have simple PhantomJS script to parse Javascript content of website to html. (Some data is then extracted from the html code using other tool.)
var page = require('webpage').create();
var fs = require('fs');// File System Module
var output = '/tmp/sourcefile'; // path for saving the local file
page.open('targeturl', function() { // open the file
fs.write(output,page.content,'w'); // Write the page to the local file using page.content
phantom.exit(); // exit PhantomJs
});
(I got these lines of code from http://kochi-coders.com/2014/05/06/scraping-a-javascript-enabled-web-page-using-beautiful-soup-and-phantomjs/)
This used to work when all targets had direct links. Now they are behind the same url and there is drop down menu:
<select id="observation-station-menu" name="station" onchange="updateObservationProductsBasedOnForm(this);">
<option value="101533">Alajärvi Möksy</option>
...
<option value="101541">Äänekoski Kalaniemi</option>
</select>
This is the menu item I would actually like to load:
<option value="101632">Joensuu Linnunlahti</option>
Because of this menu my script only downloads data related to the default location. How I load contents of other item from the menu and download html content of that item instead?
My target site is this: http://ilmatieteenlaitos.fi/suomen-havainnot
(If there is better way than PhantomJS for doing this I could use it just as well. My interest is in dealing with the data once get it scraped and I chose PhantomJS just because it was the first thing that worked. Some options might be limited because my server is a Raspberry Pi and might not work on it: Python Selenium: Firefox profile error)

Since the page have jQuery, you can do:
page.open('targeturl', function() { // open the file
page.evaluate(function() {
jQuery('#observation-station-menu').val('101632').change();
}); //change the checkbox, then fires the event
fs.write(output,page.content,'w'); // Write the page to the local file using page.content
phantom.exit(); // exit PhantomJs
});

You could directly call the function, which is defined in the underlying js on that page:
var page = require('webpage').create();
var fs = require('fs');// File System Module
var output = '/tmp/sourcefile'; // path for saving the local file
page.open('targeturl', function() { // open the file
page.evaluate(function() {
updateObservationProducts(101632, 'weather');
});
window.setTimeout(function () {
fs.write(output,page.content,'w'); // Write the page to the local file using page.content
phantom.exit(); // exit PhantomJs
}, 1000); // Change timeout as required to allow sufficient time
});
For the waiting to render, see this phantomjs not waiting for "full" page load , I copy pasted a part from rhunwicks solution.

Changing a link dynamically in PhantomJS and clicking it to scrape the page

I've been trying to figure this out for a couple days now but haven't been able to achieve it.
There's this web page were I need to scrap all records available on it, I've noticed that if I modify the pagination link with firebug or the browser's inspector I can get all the records I need, for example, this is the original link:
<a href="javascript:gReport.navigate.paginate('paginator_min_row=16max_rows=15rows_fetched=15')">
If I modify that link like this
<a href="javascript:gReport.navigate.paginate('paginator_min_row=1max_rows=5000rows_fetched=5000')">
And then click on the pagination button on the browser (the very same that contains the link I've just changed) I'm able to get all records I need from that site (most of the time "rows" doesn't get any bigger than 4000, I use 5000 just in case)
Since I have to process that file by hand every single day I thought that maybe I could automatize the process with PhantomJS and get the whole page on a single run without looking for that link then changing it, so in order to modify the pagination link and getting all records I'm using the following code:
var page = require('webpage').create();
var fs = require('fs');
page.open('http://testingsite1.local', function () {
page.evaluate(function(){
$('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').first().attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id','clickit');
$('#clickit')[0].click();
});
page.render('test.png');
fs.write('test.html', page.content, 'w');
phantom.exit();
});
Notice that there are TWO pagination links on that website, because of that I'm using jquery's ".first()" to choose only the first one.
Also since the required link doesn't have any identificator I select it using its own link then change it to what I need, and lastly I add the "clickit" ID to it for later calling.
Now, this are my questions:
I'm, not exactly sure why it isn't working, if I run the code it fetches the first page only, after examining the requested page source code I do see the href link has been changed to what I want but it just doesn't get called, I have two different theories on what might be wrong
The modified href isn't getting "clicked" so the page isn't getting updated
The href does get clicked, but since the page takes a few seconds to load all results dynamically I only get to dump the first page Phantomjs gets to see
What do you guys think about it?
[UPDATE NOV 6 2015]
Ok, so the answers provided by #Artjomb and #pguardiario pointed me in a new direction:
I needed more debugging info on what was going on
I needed to call gReport.navigate.paginate function directly
Sadly I simply lack the the experience to properly use PhantomJS, several other samples indicated that I could achieve what I wanted with CasperJS, so I tried it, this is what I produced after a couple of hours
var utils = require('utils');
var fs = require('fs');
var url = 'http://testingsite1.local';
var casper = require('casper').create({
verbose: true,
logLevel: 'debug'
});
casper.on('error', function(msg, backtrace) {
this.echo("=========================");
this.echo("ERROR:");
this.echo(msg);
this.echo(backtrace);
this.echo("=========================");
});
casper.on("page.error", function(msg, backtrace) {
this.echo("=========================");
this.echo("PAGE.ERROR:");
this.echo(msg);
this.echo(backtrace);
this.echo("=========================");
});
casper.start(url, function() {
var url = this.evaluate(function() {
$('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id', 'clicklink');
return gReport.navigate.paginate('paginator_min_row=1max_rows=5000rows_fetched=5000');
});
});
casper.then(function() {
this.waitForSelector('.nonexistant', function() {
// Nothing here
}, function() {
//page load failed after 5 seconds
this.capture('screen.png');
var html = this.getPageContent();
var f = fs.open('test.html', 'w');
f.write(html);
f.close();
}, 50000);
});
casper.run(function() {
this.exit();
});
Please be gentle as I know this code sucks, I'm no Javascript expert and in fact I know very little of it, I know I should have waited an element to appear but it simply didn't work on my tests as I was still getting the page without update from the AJAX request.
In the end I waited a long time (50 seconds) for the AJAX request to show on page and then dump the HTML
Oh! and calling the function directly did work great!

The href does get clicked, but since the page takes a few seconds to load all results dynamically I only get to dump the first page Phantomjs gets to see
It's easy to check whether it's that by wrapping the render, write and exit calls in setTimeout and trying different timeouts:
page.open('http://testingsite1.local', function () {
page.evaluate(function(){
$('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').first().attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id','clickit');
$('#clickit')[0].click();
});
setTimeout(function(){
page.render('test.png');
fs.write('test.html', page.content, 'w');
phantom.exit();
}, 5000);
});
If it's really just a timeout issue, then you should use the waitFor() function to wait for a specific condition like "all elements loaded" or "x elements of that type are loaded".
The modified href isn't getting "clicked" so the page isn't getting updated
This is a little trickier. You can listen to the onConsoleMessage, onError, onResourceError, onResourceTimeout events (Example) and see if there are errors on the page. Some of those errors are fixable by the stuff you can do in PhantomJS: Function.prototype.bind not available or HTTPS site/resources cannot be loaded.
There are other ways to click something that are more reliable such as this one.

PhantomJS how to render javascript in html string

I'm trying to get PhantomJS to take an html string and then have it render the full page as a browser would (including execution of any javascript in the page source). I need the resulting html result as a string. I have seen examples of page.open which is of no use since I already have the page source in my database.
Do I need to use page.open to trigger the javascript rendering engine in PhantomJS? Is there anyway to do this all in memory (ie.. without page.open making a request or reading/writing html source from/to disk?
I have seen a similar question and answer here but it doesn't quite solve my issue. After running the code below, nothing I do seems to render the javascript in the html source string.
var page = require('webpage').create();
page.setContent('raw html and javascript in this string', 'http://whatever.com');
//everything i've tried from here on doesn't execute the javascript in the string
--------------Update---------------
Tried the following based on the suggestion below but this still does not work. Just returns the raw source that I supplied with no javascript rendered.
var page = require('webpage').create();
page.settings.localToRemoteUrlAccessEnabled = true;
page.settings.webSecurityEnabled = false;
page.onLoadFinished = function(){
var resultingHtml = page.evaluate(function() {
return document.documentElement.innerHTML;
});
console.log(resultingHtml);
//console.log(page.content); // this didn't work either
phantom.exit();
};
page.url = input.Url;
page.content = input.RawHtml;
//page.setContent(input.RawHtml, input.Url); //this didn't work either

The following works
page.onLoadFinished = function(){
console.log(page.content); // rendered content
};
page.content = "your source html string";
But you have to keep in mind that if you set the page from a string, the domain will be about:blank. So if the html loads resources from other domains, then you should run PhantomJS with the --web-security=false --local-to-remote-url-access=true commandline options:
phantomjs --web-security=false --local-to-remote-url-access=true script.js
Additionally, you may need to wait for the completion of the JavaScript execution which might be not be finished when PhantomJS thought it finished. Use either setTimeout() to wait a static amount of time or waitFor() to wait for a specific condition on a page. More robust ways to wait for a full page are given in this question: phantomjs not waiting for “full” page load

The setTimeout made it work even though I'm not excited to wait a set amount of time for each page. The waitFor approach that is discussed here doesn't work since I have no idea what elements each page might have.
var system = require('system');
var page = require('webpage').create();
page.setContent(input.RawHtml, input.Url);
window.setTimeout(function () {
console.log(page.content);
phantom.exit();
}, input.WaitToRenderTimeInMilliseconds);

Maybe not the answer you want, but using PhantomJsCloud.com you can do it easily, Here's an example: http://api.phantomjscloud.com/api/browser/v2/a-demo-key-with-low-quota-per-ip-address/?request={url:%22http://example.com%22,content:%22%3Ch1%3ENew%20Content!%3C/h1%3E%22,renderType:%22png%22,scripts:{domReady:[%22var%20hiDiv=document.createElement%28%27div%27%29;hiDiv.innerHTML=%27Hello%20World!%27;document.body.appendChild%28hiDiv%29;window._pjscMeta.scriptOutput={Goodbye:%27World%27};%22]},outputAsJson:false} The "New Content!" is the content that replaces the original content, and the "Hello World!" is placed in the page by a script.
If you want to do this via normal PhantomJs, you'll need to use the injectJs or includeJs functions, after the page content is loaded.

Why does this phantomjs code return null and the document title?

I am trying to learn PhantomJS. I would appreciate if you can help me understand why the code below gives me an error(shown below) and help me fix the error. I am trying to execute some javascript on a page using phantomjs. The code lines in the evaluate function work well when I enter them in Chrome console, i.e., they give the expected result (document.title).
Thank you.
PhantomJS Code
var page = require('webpage').create();
var url = 'http://www.google.com';
page.open(url, function(status) {
var title = page.evaluate(function(query) {
document.querySelector('input[name=q]').setAttribute('value', query);
document.querySelector('input[name="btnK"]').click();
return document.title;
}, 'phantomJS');
console.log(title);
phantom.exit()
})
Error
TypeError: 'null' is not an object (evaluating 'document.querySelector('input[name="btnK"]').click')
phantomjs://webpage.evaluate():4
phantomjs://webpage.evaluate():7
phantomjs://webpage.evaluate():7
null
Edit 1: In response to Andrew's answer
Andrew, it is strange but on my computer, the button is an input element. The following screenshot shows the result on my computer.
Edit 2: click event unreliable
Sometimes, the following click event works, sometimes it does not.
document.querySelector('input[name="btnK"]')
Not clear to me what is happening.
About the answer
For future readers, in addition to the answer, the gist by Artjom B. is helpful in understanding what is happening. However, for a more robust solution, I think something like the waitfor.js example will have to be used (as suggested in the answer). I hope it is okay to copy and paste Artjom B.'s gist here. While the gist below works (with form submit); it is still not clear to me why it does not work if I try to simulate the click button on the input. If anyone can clarify that, it would be great.
// Gist by Artjom B.
var page = require('webpage').create();
var url = 'http://www.google.com';
page.open(url, function(status) {
var query = 'phantomJS';
page.evaluate(function(query) {
document.querySelector('input[name=q]').value = query;
document.querySelector('form[action="/search"]').submit();
}, query);
setTimeout(function(){
var title = page.evaluate(function() {
return document.title;
});
console.log(title);
phantom.exit();
}, 2000);
});

Google uses a form for submitting its queries. It's also highly likely that google has changed the prototype methods for their search buttons, so it's not really the best site to test web scraping.
The easiest way to do this is to actually perform a form submit, which slightly tweaks your example.
var page = require('webpage').create();
var url = 'http://www.google.com';
page.open(url, function(status) {
var query = 'phantomJS';
var title = page.evaluate(function(query) {
document.querySelector('input[name=q]').value = query;
document.querySelector('form[action="/search"]').submit();
return document.title
}, query);
console.log(title);
phantom.exit();
});
Note that you will likely need to consider that the response is async from this call, so getting the title directly will likely result in an undefined error (you need to account for the time it takes for the page to load before looking up data; you can review this in their waitfor.js example).

You can open google.com and try document.querySelector('input[name="btnK"]') in the console, it's null.
Actully try replace input with button:
document.querySelector('button[name="btnK"]')

Web page Capture and save to image using phantomjs lib

i was searching google to get any js lib which can capture the image of any website or url. i came to know that phantomjs library can do it. here i got a small code which capture and convert the github home page to png image
if anyone familiar with phantomjs then please tell me what is the meaning of this line
var page = require('webpage').create();
here i can give any name instead of webpage ?
if i need to capture the portion of any webpage then how can i do it with the help of this library. anyone can guide me.
var page = require('webpage').create();
page.open('http://github.com/', function () {
page.render('github.png');
phantom.exit();
});
https://github.com/ariya/phantomjs/wiki
thanks

Here is a simple phantomjs script for grabbing an image:
var page = require('webpage').create(),
system = require('system'),
address, output, size;
address = "http://google.com";
output = "your_image.png";
page.viewportSize = { width: 900, height: 600 };
page.open(address, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
page.render(output);
console.log('done');
phantom.exit();
}, 10000);
}
})
Where..
'address' is your url string.
'output' is your filename string.
Also 'width' & 'height' are the dimensions of what area of the site to capture (comment this out if you want the whole page)
To run this from the command line save the above as 'script_name.js and fire off phantom making the js file the first argument.
Hope this helps :)

The line you ask about:
var page = require('webpage').create();
As far as I can tell, that line does 3 things: It adds a module require('webpage'), then creates a WebPage Object in PhantomJS .create(), and then assigns that Object to var = page
The name "webpage" tells it which module to add.
http://phantomjs.org/api/webpage/
I too need a way to use page.render() to capture just one section of a web page, but I don't see an easy way to do this. It would be nice to select a page element by ID and just render out that element based at whatever size it is. They should really add that for the next version of PhantomJS.
For now, my only workaround is to add an anchor tag to my URL http://example.com/page.html#element to make the page scroll to the element that I want, and then set a width and height that gets close to the size I need.
I recently discovered that I can manipulate the page somewhat before rendering, so I want to try to use this technique to hide all of the other elements except the one I want to capture. I have not tried this yet, but maybe I will have some success.
See this page and look at how they use querySelector(): https://github.com/ariya/phantomjs/blob/master/examples/technews.js

We Keep Coding

JavaScript is the programming language of the Web.

HTML output from PhantomJS and Google Chrome/Firefox are different - javascript

Related

Selecting menu item using PhantomJS

Changing a link dynamically in PhantomJS and clicking it to scrape the page

PhantomJS how to render javascript in html string

Why does this phantomjs code return null and the document title?

Web page Capture and save to image using phantomjs lib

Categories

Resources