I am going to scrape some contents from a website that use javascript to load dynamic content. Before, I have used request and cheerio to scrape and they worked just fine. But I just find out that request and cheerio cannot scrape dynamic content. After do a research, I found phantomjs that can get all the content after the page has loaded. I have a problem with it now, I cannot use jQuery selector like I used to use in cheerio. This is my sample code but the selector is return nothing.
var page = require('webpage').create();
var url = 'http://angkorauto.com/vehicle';
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
// console.log(page.content);
page.includeJs('https://cdnjs.cloudflare.com/ajax/libs/jquery/3.1.1/jquery.min.js', function(){
page.evaluate(function(){
console.log($('.divTitle').find('a').attr('href'));
});
});
phantom.exit();
}, 1500);
}
});
Could you help me with this problem? I really get stuck now.
Thanks for ur time to help.
The website you want to scrape has jQuery already (like many other websites) so you don't have load it again.
This works fine:
var page = require('webpage').create();
var url = 'http://angkorauto.com/vehicle';
page.open(url, function(status) {
var href = page.evaluate(function(){
return jQuery('.divTitle').find('a').attr('href');
});
console.log(href);
});
Related
I am trying to scrape all the content of a specific page of TripAdvisor. Using the code bellow I am getting all the .html code with all the content to scrape. What I would like to do with PhantomJS is manipulate the page to select 3 things before downloading all the html:
Select sorting by 'Date'
Select 'Any' language
Expand all the 'More' button for all the reviews to display them all.
I attached a screenshot to make it more clear.
http://www.tripadvisor.com/Restaurant_Review-g187234-d2631590-Reviews-Le_Bedouin_Chez_Michel-Nice_French_Riviera_Cote_d_Azur_Provence.html#REVIEWS
// scrape_techstars.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'reviews.html'
page.open('http://www.tripadvisor.com/Restaurant_Review-g187234-d2631590-Reviews-Le_Bedouin_Chez_Michel-Nice_French_Riviera_Cote_d_Azur_Provence.html#REVIEWS', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
Can anyone with experience with this JS library tell me how to execute these actions?
Thanks!
You'll want to add an onLoadFinished function. If it were me I would inject jquery and use that to interact with the dom.
page.onLoadFinished = function() {
page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {
page.evaluate(function() {
// do dom stuff here
});
});
};
All of the following will successfully redirect a user to another page (with their own caveats, of course):
window.location.replace(new_url),
window.location.assign(new_url),
window.location = new_url,
The typical response to someone asking if you can get a callback for changing location is, of course, no, because whisking a user off to a new page means the scripts on your page are no longer active.
That's all fine and dandy, but in the case where you are using any of the three methods above to download a file, not only does the user stay on the same page as they are on, but there is also a slight lag (depending on network speeds) between updating location and when the file actually starts downloading.
In this situation (the user remaining on the page in which window.location was updated), is there any way to create a callback that would enable, for example, a loading icon being displayed the line prior to the redirect up until the file actually starts downloading?
You can create a hidden iFrame, pointing at the downloadable file. The user won't notice the difference, at the same time you can continue running scripts on the main document.
function downloadURL(url, callback){
var hiddenIFrameID = 'hiddenDownloader' + count++;
var iframe = document.createElement('iframe');
iframe.id = hiddenIFrameID;
iframe.style.display = 'none';
document.body.appendChild(iframe);
iframe.src = url;
callback();
}
downloadURL("http:\\...", function() { alert('is downloading'); });
I wrote the following code being inspired by #Semyon Krotkih answer. It users jQuery. Also it doesn't have any error support, but it works :)
function downloadURL(url, callback) {
var id = 'hiddenDownloader';
$('body').append('<iframe id="' + id + '" style="display: block" src="' + url + '"></iframe>');
var $iframe = $('#' + id);
$iframe.on('load', function () {
$iframe.remove();
// no error support
callback();
});
}
downloadURL('http://example.com', function() {
alert('Done');
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min.js"></script>
I've been debugging this for a long time and it has me completely baffled. I need to save ads to my computer for a work project. Here is an example ad that I got from CNN.com:
http://ads.cnn.com/html.ng/site=cnn&cnn_pagetype=main&cnn_position=300x250_rgt&cnn_rollup=homepage&page.allowcompete=no¶ms.styles=fs&Params.User.UserID=5372450203c5be0a3c695e599b05d821&transactionID=13999976982075532128681984&tile=2897967999935&domId=6f4501668a5e9d58&kxid=&kxseg=
When I visit this link in Google Chrome and Firefox, I see an ad (if the link stops working, simply go to CNN.com and grab the iframe URL for one of the ads). I developed a PhantomJS script that will save a screenshot and the HTML of any page. It works on any website, but it doesn't seem to work on these ads. The screenshot is blank and the HTML contains a tracking pixel (a 1x1 transparent gif used to track the ad). I thought that it would give me what I see in my normal browser.
The only thing that I can think of is that the AJAX calls are somehow messing up PhantomJS, so I hard-coded a delay but I got the same results.
Here is the most basic piece of test code that reproduces my problem:
var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
}
else {
// Output Results Immediately
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlBeforeTimeout.htm", html, 'w');
page.render('RenderBeforeTimeout.png');
// Output Results After Delay (for AJAX)
window.setTimeout(function () {
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlAfterTimeout.htm", html, 'w');
page.render('RenderAfterTimeout.png');
phantom.exit();
}, 9000); // 9 Second Delay
}
});
You can run this code using this command in your terminal:
phantomjs getHtml.js 'http://www.google.com/'
The above command works well. When you replace the Google URL with an Ad URL (like the one at the top of this post), is gives me the unexpected results that I explained.
Thanks so much for your help! This is my first question that I've ever posted on here, because I can almost always find the answer by searching Stack Overflow. This one, however, has me completely stumped! :)
EDIT: I'm running PhantomJS 1.9.7 on Ubuntu 14.04 (Trusty Tahr)
EDIT: Okay, I've been working on it for a while now and I think it has something to do with cookies. If I clear all of my history and view the link in my browser, it also comes up blank. If I then refresh the page, it displays fine. It also displays fine if I open it in a new tab. The only time it doesn't is when I try to view it directly after clearing my cookies.
EDIT: I've tried loading the link twice in PhantomJS without exiting (manually requesting it twice in my script before calling phantom.exit()). It doesn't work. In the PhantomJS documentation it says that the cookie jar is enabled by default. Any ideas? :)
You should try using the onLoadFinished callback instead of checking for status in page.open. Something like this should work:
var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];
page.open(url);
page.onLoadFinished = function()
{
// Output Results Immediately
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlBeforeTimeout.htm", html, 'w');
page.render('RenderBeforeTimeout.png');
// Output Results After Delay (for AJAX)
window.setTimeout(function () {
var html = page.evaluate(function () {
return document.getElementsByTagName('html')[0].innerHTML;
});
fs.write("HtmlAfterTimeout.htm", html, 'w');
page.render('RenderAfterTimeout.png');
phantom.exit();
}, 9000); // 9 Second Delay
};
I have an answer here that loops through all files in a local folder and saves images of the resulting pages: Using Phantom JS to convert all HTML files in a folder to PNG
The same principle applies to remote HTML pages.
Here is what I have from the output:
Before Timeout:
http://i.stack.imgur.com/GmsH9.jpg
After Timeout:
http://i.stack.imgur.com/mo6Ax.jpg
I'm having difficulties accessing the contentDocument of an iframe. I am using phantomjs (1.9). I have looked into various threads but none seem to have the answer.
This is my phantomjs script where I have injected jquery to try and select the element.
var page = require('webpage').create();
page.onConsoleMessage = function(msg, lineNum, sourceId) {
console.log('CONSOLE: ' + msg);
};
page.onError = function(msg) {
console.log('ERROR MESSAGE: ' + msg);
};
page.open('http://localhost:8080/', function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
page.evaluate(function() {
console.log( $('iframe').contentDocument.documentElement );
});
phantom.exit();
});
});
Apart form jquery, I have also used these two lines of code to get the DOM element that I want (the DOM HTML element that's inside the iframe). PhantomJS seems unable to parse anything beyond getElementsByTagName('iframe') or $('iframe') could it be because it hasn't finished loading yet?
document.getElementsByTagName('iframe')[0].contentDocument.activeElement;
document.getElementsByTagName('iframe')[0].contentDocument.documentElement;
I am also running the script with --web-security=no setting disabled
I ran into this issue but found it was because I was not wrapping the code in evaluate(). You seem to be doing that though. Try this not using jquery.
page.evaluate(function (){
iframe = document.getElementById('iframeName').contentDocument
iframe.getElementById("testInput").value = "test";
});
i was searching google to get any js lib which can capture the image of any website or url. i came to know that phantomjs library can do it. here i got a small code which capture and convert the github home page to png image
if anyone familiar with phantomjs then please tell me what is the meaning of this line
var page = require('webpage').create();
here i can give any name instead of webpage ?
if i need to capture the portion of any webpage then how can i do it with the help of this library. anyone can guide me.
var page = require('webpage').create();
page.open('http://github.com/', function () {
page.render('github.png');
phantom.exit();
});
https://github.com/ariya/phantomjs/wiki
thanks
Here is a simple phantomjs script for grabbing an image:
var page = require('webpage').create(),
system = require('system'),
address, output, size;
address = "http://google.com";
output = "your_image.png";
page.viewportSize = { width: 900, height: 600 };
page.open(address, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
page.render(output);
console.log('done');
phantom.exit();
}, 10000);
}
})
Where..
'address' is your url string.
'output' is your filename string.
Also 'width' & 'height' are the dimensions of what area of the site to capture (comment this out if you want the whole page)
To run this from the command line save the above as 'script_name.js and fire off phantom making the js file the first argument.
Hope this helps :)
The line you ask about:
var page = require('webpage').create();
As far as I can tell, that line does 3 things: It adds a module require('webpage'), then creates a WebPage Object in PhantomJS .create(), and then assigns that Object to var = page
The name "webpage" tells it which module to add.
http://phantomjs.org/api/webpage/
I too need a way to use page.render() to capture just one section of a web page, but I don't see an easy way to do this. It would be nice to select a page element by ID and just render out that element based at whatever size it is. They should really add that for the next version of PhantomJS.
For now, my only workaround is to add an anchor tag to my URL http://example.com/page.html#element to make the page scroll to the element that I want, and then set a width and height that gets close to the size I need.
I recently discovered that I can manipulate the page somewhat before rendering, so I want to try to use this technique to hide all of the other elements except the one I want to capture. I have not tried this yet, but maybe I will have some success.
See this page and look at how they use querySelector(): https://github.com/ariya/phantomjs/blob/master/examples/technews.js