I am trying to scrape all the content of a specific page of TripAdvisor. Using the code bellow I am getting all the .html code with all the content to scrape. What I would like to do with PhantomJS is manipulate the page to select 3 things before downloading all the html:
Select sorting by 'Date'
Select 'Any' language
Expand all the 'More' button for all the reviews to display them all.
I attached a screenshot to make it more clear.
http://www.tripadvisor.com/Restaurant_Review-g187234-d2631590-Reviews-Le_Bedouin_Chez_Michel-Nice_French_Riviera_Cote_d_Azur_Provence.html#REVIEWS
// scrape_techstars.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'reviews.html'
page.open('http://www.tripadvisor.com/Restaurant_Review-g187234-d2631590-Reviews-Le_Bedouin_Chez_Michel-Nice_French_Riviera_Cote_d_Azur_Provence.html#REVIEWS', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
Can anyone with experience with this JS library tell me how to execute these actions?
Thanks!
You'll want to add an onLoadFinished function. If it were me I would inject jquery and use that to interact with the dom.
page.onLoadFinished = function() {
page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {
page.evaluate(function() {
// do dom stuff here
});
});
};
Related
I am going to scrape some contents from a website that use javascript to load dynamic content. Before, I have used request and cheerio to scrape and they worked just fine. But I just find out that request and cheerio cannot scrape dynamic content. After do a research, I found phantomjs that can get all the content after the page has loaded. I have a problem with it now, I cannot use jQuery selector like I used to use in cheerio. This is my sample code but the selector is return nothing.
var page = require('webpage').create();
var url = 'http://angkorauto.com/vehicle';
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
// console.log(page.content);
page.includeJs('https://cdnjs.cloudflare.com/ajax/libs/jquery/3.1.1/jquery.min.js', function(){
page.evaluate(function(){
console.log($('.divTitle').find('a').attr('href'));
});
});
phantom.exit();
}, 1500);
}
});
Could you help me with this problem? I really get stuck now.
Thanks for ur time to help.
The website you want to scrape has jQuery already (like many other websites) so you don't have load it again.
This works fine:
var page = require('webpage').create();
var url = 'http://angkorauto.com/vehicle';
page.open(url, function(status) {
var href = page.evaluate(function(){
return jQuery('.divTitle').find('a').attr('href');
});
console.log(href);
});
I am running some tests against a page using Nightwatch.js. I need to get the text content of an Ace Editor that's on the page, so that I can compare it against some JSON.
Is this possible?
Thanks in advance!
Instead, I hooked into the page's angular controller and grabbed the data I needed from there:
'is the Splash-Live schema set up correctly?': function (client) {
var pageSplashLive = client.page.pageSplashLive();
var code;
login(client);
pageSplashLive.navigate()
.waitForElementVisible('#jsonButton', 15000)
.click('#jsonButton')
.waitForElementVisible('.ace_content', 10000)
.api.execute("return angular.element($('.data-editor-form')).scope()['ctrl']['selectedData']['schema']['properties'];", [], function(response) {
code = JSON.stringify(response.value);
console.log(code);
client.assert.equal(code,
'json_goes_here');
});
},
I have simple PhantomJS script to parse Javascript content of website to html. (Some data is then extracted from the html code using other tool.)
var page = require('webpage').create();
var fs = require('fs');// File System Module
var output = '/tmp/sourcefile'; // path for saving the local file
page.open('targeturl', function() { // open the file
fs.write(output,page.content,'w'); // Write the page to the local file using page.content
phantom.exit(); // exit PhantomJs
});
(I got these lines of code from http://kochi-coders.com/2014/05/06/scraping-a-javascript-enabled-web-page-using-beautiful-soup-and-phantomjs/)
This used to work when all targets had direct links. Now they are behind the same url and there is drop down menu:
<select id="observation-station-menu" name="station" onchange="updateObservationProductsBasedOnForm(this);">
<option value="101533">Alajärvi Möksy</option>
...
<option value="101541">Äänekoski Kalaniemi</option>
</select>
This is the menu item I would actually like to load:
<option value="101632">Joensuu Linnunlahti</option>
Because of this menu my script only downloads data related to the default location. How I load contents of other item from the menu and download html content of that item instead?
My target site is this: http://ilmatieteenlaitos.fi/suomen-havainnot
(If there is better way than PhantomJS for doing this I could use it just as well. My interest is in dealing with the data once get it scraped and I chose PhantomJS just because it was the first thing that worked. Some options might be limited because my server is a Raspberry Pi and might not work on it: Python Selenium: Firefox profile error)
Since the page have jQuery, you can do:
page.open('targeturl', function() { // open the file
page.evaluate(function() {
jQuery('#observation-station-menu').val('101632').change();
}); //change the checkbox, then fires the event
fs.write(output,page.content,'w'); // Write the page to the local file using page.content
phantom.exit(); // exit PhantomJs
});
You could directly call the function, which is defined in the underlying js on that page:
var page = require('webpage').create();
var fs = require('fs');// File System Module
var output = '/tmp/sourcefile'; // path for saving the local file
page.open('targeturl', function() { // open the file
page.evaluate(function() {
updateObservationProducts(101632, 'weather');
});
window.setTimeout(function () {
fs.write(output,page.content,'w'); // Write the page to the local file using page.content
phantom.exit(); // exit PhantomJs
}, 1000); // Change timeout as required to allow sufficient time
});
For the waiting to render, see this phantomjs not waiting for "full" page load , I copy pasted a part from rhunwicks solution.
I am writing a test that clicks on a button and opens a new tab and directs you to a new website. I want to call in that website value so I may parse it after the rfp code in the webpage name. I then open a decoder site and use it to decode and be sure the decoded webpage name works properly.
The code I'm using:
this.switchesToGetQuotePage = function() {
browser.getAllWindowHandles().then(function(handles) {
newWindowHandle = handles[1]; // this is your new window
browser.switchTo().window(newWindowHandle).then(function() {
getCurrentUrl.then(function(text) {
console.log(text);
});
});
});
};
When I call the getCurrentUrl function it returns below as the value:
data: ,
Use the protractor built in getLocationAbsUrl() to get the url of the current page if its angular based. Here's how -
browser.getLocationAbsUrl().then(function(url){
console.log(url);
});
However if you are working on a non-angular page then do wait until the page loads as the url changes (through redirections) until final page is delivered to the client and then use getCurrentUrl() on the page. Here's how -
var ele = $("ELEMENT_ON_NEW_PAGE"); //replace it with your element on the page
browser.switchTo().window(newWindowHandle).then(function() {
browser.wait(protractor.ExpectedConditions.visibilityOf(ele), 10000).then(function(){
getCurrentUrl.then(function(text) {
console.log(text);
});
});
});
Hope it helps.
i was searching google to get any js lib which can capture the image of any website or url. i came to know that phantomjs library can do it. here i got a small code which capture and convert the github home page to png image
if anyone familiar with phantomjs then please tell me what is the meaning of this line
var page = require('webpage').create();
here i can give any name instead of webpage ?
if i need to capture the portion of any webpage then how can i do it with the help of this library. anyone can guide me.
var page = require('webpage').create();
page.open('http://github.com/', function () {
page.render('github.png');
phantom.exit();
});
https://github.com/ariya/phantomjs/wiki
thanks
Here is a simple phantomjs script for grabbing an image:
var page = require('webpage').create(),
system = require('system'),
address, output, size;
address = "http://google.com";
output = "your_image.png";
page.viewportSize = { width: 900, height: 600 };
page.open(address, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
page.render(output);
console.log('done');
phantom.exit();
}, 10000);
}
})
Where..
'address' is your url string.
'output' is your filename string.
Also 'width' & 'height' are the dimensions of what area of the site to capture (comment this out if you want the whole page)
To run this from the command line save the above as 'script_name.js and fire off phantom making the js file the first argument.
Hope this helps :)
The line you ask about:
var page = require('webpage').create();
As far as I can tell, that line does 3 things: It adds a module require('webpage'), then creates a WebPage Object in PhantomJS .create(), and then assigns that Object to var = page
The name "webpage" tells it which module to add.
http://phantomjs.org/api/webpage/
I too need a way to use page.render() to capture just one section of a web page, but I don't see an easy way to do this. It would be nice to select a page element by ID and just render out that element based at whatever size it is. They should really add that for the next version of PhantomJS.
For now, my only workaround is to add an anchor tag to my URL http://example.com/page.html#element to make the page scroll to the element that I want, and then set a width and height that gets close to the size I need.
I recently discovered that I can manipulate the page somewhat before rendering, so I want to try to use this technique to hide all of the other elements except the one I want to capture. I have not tried this yet, but maybe I will have some success.
See this page and look at how they use querySelector(): https://github.com/ariya/phantomjs/blob/master/examples/technews.js