At work I'd like to parse some web pages. Unfortunately I can't add any real page to my example because the urls at work are confident. I can only try to explain what is the problem.
To parse I wrote the following script in R. As a mock url I used www.imdb.com.:
library(rvest)
library(plyr)
# urls
url <- "http://www.imdb.com/"
# parse
html <- try(read_html(url))
# select
select_meta <- function(html) {
html %>%
html_nodes(xpath = "//div") %>%
html_attrs # function to select meta
}
meta <- select_meta(html)
Problem is this script doesn't return anything for the pages I use at work. I guess this is because the scripts are generated by javascript. I found this tutorial which explains how to scrape javascript generated pages in R.
The code used to generate the page in the tutorial is the following:
// scrape_techstars.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'techstars.html'
page.open('http://www.techstars.com/companies/stats/', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
I don't have any Javascript knowledge so I'm having trouble scaling page.open (which only works for 1 page) to multiple pages (at work I have to parse roughly 100 pages). So instead of relying on phantom js I'd rather have a solution which is completely R based (if this is totally inefficient and offensive to real coders, I apologise in advance). So the crux of my question is: "how can I generate several pages in R?".
This is a one-off thing so I'm not really thinking about reading up on Javascript or parsing. Thanks in advance for helping me out.
Related
I am trying to scrape data from an interactive map (looking to get crime data for a county). I am using R (rvest) and trying to use phantomjs too. I'm new to web scraping so I am not really understanding how all the elements work together (trying to get there).
The problem I believe I am having is that after I run the phantomjs and upload the html using R's rvest package, I end up with more scripts and no clear data in the html. My code is below.
writeLines("var url = 'http://www.google.com';
var page = new WebPage();
var fs = require('fs');
page.open(url, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('cool.html', page.content, 'w');
phantom.exit();
}, 2500);
}
", con = "scrape.js")
A function that takes in the url that I want to scrape
s_scrape <- function(url = "https://gis.adacounty.id.gov/apps/crimemapper/",
js_path = "scrape.js",
phantompath = "/Users/alihoop/Documents/phantomjs/bin/phantomjs"){
# this section will replace the url in scrape.js to whatever you want
lines <- readLines(js_path)
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, js_path)
command = paste(phantompath, js_path, sep = " ")
system(command)
}
Execute the js_scrape() function and get a html file saved as "cool.html"
js_scrape()
Where I am not understanding what to do next is the below R code:
map_data <- read_html('cool.html') %>%
html_nodes('script')
The output I get in the HTML via phantomjs is just scripts again. Looking for help on how to proceed when faced (in my mind) is javascript nested in javascript scripts(?)
Thank you!
This site uses javascript to make queries to the server. One solution is to reproduce the rest request and read the returning JSON file directly. This avoids the need to use Phantomjs.
From the developer tools options from your browser and looking through the xhr files, you will find a file(s) named "query" with a link similar to: "https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&outSR=102100&resultOffset=0&resultRecordCount=1000"
Read this JSON response directly and convert to a list with the use of the jsonlite package:
library(jsonlite)
output<-jsonlite::fromJSON("https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&outSR=102100&resultOffset=0&resultRecordCount=1000")
output$features
Find the first number in the link, (11 in this case) "FeatureServer/11/query?f=json". This number will determine which crime to query the server with. I found, it can take a value from 0 to 11. Enter 0 for arson, 4 for drugs, 11 for vandalism, etc.
I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?
RT are using a service from spot.im for comments
you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.
i wrote a quick script to do this
import requests
import re
import json
def get_rt_comments(article_url):
spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
post_id = re.search('([0-9]+)', article_url).group(0)
r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
spotim_token = r1['token']
payload = {
"count": 25, #number of comments to fetch
"sort_by":"best",
"cursor":{"offset":0,"comments_read":0},
"host_url": article_url,
"canonical_url": article_url
}
r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})
return r2.json()
if __name__ == '__main__':
url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
comments = get_rt_comments(url)
print(comments)
Yes, if it can be viewed with a web browser, you can extract it.
If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.
Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.
Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).
I'm trying to scrape county assessor data on historic property values for multiple parcels generated using javascript from https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=07101001 using phantomjs controlled by RSelenium.
'paraid' in the url is the 9 digit parcel number. I have a dataframe containing a list of parcel numbers that I'm interested in (a few hundred in total), but have been attempting to make the code work on a small subset of those:
parcel_nums
[1] "00905101" "00905102" "00905103" "00905104" "00905105"
[6] "00905106" "00905107" "00905108" "00905201" "00905202"
I need to scrape the data in the table generated on the page for each parcel and preserve it. I have chosen to write the page to a file "output.htm" and then parse the file afterwards. My code is as follows:
require(plyr)
require(rvest)
require(RSelenium)
require(tidyr)
require(dplyr)
parcel_nums <- prop_attr$APN[1:10] #Vector of parcel numbers
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
result <- remDr$phantomExecute("var page = this;
var fs = require(\"fs\");
page.onLoadFinished = function(status) {
var file = fs.open(\"output.htm\", \"w\");
file.write(page.content);
file.close();
};")
for (i in 1:length(parcel_nums)){
url <- paste("https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=",
parcel_nums[i], sep = "")
Sys.sleep(5)
emDr$navigate(url)
dat <- read_html("output.htm", encoding = "UTF-8") %>%
html_nodes("table") %>%
html_table(, header = T)
df <- data.frame(dat)
#assign parcel number to panel
df$apn <- parcel_nums[i]
#on first iteratation initialize final data frame, on sebsequent iterations append the final data frame
ifelse(i == 1, parcel_data <- df, parcel_data <- rbind(parcel_data, df))
}
remDr$close
pJS$stop()
This will work perfectly for one or two iterations of the loop, but it suddenly stops preserving the data generated by the javascript and produces an error:
Error in `$<-.data.frame`(`*tmp*`, "apn", value = "00905105") :
replacement has 1 row, data has 0
which is due to the parser not locating the table in the output file because it is not being preserved. I'm unsure if there is a problem with the implementation I've chosen or if there is some idiosycrasy of the particular site that is causing the issue. I am not familiar with JavaScript so the code snippet used is taken from an example I found. Thank you for any assistance.
The below answer worked perfectly. I also moved the Sys.sleep(5) to after the $navigate to allow the page time to load the javascript. The loop is now executing to completion.
require(plyr)
require(rvest)
require(RSelenium)
require(tidyr)
require(dplyr)
parcel_nums <- prop_attr$APN[1:10] #Vector of parcel numbers
#pJS <- phantom()
remDr <- remoteDriver()
remDr$open()
# #result <- remDr$executeScript("var page = this;
# var fs = require(\"fs\");
# page.onLoadFinished = function(status) {
# var file = fs.open(\"output.htm\", \"w\");
# file.write(page.content);
# file.close();
# };")
#length(parcel_nums)
for (i in 1:length(parcel_nums)){
url <- paste("https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=",
parcel_nums[i], sep = "")
Sys.sleep(5)
remDr$navigate(url)
doc <- htmlParse(remDr$getPageSource()[[1]])
doc_t<-readHTMLTable(doc,header = TRUE)$`NULL`
df<-data.frame(doc_t)
#assign parcel number to panel
df$apn <- parcel_nums[i]
#on first iteratation initialize final data frame, on sebsequent iterations append the final data frame
ifelse(i == 1, parcel_data <- df, parcel_data <- rbind(parcel_data, df))
}
remDr$close
This gave me a solution. And this should work with the the phantomJS too. I request you to test and reply.
I have lost an entire day trying to solve a similar issue. So I share my learning to help others save time and nerves..
I guess we need to understand that opening, navigating and other browsing actions through the remote driver need time to complete.
So we have to wait before we try to read or do anything on the pages we are expecting to scrape.
My problems were solved when I introduced Sys.sleep(5) after the remDr$navigate(url) call.
It seems that a neater solution consists of inserting an remDr$setTimeout(type = "page load", milliseconds = 10000) as suggested at how to check if page finished loading in RSelenium but didn't test it yet.
I'm developing application using AngularJS. Everything seems to be nice until I meet something that leads me to headache: SEO.
From many references, I found out that AJAX content crawled and indexed by Google bot or Bing bot 'is not that easy' since the crawlers don't render Javascript.
Currently I need a solution using PHP. I use PHP Slim Framework so my main file is index.php which contains function to echo the content of my index.html. My question is:
Is it possible to make a snapshot of rendered Javascript in HTML?
My strategy is:
If the request query string contains _escaped_fragment_, the application will generate a snapshot and give that snapshot as response instead of the exact file.
Any help would be appreciated. Thanks.
After plenty of times searching and researching, I finally managed to solve my problem by mixing PHP with PhantomJS (version 2.0). I use exec() function in PHP to run phantomJS and create Javascript file to take get the content of the targeted URL. Here are the snippets:
index.php
// Let's assume that you have a bin folder under your root folder directory which contains phantomjs.exe and content.js
$script = __DIR__ ."/bin/content.js";
$target = "http://www.kincir.com"; // target URL
$cmd = __DIR__."/bin/phantomjs.exe $script $target";
exec($cmd, $output);
return implode("", $output);
content.js
var webPage = require('webpage');
var system = require('system');
var page = webPage.create();
var url = system.args[1]; // This will get the second argument from $cmd, in this example, it will be the value of $target on index.php which is "http://www.kincir.com"
page.open(url, function (status) {
page.onLoadFinished = function () { // Make sure to return the content of the page once the page is finish loaded
var content = page.content;
console.log(content);
phantom.exit();
};
});
I recently published a project that gives PHP access to a browser. Get it here: https://github.com/merlinthemagic/MTS. It also relies on PhantomJS.
After downloading and setup you would simply use the following code:
$myUrl = "http://www.example.com";
$windowObj = \MTS\Factories::getDevices()->getLocalHost()->getBrowser('phantomjs')->getNewWindow($myUrl);
//now you can either retrive the DOM and parse it, like this:
$domData = $windowObj->getDom();
//this project also lets you manipulate the live page. Click, fill forms, submit etc.
I have about 100 static HTML pages that I want to apply some DOM manipulations to. They all follow the same HTML structure. I want to apply some DOM manipulations to each of these files, and then save the resulting HTML.
These are the manipulations I want to apply:
# [start]
$("h1.title, h2.description", this).wrap("<hgroup>");
if ( $("h1.title").height() < 200 ) {
$("div.content").addClass('tall');
}
# [end]
# SAVE NEW HTML
The first line (.wrap()) I could easily do with a find and replace, but it gets tricky when I have to determine the calculated height of an element, which can't be easily be determined sans-JavaScript.
Does anyone know how I can achieve this? Thanks!
While the first part could indeed be solved in "text mode" using regular expressions or a more complete DOM implementation in JavaScript, for the second part (the height calculation), you'll need a real, full browser or a headless engine like PhantomJS.
From the PhantomJS homepage:
PhantomJS is a command-line tool that packs and embeds WebKit.
Literally it acts like any other WebKit-based web browser, except that
nothing gets displayed to the screen (thus, the term headless). In
addition to that, PhantomJS can be controlled or scripted using its
JavaScript API.
A schematic instruction (which I admit is not tested) follows.
In your modification script (say, modify-html-file.js) open an HTML page, modify it's DOM tree and console.log the HTML of the root element:
var page = new WebPage();
page.open(encodeURI('file://' + phantom.args[0]), function (status) {
if (status === 'success') {
var html = page.evaluate(function () {
// your DOM manipulation here
return document.documentElement.outerHTML;
});
console.log(html);
}
phantom.exit();
});
Next, save the new HTML by redirecting your script's output to a file:
#!/bin/bash
mkdir modified
for i in *.html; do
phantomjs modify-html-file.js "$1" > modified/"$1"
done
I tried PhantomJS as in katspaugh's answer, but ran into several issues trying to manipulate pages. My use case was modifying the static html output of Doxygen, without modifying Doxygen itself. The goal was to reduce delivered file size by remove unnecessary elements from the page, and convert it to HTML5. Additionally I also wanted to use jQuery to access and modify elements more easily.
Loading the page in PhantomJS
The APIs appear to have changed drastically since the accepted answer. Additionally, I used a different approach (derived from this answer), which will be important in mitigating one of the major issues I encountered.
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
// Reading the page's content into your "webpage"
// This automatically refreshes the page
page.content = fs.read(system.args[1]);
// Make all your changes here
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Preventing JavaScript from Running
My page uses Google Analytics in the footer, and now the page is modified beyond my intention, presumably because javascript was run. If we disable javascript, we can't actually use jQuery to modify the page, so that isn't an option. I've tried temporarily changing the tag, but when I do, every special character is replaced with an html-escaped equivalent, destroying all javascript code on the page. Then, I came across this answer, which gave me the following idea.
var rawPageString = fs.read(system.args[1]);
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");
page.content = rawPageString;
// Make all your changes here
rawPageString = page.content;
rawPageString = rawPageString.replace(/<script type='foo\/bar'/g, "<script");
Adding jQuery
There's actually an example on how to use jQuery. However, I thought an offline copy would be more appropriate. Initially I tried using page.includeJs as in the example, but found that page.injectJs was more suitable for the use case. Unlike includeJs, there's no <script> tag added to the page context, and the call blocks execution which simplifies the code. jQuery was placed in the same directory I was executing my script from.
page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {
// Make all changes here
// Remove the foo/bar type more easily here
$("script[type^=foo]").removeAttr("type");
});
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Putting it All Together
var system = require('system');
var fs = require('fs');
var page = require('webpage').create();
var rawPageString = fs.read(system.args[1]);
// Prevent in-page javascript execution
rawPageString = rawPageString.replace(/<script type="text\/javascript"/g, "<script type='foo/bar'");
rawPageString = rawPageString.replace(/<script>/g, "<script type='foo/bar'>");
page.content = rawPageString;
page.injectJs("jquery-2.1.4.min.js");
page.evaluate(function () {
// Make all changes here
// Remove the foo/bar type
$("script[type^=foo]").removeAttr("type");
});
fs.write(system.args[2], page.content, 'w');
phantom.exit();
Using it from the command line:
phantomjs modify-html-file.js "input_file.html" "output_file.html"
Note: This was tested and working with PhantomJS 2.0.0 on Windows 8.1.
Pro tip: If speed matters, you should consider iterating the files from within your PhantomJS script rather than a shell script. This will avoid the latency that PhantomJS has when starting up.
you can get your modified content by $('html').html() (or a more specific selector if you don't want stuff like head tags), then submit it as a big string to your server and write the file server side.