Unable to scrape multiple pages using phantomjs in r

Unable to scrape multiple pages using phantomjs in r - javascript

I'm trying to scrape county assessor data on historic property values for multiple parcels generated using javascript from https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=07101001 using phantomjs controlled by RSelenium.
'paraid' in the url is the 9 digit parcel number. I have a dataframe containing a list of parcel numbers that I'm interested in (a few hundred in total), but have been attempting to make the code work on a small subset of those:
parcel_nums
[1] "00905101" "00905102" "00905103" "00905104" "00905105"
[6] "00905106" "00905107" "00905108" "00905201" "00905202"
I need to scrape the data in the table generated on the page for each parcel and preserve it. I have chosen to write the page to a file "output.htm" and then parse the file afterwards. My code is as follows:
require(plyr)
require(rvest)
require(RSelenium)
require(tidyr)
require(dplyr)
parcel_nums <- prop_attr$APN[1:10] #Vector of parcel numbers
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
result <- remDr$phantomExecute("var page = this;
var fs = require(\"fs\");
page.onLoadFinished = function(status) {
var file = fs.open(\"output.htm\", \"w\");
file.write(page.content);
file.close();
};")
for (i in 1:length(parcel_nums)){
url <- paste("https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=",
parcel_nums[i], sep = "")
Sys.sleep(5)
emDr$navigate(url)
dat <- read_html("output.htm", encoding = "UTF-8") %>%
html_nodes("table") %>%
html_table(, header = T)
df <- data.frame(dat)
#assign parcel number to panel
df$apn <- parcel_nums[i]
#on first iteratation initialize final data frame, on sebsequent iterations append the final data frame
ifelse(i == 1, parcel_data <- df, parcel_data <- rbind(parcel_data, df))
}
remDr$close
pJS$stop()
This will work perfectly for one or two iterations of the loop, but it suddenly stops preserving the data generated by the javascript and produces an error:
Error in `$<-.data.frame`(`*tmp*`, "apn", value = "00905105") :
replacement has 1 row, data has 0
which is due to the parser not locating the table in the output file because it is not being preserved. I'm unsure if there is a problem with the implementation I've chosen or if there is some idiosycrasy of the particular site that is causing the issue. I am not familiar with JavaScript so the code snippet used is taken from an example I found. Thank you for any assistance.
The below answer worked perfectly. I also moved the Sys.sleep(5) to after the $navigate to allow the page time to load the javascript. The loop is now executing to completion.

require(plyr)
require(rvest)
require(RSelenium)
require(tidyr)
require(dplyr)
parcel_nums <- prop_attr$APN[1:10] #Vector of parcel numbers
#pJS <- phantom()
remDr <- remoteDriver()
remDr$open()
# #result <- remDr$executeScript("var page = this;
# var fs = require(\"fs\");
# page.onLoadFinished = function(status) {
# var file = fs.open(\"output.htm\", \"w\");
# file.write(page.content);
# file.close();
# };")
#length(parcel_nums)
for (i in 1:length(parcel_nums)){
url <- paste("https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=",
parcel_nums[i], sep = "")
Sys.sleep(5)
remDr$navigate(url)
doc <- htmlParse(remDr$getPageSource()[[1]])
doc_t<-readHTMLTable(doc,header = TRUE)$`NULL`
df<-data.frame(doc_t)
#assign parcel number to panel
df$apn <- parcel_nums[i]
#on first iteratation initialize final data frame, on sebsequent iterations append the final data frame
ifelse(i == 1, parcel_data <- df, parcel_data <- rbind(parcel_data, df))
}
remDr$close
This gave me a solution. And this should work with the the phantomJS too. I request you to test and reply.

I have lost an entire day trying to solve a similar issue. So I share my learning to help others save time and nerves..
I guess we need to understand that opening, navigating and other browsing actions through the remote driver need time to complete.
So we have to wait before we try to read or do anything on the pages we are expecting to scrape.
My problems were solved when I introduced Sys.sleep(5) after the remDr$navigate(url) call.
It seems that a neater solution consists of inserting an remDr$setTimeout(type = "page load", milliseconds = 10000) as suggested at how to check if page finished loading in RSelenium but didn't test it yet.

Related

get the actuel domain name (server name) in Shiny app

I have 3 servers, dev, test and prod. My Shiny code shoud be deployed from dev to prod.
Now the problem:
In the ui.R I refere via href = 'https://dev.com/start/' to another site named start. Is it possible to get the domain name, dev, test and prod automatically? Something like, `href = 'https://what is the actuall domain.com/start/'
addendum: as DanielR answerd, one can use session$clientData$url_hostname, however my problem is taht I need the hostname in dashboardHeader. The place in ui.R where I need the dynamical href is:
dashboardPage(
dashboardHeader(title = "KRB",
titleWidth = 150,
tags$li(a(href ='https://dev.com/start/

You can get the hostname using the session$clientData$url_hostname in your server function. See https://shiny.rstudio.com/articles/client-data.html
Here's a little app:
library(shiny)
ui <- fluidPage(
uiOutput('urlui')
)
server <- function(input, output, session) {
output$urlui <- renderUI({
htmltools::a('my link',
href=paste0('http://', session$clientData$url_hostname))
})
}
shinyApp(ui = ui, server = server)

Now the problem: In the ui.R I refere via href = 'https://dev.com/start/' to another site named start. Is it possible to get the domain name, dev, test and prod automatically?
For what you want to achieve here, you don’t need to get the actual host name, if you can just use a relative URL instead of a full absolute one to begin with.
Instead of
tags$li(a(href ='https://dev.com/start/' …
use
tags$li(a(href ='/start/' …
Relative URLs with a leading slash refer to the domain root, so this should resolve to https://[hostname]/start/ automatically, without you having to determine what [hostname] actually is in this case. The browser basically does that part for you when it resolves relative URLs, based on the address of the currently displayed main document.

Web Scraping interactive map (javascript) with R and PhantomJS

I am trying to scrape data from an interactive map (looking to get crime data for a county). I am using R (rvest) and trying to use phantomjs too. I'm new to web scraping so I am not really understanding how all the elements work together (trying to get there).
The problem I believe I am having is that after I run the phantomjs and upload the html using R's rvest package, I end up with more scripts and no clear data in the html. My code is below.
writeLines("var url = 'http://www.google.com';
var page = new WebPage();
var fs = require('fs');
page.open(url, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('cool.html', page.content, 'w');
phantom.exit();
}, 2500);
}
", con = "scrape.js")
A function that takes in the url that I want to scrape
s_scrape <- function(url = "https://gis.adacounty.id.gov/apps/crimemapper/",
js_path = "scrape.js",
phantompath = "/Users/alihoop/Documents/phantomjs/bin/phantomjs"){
# this section will replace the url in scrape.js to whatever you want
lines <- readLines(js_path)
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, js_path)
command = paste(phantompath, js_path, sep = " ")
system(command)
}
Execute the js_scrape() function and get a html file saved as "cool.html"
js_scrape()
Where I am not understanding what to do next is the below R code:
map_data <- read_html('cool.html') %>%
html_nodes('script')
The output I get in the HTML via phantomjs is just scripts again. Looking for help on how to proceed when faced (in my mind) is javascript nested in javascript scripts(?)
Thank you!

This site uses javascript to make queries to the server. One solution is to reproduce the rest request and read the returning JSON file directly. This avoids the need to use Phantomjs.
From the developer tools options from your browser and looking through the xhr files, you will find a file(s) named "query" with a link similar to: "https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&outSR=102100&resultOffset=0&resultRecordCount=1000"
Read this JSON response directly and convert to a list with the use of the jsonlite package:
library(jsonlite)
output<-jsonlite::fromJSON("https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&outSR=102100&resultOffset=0&resultRecordCount=1000")
output$features
Find the first number in the link, (11 in this case) "FeatureServer/11/query?f=json". This number will determine which crime to query the server with. I found, it can take a value from 0 to 11. Enter 0 for arson, 4 for drugs, 11 for vandalism, etc.

Plotly (offline) for Python click event

Is it possible to add click events to a Plotly scatter plot (offline mode in Python)?
As an example, I want to change the shape of a set of scatter points upon being clicked.
What I tried so far
My understanding from reading other questions from the site (with no clear answer) is that I may have to produce the html and then edit it after the fact by putting in javascript code? So I could write a javascript function, save it off to my_js.js and then link to it from the html?

I've been doing some work with offline plots in plotly and had the same challenge.
Here's a kludge I've come up with which my prove as inspiration for others.
Some limitations:
Assumes that you have the offline output in a single html file, for a single plot.
Assumes that your on events are named the same as the event handlers.
Requires Beautiful Soup 4.
Assumes you've got lxml installed.
Developed with Plotly 2.2.2
Code Snippet:
import bs4
def add_custom_plotly_events(
filename,
events = {
"plotly_click": "function plotly_click(data) { console.log(data); }",
"plotly_hover": "function plotly_hover(data) { console.log(data); }"
},
prettify_html = True
):
# what the value we're looking for the javascript
find_string = "Plotly.newPlot"
# stop if we find this value
stop_string = "then(function(myPlot)"
def locate_newplot_script_tag(soup):
scripts = soup.find_all('script')
script_tag = soup.find_all(string=re.compile(find_string))
if len(script_tag) == 0:
raise ValueError("Couldn't locate the newPlot javascript in {}".format(filename))
elif len(script_tag) > 1:
raise ValueError("Located multiple newPlot javascript in {}".format(filename))
if script_tag[0].find(stop_string) > -1:
raise ValueError("Already updated javascript, it contains:", stop_string)
return script_tag[0]
def split_javascript_lines(new_plot_script_tag):
return new_plot_script_tag.string.split(";")
def find_newplot_creation_line(javascript_lines):
for index, line in enumerate(javascript_lines):
if line.find(find_string) > -1:
return index, line
raise ValueError("Missing new plot creation in javascript, couldn't find:", find_string)
def join_javascript_lines(javascript_lines):
# join the lines with javascript line terminator ;
return ";".join(javascript_lines)
def register_on_events(events):
on_events_registration = []
for function_name in events:
on_events_registration.append("myPlot.on('{}', {})".format(
function_name, function_name
))
return on_events_registration
# load the file
with open(filename) as inf:
txt = inf.read()
soup = bs4.BeautifulSoup(txt, "lxml")
new_plot_script_tag = locate_newplot_script_tag(soup)
javascript_lines = split_javascript_lines(new_plot_script_tag)
line_index, line_text = find_newplot_creation_line(javascript_lines)
on_events_registration = register_on_events(events)
# replace whitespace characters with actual whitespace
# using + to concat the strings as {} in format
# causes fun times with {} as the brackets in js
# could possibly overcome this with in ES6 arrows and such
line_text = line_text + ".then(function(myPlot) { " + join_javascript_lines(on_events_registration) +" })".replace('\n', ' ').replace('\r', '')
# now add the function bodies we've register in the on handles
for function_name in events:
javascript_lines.append(events[function_name])
# update the specific line
javascript_lines[line_index] = line_text
# update the text of the script tag
new_plot_script_tag.string.replace_with(join_javascript_lines(javascript_lines))
# save the file again
with open(filename, "w") as outf:
# tbh the pretty out is still ugly af
if prettify_html:
for line in soup.prettify(formatter = None):
outf.write(str(line))
else:
outf.write(str(soup))

According to Click events in python offline mode? on Plotly's community site this is not supported, at least as of December 2015.
That post does contain some hints as to how to implement this functionality yourself, if you're feeling adventurous.

HTML Img's failing to load

Just for fun, i'm trying to implement a "15 puzzle", but with 16 images (from 1 music photo) instead.
The thing is split into 2 scripts / sides. 1 Python CGI script that will perform the Last.FM query + splitting the image in Y x Z chunks. When the python script finishes it outputs a JSON string that contains the location (on server), extension etc.
{"succes": true, "content": {"nrofpieces": 16, "size": {"width": 1096, "height": 961}, "directoryname": "Mako", "extension": "jpeg"}}
On the other side is a HTML, JS, (CSS) combo that will query the CGI script for the images.
$(document).ready(function () {
var artiest = $("#artiest")
var rijen = $("#rijen")
var kolommen = $("#kolommen")
var speelveld = $("#speelveld")
var search = $("#search")
$("#buttonClick").click(function () {
var artiestZ = artiest.val()
var rijenZ = rijen.val()
var kolommenZ = kolommen.val()
$.getJSON("http://localhost:8000/cgi-bin/cgiScript.py", "artiest=" + artiestZ + "&rijen=" + rijenZ + "&kolommen=" + kolommenZ, function (JsonSring) {
console.log("HIIIIII")
if (JsonSring.succes === true){
console.log(JsonSring)
var baseUrl = "http://localhost:8000/"
var extension = JsonSring.content.extension
var url = baseUrl + JsonSring.content.directoryname + "/"
var amountX = rijenZ
var amountY = kolommenZ
for (var i = 0; i < amountX; i += 1){
for (var p = 0; p < amountY; p += 1){
console.log("HI")
var doc = new Image
doc.setAttribute("src", url + JsonSring.content.directoryname + i + "_" + p + "." +extension)
document.getElementById("speelveld").appendChild(doc)
}
}
}else {
// Search failed. Deal with it.
}
})
})
})
where the various id's link to various HTML elements. (Text Fields & Buttons & Div's).
Beneath is a screenshot of the full folder that contains the image files.
Now, coming to the point. All the HTML img tags with src seem correct, yet. Some images don't load, yet other do. I also noticed that all images failed to load in 2s intervals. Is there some kind of timeout, or so?
All this is being ran from a local machine, so disk speed and cpu shouldn't really affect the matter. Also, from what I understand: The call for making the img tags etc is done in a callback from the getJson, meaning it'll only run when getJson has finished / had a reply.
Does the great StackOverFlow community have an idea what's happening here?

To share my knowledge/experiences with the great StackOverflow community,
Small backstory
After progressing a bit further into the project I started to run into various issues going from JSON parsing to not having Allow-Control-Allow-Origin: * headers, making it very hard to get the Ajax Request (Client ==> Python CGI) done.
In the meantime I also started dev'ing on my main desktop (which for some reason either has massive issues with Python versioning or has none). But due to the terminal on my desktop having Python 3.4+ , there was no module CGIHTTPServer. After a small amount of digging, I found that CGIHTTPServer had been transfered into http.server, yet when running plain old python -m http.server, I noticed the CGI script wouldn't run. It would just display. Ofcourse, I forgot the option -cgi.
Main solution
The times I was succesfully using CGIHTTPServer, I had troubles. The images wouldn't load as described above. I suspect that the module just couldn't take the decent amount of requests. Meaning that when suddenly Y x Z requests came in, it would struggle to deliver all the data. ==> Connection Refused.
Since switching to python -m http.server -cgi, no problems what so ever. Currently working on a Bootstrap Grid for all those images!
Thx #Lashane and #Ruud.

Parsing javascript generated pages in R

At work I'd like to parse some web pages. Unfortunately I can't add any real page to my example because the urls at work are confident. I can only try to explain what is the problem.
To parse I wrote the following script in R. As a mock url I used www.imdb.com.:
library(rvest)
library(plyr)
# urls
url <- "http://www.imdb.com/"
# parse
html <- try(read_html(url))
# select
select_meta <- function(html) {
html %>%
html_nodes(xpath = "//div") %>%
html_attrs # function to select meta
}
meta <- select_meta(html)
Problem is this script doesn't return anything for the pages I use at work. I guess this is because the scripts are generated by javascript. I found this tutorial which explains how to scrape javascript generated pages in R.
The code used to generate the page in the tutorial is the following:
// scrape_techstars.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'techstars.html'
page.open('http://www.techstars.com/companies/stats/', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
I don't have any Javascript knowledge so I'm having trouble scaling page.open (which only works for 1 page) to multiple pages (at work I have to parse roughly 100 pages). So instead of relying on phantom js I'd rather have a solution which is completely R based (if this is totally inefficient and offensive to real coders, I apologise in advance). So the crux of my question is: "how can I generate several pages in R?".
This is a one-off thing so I'm not really thinking about reading up on Javascript or parsing. Thanks in advance for helping me out.

We Keep Coding

JavaScript is the programming language of the Web.

Unable to scrape multiple pages using phantomjs in r - javascript

Related

get the actuel domain name (server name) in Shiny app

Web Scraping interactive map (javascript) with R and PhantomJS

Plotly (offline) for Python click event

HTML Img's failing to load

Parsing javascript generated pages in R

Categories

Resources