Plotly (offline) for Python click event - javascript

Is it possible to add click events to a Plotly scatter plot (offline mode in Python)?
As an example, I want to change the shape of a set of scatter points upon being clicked.
What I tried so far
My understanding from reading other questions from the site (with no clear answer) is that I may have to produce the html and then edit it after the fact by putting in javascript code? So I could write a javascript function, save it off to my_js.js and then link to it from the html?

I've been doing some work with offline plots in plotly and had the same challenge.
Here's a kludge I've come up with which my prove as inspiration for others.
Some limitations:
Assumes that you have the offline output in a single html file, for a single plot.
Assumes that your on events are named the same as the event handlers.
Requires Beautiful Soup 4.
Assumes you've got lxml installed.
Developed with Plotly 2.2.2
Code Snippet:
import bs4
def add_custom_plotly_events(
filename,
events = {
"plotly_click": "function plotly_click(data) { console.log(data); }",
"plotly_hover": "function plotly_hover(data) { console.log(data); }"
},
prettify_html = True
):
# what the value we're looking for the javascript
find_string = "Plotly.newPlot"
# stop if we find this value
stop_string = "then(function(myPlot)"
def locate_newplot_script_tag(soup):
scripts = soup.find_all('script')
script_tag = soup.find_all(string=re.compile(find_string))
if len(script_tag) == 0:
raise ValueError("Couldn't locate the newPlot javascript in {}".format(filename))
elif len(script_tag) > 1:
raise ValueError("Located multiple newPlot javascript in {}".format(filename))
if script_tag[0].find(stop_string) > -1:
raise ValueError("Already updated javascript, it contains:", stop_string)
return script_tag[0]
def split_javascript_lines(new_plot_script_tag):
return new_plot_script_tag.string.split(";")
def find_newplot_creation_line(javascript_lines):
for index, line in enumerate(javascript_lines):
if line.find(find_string) > -1:
return index, line
raise ValueError("Missing new plot creation in javascript, couldn't find:", find_string)
def join_javascript_lines(javascript_lines):
# join the lines with javascript line terminator ;
return ";".join(javascript_lines)
def register_on_events(events):
on_events_registration = []
for function_name in events:
on_events_registration.append("myPlot.on('{}', {})".format(
function_name, function_name
))
return on_events_registration
# load the file
with open(filename) as inf:
txt = inf.read()
soup = bs4.BeautifulSoup(txt, "lxml")
new_plot_script_tag = locate_newplot_script_tag(soup)
javascript_lines = split_javascript_lines(new_plot_script_tag)
line_index, line_text = find_newplot_creation_line(javascript_lines)
on_events_registration = register_on_events(events)
# replace whitespace characters with actual whitespace
# using + to concat the strings as {} in format
# causes fun times with {} as the brackets in js
# could possibly overcome this with in ES6 arrows and such
line_text = line_text + ".then(function(myPlot) { " + join_javascript_lines(on_events_registration) +" })".replace('\n', ' ').replace('\r', '')
# now add the function bodies we've register in the on handles
for function_name in events:
javascript_lines.append(events[function_name])
# update the specific line
javascript_lines[line_index] = line_text
# update the text of the script tag
new_plot_script_tag.string.replace_with(join_javascript_lines(javascript_lines))
# save the file again
with open(filename, "w") as outf:
# tbh the pretty out is still ugly af
if prettify_html:
for line in soup.prettify(formatter = None):
outf.write(str(line))
else:
outf.write(str(soup))

According to Click events in python offline mode? on Plotly's community site this is not supported, at least as of December 2015.
That post does contain some hints as to how to implement this functionality yourself, if you're feeling adventurous.

Related

Programmatically change page in PDF.js with QWebEngineView

I am making an application in PyQt5 that involves displaying a PDF using the QWebEngineView and PDF.js by Mozilla.
I am able to display the PDF no problem, but I cannot figure out how to either:
1: set the page on load, or
2: update the page after it is already loaded
I have tried the numerous options from other Stackoverflow posts that involve using self.runJavaScript() to change it, but it always results in either "Cannot set property of undefined" or "Object is NoneType".
Here is my method:
def load_file(self, file, page=0) -> None:
url = QtCore.QUrl().fromLocalFile(os.path.abspath("./pdfjs/web/viewer.html"))
query = QtCore.QUrlQuery()
query.addQueryItem("file", os.path.normpath(os.path.abspath(file)))
url.setQuery(query)
self.pdf_view.load(url)
where self.pdf_view is QWebEngineView.
I would appreciate any help on how to accomplish this.
EDIT: I was able to specify the page on load with the # symbol, but as for changing the page without re-loading the whole thing is still unknown to me.
The PDF.js viewer loads some scripts that create a PDFViewer object with all the necessary properties for programmatically navigating pages. So you just need to run some simple javascript on the main viewer page to get the functionality you need. To make things a little nicer to work with, it's also helpful to provide a way to run the javascript synchronously so that return values can be accessed more easily.
Below is a simple working demo that implements that (only tested on Linux). Hopefully it should be clear how to adapt it to work with your own application:
import sys, os
from PyQt5 import QtCore, QtWidgets, QtWebEngineWidgets
# PDFJS = '/usr/share/pdf.js/web/viewer.html'
PDFJS = './pdfjs/web/viewer.html'
class Window(QtWidgets.QWidget):
def __init__(self):
super().__init__()
self.buttonNext = QtWidgets.QPushButton('Next Page')
self.buttonNext.clicked.connect(lambda: self.changePage(+1))
self.buttonPrev = QtWidgets.QPushButton('Previous Page')
self.buttonPrev.clicked.connect(lambda: self.changePage(-1))
self.viewer = QtWebEngineWidgets.QWebEngineView()
layout = QtWidgets.QGridLayout(self)
layout.addWidget(self.viewer, 0, 0, 1, 2)
layout.addWidget(self.buttonPrev, 1, 0)
layout.addWidget(self.buttonNext, 1, 1)
def loadFile(self, file):
url = QtCore.QUrl.fromLocalFile(os.path.abspath(PDFJS))
query = QtCore.QUrlQuery()
query.addQueryItem('file', os.path.abspath(file))
url.setQuery(query)
self.viewer.load(url)
def execJavaScript(self, script):
result = None
def callback(data):
nonlocal result
result = data
loop.quit()
loop = QtCore.QEventLoop()
QtCore.QTimer.singleShot(
0, lambda: self.viewer.page().runJavaScript(script, callback))
loop.exec()
return result
def changePage(self, delta):
page = self.execJavaScript(
'PDFViewerApplication.pdfViewer.currentPageNumber')
self.setCurrentPage(page + int(delta))
def setCurrentPage(self, page):
count = self.execJavaScript(
'PDFViewerApplication.pdfViewer.pagesCount')
if 1 <= page <= count:
self.execJavaScript(
f'PDFViewerApplication.pdfViewer.currentPageNumber = {page}')
if __name__ == '__main__':
app = QtWidgets.QApplication(sys.argv)
window = Window()
if len(sys.argv) > 1:
window.loadFile(sys.argv[1])
window.setGeometry(600, 50, 800, 600)
window.show()
sys.exit(app.exec_())

Web Scraping interactive map (javascript) with R and PhantomJS

I am trying to scrape data from an interactive map (looking to get crime data for a county). I am using R (rvest) and trying to use phantomjs too. I'm new to web scraping so I am not really understanding how all the elements work together (trying to get there).
The problem I believe I am having is that after I run the phantomjs and upload the html using R's rvest package, I end up with more scripts and no clear data in the html. My code is below.
writeLines("var url = 'http://www.google.com';
var page = new WebPage();
var fs = require('fs');
page.open(url, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('cool.html', page.content, 'w');
phantom.exit();
}, 2500);
}
", con = "scrape.js")
A function that takes in the url that I want to scrape
s_scrape <- function(url = "https://gis.adacounty.id.gov/apps/crimemapper/",
js_path = "scrape.js",
phantompath = "/Users/alihoop/Documents/phantomjs/bin/phantomjs"){
# this section will replace the url in scrape.js to whatever you want
lines <- readLines(js_path)
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, js_path)
command = paste(phantompath, js_path, sep = " ")
system(command)
}
Execute the js_scrape() function and get a html file saved as "cool.html"
js_scrape()
Where I am not understanding what to do next is the below R code:
map_data <- read_html('cool.html') %>%
html_nodes('script')
The output I get in the HTML via phantomjs is just scripts again. Looking for help on how to proceed when faced (in my mind) is javascript nested in javascript scripts(?)
Thank you!
This site uses javascript to make queries to the server. One solution is to reproduce the rest request and read the returning JSON file directly. This avoids the need to use Phantomjs.
From the developer tools options from your browser and looking through the xhr files, you will find a file(s) named "query" with a link similar to: "https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&outSR=102100&resultOffset=0&resultRecordCount=1000"
Read this JSON response directly and convert to a list with the use of the jsonlite package:
library(jsonlite)
output<-jsonlite::fromJSON("https://gisapi.adacounty.id.gov/arcgis/rest/services/CrimeMapper/CrimeMapperWAB/FeatureServer/11/query?f=json&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&outSR=102100&resultOffset=0&resultRecordCount=1000")
output$features
Find the first number in the link, (11 in this case) "FeatureServer/11/query?f=json". This number will determine which crime to query the server with. I found, it can take a value from 0 to 11. Enter 0 for arson, 4 for drugs, 11 for vandalism, etc.

Parsing html from a javascript rendered url with python object

I would like to extract the market information from the following url and all of its subsequent pages:
https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1
I have successfully parsed the data that I want from the first page using some code from the following url:
https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages
I have also been able to parse out the url for the next page to feed into a loop in order to grab data from the next page. The problem is it crashes before the next page loads for a reason I don't fully understand.
I have a hunch that the class that I have borrowed from 'impythonist' may be causing the problem. I don't know enough object orientated programming to work out the problem. Here is my code, much of which is borrowed from the the url above:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
import re
from bs4 import BeautifulSoup
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
base_url='https://uk.reuters.com'
complete_next_page='https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1'
#LOOP TO RENDER PAGES AND GRAB DATA
while complete_next_page != '':
print ('NEXT PAGE: ',complete_next_page, '\n')
r = Render(complete_next_page) # USE THE CLASS TO RENDER JAVASCRIPT FROM PAGE
result = r.frame.toHtml() # ERROR IS THROWN HERE ON 2nd PAGE
# PARSE THE HTML
soup = BeautifulSoup(result, 'lxml')
row_data=soup.find('div', attrs={'class':'column1 gridPanel grid8'})
print (len(row_data))
# PARSE ALL ROW DATA
stripe_rows=row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows=row_data.findAll('tr', attrs={'class':''})
print (len(stripe_rows))
print (len(non_stripe_rows))
# PARSE SPECIFIC ROW DATA FROM INDEX COMPONENTS
#non_stripe_rows: from 4 to 18 (inclusive) contain data
#stripe_rows: from 2 to 16 (inclusive) contain data
i=2
while i < len(stripe_rows):
print('CURRENT LINE IS: ',str(i))
print(stripe_rows[i])
print('###############################################')
print(non_stripe_rows[i+2])
print('\n')
i+=1
#GETS LINK TO NEXT PAGE
next_page=str(soup.find('div', attrs={'class':'pageNavigation'}).find('li', attrs={'class':'next'}).find('a')['href']) #GETS LINK TO NEXT PAGE WORKS
complete_next_page=base_url+next_page
I have annotated the bits of code that I have written and understand but I don't really know what's going on in the 'Render' class enough to diagnose the error? Unless its something else?
Here is the error:
result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'
I don't need to keep the information in the class once I have parsed it out so I was thinking perhaps it could be cleared or reset somehow and then updated to hold the new url information from page 2:n but I have no idea how to do this?
Alternatively if anyone knows another way to grab this specific data from this page and the following ones then that would be equally helpful?
Many thanks in advance.
How about using selenium and phantomjs instead of PyQt.
You can easily get selenium by executing "pip install selenium".
If you use Mac you can get phantomjs by executing "brew install phantomjs".
If your PC is Windows use choco instead of brew, or Ubuntu use apt-get.
from selenium import webdriver
from bs4 import BeautifulSoup
base_url = "https://uk.reuters.com"
first_page = "/business/markets/index/.FTSE?sortBy=&sortDir=&pn=1"
browser = webdriver.PhantomJS()
# PARSE THE HTML
browser.get(base_url + first_page)
soup = BeautifulSoup(browser.page_source, "lxml")
row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
# PARSE ALL ROW DATA
stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
print(len(stripe_rows), len(non_stripe_rows))
# GO TO THE NEXT PAGE
next_button = soup.find("li", attrs={"class":"next"})
while next_button:
next_page = next_button.find("a")["href"]
browser.get(base_url + next_page)
soup = BeautifulSoup(browser.page_source, "lxml")
row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
print(len(stripe_rows), len(non_stripe_rows))
next_button = soup.find("li", attrs={"class":"next"})
# DONT FORGET THIS!!
browser.quit()
I know the code above is not efficient (too slow I feel), but I think that it will bring you the results you desire. In addition, if the web page you want to scrape does not use Javascript, even PhantomJS and selenium are unnecessary. You can use the requests module. However, since I wanted to show you the contrast with PyQt, I used PhantomJS and Selenium in this answer.

Powershell Search Automation - GetElementbyID won't return values for controls created with Javascript

I'm working on a project to automate the search of tax records on a county website. Eventually, I want to be able to give Powershell a list of ID numbers and have it return results for all those numbers. The code I have so far is here:
#Open IE and go to www.gastontax.com
$ie = new-object -com "InternetExplorer.Application"
$ie.navigate("http://www.gastontax.com/")
$ie.visible = $true
#Yield the script while the page is loading
While ( $ie.busy -eq $true){
[System.Threading.Thread]::Sleep(200)
}
#Set the document and the frames
$doc = $ie.document
$frames = $doc.frames
#Accept the notification
$btn = $doc.getElementByID("ctl00_Tax_btnAccept")
$btn.click()
#Yield the script while the page is loading
While ( $ie.busy -eq $true){
[System.Threading.Thread]::Sleep(200)
}
#Create the drop-down search parameters
$taxyear = "All"
$status = "Unpaid"
$searchtype = "Both"
$searchparam = "Parcel Number"
$searchtext = "XXXXXX"
#Set the drop-down search parameters
$doc.getElementbyID("ctl00_Tax_drpTaxYear").value = $taxyear
$doc.getElementbyID("ctl00_Tax_drpStatus").value = $status
$doc.getElementbyID("ctl00_Tax_drpSearchType").value = $searchtype
$doc.getElementbyID("ctl00_Tax_drpSearchParam").value = $searchparam
#Create the parcel name parameter
$doc.getElementbyID("ctl00_Tax_txtSearchParam").value = $searchtext
$btn2 = $doc.getElementByID("ctl00_Tax_btnSearch")
$btn2.click()
However, whenever I try to set the values of the controls, I get the following message for each of the getElementbyID lines:
The property 'value' cannot be found on this object. Verify that the property exists and can be set.
I noticed that the values I'm looking for--"ctl00_Tax_drpTaxYear", "ctl00_Tax_drpStatus", etc.--do not exist in the page's source code until after I hit the button on the welcome page (marked by $btn.click()). Could this have something to do with the error Powershell is throwing? If so, how would I get around it?
Thanks!
I could only get it to work with a hack involving strategic toggling of the visibility of IE (Internet Explorer), which appears to be necessary for the DOM to be fully reflected in $ie.Document (see comments in source code below).
The hack seems to work reliably on my Windows 10 machine with IE 11, but YMMV.
Generally:
The Internet Explorer COM Automation interface is obsolescent, so it's worth considering an alternative, such as Edge's WebDriver support (an entirely different technology that may become a W3C standard).
Sticking with the IE COM object, the hack may not be necessary if DOM-related events are used instead, but I'm not sure if they're supported in PowerShell.
# Helper function for wating until a condition is true.
function await([scriptblock] $sb) {
while (-not (& $sb)) { Start-Sleep -Milliseconds 200 }
}
$ie = new-object -com "InternetExplorer.Application"
$ie.navigate("http://www.gastontax.com/")
# Wait until the page has finished loading.
await { $ie.ReadyState -eq 4 }
# Click the "Yes, I accept" button.
$ie.document.getElementById("ctl00_Tax_btnAccept").click()
# !! HACK (part 1 of 2):
# !! Delaying making IE visible until here seems to be the only way
# !! to get the new DOM to be at least *partially* reflected in $ie.document.
$ie.visible = $true
# Wait until the page has finished loading.
await { $ie.ReadyState -eq 4 }
# !! HACK (part 2 of 2):
# !! Inexplicably, another toggling of visibility is what it takes for *all*
# !! elements in the new DOM to be accessible via $ie.Document.
$ie.visible = $false
$ie.visible = $true
#Create the drop-down search parameters
$taxyear = "All"
$status = "Unpaid"
$searchtype = "Both"
$searchparam = "Parcel Number"
$searchtext = "XXXXXX"
#Set the drop-down search parameters
$doc = $ie.document
$doc.getElementById("ctl00_Tax_drpTaxYear").value = $taxyear
$ie.document.getElementById("ctl00_Tax_drpStatus").value = $status
$ie.document.getElementById("ctl00_Tax_drpSearchType").value = $searchtype
$ie.document.getElementById("ctl00_Tax_drpSearchParam").value = $searchparam
#Create the parcel name parameter
$ie.document.getElementById("ctl00_Tax_txtSearchParam").value = $searchtext
$ie.document.getElementById("ctl00_Tax_btnSearch").click()

Can I extract comments of any page from https://www.rt.com/ using python3?

I am writing a web crawler. I extracted heading and Main Discussion of the this link but I am unable to find any one of the comment (Ctrl+u -> Ctrl+f . Comment Text). I think the comments are written in JavaScript. Can I extract it?
RT are using a service from spot.im for comments
you need to do make two POST requests, first https://api.spot.im/me/network-token/spotim to get a token, then https://api.spot.im/conversation-read/spot/sp_6phY2k0C/post/353493/get to get the comments as JSON.
i wrote a quick script to do this
import requests
import re
import json
def get_rt_comments(article_url):
spotim_spotId = 'sp_6phY2k0C' # spotim id for RT
post_id = re.search('([0-9]+)', article_url).group(0)
r1 = requests.post('https://api.spot.im/me/network-token/spotim').json()
spotim_token = r1['token']
payload = {
"count": 25, #number of comments to fetch
"sort_by":"best",
"cursor":{"offset":0,"comments_read":0},
"host_url": article_url,
"canonical_url": article_url
}
r2_url ='https://api.spot.im/conversation-read/spot/' + spotim_spotId + '/post/'+ post_id +'/get'
r2 = requests.post(r2_url, data=json.dumps(payload), headers={'X-Spotim-Token': spotim_token , "Content-Type": "application/json"})
return r2.json()
if __name__ == '__main__':
url = 'https://www.rt.com/usa/353493-clinton-speech-affairs-silence/'
comments = get_rt_comments(url)
print(comments)
Yes, if it can be viewed with a web browser, you can extract it.
If you look at the source it is really an iframe that loads a piece of javascript, that then creates a new tag in the document with the source of that script tag loading bundle.js, which really contains the commenting software. This in turns then fetches the actual comments.
Instead of going through this manually, you could consider using for example webkit to create a headless browser that executes the javascript like an ordinary browser. Then you can scrape from that instead of having to manually make your crawler fetch the external resources.
Examples of such headless browsers could be Spynner, Dryscape, or the PhantomJS derived PhantomPy (the latter seems to be an abandoned project now).

Categories