I'm trying to webscrape the historical 'Market Value Dvelopment' chart on this website:
https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290
After learning that it's javascript, I starting learning about webscraping JS using webdrivers (Selenium), headless browsers, and Chrome/Chromium. After inspecting the page, I found that the ID I might be looking for is id_= 'yw0' which seems to be housing the chart:
Given this, here is my code:
import selenium as se
from selenium import webdriver
options = se.webdriver.ChromeOptions()
options.add_argument('headless')
driver = se.webdriver.Chrome(executable_path='/Applications/Utilities/chromedriver', chrome_options=options)
driver.get('https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290')
element = driver.find_element_by_id(id_='yw0')
print(element)
When I run it it outputs this:
<selenium.webdriver.remote.webelement.WebElement (session="bd8e42834fcdd92383ce2ed13c7943c0", element="8df128aa-d242-40a0-9306-f523136bfe57")>
When changing the code after element to
value = element.text
print(value)
I get:
Current Market Value : 180,00 Mill. €
2010
2012
2014
2016
2018
50,0
100,0
150,0
200,0
Which isn't the data but the x and y values of the chart intervals.
I've tried different id tags of the chart to see if I'm simply identifying the wrong container (e.g. highcharts-0). But I'm unable to find the actual data values of the chart.
What's curious is that the chart changes a bit after I run my code. The chart 'gets wider' and runs off the designated area for the chart. It looks like this:
I'm wondering what what I can and need to change in the code in order to scrape the data points that displays on the chart.
You can regex it out from javascript and do a little string manipulation. You get a list of dictionaries from the below. No need for selenium.
import requests, re, ast
r = requests.get('https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r"'data':(.*)}\],")
s = p.findall(r.text)[0]
s = s.encode().decode('unicode_escape')
data = ast.literal_eval(s)
Looking at first item:
Regex:
tl;dr;
When using browser on load jQuery pulls in the chart info from a script tag resulting in what you see. The regex extracts that same info i.e. the relevant series info for the chart, from where jQuery sourced the series.
Selenium:
There is certainly room for improving this but it demonstrates the general principles. The values are retrieved from script tags to update tooltip as you hover over each data point on chart. The values retrieved are associated with the x,y of the chart point. So, you cannot read from where you are looking the tooltip info. Rather, you can click each data point and grab the updated info from the tooltip element.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from bs4 import BeautifulSoup as bs
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.add_argument("--start-maximized")
url = 'https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290'
d = webdriver.Chrome(options = options)
d.get(url)
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".as-oil__btn-optin"))).click()
markers = d.find_elements_by_css_selector('.highcharts-markers image')
time.sleep(1)
for marker in markers:
ActionChains(d).click_and_hold(marker).perform()
text = d.find_element_by_css_selector('div.highcharts-tooltip').text
while True:
if len(text) == 0:
ActionChains(d).click_and_hold(marker).perform()
else:
break
print(text)
Related
I'm trying to access the chart data (high chart format) from the below website using Python & Selenium. The default "1 year" option works perfect, but when I use Selenium to click "5Y" option in chart & get data, it still returns the "1Y" information.
import time
from selenium import webdriver
website = 'https://www.moneycontrol.com/nps/nav/lic-pension-fund-scheme-g-tier-ii/SM003010'
# Open Website
driver = webdriver.Firefox()
driver.get(website)
time.sleep(2)
# Click on 5 Year Option in Chart
driver.find_element_by_id("li_5y").click()
time.sleep(2)
# Get Data from Highcharts Series
output = driver.execute_script('return window.Highcharts.charts[2].series[0].options.data')
driver.close()
I've also tried an alternative for clicking 5 year data but the same issue persists:
driver.execute_script("get_stock_graph('','5Y','li_5y','fiveymfd_5')")
Any advice would be appreciated on how to get the refreshed driver page info.
Thanks!
On that page, every time you change a time period a new chart is created, so you need to get the data from the last one in Highcharts.charts array:
output = driver.execute_script('return window.Highcharts.charts[window.Highcharts.charts.length-1].series[0].options.data')
API Reference: https://api.highcharts.com/class-reference/Highcharts#.charts
I'm trying to make a python script using jupyter-notebook, which is fetching data from my website's sql-server and I want to call this script using a javascript function every time the page is loaded. So the page will have the Plotly graphs.
Here is my code:
# coding: utf-8
# In[1]:
#import os
#os.chdir("D:/Datasets/Trell")
# In[2]:
import json
from pandas.io.json import json_normalize
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from plotly.offline import init_notebook_mode,plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
import plotly.tools as tls
# In[3]:
# importing the requests library
import requests
# api-endpoint
URL = "https://*****.co.in/*****/*******.php"
# location given here
token= '************'
query= 'SELECT userId,createdAt,userName,trails_count,bio FROM users WHERE createdAt >= "2018-07-01"'
# defining a params dict for the parameters to be sent to the API
PARAMS = {'token':token, 'query':query}
# sending get request and saving the response as response object
r = requests.post(url = URL, data = PARAMS)
# In[4]:
data=r.json()
# In[5]:
df=pd.DataFrame(data)
# In[6]:
df.head(1)
# In[7]:
df['date'] = pd.DatetimeIndex(df.createdAt).normalize()
# In[8]:
df['user']=1
# In[9]:
df_user=df.groupby(['date'],as_index=False)['user'].agg('sum')
# In[10]:
data = [go.Scatter( x=df_user['date'], y=df_user['user'] )]
plot(data, filename='time-series.')
# In[11]:
df_user['day_of_week']=df_user['date'].dt.weekday_name
df_newuser_day=df_user.groupby(['day_of_week'],as_index=False)['user'].agg('sum')
df_newuser_day=df_newuser_day.sort_values(['user'],ascending=False)
trace = go.Bar(
x=df_newuser_day['day_of_week'],
y=df_newuser_day.user,
marker=dict(
color="blue",
#colorscale = 'Blues',
reversescale = True
),
)
layout = go.Layout(
title='Days of Week on which max. users register (July)'
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
plot(fig, filename="medal.")
But the problem is that every time the plot() function is executed new HTML tabs are getting open with the filename= mentioned inside the function.
All I want is when I'm executing the file all the graphs come under single HTML page and also I want to give header with <h1> tag before every plot is being so that to the plots are understandable. So is there a way I can do that along with adding of some HTMl and CSS tags before plotly plots so that it looks like a clean webpage with all the plotly graphs along with the headers mentioned under the <h1> tag.
Like I want all the graphs to appear on the same page together one after the other.
P.S. I don't want to use iplot because it plots in the same notebook only and doesn't save the file also.
To make the plots appear in the same page, please use plotly offline's iplot method, instead of plot.
So the statement.
plot(fig, filename="medal.")
will become.
iplot(fig)
If you wish to add HTML before the plot, please use the display and HTML provided by ipython.
from IPython.core.display import display, HTML
display(HTML('<h1>Hello, world!</h1>'))
iplot(fig)
Thus, first we can insert the html first and then plot the graph!
To know more, visit this SO Answer
Late reply: subplots may be the answer to this problem.
For example, to create a subplots of 2 rows and 2 columns,
from plotly import tools
plots = tools.make_subplots(rows=2, cols=2, print_grid=True)
I want to scrape the number of views that specific videos on Instagram have. I'm relatively new to python but I'm guessing there must be a way given that the views can be found in the source code.
https://www.instagram.com/p/BOTU6rJhShv/ is one video I have been working with. As of this writing, it has 1759 views. Looking at the source code, 1759 is clearly listed as the "video_views" inside of a dictionary-like element:
This element sits deep inside one of the page's tag. From what I've read, the data is currently organized in a javascript form and should be converted to JSON to use in python. Here's what I have so far:
import json
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
page = urlopen('https://www.instagram.com/p/BOTU6rJhShv/')
soup = bs(page.read(),"html.parser")
body = soup.find('body',{'class':''})
script = body.find('script',{'type':'text/javascript'})
print(script)
Since I print the result of script at the bottom, I know this hones in on the section of the page I want to focus on. If I could read in that information to python, I can iterate through it find the "video_views" key, but that is where I am stuck. How can I convert the information between the script tags to JSON format and load into python?
Well, since the format is always the same, you could simply do this:
data = json.loads(script.text.replace('window._sharedData = ', '')[:-1])
Update: (I'm using python 2.7, so urllib2.urlopen is used instead)
I do get consistent output from this code:
import json
import re
import urllib2
from bs4 import BeautifulSoup as bs
page = urllib2.urlopen('https://www.instagram.com/p/BOTU6rJhShv/')
soup = bs(page.read(),"html.parser")
body = soup.find('body',{'class':''})
script = body.find('script',{'type':'text/javascript'})
data = json.loads(script.text.replace('window._sharedData = ', '')[:-1])
print data
print data['entry_data']['PostPage'][0]['media']['video_views']
Currently the video_views is 1759.
I would like to download the markers from an embedded google map in order to find the country the marker resides in. I did not create the map, so I can not export the KML. I tried downloading the content using requests and parsing through the HTML content using Beautiful Soup, then finding the country information by parsing the JavaScript in slimit. However this only allowed me to find a small number of the waypoints on the map. The organization operates in over 100 countries, but my search is only returning 14 country names. I wonder if I need to use a google maps specific module?
Sample Code:
import requests
from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
#Get HTML Content
with requests.Session() as c:
page = c.get("http://www.ifla-world-report.org/cgi-bin/static.ifla_wr.cgi?dynamic=1&d=ifla_wr_browse&page=query&interface=map")
pContent = page.content
#Parse through HTML and Grab the Javascript
soup = BeautifulSoup(pContent)
text_to_find = "country"
for script in soup.find_all('script'):
#Now Parse through the Javascript to find the country variable
lookat = Parser()
tree = lookat.parse(script.text)
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Assign):
value = getattr(node.left, 'value', '')
if text_to_find in value:
country = getattr(node.right, 'value', '')
print country[1:-1]
I am trying to embed a bokeh plot in a webpage served by a simple Flask app, using the embed.autoload_server function that I picked up from looking over the bokeh embed examples on github. Everything seems to be working as expected on the python side of things, but the page renders without any data (even though the data is within the JS plot object). I do see the 5 bokeh plot manipulation buttons but I do not see the actual plot. After turning on the JS console I see that the i variable is being returned as undefined in the following statement (line 23512, bokeh.js):
i = this.get('dimension');
As a result, ranges[i] is also undefined, which is the error I'm getting in the console.
I can navigate the browser to the actual plot json and I see all the data as expected there, which is why I turned to the JS console to troubleshoot.
Any ideas would be very appreciated, my JS is pretty rusty at the moment. Is there a relationship between the attributes of the python "plot" objects and the JS "plot" objects? It seems like this is just an issue of my front end object missing the "dimension" attribute.
In response to the question, here is the code, it is pretty much lifted directly from the candlestick example code, but that was from a pull from several weeks ago, so it very well could be dated. I pulled again since and didn't revisit this code since there were no issues creating the plot data.
def candlestick():
store = pd.HDFStore('../data/dt_metastock.h5')
keys = [key for key in store.keys() if 'daily' in key]
df = store[keys[0]][:800]
#df['date'] = pd.to_datetime(df['date'])
mids = (df.open + df.close)/2
spans = abs(df.close-df.open)
inc = df.close > df.open
dec = df.open > df.close
w = 12*60*60*1000 # half day in ms
output_server("candlestick")
figure(tools="pan,wheel_zoom,box_zoom,reset,previewsave",
plot_width=1000, name="candlestick")
hold()
segment(df.idx, df.high, df.idx, df.low, color='black')
w = .5
rect(df.idx[inc].values, mids[inc], w, spans[inc], fill_color="#D5E1DD", line_color="black")
rect(df.idx[dec].values, mids[dec], w, spans[dec], fill_color="#F2583E", line_color="black")
curplot().title = keys[0]
xaxis().major_label_orientation = pi/4
grid().grid_line_alpha=0.3
tag = embed.autoload_server(curplot(), cursession())
return tag
Can you post the code of your plot? Recently, we have merged a new layout system and it seems to me that you are probably using and old way to set up the axes in your plot...