I'm trying to access the chart data (high chart format) from the below website using Python & Selenium. The default "1 year" option works perfect, but when I use Selenium to click "5Y" option in chart & get data, it still returns the "1Y" information.
import time
from selenium import webdriver
website = 'https://www.moneycontrol.com/nps/nav/lic-pension-fund-scheme-g-tier-ii/SM003010'
# Open Website
driver = webdriver.Firefox()
driver.get(website)
time.sleep(2)
# Click on 5 Year Option in Chart
driver.find_element_by_id("li_5y").click()
time.sleep(2)
# Get Data from Highcharts Series
output = driver.execute_script('return window.Highcharts.charts[2].series[0].options.data')
driver.close()
I've also tried an alternative for clicking 5 year data but the same issue persists:
driver.execute_script("get_stock_graph('','5Y','li_5y','fiveymfd_5')")
Any advice would be appreciated on how to get the refreshed driver page info.
Thanks!
On that page, every time you change a time period a new chart is created, so you need to get the data from the last one in Highcharts.charts array:
output = driver.execute_script('return window.Highcharts.charts[window.Highcharts.charts.length-1].series[0].options.data')
API Reference: https://api.highcharts.com/class-reference/Highcharts#.charts
Related
I'm almost there with my first try of using scrapy, selenium to collect data from website with javascript loaded content.
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.webdriver.common.by import By
import time
class FreePlayersSpider(scrapy.Spider):
name = 'free_players'
allowed_domains = ['www.forge-db.com']
start_urls = ['https://www.forge-db.com/fr/fr11/players/?server=fr11']
driver = {}
def __init__(self):
self.driver = webdriver.Chrome('/home/alain/Documents/repository/web/foe-python/chromedriver')
self.driver.get('https://forge-db.com/fr/fr11/players/?server=fr11')
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
#time.sleep(1)
sel = Selector(text = self.driver.page_source)
players = sel.xpath('.//table/tbody/tr')
for player in players:
joueur = player.xpath('.//td[3]/a/text()').get()
guilde = player.xpath('.//td[4]/a/text()').get()
yield {
'player' : joueur,
'guild' : guilde
}
next_page_btn = self.driver.find_element_by_xpath('//a[#class="paginate_button next"]')
if next_page_btn:
time.sleep(2)
next_page_btn.click()
yield scrapy.Request(url = self.start_urls, callback=self.parse)
# Close the selenium driver, so in fact it closes the testing browser
self.driver.quit()
def parse_players(self):
pass
I want to collect user names and their relative guild and output to a csv file.
For now my issue is to proceed to NEXT PAGE and to parse again the content loaded by javascript.
if i'm able to simulate click on NEXT tag, i'm not 100% sure that code will proceed all pages and i'm not able to parse the new content using the same function.
Any idea how could i solve this issue ?
thx.
Instead of using selenium, you should try recreate the request to update the table. If you look closely at the HTML under chrometools. You can see that the request is made with parameters and a response is sent back with the data in a nice structured format.
Please see here with regards to dynamic content in scrapy. As it explains the first step to think about is it necessary to recreate browser activity ? Or can I get the information I need from reverse engineering HTTP get requests. Sometimes the information is hidden with <script></script> tags and you can use some regex or some string methods to gain what you want. Rendering the page and then using browser activity should be thought of as a last step.
Now before I go into some background on reverse engineering the requests, this website you're trying to get information from requires only to reverse engineer the HTTP requests.
Reverse Engineering HTTP requests in Scrapy
Now in terms of the actual web itself we can use chrome devtools by right clicking inspect on a page. Clicking the network tab allows you to see all requests the browser makes to render the page. In this case you want to see what happens when you click next.
Image1: here
Here you can see all the requests made when you click next on the page. I always look for the biggest sized response as that'll most likely have your data.
Image2: here
Here you can see the request headers/params etc... things you need to make a proper HTTP request. We can see that the referring URL is actually getplayers.php with all the params to get the next page added on. If you scroll down you can see all the same parameters it sends to getplayers.php. Keep this in mind, sometimes we need to send headers, cookies and parameters.
Image3: here
Here is the preview of the data we would get back from the server if we make the correct request, it's a nice neat format which is great for scraping.
Now You could copy the headers and parameters, cookies here into scrapy, but after a bit of searching and it's always worth checking this first, if just by passing in an HTTP request with the url will you get the data you want then that is the simplest way.
In this case it's true and infact you get in a nice need format with all the data.
Code example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['forge-db.com']
def start_requests(self):
url = 'https://www.forge-db.com/fr/fr11/getPlayers.php?'
yield scrapy.Request(url=url)
def parse(self,response):
for row in response.json()['data']:
yield {'name':row[2],'guild':row[3] }
Settings
In settings.py, you need to set ROBOTSTXT_OBEY = False The site doesn't want you to access this data so we need to set it to false. Be careful, you could end getting banned from the server.
I would also suggest a couple of other settings to be respectful and cache the results so if you want to play around with this large dataset you don't hammer the server.
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 3
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'httpcache'
Comments on the code
We make a request to https://www.forge-db.com/fr/fr11/getPlayers.php? and if you were to print the response you get all the data from the table, it's quite a lot... Now it looks like it's in json format so we use scrapy's new feature to handle json and convert into a python dictionary. response.json() be sure that you have uptodate scrapy to take advantage of this. Otherwise you could use the json library that python provides to do the same thing.
Now you have to look at the preview data abit here but the individual rows are within response.json()['data'][i] where i in the row of data. The name and guild are within response.json()['data'][i][2] and response.json()['data'][i][3]. So looping over every response.json()['data']and grabbing the name and guild.
If the data wasn't so structured as it is here and it needed modifying I would strongly urge you to use Items or ItemLoaders for creating the fields that you can then output the data. You can modifying the extracted data more easily with ItemLoaders and you can interact with duplicates items etc using a pipeline. These are just some thoughts for in the future, I almost never use yielding a dictionary for extracting data particularly large datasets.
I'm trying to get the data from "https://fortune.com/global500/2019/search/" using python requests-html module. I'm able to get the 1st 100 items (from 1st page) because the page have javascript enabled. And we need to click on "next" to load the 2nd page, curretly i get only just the 1st 100 items.
While i click "next" on the browser the url is not changing on the address bar. So I'm clueless how to get the next pages using requests-html.
from requests_html import HTMLSession
def get_fortune500():
companies = []
url = 'https://fortune.com/global500/2019/search/'
session = HTMLSession()
r = session.get(url)
r.html.render(wait=1, retries=2)
table = r.html.find('div.rt-tbody', first=True)
rows = table.find('div.rt-tr-group')
for row in rows:
row_data = []
cells = row.find('div.rt-td')
for cell in cells:
celldata = cell.text.lstrip('$').replace(',', '')
row_data.append(celldata)
companies.append(row_data)
return companies
fortune_list = get_fortune500()
print(fortune_list)
print(len(fortune_list))
I really appreciate your time.
Here is the list of 500 of all
https://content.fortune.com/wp-json/irving/v1/data/franchise-search-results?list_id=2666483
This website is storing the response of this API in browsers IndexedDB and after that only frontend takes control.
You can figure out the way to read That response from the first request.
Although you can do it just by navigating to the JSON is mentioned by #Jugraj but if you want to learn more about the requests-html you can always look for the official documentation of the requests-html.
I'm trying to webscrape the historical 'Market Value Dvelopment' chart on this website:
https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290
After learning that it's javascript, I starting learning about webscraping JS using webdrivers (Selenium), headless browsers, and Chrome/Chromium. After inspecting the page, I found that the ID I might be looking for is id_= 'yw0' which seems to be housing the chart:
Given this, here is my code:
import selenium as se
from selenium import webdriver
options = se.webdriver.ChromeOptions()
options.add_argument('headless')
driver = se.webdriver.Chrome(executable_path='/Applications/Utilities/chromedriver', chrome_options=options)
driver.get('https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290')
element = driver.find_element_by_id(id_='yw0')
print(element)
When I run it it outputs this:
<selenium.webdriver.remote.webelement.WebElement (session="bd8e42834fcdd92383ce2ed13c7943c0", element="8df128aa-d242-40a0-9306-f523136bfe57")>
When changing the code after element to
value = element.text
print(value)
I get:
Current Market Value : 180,00 Mill. €
2010
2012
2014
2016
2018
50,0
100,0
150,0
200,0
Which isn't the data but the x and y values of the chart intervals.
I've tried different id tags of the chart to see if I'm simply identifying the wrong container (e.g. highcharts-0). But I'm unable to find the actual data values of the chart.
What's curious is that the chart changes a bit after I run my code. The chart 'gets wider' and runs off the designated area for the chart. It looks like this:
I'm wondering what what I can and need to change in the code in order to scrape the data points that displays on the chart.
You can regex it out from javascript and do a little string manipulation. You get a list of dictionaries from the below. No need for selenium.
import requests, re, ast
r = requests.get('https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r"'data':(.*)}\],")
s = p.findall(r.text)[0]
s = s.encode().decode('unicode_escape')
data = ast.literal_eval(s)
Looking at first item:
Regex:
tl;dr;
When using browser on load jQuery pulls in the chart info from a script tag resulting in what you see. The regex extracts that same info i.e. the relevant series info for the chart, from where jQuery sourced the series.
Selenium:
There is certainly room for improving this but it demonstrates the general principles. The values are retrieved from script tags to update tooltip as you hover over each data point on chart. The values retrieved are associated with the x,y of the chart point. So, you cannot read from where you are looking the tooltip info. Rather, you can click each data point and grab the updated info from the tooltip element.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from bs4 import BeautifulSoup as bs
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.add_argument("--start-maximized")
url = 'https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290'
d = webdriver.Chrome(options = options)
d.get(url)
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".as-oil__btn-optin"))).click()
markers = d.find_elements_by_css_selector('.highcharts-markers image')
time.sleep(1)
for marker in markers:
ActionChains(d).click_and_hold(marker).perform()
text = d.find_element_by_css_selector('div.highcharts-tooltip').text
while True:
if len(text) == 0:
ActionChains(d).click_and_hold(marker).perform()
else:
break
print(text)
I'm attempting to scrape data on food seasonality from the Seasonal Food Guide but hitting a snag. The site has a fairly simple URL structure:
https://www.seasonalfoodguide.org/produce_name/state_name
I've been able to use Selenium and Beautiful Soup to successfully scrape the seasonality information from one page, but on subsequent loops the section of text I'm looking for doesn't actually load so I get AttributeError: 'NoneType' object has no attribute 'text'. I know it's because months_list_raw is coming back empty due to the fact that the 'wheel-months-list' portion of the page isn't loading on the second loop. Code is below. Any ideas?
for ingredient in produce_list:
for state in state_list:
# grab page content
search_url = 'https://www.seasonalfoodguide.org/{}/{}'.format(ingredient,state)
driver.get(search_url)
page_soup = soup(driver.page_source, 'lxml')
# grab list of months
months_list_raw = page_soup.find('p',{'id':'wheel-months-list'})
months_list = months_list_raw.text
The page is being rendered on the client side, which means when you open the page, another request is being made to a backend server to fetch the data based on your selected filters. So the issue is that when you open the page and read the HTML, the content is not fully loaded yet. The simplest thing you could do is sleep for some time after opening the page with Selenium in order to wait for it to fully load. I've tested your code by throwing in time.sleep(3) after the driver.get(search_url) and it worked fine.
To prevent the error from occuring and continuing with your loop you need to do a check for when the months_list_raw element is not None. It seems like some of the produce pages do not have any data for some states, so you will need to handle that in your program how you want.
for ingredient in produce_list:
for state in state_list:
# grab page content
search_url = 'https://www.seasonalfoodguide.org/{}/{}'.format(ingredient,state)
driver.get(search_url)
page_soup = soup(driver.page_source, 'lxml')
# grab list of months
months_list_raw = page_soup.find('p',{'id':'wheel-months-list'})
if months_list_raw is not None:
months_list = months_list_raw.text
else:
# Handle case where ingredient/state data doesn't exist
I'm trying to make a python script using jupyter-notebook, which is fetching data from my website's sql-server and I want to call this script using a javascript function every time the page is loaded. So the page will have the Plotly graphs.
Here is my code:
# coding: utf-8
# In[1]:
#import os
#os.chdir("D:/Datasets/Trell")
# In[2]:
import json
from pandas.io.json import json_normalize
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from plotly.offline import init_notebook_mode,plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
import plotly.tools as tls
# In[3]:
# importing the requests library
import requests
# api-endpoint
URL = "https://*****.co.in/*****/*******.php"
# location given here
token= '************'
query= 'SELECT userId,createdAt,userName,trails_count,bio FROM users WHERE createdAt >= "2018-07-01"'
# defining a params dict for the parameters to be sent to the API
PARAMS = {'token':token, 'query':query}
# sending get request and saving the response as response object
r = requests.post(url = URL, data = PARAMS)
# In[4]:
data=r.json()
# In[5]:
df=pd.DataFrame(data)
# In[6]:
df.head(1)
# In[7]:
df['date'] = pd.DatetimeIndex(df.createdAt).normalize()
# In[8]:
df['user']=1
# In[9]:
df_user=df.groupby(['date'],as_index=False)['user'].agg('sum')
# In[10]:
data = [go.Scatter( x=df_user['date'], y=df_user['user'] )]
plot(data, filename='time-series.')
# In[11]:
df_user['day_of_week']=df_user['date'].dt.weekday_name
df_newuser_day=df_user.groupby(['day_of_week'],as_index=False)['user'].agg('sum')
df_newuser_day=df_newuser_day.sort_values(['user'],ascending=False)
trace = go.Bar(
x=df_newuser_day['day_of_week'],
y=df_newuser_day.user,
marker=dict(
color="blue",
#colorscale = 'Blues',
reversescale = True
),
)
layout = go.Layout(
title='Days of Week on which max. users register (July)'
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
plot(fig, filename="medal.")
But the problem is that every time the plot() function is executed new HTML tabs are getting open with the filename= mentioned inside the function.
All I want is when I'm executing the file all the graphs come under single HTML page and also I want to give header with <h1> tag before every plot is being so that to the plots are understandable. So is there a way I can do that along with adding of some HTMl and CSS tags before plotly plots so that it looks like a clean webpage with all the plotly graphs along with the headers mentioned under the <h1> tag.
Like I want all the graphs to appear on the same page together one after the other.
P.S. I don't want to use iplot because it plots in the same notebook only and doesn't save the file also.
To make the plots appear in the same page, please use plotly offline's iplot method, instead of plot.
So the statement.
plot(fig, filename="medal.")
will become.
iplot(fig)
If you wish to add HTML before the plot, please use the display and HTML provided by ipython.
from IPython.core.display import display, HTML
display(HTML('<h1>Hello, world!</h1>'))
iplot(fig)
Thus, first we can insert the html first and then plot the graph!
To know more, visit this SO Answer
Late reply: subplots may be the answer to this problem.
For example, to create a subplots of 2 rows and 2 columns,
from plotly import tools
plots = tools.make_subplots(rows=2, cols=2, print_grid=True)