Python BeautifulSoup - Scraping Google Finance historical data

Python BeautifulSoup - Scraping Google Finance historical data - javascript

I was trying to scrap Google Finance historical data. I was need of to total number of rows, which is located along with the pagination. The following is the div tag which is responsible for displaying the total number of rows:
<div class="tpsd">1 - 30 of 1634 rows</div>
I tried using the following code to get the data, but its returning an empty list:
soup.find_all('div', 'tpsd')
I tried getting the entire table but even then I was not successful, when I checked the page source I was able to find the value inside a JavaScript function. When I Googled how to get values from script tag, it was mentioned to used regex. So, I tried using regex and the following is my code:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ')
soup = BeautifulSoup(r.content,'lxml')
var = soup.find_all("script")[8].string
a = re.compile('google.finance.applyPagination\((.*)\'http', re.DOTALL)
b = a.search(var)
num = b.group(1)
print(num.replace(',','').split('\n')[3])
I am able to get the values which I want, but my doubt is whether the above code which I used to get the values is correct, or is there any other way better way. Kindly help.

You can easily pass an offset i.e start=.. to the url getting 30 rows at a time which is exactly what is happening with the pagination logic:
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"
with requests.session() as s:
start = 0
req = s.get(url.format(start))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows = table.find_all("tr")
while True:
start += 30
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
if not table:
break
all_rows.extend(table.find_all("tr"))
You can also get the total rows using the script tag and use that with range:
with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")
for start in range(30, total+1, 30):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))
print(len(all_rows))
The num=30 is the amount of rows per page, to make less requests you can set it to 200 which seems to be the max and work your step/offset from that.
url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=200&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"
with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")
for start in range(200, total+1, 200):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
print(url.format(start)
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))
If we run the code, you will see we get 1643 rows:
In [7]: with requests.session() as s:
...: req = s.get(url.format(0))
...: soup = BeautifulSoup(req.content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
...: total = int(scr.text.split(",", 3)[2])
...: all_rows = table.find_all("tr")
...: for start in range(200, total+1, 200):
...: soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: all_rows.extend(table.find_all("tr"))
...: print(len(all_rows))
...:
1643
In [8]:

You can just use the python module: https://pypi.python.org/pypi/googlefinance
The api is simple:
#The google finance API that we need.
from googlefinance import getQuotes
#The json handeler, since the API returns a JSON.
import json
intelJSON = (getQuotes('INTC'))
intelDump = json.dumps(intelJSON, indent=2)
intelInfo = json.loads(intelDump)
intelPrice = intelInfo[0]['LastTradePrice']
intelTime = intelInfo[0]['LastTradeDateTimeLong']
print ("As of " + intelTime + ", Intel stock is trading at: " + intelPrice)

I prefer having all the raw CSV files that are available for download from Google Finance. I wrote a quick python script to automatically download all the historical price info for a list of companies -- it's equivalent to how a human might use the "Download to Spreadsheet" link manually.
Here's the GitHub repo, with the downloaded CSV files for all S&P 500 stocks (in the rawCSV folder): https://github.com/liezl200/stockScraper
It uses this link http://www.google.com/finance/historical?q=googl&startdate=May+3%2C+2012&enddate=Apr+30%2C+2017&output=csv where the key here is the last output parameter, output=csv. I use urllib.urlretrieve(download_url, local_csv_filename) to retrieve the CSV.

Related

CSV data with 2 different delimiters for Python 2/Ignition

I am developing the code in Ignition software using Jython/Python 2 scripts. We need to read data from csv file that has two delimiters "," in header and "\t" in data. The code we use are:
file_path = r'T:\test1.csv'
csvData = csv.reader(open(file_path, 'r'))
header = csvData.next() # Skip the fist row
dataset = system.dataset.toDataSet(header,list(csvData))
calcwindow.rootContainer.getComponent('Power Table').data = dataset
After applying this code we get this:
Power Table
Question are how can we separate the data so that all rows and columns match with csv.reader as ignition do not support panda or re :(
Update the code and now it separate data correctly:
csvData = csv.reader(open(file_path, 'r'),delimiter=',')
header = csvData.next()# Skip the fist row
for line in csvData:
str1 = "".join(line) #removes commas
#print str1
parts = str1.split("\t")
print parts
dataset = system.dataset.toDataSet(header,list(parts))
calcwindow.rootContainer.getComponent('Power Table').data = dataset
, but the error code came up:
Row 0 doesn't have the same number of columns as header list.
Any suggestions??
Thanks
Igor

I figure it out myself.
Here is the code:
file_path = r'T:\test1.csv'
try:
file = open(file_path)
csvData = csv.reader(file,delimiter=',') # open the file with comma delimiter
header = csvData.next()# Skip the fist row
csvData1 = list(csvData) # create list from data
lstLine = csvData1[-1] # selects last line added
str1 = "".join(lstLine) #removes commas and create string
parts = str1.split("\t") #split string back into list
dataset = system.dataset.toDataSet(header,[parts])
calcwindow.rootContainer.getComponent('Power Table').data = dataset
file.close()
except:
print "CSV busy exporting from TIA software"
Hope it will help anyone.

Pull variable value from javascript source using BeautifulSoup4 Python

I'm newbie in python programming. I'm learning beautifulsoup to scrap website.
I want to extract and store the value of "stream" to my variable.
My Python code as follows :
import bs4 as bs #Importing BeautifulSoup4 Python Library.
import urllib.request
import requests
import json
import re
headers = {'User-Agent':'Mozilla/5.0'}
url = "http://thoptv.com/partners/mhdTVlive/Core.php?level=1200&channel=Dsports_HD"
page = requests.get(url)
soup = bs.BeautifulSoup(page.text,"html.parser")
pattern = re.compile('var stream = (.*?);')
scripts = soup.find_all('script')
for script in scripts:
if(pattern.match(str(script.string))):
data = pattern.match(script.string)
links = json.loads(data.groups()[0])
print(links)
This is the source javascript code to get the stream url value.
https://content.jwplatform.com/libraries/oncyToRO.js'>if( navigator.userAgent.match(/android/i)||
navigator.userAgent.match(/webOS/i)||
navigator.userAgent.match(/iPhone/i)||
navigator.userAgent.match(/iPad/i)||
navigator.userAgent.match(/iPod/i)||
navigator.userAgent.match(/BlackBerry/i)||
navigator.userAgent.match(/Windows Phone/i)) {var stream =
"http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=ibxIPxc6rkq1yIUJb4RlEV&pxe=1504146411&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw";}else{var
stream =
"http://hd.simiptv.com:8080//index.m3u8?key=VIoVSsGRLRouHWGNo1epzX&exp=932213423&domain=thoptv.stream&id=461";}jwplayer("THOPTVPlayer").setup({"title":
'thoptv.stream',"stretching":"exactfit","width": "100%","file":
none,"height": "100%","skin": "seven","autostart": "true","logo":
{"file":"https://i.imgur.com/EprI2uu.png","margin":"-0",
"position":"top-left","hide":"false","link":"http://mhdtvlive.co.in"},"androidhls":
true,});jwplayer("THOPTVPlayer").onError(function(){jwplayer().load({file:"http://content.jwplatform.com/videos/7RtXk3vl-52qL9xLP.mp4",image:"http://content.jwplatform.com/thumbs/7RtXk3vl-480.jpg"});jwplayer().play();});jwplayer("THOPTVPlayer").onComplete(function(){window.location
= window.location.href;});jwplayer("THOPTVPlayer").onPlay(function(){clearTimeout(theTimeout);});
I need to extract the url from stream.
var stream = "http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=ibxIPxc6rkq1yIUJb4RlEV&pxe=1504146411&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw";}

Rather then thinking complicated with regex, if the link is the only dynamically changing part, you can split the string with some known separating tokens.
x = """
https://content.jwplatform.com/libraries/oncyToRO.js'>if( navigator.userAgent.match(/android/i)|| navigator.userAgent.match(/webOS/i)|| navigator.userAgent.match(/iPhone/i)|| navigator.userAgent.match(/iPad/i)|| navigator.userAgent.match(/iPod/i)|| navigator.userAgent.match(/BlackBerry/i)|| navigator.userAgent.match(/Windows Phone/i)) {var stream = "http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=ibxIPxc6rkq1yIUJb4RlEV&pxe=1504146411&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw";}else{var stream = "http://hd.simiptv.com:8080//index.m3u8?key=VIoVSsGRLRouHWGNo1epzX&exp=932213423&domain=thoptv.stream&id=461";}jwplayer("THOPTVPlayer").setup({"title": 'thoptv.stream',"stretching":"exactfit","width": "100%","file": none,"height": "100%","skin": "seven","autostart": "true","logo": {"file":"https://i.imgur.com/EprI2uu.png","margin":"-0", "position":"top-left","hide":"false","link":"http://mhdtvlive.co.in"},"androidhls": true,});jwplayer("THOPTVPlayer").onError(function(){jwplayer().load({file:"http://content.jwplatform.com/videos/7RtXk3vl-52qL9xLP.mp4",image:"http://content.jwplatform.com/thumbs/7RtXk3vl-480.jpg"});jwplayer().play();});jwplayer("THOPTVPlayer").onComplete(function(){window.location = window.location.href;});jwplayer("THOPTVPlayer").onPlay(function(){clearTimeout(theTimeout);});
"""
left1, right1 = x.split("Phone/i)) {var stream =")
left2, right2 = right1.split(";}else")
print(left2)
# "http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=ibxIPxc6rkq1yIUJb4RlEV&pxe=1504146411&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw"

pattern.match() matches the pattern from the beginning of the string. Try using pattern.search() instead - it will match anywhere within the string.
Change your for loop to this:
for script in scripts:
data = pattern.search(script.text)
if data is not None:
stream_url = data.groups()[0]
print(stream_url)
You can also get rid of the surrounding quotes by changing the regex pattern to:
pattern = re.compile('var stream = "(.*?)";')
so that the double quotes are not included in the group.
You might also have noticed that there are two possible stream variables depending on the accessing user agent. For tablet like devices the first would be appropriate, while all other user agents should use the second stream. You can use pattern.findall() to get all of them:
>>> pattern.findall(script.text)
['"http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=LEurobVVelOhbzOZ6EkTwr&pxe=1571716053&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.*AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw*"', '"http://hd.simiptv.com:8080//index.m3u8?key=vaERnLJswnWXM8THmfvDq5&exp=944825312&domain=thoptv.stream&id=461"']

this code works for me
import bs4 as bs #Importing BeautifulSoup4 Python Library.
import urllib.request
import requests
import json
headers = {'User-Agent':'Mozilla/5.0'}
url = "http://thoptv.com/partners/mhdTVlive/Core.php?
level=1200&channel=Dsports_HD"
page = requests.get(url)
soup = bs.BeautifulSoup(page.text,"html.parser")
scripts = soup.find_all('script')
out = list()
for c, i in enumerate(scripts): #go over list
text = i.text
if(text[:2] == "if"): #if the (if) comes first
for count, t in enumerate(text): # then we have reached the correct item in the list
if text[count] == "{" and text[count + 1] == "v" and text[count + 5] == "s": # and if this is here that stream is set
tmp = text[count:] # add this to the tmp varible
break # and end
co = 0
for m in tmp: #loop over the results from prev. result
if m == "\"" and co == 0: #if string is starting
co = 1 #set count to "true" 1
elif m == "\"" and co == 1: # if it is ending stop
print(''.join(out)) #results
break
elif co == 1:
# as long as we are looping over the rigth string
out.append(m) #add to out list
pass
result = ''.join(out) #set result
it basicly filters the string manuely.
but if we use user1767754 method (brilliant by the way) we will end up something like this:
import bs4 as bs #Importing BeautifulSoup4 Python Library.
import urllib.request
import requests
import json
headers = {'User-Agent':'Mozilla/5.0'}
url = "http://thoptv.com/partners/mhdTVlive/Core.php?level=1200&channel=Dsports_HD"
page = requests.get(url)
soup = bs.BeautifulSoup(page.text,"html.parser")
scripts = soup.find_all('script')
x = scripts[3].text
left1, right1 = x.split("Phone/i)) {var stream =")
left2, right2 = right1.split(";}else")
print(left2)

Follow each link of a page and scrape content, Scrapy + Selenium

This is the website I'm working on. On each page, there are 18 posts in a table. I want to access each post and scrape its content, and repeat this for the first 5 pages.
My approach is to make my spider to scrape all links in the 5 pages and iterate over them to get the content. Because the "next page" button and certain text in each post is written by JavaScript, I use Selenium and Scrapy. I ran my spider and could see that Firefox webdriver displays the first 5 pages, but then the spider stopped without scraping any content. Scrapy returns no error message either.
Now I suspect that the failure may be due to:
1) No link is stored into all_links.
2) Somehow parse_content did not run.
My diagnosis may be wrong and I need help with finding the problem. Thank you very much!
This is my spider:
import scrapy
from bjdaxing.items_bjdaxing import BjdaxingItem
from selenium import webdriver
from scrapy.http import TextResponse
import time
all_links = [] # a global variable to store post links
class Bjdaxing(scrapy.Spider):
name = "daxing"
allowed_domains = ["bjdx.gov.cn"] # DO NOT use www in allowed domains
start_urls = ["http://app.bjdx.gov.cn/cms/daxing/lookliuyan_bjdx.jsp"] # This has to start with http
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url) # request the start url in the browser
i = 1
while i <= 5: # The number of pages to be scraped in this session
response = TextResponse(url = response.url, body = self.driver.page_source, encoding='utf-8') # Assign page source to response. I can treat response as if it's a normal scrapy project.
global all_links
all_links.extend(response.xpath("//a/#href").extract()[0:18])
next = self.driver.find_element_by_xpath(u'//a[text()="\u4e0b\u9875\xa0"]') # locate "next" button
next.click() # Click next page
time.sleep(2) # Wait a few seconds for next page to load.
i += 1
def parse_content(self, response):
item = BjdaxingItem()
global all_links
for link in all_links:
self.driver.get("http://app.bjdx.gov.cn/cms/daxing/") + link
response = TextResponse(url = response.url, body = self.driver.page_source, encoding = 'utf-8')
if len(response.xpath("//table/tbody/tr[1]/td[2]/text()").extract() > 0):
item['title'] = response.xpath("//table/tbody/tr[1]/td[2]/text()").extract()
else:
item['title'] = ""
if len(response.xpath("//table/tbody/tr[3]/td[2]/text()").extract() > 0):
item['netizen'] = response.xpath("//table/tbody/tr[3]/td[2]/text()").extract()
else:
item['netizen'] = ""
if len(response.xpath("//table/tbody/tr[3]/td[4]/text()").extract() > 0):
item['sex'] = response.xpath("//table/tbody/tr[3]/td[4]/text()").extract()
else:
item['sex'] = ""
if len(response.xpath("//table/tbody/tr[5]/td[2]/text()").extract() > 0):
item['time1'] = response.xpath("//table/tbody/tr[5]/td[2]/text()").extract()
else:
item['time1'] = ""
if len(response.xpath("//table/tbody/tr[11]/td[2]/text()").extract() > 0):
item['time2'] = response.xpath("//table/tbody/tr[11]/td[2]/text()").extract()
else:
item['time2'] = ""
if len(response.xpath("//table/tbody/tr[7]/td[2]/text()").extract()) > 0:
question = "".join(response.xpath("//table/tbody/tr[7]/td[2]/text()").extract())
item['question'] = "".join(map(unicode.strip, question))
else: item['question'] = ""
if len(response.xpath("//table/tbody/tr[9]/td[2]/text()").extract()) > 0:
reply = "".join(response.xpath("//table/tbody/tr[9]/td[2]/text()").extract())
item['reply'] = "".join(map(unicode.strip, reply))
else: item['reply'] = ""
if len(response.xpath("//table/tbody/tr[13]/td[2]/text()").extract()) > 0:
agency = "".join(response.xpath("//table/tbody/tr[13]/td[2]/text()").extract())
item['agency'] = "".join(map(unicode.strip, agency))
else: item['agency'] = ""
yield item

Multiple problems and possible improvements here:
you don't have any "link" between the parse() and the parse_content() methods
using global variables is usually a bad practice
you don't need selenium here at all. To follow the pagination you just need to make a POST request to the same url providing the currPage parameter
The idea is to use .start_requests() and create a list/queue of requests to handle the pagination. Follow the pagination and gather the links from the table. Once the queue of requests is empty, switch to following the previously gathered links. Implementation:
import json
from urlparse import urljoin
import scrapy
NUM_PAGES = 5
class Bjdaxing(scrapy.Spider):
name = "daxing"
allowed_domains = ["bjdx.gov.cn"] # DO NOT use www in allowed domains
def __init__(self):
self.pages = []
self.links = []
def start_requests(self):
self.pages = [scrapy.Request("http://app.bjdx.gov.cn/cms/daxing/lookliuyan_bjdx.jsp",
body=json.dumps({"currPage": str(page)}),
method="POST",
callback=self.parse_page,
dont_filter=True)
for page in range(1, NUM_PAGES + 1)]
yield self.pages.pop()
def parse_page(self, response):
base_url = response.url
self.links += [urljoin(base_url, link) for link in response.css("table tr td a::attr(href)").extract()]
try:
yield self.pages.pop()
except IndexError: # no more pages to follow, going over the gathered links
for link in self.links:
yield scrapy.Request(link, callback=self.parse_content)
def parse_content(self, response):
# your parse_content method here

how to fetch javascript contents in python

I have a website that has data I want to fetch stored in a javascript. How do I fetch it?
The code is this :- http://pastebin.com/zhdWT5HM
I want to fetch from "var playersData" line. I want to fetch this thing :- "playerId":"showsPlayer" (without quotes obviously). How do I do so?
I've tried beautiful soup. My current script looks like this
q = requests.get('websitelink')
soup = BeautifulSoup(q.text)
searching = soup.findAll('script',{'type':'text/javascript'})
for playerIdin searching:
x = playerId.find_all('var playersData', limit=1)
print x
I'm getting [] as my output. I can't seem to figure out my problem here.
Please help out guys and gals :)

BeautifulSoup would only help locating the desired script tag. Then, you would have multiple options: you can extract the desired data with a javascript parser, like slimit, or use regular expressions:
import re
from bs4 import BeautifulSoup
page = """
<script type="text/javascript">
var logged = true;
var video_id = 59374;
var item_type = 'official';
var debug = false;
var baseUrl = 'http://www.example.com';
var base_url = 'http://www.example.com/';
var assetsBaseUrl = 'http://www.example.com/assets';
var apiBaseUrl = 'http://www.example.com/common';
var playersData = [{"playerId":"showsPlayer","userId":true,"solution":"flash","playlist":[{"itemId":"5090","itemAK":"Movie"}]];
</script><script type="text/javascript" >
"""
soup = BeautifulSoup(page)
pattern = re.compile(r'"playerId":"(.*?)"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)
Prints:
showsPlayer

Any way to get JS object using scrapy

I am using scrapy to gather schedule information on uslpro website. The site I am crawling is http://uslpro.uslsoccer.com/schedules/index_E.html.
The content of the page is rendered when the page is loaded. So I can't get the table data directly from source code. I looked at the source code and found that the schedule objects are stored in one object.
Here is the JavaScript Code.
preRender: function(){
var gmsA=diiH2A(DIISnapshot.gamesHolder);
....
This gmsA object has all schedule information. Is there any way to get this JS object using scrapy? Thank you very much for your help.

For starters, you have multiple options to choose from:
parse the javascript file containing the data (which is I'm describing below)
use Scrapy with scrapyjs tool
automate a real browser with the help of selenium
Okay, the first option (is arguably the most complicated).
The page is loaded via a separate call to a .js file which contains the information about matches and teams in two separate objects:
DIISnapshot.gms = {
"4428801":{"code":"1","tg":65672522,"fg":"2953156","fac":"22419","facn":"Blackbaud Stadium","tm1":"13380700","tm2":"22310","sc1":"1","sc2":"1","gmapply":"","dt":"22-MAR-2014","tim":"30-DEC-1899 19:30:00.0000","se":"65672455","modst":"","gmlabel":"","golive":0,"gmrpt":"67842863","urlvideo":"http://www.youtube.com/watch?v=JHi6_nnuAsQ","urlaudio":""}
, "4428803":{"code":"2","tg":65672522,"fg":"2953471","fac":"1078448","facn":"StubHub Center","tm1":"33398866","tm2":"66919078","sc1":"1","sc2":"3","gmapply":"","dt":"22-MAR-2014","tim":"30-DEC-1899 22:30:00.0000","se":"65672455","modst":"","gmlabel":"","golive":0,"gmrpt":"67846731","urlvideo":"http://www.youtube.com/watch?v=nLaRaTi7BgE","urlaudio":""}
...
, "5004593":{"code":"217","tg":65672522,"fg":"66919058","fac":"66919059","facn":"Bonney Field","tm1":"934394","tm2":"65674034","sc1":"0","sc2":"2","gmapply":"3","dt":"27-SEP-2014","tim":"30-DEC-1899 22:30:00.0000","se":"65672455","modst":"21-SEP-2014 1:48:26.5710","gmlabel":"FINAL","golive":0,"gmrpt":"72827154","urlvideo":"https://www.youtube.com/watch?v=QPhL8Ktkz4M","urlaudio":""}
};
DIISnapshot.tms = {
"13380700":{"name":"Orlando City SC","club":"","nick":"Orlando","primarytg":"65672522"}
...
, "8969532":{"name":"Pittsburgh Riverhounds","club":"","nick":"Pittsburgh","primarytg":"65672522"}
, "934394":{"name":"Harrisburg City Islanders","club":"","nick":"Harrisburg","primarytg":"65672522"}
};
And things are getting a bit more difficult because the URL to that js file is also constructed with javascript in the following script tag:
<script type="text/javascript">
var DIISnapshot = {
goLive: function(gamekey) {
clickpop1=window.open('http://uslpro.uslsoccer.com/scripts/runisa.dll?M2:gp::72013+Elements/DisplayBlank+E+2187955++'+gamekey+'+65672455','clickpop1','toolbar=0,location=0,status=0,menubar=0,scrollbars=1,resizable=0,top=100,left=100,width=315,height=425');
}
};
var DIISchedule = {
MISL_lgkey: '36509042',
sename:'2014',
sekey: '65672455',
lgkey: '2792331',
tg: '65672522',
...
fetchInfo:function(){
var fname = DIISchedule.tg;
if (fname === '') fname = DIISchedule.sekey;
new Ajax.Request('/schedules/' + DIISchedule.seSeq + '/' + fname + '.js?'+rand4(),{asynchronous: false});
DIISnapshot.gamesHolder = DIISnapshot.gms;
DIISnapshot.teamsHolder = DIISnapshot.tms;
DIISnapshot.origTeams = [];
for (var teamkey in DIISnapshot.tms) DIISnapshot.origTeams.push(teamkey);
},
...
DIISchedule.scheduleLoaded = true;
}
}
document.observe('dom:loaded',DIISchedule.init);
</script>
Okay, let's use BeautifulSoup HTML parser and slimit javascript parser to get the dynamic part (that tg value is the name of the js with the data) used to construct the URL, then make a request to a URL, parse the javascript and print out the matches:
import json
import random
import re
from bs4 import BeautifulSoup
import requests
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
# start a session
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'}
session = requests.Session()
response = session.get('http://uslpro.uslsoccer.com/schedules/index_E.html', headers=headers)
# get the dynamic part of the JS url
soup = BeautifulSoup(response.content)
script = soup.find('script', text=lambda x: x and 'var DIISchedule' in x)
tg = re.search(r"tg: '(\d+)',", script.text).group(1)
# request to JS url
js_url = "http://uslpro.uslsoccer.com/schedules/2014/{tg}.js?{rand}".format(tg=tg, rand=random.randint(1000, 9999))
response = session.get(js_url, headers=headers)
# parse js
parser = Parser()
tree = parser.parse(response.content)
matches, teams = [json.loads(node.right.to_ecma())
for node in nodevisitor.visit(tree)
if isinstance(node, ast.Assign) and isinstance(node.left, ast.DotAccessor)]
for match in matches.itervalues():
print teams[match['tm1']]['name'], '%s : %s' % (match['sc1'], match['sc2']), teams[match['tm2']]['name']
Prints:
Arizona United SC 0 : 2 Orange County Blues FC
LA Galaxy II 1 : 0 Seattle Sounders FC Reserves
LA Galaxy II 1 : 3 Harrisburg City Islanders
New York Red Bulls Reserves 0 : 1 OKC Energy FC
Wilmington Hammerheads FC 2 : 1 Charlotte Eagles
Richmond Kickers 3 : 2 Harrisburg City Islanders
Charleston Battery 0 : 2 Orlando City SC
Charlotte Eagles 0 : 2 Richmond Kickers
Sacramento Republic FC 2 : 1 Dayton Dutch Lions FC
OKC Energy FC 0 : 5 LA Galaxy II
...
The part printing the list of matches is for demonstration purposes. You can use matches and teams dictionaries to output the data in a format you need.
As this is not a popular tag I don't expect any upvotes - most importantly, it was an interesting challenge for me.

We Keep Coding

JavaScript is the programming language of the Web.

Python BeautifulSoup - Scraping Google Finance historical data - javascript

Related

CSV data with 2 different delimiters for Python 2/Ignition

Pull variable value from javascript source using BeautifulSoup4 Python

Follow each link of a page and scrape content, Scrapy + Selenium

how to fetch javascript contents in python

Any way to get JS object using scrapy

Categories

Resources