Trying to scrape table using Pandas from Selenium's result - javascript

I am trying to scrape a table from a Javascript website using Pandas. For this, I used Selenium to first reach my desired page. I am able to print the table in text format (as shown in commented script), but I want to be able to have the table in Pandas, too. I am attaching my script as below and I hope someone could help me figure this out.
import time
from selenium import webdriver
import pandas as pd
chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?
filter=BS02'
page = driver.get(url)
time.sleep(2)
driver.find_element_by_xpath('//*[#id="bursa_boards"]/option[2]').click()
driver.find_element_by_xpath('//*[#id="bursa_sectors"]/option[11]').click()
time.sleep(2)
driver.find_element_by_xpath('//*[#id="bm_equity_price_search"]').click()
time.sleep(5)
target = driver.find_elements_by_id('bm_equities_prices_table')
##for data in target:
## print (data.text)
for data in target:
dfs = pd.read_html(target,match = '+')
for df in dfs:
print (df)
Running the above script, i get the below error:
Traceback (most recent call last):
File "E:\Coding\Python\BS_Bursa Properties\Selenium_Pandas_Bursa Properties.py", line 29, in <module>
dfs = pd.read_html(target,match = '+')
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 906, in read_html
keep_default_na=keep_default_na)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 728, in _parse
compiled_match = re.compile(match) # you can pass a compiled regex here
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 233, in compile
return _compile(pattern, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 616, in _parse
source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 0
I've tried using pd.read_html on the url also, but it returned an error of "No Table Found". The url is: http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS08&board=MAIN-MKT&sector=PROPERTIES&page=1.

You can get the table using the following code
import time
from selenium import webdriver
import pandas as pd
chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS02'
page = driver.get(url)
time.sleep(2)
df = pd.read_html(driver.page_source)[0]
print(df.head())
This is the output
No Code Name Rem Last Done LACP Chg % Chg Vol ('00) Buy Vol ('00) Buy Sell Sell Vol ('00) High Low
0 1 5284CB LCTITAN-CB s 0.025 0.020 0.005 +25.00 406550 19878 0.020 0.025 106630 0.025 0.015
1 2 1201 SUMATEC [S] s 0.050 0.050 - - 389354 43815 0.050 0.055 187301 0.055 0.050
2 3 5284 LCTITAN [S] s 4.470 4.700 -0.230 -4.89 367335 430 4.470 4.480 34 4.780 4.140
3 4 0176 KRONO [S] - 0.875 0.805 0.070 +8.70 300473 3770 0.870 0.875 797 0.900 0.775
4 5 5284CE LCTITAN-CE s 0.130 0.135 -0.005 -3.70 292379 7214 0.125 0.130 50 0.155 0.100
To get data from all pages you can crawl the remaining pages and use df.append

Answer:
df = pd.read_html(target[0].get_attribute('outerHTML'))
Result:
Reason for target[0]:
driver.find_elements_by_id('bm_equities_prices_table') returns a list of selenium webelements, in your case, there's only 1 element, hence [0]
Reason for get_attribute('outerHTML'):
we want to get the 'html' of the element. There are 2 types of such get_attribute methods: 'innerHTML' vs 'outerHTML'. We chose the 'outerHTML' becasue we need to include the current element, where the table headers are, I suppose, instead of only the inner contents of the element.
Reason for df[0]
pd.read_html() returns a list of data frames, the first of which is the result we want, hence [0].

Related

Hi all. I started working on a license plate recognition project and I'm new to computer vision. I want to run the code but I get an error

I have a trained model.weights on yolov4 for license plate detection. Now I need to recognize the characters on this license plate. I have the code, when I run this "test_img('D:/RTP/car.jpg', config_file, weights,'D:/RTP/')" I get an error
AttributeError Traceback (most recent call last)
Cell In [20], line 1
----> 1 test_img('D:/RTP/car.jpg', config_file, weights,'D:/RTP/')
Cell In [18], line 3, in test_img(input, config_file, weights, out_path)
1 def test_img(input, config_file, weights, out_path):
2 # Loading darknet network and classes along with the bbox colors.
----> 3 network, class_names, class_colors = darknet.load_network(
4 config_file,
5 data_file,
6 weights,
7 batch_size= batch_size
8 )
10 # Reading the image and performing YOLOv4 detection.
11 img = cv2.imread(input)
AttributeError: module 'darknet' has no attribute 'load_network'
this is test_img( ) function
def test_img(input, config_file, weights, out_path):
# Loading darknet network and classes along with the bbox colors.
network, class_names, class_colors = darknet.load_network(
config_file,
data_file,
weights,
batch_size= batch_size
)
# Reading the image and performing YOLOv4 detection.
img = cv2.imread(input)
bboxes, scores, det_time = yolo_det(img, config_file, data_file, batch_size, weights, thresh, out_path, network, class_names, class_colors)
# Extracting or cropping the license plate and applying the OCR.
for bbox in bboxes:
bbox = [bbox[0], bbox[1], bbox[2]- bbox[0], bbox[3] - bbox[1]]
cr_img = crop(img, bbox)
result = ocr.ocr(cr_img, cls=False, det=False)
ocr_res = result[0][0]
rec_conf = result[0][1]
print(result)
# Plotting the predictions using OpenCV.
(label_width,label_height), baseline = cv2.getTextSize(ocr_res , font, 2, 3)
top_left = tuple(map(int,[int(bbox[0]),int(bbox[1])-(label_height+baseline)]))
top_right = tuple(map(int,[int(bbox[0])+label_width,int(bbox[1])]))
org = tuple(map(int,[int(bbox[0]),int(bbox[1])-baseline]))
cv2.rectangle(img, (int(bbox[0]), int(bbox[1])), (int(bbox[2]), int(bbox[3])), blue_color, 2)
cv2.rectangle(img, top_left, top_right, blue_color,-1)
cv2.putText(img, ocr_res, org, font, 2, white_color,3)
# Writing output image.
file_name = os.path.join(out_path, 'out_' + input.split('/')[-1])
cv2.imwrite(file_name, img)

Parse <script type=“text/javascript” twitter python

very long code..
Need parse screen_name:
<script type="text/javascript" charset="utf-8" nonce="YjJmNTAwODgtODBmMy00YzQ5LWJhODItMmQwNTk0Yjg4MTI1">window.__INITIAL_STATE__={"optimist":[],"urt":{},"toasts":[],"needs_phone_verification":false,"normal_followers_count":2,"notifications":false,"pinned_tweet_ids_str":[],"profile_image_url_https":"https://pbs.twimg.com/profile_images/1174197230003208192/qK5cqalJ_normal.jpg","profile_interstitial_type":"","protected":false,"featureSwitch":{"config":{"2fa_multikey_management_enabled":{"value":false},"screen_name":"Vickson25435099","always_use_https":true,"use_cookie_personalization":false,"sleep_time":{"enabled":false,"end_time":null,"start_time":null},"geo_enabled":false,"language":"en","discoverable_by_email":true,"discoverable_by_mobile_phone":true,"personalized_trends":true,"allow_media_tagging":"none","allow_contributor_request":"all","allow_ads_personalization":true,"allow_logged_out_device_personalization":true,"allow_location_history_personalization":true,"allow_sharing_data_for_third_party_personalization":false,"allow_dms_from":"following","allow_dm_groups_from":"following","translator_type":"none","country_code":"us","nsfw_user":false,"nsfw_admin":false,"ranked_timeline_setting":1,"ranked_timeline_eligible":null,"address_book_live_sync_enabled":false,"universal_quality_filtering_enabled":"enabled","dm_receipt_setting":"all_disabled","alt_text_compose_enabled":null,"mention_filter":"unfiltered","allow_authenticated_periscope_requests":true,"protect_password_reset":false,"require_password_login":false,"requires_login_verification":false,"dm_quality_filter":"enabled","autoplay_disabled":false,"settings_metadata":{}},"fetchStatus":"loaded"},"dataSaver":{"dataSaverMode":false},"transient":{"dtabBarInfo":{"hide":false},"loginPromptShown":false,"lastViewedDmInboxPath":"/messages","themeFocus":""}},"devices":{"browserPush":{"fetchStatus":"none","pushNotificationsPrompt":{"dismissed":false,"fetchStatus":"none"},"subscribed":false,"supported":null},"devices":{"data":{"emails":[],"phone_numbers":[]},"fetchStatus":"none"},"notificationSettings":{"push_settings":{"error":null,"fetchStatus":"none"},"push_settings_template":{"template":{"settings":[]}},"sms_settings":{"error":null,"fetchStatus":"none"},"sms_settings_template":{"template":{"settings":[]}},"checkin_time":null}},"audio":{"conversationLookup":{}},"hashflags":{"fetchStatus":"none","hashflags":{}},"friendships":{"pendingFollowers":{"acceptedIds":[],"ids":[],"fetchStatus":{"bottom":"none","top":"none"},"hydratedIds":[]}},"homeTimeline":{"useLatest":false,"fetchStatus":"none"},"multiAccount":{"fetchStatus":"none","users":[],"badgeCounts":{},"addAccountFetchStatus":"none"},"badgeCount":{"unreadDMCount":0},"ocf_location":{"startLocation":{}},"navigation":{},"teams":{"fetchStatus":"none","teams":{}},"cardState":{},"promotedContent":{}};window.__META_DATA__={"env":"prod","isLoggedIn":true,"isRTL":false,"hasMultiAccountCookie":false,"uaParserTags":["m2","rweb","msw"],"serverDate":1614578006755,"sha":"9921d3a6d626dc45b0f5a65681ef95c891d815cd"};window.__PREFETCH_DATA__={"items":[{"key":"dataUsageSettings","payload":{"dataSaverMode":false}}],"timestamp":1614578006700};</script>
I`m trying this method
import requests
import json
from bs4 import BeautifulSoup
x = requests.get('https://twitter.com/home')
b = BeautifulSoup(x.text, 'html.parser')
for b in b.find_all('script'):
wis = x.text.split('window.__INITIAL_STATE__=')
if len(wis) > 1:
data = json.loads(wis[1].split(';')[0])
print(data["screen_name"])
Result: KeyError "screen_name"
And this way doesn't work either:
import requests
import json
x = requests.get('https://twitter.com/home')
html = x.text.split('window.__INITIAL_STATE__=')[0]
html = html.split(';</script>')[0]
data = json.loads(html)
print(data['screen_name'])
Result
Traceback (most recent call last):
File "<string>", line 8, in <module>
File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>
update using full source html
for script in b.find_all('script'):
if 'window.__INITIAL_STATE__=' not in script.contents[0]:
continue
wis = script.contents[0].split('window.__INITIAL_STATE__=')
data = json.loads(wis[1].split(';window.__META_DATA__')[0])
print(data["settings"]["remote"]["settings"]["screen_name"])
break
you wont get screen_name it is only for current logged user, you have to requests with valid cookies to get the data.
btw for example above, it has multiple variable (json), you want json between window.__INITIAL_STATE__= and ,"devices"
b = BeautifulSoup(html, 'html.parser')
for script in b.find_all('script'):
if 'window.__INITIAL_STATE__=' not in script.contents[0]:
continue
wis = script.contents[0].split('window.__INITIAL_STATE__=')
data = json.loads(wis[1].split(',"devices"')[0])
print(data['featureSwitch']['config']['screen_name'])
break

Highlight and Label Line in line chart for Bokeh

I'm dealing with a callback method to create a line chart on the go, given a specific dataframe.
def Total_value(DF):
return pd.DataFrame(pd.DataFrame(DF)['FinalSalePrice'].
groupby(level=0, group_keys=False).
apply(lambda x: x.sort_values(ascending=False).head(15))).reset_index()
def TOP_Item(data):
return np.array(data.ItemCode.value_counts()[data.ItemCode.value_counts() > 20].index)
def figure_creator(arr,l):
# colors = ["#%06x" % random.randint(0,0xFFFFFF) for c in range(len(arr))]
fig = figure(plot_width=1000, plot_height=300,x_axis_type='datetime')
for item in arr:
fig.line(l[l.ItemCode == item].ServicedOn.unique(),l[l.ItemCode == item][np.int(0)], line_width=2)
# fig.add_tools(HoverTool(show_arrow=False,
# line_policy='nearest',
# tooltips=None))
return fig
at the very end I call:
show(figure_creator(TOP_Item(Total_value(SER_2016)),Total_value(SER_2016)))
I want to add a Hovertool which could Highlight the given chart and also display the label for the line.
The DataFrame for these is quite big, hence I can't upload it Here.
But the premise of each of the function is explained below:
Total_value: is used to calculate the total value of money, each unique item in the dataframe has made,sort them, and take only the top 15 items.
Top_Item: is used to calculate which of the 15 items has appeared more than 20 times for a 14 day period in a year(there are 25ish, 14 day periods in a year). Further return the list of the items.
fig_creator: creates a line for each of returned item.
**
Is there a way to create a callback method on the hovertool(commented out) per new line that is being generated ?
I figured it out using select tool. Posting for others who might run into a similar problem.
def figure_creator(arr,l):
# colors = ["#%06x" % random.randint(0,0xFFFFFF) for c in range(len(arr))]
fig = figure(plot_width=1000, plot_height=300,x_axis_type='datetime',tools="reset,hover")
for item in arr:
# dicta
fig.line(l[l.ItemCode == item].ServicedOn.unique(),l[l.ItemCode == item][np.int(0)], line_width=2,alpha=0.4,
hover_line_color='red',hover_line_alpha=0.8)
fig.select(dict(type=HoverTool)).tooltips = {"item":item}
# fig.add_tools(HoverTool(show_arrow=False,
# line_policy='nearest',
# tooltips=None))
return fig
This renders:

Python BeautifulSoup - Scraping Google Finance historical data

I was trying to scrap Google Finance historical data. I was need of to total number of rows, which is located along with the pagination. The following is the div tag which is responsible for displaying the total number of rows:
<div class="tpsd">1 - 30 of 1634 rows</div>
I tried using the following code to get the data, but its returning an empty list:
soup.find_all('div', 'tpsd')
I tried getting the entire table but even then I was not successful, when I checked the page source I was able to find the value inside a JavaScript function. When I Googled how to get values from script tag, it was mentioned to used regex. So, I tried using regex and the following is my code:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ')
soup = BeautifulSoup(r.content,'lxml')
var = soup.find_all("script")[8].string
a = re.compile('google.finance.applyPagination\((.*)\'http', re.DOTALL)
b = a.search(var)
num = b.group(1)
print(num.replace(',','').split('\n')[3])
I am able to get the values which I want, but my doubt is whether the above code which I used to get the values is correct, or is there any other way better way. Kindly help.
You can easily pass an offset i.e start=.. to the url getting 30 rows at a time which is exactly what is happening with the pagination logic:
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"
with requests.session() as s:
start = 0
req = s.get(url.format(start))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows = table.find_all("tr")
while True:
start += 30
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
if not table:
break
all_rows.extend(table.find_all("tr"))
You can also get the total rows using the script tag and use that with range:
with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")
for start in range(30, total+1, 30):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))
print(len(all_rows))
The num=30 is the amount of rows per page, to make less requests you can set it to 200 which seems to be the max and work your step/offset from that.
url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=200&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"
with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")
for start in range(200, total+1, 200):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
print(url.format(start)
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))
If we run the code, you will see we get 1643 rows:
In [7]: with requests.session() as s:
...: req = s.get(url.format(0))
...: soup = BeautifulSoup(req.content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
...: total = int(scr.text.split(",", 3)[2])
...: all_rows = table.find_all("tr")
...: for start in range(200, total+1, 200):
...: soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: all_rows.extend(table.find_all("tr"))
...: print(len(all_rows))
...:
1643
In [8]:
You can just use the python module: https://pypi.python.org/pypi/googlefinance
The api is simple:
#The google finance API that we need.
from googlefinance import getQuotes
#The json handeler, since the API returns a JSON.
import json
intelJSON = (getQuotes('INTC'))
intelDump = json.dumps(intelJSON, indent=2)
intelInfo = json.loads(intelDump)
intelPrice = intelInfo[0]['LastTradePrice']
intelTime = intelInfo[0]['LastTradeDateTimeLong']
print ("As of " + intelTime + ", Intel stock is trading at: " + intelPrice)
I prefer having all the raw CSV files that are available for download from Google Finance. I wrote a quick python script to automatically download all the historical price info for a list of companies -- it's equivalent to how a human might use the "Download to Spreadsheet" link manually.
Here's the GitHub repo, with the downloaded CSV files for all S&P 500 stocks (in the rawCSV folder): https://github.com/liezl200/stockScraper
It uses this link http://www.google.com/finance/historical?q=googl&startdate=May+3%2C+2012&enddate=Apr+30%2C+2017&output=csv where the key here is the last output parameter, output=csv. I use urllib.urlretrieve(download_url, local_csv_filename) to retrieve the CSV.

Any way to get JS object using scrapy

I am using scrapy to gather schedule information on uslpro website. The site I am crawling is http://uslpro.uslsoccer.com/schedules/index_E.html.
The content of the page is rendered when the page is loaded. So I can't get the table data directly from source code. I looked at the source code and found that the schedule objects are stored in one object.
Here is the JavaScript Code.
preRender: function(){
var gmsA=diiH2A(DIISnapshot.gamesHolder);
....
This gmsA object has all schedule information. Is there any way to get this JS object using scrapy? Thank you very much for your help.
For starters, you have multiple options to choose from:
parse the javascript file containing the data (which is I'm describing below)
use Scrapy with scrapyjs tool
automate a real browser with the help of selenium
Okay, the first option (is arguably the most complicated).
The page is loaded via a separate call to a .js file which contains the information about matches and teams in two separate objects:
DIISnapshot.gms = {
"4428801":{"code":"1","tg":65672522,"fg":"2953156","fac":"22419","facn":"Blackbaud Stadium","tm1":"13380700","tm2":"22310","sc1":"1","sc2":"1","gmapply":"","dt":"22-MAR-2014","tim":"30-DEC-1899 19:30:00.0000","se":"65672455","modst":"","gmlabel":"","golive":0,"gmrpt":"67842863","urlvideo":"http://www.youtube.com/watch?v=JHi6_nnuAsQ","urlaudio":""}
, "4428803":{"code":"2","tg":65672522,"fg":"2953471","fac":"1078448","facn":"StubHub Center","tm1":"33398866","tm2":"66919078","sc1":"1","sc2":"3","gmapply":"","dt":"22-MAR-2014","tim":"30-DEC-1899 22:30:00.0000","se":"65672455","modst":"","gmlabel":"","golive":0,"gmrpt":"67846731","urlvideo":"http://www.youtube.com/watch?v=nLaRaTi7BgE","urlaudio":""}
...
, "5004593":{"code":"217","tg":65672522,"fg":"66919058","fac":"66919059","facn":"Bonney Field","tm1":"934394","tm2":"65674034","sc1":"0","sc2":"2","gmapply":"3","dt":"27-SEP-2014","tim":"30-DEC-1899 22:30:00.0000","se":"65672455","modst":"21-SEP-2014 1:48:26.5710","gmlabel":"FINAL","golive":0,"gmrpt":"72827154","urlvideo":"https://www.youtube.com/watch?v=QPhL8Ktkz4M","urlaudio":""}
};
DIISnapshot.tms = {
"13380700":{"name":"Orlando City SC","club":"","nick":"Orlando","primarytg":"65672522"}
...
, "8969532":{"name":"Pittsburgh Riverhounds","club":"","nick":"Pittsburgh","primarytg":"65672522"}
, "934394":{"name":"Harrisburg City Islanders","club":"","nick":"Harrisburg","primarytg":"65672522"}
};
And things are getting a bit more difficult because the URL to that js file is also constructed with javascript in the following script tag:
<script type="text/javascript">
var DIISnapshot = {
goLive: function(gamekey) {
clickpop1=window.open('http://uslpro.uslsoccer.com/scripts/runisa.dll?M2:gp::72013+Elements/DisplayBlank+E+2187955++'+gamekey+'+65672455','clickpop1','toolbar=0,location=0,status=0,menubar=0,scrollbars=1,resizable=0,top=100,left=100,width=315,height=425');
}
};
var DIISchedule = {
MISL_lgkey: '36509042',
sename:'2014',
sekey: '65672455',
lgkey: '2792331',
tg: '65672522',
...
fetchInfo:function(){
var fname = DIISchedule.tg;
if (fname === '') fname = DIISchedule.sekey;
new Ajax.Request('/schedules/' + DIISchedule.seSeq + '/' + fname + '.js?'+rand4(),{asynchronous: false});
DIISnapshot.gamesHolder = DIISnapshot.gms;
DIISnapshot.teamsHolder = DIISnapshot.tms;
DIISnapshot.origTeams = [];
for (var teamkey in DIISnapshot.tms) DIISnapshot.origTeams.push(teamkey);
},
...
DIISchedule.scheduleLoaded = true;
}
}
document.observe('dom:loaded',DIISchedule.init);
</script>
Okay, let's use BeautifulSoup HTML parser and slimit javascript parser to get the dynamic part (that tg value is the name of the js with the data) used to construct the URL, then make a request to a URL, parse the javascript and print out the matches:
import json
import random
import re
from bs4 import BeautifulSoup
import requests
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
# start a session
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'}
session = requests.Session()
response = session.get('http://uslpro.uslsoccer.com/schedules/index_E.html', headers=headers)
# get the dynamic part of the JS url
soup = BeautifulSoup(response.content)
script = soup.find('script', text=lambda x: x and 'var DIISchedule' in x)
tg = re.search(r"tg: '(\d+)',", script.text).group(1)
# request to JS url
js_url = "http://uslpro.uslsoccer.com/schedules/2014/{tg}.js?{rand}".format(tg=tg, rand=random.randint(1000, 9999))
response = session.get(js_url, headers=headers)
# parse js
parser = Parser()
tree = parser.parse(response.content)
matches, teams = [json.loads(node.right.to_ecma())
for node in nodevisitor.visit(tree)
if isinstance(node, ast.Assign) and isinstance(node.left, ast.DotAccessor)]
for match in matches.itervalues():
print teams[match['tm1']]['name'], '%s : %s' % (match['sc1'], match['sc2']), teams[match['tm2']]['name']
Prints:
Arizona United SC 0 : 2 Orange County Blues FC
LA Galaxy II 1 : 0 Seattle Sounders FC Reserves
LA Galaxy II 1 : 3 Harrisburg City Islanders
New York Red Bulls Reserves 0 : 1 OKC Energy FC
Wilmington Hammerheads FC 2 : 1 Charlotte Eagles
Richmond Kickers 3 : 2 Harrisburg City Islanders
Charleston Battery 0 : 2 Orlando City SC
Charlotte Eagles 0 : 2 Richmond Kickers
Sacramento Republic FC 2 : 1 Dayton Dutch Lions FC
OKC Energy FC 0 : 5 LA Galaxy II
...
The part printing the list of matches is for demonstration purposes. You can use matches and teams dictionaries to output the data in a format you need.
As this is not a popular tag I don't expect any upvotes - most importantly, it was an interesting challenge for me.

Categories