Any way to get JS object using scrapy - javascript

I am using scrapy to gather schedule information on uslpro website. The site I am crawling is http://uslpro.uslsoccer.com/schedules/index_E.html.
The content of the page is rendered when the page is loaded. So I can't get the table data directly from source code. I looked at the source code and found that the schedule objects are stored in one object.
Here is the JavaScript Code.
preRender: function(){
var gmsA=diiH2A(DIISnapshot.gamesHolder);
....
This gmsA object has all schedule information. Is there any way to get this JS object using scrapy? Thank you very much for your help.

For starters, you have multiple options to choose from:
parse the javascript file containing the data (which is I'm describing below)
use Scrapy with scrapyjs tool
automate a real browser with the help of selenium
Okay, the first option (is arguably the most complicated).
The page is loaded via a separate call to a .js file which contains the information about matches and teams in two separate objects:
DIISnapshot.gms = {
"4428801":{"code":"1","tg":65672522,"fg":"2953156","fac":"22419","facn":"Blackbaud Stadium","tm1":"13380700","tm2":"22310","sc1":"1","sc2":"1","gmapply":"","dt":"22-MAR-2014","tim":"30-DEC-1899 19:30:00.0000","se":"65672455","modst":"","gmlabel":"","golive":0,"gmrpt":"67842863","urlvideo":"http://www.youtube.com/watch?v=JHi6_nnuAsQ","urlaudio":""}
, "4428803":{"code":"2","tg":65672522,"fg":"2953471","fac":"1078448","facn":"StubHub Center","tm1":"33398866","tm2":"66919078","sc1":"1","sc2":"3","gmapply":"","dt":"22-MAR-2014","tim":"30-DEC-1899 22:30:00.0000","se":"65672455","modst":"","gmlabel":"","golive":0,"gmrpt":"67846731","urlvideo":"http://www.youtube.com/watch?v=nLaRaTi7BgE","urlaudio":""}
...
, "5004593":{"code":"217","tg":65672522,"fg":"66919058","fac":"66919059","facn":"Bonney Field","tm1":"934394","tm2":"65674034","sc1":"0","sc2":"2","gmapply":"3","dt":"27-SEP-2014","tim":"30-DEC-1899 22:30:00.0000","se":"65672455","modst":"21-SEP-2014 1:48:26.5710","gmlabel":"FINAL","golive":0,"gmrpt":"72827154","urlvideo":"https://www.youtube.com/watch?v=QPhL8Ktkz4M","urlaudio":""}
};
DIISnapshot.tms = {
"13380700":{"name":"Orlando City SC","club":"","nick":"Orlando","primarytg":"65672522"}
...
, "8969532":{"name":"Pittsburgh Riverhounds","club":"","nick":"Pittsburgh","primarytg":"65672522"}
, "934394":{"name":"Harrisburg City Islanders","club":"","nick":"Harrisburg","primarytg":"65672522"}
};
And things are getting a bit more difficult because the URL to that js file is also constructed with javascript in the following script tag:
<script type="text/javascript">
var DIISnapshot = {
goLive: function(gamekey) {
clickpop1=window.open('http://uslpro.uslsoccer.com/scripts/runisa.dll?M2:gp::72013+Elements/DisplayBlank+E+2187955++'+gamekey+'+65672455','clickpop1','toolbar=0,location=0,status=0,menubar=0,scrollbars=1,resizable=0,top=100,left=100,width=315,height=425');
}
};
var DIISchedule = {
MISL_lgkey: '36509042',
sename:'2014',
sekey: '65672455',
lgkey: '2792331',
tg: '65672522',
...
fetchInfo:function(){
var fname = DIISchedule.tg;
if (fname === '') fname = DIISchedule.sekey;
new Ajax.Request('/schedules/' + DIISchedule.seSeq + '/' + fname + '.js?'+rand4(),{asynchronous: false});
DIISnapshot.gamesHolder = DIISnapshot.gms;
DIISnapshot.teamsHolder = DIISnapshot.tms;
DIISnapshot.origTeams = [];
for (var teamkey in DIISnapshot.tms) DIISnapshot.origTeams.push(teamkey);
},
...
DIISchedule.scheduleLoaded = true;
}
}
document.observe('dom:loaded',DIISchedule.init);
</script>
Okay, let's use BeautifulSoup HTML parser and slimit javascript parser to get the dynamic part (that tg value is the name of the js with the data) used to construct the URL, then make a request to a URL, parse the javascript and print out the matches:
import json
import random
import re
from bs4 import BeautifulSoup
import requests
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
# start a session
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'}
session = requests.Session()
response = session.get('http://uslpro.uslsoccer.com/schedules/index_E.html', headers=headers)
# get the dynamic part of the JS url
soup = BeautifulSoup(response.content)
script = soup.find('script', text=lambda x: x and 'var DIISchedule' in x)
tg = re.search(r"tg: '(\d+)',", script.text).group(1)
# request to JS url
js_url = "http://uslpro.uslsoccer.com/schedules/2014/{tg}.js?{rand}".format(tg=tg, rand=random.randint(1000, 9999))
response = session.get(js_url, headers=headers)
# parse js
parser = Parser()
tree = parser.parse(response.content)
matches, teams = [json.loads(node.right.to_ecma())
for node in nodevisitor.visit(tree)
if isinstance(node, ast.Assign) and isinstance(node.left, ast.DotAccessor)]
for match in matches.itervalues():
print teams[match['tm1']]['name'], '%s : %s' % (match['sc1'], match['sc2']), teams[match['tm2']]['name']
Prints:
Arizona United SC 0 : 2 Orange County Blues FC
LA Galaxy II 1 : 0 Seattle Sounders FC Reserves
LA Galaxy II 1 : 3 Harrisburg City Islanders
New York Red Bulls Reserves 0 : 1 OKC Energy FC
Wilmington Hammerheads FC 2 : 1 Charlotte Eagles
Richmond Kickers 3 : 2 Harrisburg City Islanders
Charleston Battery 0 : 2 Orlando City SC
Charlotte Eagles 0 : 2 Richmond Kickers
Sacramento Republic FC 2 : 1 Dayton Dutch Lions FC
OKC Energy FC 0 : 5 LA Galaxy II
...
The part printing the list of matches is for demonstration purposes. You can use matches and teams dictionaries to output the data in a format you need.
As this is not a popular tag I don't expect any upvotes - most importantly, it was an interesting challenge for me.

Related

How to parse JavaScript Json into Python dict type, effeciently

I am looking for way to read javascript json data loaded into one of a script tag of this page. I have tried various re patterns posted on google and stackoveflow but got nothing.
The Json Formatter shows an Invalid (RFC 8259).
Here is a code
import requests,json
from scrapy.selector import Selector
headers = {'Content-Type': 'application/json', 'Accept-Language': 'en-US,en;q=0.5', 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'}
url = 'https://www.zocdoc.com/doctor/andrew-fagelman-md-7363?insuranceCarrier=-1&insurancePlan=-1'
response = requests.get(url,headers = headers)
sel = Selector(text = response.text)
profile_data = sel.css('script:contains(APOLLO_STATE)::text').get('{}').split('__REDUX_STATE__ = JSON.parse(')[-1].split(');\n window.ZD = {')[0]
profile_json = json.loads(profile_data)
print(type(profile_json))
The problem seems an invalid json format. The type of profile_json is string while a little amendments in above code shows below error stack
>>> profile_data = sel.css('script:contains(APOLLO_STATE)::text').get('{}').split('__REDUX_STATE__ = JSON.parse("')[-1].split('");\n window.ZD = {')[0].replace("\\","")
>>> profile_json = json.loads(profile_data)
Traceback (most recent call last):
File "/usr/lib/python3.6/code.py", line 91, in runcode
exec(code, self.locals)
File "<console>", line 1, in <module>
File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 41316 (char 41315)
Error in output are highlighted here:
The original HTML contains this (heavily trimmed):
<script>
...
window.__REDUX_STATE__ = JSON.parse("{\"routing\": ...
\"awards\":[\"Journal of Urology - \\\"Efficacy, Safety, and Use of Viagra in Clinical Practice.\\\"\",\"Critical Care Resident of the Year - 2003\"],
...
The same string extracted by scrapy is this:
"awards":[
"Journal of Urology - ""Efficacy",
"Safety",
"and Use of Viagra in Clinical Practice.""",
"Critical Care Resident of the Year - 2003"
],
It appears the backslashes are removed from it, making the JSON invalid.
I don't know if this is an efficient way of handling the problem but below code resolved my problem.
>>> import js2xml
>>> profile_data = sel.css('script:contains(APOLLO_STATE)::text').get('{}')
>>> parsed = js2xml.parse(profile_data)
>>> js = json.loads(parsed.xpath("//string[contains(text(),'routing')]/text()")[0])

ctx in ANTLR4 javascript visitor

Using ANTLR4 v4.8
I am in the process of writing transpiler exploring use of ANTLR (javascript target with visitor).
Grammar -> lex/parse is fine and I now sit on parse tree.
Grammar
grammar Mygrammar;
/*
* parser rules
*/
progm : stmt+;
stmt
: progdecl
| print
;
progdecl : PROGDECLKW ID '..';
print : WRITEKW STRLIT '..';
/*
* lexer rules
*/
PROGDECLKW : 'DECLAREPROGRAM';
WRITEKW : 'PRINT';
// Literal
STRLIT : '\'' .*? '\'' ;
// Identifier
ID : [a-zA-Z0-9]+;
// skip
LINE_COMMENT : '*' .*? '\n' -> skip;
TERMINATOR : [\r\n]+ -> skip;
WS : [ \t\n\r]+ -> skip;
hw.mg
***************
* Hello world
***************
DECLAREPROGRAM hw..
PRINT 'Hello World!'..
index.js
...
const myVisitor = require('./src/myVisitor').myVisitor;
const input = './src_sample/hw.mg';
const chars = new antlr4.FileStream(input);
...
parser.buildParseTrees = true;
const myVisit = new myVisitor();
myVisit.visitPrint(parser.print());
Use of visitor didn't seem straightforward, and this SO post helps to an extent.
On use of context. Is there a good way to track ctx, when I hit each node?
Using myVisit.visit(tree) as starting context is fine. When I start visiting each node, using non-root context
myVisit.visitPrint(parser.print()) throws me error.
Error:
PrintContext {
parentCtx: null,
invokingState: -1,
ruleIndex: 3,
children: null,
start: CommonToken {
source: [ [MygrammarLexer], [FileStream] ],
type: -1,
channel: 0,
start: 217,
together with exception: InputMismatchException [Error]
I believe it is because children is null instead of being populated.
Which, in turn, is due to
line 9:0 mismatched input '<EOF>' expecting {'DECLAREPROGRAM', 'PRINT'}
Question:
Is above the only way to pass the context or am I doing this wrong?
If the use is correct, then I incline towards looking at reporting this as bug.
edit 17.3 - added grammar and source
When you invoke parser.print() but feed it the input:
***************
* Hello world
***************
DECLAREPROGRAM hw..
PRINT 'Hello World!'..
it will not work. For print(), the parser expects input like this PRINT 'Hello World!'... For the entire input, you will have to invoke prog() instead. Also, it is wise to "anchor" your starting rule with the EOF token which will force ANTLR to consume the entire input:
progm : stmt+ EOF;
If you want to parse and visit an entire parse tree (using prog()), but are only interested in the print node/context, then it is better to use a listener instead of a visitor. Check this page how to use a listener: https://github.com/antlr/antlr4/blob/master/doc/javascript-target.md
EDIT
Here's how a listener works (a Python demo since I don't have the JS set up properly):
import antlr4
from playground.MygrammarLexer import MygrammarLexer
from playground.MygrammarParser import MygrammarParser
from playground.MygrammarListener import MygrammarListener
class PrintPreprocessor(MygrammarListener):
def enterPrint_(self, ctx: MygrammarParser.Print_Context):
print("Entered print: `{}`".format(ctx.getText()))
if __name__ == '__main__':
source = """
***************
* Hello world
***************
DECLAREPROGRAM hw..
PRINT 'Hello World!'..
"""
lexer = MygrammarLexer(antlr4.InputStream(source))
parser = MygrammarParser(antlr4.CommonTokenStream(lexer))
antlr4.ParseTreeWalker().walk(PrintPreprocessor(), parser.progm())
When running the code above, the following will be printed:
Entered print: `PRINT'Hello World!'..`
So, in short: this listener accepts the entire parse tree of your input, but only "listens" when we enter the print parser rule.
Note that I renamed print to print_ because print is protected in the Python target.

Pull variable value from javascript source using BeautifulSoup4 Python

I'm newbie in python programming. I'm learning beautifulsoup to scrap website.
I want to extract and store the value of "stream" to my variable.
My Python code as follows :
import bs4 as bs #Importing BeautifulSoup4 Python Library.
import urllib.request
import requests
import json
import re
headers = {'User-Agent':'Mozilla/5.0'}
url = "http://thoptv.com/partners/mhdTVlive/Core.php?level=1200&channel=Dsports_HD"
page = requests.get(url)
soup = bs.BeautifulSoup(page.text,"html.parser")
pattern = re.compile('var stream = (.*?);')
scripts = soup.find_all('script')
for script in scripts:
if(pattern.match(str(script.string))):
data = pattern.match(script.string)
links = json.loads(data.groups()[0])
print(links)
This is the source javascript code to get the stream url value.
https://content.jwplatform.com/libraries/oncyToRO.js'>if( navigator.userAgent.match(/android/i)||
navigator.userAgent.match(/webOS/i)||
navigator.userAgent.match(/iPhone/i)||
navigator.userAgent.match(/iPad/i)||
navigator.userAgent.match(/iPod/i)||
navigator.userAgent.match(/BlackBerry/i)||
navigator.userAgent.match(/Windows Phone/i)) {var stream =
"http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=ibxIPxc6rkq1yIUJb4RlEV&pxe=1504146411&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw";}else{var
stream =
"http://hd.simiptv.com:8080//index.m3u8?key=VIoVSsGRLRouHWGNo1epzX&exp=932213423&domain=thoptv.stream&id=461";}jwplayer("THOPTVPlayer").setup({"title":
'thoptv.stream',"stretching":"exactfit","width": "100%","file":
none,"height": "100%","skin": "seven","autostart": "true","logo":
{"file":"https://i.imgur.com/EprI2uu.png","margin":"-0",
"position":"top-left","hide":"false","link":"http://mhdtvlive.co.in"},"androidhls":
true,});jwplayer("THOPTVPlayer").onError(function(){jwplayer().load({file:"http://content.jwplatform.com/videos/7RtXk3vl-52qL9xLP.mp4",image:"http://content.jwplatform.com/thumbs/7RtXk3vl-480.jpg"});jwplayer().play();});jwplayer("THOPTVPlayer").onComplete(function(){window.location
= window.location.href;});jwplayer("THOPTVPlayer").onPlay(function(){clearTimeout(theTimeout);});
I need to extract the url from stream.
var stream = "http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=ibxIPxc6rkq1yIUJb4RlEV&pxe=1504146411&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw";}
Rather then thinking complicated with regex, if the link is the only dynamically changing part, you can split the string with some known separating tokens.
x = """
https://content.jwplatform.com/libraries/oncyToRO.js'>if( navigator.userAgent.match(/android/i)|| navigator.userAgent.match(/webOS/i)|| navigator.userAgent.match(/iPhone/i)|| navigator.userAgent.match(/iPad/i)|| navigator.userAgent.match(/iPod/i)|| navigator.userAgent.match(/BlackBerry/i)|| navigator.userAgent.match(/Windows Phone/i)) {var stream = "http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=ibxIPxc6rkq1yIUJb4RlEV&pxe=1504146411&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw";}else{var stream = "http://hd.simiptv.com:8080//index.m3u8?key=VIoVSsGRLRouHWGNo1epzX&exp=932213423&domain=thoptv.stream&id=461";}jwplayer("THOPTVPlayer").setup({"title": 'thoptv.stream',"stretching":"exactfit","width": "100%","file": none,"height": "100%","skin": "seven","autostart": "true","logo": {"file":"https://i.imgur.com/EprI2uu.png","margin":"-0", "position":"top-left","hide":"false","link":"http://mhdtvlive.co.in"},"androidhls": true,});jwplayer("THOPTVPlayer").onError(function(){jwplayer().load({file:"http://content.jwplatform.com/videos/7RtXk3vl-52qL9xLP.mp4",image:"http://content.jwplatform.com/thumbs/7RtXk3vl-480.jpg"});jwplayer().play();});jwplayer("THOPTVPlayer").onComplete(function(){window.location = window.location.href;});jwplayer("THOPTVPlayer").onPlay(function(){clearTimeout(theTimeout);});
"""
left1, right1 = x.split("Phone/i)) {var stream =")
left2, right2 = right1.split(";}else")
print(left2)
# "http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=ibxIPxc6rkq1yIUJb4RlEV&pxe=1504146411&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw"
pattern.match() matches the pattern from the beginning of the string. Try using pattern.search() instead - it will match anywhere within the string.
Change your for loop to this:
for script in scripts:
data = pattern.search(script.text)
if data is not None:
stream_url = data.groups()[0]
print(stream_url)
You can also get rid of the surrounding quotes by changing the regex pattern to:
pattern = re.compile('var stream = "(.*?)";')
so that the double quotes are not included in the group.
You might also have noticed that there are two possible stream variables depending on the accessing user agent. For tablet like devices the first would be appropriate, while all other user agents should use the second stream. You can use pattern.findall() to get all of them:
>>> pattern.findall(script.text)
['"http://ssrigcdnems01.cdnsrv.jio.com/jiotv.live.cdn.jio.com/Dsports_HD/Dsports_HD_800.m3u8?jct=LEurobVVelOhbzOZ6EkTwr&pxe=1571716053&st=AQIC5wM2LY4SfczRaEwgGl4Dyvly_3HihdlD_Oduojk5Kxs.*AAJTSQACMDIAAlNLABQtNjUxNDEwODczODgxNzkyMzg5OQACUzEAAjYw*"', '"http://hd.simiptv.com:8080//index.m3u8?key=vaERnLJswnWXM8THmfvDq5&exp=944825312&domain=thoptv.stream&id=461"']
this code works for me
import bs4 as bs #Importing BeautifulSoup4 Python Library.
import urllib.request
import requests
import json
headers = {'User-Agent':'Mozilla/5.0'}
url = "http://thoptv.com/partners/mhdTVlive/Core.php?
level=1200&channel=Dsports_HD"
page = requests.get(url)
soup = bs.BeautifulSoup(page.text,"html.parser")
scripts = soup.find_all('script')
out = list()
for c, i in enumerate(scripts): #go over list
text = i.text
if(text[:2] == "if"): #if the (if) comes first
for count, t in enumerate(text): # then we have reached the correct item in the list
if text[count] == "{" and text[count + 1] == "v" and text[count + 5] == "s": # and if this is here that stream is set
tmp = text[count:] # add this to the tmp varible
break # and end
co = 0
for m in tmp: #loop over the results from prev. result
if m == "\"" and co == 0: #if string is starting
co = 1 #set count to "true" 1
elif m == "\"" and co == 1: # if it is ending stop
print(''.join(out)) #results
break
elif co == 1:
# as long as we are looping over the rigth string
out.append(m) #add to out list
pass
result = ''.join(out) #set result
it basicly filters the string manuely.
but if we use user1767754 method (brilliant by the way) we will end up something like this:
import bs4 as bs #Importing BeautifulSoup4 Python Library.
import urllib.request
import requests
import json
headers = {'User-Agent':'Mozilla/5.0'}
url = "http://thoptv.com/partners/mhdTVlive/Core.php?level=1200&channel=Dsports_HD"
page = requests.get(url)
soup = bs.BeautifulSoup(page.text,"html.parser")
scripts = soup.find_all('script')
x = scripts[3].text
left1, right1 = x.split("Phone/i)) {var stream =")
left2, right2 = right1.split(";}else")
print(left2)

Trying to scrape table using Pandas from Selenium's result

I am trying to scrape a table from a Javascript website using Pandas. For this, I used Selenium to first reach my desired page. I am able to print the table in text format (as shown in commented script), but I want to be able to have the table in Pandas, too. I am attaching my script as below and I hope someone could help me figure this out.
import time
from selenium import webdriver
import pandas as pd
chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?
filter=BS02'
page = driver.get(url)
time.sleep(2)
driver.find_element_by_xpath('//*[#id="bursa_boards"]/option[2]').click()
driver.find_element_by_xpath('//*[#id="bursa_sectors"]/option[11]').click()
time.sleep(2)
driver.find_element_by_xpath('//*[#id="bm_equity_price_search"]').click()
time.sleep(5)
target = driver.find_elements_by_id('bm_equities_prices_table')
##for data in target:
## print (data.text)
for data in target:
dfs = pd.read_html(target,match = '+')
for df in dfs:
print (df)
Running the above script, i get the below error:
Traceback (most recent call last):
File "E:\Coding\Python\BS_Bursa Properties\Selenium_Pandas_Bursa Properties.py", line 29, in <module>
dfs = pd.read_html(target,match = '+')
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 906, in read_html
keep_default_na=keep_default_na)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\io\html.py", line 728, in _parse
compiled_match = re.compile(match) # you can pass a compiled regex here
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 233, in compile
return _compile(pattern, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "C:\Users\lnv\AppData\Local\Programs\Python\Python36-32\lib\sre_parse.py", line 616, in _parse
source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 0
I've tried using pd.read_html on the url also, but it returned an error of "No Table Found". The url is: http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS08&board=MAIN-MKT&sector=PROPERTIES&page=1.
You can get the table using the following code
import time
from selenium import webdriver
import pandas as pd
chrome_path = r"Path to chrome driver"
driver = webdriver.Chrome(chrome_path)
url = 'http://www.bursamalaysia.com/market/securities/equities/prices/#/?filter=BS02'
page = driver.get(url)
time.sleep(2)
df = pd.read_html(driver.page_source)[0]
print(df.head())
This is the output
No Code Name Rem Last Done LACP Chg % Chg Vol ('00) Buy Vol ('00) Buy Sell Sell Vol ('00) High Low
0 1 5284CB LCTITAN-CB s 0.025 0.020 0.005 +25.00 406550 19878 0.020 0.025 106630 0.025 0.015
1 2 1201 SUMATEC [S] s 0.050 0.050 - - 389354 43815 0.050 0.055 187301 0.055 0.050
2 3 5284 LCTITAN [S] s 4.470 4.700 -0.230 -4.89 367335 430 4.470 4.480 34 4.780 4.140
3 4 0176 KRONO [S] - 0.875 0.805 0.070 +8.70 300473 3770 0.870 0.875 797 0.900 0.775
4 5 5284CE LCTITAN-CE s 0.130 0.135 -0.005 -3.70 292379 7214 0.125 0.130 50 0.155 0.100
To get data from all pages you can crawl the remaining pages and use df.append
Answer:
df = pd.read_html(target[0].get_attribute('outerHTML'))
Result:
Reason for target[0]:
driver.find_elements_by_id('bm_equities_prices_table') returns a list of selenium webelements, in your case, there's only 1 element, hence [0]
Reason for get_attribute('outerHTML'):
we want to get the 'html' of the element. There are 2 types of such get_attribute methods: 'innerHTML' vs 'outerHTML'. We chose the 'outerHTML' becasue we need to include the current element, where the table headers are, I suppose, instead of only the inner contents of the element.
Reason for df[0]
pd.read_html() returns a list of data frames, the first of which is the result we want, hence [0].

Python BeautifulSoup - Scraping Google Finance historical data

I was trying to scrap Google Finance historical data. I was need of to total number of rows, which is located along with the pagination. The following is the div tag which is responsible for displaying the total number of rows:
<div class="tpsd">1 - 30 of 1634 rows</div>
I tried using the following code to get the data, but its returning an empty list:
soup.find_all('div', 'tpsd')
I tried getting the entire table but even then I was not successful, when I checked the page source I was able to find the value inside a JavaScript function. When I Googled how to get values from script tag, it was mentioned to used regex. So, I tried using regex and the following is my code:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ')
soup = BeautifulSoup(r.content,'lxml')
var = soup.find_all("script")[8].string
a = re.compile('google.finance.applyPagination\((.*)\'http', re.DOTALL)
b = a.search(var)
num = b.group(1)
print(num.replace(',','').split('\n')[3])
I am able to get the values which I want, but my doubt is whether the above code which I used to get the values is correct, or is there any other way better way. Kindly help.
You can easily pass an offset i.e start=.. to the url getting 30 rows at a time which is exactly what is happening with the pagination logic:
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"
with requests.session() as s:
start = 0
req = s.get(url.format(start))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows = table.find_all("tr")
while True:
start += 30
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
if not table:
break
all_rows.extend(table.find_all("tr"))
You can also get the total rows using the script tag and use that with range:
with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")
for start in range(30, total+1, 30):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))
print(len(all_rows))
The num=30 is the amount of rows per page, to make less requests you can set it to 200 which seems to be the max and work your step/offset from that.
url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=200&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"
with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")
for start in range(200, total+1, 200):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
print(url.format(start)
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))
If we run the code, you will see we get 1643 rows:
In [7]: with requests.session() as s:
...: req = s.get(url.format(0))
...: soup = BeautifulSoup(req.content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
...: total = int(scr.text.split(",", 3)[2])
...: all_rows = table.find_all("tr")
...: for start in range(200, total+1, 200):
...: soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: all_rows.extend(table.find_all("tr"))
...: print(len(all_rows))
...:
1643
In [8]:
You can just use the python module: https://pypi.python.org/pypi/googlefinance
The api is simple:
#The google finance API that we need.
from googlefinance import getQuotes
#The json handeler, since the API returns a JSON.
import json
intelJSON = (getQuotes('INTC'))
intelDump = json.dumps(intelJSON, indent=2)
intelInfo = json.loads(intelDump)
intelPrice = intelInfo[0]['LastTradePrice']
intelTime = intelInfo[0]['LastTradeDateTimeLong']
print ("As of " + intelTime + ", Intel stock is trading at: " + intelPrice)
I prefer having all the raw CSV files that are available for download from Google Finance. I wrote a quick python script to automatically download all the historical price info for a list of companies -- it's equivalent to how a human might use the "Download to Spreadsheet" link manually.
Here's the GitHub repo, with the downloaded CSV files for all S&P 500 stocks (in the rawCSV folder): https://github.com/liezl200/stockScraper
It uses this link http://www.google.com/finance/historical?q=googl&startdate=May+3%2C+2012&enddate=Apr+30%2C+2017&output=csv where the key here is the last output parameter, output=csv. I use urllib.urlretrieve(download_url, local_csv_filename) to retrieve the CSV.

Categories