The problem: A website I am trying to gather data from uses Javascript to produce a graph. I'd like to be able to pull the data that is being used in the graph, but I am not sure where to start. For example, the data might be as follows:
var line1=
[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],
["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],
["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],
["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],
["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],
["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],
["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];
This is pricing data (Date, Price, Volume). I've found another question here - Parsing variable data out of a js tag using python - which suggests that I use JSON and BeautifulSoup, but I am unsure how to apply it to this particular problem because the formatting is slightly different. In fact, in this problem the code looks more like python than any type of JSON dictionary format.
I suppose I could read it in as a string, and then use XPATH and some funky string editing to convert it, but this seems like too much work for something that is already formatted as a Javascript variable.
So, what can I do here to pull this type of organized data from this variable while using python? (I am most familiar with python and BS4)
If your format really is just one or more var foo = [JSON array or object literal];, you can just write a dotall regex to extract them, then parse each one as JSON. For example:
>>> j = '''var line1=
[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],
["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],
["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],
["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],
["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],
["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],
["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];\s*$'''
>>> values = re.findall(r'var.*?=\s*(.*?);', j, re.DOTALL | re.MULTILINE)
>>> for value in values:
... print(json.loads(value))
[[['Wed, 12 Jun 2013 01:00:00 +0000', 22.4916114807, '2 sold'],
['Fri, 14 Jun 2013 01:00:00 +0000', 27.4950008392, '2 sold'],
['Sun, 16 Jun 2013 01:00:00 +0000', 19.5499992371, '1 sold'],
['Tue, 18 Jun 2013 01:00:00 +0000', 17.25, '1 sold'],
['Sun, 23 Jun 2013 01:00:00 +0000', 15.5420341492, '2 sold'],
['Thu, 27 Jun 2013 01:00:00 +0000', 8.79045295715, '3 sold'],
['Fri, 28 Jun 2013 01:00:00 +0000', 10, '1 sold']]]
Of course this makes a few assumptions:
A semicolon at the end of the line must be an actual statement separator, not the middle of a string. This should be safe because JS doesn't have Python-style multiline strings.
The code actually does have semicolons at the end of each statement, even though they're optional in JS. Most JS code has those semicolons, but it obviously isn't guaranteed.
The array and object literals really are JSON-compatible. This definitely isn't guaranteed; for example, JS can use single-quoted strings, but JSON can't. But it does work for your example.
Your format really is this well-defined. For example, if there might be a statement like var line2 = [[1]] + line1; in the middle of your code, it's going to cause problems.
Note that if the data might contain JavaScript literals that aren't all valid JSON, but are all valid Python literals (which isn't likely, but isn't impossible, either), you can use ast.literal_eval on them instead of json.loads. But I wouldn't do that unless you know this is the case.
Okay, so there are a few ways to do it, but I ended up simply using a regular expression to find everything between line1= and ;
#Read page data as a string
pageData = sock.read()
#set p as regular expression
p = re.compile('(?<=line1=)(.*)(?=;)')
#find all instances of regular expression in pageData
parsed = p.findall(pageData)
#evaluate list as python code => turn into list in python
newParsed = eval(parsed[0])
Regex is nice when you have good coding, but is this method better (EDIT: or worse!) than any of the other answers here?
EDIT: I ultimately used the following:
#Read page data as a string
pageData = sock.read()
#set p as regular expression
p = re.compile('(?<=line1=)(.*)(?=;)')
#find all instances of regular expression in pageData
parsed = p.findall(pageData)
#load as JSON instead of using evaluate to prevent risky execution of unknown code
newParsed = json.loads(parsed[0])
The following makes a few assumptions such as knowing how the page is formatted, but a way of getting your example into memory on Python is like this
# example data
data = 'foo bar foo bar foo bar foo bar\r\nfoo bar foo bar foo bar foo bar \r\nvar line1=\r\n[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],\r\n["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],\r\n["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],\r\n["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],\r\n["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],\r\n["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],\r\n["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];\r\nfoo bar foo bar foo bar foo bar\r\nfoo bar foo bar foo bar foo bar'
# find your variable's start and end
x = data.find('line1=') + 6
y = data.find(';', x)
# so you can get just the relevant bit
interesting = data[x:y].strip()
# most dangerous step! don't do this on unknown sources
parsed = eval(interesting)
# maybe you'd want to use JSON instead, if the data has the right syntax
from json import loads as JSON
parsed = JSON(interesting)
# now parsed is your data
Assuming you have a python variable with a javascript line/block as a string like"var line1 = [[a,b,c], [d,e,f]];", you could use the following few lines of code.
>>> code = """var line1 = [['a','b','c'], ['d','e','f'], ['g','h','i']];"""
>>> python_readable_code = code.strip("var ;")
>>> exec(python_readable_code)
>>> print(line1)
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
exec() Will run the code that is formatted as a string. In this case it will set the variable line1 to a list with lists.
And than you could use something like this:
for list in line1:
print(list[0], list[1], list[2])
# Or do something else with those values, like save them to a file