Web Scraping Python using Google Chrome extension - javascript

Hi I am a Python newbie and I am webscraping a webpage.
I am using the Google Chrome Developer Extension to identify the class of the objects I want to scrape. However, my code returns an empty array of results whereas the screenshots clearly show that that those strings are in the HTML code.
Chrome Developer
import requests
from bs4 import BeautifulSoup
url = 'http://www.momondo.de/flightsearch/?Search=true&TripType=2&SegNo=2&SO0=BOS&SD0=LON&SDP0=07-09-2016&SO1=LON&SD1=BOS&SDP1=12-09-2016&AD=1&TK=ECO&DO=false&NA=false'
html = requests.get(url)
soup = BeautifulSoup(html.text,"lxml")
x = soup.find_all("span", {"class":"value"})
print(x)
#pprint.pprint (soup.div)
I am very much appreciating your help!
Many thanks!

Converted my comment to an answer...
Make sure the data you are expecting is actually there. Use print(soup.prettify()) to see what was actually returned from the request. Depending on how the site works, the data you are looking for may only exist in the browser after the javascript is processed. You might also want to take a look at selenium

Related

No Element in Xpath with lxml : Javascript Generated Page

I have had Xpath work with other things before, in the Chrome browser I can find my xpath in the console with $x('//*[#id="profile"]/div[2]/div[2]/div[1]/div[2]/div[2]/div[1]/span[2]) at https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na.
When I try to get this element in code it returns an empty array, anybody know why?
#client.command(pass_context=True)
async def checkChrisPubg(ctx):
page = requests.get('https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na')
tree = html.fromstring(page.content)
duoRank = tree.xpath('//*[#id="profile"]/div[2]/div[2]/div[1]/div[2]')
print(duoRank)
print(duoRank) gives me []
So, I tried to do this with PyQt4 and had no real success in practice, a simpler but slightly more invasive resolution is to use Selenium, a webdriver for loading web pages.
I am sure there are multiple solutions to this but I was having a hell of a time even knowing what was wrong until I found my solution.
When using lxml you should ensure the data you are trying to grab is not generated by javascript. To do this you can open Chrome Developer tools, click the menu(3 vertical dots), go to settings, go to the bottom, disable Javascript, and reload the page.
If nothing is there, the page is generated content with Javascript.
A simple solution is below, this will wait for the page to render and then let you parse the tree with lxml.
This solution will require that you use these imports(You must install selenium):
from selenium import webdriver
Now, you can load the page and start scraping:
#Load in your browser(I use chrome)
browser = webdriver.Chrome()
#Choose url you want to scrape
url = 'https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na'
#get the url with Selenium
browser.get(url)
#get the innerhtml from the rendered page
innerHTML = browser.execute_script("return document.body.innerHTML")
#Now use lxml to parse the page
tree = html.fromstring(innerHTML)
#Get your element with xpath
duoRank = tree.xpath('//*[#id="profile"]/div[2]/div[2]/div[1]/div[2]/div[2]/div[1]/span[2]/text()')
#close the browser
browser.quit()
My original solution would have been nice, but just didn't work because much of it is deprecated.
What library are you using as a parser?
If xml.etree.ElementTree,
ElementTree provides limited support for XPath expressions. The goal is to support a small subset of the abbreviated syntax; a full XPath engine is outside the scope of the core library.
http://effbot.org/zone/element-xpath.htm
Open page source view-source:https://pubgtracker.com/profile/pc/Fuzzyllama/duo?region=na here is the script with json playerData at line 491. Just parse it.

Querying & Manipulating Data located in Javascript Variable

I'm trying to make my own DB with data pulled from a Javascript Variable located on this URL (https://www.numberfire.com/nba/fantasy/fantasy-basketball-projections/). Since the data is only made available in the variable (NF_DATA), I'm not able to easily query as I would an API.
I'm able to pull the data as I would any JSON object using the Chrome Developer Console as such:
I would like to be able to pull data with a Python script in the same way as I was doing in the Chrome Developer Console. (For instance, identifying the exact data by writing 'NF_DATA['daily_projections']['1'][...]' in the script, so I can do the manipulation directly in the script, not in Chrome Devtools). Any recommendations on how I do this? I have tried using BeautifulSoup in Python, but was having trouble grabbing the data without making the complete output into a string (would this even be a good way to think about it??)

Force Chrome to show Json response formatted as tree structure

I can make my app to "force" Chrome (v 39.0.2171.99 m) to show http response as Json (instead of XML).
How do I get the Json in a tree structure (instead of a string)?
Checking the Preview tab in dev tools doesn't work for me.
I could paste the Json string into JsonLint, but I want to know a more direct route, if there is one.
I've been using this extension (It's called JSON Viewer and the source is available at github) for years, it works great.
I don't know who's the developer is but if he ever reads this: Thanks for taking the time to develop such a timesaving tool!

How to retrieve title from dyamically formed web pages

I'm working on a bookmarking app using Django and would like to extract the title from web pages that use javascript to generate the title. I've looked at windmill and installed/ran selenium, which worked, but I believe these tools are more than what I need to obtain the title of a web page. I'm currently trying to use spynner, but haven't been successful in retrieving the contents after the page is fully rendered. Here is the code that I currently have...
from spynner import Browser
from pyquery import PyQuery
browser = Browser()
browser.set_html_parser(PyQuery)
browser.load("https://www.coursera.org/course/techcity")
I receive a SpynnerTimeout: Timeout reached: 10 seconds error when executing the last line in a python shell. If I execute the last statement again, it will return True, but only the page before the javascript is run is returned, which doesn't have the "correct" page title. I also tried the following:
browser.load("https://www.coursera.org/course/techcity", wait_callback=wait_load(10))
browser.soup("title")[0].text
But this also returns the incorrect title - 'Coursera.org' (i.e. title before the javascript is run).
Here are my questions:
Is there a more efficient recommended approach for extracting a web page title that is dynamically generated with javascript, that uses some other python tool/library? If so, what is that recommended approach? - any example code appreciated.
If using spynner is a good approach, what should I be doing to get the title after the page is loaded, or even better, right after the title has been rendered by the javascript. The code I have now is just what I pieced together from a blog post and looking at the source for spynner on github.

How to parse html that includes javascript code

How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.
As a simple example
link
should be replaced by the appropriate value the javascript function returns, e.g.
link
A more complex example would be a saved facebook html page which is littered with loads of javascript code.
Probably related to
How to "execute" HTML+Javascript page with Node.js
but do I really need Node.js and JSDOM? Also slightly related is
Python library for rendering HTML and javascript
but I'm not interested in rendering just the pure html output.
You can use Selenium with python as detailed here
Example:
import xmlrpclib
# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)
# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)
import os
os.system('start run_firefox.bat')
print app.open('http://localhost:8080/AUT/000000A/http/www.amazon.com/')
print app.verifyTitle('Amazon.com: Welcome')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.com: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()
From Mozilla Gecko FAQ:
Q. Can you invoke the Gecko engine from a Unix shell script? Could you send it HTML and get back a web page that might be sent to the printer?
A. Not really supported; you can probably get something close to what you want by writing your own application using Gecko's embedding APIs, though. Note that it's currently not possible to print without a widget on the screen to render to.
Embedding Gecko in a program that outputs what you want may be way too heavy, but at least your output will be as good as it gets.
PhantomJS can be loaded using Selenium
$ ipython
In [1]: from selenium import webdriver
In [2]: browser=webdriver.PhantomJS()
In [3]: browser.get('http://seleniumhq.org/')
In [4]: browser.title
Out[4]: u'Selenium - Web Browser Automation'

Categories