Python Scraping from JavaScript table on PGA Website - javascript

I'm just getting into Python and have been working mostly with BeautifulSoup to scrape sports data from the web. I have run into an issue with a table on the PGA website where it is generated by javascript, was hoping someone could walk me through the process in the context the specific website I am working with. Here is a sample link "http://www.pgatour.com/content/pgatour/players/player.29745.tyler-aldridge.html/statistics" the tables are all of the player statistics tables. Thanks!

When a web page uses JavaScript to build or get it's content, you are out of luck with tools that just download HTML from the web. You need something which is mimicking a web browser more thoroughly, and interpreting JavaScript. In other words, a so-called headless browser. There are some out there, even some with good Python integration. You may want to start your journey by searching for PhantomJS or Selenium. Once you've chosen the tool of your choice, you can let the browser do it's retrieving and rendering work and then browse the DOM in much a similar way than you did with BeautifulSoup on static pages.
I would, however, also a look at the Network tab of your browser's debugger first. Sometimes you can identify the GET which is actually getting the table data from the server. In this case it might be easier to GET the data yourself (e.g. via requests) than to employ complex technology to do it for you. It is also very likely that you get the information you want in plain JSON which will make it even simpler to use. The PGA site makes GETs hundreds of resources to build, but it will still be a good trade to browse thru them.

You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
Also, consider using this:
http://www.seleniumhq.org/docs/03_webdriver.jsp
Selenium-WebDriver makes direct calls to the browser using each browser’s native support for automation. How these direct calls are made, and the features they support depends on the browser you are using. Information on each ‘browser driver’ is provided later in this chapter.
For those familiar with Selenium-RC, this is quite different from what you are used to. Selenium-RC worked the same way for each supported browser. It ‘injected’ javascript functions into the browser when the browser was loaded and then used its javascript to drive the AUT within the browser. WebDriver does not use this technique. Again, it drives the browser directly using the browser’s built in support for automation.

Related

Get a javascript variable from a web page without interaction/heedlessly

Good afternoon!
We're looking to get a javascript variable from a webpage, that we are usually able to retrieve typing app in the Chrome DevTools.
However, we're looking to realize this headlessly as it has to be performed on numerous apps.
Our ideas :
Using a Puppeteer instance to go on the page, type the command and return the variable, which works, but it's very ressource consuming.
Using a GET/POST request to the page trying to inject the JS command, but we didn't succeed.
We're then wondering if there will be an easier solution, as a special API that could extract the variable?
The goal would be to automate this process with no human interaction.
Thanks for your help!
Your question is not so much about a JS API (since the webpage is not yours to edit, you can only request it) as it is about webcrawling / browser automation.
You have to add details to get a definitive answer, but I see two scenarios:
the website actively checks for evidence of human browsing (for example, it sits behind CloudFlare and has requested this option); or the scripts depend heavily on there being a browser execution environment available. In this case, the simplest option is to automate a browser, because a headless option has to get many things right to fool the server or the scripts. I would use karate, which is easier than, say, selenium and can execute in-browser scripts. It is written in Java, but you can execute it externally and just read its reports.
the website does not check for such evidence and the scripts do not really require a browser execution environment. Then you can simply download everything requires locally and attempt to jury-rig the JS into executing in any JS environment. According to your post, this fails; but it is impossible to help unless you can describe how it fails. This option can be headless.
You can embed Chrome into your application and instrument it. It will be headless.
We've used this approach in the past to copy content from PowerPoint Online.
We were using .NET to do this and therefore used CEFSharp.

can requests python library force a page to load all javascript dynamic content before storing the contents of that page

Beautifulsoup can often be used to (1) store the contents of a page in a variable and
(2) parse elements in a webpage.
However Beautifulsoup on it's own cannot open - password protected HTTP error 403 pages. So I used requests for this task.
Now I am wondering does the Requests library have the ability to Force the javascript on a page to load?
I am using python2.7
Does requests have the ability to requests.open(some url).forceJavascriptLoad
No. Requests doesn't have the ability to execute javascript in any way. You need a so-called "headless" web browser to do what you want. Here is a list of some of them. As an advice I recommend you to try the PhantomJS, although it is not written in Python, it has several advantages over the others:
It is easy to setup and use
Actively developed and not abandoned like a lot of other headless browsers
Has really good JavaScript support
Is fast
Provides precompiled binaries in case you have problems with compiling stuff
I tried a lot of headless browsers by myself and I was only happy with PhantomJS. If you still want to try the Python-based headless browser you can give a Ghost a try.

Selenium versus BeautifulSoup for web scraping

I'm scraping content from a website using Python. First I used BeautifulSoup and Mechanize on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium.
Given that I can find elements and get their content using Selenium with methods like driver.find_element_by_xpath, what reason is there to use BeautifulSoup when I could just use Selenium for everything?
And in this particular case, I need to use Selenium to click on the JavaScript button so is it better to use Selenium to parse as well or should I use both Selenium and Beautiful Soup?
Before answering your question directly, it's worth saying as a starting point: if all you need to do is pull content from static HTML pages, you should probably use a HTTP library (like Requests or the built-in urllib.request) with lxml or BeautifulSoup, not Selenium (although Selenium will probably be adequate too). The advantages of not using Selenium needlessly:
Bandwidth, and time to run your script. Using Selenium means fetching all the resources that would normally be fetched when you visit a page in a browser - stylesheets, scripts, images, and so on. This is probably unnecessary.
Stability and ease of error recovery. Selenium can be a little fragile, in my experience - even with PhantomJS - and creating the architecture to kill a hung Selenium instance and create a new one is a little more irritating than setting up simple retry-on-exception logic when using requests.
Potentially, CPU and memory usage - depending upon the site you're crawling, and how many spider threads you're trying to run in parallel, it's conceivable that either DOM layout logic or JavaScript execution could get pretty expensive.
Note that a site requiring cookies to function isn't a reason to break out Selenium - you can easily create a URL-opening function that magically sets and sends cookies with HTTP requests using cookielib/cookiejar.
Okay, so why might you consider using Selenium? Pretty much entirely to handle the case where the content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. Even then, you might be able to get the data you want without breaking out the heavy machinery. Usually one of these scenarios applies:
JavaScript served with the page has the content already baked into it. The JavaScript is just there to do the templating or other DOM manipulation that puts the content into the page. In this case, you might want to see if there's an easy way to pull the content you're interested in straight out of the JavaScript using regex.
The JavaScript is hitting a web API to load content. In this case, consider if you can identify the relevant API URLs and just hit them yourself; this may be much simpler and more direct than actually running the JavaScript and scraping content off the web page.
If you do decide your situation merits using Selenium, use it in headless mode, which is supported by (at least) the Firefox and Chrome drivers. Web spidering doesn't ordinarily require actually graphically rendering the page, or using any browser-specific quirks or features, so a headless browser - with its lower CPU and memory cost and fewer moving parts to crash or hang - is ideal.
I would recommend using Selenium for things such as interacting with web pages whether it is in a full blown browser, or a browser in headless mode, such as headless Chrome. I would also like to say that beautiful soup is better for observing and writing statements that rely on if an element is found or WHAT is found, and then using selenium ot execute interactive tasks with the page if the user desires.
I used Selenium for web scraping, but it is not happy solution. In my last project I used https://github.com/chromedp/chromedp . It is more simple solution than Selenium.

Using Python to Automatically Login to a Website with a JavaScript Form

I'm attempting to write a particular script that logs into a website. This specific website contains a Javascript form so I had little to no luck by making use of "mechanize".
I'm curious if there exist other solutions that I may be unaware of that would help me in my situation. If this particular question or some related variant has been asked here before, please excuse me, and I would prefer the link to this particular query. Otherwise, what are some common techniques/approaches for dealing with this issue?
Thanks.
I've recently been using PhantomJS for this kind of work - it's a command-line tool that allows you to run Javascript in a browser environment (based on Webkit). This allows you to do scraping and online interactions that require Javascript-enabled interfaces. There's a Python-based implementation here that's fully compatible with the API of the C++ version, or you could run either version in Python via subprocess.
Depending on what you're trying to do, another good option might be to use Selenium, which has client driver implementation in Python - it's meant for integration testing, but can do a lot of automation as long as you're okay running the Java-based Selenium Server and having the automation happen in an open browser rather than as a background process.

Webscraping a javascript based website

There are many tools that scrape HTML pages with javascript off, however are there any that will scrape with javascript on, including pressing buttons that are javascript callbacks?
I'm currently trying to scrape a site that is soley navigated through javascript calls. All the buttons that lead to the content execute javascript without a href in sight. I could reverse engineer the javascript calls (that do, in part return HTML) but that is going to take some time, are there any short cuts?
I use htmlunit, generally wrapped in a Java-based scripting language like JRuby. HtmlUnit is fantastic because it's JavaScript engine handles all of the dynamic functionality including AJAX behind the scenes. Makes it very easy to scrape.
Have you tried using scRubyIt? I'm not 100% sure, but I think I used it to scrape somo dynamic web sites.
It has some useful methods like
click_link_and_wait 'Get results', 5
Win32::IE::Mechanize
You could use Watij if you're into Java ( and want to automate Internet Explorer ). Alternatively, you can use Webdriver and also automate Firefox. Webdriver has a Python API too.
At the end of the day, those website which do not use Flash or other embedded plugins will need to make HTTP requests from the browser to the server. Most, if not all of those requests will have patterns within their URI's. Use Firebug/LiveHTTPHeaders to capture all the requests, which in turn will let you see what data comes back. From there, you can build ways to grab the data you want.
That is, of course, they are not using some crappy form of obfuscation/encryption to slow you down.

Categories