Selenium versus BeautifulSoup for web scraping

Selenium versus BeautifulSoup for web scraping - javascript

I'm scraping content from a website using Python. First I used BeautifulSoup and Mechanize on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium.
Given that I can find elements and get their content using Selenium with methods like driver.find_element_by_xpath, what reason is there to use BeautifulSoup when I could just use Selenium for everything?
And in this particular case, I need to use Selenium to click on the JavaScript button so is it better to use Selenium to parse as well or should I use both Selenium and Beautiful Soup?

Before answering your question directly, it's worth saying as a starting point: if all you need to do is pull content from static HTML pages, you should probably use a HTTP library (like Requests or the built-in urllib.request) with lxml or BeautifulSoup, not Selenium (although Selenium will probably be adequate too). The advantages of not using Selenium needlessly:
Bandwidth, and time to run your script. Using Selenium means fetching all the resources that would normally be fetched when you visit a page in a browser - stylesheets, scripts, images, and so on. This is probably unnecessary.
Stability and ease of error recovery. Selenium can be a little fragile, in my experience - even with PhantomJS - and creating the architecture to kill a hung Selenium instance and create a new one is a little more irritating than setting up simple retry-on-exception logic when using requests.
Potentially, CPU and memory usage - depending upon the site you're crawling, and how many spider threads you're trying to run in parallel, it's conceivable that either DOM layout logic or JavaScript execution could get pretty expensive.
Note that a site requiring cookies to function isn't a reason to break out Selenium - you can easily create a URL-opening function that magically sets and sends cookies with HTTP requests using cookielib/cookiejar.
Okay, so why might you consider using Selenium? Pretty much entirely to handle the case where the content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. Even then, you might be able to get the data you want without breaking out the heavy machinery. Usually one of these scenarios applies:
JavaScript served with the page has the content already baked into it. The JavaScript is just there to do the templating or other DOM manipulation that puts the content into the page. In this case, you might want to see if there's an easy way to pull the content you're interested in straight out of the JavaScript using regex.
The JavaScript is hitting a web API to load content. In this case, consider if you can identify the relevant API URLs and just hit them yourself; this may be much simpler and more direct than actually running the JavaScript and scraping content off the web page.
If you do decide your situation merits using Selenium, use it in headless mode, which is supported by (at least) the Firefox and Chrome drivers. Web spidering doesn't ordinarily require actually graphically rendering the page, or using any browser-specific quirks or features, so a headless browser - with its lower CPU and memory cost and fewer moving parts to crash or hang - is ideal.

I would recommend using Selenium for things such as interacting with web pages whether it is in a full blown browser, or a browser in headless mode, such as headless Chrome. I would also like to say that beautiful soup is better for observing and writing statements that rely on if an element is found or WHAT is found, and then using selenium ot execute interactive tasks with the page if the user desires.

I used Selenium for web scraping, but it is not happy solution. In my last project I used https://github.com/chromedp/chromedp . It is more simple solution than Selenium.

Related

Get a javascript variable from a web page without interaction/heedlessly

Good afternoon!
We're looking to get a javascript variable from a webpage, that we are usually able to retrieve typing app in the Chrome DevTools.
However, we're looking to realize this headlessly as it has to be performed on numerous apps.
Our ideas :
Using a Puppeteer instance to go on the page, type the command and return the variable, which works, but it's very ressource consuming.
Using a GET/POST request to the page trying to inject the JS command, but we didn't succeed.
We're then wondering if there will be an easier solution, as a special API that could extract the variable?
The goal would be to automate this process with no human interaction.
Thanks for your help!

Your question is not so much about a JS API (since the webpage is not yours to edit, you can only request it) as it is about webcrawling / browser automation.
You have to add details to get a definitive answer, but I see two scenarios:
the website actively checks for evidence of human browsing (for example, it sits behind CloudFlare and has requested this option); or the scripts depend heavily on there being a browser execution environment available. In this case, the simplest option is to automate a browser, because a headless option has to get many things right to fool the server or the scripts. I would use karate, which is easier than, say, selenium and can execute in-browser scripts. It is written in Java, but you can execute it externally and just read its reports.
the website does not check for such evidence and the scripts do not really require a browser execution environment. Then you can simply download everything requires locally and attempt to jury-rig the JS into executing in any JS environment. According to your post, this fails; but it is impossible to help unless you can describe how it fails. This option can be headless.

You can embed Chrome into your application and instrument it. It will be headless.
We've used this approach in the past to copy content from PowerPoint Online.
We were using .NET to do this and therefore used CEFSharp.

Web Crawler with Scraper that uses Puppeteer and Scrapy

Please do note that I am a novice when it comes to web technologies.
I have to crawl and scrape quite a few websites that are built with a mix of React / javascript / html techs. These sites in all have approx. 0.1 to 0.5 million pages.
I am planning to use Selenium and Scrapy to get the crawling and scraping done. Scrapy alone cannot scrape react pages and using Selenium to scrape regular javascript/html can prove to be very time consuming.
I want to know if there is any way my crawler/scraper can understand what differentiates a react page from a Javascript/html page.
Awaiting response.

Not sure if this has come too late, but i'll write my two cents on this issue nonetheless.
I have to crawl and scrape quite a few websites that are built with a mix of React / javascript / html techs.
Correct me if I'm wrong, but I believe what you meant by this is that some webpages of that site, contains data of interest (data to be scraped) which are already loaded by HTML, without involving JS. Hence, you wish to differentiate those webpages which you need to use JS rendering apart from those that don't, to improve scraping efficiency.
To answer your question directly, there is no smart system that a crawler can use to differentiate between those two types of webpages without rendering it at least once.
If the URLs of the webpage follow a pattern that enables you to easily discern between pages that use JS and pages that only require HTML crawling:
You can try to render the page at least once and write conditional code around the response. What I mean, is to first crawl the target URL with Scrapy (HTML rendering), and if the response received is incomplete (assuming the invalid response is not due to erroneous element selection code), try crawling it a second time with a JS renderer.
This brings me to my second point. If the webpages have not fixed rendering URL pattern, you can simply try using a faster and more lightweight JS renderer.
Selenium indeed has relatively high overhead when mass crawling (up to 0.5M in your case), since it is not built for it in the first place. You can check out Pyppeteer, an unofficial port of Google's Node.js library Puppeteer in Python. This will allow you to easily integrate it with Scrapy.
Here you can read up on the Pros and Cons of the Puppeteer system VS Selenium, to better calibrate it to your use case. One major limitation is that Puppeteer just supports Chrome for now.

Python Scraping from JavaScript table on PGA Website

I'm just getting into Python and have been working mostly with BeautifulSoup to scrape sports data from the web. I have run into an issue with a table on the PGA website where it is generated by javascript, was hoping someone could walk me through the process in the context the specific website I am working with. Here is a sample link "http://www.pgatour.com/content/pgatour/players/player.29745.tyler-aldridge.html/statistics" the tables are all of the player statistics tables. Thanks!

When a web page uses JavaScript to build or get it's content, you are out of luck with tools that just download HTML from the web. You need something which is mimicking a web browser more thoroughly, and interpreting JavaScript. In other words, a so-called headless browser. There are some out there, even some with good Python integration. You may want to start your journey by searching for PhantomJS or Selenium. Once you've chosen the tool of your choice, you can let the browser do it's retrieving and rendering work and then browse the DOM in much a similar way than you did with BeautifulSoup on static pages.
I would, however, also a look at the Network tab of your browser's debugger first. Sometimes you can identify the GET which is actually getting the table data from the server. In this case it might be easier to GET the data yourself (e.g. via requests) than to employ complex technology to do it for you. It is also very likely that you get the information you want in plain JSON which will make it even simpler to use. The PGA site makes GETs hundreds of resources to build, but it will still be a good trade to browse thru them.

You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
Also, consider using this:
http://www.seleniumhq.org/docs/03_webdriver.jsp
Selenium-WebDriver makes direct calls to the browser using each browser’s native support for automation. How these direct calls are made, and the features they support depends on the browser you are using. Information on each ‘browser driver’ is provided later in this chapter.
For those familiar with Selenium-RC, this is quite different from what you are used to. Selenium-RC worked the same way for each supported browser. It ‘injected’ javascript functions into the browser when the browser was loaded and then used its javascript to drive the AUT within the browser. WebDriver does not use this technique. Again, it drives the browser directly using the browser’s built in support for automation.

Automation using NodeJS

I have a few clients that pay me to post ads on sites such as craigslist.com an backpage.com. Currently every hour or so I have a macro that runs and I manually do the captchas (which I'm fine with). But now I have some free time and I want to write a proper program to prevent stupid errors that can happen with macros (screen resize, miss clicks etc).
Part of my posting includes selecting images to upload. I know for security reasons javascript doesn't let you specify which file the user uploads, that part is decided on their own. I'm sure I could do it using NodeJS somehow since it would be local on my machine, but I don't even have the slightest idea how I would even approach this.
Any guidance or direction would be very helpful.

if you use nodeJS, you need to work hard, like
- get html content and parse it
- construct input that you want
- re-submit form, re-post data
the easier way is to use web browser automation like selenium to work end to end for you
more info: http://www.seleniumhq.org/

If you are familiar to Nodejs and JavaScript then I recommend you use Protractor.
It is the current default end-to-end automation testing tool for AngularJs applications but I'm pretty sure it will solve your problem.
Instead of using AngularJs specific selectors (like element(by.model)) to "find" your html elements you will use regular css selectors like: $("div.top") returning an array of all divs with a css class named top, for exemple.
Protractor implements Selenium Web-driver protocol, that means that the scripts that you write will communicate with almost any automation ready browser like ChromeDriver, FirefoxDriver or PhantomJsDriver (for a no GUI low fidelity but fast alternative).
Make sure you check the getting started section for a jump start.

can requests python library force a page to load all javascript dynamic content before storing the contents of that page

Beautifulsoup can often be used to (1) store the contents of a page in a variable and
(2) parse elements in a webpage.
However Beautifulsoup on it's own cannot open - password protected HTTP error 403 pages. So I used requests for this task.
Now I am wondering does the Requests library have the ability to Force the javascript on a page to load?
I am using python2.7
Does requests have the ability to requests.open(some url).forceJavascriptLoad

No. Requests doesn't have the ability to execute javascript in any way. You need a so-called "headless" web browser to do what you want. Here is a list of some of them. As an advice I recommend you to try the PhantomJS, although it is not written in Python, it has several advantages over the others:
It is easy to setup and use
Actively developed and not abandoned like a lot of other headless browsers
Has really good JavaScript support
Is fast
Provides precompiled binaries in case you have problems with compiling stuff
I tried a lot of headless browsers by myself and I was only happy with PhantomJS. If you still want to try the Python-based headless browser you can give a Ghost a try.

We Keep Coding

JavaScript is the programming language of the Web.