Web Crawler with Scraper that uses Puppeteer and Scrapy - javascript

Please do note that I am a novice when it comes to web technologies.
I have to crawl and scrape quite a few websites that are built with a mix of React / javascript / html techs. These sites in all have approx. 0.1 to 0.5 million pages.
I am planning to use Selenium and Scrapy to get the crawling and scraping done. Scrapy alone cannot scrape react pages and using Selenium to scrape regular javascript/html can prove to be very time consuming.
I want to know if there is any way my crawler/scraper can understand what differentiates a react page from a Javascript/html page.
Awaiting response.

Not sure if this has come too late, but i'll write my two cents on this issue nonetheless.
I have to crawl and scrape quite a few websites that are built with a mix of React / javascript / html techs.
Correct me if I'm wrong, but I believe what you meant by this is that some webpages of that site, contains data of interest (data to be scraped) which are already loaded by HTML, without involving JS. Hence, you wish to differentiate those webpages which you need to use JS rendering apart from those that don't, to improve scraping efficiency.
To answer your question directly, there is no smart system that a crawler can use to differentiate between those two types of webpages without rendering it at least once.
If the URLs of the webpage follow a pattern that enables you to easily discern between pages that use JS and pages that only require HTML crawling:
You can try to render the page at least once and write conditional code around the response. What I mean, is to first crawl the target URL with Scrapy (HTML rendering), and if the response received is incomplete (assuming the invalid response is not due to erroneous element selection code), try crawling it a second time with a JS renderer.
This brings me to my second point. If the webpages have not fixed rendering URL pattern, you can simply try using a faster and more lightweight JS renderer.
Selenium indeed has relatively high overhead when mass crawling (up to 0.5M in your case), since it is not built for it in the first place. You can check out Pyppeteer, an unofficial port of Google's Node.js library Puppeteer in Python. This will allow you to easily integrate it with Scrapy.
Here you can read up on the Pros and Cons of the Puppeteer system VS Selenium, to better calibrate it to your use case. One major limitation is that Puppeteer just supports Chrome for now.

Related

Python Scraping from JavaScript table on PGA Website

I'm just getting into Python and have been working mostly with BeautifulSoup to scrape sports data from the web. I have run into an issue with a table on the PGA website where it is generated by javascript, was hoping someone could walk me through the process in the context the specific website I am working with. Here is a sample link "http://www.pgatour.com/content/pgatour/players/player.29745.tyler-aldridge.html/statistics" the tables are all of the player statistics tables. Thanks!
When a web page uses JavaScript to build or get it's content, you are out of luck with tools that just download HTML from the web. You need something which is mimicking a web browser more thoroughly, and interpreting JavaScript. In other words, a so-called headless browser. There are some out there, even some with good Python integration. You may want to start your journey by searching for PhantomJS or Selenium. Once you've chosen the tool of your choice, you can let the browser do it's retrieving and rendering work and then browse the DOM in much a similar way than you did with BeautifulSoup on static pages.
I would, however, also a look at the Network tab of your browser's debugger first. Sometimes you can identify the GET which is actually getting the table data from the server. In this case it might be easier to GET the data yourself (e.g. via requests) than to employ complex technology to do it for you. It is also very likely that you get the information you want in plain JSON which will make it even simpler to use. The PGA site makes GETs hundreds of resources to build, but it will still be a good trade to browse thru them.
You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
Also, consider using this:
http://www.seleniumhq.org/docs/03_webdriver.jsp
Selenium-WebDriver makes direct calls to the browser using each browser’s native support for automation. How these direct calls are made, and the features they support depends on the browser you are using. Information on each ‘browser driver’ is provided later in this chapter.
For those familiar with Selenium-RC, this is quite different from what you are used to. Selenium-RC worked the same way for each supported browser. It ‘injected’ javascript functions into the browser when the browser was loaded and then used its javascript to drive the AUT within the browser. WebDriver does not use this technique. Again, it drives the browser directly using the browser’s built in support for automation.

Fastest solution for scraping text rendered by javascript in Python

I am relatively new to scraping and am trying to scrape this site (and many, many like it): http://www.superiorcourt.maricopa.gov/docket/CriminalCourtCases/caseInfo.asp?caseNumber=CR1999-012267
I'm using Python and Scrapy. My problem is that when I start up a scrapy shell and point it to this url, the response body is full of code I can't read, e.g :
c%*u9u\\'! (vy!}vyO"9u#$"v/!!!"yJZ*9u!##v/!"*!%y\\_9u\\')"v/\\'!#myJOu9u$)}vy}vy9CCVe^SdY_^uvkT_Se]U^dKju"&#$)\\')&vMK9u)}&vy}MKju!\\'$#)(# (!#vMuvmy\\:*Ve^SdY_^uCy\\y
The information I actually want to scrape does not appear to be accessible.
I believe this is a javascript problem, and have confirmed that using tools others have suggested before, like Selenium, renders the page correctly. My problem is that I will need to scrape several million of these sites, and don't believe that a browser-based solution is going to be fast enough.
Is there a better way to do this? I do not need to click any links on the page (I have a long list of all the URLs I want to scrape), or interact with it in any other way. Is it possible that the response body contains a JSON code I could parse?
If you just want to wait for the javascript data to load I'd use ScrapyJS.
If you need to interact with javascript elements on the website maybe use Scrapy + Selenium + phantomjs. the latter is usually a more popular choice because it's easier to learn and can do more but it's slower

Selenium versus BeautifulSoup for web scraping

I'm scraping content from a website using Python. First I used BeautifulSoup and Mechanize on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium.
Given that I can find elements and get their content using Selenium with methods like driver.find_element_by_xpath, what reason is there to use BeautifulSoup when I could just use Selenium for everything?
And in this particular case, I need to use Selenium to click on the JavaScript button so is it better to use Selenium to parse as well or should I use both Selenium and Beautiful Soup?
Before answering your question directly, it's worth saying as a starting point: if all you need to do is pull content from static HTML pages, you should probably use a HTTP library (like Requests or the built-in urllib.request) with lxml or BeautifulSoup, not Selenium (although Selenium will probably be adequate too). The advantages of not using Selenium needlessly:
Bandwidth, and time to run your script. Using Selenium means fetching all the resources that would normally be fetched when you visit a page in a browser - stylesheets, scripts, images, and so on. This is probably unnecessary.
Stability and ease of error recovery. Selenium can be a little fragile, in my experience - even with PhantomJS - and creating the architecture to kill a hung Selenium instance and create a new one is a little more irritating than setting up simple retry-on-exception logic when using requests.
Potentially, CPU and memory usage - depending upon the site you're crawling, and how many spider threads you're trying to run in parallel, it's conceivable that either DOM layout logic or JavaScript execution could get pretty expensive.
Note that a site requiring cookies to function isn't a reason to break out Selenium - you can easily create a URL-opening function that magically sets and sends cookies with HTTP requests using cookielib/cookiejar.
Okay, so why might you consider using Selenium? Pretty much entirely to handle the case where the content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. Even then, you might be able to get the data you want without breaking out the heavy machinery. Usually one of these scenarios applies:
JavaScript served with the page has the content already baked into it. The JavaScript is just there to do the templating or other DOM manipulation that puts the content into the page. In this case, you might want to see if there's an easy way to pull the content you're interested in straight out of the JavaScript using regex.
The JavaScript is hitting a web API to load content. In this case, consider if you can identify the relevant API URLs and just hit them yourself; this may be much simpler and more direct than actually running the JavaScript and scraping content off the web page.
If you do decide your situation merits using Selenium, use it in headless mode, which is supported by (at least) the Firefox and Chrome drivers. Web spidering doesn't ordinarily require actually graphically rendering the page, or using any browser-specific quirks or features, so a headless browser - with its lower CPU and memory cost and fewer moving parts to crash or hang - is ideal.
I would recommend using Selenium for things such as interacting with web pages whether it is in a full blown browser, or a browser in headless mode, such as headless Chrome. I would also like to say that beautiful soup is better for observing and writing statements that rely on if an element is found or WHAT is found, and then using selenium ot execute interactive tasks with the page if the user desires.
I used Selenium for web scraping, but it is not happy solution. In my last project I used https://github.com/chromedp/chromedp . It is more simple solution than Selenium.

Pros and cons of fully generating html using javascript

Recently, I have come up with some idea about how to improve the overall performance for
a web application in that instead of generating a ready-to-show html page from the web server, why not let it be fully generated in the client side.
Doing it this way, only pure data (in my case is data in JSON format) needs to be sent to the browser.
This will offload the work of html generation from the server to the client's browser and
reduce the size of the response packet sent back to users.
After some research, I have found that this framework (http://beebole-apps.com/pure/)
does the same thing as mine.
What I want to know is the pros and cons of doing it this way.
It is surely faster and better for web servers and with modern browser, Javascript code can run fast so page generation can be done fast.
What should be a downside for this method is probably for SEO.
I am not sure if search engines like Google will appropriately index my website.
Could you tell me what the downside for this method is?
Ps: I also attached sample code to help describe the method as follows:
In the head, simple javascript code will be written like this:
<script type='javascript' src='html_generator.js'/>
<script>
function onPageLoad(){
htmlGenerate($('#data').val());
}
</script>
In the body, only one element exist, used merely as a data container as follows:
<input type='hidden' id='data' value='{"a": 1, "b": 2, "c": 3}'/>
When the browser renders the file, htmlGenerate function which is in html_generator.js will be called and the whole html page will be generated in this function. Note that the html_generator.js file can be large since it contains a lot of html templates but since it can be cached in the browser, it will be downloaded once and only once.
Downsides
Search Engines will not be able to index your page. If they do, you're very lucky.
Users with JavaScript disabled, or on mobile devices, will very likely not be able to use it.
The speed advantages might turn out to be minimal, especially if the user's using a slow JavaScript engine like in IE.
Maintainability: Unless you automate the generation of your javascript, it's going to be a nightmare!
All in all
This method is not recommended if you're doing it only for speed increase. However, it is often done in web applications, where users stay on the same page, but then you would probably be better off using one of the existing frameworks for this, such as backbone.js, and make sure it remains crawlable by following Google's hashbang guidelines or HTML5 PushState (as suggested by #rohk).
If it's just performance you're looking for, and your app doesn't strictly need to work like this, don't do it. But if you do go this way, then make sure you do it in the standardized way so that it remains fast and indexable.
Client-side templating is often used in Single Page Applications.
You have to weight the pros and cons:
Pros :
More responsive interface
Load reduced on your web server
Cons :
SEO will be harder than for a classic website (unless you use HTML5 PushState)
Accessibility : you are relying heavily on javascript...
If you are going this way I recommend that you look at Backbone.js.
You can find tutorials and examples here : http://www.quora.com/What-are-some-good-resources-for-Backbone-js
Examples of SPA :
Document Cloud
Grooveshark
Gmail
Also look at this answer : https://stackoverflow.com/a/8372246/467003
Yikes.
jQuery templates are closer to what you are thinking. Sanity says you should have some HTML templates that you can easily modify at will. You want MAINTAINABILITY not string concatenations that keep you tossing and turning in your worst nightmares.
You can continue with this madness but you will end up learning the hard way. I employ you to first try some protypes using jQuery templates and your pure code way. Sanity will surely overcome you my friend and I say that coming from trying your way for a time. :)
Its possible to load content in dynamically of the template you need as well using AJAX. It doesn't make sense that you will have a panacea approach where every single conceivable template is needed to be downloaded in a single request.
The pros? I really can't think of any. You say that it will be easier in the webserver, but serving HTML is what web servers are designed to do.
The cons? It goes against just about every best practise when it comes to building websites:
search engines will not be able to index your site well, if at all
reduced maintainability
no matter how fast JS engines are, the DOM is slow, and will never be as fast as building HTML on the server
one JS error and your entire site doesn't render. Oops. Browsers however are extremely tolerant of HTML errors.
Ultimately HTML is a great tool for displaying content. JavaScript is a great way of adding interaction. Mixing them up like you suggest is just nuts and will only cause problems.
Cons
SEO
Excluding user without javascript
Pros
Faster workflow / Fluider interface
Small load reduce
If SEO is not important for you and most of your user have Javascript you could create a Single Page Applications. It will not reduce your load very much but it can create a faster workflow on the page.
Btw: I would not pack all templates in one file. It would be too big for big projects.

Webscraping a javascript based website

There are many tools that scrape HTML pages with javascript off, however are there any that will scrape with javascript on, including pressing buttons that are javascript callbacks?
I'm currently trying to scrape a site that is soley navigated through javascript calls. All the buttons that lead to the content execute javascript without a href in sight. I could reverse engineer the javascript calls (that do, in part return HTML) but that is going to take some time, are there any short cuts?
I use htmlunit, generally wrapped in a Java-based scripting language like JRuby. HtmlUnit is fantastic because it's JavaScript engine handles all of the dynamic functionality including AJAX behind the scenes. Makes it very easy to scrape.
Have you tried using scRubyIt? I'm not 100% sure, but I think I used it to scrape somo dynamic web sites.
It has some useful methods like
click_link_and_wait 'Get results', 5
Win32::IE::Mechanize
You could use Watij if you're into Java ( and want to automate Internet Explorer ). Alternatively, you can use Webdriver and also automate Firefox. Webdriver has a Python API too.
At the end of the day, those website which do not use Flash or other embedded plugins will need to make HTTP requests from the browser to the server. Most, if not all of those requests will have patterns within their URI's. Use Firebug/LiveHTTPHeaders to capture all the requests, which in turn will let you see what data comes back. From there, you can build ways to grab the data you want.
That is, of course, they are not using some crappy form of obfuscation/encryption to slow you down.

Categories