I am relatively new to scraping and am trying to scrape this site (and many, many like it): http://www.superiorcourt.maricopa.gov/docket/CriminalCourtCases/caseInfo.asp?caseNumber=CR1999-012267
I'm using Python and Scrapy. My problem is that when I start up a scrapy shell and point it to this url, the response body is full of code I can't read, e.g :
c%*u9u\\'! (vy!}vyO"9u#$"v/!!!"yJZ*9u!##v/!"*!%y\\_9u\\')"v/\\'!#myJOu9u$)}vy}vy9CCVe^SdY_^uvkT_Se]U^dKju"&#$)\\')&vMK9u)}&vy}MKju!\\'$#)(# (!#vMuvmy\\:*Ve^SdY_^uCy\\y
The information I actually want to scrape does not appear to be accessible.
I believe this is a javascript problem, and have confirmed that using tools others have suggested before, like Selenium, renders the page correctly. My problem is that I will need to scrape several million of these sites, and don't believe that a browser-based solution is going to be fast enough.
Is there a better way to do this? I do not need to click any links on the page (I have a long list of all the URLs I want to scrape), or interact with it in any other way. Is it possible that the response body contains a JSON code I could parse?
If you just want to wait for the javascript data to load I'd use ScrapyJS.
If you need to interact with javascript elements on the website maybe use Scrapy + Selenium + phantomjs. the latter is usually a more popular choice because it's easier to learn and can do more but it's slower
Related
Please do note that I am a novice when it comes to web technologies.
I have to crawl and scrape quite a few websites that are built with a mix of React / javascript / html techs. These sites in all have approx. 0.1 to 0.5 million pages.
I am planning to use Selenium and Scrapy to get the crawling and scraping done. Scrapy alone cannot scrape react pages and using Selenium to scrape regular javascript/html can prove to be very time consuming.
I want to know if there is any way my crawler/scraper can understand what differentiates a react page from a Javascript/html page.
Awaiting response.
Not sure if this has come too late, but i'll write my two cents on this issue nonetheless.
I have to crawl and scrape quite a few websites that are built with a mix of React / javascript / html techs.
Correct me if I'm wrong, but I believe what you meant by this is that some webpages of that site, contains data of interest (data to be scraped) which are already loaded by HTML, without involving JS. Hence, you wish to differentiate those webpages which you need to use JS rendering apart from those that don't, to improve scraping efficiency.
To answer your question directly, there is no smart system that a crawler can use to differentiate between those two types of webpages without rendering it at least once.
If the URLs of the webpage follow a pattern that enables you to easily discern between pages that use JS and pages that only require HTML crawling:
You can try to render the page at least once and write conditional code around the response. What I mean, is to first crawl the target URL with Scrapy (HTML rendering), and if the response received is incomplete (assuming the invalid response is not due to erroneous element selection code), try crawling it a second time with a JS renderer.
This brings me to my second point. If the webpages have not fixed rendering URL pattern, you can simply try using a faster and more lightweight JS renderer.
Selenium indeed has relatively high overhead when mass crawling (up to 0.5M in your case), since it is not built for it in the first place. You can check out Pyppeteer, an unofficial port of Google's Node.js library Puppeteer in Python. This will allow you to easily integrate it with Scrapy.
Here you can read up on the Pros and Cons of the Puppeteer system VS Selenium, to better calibrate it to your use case. One major limitation is that Puppeteer just supports Chrome for now.
I have a webpage that I am working on for a personal project to learn about javascript and web development. The page consists of a menu bar and some content. My idea is, when a link is clicked, to change the content of the page via AJAX, such that I:
fetch some new content from a different page,
swap out the old content with the new,
animate the content change with some pretty javascript visual effects.
I figure that this is a little more efficient than getting a whole document with a standard HTTP GET request, since the browser won't have to fetch the style sheets and scripts sourced in the <head> tag of the document. I should also add that I am fetching content solely from documents that are served by my web app that I have created and whose content I am fully aware of.
Anyway, I came across this answer on an SO question recently, and it got me wondering about the ideal way to engineer a solution that fits the requirements I have given for the web page. The way I see it, there are two solutions, neither of which seem ideal:
Configure the back-end (server) from which I am fetching such that it will return content and not the entire page if asked for only content, and then load that content in with AJAX (my current solution), or
Get the entire document with AJAX and then use a script to extract the content and load it into the page.
It seems to me that neither solution is not quite right. For 1, it seems that I am splitting logic across two different places: The server has to be configured to serve content if asked for content and the the javascript has to know how to ask for content. For 2, it seems that this is an inappropriate use of AJAX (according to the previously mentioned SO answer), given that I am asking for a whole page and then parsing it, when AJAX is meant to fetch small bits and pieces of information rather than whole documents.
So I am asking: which of these two solutions is better from an engineering perspective? Is there another solution which would be better than either of these two options?
animate the content change with some pretty javascript visual effects.
Please don't. Anyway, you seem to be looking for a JS MVC framework like Knockout.
Using such a framework, you can let the server return models, represented in JSON or XML, which a little piece of JS transforms into HTML, using various ways of templating and annotations.
So instead of doing the model-to-HTML translation serverside and send a chunk of HTML to the browser, you just return a list of business objects (say, Addresses) in a way the browser (or rather JS) understands and let Knockout bind that to a grid view, input elements and so on.
I'm going to be accessing a number of accounts on Amazon's KDP - http://kdp.amazon.com/
My task is to login to each account and check the account's earnings. Mechanize works great for logging in and dealing with the cookies and such but the page which displays the account earnings uses javascript to dynamically populate the page.
I did a little bit of digging and found that the javascripts sends out the following request:
https://kdp.amazon.com/self-publishing/reports/transactionSummary?_=1326419839161&marketplaceID=ATVPDKIKX0DER
Along with a cookie which contains a session ID, a token, and some random stuff. Every time I click a link to display the results, the numerical part of the above GET url is different, even if it's the same link.
In response to the request, the browser then receives this (cut out a bunch of it so it doesn't take up the whole page):
{"iTotalDisplayRecords":13,"iTotalRecords":13,"aaData":[["12/03/2011","<span
title=\"Booktitle\">Hold That ...<\/span>","<span
title=\"Author\">Amy
....
<\/span>","B004PGMHEM","1","1","0","70%","4.47","0.06","4.47","0.01","0.00",""],["","","","","","","","","","","","","<div
class='grandtotal'>Total: $ 39.53<\/div>","Junk"]]}
I think I can use mechanize's cookie container to extract the cookies which are a part of that request but how do I figure out what that number is and how it's generated? The javascripts in the source code of the page seem cryptic on the best of days. Here's one of them:
http://kdp.amazon.com/DTPUIFramework/js/all-signin-thin.js
Is there a way to really track down what javascripts are running "behind the scenes" so to speak after I click on something on the page so that I can emulate that request in conjunction with mechanize?
Danke..
PS: I can't (or, rather, I don't want to) use watir for this task, because in theory I might be handling more than just a handful of accounts so this's gotta be pretty snappy.
It's just a timestamp and it's only used for cache busting. Try this:
Time.now.to_i.to_s
Mechanize doesn't run JavaScript that is embedded in the page. It only retrieves the HTML.
If the page contains JavaScript, Mechanize can see it and you can use Nokogiri, which Mechanize uses internally, to retrieve the <script> tags' content. But, anything that would be loaded as a result of the JavaScript being executed in a browser will not run in Mechanize. Watir is the solution for that, because it drives the browser itself, which will interpret and run the JavaScript in the page.
You can step through the pages in a browser and look at the source code to get an idea what is running using FireBug. From that information you can get an understanding of what the JavaScript is doing, and then use Mechanize and Nokogiri to extract the needed information from a page that lets you build up your next URLs, but it can be a lot of work.
What you ask is similar to many other's questions regarding Mechanize and JavaScript. I'd recommend you look at these SO links to get alternate ideas:
Mechanize and JavaScript
Ruby Mechanize not returning Javascript built page correctly
Or search Stack Overflow for questions about Ruby, JavaScript and Mechanize.
As mentioned in a previous question, I'm coding a crawler for the QuakeLive website.
I've been using WWW::Mechanize to get the web content and this worked fine for all the pages except the one with matches. The problem is that I need to get all these kind of IDs:
<div id="ffa_c14065c8-d433-11df-a920-001a6433f796_50498929" class="areaMapC">
These are used to build specific matches URLs, but I simply can't.
I managed to see those IDs only via FireBug and no page downloader, parser, getter I tried was able to help here. All I can get is a simpler version of the page which code is the one you can see by "showing source code" in Firefox.
Since FireBug shows the IDs I can safely assume they are already loaded, but then I can't understand why nothing else gets them. It might have something to do with JavaScript.
You can find a page example HERE
To get at the DOM containing those IDs you'll probably have to execute the javascript code on that site. I'm not aware of any libraries that'd allow you to do that, and then introspect the resulting DOM within perl, so just controlling an actual browser and later asking it for the DOM, or only parts of it, seems like a good way to go about this.
Various browsers provide ways to be controlled programatically. With a Mozilla based browser, such as Firefox, this could be as easy as loading mozrepl into the browser, opening a socket from perl space, sending a few lines of javascript code over to actually load that page, and then some more javascript code to give you the parts of the DOM you're interested in back. The result of that you could then parse with one of the many JSON modules on CPAN.
Alternatively, you could work through the javascript code executed on your page and figure out what it actually does, to then mimic that in your crawler.
The problem is that mechanize mimics the networking layer of the browser but not the rendering or javascript execution layer.
Many folks use the web browser control provided by Microsoft. This is a full instance of IE in a control that you can host in a WinForm, WPF or plain old Console app. It allows you to, among other things, load the web page and run javascript as well as send and receive javascript commands.
Here's a reasonable intro into how to host a browser control: http://www.switchonthecode.com/tutorials/csharp-snippet-tutorial-the-web-browser-control
A ton of data is sent over ajax requests. You need to account for that in your crawler somehow.
It looks like they are using AJAX. I can see where the requests are being sent using FireBug. You may need to either pick up on this by trying to parse and execute javasript that affects the DOM.
You should be able to use WWW::HtmlUnit - it loads and executes javascript.
Read the FAQ. WWW::Mechanize doesn't do javascript. They're probably using javascript to change the page. You'll need a different approach.
There are many tools that scrape HTML pages with javascript off, however are there any that will scrape with javascript on, including pressing buttons that are javascript callbacks?
I'm currently trying to scrape a site that is soley navigated through javascript calls. All the buttons that lead to the content execute javascript without a href in sight. I could reverse engineer the javascript calls (that do, in part return HTML) but that is going to take some time, are there any short cuts?
I use htmlunit, generally wrapped in a Java-based scripting language like JRuby. HtmlUnit is fantastic because it's JavaScript engine handles all of the dynamic functionality including AJAX behind the scenes. Makes it very easy to scrape.
Have you tried using scRubyIt? I'm not 100% sure, but I think I used it to scrape somo dynamic web sites.
It has some useful methods like
click_link_and_wait 'Get results', 5
Win32::IE::Mechanize
You could use Watij if you're into Java ( and want to automate Internet Explorer ). Alternatively, you can use Webdriver and also automate Firefox. Webdriver has a Python API too.
At the end of the day, those website which do not use Flash or other embedded plugins will need to make HTTP requests from the browser to the server. Most, if not all of those requests will have patterns within their URI's. Use Firebug/LiveHTTPHeaders to capture all the requests, which in turn will let you see what data comes back. From there, you can build ways to grab the data you want.
That is, of course, they are not using some crappy form of obfuscation/encryption to slow you down.