Best Open Source Spider for Site Coverage - javascript

I am interested in crawling a lot of websites. The most important consideration is that the spider is able to reach as much as the site as possible. One key feature that is lacking from most spiders is the ability to execute JavaScript. This is required in order to crawl ajax powered sites. I really like Open Source and I will need to modify the code for my project.
Currently I think that Solr, which is apart of Lucine is a very good solution.
http://lucene.apache.org/solr/features.html
Has anyone used Solr or Lucine? My biggest problem with Solr can not execute javascript, however its has a rich feature set and is scalability both of which makes Solr attractive.

Solr is not a crawler, but a search engine (searches over an index to return results).
That said, I really like heritrix for its flexibility. Most crawlers won't execute Javascript (but some, as Heritrix, will try to extract links from it) as that doesn't make much sense, even today. The thing is that Heritrix will allow you to plug in your own classes to do whatever you wish with the crawled data.

Try HTMLUnit. http://htmlunit.sourceforge.net/

Solr is a search engine built on the top of Lucene. It is not doing anything with crawling. Take a look at Apache Nutch. Cracking javascript might be a problem, as they are often inteded to lead the crawler to the dead-end.

watir might be useful for you.

With pages that create the dom based on javascript templating you'll really want full javascript execution in your spider. Take a look at https://github.com/mikeal/spider for Node JS.

Related

Pros and cons of fully generating html using javascript

Recently, I have come up with some idea about how to improve the overall performance for
a web application in that instead of generating a ready-to-show html page from the web server, why not let it be fully generated in the client side.
Doing it this way, only pure data (in my case is data in JSON format) needs to be sent to the browser.
This will offload the work of html generation from the server to the client's browser and
reduce the size of the response packet sent back to users.
After some research, I have found that this framework (http://beebole-apps.com/pure/)
does the same thing as mine.
What I want to know is the pros and cons of doing it this way.
It is surely faster and better for web servers and with modern browser, Javascript code can run fast so page generation can be done fast.
What should be a downside for this method is probably for SEO.
I am not sure if search engines like Google will appropriately index my website.
Could you tell me what the downside for this method is?
Ps: I also attached sample code to help describe the method as follows:
In the head, simple javascript code will be written like this:
<script type='javascript' src='html_generator.js'/>
<script>
function onPageLoad(){
htmlGenerate($('#data').val());
}
</script>
In the body, only one element exist, used merely as a data container as follows:
<input type='hidden' id='data' value='{"a": 1, "b": 2, "c": 3}'/>
When the browser renders the file, htmlGenerate function which is in html_generator.js will be called and the whole html page will be generated in this function. Note that the html_generator.js file can be large since it contains a lot of html templates but since it can be cached in the browser, it will be downloaded once and only once.
Downsides
Search Engines will not be able to index your page. If they do, you're very lucky.
Users with JavaScript disabled, or on mobile devices, will very likely not be able to use it.
The speed advantages might turn out to be minimal, especially if the user's using a slow JavaScript engine like in IE.
Maintainability: Unless you automate the generation of your javascript, it's going to be a nightmare!
All in all
This method is not recommended if you're doing it only for speed increase. However, it is often done in web applications, where users stay on the same page, but then you would probably be better off using one of the existing frameworks for this, such as backbone.js, and make sure it remains crawlable by following Google's hashbang guidelines or HTML5 PushState (as suggested by #rohk).
If it's just performance you're looking for, and your app doesn't strictly need to work like this, don't do it. But if you do go this way, then make sure you do it in the standardized way so that it remains fast and indexable.
Client-side templating is often used in Single Page Applications.
You have to weight the pros and cons:
Pros :
More responsive interface
Load reduced on your web server
Cons :
SEO will be harder than for a classic website (unless you use HTML5 PushState)
Accessibility : you are relying heavily on javascript...
If you are going this way I recommend that you look at Backbone.js.
You can find tutorials and examples here : http://www.quora.com/What-are-some-good-resources-for-Backbone-js
Examples of SPA :
Document Cloud
Grooveshark
Gmail
Also look at this answer : https://stackoverflow.com/a/8372246/467003
Yikes.
jQuery templates are closer to what you are thinking. Sanity says you should have some HTML templates that you can easily modify at will. You want MAINTAINABILITY not string concatenations that keep you tossing and turning in your worst nightmares.
You can continue with this madness but you will end up learning the hard way. I employ you to first try some protypes using jQuery templates and your pure code way. Sanity will surely overcome you my friend and I say that coming from trying your way for a time. :)
Its possible to load content in dynamically of the template you need as well using AJAX. It doesn't make sense that you will have a panacea approach where every single conceivable template is needed to be downloaded in a single request.
The pros? I really can't think of any. You say that it will be easier in the webserver, but serving HTML is what web servers are designed to do.
The cons? It goes against just about every best practise when it comes to building websites:
search engines will not be able to index your site well, if at all
reduced maintainability
no matter how fast JS engines are, the DOM is slow, and will never be as fast as building HTML on the server
one JS error and your entire site doesn't render. Oops. Browsers however are extremely tolerant of HTML errors.
Ultimately HTML is a great tool for displaying content. JavaScript is a great way of adding interaction. Mixing them up like you suggest is just nuts and will only cause problems.
Cons
SEO
Excluding user without javascript
Pros
Faster workflow / Fluider interface
Small load reduce
If SEO is not important for you and most of your user have Javascript you could create a Single Page Applications. It will not reduce your load very much but it can create a faster workflow on the page.
Btw: I would not pack all templates in one file. It would be too big for big projects.

Writing a single page client-side only webpage

I'm considering to try out an idea, mostly for fun, and my question is if this is reasonable and if there are any libraries or frameworks that could make this experiment a little easier.
So, the idea: Basically it is to write a new UI for a website I've developed, but doing it with client-side code only. I can read/write data using ajax, since my existing website has an API that allows me to perform all kinds of queries. This allows me to use JavaScript for the whole thing and theoretically put all of the code in a single file.
Obviously there are limitations to circumvent; bookmarking, page refreshing, the back-button etc. But these limitations are what makes it interesting, right? :) I'm not so worried about search engine indexing, since one has to be logged in to use the site anyway.
The site itself is not overly complex, but it is not simple either. There are four different levels of users, multiple languages and quite a lot of data to be presented.
Is this a bad idea? If so, why would you advice against it? And do you know of any JavaScript frameworks or libraries that could make this easier? (And no, I'm not looking for an abstraction like Google Web Toolkit; I would like something purely JavaScript)
One of my coworkers did this. A nice feature of this concept was that you don't have a ton of POSTS whenever the user 'changes pages', since they are actually not ever changing a page until they submit their data for the final time. He did this for product registration software, which was nice. Our servers were only taking a hit when the user initially requested the page, and then when they submitted it.
The major, MAJOR downside to this concept is that most web developers are not expecting this. My coworker (and you) have a cool idea - but unless it is well implemented, with comments, 100% valid HTML, and a host of other good design principles in place, it can be confusing since most web developers have basically never seen this done before. His site was a nightmare to work with, as he did not actually know what engineering web software meant, and it was all slapped together. My organization never pursued this (potentially useful) idea because his implementation was so poor.
So, when I looked at this idea here were the trade-off I came up with:
1.) You cannot require any server-side interaction during intermediate pages.
2.) The initial page loading is longer, but there are no intermediate page requests (better optimization).
3.) This is vastly different than anything anyone usually does, which means you need to be especially careful with documentation.
4.) This design concept facilitates totally stand-alone web software to be easily deployed without the web.
5.) You might be increasing complexity for avoiding page loads, but maybe not. I'm not sure.
All together, I think it just depends on what you want to accomplish. My coworker really just wanted to see if he could do it, which he could. However, how he did it was really pretty bad, to the point where everyone else connected his poor implementation with a poor idea it was fairly sad.
Mostly, I think if you follow good web design practices this wouldn't be too bad of a thing to pursue. What are your goals though?
I'm sorry I couldn't directly answer all of your questions. I hope my experiences are still helpful in answering if I think it is a bad idea or not.
-Brian J. Stinar-
SproutCore is one of the best out there when it comes to really being built from the ground up for single-page, potentially complex applications. Unlike some others like GWT or Cappucino, SproutCore is really about using JavaScript directly. They aren't the only one though. You might also want to look at JavaScriptMVC and qooxdoo.
Personally, I've built an extremely large and complex single-page application using JavaScript. It is currently around 100,000 lines (including comments/whitespace). To give a sense of scale, jQuery is around 6,000. To reach that size, I built my own framework, build tools etc. and it is extremely maintainable. You're asking if it can work, and I can tell you that it does, but you do need a little infrastructure if you're looking at doing something big. (BTW, there's a lot of lazy loading - its not 100,000 lines all at once!)
I also wouldn't really recommend this method for anything other than a web application. As pointed out, its still difficult for SEO, and may be strange for some users.
You're looking for ExtJS
I did my site like that in jquery, backed by spring mvc 3. For me it works fine and you have a lightweight, flashy feling while on it.
If you have time to dig in javascript seriously... it is really good exercise.
The only thing I really need to add is SEO, since spiders see only intial html. If it is important for you, i think backend static files and sitemap would be good enough solution.
I found something very light-weight that provides the basic needed functionally without forcing you into a whole framework. It's called Sammy and is inspired by Ruby's Sinatra: http://code.quirkey.com/sammy/
Very recommended!
Alternative: code the same in Java. Take a look to ItsNat
Think in JavaScript but code the same in Java in server with the same DOM APIs, in server is way easier to manage your application without custom client/bridges because UI and data are together.
Regarding SEO, bookmarking etc in single page there are solutions, take a look to the Single Page Interface Manifesto

Execute javascript on IIS server

I have the following situation. A customer uses JavaScript with jQuery to create a complex website. We would like to use JavaScript and jQuery on the server (IIS) for the following reasons:
Skills transfer - we would like to use JavaScript and jQuery on the server and not have to use eg VB Script. / classic asp. .Net framework/Java etc is ruled out because of this.
Improved options for search/accessibility. We would like to be able to use jQuery as a templating system, but this isn't viable for search engines and users with js turned off - unless we can selectively run this code on the server.
There is significant investment in IIS and Windows Server, so changing that is not an option.
I know you can run jScript on IIS using windows Script host, but am unsure of the scalability and the process surrounding this. I am also unsure whether this would have access to the DOM.
Here is a diagram that hopefully explains the situation. I was wondering if anyone has done anything similar?
EDIT: I am not looking for critic on web architecture, I am simply wanting to know if there are any options for manipulating the DOM of a page before it is sent to the client, using javascript. Jaxer is one such product (no IIS) Thanks.
Have a look at bringing the browser to the server, Rhino, and Use Microsoft's IIS as a Java servlet engine.
The first link is from John Resig's (jQuery's creator) blog.
Update August 2 2011
Node.js is coming to Windows.
The idea to reuse client JS on the server may sound tempting, but I am not sure that jQuery itself would be ready to run in server environment.
You will need to define global context for jQuery somehow by initializing window, document, self, location, etc.. I am not sure it is doable.
Besides, as Cheeso has mentioned, Active Server Pages is a very outdated technology, it was replaced with ASP.Net by Microsoft in the beginning of the century. I used to maintain a legacy system using ASP 3.0 for more than a year and that was pain. The most wonderful pastime was debugging: you will hardly find anything for the purpose today and will have to decript beautiful errors like in IIS log:
error '800a9c68'
Application-defined or object-defined error
Nevertheless, I can confirm that I managed to reuse client and server JScript. But this was code written by me who knew that it was going to be used on the server.
P.S. I would not recommend move that way. There are plenty templating frameworks which are familiar to those who write HTML and JavaScript.
JScript runs on IIS via something called ASP.
Active Server Pages.
It was first available in 1996.
Eventually ASP.NET was introduced as a successor. But ASP is still supported.
There is no DOM for the HTML page, though.
You might need to reconsider your architecture a bit.
I think the only viable solutions you're likely to find anywhere near ready to go involve putting IIS in front of Java. There are two browser-like environments I'm aware of coded for Java:
1) Env-js (see http://groups.google.com/group/envjs and http://github.com/thatcher/env-js )
I believe this one has contributions from jQuery's John Resig and was put together with jQuery testing/support in mind.
2) HTMLUnit (see http://htmlunit.sourceforge.net/ ) This one's older, and wasn't originally conceived around jQuery, but there are reports in the wild of using it to run jQuery's test suite successfully (http://daniel.gredler.net/2007/08/08/htmlunit-taming-jquery/ ).
If you want something pure-IIS/MS, I think your observation about windowsScript host and/or something like the semi-abandoned JScript.NET is probably about as close as you're going to come, along with a port (which you'll probably have to start) of something like Env-js or HTMLUnit.
Also, I don't know if you've seen the Wikipedia list of server-side JavaScript solutions:
http://en.wikipedia.org/wiki/Server-side_JavaScript
Finally... you could probably write a serviceable jQuery-like library in any language that already has some kind of DOM library and first-class functions (or, failing that an eval facility). See, for example pQuery for Perl (http://metacpan.org/pod/pQuery ). This would get you the benefits of the jQuery style of manipulating documents. Skill transfer is great and JavaScript has a wonderful confluence of very nice features, but on the other hand, having developers who care enough to learn multiple languages is also great, and js isn't the only nice language out there.
I think it's mainly a browser based script so probably you are better of using technologies based on VB or .NET to perform or generate HTML from templates. I'm sure there are because in the java world there are a few of these around (like velocity). You'd then use jQuery to create or add client side functionality and usability so it makes the website more usable than it would have been.
What exactly do you mean by
"A customer uses JavaScript with
jQuery to create a complex website"
Half the point of jQuery is to make it easy for the developer to manipulate the DOM, and therefore add interactive enhancements to a web site. By running the Javascript on the server and only rendering HTML you will lose the ability to add these enhancements, without doing a round trip to the server (think WebForms postback model...ugh).
Now if what you really mean is the customer uses a site builder based on jQuery, why not have that tool output flat HTML in the first place?
Take a look at this technology. You can invoke scripts to run at server, at client, or both. Plus, this really implements the firefox engine on the server. Take a look at it.
Aptana's Jaxer is the first AJAX web server so far. I have not tryed it yet, but I will. Looks promising and very powerful.

Introductory JavaScript programming task for an expert developer

What would be a good mini-project to get intimate with JavaScript, as an advanced 'introduction' to the language? I want to actually code an application in JS, not hook up bits of it to enhance a web application.
A lot of stuff you could learn by doing an RSS reader on a page. Google shows what can be done. The whole lection concentrates on javascript, network access, security restrictions and medium data mangeling.
If you have the ability to do any sort of backend programming than AJAX is really neat to do. You can get a lot of good effects with less efforts. A good thing to build on up.
I would argue that if you're really an advanced programmer then the exercises above would not really give you any insight into the language as they are just variations on things you probably have already done. Javascript's strongest suit is it's LISP style ability to grow. Write something AI(ish) that creates new functions. Most people don't utilize the language in this way, but, its ability to augment its own classes on the fly is, I would argue, it's most unusual and most powerful feature.
Although not a project, watch the Douglas Crockford videos at YUI theater.
The biggest web based Javascript projects are going to deal with the DOM. Do some nifty stuff with JQuery. Make a table with rows that highlight when you hover. Make them update themselves through AJAX and JSON when you click on them.
If you're really looking for something magical and usefull write a scrollable table with fixed headers and footers for IE8.
If you want to stay away from the WEB use the JDK 1.6 and run Javascript code in that. You could do TONS with that.
Whenever I'm trying to get familiar with a language, I will work on Project Euler problems with it.
I would implement a simple game like sokoban first.
The second application would be an AJAX-based multiuser chat application, first fetching other people's responses by polling, later with AJAX push.
Interesting question.
Really you could do any sort of application. In order to make sure you're using the latest and greatest stuff, I'd try making a simple CRUD style application using DHTML and AJAX. Perhaps a contacts list or calendar. If you're feeling really energetic, you could write the back-end in JavaScript as well.
Unless you want to get really friendly with the DOM and browser compatibility, I'd learn Javascript through the mask of one of the nice frameworks like Jquery or Prototype.
The Holy Grail - a WYSIWYG editor. They wouldn't need to complete it, but just seeing their plan of attack would be interesting. Plays right into patterns and OO.
I suggest you create a Google Gadget. You can create one for free and perhaps make something useful out of it. If you don't have a Google account, sign up for one. Then add the Google Gadget Editor to begin writing your code.
With the gadget, you'll be able to mess with JavaScript, JSON, CSS, etc. Furthermore, you'll be able to store the file on Google's server so you can work on it from any computer.
I created a simple RSS reader and wrote JavaScript to get the feed (using Google's API) then dealing with that JavaScript object because it came back as JSON. I then developed some JavaScript to hide/show div tags.
It was a good starter project for me to learn JavaScript.
Get JavaScript the Good Parts by Douglas Crockford. Also check out his web site: http://www.crockford.com
Key reason: just because JavaScript looks like C/C++/Java/C# doesn't mean it actually is like them. Things are significantly different. I suggest reading his book to get a grasp of those differences.
Otherwise, I would look at the JQuery web site. JavaScript is cool and all, but a good framework will save you from a lot of the pitfalls and make you much more productive faster.
try making an advanced AJAX application like for example try to recreate the google calander.
How about a firefox plugin to monitor StackOverflow? It could use RSS to monitor feeds and let you know when new questions are asked with your tags.
It could also be grown as your js skills progress.
Write yet another javascript framework, but focused specially in something, ie game programming.

Should I use Google Web Toolkit for my new webapp?

I would like to create a database backed interactive AJAX webapp which has a custom (specific kind of events, editing) calendaring system. This would involve quite a lot of JavaScript and AJAX, and I thought about Google Web Toolkit for the interface and Ruby on Rails for server side.
Is Google Web Toolkit reliable and good? What hidden risks might be if I choose Google Web Toolkit? Can one easily combine it with Ruby on Rails on server side? Or should I try to use directly a JavaScript library like jQuery?
I have no experience in web development except some HTML, but I am an experienced programmer (c++, java, c#), and I would like to use only free tools for this project.
RoR is actually one of the things the GWT is made to work well with, as long as you're using REST properly. It's in the Google Web Toolkit Applications book, and you can see a demo from the book using this kind of idea here. That's not to say that you won't have any problems, but I think the support is definitely out there for it.
There's a neat project for making RoR/GWT easy that you can find here (MIT license). I haven't had a chance to try it out yet, but it looks like a good amount of thought has been put into it. One catch is that it looks like it hasn't been fully tested with 2.1 Rails yet, just 2.0, so you may run into a few (probably minor and fixable) errors.
If you are looking to integrate GWT with non-Java backends such as ROR, PHP etc., you should bear in mind that GWT 1.5 now supports JavaScript Overlay types. This feature lets you write classes that can be mapped over the top of native JavaScript objects to easily provide accessor methods for properties of those objects and other extended functionality.
See this link for more details:
JavaScript Overlay Types
So you could return JSON encoded data from your backend via AJAX calls, parse it into a JavaScript Object and then access the data through your GWT Java code using the overlay classes you've created. Or when you render your page you can render static config data as JavaScript Objects and read it in via this mechanism, rather than having to do an AJAX call to grab the data.
If you know JAVA, and have somewhere you can host it (like a tomcat or glassfish container) I would recommend that much more than using Ruby for the back end. The main reason is that then you can share all of your objects, and use the built in RPC mechanism. I've done this for quite a lot of our projects and it's a huge timesaver, not to mention that the code is less error prone, because you don't convert your java objects to anything and then back again.
I have linked my GWT with Rails before, using the to_json function in Rails and then reading the JSON in GWT. It's all supported, but it is far more annoying than just doing the back end in JAVA.
Of course if you have cheap hosting, then Java containers are pretty much out of the question, in which case I would think Rails would be the next best thing.
GWT is very high quality with a great community. However you do need to know CSS if you want to adjust the look of things (you will) - CSS can do a lot of the layout, just like regular web if you want it to. Libraries like GWT-ext or ExtGWT can help a bit as they have stunning "out of the box" looks but for a price (extra size to your app).
You can code everything in Java using GWT, and you can integrate existing 3rd party javascript libraries with it. It's very good. I've never used RoR much though, so can't say anything about that.
If you're experienced in Java but not in Javascript/CSS, then GWT is going to be a lifesaver (unless you want to learn them, of course). CSS has so many little fiddly details. It is not uncommon to spend half a day fixing a 2 pixel misalignment that only occurs in IE6.
I am not sure about how easy it would be to use ROR for the back end... It is possible, I am sure, since GWT ajax communication is just servlets. But they provide some really nice functionality for passing Java objects back and forth which you won't be able to utilize if your server isn't also using Java.
I wrote about some of the disadvantages of GWT recently. Mainly, the disadvantages are: long deployment cycle for changes to some parts of the application and a rather steep learning curve. As a seasoned Java programmer, the second should be less of a problem and if you use a seperate backend, the first is also mitigated (as a complete redeploy is primarily required when you change the 'server' part of the application).
GWT is a wonderful framework with lots of potential. Keep in mind that it's still quite new, though. There are some unresolved bugs that can really annoy you, and they usually require ugly workarounds to get past. The community is great but you'll probably end up with a few problems sooner or later that Google can't answer yet.
But hey, I say go for it. The potential for GWT is awesome, and I bet it's future will be bright.
You should definitely use GWT for a new project (it's pretty easy to use in an old project too).
I my experience it's very fast to learn and use. The compiled javascript code is much better than anything you could ever write by hand and it works fast too.
Another benefit is the ability to debug you're code (which is hell with javascript alone)
This blog has inputs from many experienced users of GWT and have some great discussion points. I personally have huge experience with varied UI Frameworks. I will add my two cents. Lets look at fundamental advantages and disadvantages of GWT
Fundamental Advantage
GWT takes the web layer programming to JAVA. So, the obvious advantages of Java start getting into play. It will provide Object Oriented programming. It will also provide great debugging and compile time checks. Since it generates HTML and Javascript, it will also have ability to hide some complexity within its generator.
Fundamental Disadvantage
The disadvantage starts from the same statement. GWT takes the web layer programming to JAVA. If you know JAVA, probably you will never look out for an alternative language to write your business logic. It's self sufficient and great. But when it comes to writing configurations for a JAVA application. We use property files, database, XML etc. We never store configurations in a JAVA class file. Think hard, why is that?
This is because configuration is a static data. It often require hierarchy. It is supposed to be readable. It never requires compilation. It doesn't require knowledge of JAVA programming language. In short, it is a different ball game. Now the question is, how it relates to our discussion?
Now, lets think about a web page. Do you think when we write a web page we write a business logic? Absolutely not. Web page is just a configuration. It is a configuration of hierarchical containers and fields. We need to write business logic for the data that will be captured from and displayed on the web page and not to create the web page itself.
Previous paragraph makes a very very strong statement. This will explain why HTML and XML based web pages are still the most popular ones. XML is the best in business to write configurations. A framework must allow a clear separation of web page from business logic (the goal of MVC framework). By doing this a web designer will be able to apply his skills of visualization and artistry to create brilliant looking web pages just by configuring XMLs and without being bothered about the intricacies of a programming language. Developers will be able to use their best in business JAVA for writing business logic.
Finally, lets talk about the repercussions in direct terms. GWT breaks this principal so it is bound to fail. The cost for developing GWT application will be very high because you will need multiskill programmers to write web pages. The required look and feel will be very hard to achieve. The turn around time of modifying the web page will be very high because of unnecessary compilation. And lastly, since you are writing web pages in JAVA it is very easy to corrupt it with business logic. Unknowingly you will introduce complexities that must be avoided.
You could also consider Grails ("Groovy on Rails") which gives you the benefits of a Rails framework and the use of the Java VM.
Our team recently asked the same question, and we chose to go with GWT, especially since the designer plugin made working with GWT more accessible to non-java experts on the team. Whoever makes this choice, just beware you DON'T use the GWT Designer plugin !! It has not been updated (in at least a year, apparently) to create a GWT application that is compatible with IE8.
Our team had almost completed our application layouts, which were working perfectly in Chrome, FF and Safari. Then they blew up in IE. IE 7 would load partial pages (but not composite includes), and IE8 was not even able to load up the application. It just hung.
The designer plugin has buttons that allow the user to add CellTable widgets that are not IE compatible (CellTable, DeckPanel, Horizontal Panel, Vertical Panel, among others). These will cause intense pain when the layouts have to be re-done in java without assistance from the designer.
Experienced GWT users love it, but the designer plugin will kill you.

Categories