I would like to load a DOM using a document (in string form) or a URL, and then Execute javascript functions (including jquery selectors) against it. This would be totally server side, in process, no client/browser.
Basically I need to load the dom and then use jquery selectors and text() & type val() functions to extract strings from it. I don't really need to manipulate the dom.
I have looked at .Net javascript engines such as Jurassic and Jint, but neither support loading a DOM, and so therefore can't do what I need.
I would be willing to consider non .Net solutions (node.js, ruby, etc) if they exist, but would really prefer .Net.
edit
The below is a good answer, but currently I'm trying a different route, I'm attempting to port envjs to jurassic. If I can get that working I think it will do what I want, stay tuned....
The answer depends on what you are trying to do. If your goal is basically a complete web browser simulation, or a "headless browser," there are a number of solutions, but none of them (that I know of) exist cleanly in .NET. To mimic a browser, you need a javascript engine and a DOM. You've identified a few engines; I've found Jurassic to be both the most robust and fastest. The google chrome V8 engine is also very popular; the Neosis Javascript.NET project provides a .NET wrapper for it. It's not quite pure .NET since you have a non-.NET dependency, but it integrates cleanly and is not much trouble to use.
But as you've noted, you still need a DOM. In pure C# there is XBrowser, but it looks a bit stale. There are javascript-based representations of the entire browser DOM like jsdom, too. You could probably run jsdom in Jurassic, giving you a DOM simulation without a browser, all in C# (though likely very slowly!) It would definitely run just fine in V8. If you get outside the .NET realm, there are other better-supported solutions. This question discusses HtmlUnit. Then there's Selenium for automating actual web browsers.
Also, bear in mind that a lot of the work done around the these tools is for testing. While that doesn't mean you couldn't use them for something else, they may not perform or integrate well for any kind of stable use in inline production code. If you are trying to basically do real-time HTML manipulation, then a solution mixing a lot of technologies not that aren't widely used except for testing might be a poor choice.
If your need is actually HTML manipulation, and it doesn't really need to use Javascript but you are thinking more about the wealth of such tools available in JS, then I would look at C# tools designed for this purpose. For example HTML Agility Pack, or my own project CsQuery, which is a C# jQuery port.
If you are basically trying to take some code that was written for the client, but run it on a server -- e.g. for sophisticated/accelerated web scraping -- I'd search around using those terms. For example this question discusses this, with answers including PhantomJS, a headless webkit browser stack, as well as some of the testing tools I have already mentioned. For web scraping, I would imagine you can live without it all being in .NET, and that may be the only reasonable answer anyway.
Related
When I discovered that Node.js was built using the V8 JavaScript engine, I thought:
Great, web scraping will be easier as the page
will be rendered like in the browser, with a
"native" DOM supporting XPath and any AJAX calls on
the page executed.
Why doesn't it have a native DOM when it uses the same JavaScript engine as Chrome?
Why doesn't it have a mode to run JavaScript in retrieved pages?
What am I not understanding about JavaScript engines vs the engine in a web browser?
Many thanks!
The DOM is the DOM, and the JavaScript implementation is simply a separate entity. The DOM represents a set of facilities that a web browser exposes to the JavaScript environment. There's no requirement however that any particular JavaScript runtime will have any facilities exposed via the global object.
What Node.js is is a stand-alone JavaScript environment completely independent of a web browser. There's no intrinsic link between web browsers and JavaScript; the DOM is not part of the JavaScript language or specification or anything.
I use the old Rhino Java-based JavaScript implementation in my Java-based web server. That environment also has nothing at all to do with any DOM. It's my own application that's responsible for populating the global object with facilities to do what I need it to be able to do, and it's not a DOM.
Note that there are projects like jsdom if you want a virtual DOM in your Node project. Because of its very nature as a server-side platform, a DOM is a facility that Node can do without and still make perfect sense for a wide variety of server applications. That's not to say that a DOM might not be useful to some people, but it's just not in the same category of services as things like process control, I/O, networking, database interop, and so on.
There may be some "official" answer to the question "why?" out there, but it's basically just the business of those who maintain Node (the Node Foundation now). If some intrepid developer out there decides that Node should ship by default with a set of modules to support a virtual DOM, and successfully works and works and makes that happen, then Node will have a DOM.
P.S: When reading this question I was also wondering if V8 (node.js is built on top of this) had a DOM
Why when it uses the same JS engine as Chrome doesn't it have a native
DOM?
But I searched google and found Google's V8 page which recites the following:
JavaScript is most commonly used for client-side scripting in a
browser, being used to manipulate Document Object Model (DOM) objects
for example. The DOM is not, however, typically provided by the
JavaScript engine but instead by a browser. The same is true of
V8—Google Chrome provides the DOM. V8 does however provide all the
data types, operators, objects and functions specified in the ECMA
standard.
node.js uses V8 and not Google Chrome.
Likewise, why doesn't it have a mode to run JS in retrieved pages?
I also think we don't really need it that bad. Ryan Dahl created node.js as one man (single programmer). Maybe now he (his team) will develop this, but I was already extremely amazed by the amount of code he produced (crazy). He wanted to make a non-blocking easy/efficient library, which I think he did a mighty good job at.
But then again, another developer created a module which is pretty good and actively developed (today) at https://github.com/tmpvar/jsdom.
What am I not understanding about Javascript engines vs the engine in
a web browser? :)
Those are different things as is hopefully clear from the quote above.
The Document Object Model (DOM in short) is a programming interface for HTML and XML documents and it represents the page so that programs can change the document structure, style, and content. More on this subject.
The necessary distinction between client-side (browser) and server-side (Node.js) and their main goals:
Client-side: accessing and displaying information of the web
Server-side: providing stable and reliable ways to deliver web information
Why is there no DOM in Node.js be default?
By default, Node.js doesn't have access, nor have any knowledge about the actual DOM in your own browser. Node.js just delivers the data, that will be used by your own browser to process and render the whole website, the DOM included. The server provides the data to your browser to use and process. That is the intended way.
Why wouldn't you want to access the DOM in Node.js?
Accessing your browser's actual DOM using Node.js would be just simply out of the goal of the server. Your own browser's role is to display the data coming from the server. However it is certainly possible and there are multiple solutions in different level of depths and varieties to pre-render, manipulate or change the DOM using AJAX calls. We'll see what future trends will bring.
Why would you want to access the DOM in Node.js?
By default, you shouldn't access your own, actual DOM (at least some data of it) using Node.js. Client-side and server-side are separated in terms of role, functionality, and responsibility based on years of experience and knowledge. Although there are several situations, where there are solid reasons to do so:
Gathering usage data (A/B testing, UI/UX efficiency and feedback)
Headless testing (Development, automation, web-scraping)
How can you access the DOM in Node.js?
jsdom: pure-JavaScript implementation, good for testing your own DOM/browser-related project
cheerio: great solution if you like/often use jQuery
puppeteer: Google's own way to provide headless testing using Google Chrome
own solution (your possible future project link here)
Although these solutions do not provide a way to access your browser's own, actual DOM by default, but you can create a project to send some form of data about your DOM to the server, then use/render/manipulate that data based on your needs.
...and yes, web-scraping and web development in terms of tools and utilities became more sophisticated and certainly easier in several fields.
node.js chose not to include it in their standard library. For any functionality, there is an inevitable tradeoff between comprehensiveness, scalability, and maintainability.
That doesn't mean it's not potentially useful. There is at least one JavaScript DOM implementation intended for NodeJS (among other CommonJS implementations).
You seem to have a flawed assumption that V8 and the DOM are inextricably related, that's not the case. The DOM is actually handled by Webkit, V8 doesn't handle the DOM, it handles Javascript calls to the DOM. Don't let this discourage you, Node.js has carved out a significant niche in the realtime server market, but don't let anybody tell you it's just for servers. Node makes it possible to build almost anything with JavaScript.
It is possible to do what you're talking about. For example there is the very good jsdom library if you really need access to the DOM, and node-htmlparser, there are also some really good scraping libraries that take advantage of these like apricot.
2018 answer: mainly for historical reasons, but this may change in future.
Historically, very little DOM manipulation was done on the server. Addiotinally, as other answers allude, the JS stdlib and the DOM are seperate libraries - if you're using node, for, say, Unix scripting, then HTMLElement and NodeList etc aren't really relevant to that.
However: server-side DOM manipulation is now a very common part of delivering web apps. Web servers need to understand the structure of pages, and, if asked to render a resource as HTML, deliver HTML content that reflects the initial state of a web application. This means web apps load much faster than if the server simply delivers a stub page and has the browsers then do the work of filling in the real content. Currently this is done with JSDom and similar, but in the same way node has Request and Response objects built in, having DOM functions maintained as part of the stdlib would help with this task.
Javascript != browser. Javascript as a language is not tied to browsers; node.js is simply an implementation of Javascript that is intended for servers, not browsers. Hence no DOM.
If you read DOM as 'linked objects immediately accessible from my script' then the answer 'it does, but it's very different from set of objects available from web document script'. The main reason is that node is 'evented I/O for V8', not 'HTML tree objects for V8'
Node is a runtime environment, it does not render a DOM like a browser.
Because there isn't a DOM. DOM stands for Document Object Model. There is no document in Node, so not DOM to manipulate it. That is definitively a browser thing.
You can use a library like cheerio though which gives you some simple DOM manipulation.
Node is server-level JavaScript. It's just the language applied to a basic system API, more like C++ or Java.
It seems people have answered 'why' but not how. A quick answer of how is that in a web browser, a document object is exposed (hence DOM , document object model). On windows this object is called document object. You can refer to this page and look at the methods it exposes which are for handling HTML documents like createElement. I don't use node.js or haven't done COM programming in a while but I'd imagine you could use DOM in node.js by simply calling the COM object IHTMLDocument3. Of course for other platforms like Mac OS X or Linux you would probably have to use something from their OS api. This should allow you to easily build a webpage server side using DOM, or to scrape incoming web pages.
Node.js is for serverside programming. There is no DOM to be rendered in the server.
1) What does it mean for it to have a D ocument O bject M odel? There's no document to represent.
2) You're most of the time you're not retrieving pages. You can, but most Node apps probably won't be.
3) Without a document and a browser, Javascript is just another programming language. So you may ask why there isn't a DOM in C# or Java
I am a web developer, and I have observed that many times I need the same function on both client and server. So I write it in JS as well as in PHP or whichever server side language. I am fed up with this. If I have to change it then I need to change it in both places. If I want to use it for some hand held devices, then I will have to rewrite that code yet again using objective-C or Java etc. Then if I need to change that function then I will need to change it everywhere.
Is there a solution for this? If I will call some webservice via ajax, then the client will have a delay. If it will be in JS then it can't be accessed from within PHP or Java, etc. If I use some service in PHP from another language then that can also become a performance issue.
It is also possible that some time we need such functions output from some paramters as input using db or without db.
I know there would be some pretty simpler solution but I am not aware of that. Please tell some language independent solution as I don't have VPS always.
I am not sure if my question actually belongs to stackoverflow.com or programmers.stackexchange.com so please transfer it to programmers.stackexchange.com instead of closing this question if it belongs to there.
Typically, the solution to this problem is to write common code in one language and use translators or library linking to allow access from other languages.
Node.js allows you to write server-side code in JavaScript.
Node.js is a platform built on Chrome's JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices.
You can also use JavaScript to write HTML5 apps for mobile devices.
"Building iPhone Apps with HTML, CSS, and JavaScript"
Now web designers and developers can join the iPhone app party without having to learn Cocoa's Objective-C programming language. It's true: You can write iPhone apps quickly and efficiently using your existing skills with HTML, CSS, and JavaScript. This book shows you how with lots of detailed examples, step-by-step instructions, and hands-on exercises.
If you don't want to try to write large complex applications in JavaScript, GWT provides a way to write Java and via-translation, run it on the client.
The GWT SDK contains the Java API libraries, compiler, and development server. It lets you write client-side applications in Java and deploy them as JavaScript.
If you develop in .Net languages: C# -> JavaScript ScriptSharp
Script# is a free tool that enables developers to author C# source code and subsequently compile it into regular script that works across all modern browsers
you could use the spidermonkey extension to translate php into javascript. this way you can write your functions in php then simply convert them to javascript and re-use them at the browser.
here is a good tutorial to show you how this is done
A lot of questions have been asked and answered about running server-side javascript on Google App Engine, but all of the answers deal with Java instances in order to make use of Java-based JS interpreters like Rhino, Rhino for Webapps, etc.
Is there any way to execute server-side javascript code on a Python GAE instance? I'm thinking something exactly along the lines of pyv8, but with support for App Engine (which I guess would mean a pure python implementation of the interpreter).
The only solution I can come up with at the moment is to use some sort of gross hack to run a Java and Python GAE instance side-by-side (via different versions) so they can both talk to the same datastore, let the Java instance host the JS code, and use an API to talk back'n'forth. Not very appealing.
No need to get into all the "this is unnecessary, you shouldn't be doing this" discussion -- I know this isn't ideal and I'm simply curious if it can be done.
As far as I can find: No
I've done a bit of searching, but it seems that nobody has tried to implement a pure Python Javascript engine, and I can't blame them: it would be a huge amount of work for very few use cases (unfortunately, yours is one of those). A couple of projects—Grailbrowser and Pybrowser—have Python code to render HTML, so might one day aim to run javascript, but it's not even started, and neither of them look in active development.
The most likely way it would ever happen is if Google were to offer the Parrot VM (which can run various dynamic languages) on Appengine. That's a cool idea, but I'm not holding my breath.
What might work is to run Jython (and Rhino) in a Java instance. Of course, then you'd have to get to any App services through the Java API, not the Python one, which would be ugly.
Actually, it can indeed be done, using either AppEngineJs or ESXX:
http://www.appenginejs.org/
http://esxx.blogspot.com/2009/06/esxx-on-google-app-engine.html
I am currently trying to solvevthe same problem with PyJON
http://code.google.com/p/pyjon/
Seems to be a pure Python JavaScrit parser an interpreter.
Honestly, now when we have so many javascript features on the frontend I really wish javascript in the browsers could replace html and css entirely.
We could deal with objects (structure + design + functionality) instead of html elements and css that style these elements.
But since that is never going to happen, I wonder if there is any low-level framework that abstracts away html and css entirely, like node.js (not high-level like Sproutcore) but for the frontend?
I think that would be the next big thing :)
I think that a framework that abstracts away html and css, would by definition, be high level. So you are asking an invalid question.
I consider these all high level frameworks, but they are the only ones I am aware of that abstract away html and css.
qooxdo
gwt (Google Web Toolkit)
pyjamas
cappuccino
This would've been easy if all browsers follow a strict standard. So the case now is that you will eventually find yourself needing to tweak the "low-level" javascript/css to make it compatible to all major browsers.
JQuery already gives an abstraction to cross-browser compatibilities but still considered low-level in your definition since you will still need to manipulate elements by yourself.
There are many attempts to "objectify" at least html, especially when using Java and server-side programming which includes Wicket, Groovelets and the aforementioned GWT to name a few.
Gwt does that using Java. So all you have to do deal is with java objects and the tree that make to contain other objects. Though with UI binder they have kinda brought back the old html.
And going by the limited reading of Sproutcore, GWT does something similar too. GWT can be used to make server sessionless i.e. it doesn't recognize the user but only serves the data.
In my project we are using GWT to have business logic coded in GWT which asks for data which the widget require from server.
Try Google Closure Library. Its a low level framework. It has similarities with CommonJS (like goog.require, but differs as it has its own goog.exportSymbol & goog.exportProperty rather than module.exports). However, it does not completely abstract away html/css!
I have the following situation. A customer uses JavaScript with jQuery to create a complex website. We would like to use JavaScript and jQuery on the server (IIS) for the following reasons:
Skills transfer - we would like to use JavaScript and jQuery on the server and not have to use eg VB Script. / classic asp. .Net framework/Java etc is ruled out because of this.
Improved options for search/accessibility. We would like to be able to use jQuery as a templating system, but this isn't viable for search engines and users with js turned off - unless we can selectively run this code on the server.
There is significant investment in IIS and Windows Server, so changing that is not an option.
I know you can run jScript on IIS using windows Script host, but am unsure of the scalability and the process surrounding this. I am also unsure whether this would have access to the DOM.
Here is a diagram that hopefully explains the situation. I was wondering if anyone has done anything similar?
EDIT: I am not looking for critic on web architecture, I am simply wanting to know if there are any options for manipulating the DOM of a page before it is sent to the client, using javascript. Jaxer is one such product (no IIS) Thanks.
Have a look at bringing the browser to the server, Rhino, and Use Microsoft's IIS as a Java servlet engine.
The first link is from John Resig's (jQuery's creator) blog.
Update August 2 2011
Node.js is coming to Windows.
The idea to reuse client JS on the server may sound tempting, but I am not sure that jQuery itself would be ready to run in server environment.
You will need to define global context for jQuery somehow by initializing window, document, self, location, etc.. I am not sure it is doable.
Besides, as Cheeso has mentioned, Active Server Pages is a very outdated technology, it was replaced with ASP.Net by Microsoft in the beginning of the century. I used to maintain a legacy system using ASP 3.0 for more than a year and that was pain. The most wonderful pastime was debugging: you will hardly find anything for the purpose today and will have to decript beautiful errors like in IIS log:
error '800a9c68'
Application-defined or object-defined error
Nevertheless, I can confirm that I managed to reuse client and server JScript. But this was code written by me who knew that it was going to be used on the server.
P.S. I would not recommend move that way. There are plenty templating frameworks which are familiar to those who write HTML and JavaScript.
JScript runs on IIS via something called ASP.
Active Server Pages.
It was first available in 1996.
Eventually ASP.NET was introduced as a successor. But ASP is still supported.
There is no DOM for the HTML page, though.
You might need to reconsider your architecture a bit.
I think the only viable solutions you're likely to find anywhere near ready to go involve putting IIS in front of Java. There are two browser-like environments I'm aware of coded for Java:
1) Env-js (see http://groups.google.com/group/envjs and http://github.com/thatcher/env-js )
I believe this one has contributions from jQuery's John Resig and was put together with jQuery testing/support in mind.
2) HTMLUnit (see http://htmlunit.sourceforge.net/ ) This one's older, and wasn't originally conceived around jQuery, but there are reports in the wild of using it to run jQuery's test suite successfully (http://daniel.gredler.net/2007/08/08/htmlunit-taming-jquery/ ).
If you want something pure-IIS/MS, I think your observation about windowsScript host and/or something like the semi-abandoned JScript.NET is probably about as close as you're going to come, along with a port (which you'll probably have to start) of something like Env-js or HTMLUnit.
Also, I don't know if you've seen the Wikipedia list of server-side JavaScript solutions:
http://en.wikipedia.org/wiki/Server-side_JavaScript
Finally... you could probably write a serviceable jQuery-like library in any language that already has some kind of DOM library and first-class functions (or, failing that an eval facility). See, for example pQuery for Perl (http://metacpan.org/pod/pQuery ). This would get you the benefits of the jQuery style of manipulating documents. Skill transfer is great and JavaScript has a wonderful confluence of very nice features, but on the other hand, having developers who care enough to learn multiple languages is also great, and js isn't the only nice language out there.
I think it's mainly a browser based script so probably you are better of using technologies based on VB or .NET to perform or generate HTML from templates. I'm sure there are because in the java world there are a few of these around (like velocity). You'd then use jQuery to create or add client side functionality and usability so it makes the website more usable than it would have been.
What exactly do you mean by
"A customer uses JavaScript with
jQuery to create a complex website"
Half the point of jQuery is to make it easy for the developer to manipulate the DOM, and therefore add interactive enhancements to a web site. By running the Javascript on the server and only rendering HTML you will lose the ability to add these enhancements, without doing a round trip to the server (think WebForms postback model...ugh).
Now if what you really mean is the customer uses a site builder based on jQuery, why not have that tool output flat HTML in the first place?
Take a look at this technology. You can invoke scripts to run at server, at client, or both. Plus, this really implements the firefox engine on the server. Take a look at it.
Aptana's Jaxer is the first AJAX web server so far. I have not tryed it yet, but I will. Looks promising and very powerful.