When I discovered that Node.js was built using the V8 JavaScript engine, I thought:
Great, web scraping will be easier as the page
will be rendered like in the browser, with a
"native" DOM supporting XPath and any AJAX calls on
the page executed.
Why doesn't it have a native DOM when it uses the same JavaScript engine as Chrome?
Why doesn't it have a mode to run JavaScript in retrieved pages?
What am I not understanding about JavaScript engines vs the engine in a web browser?
Many thanks!
The DOM is the DOM, and the JavaScript implementation is simply a separate entity. The DOM represents a set of facilities that a web browser exposes to the JavaScript environment. There's no requirement however that any particular JavaScript runtime will have any facilities exposed via the global object.
What Node.js is is a stand-alone JavaScript environment completely independent of a web browser. There's no intrinsic link between web browsers and JavaScript; the DOM is not part of the JavaScript language or specification or anything.
I use the old Rhino Java-based JavaScript implementation in my Java-based web server. That environment also has nothing at all to do with any DOM. It's my own application that's responsible for populating the global object with facilities to do what I need it to be able to do, and it's not a DOM.
Note that there are projects like jsdom if you want a virtual DOM in your Node project. Because of its very nature as a server-side platform, a DOM is a facility that Node can do without and still make perfect sense for a wide variety of server applications. That's not to say that a DOM might not be useful to some people, but it's just not in the same category of services as things like process control, I/O, networking, database interop, and so on.
There may be some "official" answer to the question "why?" out there, but it's basically just the business of those who maintain Node (the Node Foundation now). If some intrepid developer out there decides that Node should ship by default with a set of modules to support a virtual DOM, and successfully works and works and makes that happen, then Node will have a DOM.
P.S: When reading this question I was also wondering if V8 (node.js is built on top of this) had a DOM
Why when it uses the same JS engine as Chrome doesn't it have a native
DOM?
But I searched google and found Google's V8 page which recites the following:
JavaScript is most commonly used for client-side scripting in a
browser, being used to manipulate Document Object Model (DOM) objects
for example. The DOM is not, however, typically provided by the
JavaScript engine but instead by a browser. The same is true of
V8—Google Chrome provides the DOM. V8 does however provide all the
data types, operators, objects and functions specified in the ECMA
standard.
node.js uses V8 and not Google Chrome.
Likewise, why doesn't it have a mode to run JS in retrieved pages?
I also think we don't really need it that bad. Ryan Dahl created node.js as one man (single programmer). Maybe now he (his team) will develop this, but I was already extremely amazed by the amount of code he produced (crazy). He wanted to make a non-blocking easy/efficient library, which I think he did a mighty good job at.
But then again, another developer created a module which is pretty good and actively developed (today) at https://github.com/tmpvar/jsdom.
What am I not understanding about Javascript engines vs the engine in
a web browser? :)
Those are different things as is hopefully clear from the quote above.
The Document Object Model (DOM in short) is a programming interface for HTML and XML documents and it represents the page so that programs can change the document structure, style, and content. More on this subject.
The necessary distinction between client-side (browser) and server-side (Node.js) and their main goals:
Client-side: accessing and displaying information of the web
Server-side: providing stable and reliable ways to deliver web information
Why is there no DOM in Node.js be default?
By default, Node.js doesn't have access, nor have any knowledge about the actual DOM in your own browser. Node.js just delivers the data, that will be used by your own browser to process and render the whole website, the DOM included. The server provides the data to your browser to use and process. That is the intended way.
Why wouldn't you want to access the DOM in Node.js?
Accessing your browser's actual DOM using Node.js would be just simply out of the goal of the server. Your own browser's role is to display the data coming from the server. However it is certainly possible and there are multiple solutions in different level of depths and varieties to pre-render, manipulate or change the DOM using AJAX calls. We'll see what future trends will bring.
Why would you want to access the DOM in Node.js?
By default, you shouldn't access your own, actual DOM (at least some data of it) using Node.js. Client-side and server-side are separated in terms of role, functionality, and responsibility based on years of experience and knowledge. Although there are several situations, where there are solid reasons to do so:
Gathering usage data (A/B testing, UI/UX efficiency and feedback)
Headless testing (Development, automation, web-scraping)
How can you access the DOM in Node.js?
jsdom: pure-JavaScript implementation, good for testing your own DOM/browser-related project
cheerio: great solution if you like/often use jQuery
puppeteer: Google's own way to provide headless testing using Google Chrome
own solution (your possible future project link here)
Although these solutions do not provide a way to access your browser's own, actual DOM by default, but you can create a project to send some form of data about your DOM to the server, then use/render/manipulate that data based on your needs.
...and yes, web-scraping and web development in terms of tools and utilities became more sophisticated and certainly easier in several fields.
node.js chose not to include it in their standard library. For any functionality, there is an inevitable tradeoff between comprehensiveness, scalability, and maintainability.
That doesn't mean it's not potentially useful. There is at least one JavaScript DOM implementation intended for NodeJS (among other CommonJS implementations).
You seem to have a flawed assumption that V8 and the DOM are inextricably related, that's not the case. The DOM is actually handled by Webkit, V8 doesn't handle the DOM, it handles Javascript calls to the DOM. Don't let this discourage you, Node.js has carved out a significant niche in the realtime server market, but don't let anybody tell you it's just for servers. Node makes it possible to build almost anything with JavaScript.
It is possible to do what you're talking about. For example there is the very good jsdom library if you really need access to the DOM, and node-htmlparser, there are also some really good scraping libraries that take advantage of these like apricot.
2018 answer: mainly for historical reasons, but this may change in future.
Historically, very little DOM manipulation was done on the server. Addiotinally, as other answers allude, the JS stdlib and the DOM are seperate libraries - if you're using node, for, say, Unix scripting, then HTMLElement and NodeList etc aren't really relevant to that.
However: server-side DOM manipulation is now a very common part of delivering web apps. Web servers need to understand the structure of pages, and, if asked to render a resource as HTML, deliver HTML content that reflects the initial state of a web application. This means web apps load much faster than if the server simply delivers a stub page and has the browsers then do the work of filling in the real content. Currently this is done with JSDom and similar, but in the same way node has Request and Response objects built in, having DOM functions maintained as part of the stdlib would help with this task.
Javascript != browser. Javascript as a language is not tied to browsers; node.js is simply an implementation of Javascript that is intended for servers, not browsers. Hence no DOM.
If you read DOM as 'linked objects immediately accessible from my script' then the answer 'it does, but it's very different from set of objects available from web document script'. The main reason is that node is 'evented I/O for V8', not 'HTML tree objects for V8'
Node is a runtime environment, it does not render a DOM like a browser.
Because there isn't a DOM. DOM stands for Document Object Model. There is no document in Node, so not DOM to manipulate it. That is definitively a browser thing.
You can use a library like cheerio though which gives you some simple DOM manipulation.
Node is server-level JavaScript. It's just the language applied to a basic system API, more like C++ or Java.
It seems people have answered 'why' but not how. A quick answer of how is that in a web browser, a document object is exposed (hence DOM , document object model). On windows this object is called document object. You can refer to this page and look at the methods it exposes which are for handling HTML documents like createElement. I don't use node.js or haven't done COM programming in a while but I'd imagine you could use DOM in node.js by simply calling the COM object IHTMLDocument3. Of course for other platforms like Mac OS X or Linux you would probably have to use something from their OS api. This should allow you to easily build a webpage server side using DOM, or to scrape incoming web pages.
Node.js is for serverside programming. There is no DOM to be rendered in the server.
1) What does it mean for it to have a D ocument O bject M odel? There's no document to represent.
2) You're most of the time you're not retrieving pages. You can, but most Node apps probably won't be.
3) Without a document and a browser, Javascript is just another programming language. So you may ask why there isn't a DOM in C# or Java
Related
I was thinking about this today and I realized I don't have a clear picture here.
Here are some statements I think to be true (please correct me if I'm wrong):
the DOM is a collection of interfaces specified by W3C.
when parsing HTML source code, the browser creates a DOM tree which has nodes that implement DOM interfaces.
the ECMAScript spec has no reference of browser host objects (DOM, BOM, HTML5 APIs etc.).
how the DOM is actually implemented depends on browser internals and is probably different among most of them.
modern JS interpreters use JIT to improve the code performance and translate it to bytecode
I am curious about what happens behind the scenes when I call document.getElementById('foo'). Does the call get delegated to browser native code by the interpreter or does the browser have JS implementations of all host objects? Do you know about any optimizations they do in regard to this?
I read this overview of browser internals but it didn't mention anything about this. I will look through the Chrome and FF source when I have time, but I thought about asking here first. :)
All of your bullet points are correct, except:
modern JS interpreters use JIT to improve the code performance and translate it to bytecode
should be "...and translate it to native code". SpiderMonkey (the JS engine in Firefox) worked as a bytecode interpreter for a long time before the current JS speed arms race.
On Mozilla's JS-to-DOM bridge:
The host objects are typically implemented in C++, though there is an experiment underway to implement DOM in JS. So when a web page calls document.getElementById('foo'), the actual work of retrieving the element by its ID is done in a C++ method, as hsivonen noted.
The specific way the underlying C++ implementation gets called depends on the API and also changed over time (note that I'm not involved in the development, so might be wrong about some details, here's a blog post by jst, who was actually involved in creating much of this code):
At the lowest level every JS engine provides APIs to define host objects. For example, the browser can call JS_DefineFunctions (as demonstrated in the SpiderMonkey User Guide) to let the engine know that whenever script calls a function with the specified name, a provided C callback should be called. Same for other aspects of the host objects (e.g. enumeration, property getters/setters, etc.)
For the core ECMAScript functionality and in some tricky DOM cases the JS engine/the browser uses these APIs directly to define host objects and their behaviors, but it requires a lot of common boilerplate code for e.g. checking parameter types, converting them to the appropriate C++ types, error handling etc.
For reasons I won't go into, let's say historically, Mozilla made heavy use of XPCOM for many of its objects, including much of the DOM. One feature of XPCOM is its binding to JS called XPConnect. Among other things, XPConnect can take an interface definition in IDL (such as nsIDOMDocument; or more precisely its compiled representation), expose an object with the specified properties to the script, and later, when a script calls getElementById, perform the necessary parameter checks/conversions and route the call directly to a C++ method (nsDocument::GetElementById(const nsAString& aId, nsIDOMElement** aReturn))
The way XPConnect worked was quite inefficient: it registered generic functions as callbacks to be executed when a script accesses a host object, and these generic functions figured out what they needed to do in every particular case dynamically. This post about quickstubs walks you through one example.
"Quick stubs" mentioned in the previous link is a way to optimize JS->C++ calls time by trading some code size for it: instead of always using generic C++ functions that know how to make any kind of call, the specialized code is automatically generated at the Firefox build time for a pre-defined list of "hot" calls.
Later on the JIT (tracemonkey at that time) was taught to generate the code calling C++ methods as part of the native code generated for "hot" paths in JS. I'm not sure how the newer JITs (jaegermonkey) work in this regard.
With "paris bindings" the objects are exposed to webpage JS without any reliance on XPConnect, instead generating all the necessary glue JSClass code based on WebIDL (instead of XPCOM-era IDL). See also posts by developers who worked on this: jst and khuey. Also see How is the web-exposed DOM implemented?
I'm fuzzy on details of the three last points in particular, so take it with a grain of salt.
The most recent improvements are listed as dependencies of bug 622298, but I don't follow them closely.
JS calls to DOM methods like getElementById cause the JS engine to call into the C++ code that implements the DOM. For example, in Firefox, the call ends up in nsDocument::GetElementById(const nsAString& aId, nsIDOMElement** aReturn).
As you can see, Firefox maintains a hashtable that maps ids to elements in C++ as an optimization in this case, so it doesn't walk the whole DOM tree looking for the id.
The DOM is implemented as a language-independent library pretty much in all major browser implementations, which means it's in a different library from the Javascript engine. For example in IE, the JS engine is implemented in jscript.dll while the DOM is implemented in mshtml.dll. Safari has Nitro(JS) and WebCore(DOM). Chrome has V8(JS) and WebCore(DOM), and Firefox has SpiderMonkey/TraceMonkey(JS) and Gecko(DOM).
What this means is that anytime your JS has to access the DOM, it has to reach over to the DOM library - which is inherently slow because of all the marshaling that has to take place. An analogy that has been used is 2 pieces of land connected by a toll bridge, any time you touch the DOM, you must cross over the bridge and cross back - paying a performance toll.
References
Video: Building High Performance Web Applications and Sites
Book: High Performance Javascript (Chapter 3 on the DOM)
I want to extract data from an HTML string in a Web Worker.
I want to clarify that I do not want to manipulate the DOM. I am sending an HTML string to the Web Worker, which then should extract data from the HTML, and then return the extracted data.
In the browser I could do:
var html = $("<body><div>...more html...</div></body>");
var extractedText = $(".selector", html).text();
My Question:
What is the equivalent of the above code in a Web Worker environment if given the same HTML string? There's no jQuery, no DOMParser, no querySelector.. in the Web Worker etc. Are there alternatives?
The Why:
I'm doing on the fly scraping of pages in a browser and don't want to block the UI thread because it's pretty heavy work.
I've looked at jsdom, cheerio, etc. but could not figure out how to make them work.
Regarding suggested duplicates:
I have reviewed both of the suggested duplicates and they are ones that I have read before while searching for answers to this question. They address XML parsing and not HTML parsing, and also do not address how to use CSS-selection inside Web Workers.
Short answer:
You cannot do any sort of HTML/CSS manipulation, including querying, in a web worker.
Long answer:
There are many DOMs. There's the main DOM, which is rendered on the page, but everything that a browser does that touches an HTML or XML tree, including querySelector and friends, requires the browser to build a DOM for that tree. (see also: DocumentFragment)
One of Mozilla's developers talked a bit about some reasons why they can't build any DOMs on worker threads (found via this question, on nabble):
You're assuming that none of the DOM implementation code uses any sort of non-DOM objects, ever, or that if it does those objects are fully threadsafe. That's just not not the case, at least in Gecko.
The issue in this case is not the same DOM object being touched on multiple threads. The issue is two DOM objects on different threads both touching some global third object.
For example, the XML parser has to do some things that in Gecko can only be done on the main thread (DTD loading, offhand; there are a few others that I've seen before but don't recall offhand).
So. We obviously can't use querySelector, createElement, or anything useful in a worker, so what can we do?
Build our own DOM parser/selector modules, of course!
Not really. Try including a copy of htmlparser2 in your worker, maybe via browserify (making that work is its own question). With that, and with CSSselect to allow querySelector-like selecting, you should be ready to go.
Admittedly, you can't use jQuery with those, but for simple querying needs they (and querySelector/querySelectorAll) should be more than sufficient.
You can make dom selection inside worker but you will need to create an API that will use post message to change data between main tread and worker (because you can't use DOM directly in worker). The limitation is that you will need to pass strings between, so you can't return Dom Nodes, unless you have some code that will create DOM nodes in worker based on data from main tread.
Because JavaScript is dynamic it should be easy to create dynamic wrapper that will create all those functions for you, and will allow to call querySelelector('.foo') and expose all the Dom APIs. With proxy objects you can even allow to use querySelelector('.foo').innerHTML = 'hello'; in worker with proper code.
There is library that make creating such API easier Comlink from Google. If you don't want to use library you can check this code, this git web terminal that expose isomorphic git functions using RPC like code to worker (It's inspired by Jason's Miller
workerize).
Create worker:
https://github.com/jcubic/git/blob/gh-pages/js/main.js#L116
Worker code
https://github.com/jcubic/git/blob/gh-pages/js/git-worker.js
and quick search give this project that looks promising "Worker DOM", it should give you DOM api in worker (that I'm almost sure use solution I proposed) but I didn't check it and I'm not sure how it works.
With some bit of work you may even have working jQuery inside worker, it would good project to make open source.
I would like to write a commandline tool that receives notifications from Google App Engine's Channel API. This seems to be quite straightforward thanks to open JavaScripts VMs such as v8 and js. One problem with this approach, though, is that these VMs do not provide standard js objects such as window and document, which the channel API references. Running such code therefore gives you window/document/.. not found errors.
There seem to be two ways of circumventing this obstacle:
To write a lightweight header in javascript to emulate the behavior of the required objects.
To edit Google's javascript (/_ah/channel/jsapi) and eliminate references to such objects.
Does anyone know if there are existing implementations of these approaches, or know of a better idea? Furthermore, is there a clean, uncompressed version of the channel API client side javascript code available somewhere?
You can't edit the script used by /_ah/channel/jsapi -- it's only used when the channel is running against the dev app server. When running in production, that script redirects to https://talkgadget.google.com/talkgadget/channel.js
So you're left with emulating the required objects, or just using a hidden browser window. I would opt for the latter, since I think emulating all the DOM calls is going to get very difficult very quickly.
I would like to load a DOM using a document (in string form) or a URL, and then Execute javascript functions (including jquery selectors) against it. This would be totally server side, in process, no client/browser.
Basically I need to load the dom and then use jquery selectors and text() & type val() functions to extract strings from it. I don't really need to manipulate the dom.
I have looked at .Net javascript engines such as Jurassic and Jint, but neither support loading a DOM, and so therefore can't do what I need.
I would be willing to consider non .Net solutions (node.js, ruby, etc) if they exist, but would really prefer .Net.
edit
The below is a good answer, but currently I'm trying a different route, I'm attempting to port envjs to jurassic. If I can get that working I think it will do what I want, stay tuned....
The answer depends on what you are trying to do. If your goal is basically a complete web browser simulation, or a "headless browser," there are a number of solutions, but none of them (that I know of) exist cleanly in .NET. To mimic a browser, you need a javascript engine and a DOM. You've identified a few engines; I've found Jurassic to be both the most robust and fastest. The google chrome V8 engine is also very popular; the Neosis Javascript.NET project provides a .NET wrapper for it. It's not quite pure .NET since you have a non-.NET dependency, but it integrates cleanly and is not much trouble to use.
But as you've noted, you still need a DOM. In pure C# there is XBrowser, but it looks a bit stale. There are javascript-based representations of the entire browser DOM like jsdom, too. You could probably run jsdom in Jurassic, giving you a DOM simulation without a browser, all in C# (though likely very slowly!) It would definitely run just fine in V8. If you get outside the .NET realm, there are other better-supported solutions. This question discusses HtmlUnit. Then there's Selenium for automating actual web browsers.
Also, bear in mind that a lot of the work done around the these tools is for testing. While that doesn't mean you couldn't use them for something else, they may not perform or integrate well for any kind of stable use in inline production code. If you are trying to basically do real-time HTML manipulation, then a solution mixing a lot of technologies not that aren't widely used except for testing might be a poor choice.
If your need is actually HTML manipulation, and it doesn't really need to use Javascript but you are thinking more about the wealth of such tools available in JS, then I would look at C# tools designed for this purpose. For example HTML Agility Pack, or my own project CsQuery, which is a C# jQuery port.
If you are basically trying to take some code that was written for the client, but run it on a server -- e.g. for sophisticated/accelerated web scraping -- I'd search around using those terms. For example this question discusses this, with answers including PhantomJS, a headless webkit browser stack, as well as some of the testing tools I have already mentioned. For web scraping, I would imagine you can live without it all being in .NET, and that may be the only reasonable answer anyway.
I was thinking about this today and I realized I don't have a clear picture here.
Here are some statements I think to be true (please correct me if I'm wrong):
the DOM is a collection of interfaces specified by W3C.
when parsing HTML source code, the browser creates a DOM tree which has nodes that implement DOM interfaces.
the ECMAScript spec has no reference of browser host objects (DOM, BOM, HTML5 APIs etc.).
how the DOM is actually implemented depends on browser internals and is probably different among most of them.
modern JS interpreters use JIT to improve the code performance and translate it to bytecode
I am curious about what happens behind the scenes when I call document.getElementById('foo'). Does the call get delegated to browser native code by the interpreter or does the browser have JS implementations of all host objects? Do you know about any optimizations they do in regard to this?
I read this overview of browser internals but it didn't mention anything about this. I will look through the Chrome and FF source when I have time, but I thought about asking here first. :)
All of your bullet points are correct, except:
modern JS interpreters use JIT to improve the code performance and translate it to bytecode
should be "...and translate it to native code". SpiderMonkey (the JS engine in Firefox) worked as a bytecode interpreter for a long time before the current JS speed arms race.
On Mozilla's JS-to-DOM bridge:
The host objects are typically implemented in C++, though there is an experiment underway to implement DOM in JS. So when a web page calls document.getElementById('foo'), the actual work of retrieving the element by its ID is done in a C++ method, as hsivonen noted.
The specific way the underlying C++ implementation gets called depends on the API and also changed over time (note that I'm not involved in the development, so might be wrong about some details, here's a blog post by jst, who was actually involved in creating much of this code):
At the lowest level every JS engine provides APIs to define host objects. For example, the browser can call JS_DefineFunctions (as demonstrated in the SpiderMonkey User Guide) to let the engine know that whenever script calls a function with the specified name, a provided C callback should be called. Same for other aspects of the host objects (e.g. enumeration, property getters/setters, etc.)
For the core ECMAScript functionality and in some tricky DOM cases the JS engine/the browser uses these APIs directly to define host objects and their behaviors, but it requires a lot of common boilerplate code for e.g. checking parameter types, converting them to the appropriate C++ types, error handling etc.
For reasons I won't go into, let's say historically, Mozilla made heavy use of XPCOM for many of its objects, including much of the DOM. One feature of XPCOM is its binding to JS called XPConnect. Among other things, XPConnect can take an interface definition in IDL (such as nsIDOMDocument; or more precisely its compiled representation), expose an object with the specified properties to the script, and later, when a script calls getElementById, perform the necessary parameter checks/conversions and route the call directly to a C++ method (nsDocument::GetElementById(const nsAString& aId, nsIDOMElement** aReturn))
The way XPConnect worked was quite inefficient: it registered generic functions as callbacks to be executed when a script accesses a host object, and these generic functions figured out what they needed to do in every particular case dynamically. This post about quickstubs walks you through one example.
"Quick stubs" mentioned in the previous link is a way to optimize JS->C++ calls time by trading some code size for it: instead of always using generic C++ functions that know how to make any kind of call, the specialized code is automatically generated at the Firefox build time for a pre-defined list of "hot" calls.
Later on the JIT (tracemonkey at that time) was taught to generate the code calling C++ methods as part of the native code generated for "hot" paths in JS. I'm not sure how the newer JITs (jaegermonkey) work in this regard.
With "paris bindings" the objects are exposed to webpage JS without any reliance on XPConnect, instead generating all the necessary glue JSClass code based on WebIDL (instead of XPCOM-era IDL). See also posts by developers who worked on this: jst and khuey. Also see How is the web-exposed DOM implemented?
I'm fuzzy on details of the three last points in particular, so take it with a grain of salt.
The most recent improvements are listed as dependencies of bug 622298, but I don't follow them closely.
JS calls to DOM methods like getElementById cause the JS engine to call into the C++ code that implements the DOM. For example, in Firefox, the call ends up in nsDocument::GetElementById(const nsAString& aId, nsIDOMElement** aReturn).
As you can see, Firefox maintains a hashtable that maps ids to elements in C++ as an optimization in this case, so it doesn't walk the whole DOM tree looking for the id.
The DOM is implemented as a language-independent library pretty much in all major browser implementations, which means it's in a different library from the Javascript engine. For example in IE, the JS engine is implemented in jscript.dll while the DOM is implemented in mshtml.dll. Safari has Nitro(JS) and WebCore(DOM). Chrome has V8(JS) and WebCore(DOM), and Firefox has SpiderMonkey/TraceMonkey(JS) and Gecko(DOM).
What this means is that anytime your JS has to access the DOM, it has to reach over to the DOM library - which is inherently slow because of all the marshaling that has to take place. An analogy that has been used is 2 pieces of land connected by a toll bridge, any time you touch the DOM, you must cross over the bridge and cross back - paying a performance toll.
References
Video: Building High Performance Web Applications and Sites
Book: High Performance Javascript (Chapter 3 on the DOM)