This question already has answers here:
Is there a way to create out of DOM elements in Web Worker?
(10 answers)
Closed 7 years ago.
We all know we can spin up some web workers, give them some mundane tasks to do and get a response at some point, at which stage we normally parse the data and render it somehow to the user right.
Well... Can anyone please provide a relatively in-depth explanation as to why Web Workers do not have access to the DOM? Given a well thought out OO design approach, I just don't get why they shouldn't?
EDITED:
I know this is not possible and it won't have a practical answer, but I feel I needed a more in depth explanation. Feel free to close the question if you feel it's irrelevant to the community.
The other answers are correct; JS was originally designed to be single-threaded as a simplification, and adding multi-threading at this point has to be done very carefully. I wanted to elaborate a little more on thread safety in Web Workers.
When you send messages to workers using postMessage, one of three things happen,
The browser serializes the message into a JSON string, and de-serializes on the other end (in the worker thread). This creates a copy of the object.
The browser constructs a structured clone of the message, and passes it to the worker thread. This allows messaging using more complex data types than 1., but is essentially the same thing.
The browser transfers ownership of the message (or parts of it). This applies only for certain data types. It is zero-copy, but once the main context transfers the message to the worker context, it becomes unavailable in the main context. This is, again, to avoid sharing memory.
When it comes to DOM nodes,
is obviously not an option, since DOM nodes contain circular references,
could work, but in a read-only mode, since any changes made by the worker to the clone of the node will not be reflected in the UI, and
is unfortunately not an option since the DOM is a tree, and transferring ownership of a node would imply transferring ownership of the entire tree. This poses a problem if the UI thread needs to consult the DOM to paint the screen and it doesn't have ownership.
The entire world of Javascript in a browser was originally designed with a huge simplification - single threadedness (before there were webWorkers). This single assumption makes coding a browser a ton simpler because there can never be thread contention and can never be two threads of activity trying to modify the same objects or properties at the same time. This makes writing Javascript in the browser and implementing Javascript in the browser a ton simpler.
Then, along came a need for an ability to do some longer running "background" type things that wouldn't block the main thread. Well, the ONLY practical way to put that into an existing browser with hundreds of millions of pages of Javascript already out there on the web was to make sure that this "new" thread of Javascript could never mess up the existing main thread.
So, it was implemented in a way that the ONLY way it could communicate with the main thread or any of the objects that the main thread could modify (such as the entire DOM and nearly all host objects) was via messaging which, by its implementation is naturally synchronized and safe from thread contention.
Here's a related answer: Why doesn't JavaScript get its own thread in common browsers?
Web Workers do not have DOM access because DOM code is not thread-safe. By DOM code, I mean the code in the browser that handles your DOM calls.
Making thread-unsafe code safe is a huge project, and none of the major browsers are doing it.
Servo project is writing new browser from scratch, and I think they are writing their DOM code with thread-safety in mind.
Related
When I discovered that Node.js was built using the V8 JavaScript engine, I thought:
Great, web scraping will be easier as the page
will be rendered like in the browser, with a
"native" DOM supporting XPath and any AJAX calls on
the page executed.
Why doesn't it have a native DOM when it uses the same JavaScript engine as Chrome?
Why doesn't it have a mode to run JavaScript in retrieved pages?
What am I not understanding about JavaScript engines vs the engine in a web browser?
Many thanks!
The DOM is the DOM, and the JavaScript implementation is simply a separate entity. The DOM represents a set of facilities that a web browser exposes to the JavaScript environment. There's no requirement however that any particular JavaScript runtime will have any facilities exposed via the global object.
What Node.js is is a stand-alone JavaScript environment completely independent of a web browser. There's no intrinsic link between web browsers and JavaScript; the DOM is not part of the JavaScript language or specification or anything.
I use the old Rhino Java-based JavaScript implementation in my Java-based web server. That environment also has nothing at all to do with any DOM. It's my own application that's responsible for populating the global object with facilities to do what I need it to be able to do, and it's not a DOM.
Note that there are projects like jsdom if you want a virtual DOM in your Node project. Because of its very nature as a server-side platform, a DOM is a facility that Node can do without and still make perfect sense for a wide variety of server applications. That's not to say that a DOM might not be useful to some people, but it's just not in the same category of services as things like process control, I/O, networking, database interop, and so on.
There may be some "official" answer to the question "why?" out there, but it's basically just the business of those who maintain Node (the Node Foundation now). If some intrepid developer out there decides that Node should ship by default with a set of modules to support a virtual DOM, and successfully works and works and makes that happen, then Node will have a DOM.
P.S: When reading this question I was also wondering if V8 (node.js is built on top of this) had a DOM
Why when it uses the same JS engine as Chrome doesn't it have a native
DOM?
But I searched google and found Google's V8 page which recites the following:
JavaScript is most commonly used for client-side scripting in a
browser, being used to manipulate Document Object Model (DOM) objects
for example. The DOM is not, however, typically provided by the
JavaScript engine but instead by a browser. The same is true of
V8—Google Chrome provides the DOM. V8 does however provide all the
data types, operators, objects and functions specified in the ECMA
standard.
node.js uses V8 and not Google Chrome.
Likewise, why doesn't it have a mode to run JS in retrieved pages?
I also think we don't really need it that bad. Ryan Dahl created node.js as one man (single programmer). Maybe now he (his team) will develop this, but I was already extremely amazed by the amount of code he produced (crazy). He wanted to make a non-blocking easy/efficient library, which I think he did a mighty good job at.
But then again, another developer created a module which is pretty good and actively developed (today) at https://github.com/tmpvar/jsdom.
What am I not understanding about Javascript engines vs the engine in
a web browser? :)
Those are different things as is hopefully clear from the quote above.
The Document Object Model (DOM in short) is a programming interface for HTML and XML documents and it represents the page so that programs can change the document structure, style, and content. More on this subject.
The necessary distinction between client-side (browser) and server-side (Node.js) and their main goals:
Client-side: accessing and displaying information of the web
Server-side: providing stable and reliable ways to deliver web information
Why is there no DOM in Node.js be default?
By default, Node.js doesn't have access, nor have any knowledge about the actual DOM in your own browser. Node.js just delivers the data, that will be used by your own browser to process and render the whole website, the DOM included. The server provides the data to your browser to use and process. That is the intended way.
Why wouldn't you want to access the DOM in Node.js?
Accessing your browser's actual DOM using Node.js would be just simply out of the goal of the server. Your own browser's role is to display the data coming from the server. However it is certainly possible and there are multiple solutions in different level of depths and varieties to pre-render, manipulate or change the DOM using AJAX calls. We'll see what future trends will bring.
Why would you want to access the DOM in Node.js?
By default, you shouldn't access your own, actual DOM (at least some data of it) using Node.js. Client-side and server-side are separated in terms of role, functionality, and responsibility based on years of experience and knowledge. Although there are several situations, where there are solid reasons to do so:
Gathering usage data (A/B testing, UI/UX efficiency and feedback)
Headless testing (Development, automation, web-scraping)
How can you access the DOM in Node.js?
jsdom: pure-JavaScript implementation, good for testing your own DOM/browser-related project
cheerio: great solution if you like/often use jQuery
puppeteer: Google's own way to provide headless testing using Google Chrome
own solution (your possible future project link here)
Although these solutions do not provide a way to access your browser's own, actual DOM by default, but you can create a project to send some form of data about your DOM to the server, then use/render/manipulate that data based on your needs.
...and yes, web-scraping and web development in terms of tools and utilities became more sophisticated and certainly easier in several fields.
node.js chose not to include it in their standard library. For any functionality, there is an inevitable tradeoff between comprehensiveness, scalability, and maintainability.
That doesn't mean it's not potentially useful. There is at least one JavaScript DOM implementation intended for NodeJS (among other CommonJS implementations).
You seem to have a flawed assumption that V8 and the DOM are inextricably related, that's not the case. The DOM is actually handled by Webkit, V8 doesn't handle the DOM, it handles Javascript calls to the DOM. Don't let this discourage you, Node.js has carved out a significant niche in the realtime server market, but don't let anybody tell you it's just for servers. Node makes it possible to build almost anything with JavaScript.
It is possible to do what you're talking about. For example there is the very good jsdom library if you really need access to the DOM, and node-htmlparser, there are also some really good scraping libraries that take advantage of these like apricot.
2018 answer: mainly for historical reasons, but this may change in future.
Historically, very little DOM manipulation was done on the server. Addiotinally, as other answers allude, the JS stdlib and the DOM are seperate libraries - if you're using node, for, say, Unix scripting, then HTMLElement and NodeList etc aren't really relevant to that.
However: server-side DOM manipulation is now a very common part of delivering web apps. Web servers need to understand the structure of pages, and, if asked to render a resource as HTML, deliver HTML content that reflects the initial state of a web application. This means web apps load much faster than if the server simply delivers a stub page and has the browsers then do the work of filling in the real content. Currently this is done with JSDom and similar, but in the same way node has Request and Response objects built in, having DOM functions maintained as part of the stdlib would help with this task.
Javascript != browser. Javascript as a language is not tied to browsers; node.js is simply an implementation of Javascript that is intended for servers, not browsers. Hence no DOM.
If you read DOM as 'linked objects immediately accessible from my script' then the answer 'it does, but it's very different from set of objects available from web document script'. The main reason is that node is 'evented I/O for V8', not 'HTML tree objects for V8'
Node is a runtime environment, it does not render a DOM like a browser.
Because there isn't a DOM. DOM stands for Document Object Model. There is no document in Node, so not DOM to manipulate it. That is definitively a browser thing.
You can use a library like cheerio though which gives you some simple DOM manipulation.
Node is server-level JavaScript. It's just the language applied to a basic system API, more like C++ or Java.
It seems people have answered 'why' but not how. A quick answer of how is that in a web browser, a document object is exposed (hence DOM , document object model). On windows this object is called document object. You can refer to this page and look at the methods it exposes which are for handling HTML documents like createElement. I don't use node.js or haven't done COM programming in a while but I'd imagine you could use DOM in node.js by simply calling the COM object IHTMLDocument3. Of course for other platforms like Mac OS X or Linux you would probably have to use something from their OS api. This should allow you to easily build a webpage server side using DOM, or to scrape incoming web pages.
Node.js is for serverside programming. There is no DOM to be rendered in the server.
1) What does it mean for it to have a D ocument O bject M odel? There's no document to represent.
2) You're most of the time you're not retrieving pages. You can, but most Node apps probably won't be.
3) Without a document and a browser, Javascript is just another programming language. So you may ask why there isn't a DOM in C# or Java
I was thinking about this today and I realized I don't have a clear picture here.
Here are some statements I think to be true (please correct me if I'm wrong):
the DOM is a collection of interfaces specified by W3C.
when parsing HTML source code, the browser creates a DOM tree which has nodes that implement DOM interfaces.
the ECMAScript spec has no reference of browser host objects (DOM, BOM, HTML5 APIs etc.).
how the DOM is actually implemented depends on browser internals and is probably different among most of them.
modern JS interpreters use JIT to improve the code performance and translate it to bytecode
I am curious about what happens behind the scenes when I call document.getElementById('foo'). Does the call get delegated to browser native code by the interpreter or does the browser have JS implementations of all host objects? Do you know about any optimizations they do in regard to this?
I read this overview of browser internals but it didn't mention anything about this. I will look through the Chrome and FF source when I have time, but I thought about asking here first. :)
All of your bullet points are correct, except:
modern JS interpreters use JIT to improve the code performance and translate it to bytecode
should be "...and translate it to native code". SpiderMonkey (the JS engine in Firefox) worked as a bytecode interpreter for a long time before the current JS speed arms race.
On Mozilla's JS-to-DOM bridge:
The host objects are typically implemented in C++, though there is an experiment underway to implement DOM in JS. So when a web page calls document.getElementById('foo'), the actual work of retrieving the element by its ID is done in a C++ method, as hsivonen noted.
The specific way the underlying C++ implementation gets called depends on the API and also changed over time (note that I'm not involved in the development, so might be wrong about some details, here's a blog post by jst, who was actually involved in creating much of this code):
At the lowest level every JS engine provides APIs to define host objects. For example, the browser can call JS_DefineFunctions (as demonstrated in the SpiderMonkey User Guide) to let the engine know that whenever script calls a function with the specified name, a provided C callback should be called. Same for other aspects of the host objects (e.g. enumeration, property getters/setters, etc.)
For the core ECMAScript functionality and in some tricky DOM cases the JS engine/the browser uses these APIs directly to define host objects and their behaviors, but it requires a lot of common boilerplate code for e.g. checking parameter types, converting them to the appropriate C++ types, error handling etc.
For reasons I won't go into, let's say historically, Mozilla made heavy use of XPCOM for many of its objects, including much of the DOM. One feature of XPCOM is its binding to JS called XPConnect. Among other things, XPConnect can take an interface definition in IDL (such as nsIDOMDocument; or more precisely its compiled representation), expose an object with the specified properties to the script, and later, when a script calls getElementById, perform the necessary parameter checks/conversions and route the call directly to a C++ method (nsDocument::GetElementById(const nsAString& aId, nsIDOMElement** aReturn))
The way XPConnect worked was quite inefficient: it registered generic functions as callbacks to be executed when a script accesses a host object, and these generic functions figured out what they needed to do in every particular case dynamically. This post about quickstubs walks you through one example.
"Quick stubs" mentioned in the previous link is a way to optimize JS->C++ calls time by trading some code size for it: instead of always using generic C++ functions that know how to make any kind of call, the specialized code is automatically generated at the Firefox build time for a pre-defined list of "hot" calls.
Later on the JIT (tracemonkey at that time) was taught to generate the code calling C++ methods as part of the native code generated for "hot" paths in JS. I'm not sure how the newer JITs (jaegermonkey) work in this regard.
With "paris bindings" the objects are exposed to webpage JS without any reliance on XPConnect, instead generating all the necessary glue JSClass code based on WebIDL (instead of XPCOM-era IDL). See also posts by developers who worked on this: jst and khuey. Also see How is the web-exposed DOM implemented?
I'm fuzzy on details of the three last points in particular, so take it with a grain of salt.
The most recent improvements are listed as dependencies of bug 622298, but I don't follow them closely.
JS calls to DOM methods like getElementById cause the JS engine to call into the C++ code that implements the DOM. For example, in Firefox, the call ends up in nsDocument::GetElementById(const nsAString& aId, nsIDOMElement** aReturn).
As you can see, Firefox maintains a hashtable that maps ids to elements in C++ as an optimization in this case, so it doesn't walk the whole DOM tree looking for the id.
The DOM is implemented as a language-independent library pretty much in all major browser implementations, which means it's in a different library from the Javascript engine. For example in IE, the JS engine is implemented in jscript.dll while the DOM is implemented in mshtml.dll. Safari has Nitro(JS) and WebCore(DOM). Chrome has V8(JS) and WebCore(DOM), and Firefox has SpiderMonkey/TraceMonkey(JS) and Gecko(DOM).
What this means is that anytime your JS has to access the DOM, it has to reach over to the DOM library - which is inherently slow because of all the marshaling that has to take place. An analogy that has been used is 2 pieces of land connected by a toll bridge, any time you touch the DOM, you must cross over the bridge and cross back - paying a performance toll.
References
Video: Building High Performance Web Applications and Sites
Book: High Performance Javascript (Chapter 3 on the DOM)
I want to extract data from an HTML string in a Web Worker.
I want to clarify that I do not want to manipulate the DOM. I am sending an HTML string to the Web Worker, which then should extract data from the HTML, and then return the extracted data.
In the browser I could do:
var html = $("<body><div>...more html...</div></body>");
var extractedText = $(".selector", html).text();
My Question:
What is the equivalent of the above code in a Web Worker environment if given the same HTML string? There's no jQuery, no DOMParser, no querySelector.. in the Web Worker etc. Are there alternatives?
The Why:
I'm doing on the fly scraping of pages in a browser and don't want to block the UI thread because it's pretty heavy work.
I've looked at jsdom, cheerio, etc. but could not figure out how to make them work.
Regarding suggested duplicates:
I have reviewed both of the suggested duplicates and they are ones that I have read before while searching for answers to this question. They address XML parsing and not HTML parsing, and also do not address how to use CSS-selection inside Web Workers.
Short answer:
You cannot do any sort of HTML/CSS manipulation, including querying, in a web worker.
Long answer:
There are many DOMs. There's the main DOM, which is rendered on the page, but everything that a browser does that touches an HTML or XML tree, including querySelector and friends, requires the browser to build a DOM for that tree. (see also: DocumentFragment)
One of Mozilla's developers talked a bit about some reasons why they can't build any DOMs on worker threads (found via this question, on nabble):
You're assuming that none of the DOM implementation code uses any sort of non-DOM objects, ever, or that if it does those objects are fully threadsafe. That's just not not the case, at least in Gecko.
The issue in this case is not the same DOM object being touched on multiple threads. The issue is two DOM objects on different threads both touching some global third object.
For example, the XML parser has to do some things that in Gecko can only be done on the main thread (DTD loading, offhand; there are a few others that I've seen before but don't recall offhand).
So. We obviously can't use querySelector, createElement, or anything useful in a worker, so what can we do?
Build our own DOM parser/selector modules, of course!
Not really. Try including a copy of htmlparser2 in your worker, maybe via browserify (making that work is its own question). With that, and with CSSselect to allow querySelector-like selecting, you should be ready to go.
Admittedly, you can't use jQuery with those, but for simple querying needs they (and querySelector/querySelectorAll) should be more than sufficient.
You can make dom selection inside worker but you will need to create an API that will use post message to change data between main tread and worker (because you can't use DOM directly in worker). The limitation is that you will need to pass strings between, so you can't return Dom Nodes, unless you have some code that will create DOM nodes in worker based on data from main tread.
Because JavaScript is dynamic it should be easy to create dynamic wrapper that will create all those functions for you, and will allow to call querySelelector('.foo') and expose all the Dom APIs. With proxy objects you can even allow to use querySelelector('.foo').innerHTML = 'hello'; in worker with proper code.
There is library that make creating such API easier Comlink from Google. If you don't want to use library you can check this code, this git web terminal that expose isomorphic git functions using RPC like code to worker (It's inspired by Jason's Miller
workerize).
Create worker:
https://github.com/jcubic/git/blob/gh-pages/js/main.js#L116
Worker code
https://github.com/jcubic/git/blob/gh-pages/js/git-worker.js
and quick search give this project that looks promising "Worker DOM", it should give you DOM api in worker (that I'm almost sure use solution I proposed) but I didn't check it and I'm not sure how it works.
With some bit of work you may even have working jQuery inside worker, it would good project to make open source.
Not enough that JavaScript isn't multithreaded, apparently JavaScript doesn't even get its own but shares a thread with a load of other stuff. Even in most modern browsers JavaScript is typically in the same queue as painting, updating styles, and handling user actions.
Why is that?
From my experience an immensely improved user experience could be gained if JavaScript ran on its own thread, alone by JS not blocking UI rendering or the liberation of intricate or limited message queue optimization boilerplate (yes, also you, webworkers!) which the developer has to write themselves to keep the UI responsive all over the place when it really comes down to it.
I'm interested in understanding the motivation which governs such a seemingly unfortunate design decision, is there a convincing reason from a software architecture perspective?
User Actions Require Participation from JS Event Handlers
User actions can trigger Javascript events (clicks, focus events, key events, etc...) that participate and potentially influence the user action so clearly the single JS thread can't be executing while user actions are being processed because, if so, then the JS thread couldn't participate in the user actions because it is already doing something else. So, the browser doesn't process the default user actions until the JS thread is available to participate in that process.
Rendering
Rendering is more complicated. A typical DOM modification sequence goes like this: 1) DOM modified by JS, layout marked dirty, 2) JS thread finishes executing so the browser now knows that JS is done modifying the DOM, 3) Browser does layout to relayout changed DOM, 4) Browser paints screen as needed.
Step 2) is important here. If the browser did a new layout and screen painting after every single JS DOM modification, the whole process could be incredibly inefficient if the JS was actually going to make a bunch of DOM modifications. Plus, there would be thread synchronization issues because if you had JS modifying the DOM at the same time as the browser was trying to do a relayout and repaint, you'd have to synchronize that activity (e.g. block somebody so an operation could complete without the underlying data being changed by another thread).
FYI, there are some work-arounds that can be used to force a relayout or to force a repaint from within your JS code (not exactly what you were asking, but useful in some circumstances).
Multiple Threads Accessing DOM Really Complex
The DOM is essentially a big shared data structure. The browser constructs it when the page is parsed. Then loading scripts and various JS events have a chance to modify it.
If you suddenly had multiple JS threads with access to the DOM running concurrently, you'd have a really complicated problem. How would you synchronize access? You couldn't even write the most basic DOM operation that would involve finding a DOM object in the page and then modifying it because that wouldn't be an atomic operation. The DOM could get changed between the time you found the DOM object and when you made your modification. Instead, you'd probably have to acquire a lock on at least a sub-tree in the DOM preventing it from being changed by some other thread while you were manipulating or searching it. Then, after making the modifications, you'd have to release the lock and release any knowledge of the state of the DOM from your code (because as soon as you release the lock, some other thread could be changing it). And, if you didn't do things correctly, you could end up with deadlocks or all sorts of nasty bugs. In reality, you'd have to treat the DOM like a concurrent, multi-user datastore. This would be a significantly more complex programming model.
Avoid Complexity
There is one unifying theme among the "single threaded JS" design decision. Keep things simple. Don't require an understanding of a multiple-threaded environment and thread synchronization tools and debugging of multiple threads in order to write solid, reliable browser Javascript.
One reason browser Javascript is a successful platform is because it is very accessible to all levels of developers and it relatively easy to learn and to write solid code. While browser JS may get more advanced features over time (like we got with WebWorkers), you can be absolutely sure that these will be done in a way that simple things stay simple while more advanced things can be done by more advanced developers, but without breaking any of the things that keep things simple now.
FYI, I've written a multi-user web server application in node.js and I am constantly amazed at how much less complicated much of the server design is because of single threaded nature of nodejs Javascript. Yes, there are a few things that are more of a pain to write (learn promises for writing lots of async code), but wow the simplifying assumption that your JS code is never interrupted by another request drastically simplifies the design, testing and reduces the hard to find and fix bugs that concurrency design and coding is always fraught with.
Discussion
Certainly the first issue could be solved by allowing user action event handlers to run in their own thread so they could occur any time. But, then you immediately have multi-threaded Javascript and now require a whole new JS infrastructure for thread synchronization and whole new classes of bugs. The designers of browser Javascript have consistently decided not to open that box.
The Rendering issue could be improved if desired, but at a significant complication to the browser code. You'd have to invent some way to guess when the running JS code seems like it is no longer changing the DOM (perhaps some number of ms go by with no more changes) because you have to avoid doing a relayout and screen paint immediately on every DOM change. If the browser did that, some JS operations would become 100x slower than they are today (the 100x is a wild guess, but the point is they'd be a lot slower). And, you'd have to implement thread synchronization between layout, painting and JS DOM modifications which is doable, but complicated, a lot of work and a fertile ground for browser implementation bugs. And, you have to decide what to do when you're part-way through a relayout or repaint and the JS thread makes a DOM modification (none of the answers are great).
I use JavaScript for rendering 20 tables of 100 rows each.
The data for each table is provided by controller as JSON.
Each table is split into section that have "totals" and have some other JavaScript logic code. Some totals are outside of the table itself.
As a result JavaScript blocks browser for a couple of seconds (especially in IE6) :(
I was consideting to use http://code.google.com/p/jsworker/,
however Google Gears Workers (I guess workers in general) will not allow me to make changes to DOM at the worker code, and also it seems to me that I can not use jQuery inside jsworker worker code. (Maybe I am wrong here?).
This issue seems to be fundamental to the JavaScript coding practice, can you share with me your thoughts how to approach it?
Workers can only communicate with the page's execution by passing messages. They are unable to interact directly with the page because this would introduce enormous difficulties.
You will need to optimise your DOM manipulation code to speed up the processing time. It's worth consulting google for good practices.
One way to speed up execution is to build the table outside of the DOM and insert it into the document only when it is completed. This stops the browser having to re-draw on every insertion, which is where a lot of the time is spent.
You are not supposed to change the UI in a backgroundworker in general. You should always signal the main thread that the worker is finished, and return a result to the main thread whom then can process that.
HTML5 includes an async property for tags, and writing to your UI from there causes a blank page with just the stuff you wanted to write.
So you need another approach there.
I am not a JS guru, so I can't help you with an implementation, but at least you now have the concept :)
If you want to dig into the world of browser performance, the High Performance Web Sites blog has heaps of great info — including when javascripts block page rendering, and best practice to avoid such issues.