I want to extract data from an HTML string in a Web Worker.
I want to clarify that I do not want to manipulate the DOM. I am sending an HTML string to the Web Worker, which then should extract data from the HTML, and then return the extracted data.
In the browser I could do:
var html = $("<body><div>...more html...</div></body>");
var extractedText = $(".selector", html).text();
My Question:
What is the equivalent of the above code in a Web Worker environment if given the same HTML string? There's no jQuery, no DOMParser, no querySelector.. in the Web Worker etc. Are there alternatives?
The Why:
I'm doing on the fly scraping of pages in a browser and don't want to block the UI thread because it's pretty heavy work.
I've looked at jsdom, cheerio, etc. but could not figure out how to make them work.
Regarding suggested duplicates:
I have reviewed both of the suggested duplicates and they are ones that I have read before while searching for answers to this question. They address XML parsing and not HTML parsing, and also do not address how to use CSS-selection inside Web Workers.
Short answer:
You cannot do any sort of HTML/CSS manipulation, including querying, in a web worker.
Long answer:
There are many DOMs. There's the main DOM, which is rendered on the page, but everything that a browser does that touches an HTML or XML tree, including querySelector and friends, requires the browser to build a DOM for that tree. (see also: DocumentFragment)
One of Mozilla's developers talked a bit about some reasons why they can't build any DOMs on worker threads (found via this question, on nabble):
You're assuming that none of the DOM implementation code uses any sort of non-DOM objects, ever, or that if it does those objects are fully threadsafe. That's just not not the case, at least in Gecko.
The issue in this case is not the same DOM object being touched on multiple threads. The issue is two DOM objects on different threads both touching some global third object.
For example, the XML parser has to do some things that in Gecko can only be done on the main thread (DTD loading, offhand; there are a few others that I've seen before but don't recall offhand).
So. We obviously can't use querySelector, createElement, or anything useful in a worker, so what can we do?
Build our own DOM parser/selector modules, of course!
Not really. Try including a copy of htmlparser2 in your worker, maybe via browserify (making that work is its own question). With that, and with CSSselect to allow querySelector-like selecting, you should be ready to go.
Admittedly, you can't use jQuery with those, but for simple querying needs they (and querySelector/querySelectorAll) should be more than sufficient.
You can make dom selection inside worker but you will need to create an API that will use post message to change data between main tread and worker (because you can't use DOM directly in worker). The limitation is that you will need to pass strings between, so you can't return Dom Nodes, unless you have some code that will create DOM nodes in worker based on data from main tread.
Because JavaScript is dynamic it should be easy to create dynamic wrapper that will create all those functions for you, and will allow to call querySelelector('.foo') and expose all the Dom APIs. With proxy objects you can even allow to use querySelelector('.foo').innerHTML = 'hello'; in worker with proper code.
There is library that make creating such API easier Comlink from Google. If you don't want to use library you can check this code, this git web terminal that expose isomorphic git functions using RPC like code to worker (It's inspired by Jason's Miller
workerize).
Create worker:
https://github.com/jcubic/git/blob/gh-pages/js/main.js#L116
Worker code
https://github.com/jcubic/git/blob/gh-pages/js/git-worker.js
and quick search give this project that looks promising "Worker DOM", it should give you DOM api in worker (that I'm almost sure use solution I proposed) but I didn't check it and I'm not sure how it works.
With some bit of work you may even have working jQuery inside worker, it would good project to make open source.
Related
When I discovered that Node.js was built using the V8 JavaScript engine, I thought:
Great, web scraping will be easier as the page
will be rendered like in the browser, with a
"native" DOM supporting XPath and any AJAX calls on
the page executed.
Why doesn't it have a native DOM when it uses the same JavaScript engine as Chrome?
Why doesn't it have a mode to run JavaScript in retrieved pages?
What am I not understanding about JavaScript engines vs the engine in a web browser?
Many thanks!
The DOM is the DOM, and the JavaScript implementation is simply a separate entity. The DOM represents a set of facilities that a web browser exposes to the JavaScript environment. There's no requirement however that any particular JavaScript runtime will have any facilities exposed via the global object.
What Node.js is is a stand-alone JavaScript environment completely independent of a web browser. There's no intrinsic link between web browsers and JavaScript; the DOM is not part of the JavaScript language or specification or anything.
I use the old Rhino Java-based JavaScript implementation in my Java-based web server. That environment also has nothing at all to do with any DOM. It's my own application that's responsible for populating the global object with facilities to do what I need it to be able to do, and it's not a DOM.
Note that there are projects like jsdom if you want a virtual DOM in your Node project. Because of its very nature as a server-side platform, a DOM is a facility that Node can do without and still make perfect sense for a wide variety of server applications. That's not to say that a DOM might not be useful to some people, but it's just not in the same category of services as things like process control, I/O, networking, database interop, and so on.
There may be some "official" answer to the question "why?" out there, but it's basically just the business of those who maintain Node (the Node Foundation now). If some intrepid developer out there decides that Node should ship by default with a set of modules to support a virtual DOM, and successfully works and works and makes that happen, then Node will have a DOM.
P.S: When reading this question I was also wondering if V8 (node.js is built on top of this) had a DOM
Why when it uses the same JS engine as Chrome doesn't it have a native
DOM?
But I searched google and found Google's V8 page which recites the following:
JavaScript is most commonly used for client-side scripting in a
browser, being used to manipulate Document Object Model (DOM) objects
for example. The DOM is not, however, typically provided by the
JavaScript engine but instead by a browser. The same is true of
V8—Google Chrome provides the DOM. V8 does however provide all the
data types, operators, objects and functions specified in the ECMA
standard.
node.js uses V8 and not Google Chrome.
Likewise, why doesn't it have a mode to run JS in retrieved pages?
I also think we don't really need it that bad. Ryan Dahl created node.js as one man (single programmer). Maybe now he (his team) will develop this, but I was already extremely amazed by the amount of code he produced (crazy). He wanted to make a non-blocking easy/efficient library, which I think he did a mighty good job at.
But then again, another developer created a module which is pretty good and actively developed (today) at https://github.com/tmpvar/jsdom.
What am I not understanding about Javascript engines vs the engine in
a web browser? :)
Those are different things as is hopefully clear from the quote above.
The Document Object Model (DOM in short) is a programming interface for HTML and XML documents and it represents the page so that programs can change the document structure, style, and content. More on this subject.
The necessary distinction between client-side (browser) and server-side (Node.js) and their main goals:
Client-side: accessing and displaying information of the web
Server-side: providing stable and reliable ways to deliver web information
Why is there no DOM in Node.js be default?
By default, Node.js doesn't have access, nor have any knowledge about the actual DOM in your own browser. Node.js just delivers the data, that will be used by your own browser to process and render the whole website, the DOM included. The server provides the data to your browser to use and process. That is the intended way.
Why wouldn't you want to access the DOM in Node.js?
Accessing your browser's actual DOM using Node.js would be just simply out of the goal of the server. Your own browser's role is to display the data coming from the server. However it is certainly possible and there are multiple solutions in different level of depths and varieties to pre-render, manipulate or change the DOM using AJAX calls. We'll see what future trends will bring.
Why would you want to access the DOM in Node.js?
By default, you shouldn't access your own, actual DOM (at least some data of it) using Node.js. Client-side and server-side are separated in terms of role, functionality, and responsibility based on years of experience and knowledge. Although there are several situations, where there are solid reasons to do so:
Gathering usage data (A/B testing, UI/UX efficiency and feedback)
Headless testing (Development, automation, web-scraping)
How can you access the DOM in Node.js?
jsdom: pure-JavaScript implementation, good for testing your own DOM/browser-related project
cheerio: great solution if you like/often use jQuery
puppeteer: Google's own way to provide headless testing using Google Chrome
own solution (your possible future project link here)
Although these solutions do not provide a way to access your browser's own, actual DOM by default, but you can create a project to send some form of data about your DOM to the server, then use/render/manipulate that data based on your needs.
...and yes, web-scraping and web development in terms of tools and utilities became more sophisticated and certainly easier in several fields.
node.js chose not to include it in their standard library. For any functionality, there is an inevitable tradeoff between comprehensiveness, scalability, and maintainability.
That doesn't mean it's not potentially useful. There is at least one JavaScript DOM implementation intended for NodeJS (among other CommonJS implementations).
You seem to have a flawed assumption that V8 and the DOM are inextricably related, that's not the case. The DOM is actually handled by Webkit, V8 doesn't handle the DOM, it handles Javascript calls to the DOM. Don't let this discourage you, Node.js has carved out a significant niche in the realtime server market, but don't let anybody tell you it's just for servers. Node makes it possible to build almost anything with JavaScript.
It is possible to do what you're talking about. For example there is the very good jsdom library if you really need access to the DOM, and node-htmlparser, there are also some really good scraping libraries that take advantage of these like apricot.
2018 answer: mainly for historical reasons, but this may change in future.
Historically, very little DOM manipulation was done on the server. Addiotinally, as other answers allude, the JS stdlib and the DOM are seperate libraries - if you're using node, for, say, Unix scripting, then HTMLElement and NodeList etc aren't really relevant to that.
However: server-side DOM manipulation is now a very common part of delivering web apps. Web servers need to understand the structure of pages, and, if asked to render a resource as HTML, deliver HTML content that reflects the initial state of a web application. This means web apps load much faster than if the server simply delivers a stub page and has the browsers then do the work of filling in the real content. Currently this is done with JSDom and similar, but in the same way node has Request and Response objects built in, having DOM functions maintained as part of the stdlib would help with this task.
Javascript != browser. Javascript as a language is not tied to browsers; node.js is simply an implementation of Javascript that is intended for servers, not browsers. Hence no DOM.
If you read DOM as 'linked objects immediately accessible from my script' then the answer 'it does, but it's very different from set of objects available from web document script'. The main reason is that node is 'evented I/O for V8', not 'HTML tree objects for V8'
Node is a runtime environment, it does not render a DOM like a browser.
Because there isn't a DOM. DOM stands for Document Object Model. There is no document in Node, so not DOM to manipulate it. That is definitively a browser thing.
You can use a library like cheerio though which gives you some simple DOM manipulation.
Node is server-level JavaScript. It's just the language applied to a basic system API, more like C++ or Java.
It seems people have answered 'why' but not how. A quick answer of how is that in a web browser, a document object is exposed (hence DOM , document object model). On windows this object is called document object. You can refer to this page and look at the methods it exposes which are for handling HTML documents like createElement. I don't use node.js or haven't done COM programming in a while but I'd imagine you could use DOM in node.js by simply calling the COM object IHTMLDocument3. Of course for other platforms like Mac OS X or Linux you would probably have to use something from their OS api. This should allow you to easily build a webpage server side using DOM, or to scrape incoming web pages.
Node.js is for serverside programming. There is no DOM to be rendered in the server.
1) What does it mean for it to have a D ocument O bject M odel? There's no document to represent.
2) You're most of the time you're not retrieving pages. You can, but most Node apps probably won't be.
3) Without a document and a browser, Javascript is just another programming language. So you may ask why there isn't a DOM in C# or Java
I'm a developer on a very large, many-page web app. We're making a push to improve the sanity of our javascript, so I'd like to introduce a module loader. Currently everything is done the old-fashioned way of slapping a bunch of script tags on a page and hoping for the best. Some restrictions are:
Our html pages are templated, inherited, and composed, such that the final page sent to the client brings together pieces from many different source html files. Any of these given files may depend on javascript resources introduced higher up the chain.
This must be achievable piecemeal. The code base is far to large to convert everything at once, so I'm looking for a good solution for new pages, and for migrating over existing pages as needed.
This solution needs to coexist on the same page as existing, non-module javascript. Some things (like menus and analytics) exist on every page, and I can't remove global jquery, for instance, as it's used all over the place.
Basically I'd like a way to carve out safe spaces that use modules and modern dependency management. Most tutorials and articles I've read all target new projects.
Requirejs looks like a decent option, and I've played with it a bit. Its fully async nature is a hindrance in some cases though - I'd like to use requirejs to package up global resources but I can't do that if I can't control the execution order. Also, it sucks to have the main function of a page (say, rendering a table) happen after secondary things like analytics calls.
Any suggestions?
That's a lot to ask for. :-) So here are my thoughts on this:
Usually, when you load a web page - everything that was being displayed is wiped so that the new incoming information does not interfere with what was already there so you really have only a couple of options (that I know of) where you can keep everything in memory in the browser and yet load new pages as needed.
The first (and easiest) way is to use FRAMES. These are not used much anymore from what I've seen but each FRAME allows you to display a different web page. So what you do is to make one frame use 100% and the second one use "*" so it isn't seen. You can then use the unseen frame to control what is going on in the 100% frame.
The second way is to use Javascript to control everything. In this scenario you create a namespace area and then use jQuery's getscript() function to load in each page as you need it. I did this once by generating the HTML, converting it to hex via bin2hex(), send it back as a javascript function and then unhexing it in the browser and applying it to the web page. If it was to completely replace the web page you do that and if it was an update to a web page you just inserted it into the HTML already on the web page. Any new javascript function always attaches itself to the namespace. If a function is no longer needed, it can be removed from the namespace to free up memory.
Why convert the HTML to hex and then send it? Because then you can use whatever characters you want to us (like single and double quotes) and it doesn't affect Javascript at all. There are fairly fast hex routines available on GitHub (see my toHex and fromHex routines).
One of the extra benefits of using getScript() is that it can sometimes also make it almost impossible for anyone to see your code then. A fluke in how getScript() works which is documented.
This question already has answers here:
Is there a way to create out of DOM elements in Web Worker?
(10 answers)
Closed 7 years ago.
We all know we can spin up some web workers, give them some mundane tasks to do and get a response at some point, at which stage we normally parse the data and render it somehow to the user right.
Well... Can anyone please provide a relatively in-depth explanation as to why Web Workers do not have access to the DOM? Given a well thought out OO design approach, I just don't get why they shouldn't?
EDITED:
I know this is not possible and it won't have a practical answer, but I feel I needed a more in depth explanation. Feel free to close the question if you feel it's irrelevant to the community.
The other answers are correct; JS was originally designed to be single-threaded as a simplification, and adding multi-threading at this point has to be done very carefully. I wanted to elaborate a little more on thread safety in Web Workers.
When you send messages to workers using postMessage, one of three things happen,
The browser serializes the message into a JSON string, and de-serializes on the other end (in the worker thread). This creates a copy of the object.
The browser constructs a structured clone of the message, and passes it to the worker thread. This allows messaging using more complex data types than 1., but is essentially the same thing.
The browser transfers ownership of the message (or parts of it). This applies only for certain data types. It is zero-copy, but once the main context transfers the message to the worker context, it becomes unavailable in the main context. This is, again, to avoid sharing memory.
When it comes to DOM nodes,
is obviously not an option, since DOM nodes contain circular references,
could work, but in a read-only mode, since any changes made by the worker to the clone of the node will not be reflected in the UI, and
is unfortunately not an option since the DOM is a tree, and transferring ownership of a node would imply transferring ownership of the entire tree. This poses a problem if the UI thread needs to consult the DOM to paint the screen and it doesn't have ownership.
The entire world of Javascript in a browser was originally designed with a huge simplification - single threadedness (before there were webWorkers). This single assumption makes coding a browser a ton simpler because there can never be thread contention and can never be two threads of activity trying to modify the same objects or properties at the same time. This makes writing Javascript in the browser and implementing Javascript in the browser a ton simpler.
Then, along came a need for an ability to do some longer running "background" type things that wouldn't block the main thread. Well, the ONLY practical way to put that into an existing browser with hundreds of millions of pages of Javascript already out there on the web was to make sure that this "new" thread of Javascript could never mess up the existing main thread.
So, it was implemented in a way that the ONLY way it could communicate with the main thread or any of the objects that the main thread could modify (such as the entire DOM and nearly all host objects) was via messaging which, by its implementation is naturally synchronized and safe from thread contention.
Here's a related answer: Why doesn't JavaScript get its own thread in common browsers?
Web Workers do not have DOM access because DOM code is not thread-safe. By DOM code, I mean the code in the browser that handles your DOM calls.
Making thread-unsafe code safe is a huge project, and none of the major browsers are doing it.
Servo project is writing new browser from scratch, and I think they are writing their DOM code with thread-safety in mind.
This is a follow-up to my popular and technically challenging HTML injection into someone else's website? question.
To recap: I'd like to demo my technology to customers without actually modifying their live website (e.g. demoing the idea of Stackoverflow financial bounties, without modifying the live site). Essentially, I'm trying to create a server-side version of Greasemonkey.
I've implemented the mirror as follows:
A request comes into http://myserver.com/forward?uri=[remote]
My server opens a connect to [remote], pulls down the data and returns its body/headers to the request from #1.
I chose this syntax because I needed to handle requests from multiple domains (meaning, if stackoverflow.com links to meta.stackoverflow.com I need to handle both domains from the same forwarding server).
I have managed to rewrite links in the HTML and CSS files so they are relative to my server. The final hurdle is rewriting URLs referenced by Javascript files.
What is the best way to programmatically rewrite URLs referenced by someone else's Javascript code? Is this even technically doable?
Discussion
I'll give you an example of the technical hurdle I am facing. Take http://www.honda.com/ for example. They embed a Flash element on the page, but instead of embedding <object> directly, they use Javascript to document.write() the <object> tag containing the URL.
First attempt
Use https://stackoverflow.com/a/14570614/14731 to listen for DOM change events. Instead of trying to rewrite URLs in the Javascript code, wait for it to modify the DOM and rewrite the URLs on the fly.
Intercept all XmlHttpRequest requests using https://stackoverflow.com/a/629782/14731
Ideally we want intercept DOM changes before they render, so the browser does not request URLs before we have a chance to rewrite them.
Related resources:
http://blog.vjeux.com/2011/javascript/intercept-and-alter-script-include.html
Second attempt
A server-side solution will not work. Even if I can rewrite all DOM URLs, I've seen an example where an embedded Flash application references URLs stored in Javascript variables. There is no programmatic way to detect that these variables represent URLs because the Flash application is opaque.
Next, I plan on trying a client-side solution. I will load the original website in one frame, and manipulate its contents using Javascript in a second (hidden) frame. I hope to be able to inject new DOM elements (to demo my product) without having to rewrite the existing elements.
Very challenging and interesting task. I would go with first saving the javascript files on my server and reference them from the HTML served. Then I would find the URLs in the files (using a regex or something) and replace it with the wanted value. I know it is not very fast, it is not very dynamic and all, but I believe it would be easier to implement.
Hope I helped!
Answering my own question.
After much research, I find this technique works best: https://stackoverflow.com/a/23231268/14731
In other words, there doesn't seem to be a general algorithm to rewrite links. Patching them by hand isn't as much work as you'd expect.
Anyone who writes client-side JavaScript is familiar with the DOM - the tree structure that your browser references in memory, generated from the HTML it got from the server. JavaScript can add, remove and modify nodes on the DOM tree to make changes to the page. I find it very nice to work with (browser bugs aside), and very different from the way my server-side code has to generate the page in the first place.
My question is: what server-side frameworks/languages build a page by treating it as a DOM tree from the beginning - inserting nodes instead of echoing strings? I think it would be very helpful if the client-side and server-side code both saw the page the same way. You could certainly hack something like this together in any web server language, but a framework dedicated to creating a page this way could make some very nice optimizations.
Open source, being widely deployed and having been around a while would all be pluses.
You're describing Rhino on Rails, which is not out but will be soon.
Similarly, Aptana Jaxer, however RnR will include an actual framework (Rails) whereas Jaxer is just the server technology.
Aptana's Jaxer AJAX server might be something for you to check out, as it uses JS server-side, as well.
That being said, I would argue that you're better off not generating your markup with print statements or echos, but rather template and hook in your dynamic content.
Jaxer is server-side javascript + the DOM. You can integrate jaxer with other languages, by post-processing their output.
Also in java, php, ... you can use xpath to manipulate the DOM.
I see where you're coming from but it's all a bit moot isn't it. You can't send anything but rendered content to the browser, and you have to do it all in one go (AJAX aside). There's no value from what you are suggesting (from what I can see) as even if you build it tree-like, you're still only building a page which is sent wholesale to the client.