What are the main differences between node cluster and Threads agogo and what are the advantages or disadvantages of each? As far as I have understood threads a agogo creates a thread to run in the background and node cluster creates a new process that is run in the background. I am interested what differences there would be in ease of use or performance and when to prefer one over the other.
Just having a quick look, it uses threads, yes. Node on the other hand uses processes, since it's design is single-threaded, however internally it creates thread pools and hence threads during callback creation.
Node implementations of processes uses sockets for communication, which is quite slow, in terms of latency. Your tasks should hence be dividable, so you won't need to have much communication.
Threads are like processes, but share the memory with their calling processes, so that communications is quicker, but more dangerous.
So, the question is rather aims are threads better then processes for concurrency? It depends... But in Node context use cluster and processes rather.
The library you are referencing to is quite old right? Better don't use it. There is a reason why people abandon stuff like this.
Related
In a tutorial I've read that one should you Node's event-loop approach mainly for I/O intensive tasks. Like reading from hard disk or using network. But not for CPU-intensive task.
What's the concrete reason for the quoted statements?
Or the otherwayaround asked:
What would happen if you occupy Node.js with CPU-intesive tasks to do?
Node uses a small number of threads to handle many clients. In Node there are two types of threads: one Event Loop (aka the main loop, main thread, event thread, etc.), and a pool of k Workers in a Worker Pool (aka the threadpool).
If a thread is taking a long time to execute a callback (Event Loop) or a task (Worker), we call it "blocked". While a thread is blocked working on behalf of one client, it cannot handle requests from any other clients.
You can read more about it in official nodejs guide
TL;DR
Is it possible to have two different cluster masters that aren't a shared global? Such as making a new instance myCluster = new cluster()?
The longer version
After using the new cluster module in NodeJS for some time, I've come across the problem of two separate clusters overlapping. Two different libraries (npm packages) access the same cluster master, as the cluster module is a global in the current running process, no matter where you require it from.
Calling cluster.workers from any library will list every worker spawned by every library.
Everyone is going nuts over how easy and how much more efficient it is, but after having come across the issue of two libraries using the same cluster, I'm worried about one interfering with the other by using some of the global cluster functions, such as cluster.disconnect(), or accessing the global workers object cluster.workers. I understand that it's a fairly single-use case module, "create a self-sustainable cluster of disposable workers that can be easily restarted with a watchdog".
But it's the easiest solution for multi-threaded tasks, and sugarcoats a lot of the bother with child_process. What if two libraries decided it was necessary to use cluster, but didn't go through the effort of keeping track of which workers belong to them, and instead call a cheeky
Object.values(cluster.workers).forEach(worker => worker.kill())
as their cleanup?
Is it possible to have two different "instances" or "namespaces" for clusters, as so not to interfere with any other cluster masters? Or is the cluster module just a global variable that you must accept?
I've delved into the documentation, but from what I can tell there is no way to create a new cluster instance by calling myCluster = new cluster() or pass some unique identity to forked workers. I find it surprising that there is no obvious solution to this problem, especially considering that it targets enterprise applications, where such problems should not exist.
The trend (and has been for a while) in programming is to keep away from global instances, and create self-sufficient instances that are only aware of what they need to know, so called "dumb-components". Cluster is a fairly new addition to NodeJS, have they just decided to half-implement a great feature?
I would be very grateful for you thoughts or workarounds on the subject. Right now I'm creating a library that could heavily benefit from distribution of tasks, however I don't want to dirty up the global cluster of the package dependant. Should I resort back to low-level child_process?
Many thanks!
Answering my own question
After switching to the lower-level child_process module, I have reverted back to using cluster for one simple reason. Node.js's --inspect debugger.
After a lot of pain debugging another package, I traced the error back to a dependency that was my own library implementing the child_process approach. It was forking a child-process using the inherited execArgs from the parent process, meaning the child_process was trying to bind to the same inspector port as the parent process, failing silently and crashing.
While my first workaround was generating a random free (assigned by OS) debug port for the child_process, this was a valid workaround but posed its own problems. Foremostly, the process would have to be manually attached to a debugger by copying the unique port each time.
VS Code has a magical autoAttachChildProcesses that discovers forked processes but only when using cluster, at least in my case. It seems that cluster has some sugar-coating around the forking process, making it more discoverable and adaptable during debugging, or perhaps VS Code just listens exclusively for cluster.fork(). I have given in to the downside of making my worker visible to the entire application in favour of a more robust experience for users.
This is annoying behaviour that has an open issue on GitHub: https://github.com/nodejs/node/issues/9435
My conclusion is that the cluster module is there for a more unified forking experience with more intelligent process handling and discovery, packages will just have to respect to never touch cluster.workers or similar global functions, and instead, maintain an array of their own workers returned by cluster.fork().
Well no. Or yes.
See, since clustering in node terms means forking the current process, there is no cluster actually - just a list of forks = workers.
But: you can fork and require the script (or if-else into a block) you want to have the fork as a worker and save that new worker into your own array = cluster/fork_list.
I have simple support module that separates the worker script (and simply loads different scripts + handles communication): runworker
Disclaimer, my knowledge of node.js a few articles mostly summarized by this http://en.wikipedia.org/wiki/Node.js
That said, so my understanding is that it's supposed to be very quick because it avoids the overhead of threading. It puts everything into a single loop instead of doing the overhead of switching between processes.
I assume there is a reason why there is a sophisticated method of switching contexts completely in between threads. My question is, what is the benefit of having threads over the node.js approach?
Node.js is extremely fast with IO-intensive tasks, since its event model supports IO delays perfectly. On the other hand, it is completely incapable of doing CPU-intensive tasks without stopping everything. Thus, if you need some heavy calculation, you will want to fork off a worker to do it for you.
Threaded model switches contexts automatically, whatever the thread is doing, and thus can handle CPU-intensive jobs without impacting other threads negatively too much. (Or rather, they will still work, only slower if CPU capacity is reached.)
Recently in school, I've been taught C++/OpenMPI in a parallel computing class. I don't really like to program in C++ as its low level and harder to program, easier to make mistakes etc.
So I've been thinking, is JavaScript/NodeJS (something I've started to like) actually truly parallel? Or is it simply using non-blocking operations to simulate parallel execution (which I think it is)? There are libraries like async which gives similar functions to what I've used in OpenMPI: gather, scatter even "parallel". But I've got a feeling its just simulating parallelism using non-blocking IO?
Perhaps only node-webcl is truly parallel?
UPDATE: Seems possible via web workers (~31 min): watching http://www.infoq.com/presentations/Parallel-Programming-with-Nodejs
In deed, JavaScript is single-threaded by it's design. But you're not the first person who wants some parallelism in it, so there're some things that can work truly parallel:
WebWorkers - run in threads, which means that they quite cheap to create. They limited in data interchange abilities. At first, you could only send messages between workers, but now they are a lot better, you can even use SharedArrayBuffer for concurrent memory access. Not supported in NodeJs, only in browsers.
WebGL/WebCL - utilize graphic subsystems for parallel computing. Very fast, but effective for a limited set of problems. Not all tasks can be computed effectively on GPU-like subsystem. It also requires additional data transformations for presenting you data in pixel-like format. Has decent browsers support as WebGL, but as you've already mentioned has only experimental implementations for NodeJs.
SIMD - parallelism over data. It was a promising thing, but it is no longer on the roadmap for JavaScript and it will be a part of the WebAssembly standard.
Cluster - a NodeJs solution for parallelism. Allows to run multiple processes (not threads) and even supports SharedArrayBuffer for communication since 9th version.
That's pretty much it. There's also the WebAssembly threads proposal, but, firstly, it is a proposal and, secondly, WebAssembly is not JavaScript.
In general, JavaScript is by far not the best tool for low level parallel computing. There're a lot of other tools that suit better for this: Java, C#, Go...
With Node.js, your JavaScript runs in a single thread. IO is non blocking.
I'm new to JavaScript so forgive me for being a n00b.
When there's intensive calculation required, it more than likely involves loops that are recursive or otherwise. Sometimes this may mean having am recursive loop that runs four functions and maybe each of those functions walks the entire DOM tree, read positions and do some math for collision detection or whatever.
While the first function is walking the DOM tree, the next one will have to wait its for the first one to finish, and so forth. Instead of doing this, why not launch those loops-within-loops separately, outside the programs, and act on their calculations in another loop that runs slower because it isn't doing those calculations itself?
Retarded or clever?
Thanks in advance!
Long-term computations are exactly what Web Workers are for. What you describe is the common pattern of producer and/or consumer threads. While you could do this using Web Workers, the synchronization overhead would likely trump any gains even on highly parallel systems.
JavaScript is not the ideal language for computationally demanding applications. Also, processing power of web browser machines can vary wildly (think a low-end smartphone vs. a 16core workstation). Therefore, consider calculating complex stuff on the server and sending the result to the client to display.
For your everyday web application, you should take a single-threaded approach and analyze performance once it becomes a problem. Heck, why not ask for help about your performance problem here?
JavaScript was never meant to do perform such computationally intensive tasks, and even though this is changing, the fact remains that JavaScript is inherently single-threaded. The recent web workers technology provides a limited form of multi-threading but these worker threads can't access the DOM directly; they can only send/receive messages to the main thread which can then access it on their behalf.
Currently, the only way to have real parallel processing in JS is to use Web Workers, but it is only supported by very recent browsers. And if your program requires such a thing, it could mean that you are not using the right tools (for example, walking the DOM tree is generally done by using DOM selectors like querySelectorAll).