TL;DR
Is it possible to have two different cluster masters that aren't a shared global? Such as making a new instance myCluster = new cluster()?
The longer version
After using the new cluster module in NodeJS for some time, I've come across the problem of two separate clusters overlapping. Two different libraries (npm packages) access the same cluster master, as the cluster module is a global in the current running process, no matter where you require it from.
Calling cluster.workers from any library will list every worker spawned by every library.
Everyone is going nuts over how easy and how much more efficient it is, but after having come across the issue of two libraries using the same cluster, I'm worried about one interfering with the other by using some of the global cluster functions, such as cluster.disconnect(), or accessing the global workers object cluster.workers. I understand that it's a fairly single-use case module, "create a self-sustainable cluster of disposable workers that can be easily restarted with a watchdog".
But it's the easiest solution for multi-threaded tasks, and sugarcoats a lot of the bother with child_process. What if two libraries decided it was necessary to use cluster, but didn't go through the effort of keeping track of which workers belong to them, and instead call a cheeky
Object.values(cluster.workers).forEach(worker => worker.kill())
as their cleanup?
Is it possible to have two different "instances" or "namespaces" for clusters, as so not to interfere with any other cluster masters? Or is the cluster module just a global variable that you must accept?
I've delved into the documentation, but from what I can tell there is no way to create a new cluster instance by calling myCluster = new cluster() or pass some unique identity to forked workers. I find it surprising that there is no obvious solution to this problem, especially considering that it targets enterprise applications, where such problems should not exist.
The trend (and has been for a while) in programming is to keep away from global instances, and create self-sufficient instances that are only aware of what they need to know, so called "dumb-components". Cluster is a fairly new addition to NodeJS, have they just decided to half-implement a great feature?
I would be very grateful for you thoughts or workarounds on the subject. Right now I'm creating a library that could heavily benefit from distribution of tasks, however I don't want to dirty up the global cluster of the package dependant. Should I resort back to low-level child_process?
Many thanks!
Answering my own question
After switching to the lower-level child_process module, I have reverted back to using cluster for one simple reason. Node.js's --inspect debugger.
After a lot of pain debugging another package, I traced the error back to a dependency that was my own library implementing the child_process approach. It was forking a child-process using the inherited execArgs from the parent process, meaning the child_process was trying to bind to the same inspector port as the parent process, failing silently and crashing.
While my first workaround was generating a random free (assigned by OS) debug port for the child_process, this was a valid workaround but posed its own problems. Foremostly, the process would have to be manually attached to a debugger by copying the unique port each time.
VS Code has a magical autoAttachChildProcesses that discovers forked processes but only when using cluster, at least in my case. It seems that cluster has some sugar-coating around the forking process, making it more discoverable and adaptable during debugging, or perhaps VS Code just listens exclusively for cluster.fork(). I have given in to the downside of making my worker visible to the entire application in favour of a more robust experience for users.
This is annoying behaviour that has an open issue on GitHub: https://github.com/nodejs/node/issues/9435
My conclusion is that the cluster module is there for a more unified forking experience with more intelligent process handling and discovery, packages will just have to respect to never touch cluster.workers or similar global functions, and instead, maintain an array of their own workers returned by cluster.fork().
Well no. Or yes.
See, since clustering in node terms means forking the current process, there is no cluster actually - just a list of forks = workers.
But: you can fork and require the script (or if-else into a block) you want to have the fork as a worker and save that new worker into your own array = cluster/fork_list.
I have simple support module that separates the worker script (and simply loads different scripts + handles communication): runworker
Related
I am trying to use worker threads with the worker pool in my application which is intended to be running in a 256mb docker containers.
My main thread is taking around 30mb of memory and 1 worker thread to be taking around 25mb of memory (considering require of third party node modules). Considering this, I would only be able to create a pool of ~7 workers.
But my application requirements are such that it should be able to handle many jobs at a time by creating many workers up and listening for a job (like around 20 or more).
Is there any way wherein I can use the third-party modules like (lodash, request, etc) to be shared across worker threads to save memory it needs in requiring all necessary modules.
My initial thought process was like I can give a try with shared memory (SharedArrayBuffer) but then it will not work as it won't allow passing such complex object structure and functions.
Can anyone help me what can be a possible solution?
Thanks in advance!
I'm new to Node.JS and I want to understand how this engine work and how to use it to obtain performance and speed.
I'm building a big website using Node.JS (with Express exc...) but I'm worried about what I have found: I read that Javascript uses only a single thread to work and that is better to start multiple node instance instead of a single node that do all the work.
Node.JS has been updated and now it support parallelization and clustering to do all the work always exploiting maximum performance. But I am a bit skeptical about this.
Yesterday I have modified my node and I have activated the cluster mode activating all threads on my server.
Probably the performance are better but I want to know what happen if I build all my website using a single node instance with this configuration. NodeJS is perfect for organizing modules and controllers (in blocks) and we can build a great and well-organized program using a single node.
But performance? Is the single Thread problem solved using clustering and load balancing?
Thanks for you support :)
Just host your Node JS project on Heroku and you can choose how many instances of your application you want, or host in AWS Lambda and it will scale it automatically for you if you start to receive multiple requests at the same time.
Cluster will do what you want, but it's usually not worth it, unless you're an infra guy setting up the servers.
What are the main differences between node cluster and Threads agogo and what are the advantages or disadvantages of each? As far as I have understood threads a agogo creates a thread to run in the background and node cluster creates a new process that is run in the background. I am interested what differences there would be in ease of use or performance and when to prefer one over the other.
Just having a quick look, it uses threads, yes. Node on the other hand uses processes, since it's design is single-threaded, however internally it creates thread pools and hence threads during callback creation.
Node implementations of processes uses sockets for communication, which is quite slow, in terms of latency. Your tasks should hence be dividable, so you won't need to have much communication.
Threads are like processes, but share the memory with their calling processes, so that communications is quicker, but more dangerous.
So, the question is rather aims are threads better then processes for concurrency? It depends... But in Node context use cluster and processes rather.
The library you are referencing to is quite old right? Better don't use it. There is a reason why people abandon stuff like this.
What is a good aproach to handle background processes in a NodeJS application?
Scenario: After a user posts something to an app I want to crunch the data, request additional data from external resources, etc. All of this is quite time consuming, so I want it out of the req/res loop. Ideal would be to just have a queue of jobs where you can quickly dump a job on and a daemon or task runner will always take the oldest one and process it.
In RoR I would have done it with something like Delayed Job. What is the Node equivalent of this API?
If you want something lightweight, that runs in the same process as the server, I highly recommend Bull. It has a simple API that allows for a fine grained control over your queues.
If you're familiar with Ruby's Resque, there is a node implementation called Node-resque
Bull and Node-resque are all backed by Redis, which is ubiquitous among Node.js worker queues. They would be able to do what RoR's DelayedJob does, it's matter of specific features that you want, and your API preferences.
Background jobs are not directly related to your web service work, so they should not be in the same process. As you scale up, the memory usage of the background jobs will impact the web service performance. But you can put them in the same code repository if you want, whatever makes more sense.
One good choice for messaging between the two processes would be redis, if dropping a message every now and then is OK. If you want "no message left behind" you'll need a more heavyweight broker like Rabbit. Your web service process can publish and your background job process can subscribe.
It is not necessary for the two processes to be co-hosted, they can be on separate VMs, Docker containers, whatever you use. This allows you to scale out without much trouble.
If you're using MongoDB, I recommend Agenda. That way, separate Redis instances aren't running and features such as scheduling, queuing, and Web UI are all present. Agenda UI is optional and can be run separately of course.
Would also recommend setting up a loosely coupled abstraction between your application logic and the queuing / scheduling system so the entire background processing system can be swapped out if needed. In other words, keep as much application / processing logic away from your Agenda job definitions in order to keep them lightweight.
I'd like to suggest using Redis for scheduling jobs. It has plenty of different data structures, you can always pick one that suits better to your use case.
You mentioned RoR and DJ, so I assume you're familiar with sidekiq. You can use node-sidekiq for job scheduling if you want to, but its suboptimal imo, since it's main purpose is to integrate nodejs with RoR.
For worker daemonising I'd recommend using PM2. It's widely used and actively-maintained. It solves a lot of problems (e.g. deployment, monitoring, clustering) so make sure it won't be an overkill for you.
I tried bee-queue & bull and chose bull in the end.
I first chose bee-queue b/c it is quite simple, their examples are easy to understand, while bull's examples are bit complicated. bee's wiki Bee Queue's Origin also resonates with me. But the problem with bee is <1> their issue resolution time is quite slow, their latest update was 10 months ago. <2> I can't find an easy way to pause/cancel job.
Bull, on the other hand, frequently updates their codes, response to issues. Node.js job queue evaluation said bull's weakness is "slow issues resolution time", but my experience is the opposite!
But anyway their api is similar so it is quite easy to switch from one to another.
I suggest to use a proper Node.js framework to build you app.
I think that the most powerful and easy to use is Sails.js.
It's a MVC framework so if you are used to develop in ROR, you will find it very very easy!
If you use it, It's already present a powerful (in javascript terms) job manager.
new sails.cronJobs('0 01 01 * * 0', function () {
sails.log.warn("START ListJob");
}, null, true, "Europe/Dublin");
If you need more info not hesitate to contact me!
I have designed a meteor.js application and it works great on localhost and even when deployed to the internet. Now I want create a sign-up site that will spin up new instances of the application for each client who signs up on the back-end. Assuming a meteor.js application and python or javascript for the sign-up site, what high level steps need to be taken to implement this?
I am looking for a more correct and complete answer that takes the form of my poorly imagined version of this:
Use something like node or python to call a shell script that may or may not run as sudo
That script might create a new folder to hold instance specific stuff (like client files, config, and or that instances database).
The script or python code would deploy an instance of the application to that folder and on a specific port
Python might add configuration information to a tool like Pound to forward a subdomain to a port
Other things....!?
I don't really understand the high level steps that need to be taken here so if someone could provide those steps and maybe even some useful tools or tutorials for doing so I'd be extremely grateful.
I have a similar situation to you but ended up solving it in a completely different way. It is now available as a Meteor smart package:
https://github.com/mizzao/meteor-partitioner
The problem we share is that we wanted to write a meteor app as if only one client (or group of clients, in my case) exists, but that it needs to handle multiple sets of clients without them knowing about each other. I am doing the following:
Assume the Meteor app is programmed for just a single instance
Using a smart package, hook the collections on server (and possibly client) so that all operations are 'scoped' only to the instance of the user that is calling them. One way to do this is to automatically attach an 'instance' or 'group' field to each document that is being added.
Doing this correctly requires a lot of knowledge about the internals of Meteor, which I've been learning. However, this approach is a lot cleaner and less resource-intensive than trying to deploy multiple meteor apps at once. It means that you can still code the app as if only one client exists, instead of explicitly doing so for multiple clients. Additionally, it allows you to share resources between the instances that can be shared (i.e. static assets, shared state, etc.)
For more details and discussions, see:
https://groups.google.com/forum/#!topic/meteor-talk/8u2LVk8si_s
https://github.com/matb33/meteor-collection-hooks (the collection-hooks package; read issues for additional discussions)
Let me remark first that I think spinning up multiple instances of the same app is a bad design choice. If it is a stop gap measure, here's what I would suggest:
Create an archive that can be readily deployed. (Bundle the app, reinstall fibers if necessary, rezip). Deploy (unzip) the archive to a new folder when a new instance is created using a script.
Create a template of an init script and use forever or daemonize or jesus etc to start the site on reboot and keep the sites up during normal operation. See Meteor deploying to a VM by installing meteor or How does one start a node.js server as a daemon process? for examples. when a new instance is deployed populate the template with new values (i.e. port number, database name, folder). Copy the filled out template to init.d and link to the runlevel. Alternatively, create one script in init.d that executes other scripts to bring up the site.
Each instance should be listening to its own port, so you'll need a reverse proxy. AFAIK, Apache and Nginx require restarts when you change the configuration, so you'll probably want to look at Hipache https://github.com/dotcloud/hipache. Hipache uses redis to store the configuration information. Adding the new instance requires to add a key to redis. There is an experimental port of Hipache that brings the functionality to Nginx https://github.com/samalba/hipache-nginx
What about DNS updates? Once you create a new instance, do you need to add a new record to your DNS configuration?
I don't really have an answer to your question... but I just want to remind you of another potential problem that you may run into cuz I see you mentioned python, in other words, you may be running another web app on Apache/Nginx, etc... the problem is Meteor is not very friendly when it comes to co-exist with another http server, the project I'm working on was troubled by this issue and we had to move it to a stand alone server after days of hassle with guys from Meteor... I did not work on the issue, so I'm not able to give you more details, but I just looked up online and found something similar: https://serverfault.com/questions/424693/serving-meteor-on-main-domain-and-apache-on-subdomain-independently.
Just something to keep in mind...