I am trying to use worker threads with the worker pool in my application which is intended to be running in a 256mb docker containers.
My main thread is taking around 30mb of memory and 1 worker thread to be taking around 25mb of memory (considering require of third party node modules). Considering this, I would only be able to create a pool of ~7 workers.
But my application requirements are such that it should be able to handle many jobs at a time by creating many workers up and listening for a job (like around 20 or more).
Is there any way wherein I can use the third-party modules like (lodash, request, etc) to be shared across worker threads to save memory it needs in requiring all necessary modules.
My initial thought process was like I can give a try with shared memory (SharedArrayBuffer) but then it will not work as it won't allow passing such complex object structure and functions.
Can anyone help me what can be a possible solution?
Thanks in advance!
Related
I'm still trying to understand what is worker threads, and how it's different from child process so please bear with me.
So I'm currently building a desktop app with Node.JS + Electron. The app would work with several tasks at a time, which some of them are CPU and I/O intensive tasks.
The architecture currently has 1 main process, and numbers of child process which follows the number of host's CPU core count.
The main process handles Electron instance, renderer process, local database process, and handle the other child processes.
Meanwhile child processes would do the other tasks that is CPU and I/O intensive in nature.
So far, I have 4 questions here:
In my case, is it more beneficial to use worker threads instead?
If a tasks requires several package / library, does worker threads will require them first each time the task is run?
Currently, child process has no access to Electron API, thus only main process handles them. Does using worker threads allows me to handle Electron API?
In simple term, what's the difference between worker threads and threads pool? and should I use it instead of the other 2 (child process and worker threads)?
I am trying to understand the following part of documentation.
Please note: By default graphql-subscriptions exports an in-memory (EventEmitter) event system to re-run subscriptions. This is not suitable for running in a serious production app, because there is no way to share subscriptions and publishes across many running servers.
What means sharing subscriptions/publishes and in which cases I may need it in production?
In a production environment, you'll typically want multiple instances that run on different servers to increase resiliency. This way, a single server failure won't necessarily impact availability. As your business and the resource demands on your server grow, it's also often easier and more cost effective to scale up horizontally by adding more servers than by adding more resources to your individual server.
The EventEmitters created by the basic PubSub implementation are tied to an individual processes. There's no way to share their usage across multiple processes, let alone different servers. So if you publish something, an application running on a different process or server will not be notified of the event. On the other hand, if you use the Redis implementation of PubSub (or pretty much any of the other ones), then Redis will serve as a "go-between" for your application instances -- each application will publish events to Redis and subscribe to it for changes.
TL;DR
Is it possible to have two different cluster masters that aren't a shared global? Such as making a new instance myCluster = new cluster()?
The longer version
After using the new cluster module in NodeJS for some time, I've come across the problem of two separate clusters overlapping. Two different libraries (npm packages) access the same cluster master, as the cluster module is a global in the current running process, no matter where you require it from.
Calling cluster.workers from any library will list every worker spawned by every library.
Everyone is going nuts over how easy and how much more efficient it is, but after having come across the issue of two libraries using the same cluster, I'm worried about one interfering with the other by using some of the global cluster functions, such as cluster.disconnect(), or accessing the global workers object cluster.workers. I understand that it's a fairly single-use case module, "create a self-sustainable cluster of disposable workers that can be easily restarted with a watchdog".
But it's the easiest solution for multi-threaded tasks, and sugarcoats a lot of the bother with child_process. What if two libraries decided it was necessary to use cluster, but didn't go through the effort of keeping track of which workers belong to them, and instead call a cheeky
Object.values(cluster.workers).forEach(worker => worker.kill())
as their cleanup?
Is it possible to have two different "instances" or "namespaces" for clusters, as so not to interfere with any other cluster masters? Or is the cluster module just a global variable that you must accept?
I've delved into the documentation, but from what I can tell there is no way to create a new cluster instance by calling myCluster = new cluster() or pass some unique identity to forked workers. I find it surprising that there is no obvious solution to this problem, especially considering that it targets enterprise applications, where such problems should not exist.
The trend (and has been for a while) in programming is to keep away from global instances, and create self-sufficient instances that are only aware of what they need to know, so called "dumb-components". Cluster is a fairly new addition to NodeJS, have they just decided to half-implement a great feature?
I would be very grateful for you thoughts or workarounds on the subject. Right now I'm creating a library that could heavily benefit from distribution of tasks, however I don't want to dirty up the global cluster of the package dependant. Should I resort back to low-level child_process?
Many thanks!
Answering my own question
After switching to the lower-level child_process module, I have reverted back to using cluster for one simple reason. Node.js's --inspect debugger.
After a lot of pain debugging another package, I traced the error back to a dependency that was my own library implementing the child_process approach. It was forking a child-process using the inherited execArgs from the parent process, meaning the child_process was trying to bind to the same inspector port as the parent process, failing silently and crashing.
While my first workaround was generating a random free (assigned by OS) debug port for the child_process, this was a valid workaround but posed its own problems. Foremostly, the process would have to be manually attached to a debugger by copying the unique port each time.
VS Code has a magical autoAttachChildProcesses that discovers forked processes but only when using cluster, at least in my case. It seems that cluster has some sugar-coating around the forking process, making it more discoverable and adaptable during debugging, or perhaps VS Code just listens exclusively for cluster.fork(). I have given in to the downside of making my worker visible to the entire application in favour of a more robust experience for users.
This is annoying behaviour that has an open issue on GitHub: https://github.com/nodejs/node/issues/9435
My conclusion is that the cluster module is there for a more unified forking experience with more intelligent process handling and discovery, packages will just have to respect to never touch cluster.workers or similar global functions, and instead, maintain an array of their own workers returned by cluster.fork().
Well no. Or yes.
See, since clustering in node terms means forking the current process, there is no cluster actually - just a list of forks = workers.
But: you can fork and require the script (or if-else into a block) you want to have the fork as a worker and save that new worker into your own array = cluster/fork_list.
I have simple support module that separates the worker script (and simply loads different scripts + handles communication): runworker
I'm working on a small side project and would like to grow it out, but I'm not too sure how. My question is, how should I design my NodeJs worker application to be able to execute multiple long running jobs at the same time? (i.e. should I be using multiprocessing libraries, a load-balancer, etc)
My current situation is that I have a NodeJs app running purely to serve web requests and put jobs on a queue, while another NodeJs app reading off that queue carries out those jobs (on a heroku worker dyno). Each job may take anywhere from 1 hour to 1 week of purely writing to a database. Due to the nature of the job, and it requiring an npm package specifically, I feel like I should be using Node, but at the same time I'm not sure it's the best option when considering I would like to scale it so that hundreds of jobs can be executed at the same time.
Any advice/suggestions as to how I should architect this design would be appreciated. Thank you.
First off, a single node.js app can handle lots of jobs that are just reading/writing from a database because those activities are mostly asynchronous which means node.js is spending most of its time doing nothing while waiting for the database to respond back from the last request. So, you could probably have a single node.js app handle literally at least hundreds of jobs, perhaps even thousands of jobs (depending upon exactly what the jobs are doing). In fact, I wouldn't be surprised if a single node.js app could throw more work at your database than the database could possibly keep up with.
Then, if you want to scale how many worker node.js apps are running these jobs, you can simply fire up as many worker apps as you want (and as many as your hardware can handle) using the child_process module. You create one central work queue in your main node.js app. Then, create a bunch of child_processes whose job it is to grab N items from the work queue and process them. Note, I suggest you grab N items at once because a single node.js process can probably work on many separate jobs at once because of asynchronous I/O to your database.
You may also want to explore the cluster module which doesn't even need a work queue. You can just fire up as many clustered instances of your main app as you want and they can all share the workload (both serving web pages and working on the long running jobs). The usual guideline is to set up a clustered instance for each CPU you have in the computer. So, if you have 4 cores, you would set up a cluster with a total of four servers in it.
For example, say the user requests some audio file to be processed, then of course nodejs can't do the intense processing, so it should offload it to a worker process.
These workers probably need to be able to pub/sub to events, respawn when they die, and the queue should be able to load balance, maintain a cache, and keep alive. I've seen 0MQ, and others like it, but I'm not sure how I would go about integrating it into a web app...
What is industry standard way to create and manage these worker processes? And what are the tools used?
Edit: One more thing: say the audio processing takes a long time, and the request times out. Is there a way to get around that other than increasing the timeout?
Edit 2: By workers, I mean like the Heroku worker dynos - how do they work?
Personally I'd probably do something like add the tasks to a redis database. Then another process (probably written in a different language if it's very processing heavy) is subscribed to that key in the redis database and starts tasks that get added to it. It could even manage some worker threads itself.
For the timeout you could use one request to only start the task and another request to get the results (with longpolling if the result isn't available yet, or even better server-push through websockets, for example).