How to create and manage worker processes in NodeJS? - javascript

For example, say the user requests some audio file to be processed, then of course nodejs can't do the intense processing, so it should offload it to a worker process.
These workers probably need to be able to pub/sub to events, respawn when they die, and the queue should be able to load balance, maintain a cache, and keep alive. I've seen 0MQ, and others like it, but I'm not sure how I would go about integrating it into a web app...
What is industry standard way to create and manage these worker processes? And what are the tools used?
Edit: One more thing: say the audio processing takes a long time, and the request times out. Is there a way to get around that other than increasing the timeout?
Edit 2: By workers, I mean like the Heroku worker dynos - how do they work?

Personally I'd probably do something like add the tasks to a redis database. Then another process (probably written in a different language if it's very processing heavy) is subscribed to that key in the redis database and starts tasks that get added to it. It could even manage some worker threads itself.
For the timeout you could use one request to only start the task and another request to get the results (with longpolling if the result isn't available yet, or even better server-push through websockets, for example).

Related

Run jobs on FCFS basis in Nodejs from a database

I am developing a NodeJS application wherein a user can schedule a job (CPU intensive) to be run. I am keeping the event loop free and want to run the job in a separate process. When the user submits the job, I make an entry in the database (PostgreSQL), with the timestamp along with some other information. The processes should be run in the FCFS order. Upon some research on stackoverflow, I found people suggesting Bulljs (with Redis), Kue, RabbitMQ, etc. as a solution. My doubt is why do I need to use those when I can just poll the database and get the oldest job. I don't intend to poll the db at a regular interval but instead only when the current job is done executing.
My application does not receive too many simultaneous requests. And also users do not wait for the job to be completed. Instead they logout and are notified through mail when the job is done. What can be the potential drawbacks of using child_process (spawn/exec) module as a solution?
My doubt is why do I need to use those when I can just poll the database and get the oldest job.
How are you planning on handling failures? What if Node.js crashes with a job mid-progress, would that effect your users? Would you then retry a failed job? How do you support back-off? How many attempts before it should completely stop?
These questions are answered in the Bull implementation, RabbitMQ and almost every solution you'll find for your current challenge.
From what I noticed (child_process), it's a lower level implementation (low-level in Node.js), meaning that a lot of the functionality you'll typically require (failover/backoff) isn't included. You'll have to implement this.
That's where it usually becomes more trouble than it's worth, although admittedly managing, monitoring and deploying a Redis server may not be the most optimal solution either.
Have you considered a different approach, how would a periodic CRON job work? (For example).
The challenge with such a system is usually how you plan to handle failure and what impact failure has on your application and end-users.
I will say, in the defense of Bull, for a CPU intensive task I prefer to have a separated instance of the worker process, I can then re-deploy that single process as many times as I need. This keeps my back-end code separated and generally easier to manage, whilst also giving me the ability to easily scale up/down when required.
EDIT: I mention "more trouble than it's worth", if you're looking to really learn how technology like this is developed, go with child process and build your own abstractions on-top, if it's something you need today, use Bull, RabbitMQ or any purpose-built alternative.

How to design a NodeJs worker to handle concurrent long running jobs

I'm working on a small side project and would like to grow it out, but I'm not too sure how. My question is, how should I design my NodeJs worker application to be able to execute multiple long running jobs at the same time? (i.e. should I be using multiprocessing libraries, a load-balancer, etc)
My current situation is that I have a NodeJs app running purely to serve web requests and put jobs on a queue, while another NodeJs app reading off that queue carries out those jobs (on a heroku worker dyno). Each job may take anywhere from 1 hour to 1 week of purely writing to a database. Due to the nature of the job, and it requiring an npm package specifically, I feel like I should be using Node, but at the same time I'm not sure it's the best option when considering I would like to scale it so that hundreds of jobs can be executed at the same time.
Any advice/suggestions as to how I should architect this design would be appreciated. Thank you.
First off, a single node.js app can handle lots of jobs that are just reading/writing from a database because those activities are mostly asynchronous which means node.js is spending most of its time doing nothing while waiting for the database to respond back from the last request. So, you could probably have a single node.js app handle literally at least hundreds of jobs, perhaps even thousands of jobs (depending upon exactly what the jobs are doing). In fact, I wouldn't be surprised if a single node.js app could throw more work at your database than the database could possibly keep up with.
Then, if you want to scale how many worker node.js apps are running these jobs, you can simply fire up as many worker apps as you want (and as many as your hardware can handle) using the child_process module. You create one central work queue in your main node.js app. Then, create a bunch of child_processes whose job it is to grab N items from the work queue and process them. Note, I suggest you grab N items at once because a single node.js process can probably work on many separate jobs at once because of asynchronous I/O to your database.
You may also want to explore the cluster module which doesn't even need a work queue. You can just fire up as many clustered instances of your main app as you want and they can all share the workload (both serving web pages and working on the long running jobs). The usual guideline is to set up a clustered instance for each CPU you have in the computer. So, if you have 4 cores, you would set up a cluster with a total of four servers in it.

Error 504, avoid it with some data passing from server to client?

I'm developing an app that should receive a .CSV file, save it, scan it, and insert data of every record into DB and at the end delete the file.
With a file with about 10000 records there aren't problems but with a larger file the PHP script is correctly runned and all data are saved into DB but is printed ERROR 504 The server didn't respond in time..
I'm scanning the .CSV file with the php function fgetcsv();.
I've already edit settings into php.ini file (max execution time (120), etc..) but nothing change, after 1 minute the error is shown.
I've also try to use a javascript function to show an alert every 10 seconds but also in this case the error is shown.
Is there a solution to avoid this problem? Is it possible pass some data from server to client every tot seconds to avoid the error?
Thank's
Its typically when scaling issues pop up when you need to start evolving your system architecture, and your application will need to work asynchronously. This problem you are having is very common (some of my team are dealing with one as I write) but everyone needs to deal with it eventually.
Solution 1: Cron Job
The most common solution is to create a cron job that periodically scans a queue for new work to do. I won't explain the nature of the queue since everyone has their own, some are alright and others are really bad, but typically it involves a DB table with relevant information and a job status (<-- one of the bad solutions), or a solution involving Memcached, also MongoDB is quite popular.
The "problem" with this solution is ultimately again "scaling". Cron jobs run periodically at fixed intervals, so if a task takes a particularly long time jobs are likely to overlap. This means you need to work in some kind of locking or utilize a scheduler that supports running the job sequentially.
In the end, you won't run into the timeout problem, and you can typically dedicate an entire machine to running these tasks so memory isn't as much of an issue either.
Solution 2: Worker Delegation
I'll use Gearman as an example for this solution, but other tools encompass standards like AMQP such as RabbitMQ. I prefer Gearman because its simpler to set up, and its designed more for work processing over messaging.
This kind of delegation has the advantage of running immediately after you call it. The server is basically waiting for stuff to do (not unlike an Apache server), when it get a request it shifts the workload from the client onto one of your "workers", these are scripts you've written which run indefinitely listening to the server for workload.
You can have as many of these workers as you like, each running the same or different types of tasks. This means scaling is determined by the number of workers you have, and this scales horizontally very cleanly.
Conclusion:
Crons are fine in my opinion of automated maintenance, but they run into problems when they need to work concurrently which makes running workers the ideal choice.
Either way, you are going to need to change the way users receive feedback on their requests. They will need to be informed that their request is processing and to check later to get the result, alternatively you can periodically track the status of the running task to provide real-time feedback to the user via ajax. Thats a little tricky with cron jobs, since you will need to persist the state of the task during its execution, but Gearman has a nice built-in solution for doing just that.
http://php.net/manual/en/book.gearman.php

Queue in webbrowser on top of database?

In a web application the user is able to perform some tasks I need to send to the server asynchronously. Basically, this is really easy, but now I would like it to be also working fine in offline-mode.
My idea is to use a client-side queue, and transfer elements from that queue to the server if the network connection is available.
I could use PouchDB, but I don't need all the tasks on the client-side, so I don't want a full client-side database with all the elements the server has as well. I only need some kind of queue: Put it in there, and try to send it to the server: If it worked, dequeue, otherwise try again after a short pause.
How could I implement this? Is there something such as RabbitMQ (conceptually!) available for browsers? A queue on top of the browsers' built-in database? Something like that?
Or can this issue be solved using PouchDB?
PouchDB does support one-way replication (just do clientDb.replicate.to("http://server/")), so if you are already running CouchDB on your server, it might be a quick & easy way to implement a queueing of tasks type of system.
You will probably want to use a filter on your replication, because when you "dequeue" or delete a task from the client side db, you probably don't want to replicate that delete to the server :) This answer is specific to CouchDB, but it should work in PouchDB too, as I think PouchDB does support filtered replication: CouchDB replicate without deleting documents.
That said, using PouchDB like this seems a little awkward, and the full replication system might be a little more overhead than is necessary for a simple queueing of tasks. Depends on what the needs of your app are though, and the exact nature of the tasks you are queueing! It could be as simple as an array that you push tasks into, and periodically check if there are tasks in there, which you can pop or shift off the array and send to the server.
There's also async.queue, which is commonly used in node.js but also works in the browser (this queue is not backed by any type of storage, but you could add persistent storage using PouchDB or another client-side db).

How work when a lot of requests arrive at the web server at the same time? NODEJS

IN NODEJS:
IF we can only run one function at the same time if node is not using multiple threads. How can this work when a lot of requests arrive at the web server at the same time?
Can to clear the panorama about thread and process??
Theoretically if a large number of requests happened in the same second - or if each request has to do something that takes a while, like hard math - your server could get bogged down. Either by not responding to people at the "end" of the line (late by milliseconds), or never finishing all of the things Node needs to do to serve those requests at the "front" of the line.
In general the strategy Node takes is that if you're going to perform a long operation - like querying the database - the execution of the program should not sit around waiting, but should "call back" to some other function when the database query is eventually done.
I talk more about this in another SO answer. You could Google "node.js is cancer" for other examples of just what you are talking about.
But the prevalence of this strategy is one of the major differences between Node and other languages/frameworks: that's just how Javascript deals.
Now, in practice, several things actually happen.
First, any production Node app really should be running with Cluster or some kind of solution that provides load balancing. Because you'd be having multiple processes of your app working, your solution can do more than one thing at once.
Secondly, in general Node.js stays up really well, because the idea of not waiting around for everything. It keeps your server busy, instead of cooling it's jets waiting for something to be done.
Thirdly, yes you do have to be careful about what you do in the server. If something's going to take too long (modiying all the records in the database), probably wise to do it in the background via some kind of worker queue system: "Hey, I need to update this person's username in all of (these) records in the database" probably should happen by yet another Node.js process being the worker.

Categories