Does node.js cron affect system performance - javascript

I want to use cron module (later, node-cron etc.) in my node server to schedule jobs. The jobs are notifications (eg. email) sent to a user to update his profile picture if he hasen't done so after 1 hour of signup. I am using later module to schedule tasks when user signs up, to be executed 1 hour after to check whether he has uploaded profile picture, and send notification if not. My questions are:
1. Will having a large number of scheduled jobs affect server performance?
2. What is the best way to schedule large number of jobs without affecting the system performance?

It's hard to say whether a large number will affect system performance, since it depends an aweful lot on what they're doing and how they're doing it, as well as how many end up hitting at one time, etc.
That said, my opinion is that using this sort of thing for managing the sorts of things you're describing will severely limit your scalability, as each one is scoped to the application instance (so if you want to scale horizontally, each manages they're own; if one crashes w/o sending notifications out, what happens? etc).
I think you're far better off using some kind of distributed task manager, based perhaps on a Redis queue or similar. Your web app posts jobs to that, and a separate process or processes can handle running tasks when they expire. There's a few modules for this sort of thing, kue is the one I've played with.

Related

Is there a workaround for pinging a database every few minutes?

So, I have a NodeJS MySql system, which pings the database every 10 Minutes (might set that to one Minute) to not lose the connection, but that is not the code quality I like to write.
So, now I am asking myself, if there is any way to avoid that.
I have no plan on what Pools are, or what they do. I've seen them in a few other posts but not looked in to them, is that the only option?Thanks in advance.
Yes, you should learn about connection pools - they are a good tool for managing a connection (or a bunch of them to support concurrency).
Connections - whether to databases or to any other remote systems - are unreliable. They can come and go, and you must not assume that a connection will live forever. Hosts can sometimes be restarted, network links can go down, or a firewall may choose to terminate your TCP session.
Of course, you could implement a "connection recycler" that maintains a single connection and replaces it with a fresh one whenever it's closed. However, that would be unproductive since a pool already does that - only it typically manages more than 1 instance under the hood. It's a good excercise for learning purposes in any case.
A pool has another advantage - it can scale according to load, creating and destroying connections as needed. This lets you put less load on the database when the application load is low.
As a closing remark, if you use query builders (knex.js), object-relational mappers (sequelize, TypeORM), or other types of tools that abstract database access, they'll typically use a pool under the hood, anyway - so understanding this important layer of infrastructure is beneficial in the long run.

Run jobs on FCFS basis in Nodejs from a database

I am developing a NodeJS application wherein a user can schedule a job (CPU intensive) to be run. I am keeping the event loop free and want to run the job in a separate process. When the user submits the job, I make an entry in the database (PostgreSQL), with the timestamp along with some other information. The processes should be run in the FCFS order. Upon some research on stackoverflow, I found people suggesting Bulljs (with Redis), Kue, RabbitMQ, etc. as a solution. My doubt is why do I need to use those when I can just poll the database and get the oldest job. I don't intend to poll the db at a regular interval but instead only when the current job is done executing.
My application does not receive too many simultaneous requests. And also users do not wait for the job to be completed. Instead they logout and are notified through mail when the job is done. What can be the potential drawbacks of using child_process (spawn/exec) module as a solution?
My doubt is why do I need to use those when I can just poll the database and get the oldest job.
How are you planning on handling failures? What if Node.js crashes with a job mid-progress, would that effect your users? Would you then retry a failed job? How do you support back-off? How many attempts before it should completely stop?
These questions are answered in the Bull implementation, RabbitMQ and almost every solution you'll find for your current challenge.
From what I noticed (child_process), it's a lower level implementation (low-level in Node.js), meaning that a lot of the functionality you'll typically require (failover/backoff) isn't included. You'll have to implement this.
That's where it usually becomes more trouble than it's worth, although admittedly managing, monitoring and deploying a Redis server may not be the most optimal solution either.
Have you considered a different approach, how would a periodic CRON job work? (For example).
The challenge with such a system is usually how you plan to handle failure and what impact failure has on your application and end-users.
I will say, in the defense of Bull, for a CPU intensive task I prefer to have a separated instance of the worker process, I can then re-deploy that single process as many times as I need. This keeps my back-end code separated and generally easier to manage, whilst also giving me the ability to easily scale up/down when required.
EDIT: I mention "more trouble than it's worth", if you're looking to really learn how technology like this is developed, go with child process and build your own abstractions on-top, if it's something you need today, use Bull, RabbitMQ or any purpose-built alternative.

How to design a NodeJs worker to handle concurrent long running jobs

I'm working on a small side project and would like to grow it out, but I'm not too sure how. My question is, how should I design my NodeJs worker application to be able to execute multiple long running jobs at the same time? (i.e. should I be using multiprocessing libraries, a load-balancer, etc)
My current situation is that I have a NodeJs app running purely to serve web requests and put jobs on a queue, while another NodeJs app reading off that queue carries out those jobs (on a heroku worker dyno). Each job may take anywhere from 1 hour to 1 week of purely writing to a database. Due to the nature of the job, and it requiring an npm package specifically, I feel like I should be using Node, but at the same time I'm not sure it's the best option when considering I would like to scale it so that hundreds of jobs can be executed at the same time.
Any advice/suggestions as to how I should architect this design would be appreciated. Thank you.
First off, a single node.js app can handle lots of jobs that are just reading/writing from a database because those activities are mostly asynchronous which means node.js is spending most of its time doing nothing while waiting for the database to respond back from the last request. So, you could probably have a single node.js app handle literally at least hundreds of jobs, perhaps even thousands of jobs (depending upon exactly what the jobs are doing). In fact, I wouldn't be surprised if a single node.js app could throw more work at your database than the database could possibly keep up with.
Then, if you want to scale how many worker node.js apps are running these jobs, you can simply fire up as many worker apps as you want (and as many as your hardware can handle) using the child_process module. You create one central work queue in your main node.js app. Then, create a bunch of child_processes whose job it is to grab N items from the work queue and process them. Note, I suggest you grab N items at once because a single node.js process can probably work on many separate jobs at once because of asynchronous I/O to your database.
You may also want to explore the cluster module which doesn't even need a work queue. You can just fire up as many clustered instances of your main app as you want and they can all share the workload (both serving web pages and working on the long running jobs). The usual guideline is to set up a clustered instance for each CPU you have in the computer. So, if you have 4 cores, you would set up a cluster with a total of four servers in it.

Error 504, avoid it with some data passing from server to client?

I'm developing an app that should receive a .CSV file, save it, scan it, and insert data of every record into DB and at the end delete the file.
With a file with about 10000 records there aren't problems but with a larger file the PHP script is correctly runned and all data are saved into DB but is printed ERROR 504 The server didn't respond in time..
I'm scanning the .CSV file with the php function fgetcsv();.
I've already edit settings into php.ini file (max execution time (120), etc..) but nothing change, after 1 minute the error is shown.
I've also try to use a javascript function to show an alert every 10 seconds but also in this case the error is shown.
Is there a solution to avoid this problem? Is it possible pass some data from server to client every tot seconds to avoid the error?
Thank's
Its typically when scaling issues pop up when you need to start evolving your system architecture, and your application will need to work asynchronously. This problem you are having is very common (some of my team are dealing with one as I write) but everyone needs to deal with it eventually.
Solution 1: Cron Job
The most common solution is to create a cron job that periodically scans a queue for new work to do. I won't explain the nature of the queue since everyone has their own, some are alright and others are really bad, but typically it involves a DB table with relevant information and a job status (<-- one of the bad solutions), or a solution involving Memcached, also MongoDB is quite popular.
The "problem" with this solution is ultimately again "scaling". Cron jobs run periodically at fixed intervals, so if a task takes a particularly long time jobs are likely to overlap. This means you need to work in some kind of locking or utilize a scheduler that supports running the job sequentially.
In the end, you won't run into the timeout problem, and you can typically dedicate an entire machine to running these tasks so memory isn't as much of an issue either.
Solution 2: Worker Delegation
I'll use Gearman as an example for this solution, but other tools encompass standards like AMQP such as RabbitMQ. I prefer Gearman because its simpler to set up, and its designed more for work processing over messaging.
This kind of delegation has the advantage of running immediately after you call it. The server is basically waiting for stuff to do (not unlike an Apache server), when it get a request it shifts the workload from the client onto one of your "workers", these are scripts you've written which run indefinitely listening to the server for workload.
You can have as many of these workers as you like, each running the same or different types of tasks. This means scaling is determined by the number of workers you have, and this scales horizontally very cleanly.
Conclusion:
Crons are fine in my opinion of automated maintenance, but they run into problems when they need to work concurrently which makes running workers the ideal choice.
Either way, you are going to need to change the way users receive feedback on their requests. They will need to be informed that their request is processing and to check later to get the result, alternatively you can periodically track the status of the running task to provide real-time feedback to the user via ajax. Thats a little tricky with cron jobs, since you will need to persist the state of the task during its execution, but Gearman has a nice built-in solution for doing just that.
http://php.net/manual/en/book.gearman.php

Node.js scalable/performance capabilities with large number of users

I am in the middle of creating a node.js project however I am concerned as to the performance of the website once it is up and running.
I am expecting it to have a surge of maybe 2000 users for 4-5 hours over a period of one night per week.
The issue is that each user could be receiving a very small message once every second. i.e a timer adjustment or a price change.
2000*60 = 120000 messages in total per minute.
Would this be possible it would be extremely important that there was minimum lag, less than 1 second if possible?
Thanks for the help
You can certainly scale the users with socket.io but the question is how you want to do that. It is unclear whether you have considered the cluster module as that will significantly take the load of the single node process for that amount of users and reduce the latency time. Of course when you do this you need to stop using the in-memory store that socket.io uses by default and use something like redis instead so that you don't end up with duplicate authentication handshakes. Socket.io even has a document to explain how to do this.
What you really need to do is test the performance of your application by creating 2000 clients and simulating 2000 users. The socket.io client can let you set this up in a node application (and then perhaps sign up for the free tier of an EC2 machine to run them all).
It's worth noting that I haven't actually seen v1.0 benchmarks and really the socket.io developers should have a page on it dedicated to benchmarks as this is always a common question with developers.

Categories