Context:
My Express.js web server is currently serving an API which wraps a SOAP service (some legacy service which I can't change). The SOAP service takes a dynamic number of items to process and takes about 1.5 seconds to process each request. The Nginx server has a timeout of 60 seconds.
Problem:
For a request to this API which e.g. lets say takes more than 60 seconds to complete, I am observing that the service is getting re-triggered automatically (I am assuming by Express.js). So if in the original request I was expecting to insert lets say 50 records to a table, now due to the re-triggering of the API I am ending up with 100 records inserted (duplication).
Here is a skeleton/sample of log that kind of shows the issue: (sensitive info stripped)
January 10, 2022 15:35:44 [... ee905] - Starting myAwesomeAPI() <-- Original API trigger
January 10, 2022 15:36:44 [... ff870] - Starting myAwesomeAPI() <-- Re-trigger happens
January 10, 2022 15:36:54 [... ee905] - Completed myAwesomeAPI() <-- Original API ends (inserts 50 records in the table)
January 10, 2022 15:37:54 [... ff870] - Completed myAwesomeAPI() <-- Re-triggered API ends (inserting another 50 records in the table resulting in duplication)
What I have tried:
To reproduce the issue and check if the re-triggering can be independent of nginx. With the Nginx timeout set to 60 seconds, I changed my Express server's timeout to 10 seconds and 15 items to process (to force timeout before processing can be complete) using this:
const express = require("express")
const server = express()
server.setTimeout(10000) <-- sets all requests to have a 10 seconds timeout
// myAwesomeAPI code
Testing showed that after 10 seconds, the timeout "did" re-trigger the API and the 15 items were duplicated (I saw 30 records inserted). So this tells me that the API is getting re-triggered by Express.js.
Question(s):
How to stop the re-trigger from happening, is there an express server configuration to enable/disable the auto re-triggering on timeout?
Solutions & Ideas:
Since the max items = 100 (set by team), increasing the Nginx and Express.js timeout to 300 seconds should be a quick but dirty fix. I understand that tying async API calls to some approximation of time is pure foolishness (tell me about trying to explain this to other engineers in my team ;-p), so I would like to avoid this approach.
Create a composite key with some combination of columns and enforce the insert restrictions on the table. Combine this with checking if the composite key is already inserted/present in the table and decide to skip/insert. This approach seems a bit better .
Another approach can be to respond back to the API call immediately on receipt (which will close the request) and then continue with the request processing. Something like this (inspiration): https://www.bennadel.com/blog/3275-you-can-continue-to-process-an-express-js-request-after-the-client-response-has-been-sent.htm.
This will make me independent of platform's timeout settings but will take away the real-time nature of the response being delivered with statuses for different items and add a bit more complexity of tracking the request statuses via other lookups etc.
If you have the ability to alter the front end you can add a transaction ID to it. Store the transaction routine in an object linked to the transaction ID, then if you get an API request for an ongoing transaction you can refer to the ongoing transaction.
Something like this:
let transactions = {};
router.get('/myapi', async (req,res,next) => {
try {
let {transactionID} = req.params;
delete(req.params.transactionID);
let transaction = transactions[transactionID];
if(!transaction) {
transaction = (async () => {
let ret = await SOAPCall(req.params);
// hold onto the transaction for some period of time
let to = setTimeout(()=>{
delete(transactions[transactionID]);
}, 5000);
to.detach(); // don't hold up process exit
return ret;
})();
transactions[transactionID] = transaction;
}
let ret = await transaction;
res.json(ret);
}
catch(err) { next(err) }
});
Related
What I intend to do is a program that sends congratulatory emails for their birthday to several users, then the program will take today and execute a query to the database (it is an Excel file), in which it will take the date of the users and compare their date of birth with the current date, if the month and day coincide, mail will be sent. I think it can be done with a setInterval(), but I don't know if it affects the performance of the program. Since it will be uploaded on a windows server 2012 server of my company.
My code:
const express = require("express");
const app = express();
const excel = require('./public/scripts/readExcel.js');
const email = require('./services/email/sendEmail.js');
app.post('/send-email',(req, res)=>{
setInterval(() => {
email.sendEmail()
.then((result) => {
res.status(200).jsonp(req.body);;
console.log(result)
}).catch((err) => {
res.status(500).jsonp(req.body);;
console.log(err);
});
}, 3600000);//1 hour
});
app.listen(4000, ()=>{
console.log("Serven on -> http://localhost:4000");
})
Basically what it does is call the sendEmail function every hour which reads the Excel file or the database and extracts the date fields, compares with the current day, and sends the mail with Nodemailer to those who have birthdays. Also, the setInterval would go in the route of "/send-email" or how would the request be made?
For that, you can also run a cron job at every hour using npm package cron-job
using
var cron = require('node-cron');
cron.schedule('* * 1 * *', () => {
console.log('running a task every hour');
});
I'll break down my answer in two parts
What you need to do to make your solution work
How can you optimise the performance
1. What you need to do to make your solution work
There are two essential problems you need to resolve.
a. Your solution will work as it is, only thing you need to do is to call /send-email endpoint once after starting your server. BUT... this comes with side effects.
As setInterval will call the email.sendEmail... code block every hour, and this code block calls res.status(200).jsonp(req.body) every time. If you don't know this res.status.. sets the response for the request you receive. In this case, your request to /send-email. For the first time, it will work fine because you are returning the response to your above request. But when second time call to this code block kicks in, it has nothing to respond to because request has already been responded. Remember, HTTP protocol responds to a request once, then the request has been completed. So for this reason, your code block res.status... becomes invalid. So first thing, call res.status only once. So I'd remove this line out of the setInterval code block as follows
app.post('/send-email',(req, res)=>{
setInterval(() => {
email.sendEmail()
.then((result) => {
console.log(result)
}).catch((err) => {
console.log(err);
});
}, 3600000);//1 hour
res.status(200).jsonp(req.body);
})
b. Also I don't think you'd want the hastle of calling /send-email every time you start server, so I'd also make sure that this code block for birthday wishes gets kicked off every time you start server automatically. So I'd then just remove the line app.post('/send-email',(req, res)=>{. Also not that I'm not calling this for a request, I don't have any request to send response to so I can also remove the res.status.. line. And your code looks like this now
const express = require("express");
const app = express();
const email = require('./services/email/sendEmail.js');
(function(){
// Birthday wish email functionality
setInterval(() => {
email.sendEmail()
.then((result) => {
console.log(result)
}).catch((err) => {
console.log(err);
});
}, 3600000);//1 hour
})() // Immediately invoked function
That's it, your solution works now. Now to send birthday wish emails, you don't need to do anything else other than just starting your server.
Let's move on to second part now
2. How can you optimise the performance
a. Set interval to be 24hrs instead of 1 hr
Why do you need to check every hour for birthday? If you don't have a good answer here, I'd definitely change the interval to be 24hrs
b. Making the code more robust to deal with large data
As long as you have only 100s of entries in your excels and they are not going to grow much in future, I wouldn't go into making it more complex for performance.
But if your entries are destined to grow to 1000s and further. I'd suggest to use database(such as mongodb, postgres or mysql, etc.) to store your data and query only the entries with the birthday matching the particular date.
I'd also implement a queuing system to process query and send emails in batches instead of doing all of that at once.
So I wrote simple video creator script in NodeJS.
It's running on scheduled cron job.
I have a panel written in PHP, user enter details and clicks "Submit new Video Job" Button.
This new job is saving to DB with details, jobId and status="waiting" data.
PHP API is responsible for returning 1 status at a time, checks status="waiting" limits query to 1 then returns data with jobID when asked
Video Creation Script requests every x seconds to that API asks for new job is available.
It has 5 tasks.
available=true.
Check if new job order available (With GET Request in every 20 seconds), if has new job;
available=false
Get details (name, picture url, etc.)
Create video with details.
Upload Video to FTP
Post data to API to update details. And Mark that job as "done"
available=true;
These tasks are async so everytask has to be wait previous task to be done.
Right now, get or post requesting api if new job available in every 20 seconds (Time doesnt mattter) seems bad way to me.
So any way / package / system to accomplish this behavior?
Code Example:
const cron = require('node-cron');
let available=true;
var scheduler = cron.schedule(
'*/20 * * * * *',
() => {
if (available) {
makevideo();
}
},
{
scheduled: false,
timezone: 'Europe/Istanbul',
}
);
let makevideo = async () => {
available = false;
let {data} = await axios.get(
'https://api/checkJob'
);
if (data == 0) {
console.log('No Job');
available = true;
} else {
let jobid = data.id;
await createvideo();
await sendToFTP();
await axios.post('https://api/saveJob', {
id: jobid,
videoPath: 'somevideopath',
});
available = true;
}
};
scheduler.start();
RabbitMQ is also a good queueing system.
Why ?
It's really well documented (examples for many languages including javascript & php).
Tutorials are simple while they're exposing real use cases.
It has a REST API.
It ships with a monitoring UI.
How to use it to solve your problem ?
On the job producer side : send messages (jobs) to a queue by following tutorial 1
To consume jobs with your nodejs process : see RabbitMQ's tutorial 2
Other suggestions :
Use a prefetch value of 1 and publisher confirms so you can ensure that an instance of consumer will not receive messages while there's a job running.
Roadmap for a quick prototype : tutorial 1... then tutorial 2 x). After sending and receiving messages you can explore the options you can set on queues and messages
Nodejs package : http://www.squaremobius.net/amqp.node/
PHP package : https://github.com/php-amqplib/php-amqplib
While it is possible to use the database as a queue, it is commonly known as an anti-pattern (next to using the database for logging), and as you are looking for:
So any way / package / system to accomplish this behavior?
I use the free-form of your question thanks to the placed bounty to suggest: Beanstalk.
Beanstalk is a simple, fast work queue.
Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously.
It has client libraries in the languages you mention in your question (and many more), is easy to develop with and to run in production.
What you are doing in a very standard system design paradigm, done with Apache Kafka or any queue based implementation(ex, RabbitMQ). You can check out about Kafka/rabbitmq but basically Not going into details:
There is a central Queue.
When user submits a job the job gets added to the Queue.
The video processor runs indefinitely subscribing to the queue.
You can go ahead and look up : https://www.gentlydownthe.stream/ and you will recognize the similarities on what you are doing.
Here you don't need to poll yourself, you need to subscribe to an event and the other things will be managed by the respective queues.
I'm experimenting with node and it's child_process module.
My goal is to create server which will run on maximum of 3 processes (1 main and optionally 2 children).
I'm aware that code below may be incorrect, but it displays interesting results.
const app = require ("express")();
const {fork} = require("child_process")
const maxChildrenRuning = 2
let childrenRunning = 0
app.get("/isprime", (req, res) => {
if(childrenRunning+1 <= maxChildrenRuning) {
childrenRunning+=1;
console.log(childrenRunning)
const childProcess = fork('./isprime.js');
childProcess.send({"number": parseInt(req.query.number)})
childProcess.on("message", message => {
console.log(message)
res.send(message)
childrenRunning-=1;
})
}
})
function isPrime(number) {
...
}
app.listen(8000, ()=>console.log("Listening on 8000") )
I'm launching 3 requests with 5*10^9'ish numbers.
After 30 seconds I receive 2 responses with correct results.
CPU stops doing hard work and goes idle
Surprisingly after next 1 minute 30 seconds 1 thread starts to proceed, still pending, 3rd request and finishes after next 30 seconds with correct answer. Console log displayed below:
> node index.js
Listening on 8000
1
2
{ number: 5000000029, isPrime: true, time: 32471 }
{ number: 5000000039, isPrime: true, time: 32557 }
1
{ number: 5000000063, isPrime: true, time: 32251 }
Either express listens and checks pending requests once for a while or my browser sends actual requests every x time while pending. Can anybody explain what is happening here and why? How can I correctly achieve my goal?
The way your server code is written, if you receive a /isprime request and two child processes are already running, your request handler for /isprime does nothing. It never sends any response. You don't pass that first if test and then nothing happens afterwards. So, that request will just sit there with the client waiting for a response. Depending upon the client, it will probably eventually time out as a dead/inactive request and the client will shut it down.
Some clients (like browsers) may assume that something just got lost in the network and they may retry the request by sending it again. It would be my guess that this is what is happening in your case. The browser eventually times out and then resends the request. By the time it retries, there are less than two child processes running so it gets processed on the retry.
You could verify that the browser is retrying automatically by going to the network tab in the Chrome debugger and watching exactly what the browser sends to your server and watch that third request, see it timeout and see if it is the browser retrying the request.
Note, this code seems to be only partially implemented because you initially start two child processes, but you don't reuse those child processes. Once they finish and you decrement maxChildrenRuning, your code will then start another child process. Probably what you really want to do is to keep track of the two child processes you started and when one finishes, add it to an array of "available child processes" so when a new request comes in, you can just use an existing child process that is already started, but idle.
You also need to either queue incoming requests when all the child processes are full or you need to send some sort of error response to the http request. Never sending an http response to an incoming request is a poor design that just leads to great inefficiencies (connections hanging around much longer than needed that never actually accomplish anything).
I am calling a Google Analytics API multiple times and load that data into a subscription. Now I want to create a progressbar to inform the user that data is being loaded and give a view on how long it is going to take.
I read that it's best to use publications to pass data from server to client. Is this true?
I created following publication on the server. What is does is following:
set the initial progressValue and the initial publication with id 1
keep looping if the progressValue is less than 100 and tell that the publication of 1 is changing.
Below this code I have an other publication running where progressValue is being set in steps in a loop.
When looking at the client only the last progressValue gets posted. Before this I receive a lot of empty arrays. So it's like:
[]
[]
[]
[]
...
Progress publication
What I want is that the client receives every change in progressValue instead of only the last one. How can I solve this?
If there are any better suggestions on how to create a subscription progressbar, these answers will also be accepted.
if (Meteor.isServer) {
let progressValue = 0;
Meteor.publish('progress', function() {
const self = this;
let lastProgressValue = 0;
const id = 1;
self.added('progress', id, {
progress: progressValue,
total: 100,
});
while (progressValue < 100) {
self.changed('progress', id, {
progress: progressValue,
total: 100,
});
}
self.ready();
});
...
Hmm... so, a couple of things here.
I read that it's best to use publications to pass data from server to
client. Is this true?
This is the whole point of Meteor, using ddp. Means that data is sent to the client automagically from the server. So, the bulk of the work to manipulate data is actually handled client side using minimongo.
Have a look at this article for a good discussion of the 'automagic' part...
http://richsilv.github.io/meteor/meteor-low-level-publications/
How do you do progress?
You don't want to try handle the incrementing on the server side. Instead, you want to get a simple count of the server, perhaps using a reactive aggregate (see my answer here How to reactively aggregate mongodb in meteor) and send that to the client. So, server does a count as a publication and tells the client '57' coming.
Then as your normal data publication, you send the 57 records to the client. ON THE CLIENT, you now basically do the same sum as you did on the server, but as only some of the 57 data records have been received by the client, you effectively get a progress counter by dividing client received by the servers message of total to be sent.
Summary
On the SERVER - 2 publications, 1 reactive aggregate for the count of the records to be sent and 1 as the normal data being sent
On the CLIENT - function to count the records in the local minimongo collection - collection.find({}).count() - will do the trick. This will increment as each record is received by the client from the server.
Progress is as simple as count on client divided by server sent count to be delivered.
I just created a small NodeJS application for doing some network IO and ran into what I believe is a stack overflow because of the number of callbacks.
My application is basically a file transfer function composed of 2 network operations... lets call them fetchData() and sendData(). All they do is some HTTP Request to either GET or POST data.
I have my application setup like this:
function runTransfer(){
// I need to transfer 100 million small documents
// So I need to fetch and send a group (1000) at a time
fetchData(fetchDataCallback);
function fetchDataCallback(data){
// Now that I have my data, lets send it to my other 'store'
sendData(data, sendDataCallback);
}
function sendDataCallback(){
// now that I got a 200 OK response we can fetch more data... and so on
if(totalDocsFetched >= totalNumDocs){
return; // This is where the application would finally end, once we fetched and sent all 100 million docs
}
fetchData(fetchDataCallback);
}
}
Following this design pattern, I think that I would eventually get a stackoverflow because fetchData -> fetchDataCallback -> sendData -> sendDataCallback -> fetchData -> fetchDataCallback -> sendData -> sendData... and so forth until my stack explodes!
What kind of design pattern can I use here to ensure I don't get an overflow like this?