Elastic Beanstalk environment degrading due to high CPU utilisation

Elastic Beanstalk environment degrading due to high CPU utilisation - javascript

I'm trying to sync some google ads account into my system.
This process pull the data from the google ads account from 2017-01-01 to to last date.
Query for a single date, process it in a for loop to make a proper object
inserting into database.
Also tried with load balancers. But degrading occurs for one instance.
code
querying google ads data
var difference = dateDiffInDays(new Date(2017, 0, 1), new Date());
// getting last N days
days = LastNDays(difference + 1)
// making array of date ranges
var result = days.chunk(20);
// querying google ads data
for (var value of result) {
const list = await customer.report({
entity: 'keyword_view',
attributes: adAttributes,
segments: ['segments.date'],
from_date: value[0],
to_date: value[value.length - 1]
})
await saveKeywordsData(list, value[value.length - 1])
}
I think the problem is the following function.
Becasue the output of above query is more than 5000 or 6000 (for a single date. Here calling date fro 2017-01-01).
So when handling more than 5000 data continuously for some time lead to high cpu utilisation.
function saveKeywordsData
async function saveKeywordsData(list, cronUntill) {
let metricsArray = []
for await (let element of list) {
let metrics = element.metrics
metrics.criterion_id = element.ad_group_criterion.criterion_id
metrics.keyword = element.ad_group_criterion.keyword.text
metrics.accId = accId
metrics.agencyId = agencyId
metrics.accountMId = accountMId
metrics.date = element.segments.date
metrics.dateTime = new Date(element.segments.date)
metrics.createdAt = new Date()
metricsArray.push(metrics);
}
// metricsArray length may be more than 5000 for each loop
await chunkInsertion(metricsArray, 'keywords')
return 1;
};
function chunkInsertion
async function chunkInsertion(metricsArray, type) {
let model
if (type == 'ads')
model = app.models.googleAdsInsights
else
model = app.models.googleAdsAuctionInsights
var data = metricsArray.chunk(50);
for (let item of data) {
await model.create(item)
}
return 1
}

Based on the comment.
I can provide only generic description on how a worker environment could be used. Exact implementation details are case specific.
EB worker environments are used for executing long running tasks. This could be a good solution for your use cases, as you would decouple your web environment from those heavy processing jobs that elevate your CPU.
In this scenario your web environment would be responsible for initiating the job and collecting the results. It would not be performing the actual processing, which would be handled by the dedicated worker environment.
The worker environment exposes a SQS queue. This is different from a web environment which gives you url to your website. From worker env you get only the SQS queue endpoint. The endpoint is used to submit jobs to the worker. Your worker application would receive the jobs from the queue and perform the query independently from the web environment.
Handling of the results can be done in many ways. One way would be for worker to write the results to, e.g. a DynamoDB. The web environment in that case would query the database for the result from time to time to check when they are available. The other way is for your web app to expose a dedicated url endpoint which would be called by the worker to signal the completion of a job.
This is how you generally decouple the web environment from long-running cpu or memory intensive tasks. But this would require changing how your application works and development of worker application to be deployed on EB worker environments.

Related

Express retriggering API endpoint on request timeouts

Context:
My Express.js web server is currently serving an API which wraps a SOAP service (some legacy service which I can't change). The SOAP service takes a dynamic number of items to process and takes about 1.5 seconds to process each request. The Nginx server has a timeout of 60 seconds.
Problem:
For a request to this API which e.g. lets say takes more than 60 seconds to complete, I am observing that the service is getting re-triggered automatically (I am assuming by Express.js). So if in the original request I was expecting to insert lets say 50 records to a table, now due to the re-triggering of the API I am ending up with 100 records inserted (duplication).
Here is a skeleton/sample of log that kind of shows the issue: (sensitive info stripped)
January 10, 2022 15:35:44 [... ee905] - Starting myAwesomeAPI() <-- Original API trigger
January 10, 2022 15:36:44 [... ff870] - Starting myAwesomeAPI() <-- Re-trigger happens
January 10, 2022 15:36:54 [... ee905] - Completed myAwesomeAPI() <-- Original API ends (inserts 50 records in the table)
January 10, 2022 15:37:54 [... ff870] - Completed myAwesomeAPI() <-- Re-triggered API ends (inserting another 50 records in the table resulting in duplication)
What I have tried:
To reproduce the issue and check if the re-triggering can be independent of nginx. With the Nginx timeout set to 60 seconds, I changed my Express server's timeout to 10 seconds and 15 items to process (to force timeout before processing can be complete) using this:
const express = require("express")
const server = express()
server.setTimeout(10000) <-- sets all requests to have a 10 seconds timeout
// myAwesomeAPI code
Testing showed that after 10 seconds, the timeout "did" re-trigger the API and the 15 items were duplicated (I saw 30 records inserted). So this tells me that the API is getting re-triggered by Express.js.
Question(s):
How to stop the re-trigger from happening, is there an express server configuration to enable/disable the auto re-triggering on timeout?
Solutions & Ideas:
Since the max items = 100 (set by team), increasing the Nginx and Express.js timeout to 300 seconds should be a quick but dirty fix. I understand that tying async API calls to some approximation of time is pure foolishness (tell me about trying to explain this to other engineers in my team ;-p), so I would like to avoid this approach.
Create a composite key with some combination of columns and enforce the insert restrictions on the table. Combine this with checking if the composite key is already inserted/present in the table and decide to skip/insert. This approach seems a bit better .
Another approach can be to respond back to the API call immediately on receipt (which will close the request) and then continue with the request processing. Something like this (inspiration): https://www.bennadel.com/blog/3275-you-can-continue-to-process-an-express-js-request-after-the-client-response-has-been-sent.htm.
This will make me independent of platform's timeout settings but will take away the real-time nature of the response being delivered with statuses for different items and add a bit more complexity of tracking the request statuses via other lookups etc.

If you have the ability to alter the front end you can add a transaction ID to it. Store the transaction routine in an object linked to the transaction ID, then if you get an API request for an ongoing transaction you can refer to the ongoing transaction.
Something like this:
let transactions = {};
router.get('/myapi', async (req,res,next) => {
try {
let {transactionID} = req.params;
delete(req.params.transactionID);
let transaction = transactions[transactionID];
if(!transaction) {
transaction = (async () => {
let ret = await SOAPCall(req.params);
// hold onto the transaction for some period of time
let to = setTimeout(()=>{
delete(transactions[transactionID]);
}, 5000);
to.detach(); // don't hold up process exit
return ret;
})();
transactions[transactionID] = transaction;
}
let ret = await transaction;
res.json(ret);
}
catch(err) { next(err) }
});

Better way to schedule cron jobs based on job orders from php script

So I wrote simple video creator script in NodeJS.
It's running on scheduled cron job.
I have a panel written in PHP, user enter details and clicks "Submit new Video Job" Button.
This new job is saving to DB with details, jobId and status="waiting" data.
PHP API is responsible for returning 1 status at a time, checks status="waiting" limits query to 1 then returns data with jobID when asked
Video Creation Script requests every x seconds to that API asks for new job is available.
It has 5 tasks.
available=true.
Check if new job order available (With GET Request in every 20 seconds), if has new job;
available=false
Get details (name, picture url, etc.)
Create video with details.
Upload Video to FTP
Post data to API to update details. And Mark that job as "done"
available=true;
These tasks are async so everytask has to be wait previous task to be done.
Right now, get or post requesting api if new job available in every 20 seconds (Time doesnt mattter) seems bad way to me.
So any way / package / system to accomplish this behavior?
Code Example:
const cron = require('node-cron');
let available=true;
var scheduler = cron.schedule(
'*/20 * * * * *',
() => {
if (available) {
makevideo();
}
},
{
scheduled: false,
timezone: 'Europe/Istanbul',
}
);
let makevideo = async () => {
available = false;
let {data} = await axios.get(
'https://api/checkJob'
);
if (data == 0) {
console.log('No Job');
available = true;
} else {
let jobid = data.id;
await createvideo();
await sendToFTP();
await axios.post('https://api/saveJob', {
id: jobid,
videoPath: 'somevideopath',
});
available = true;
}
};
scheduler.start();

RabbitMQ is also a good queueing system.
Why ?
It's really well documented (examples for many languages including javascript & php).
Tutorials are simple while they're exposing real use cases.
It has a REST API.
It ships with a monitoring UI.
How to use it to solve your problem ?
On the job producer side : send messages (jobs) to a queue by following tutorial 1
To consume jobs with your nodejs process : see RabbitMQ's tutorial 2
Other suggestions :
Use a prefetch value of 1 and publisher confirms so you can ensure that an instance of consumer will not receive messages while there's a job running.
Roadmap for a quick prototype : tutorial 1... then tutorial 2 x). After sending and receiving messages you can explore the options you can set on queues and messages
Nodejs package : http://www.squaremobius.net/amqp.node/
PHP package : https://github.com/php-amqplib/php-amqplib

While it is possible to use the database as a queue, it is commonly known as an anti-pattern (next to using the database for logging), and as you are looking for:
So any way / package / system to accomplish this behavior?
I use the free-form of your question thanks to the placed bounty to suggest: Beanstalk.
Beanstalk is a simple, fast work queue.
Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously.
It has client libraries in the languages you mention in your question (and many more), is easy to develop with and to run in production.

What you are doing in a very standard system design paradigm, done with Apache Kafka or any queue based implementation(ex, RabbitMQ). You can check out about Kafka/rabbitmq but basically Not going into details:
There is a central Queue.
When user submits a job the job gets added to the Queue.
The video processor runs indefinitely subscribing to the queue.
You can go ahead and look up : https://www.gentlydownthe.stream/ and you will recognize the similarities on what you are doing.
Here you don't need to poll yourself, you need to subscribe to an event and the other things will be managed by the respective queues.

Which is the easiest way to allow comunication between NodeJS and a Python script [duplicate]

Node.js is a perfect match for our web project, but there are few computational tasks for which we would prefer Python. We also already have a Python code for them.
We are highly concerned about speed, what is the most elegant way how to call a Python "worker" from node.js in an asynchronous non-blocking way?

This sounds like a scenario where zeroMQ would be a good fit. It's a messaging framework that's similar to using TCP or Unix sockets, but it's much more robust (http://zguide.zeromq.org/py:all)
There's a library that uses zeroMQ to provide a RPC framework that works pretty well. It's called zeroRPC (http://www.zerorpc.io/). Here's the hello world.
Python "Hello x" server:
import zerorpc
class HelloRPC(object):
'''pass the method a name, it replies "Hello name!"'''
def hello(self, name):
return "Hello, {0}!".format(name)
def main():
s = zerorpc.Server(HelloRPC())
s.bind("tcp://*:4242")
s.run()
if __name__ == "__main__" : main()
And the node.js client:
var zerorpc = require("zerorpc");
var client = new zerorpc.Client();
client.connect("tcp://127.0.0.1:4242");
//calls the method on the python object
client.invoke("hello", "World", function(error, reply, streaming) {
if(error){
console.log("ERROR: ", error);
}
console.log(reply);
});
Or vice-versa, node.js server:
var zerorpc = require("zerorpc");
var server = new zerorpc.Server({
hello: function(name, reply) {
reply(null, "Hello, " + name, false);
}
});
server.bind("tcp://0.0.0.0:4242");
And the python client
import zerorpc, sys
c = zerorpc.Client()
c.connect("tcp://127.0.0.1:4242")
name = sys.argv[1] if len(sys.argv) > 1 else "dude"
print c.hello(name)

For communication between node.js and Python server, I would use Unix sockets if both processes run on the same server and TCP/IP sockets otherwise. For marshaling protocol I would take JSON or protocol buffer. If threaded Python shows up to be a bottleneck, consider using Twisted Python, which
provides the same event driven concurrency as do node.js.
If you feel adventurous, learn clojure (clojurescript, clojure-py) and you'll get the same language that runs and interoperates with existing code on Java, JavaScript (node.js included), CLR and Python. And you get superb marshalling protocol by simply using clojure data structures.

If you arrange to have your Python worker in a separate process (either long-running server-type process or a spawned child on demand), your communication with it will be asynchronous on the node.js side. UNIX/TCP sockets and stdin/out/err communication are inherently async in node.

I've had a lot of success using thoonk.js along with thoonk.py. Thoonk leverages Redis (in-memory key-value store) to give you feed (think publish/subscribe), queue and job patterns for communication.
Why is this better than unix sockets or direct tcp sockets? Overall performance may be decreased a little, however Thoonk provides a really simple API that simplifies having to manually deal with a socket. Thoonk also helps make it really trivial to implement a distributed computing model that allows you to scale your python workers to increase performance, since you just spin up new instances of your python workers and connect them to the same redis server.

I'd consider also Apache Thrift http://thrift.apache.org/
It can bridge between several programming languages, is highly efficient and has support for async or sync calls. See full features here http://thrift.apache.org/docs/features/
The multi language can be useful for future plans, for example if you later want to do part of the computational task in C++ it's very easy to do add it to the mix using Thrift.

I'd recommend using some work queue using, for example, the excellent Gearman, which will provide you with a great way to dispatch background jobs, and asynchronously get their result once they're processed.
The advantage of this, used heavily at Digg (among many others) is that it provides a strong, scalable and robust way to make workers in any language to speak with clients in any language.

Update 2019
There are several ways to achieve this and here is the list in increasing order of complexity
Python Shell, you will write streams to the python console and it
will write back to you
Redis Pub Sub, you can have a channel
listening in Python while your node js publisher pushes data
Websocket connection where Node acts as the client and Python acts
as the server or vice-versa
API connection with Express/Flask/Tornado etc working separately with an API endpoint exposed for the other to query
Approach 1 Python Shell Simplest approach
source.js file
const ps = require('python-shell')
// very important to add -u option since our python script runs infinitely
var options = {
pythonPath: '/Users/zup/.local/share/virtualenvs/python_shell_test-TJN5lQez/bin/python',
pythonOptions: ['-u'], // get print results in real-time
// make sure you use an absolute path for scriptPath
scriptPath: "./subscriber/",
// args: ['value1', 'value2', 'value3'],
mode: 'json'
};
const shell = new ps.PythonShell("destination.py", options);
function generateArray() {
const list = []
for (let i = 0; i < 1000; i++) {
list.push(Math.random() * 1000)
}
return list
}
setInterval(() => {
shell.send(generateArray())
}, 1000);
shell.on("message", message => {
console.log(message);
})
destination.py file
import datetime
import sys
import time
import numpy
import talib
import timeit
import json
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
size = 1000
p = 100
o = numpy.random.random(size)
h = numpy.random.random(size)
l = numpy.random.random(size)
c = numpy.random.random(size)
v = numpy.random.random(size)
def get_indicators(values):
# Return the RSI of the values sent from node.js
numpy_values = numpy.array(values, dtype=numpy.double)
return talib.func.RSI(numpy_values, 14)
for line in sys.stdin:
l = json.loads(line)
print(get_indicators(l))
# Without this step the output may not be immediately available in node
sys.stdout.flush()
Notes: Make a folder called subscriber which is at the same level as source.js file and put destination.py inside it. Dont forget to change your virtualenv environment

How to run Google Cloud SQL only when I need it?

Google Cloud SQL advertises that it's only $0.0150 per hour for the smallest machine type, and I'm being charged for every hour, not just hours that I'm connected. Is this because I'm using a pool? How do I setup my backend so that it queries the cloud db only when needed so I don't get charged for every hour of the day?
const mysql = require('mysql');
const pool = mysql.createPool({
host : process.env.SQL_IP,
user : 'root',
password : process.env.SQL_PASS,
database : 'mydb',
ssl : {
[redacted]
}
});
function query(queryStatement, cB){
pool.getConnection(function(err, connection) {
// Use the connection
connection.query(queryStatement, function (error, results, fields) {
// And done with the connection.
connection.destroy();
// Callback
cB(error,results,fields);
});
});
}

This is not so much about the pool as it is about the nature of Cloud SQL. Unlike App Engine, Cloud SQL instances are always up. I learned this the hard way one Saturday morning when I'd been away from the project for a week. :)
There's no way to spin them down when they're not being used, unless you explicitly go stop the service.
There's no way to schedule a service stop, at least within the GCP SDK. You could alway write a cron job, or something like that, that runs a little gcloud sql instances patch [INSTANCE_NAME] --activation-policy NEVER command at, for example, 6pm local time, M-F. I was too lazy to do that, so I just set a calendar reminder for myself to shut down my instance at the end of my workday.
Here's the MySQL Instance start/stop/restart page for the current SDK's docs:
https://cloud.google.com/sql/docs/mysql/start-stop-restart-instance
On an additional note, there is an ongoing 'Feature Request' in the GCP Platform to start/stop the Cloud SQL (2nd Gen), according to the traffic as well. You can also visit the link and provide your valuable suggestions/comments there as well.

I took the idea from #ingernet and created a cloud function which starts/stops the CloudSQL instance when needed. It can be triggered via a scheduled job so you can define when the instance goes up or down.
The details are here in this Github Gist (inspiration taken from here). Disclaimer: I'm not a python developer so there might be issues in the code, but at the end it works.
Basically you need to follow these steps:
Create a pub/sub topic which will be used to trigger the cloud function.
Create the cloud function and copy in the code below.
Make sure to set the correct project ID in line 8.
Set the trigger to Pub/Sub and choose the topic created in step 1.
Create a cloud scheduler job to trigger the cloud function on a regular basis.
Choose the frequency when you want the cloud function to be triggered.
Set the target to Pub/Sub and define the topic created in step 1.
The payload should be set to start [CloudSQL instance name] or stop [CloudSQL instance name] to start or stop the specified instance (e.g. start my_cloudsql_instance will start the CloudSQL instance with the name my_cloudsql_instance)
Main.py:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import base64
from pprint import pprint
credentials = GoogleCredentials.get_application_default()
service = discovery.build('sqladmin', 'v1beta4', credentials=credentials, cache_discovery=False)
project = 'INSERT PROJECT_ID HERE'
def start_stop(event, context):
print(event)
pubsub_message = base64.b64decode(event['data']).decode('utf-8')
print(pubsub_message)
command, instance_name = pubsub_message.split(' ', 1)
if command == 'start':
start(instance_name)
elif command == 'stop':
stop(instance_name)
else:
print("unknown command " + command)
def start(instance_name):
print("starting " + instance_name)
patch(instance_name, "ALWAYS")
def stop(instance_name):
print("stopping " + instance_name)
patch(instance_name, "NEVER")
def patch(instance, activation_policy):
request = service.instances().get(project=project, instance=instance)
response = request.execute()
dbinstancebody = {
"settings": {
"settingsVersion": response["settings"]["settingsVersion"],
"activationPolicy": activation_policy
}
}
request = service.instances().patch(
project=project,
instance=instance,
body=dbinstancebody)
response = request.execute()
pprint(response)
Requirements.txt
google-api-python-client==1.10.0
google-auth-httplib2==0.0.4
google-auth==1.19.2
oauth2client==4.1.3

How to implement time to live with socket.io

I believe I have surfed the web enough but still cant get any resources on the topic. How can I implement the 'time-to-live' function with Socket.io?
I am using Node.js with express.
The above mentioned time-to-live function is intended to work as described below:
If I specify timeToLive = 10; secs, clients that connect in less than 10 sec after the message is emitted should still get the message.
This function is available on some of the cloud messaging libraries like GCM.
Any online resource will appreciated.

There is no such functionality in socket.io. You will have to implement it yourself. Consider using an array of objects that holds messages and Date.now() of that message and loop it when a user connects. Delete any messages that are expired and emit the ones that are still valid.
Minimum code could be this but due to heavy use of splice it could be slow.
var messages = [];
var TTL = 10;
io.on('connection', function (socket) {
for(i = 0; i < messages.length; i++)
if(messages[i].time + TTL < Date.now())
messages.splice(i, 1);
socket.emit('data', messages);
});
Consider using redis or any other high performance database to also synchronize between multiple servers if you require that:

We Keep Coding

JavaScript is the programming language of the Web.