JavaScript leaking memory (Node.js/Restify/MongoDB)

JavaScript leaking memory (Node.js/Restify/MongoDB) - javascript

Update 4: By instantiating the restify client (see controllers/messages.js) outside of the function and calling global.gc() after every request it seems the memory growth rate has been reduced a lot (~500KB per 10secs). Yet, the memory usage is still constantly growing.
Update3: Came across this post: https://journal.paul.querna.org/articles/2011/04/05/openssl-memory-use/
It might be worth noting that I'm using HTTPS with Restify.
Update 2: Updated the code below to the current state. I've tried swapping out Restify with Express. Sadly this didn't make any difference. It seems that the api call at the end of the chain (restify -> mongodb -> external api) causes everything to retain to memory.
Update 1: I have replaced Mongoose with the standard MongoDB driver. Memory usage seems to grow less fast, yet the leak remains..
I've been working on trying to locate this leak for a couple of days now.
I'm running an API using Restify and Mongoose and for every API call I do at least one MongoDB lookup. I've got about 1-2k users that hit the API multiple times in a day.
What I have tried
I've isolated my code to just using Restify and used ApacheBench to fire a huge amount of requests (100k+). The memory usage stays around 60MB during the test.
I've isolated my code to just using Restify and Mongoose and tested it the same way as above. Memory usage stays around 80MB.
I've tested the full production code locally using ApacheBench. Memory usage stays around 80MB.
I've automatically dumped the heap on intervals. The biggest heap dump I had was 400MB. All I can see that there are tons of Strings and Arrays but I cannot clearly see a pattern in it.
So, what could be wrong?
I've done the above tests using just one API user. This means that Mongoose only grabs the same document over and over. The difference with production is that a lot of different users hit the API meaning mongoose gets a lot of different documents.
When I start the nodejs server the memory quickly grows to 100MB-200MB. It eventually stabilizes around 500MB. Could this mean that it leaks memory for every user? Once every user has visited the API it will stabilize?
I've included my code below which outlines the general structure of my API. I would love to know if there's a critical mistake in my code or any other approach to finding out what is causing the high memory usage.
Code
app.js
var restify = require('restify');
var MongoClient = require('mongodb').MongoClient;
// ... setup restify server and mongodb
require('./api/message')(server, db);
api/message.js
module.exports = function(server, db) {
// Controllers used for retrieving accounts via MongoDB and communication with an external api
var accountController = require('../controllers/accounts')(db);
var messageController = require('../controllers/messages')();
// Restify bind to put
server.put('/api/message', function(req, res, next) {
// Token from body
var token = req.body.token;
// Get account by token
accountController.getAccount(token, function(error, account) {
// Send a message using external API
messageController.sendMessage(token, account.email, function() {
res.send(201, {});
return next();
});
});
});
};
controllers/accounts.js
module.exports = function(db) {
// Gets account by a token
function getAccount(token, callback) {
var ObjectID = require('mongodb').ObjectID;
var collection = db.collection('accounts');
collection.findOne({
token: token
}, function(error, account) {
if (error) {
return callback(error);
}
if (account) {
return callback('', account);
}
return callback('Account not found');
});
}
};
controllers/messages.js
module.exports = function() {
function sendMessage(token, email, callback) {
// Get a token used for external API
getAccessToken(function() {}
// ... Setup client
// Do POST
client.post('/external_api', values, function(err, req, res, obj) {
return callback();
});
});
}
return {
sendMessage: sendMessage
};
};
Heap snapshot of suspected leak

Might be a bug in getters, I got it when using virtuals or getters for mongoose schema https://github.com/LearnBoost/mongoose/issues/1565

It's actually normal to only see string and arrays, as most programs are largely based on them. The profiler that allow sorting by total object count are therefore not of much use as they many times give the same results for many different programs.
A better way to use the memory profiling of chrome is to take one snapshot for example after one user calls an API, and then a second heap snapshot after a second user called the API.
The profiler gives the possibility to compare two snapshots and see what is the difference between one and the other (see this tutorial), this will help understand why the memory grew in an unexpected way.
Objects are retained in memory because there is still a reference to them that prevents the object from being garbage collected.
So another way to try to use the profiler to find memory leaks is to look for an object that you believe should not be there and see what is it's retaining paths, and see if there are any unexpected paths.

Not sure whether this helps, but could you try to remove unnecessary returns?
api/message.js
// Send a message using external API
messageController.sendMessage(token, account.email, function() {
res.send(201, {});
next(); // remove 'return'
});
controllers/accounts.js
module.exports = function(db) {
// Gets account by a token
function getAccount(token, callback) {
var ObjectID = require('mongodb').ObjectID;
var collection = db.collection('accounts');
collection.findOne({
token: token
}, function(error, account) {
if (error) {
callback(error); // remove 'return'
} else if (account) {
callback('', account); // remove 'return'
} else {
callback('Account not found'); // remove 'return'
}
});
}
return { // I guess you missed to copy this onto the question.
getAccount: getAccount
};
};
controllers/messages.js
// Do POST
client.post('/external_api', values, function(err, req, res, obj) {
callback(); // remove 'return'
});

Your issue is in the getAccount mixed with how GC work's.
When you chain lots of function the GC only clears one at a time and the older something is on memory the less chances it has of being collected so on your get account you need at least that I can count 6 calls to global.gc() or auto executes before it can be collected by this time the GC assumes its something that it probably wont collect so it doesn't check it anyway.
collection{
findOne{
function(error, account){
callback('', account)
sendMessage(...)
getAccessToken(){
Post
}
}
}
}
}
}
as suggested by Gene remove this chaining.
PS: This is just a representation of how the GC works and depends on Implementation but you get the point.

Related

Node.js / Mongoose - Undo Database Modifications on Error

I have quite a complex express route in my node.js server, that makes many modifications to the database via mongoose.
My goal is to implement a mechanism, that reverts all changes when any error occurs. My idea was implementing functions for undoing into the catch block.
But this is quite ugly, as I have to know what the previous values were and what if an error occurs in the catch block? It's especially difficult to revert those changes, when an error occurred during a Promise.all(array.map( /* ... */ ))
My route looks akin to this:
module.exports = (req, res) => {
var arr1, var2, var3
try {
const body = req.body
arr1 = await fetchVar1(body)
const _data = await Promise.all([
Promise.all(
arr1.map(async x => {
const y = await fetchSometing(x)
return doSometing(y)
})
),
doSomething3(arr1),
])
var2 = _data[1]
var3 = _data[2]
return res.json({ arr1, var2, var3 })
} catch (err) {
/**
* If an error occurs I have to undo
* everything that has been done
* in the try block
*/
}
}
Preferably I would like to implement something that "batches" all changes and "commits" the changes if no errors occurred.

What you are looking for is transactions: https://mongoosejs.com/docs/transactions.html
Manually undoing stuff after doing them won't protect you from every issue, so you should not rely on that. For example, exactly as you wrote: what happens if there is a crash after a partial write (some data is written, some is not), then another crash during your "rollback" code, which does not cleanup everything? If your code depends on your data being absolutely clean, then you have a problem. Your code should either be able to handle partial data correctly, or you must have some way to guarantee that your data is perfectly good at all times.
Transactions is the way to go, because it only commits everything at once if everything works.

What you’re looking for is called Transactions.
Transactions are new in MongoDB 4.0 and Mongoose 5.2.0. Transactions let you execute multiple operations in isolation and potentially undo all the operations if one of them fails. This guide will get you started using transactions with Mongoose.
For more information check the link below:
https://mongoosejs.com/docs/transactions.html

Duplicate Array Data Web Scraping

I can't seem to get the article duplicates out of my web scraper results, this is my code:
app.get("/scrape", function (req, res) {
request("https://www.nytimes.com/", function (error, response, html) {
// Load the HTML into cheerio and save it to a variable
// '$' becomes a shorthand for cheerio's selector commands, much like jQuery's '$'
var $ = cheerio.load(html);
var uniqueResults = [];
// With cheerio, find each p-tag with the "title" class
// (i: iterator. element: the current element)
$("div.collection").each(function (i, element) {
// An empty array to save the data that we'll scrape
var results = [];
// store scraped data in appropriate variables
results.link = $(element).find("a").attr("href");
results.title = $(element).find("a").text();
results.summary = $(element).find("p.summary").text().trim();
// Log the results once you've looped through each of the elements found with cheerio
db.Article.create(results)
.then(function (dbArticle) {
res.json(dbArticle);
}).catch(function (err) {
return res.json(err);
});
});
res.send("You scraped the data successfully.");
});
});
// Route for getting all Articles from the db
app.get("/articles", function (req, res) {
// Grab every document in the Articles collection
db.Article.find()
.then(function (dbArticle) {
res.json(dbArticle);
})
.catch(function (err) {
res.json(err);
});
});
Right now I am getting five copies of each article sent to the user. I have tried db.Article.distinct and various versions of this to filter the results down to only unique articles. Any tips?

In Short:
Switching the var results = [] from an Array to an Object var results = {} did the trick for me. Still haven't figured out the exact reason for the duplicate insertion of documents in database, will update as soon I find out.
Long Story:
You have multiple mistakes and points of improvement there in your code. I will try pointing them out:
Let's follow them first to make your code error free.
Mistakes
1. Although mongoose's model.create, new mongoose() does seem to work fine with Arrays but I haven't seen such a use before and it does not even look appropriate.
If you intend to create documents one after another then represent your documents using an object instead of an Array. Using an array is more mainstream when you intend to create multiple documents at once.
So switch -
var results = [];
to
var results = {};
2. Sending response headers after they are already sent will create for you an error. I don't know if you have already noticed it or not but its pretty much clear upfront as once the error is popped up the remaining documents won't get stored because of PromiseRejection Error if you haven't setup a try/catch block.
The block inside $("div.collection").each(function (i, element) runs asynchronously so your process control won't wait for each document to get processed, instead it would immediately execute res.send("You scraped the data successfully.");.
This will effectively terminate the Http connection between the client and the server and any further issue of response termination calls like res.json(dbArticle) or res.json(err) will throw an error.
So, just comment the res.json statements inside the .create's then and catch methods. This will although terminate the response even before the whole articles are saved in the DB but you need not to worry as your code would still work behind the scene saving articles in database for you (asynchronously).
If you want your response to be terminated only after you have successfully saved the data then change your middleware implementation to -
request('https://www.nytimes.com', (err, response, html) => {
var $ = cheerio.load(html);
var results = [];
$("div.collection").each(function (i, element) {
var ob = {};
ob.link = $(element).find("a").attr("href");
ob.title = $(element).find("a").text();
ob.summary = $(element).find("p.summary").text().trim();
results.push(ob);
});
db.Article.create(results)
.then(function (dbArticles) {
res.json(dbArticles);
}).catch(function (err) {
return res.json(err);
});
});
After making above changes and even after the first one, my version of your code ran fine. So if you want you can continue on with your current version, or you may try reading some points of improvement.
Points of Improvements
1. Era of callbacks is long gone:
Convert your implementation to utilise Promises as they are more maintainable and easier to reason about. Here are the things you can do -
Change request library from request to axios or any one which supports Promises by default.
2. Make effective use of mongoose methods for insertion. You can perform bulk inserts of multiple statements in just one query. You may find docs on creating documents in mongodb quite helpful.
3. Start using some frontend task automation library such as puppeteer or nightmare.js for data scraping related task. Trust me, they make life a hell lot easier than using cheerio or any other library for the same. Their docs are really good and well maintained so you won't have have hard time picking these up.

Text Game: Asynchronous Node - Prompts and MySQL

PREFACE
So it seems I've coded myself into a corner again. I've been teaching myself how to code for about 6 months now and have a really bad habit of starting over and changing projects so please, if my code reveals other bad practices let me know, but help me figure out how to effectively use promises/callbacks in this project first before my brain explodes. I currently want to delete it all and start over but I'm trying to break that habit. I did not anticipate the brain melting difficulty spike of asynchronicity (is that the word? spellcheck hates it)
The Project
WorldBuilder - simply a CLI that will talk to a MySQL database on my machine and mimic basic text game features to allow building out an environment quickly in my spare time. It should allow me to MOVE [DIRECTION] LOOK at and CREATE [ rooms / objects ].
Currently it works like this:
I use Inquirer to handle the prompt...
request_command: function(){
var command = {type:"input",name:"command",message:"CMD:>"}
inquirer.prompt(command).then(answers=>{
parser.parse(answers.command);
}).catch((error)=>{
console.log(error);
});
}`
The prompt passed whatever the user types to the parser
parse: function(text) {
player_verbs = [
'look',
'walk',
'build',
'test'
]
words = text.split(' ');
found_verb = this.find(player_verbs, words)[0];
if (found_verb == undefined) {
console.log('no verb found');
} else {
this.verbs.execute(found_verb, words)
}
}
The parser breaks the string into words and checks those words against possible verbs. Once it finds the verb in the command it accesses the verbs object...
(ignore scope mouthwash, that is for another post)
verbs: {
look: function(cmds) {
// ToDo: Check for objects in the room that match a word in cmds
player.look();
},
walk: function(cmds) {
possible_directions = ['north','south','east','west'];
direction = module.exports.find(possible_directions, cmds);
player.walk(direction[0]);
},
build: function(cmds) {
// NOTE scope_mouthwash exists because for some
// reason I cannot access the global constant
// from within this particular nested function
// scope_mouthwash == menus
const scope_mouthwash = require('./menus.js');
scope_mouthwash.room_creation_menu();
},
execute: function(verb, args) {
try{
this[verb](args);
}catch(e){
throw new Error(e);
}
}
}
Those verbs then access other objects and do various things but essentially it breaks down to querying the database an unknown number of times to either retrieve info about an object and make a calculation or store info about an object. Currently my database query function returns a promise.
query: function (sql) {
return new Promise(function(resolve, reject){
db.query(sql, (err, rows)=>{
if (err) {
reject(err);
}
else {
resolve(rows);
}
});
});
},
The Problem
Unfortunately I jumped the gun and started using promises before I fully understood when to use them and I believe I should have used callbacks in some of these situations but I just don't quite get it yet. I solved 'callback hell' before I had to experience 'callback hell' and now am trying to avoid 'promise hell'. My prompt used to call itself in a loop after it triggered the required verbs but this whole approach broke down when I realized I'd get prompt messages in the middle of other prompt cycles for room building and such.
Should my queries be returning promises or should I rewrite them to use callback functions handled by whichever verb calls the query? How should I determine when to prompt the user again in the situation of having an unknown number of asynchronous processes?
So, put in a different way, my question is..
Given the parameters of my program how should I be visualizing and managing the asynchronous flow of commands, each of which may chain to an unknown number of database queries?
Possible Solution directions that have occurred to me..
Create an object that keeps track of when there are pending promises and
simply prompts the user when all promises are resolved or failed.
Rewrite my program to use callback functions where possible and force a known number of promises. If I can get all verbs to work off callback functions I might be able to say something like Prompt.then(resolve verb).then(Prompt)...
Thank you for bearing with me, I know that was a long post. I know enough to get myself in trouble and Im pretty lost in the woods right now.

How can I promisify the MongoDB native Javascript driver using bluebird?

I'd like to use the MongoDB native JS driver with bluebird promises. How can I use Promise.promisifyAll() on this library?

The 2.0 branch documentation contains a better promisification guide https://github.com/petkaantonov/bluebird/blob/master/API.md#promisification
It actually has mongodb example which is much simpler:
var Promise = require("bluebird");
var MongoDB = require("mongodb");
Promise.promisifyAll(MongoDB);

When using Promise.promisifyAll(), it helps to identify a target prototype if your target object must be instantiated. In case of the MongoDB JS driver, the standard pattern is:
Get a Db object, using either MongoClient static method or the Db constructor
Call Db#collection() to get a Collection object.
So, borrowing from https://stackoverflow.com/a/21733446/741970, you can:
var Promise = require('bluebird');
var mongodb = require('mongodb');
var MongoClient = mongodb.MongoClient;
var Collection = mongodb.Collection;
Promise.promisifyAll(Collection.prototype);
Promise.promisifyAll(MongoClient);
Now you can:
var client = MongoClient.connectAsync('mongodb://localhost:27017/test')
.then(function(db) {
return db.collection("myCollection").findOneAsync({ id: 'someId' })
})
.then(function(item) {
// Use `item`
})
.catch(function(err) {
// An error occurred
});
This gets you pretty far, except it'll also help to make sure the Cursor objects returned by Collection#find() are also promisified. In the MongoDB JS driver, the cursor returned by Collection#find() is not built from a prototype. So, you can wrap the method and promisify the cursor each time. This isn't necessary if you don't use cursors, or don't want to incur the overhead. Here's one approach:
Collection.prototype._find = Collection.prototype.find;
Collection.prototype.find = function() {
var cursor = this._find.apply(this, arguments);
cursor.toArrayAsync = Promise.promisify(cursor.toArray, cursor);
cursor.countAsync = Promise.promisify(cursor.count, cursor);
return cursor;
}

I know this has been answered several times, but I wanted to add in a little more information regarding this topic. Per Bluebird's own documentation, you should use the 'using' for cleaning up connections and prevent memory leaks.
Resource Management in Bluebird
I looked all over the place for how to do this correctly and information was scarce so I thought I'd share what I found after much trial and error. The data I used below (restaurants) came from the MongoDB sample data. You can get that here: MongoDB Import Data
// Using dotenv for environment / connection information
require('dotenv').load();
var Promise = require('bluebird'),
mongodb = Promise.promisifyAll(require('mongodb'))
using = Promise.using;
function getConnectionAsync(){
// process.env.MongoDbUrl stored in my .env file using the require above
return mongodb.MongoClient.connectAsync(process.env.MongoDbUrl)
// .disposer is what handles cleaning up the connection
.disposer(function(connection){
connection.close();
});
}
// The two methods below retrieve the same data and output the same data
// but the difference is the first one does as much as it can asynchronously
// while the 2nd one uses the blocking versions of each
// NOTE: using limitAsync seems to go away to never-never land and never come back!
// Everything is done asynchronously here with promises
using(
getConnectionAsync(),
function(connection) {
// Because we used promisifyAll(), most (if not all) of the
// methods in what was promisified now have an Async sibling
// collection : collectionAsync
// find : findAsync
// etc.
return connection.collectionAsync('restaurants')
.then(function(collection){
return collection.findAsync()
})
.then(function(data){
return data.limit(10).toArrayAsync();
});
}
// Before this ".then" is called, the using statement will now call the
// .dispose() that was set up in the getConnectionAsync method
).then(
function(data){
console.log("end data", data);
}
);
// Here, only the connection is asynchronous - the rest are blocking processes
using(
getConnectionAsync(),
function(connection) {
// Here because I'm not using any of the Async functions, these should
// all be blocking requests unlike the promisified versions above
return connection.collection('restaurants').find().limit(10).toArray();
}
).then(
function(data){
console.log("end data", data);
}
);
I hope this helps someone else out who wanted to do things by the Bluebird book.

Version 1.4.9 of mongodb should now be easily promisifiable as such:
Promise.promisifyAll(mongo.Cursor.prototype);
See https://github.com/mongodb/node-mongodb-native/pull/1201 for more details.

We have been using the following driver in production for a while now. Its essentially a promise wrapper over the native node.js driver. It also adds some additional helper functions.
poseidon-mongo - https://github.com/playlyfe/poseidon-mongo

Pre-emptive: How to avoid deep callback hierarchy in Node? (Using Express) [duplicate]

This question already has answers here:
How to avoid long nesting of asynchronous functions in Node.js
(23 answers)
Closed 9 years ago.
I just started experimenting with node (using Express to build a simple website with a MySql database).
I have basically ran with the application structure Express provides (which doesn't matter for the sake of this question). I have a file routes/index.js which exports the index function that is hit whenever a request is made for my home page. The contents of index.js are:
var db = require('../db');
exports.index = function(req, res){
db.getConnection(function(err, connection) {
connection.query('SELECT * FROM test_table', function (err, rows) {
var templateVariables = {
title: 'Index Page',
response: rows[0].text
};
res.render('index', templateVariables);
});
connection.end();
});
};
This is obviously a very preliminary and lightweight example, however on this particular GET request for the Index page, there is already a 3-deep set of callback functions. Each callbuck must live within the callback of the "parent", for it depends on the result (in a sequentially executed language/environment this would be obvious and trivial).
My question is that when building more complex and potentially very large applications, how does one avoid the issue of having massive nestings of callback functions? This will of course be the case when you have sequential dependency on logic. I know the philosophy of Node is to be asynchronous, but when it comes to waiting for data from the database(s) and say we're running 5 separate queries, what then? Would we just write a single multi-statement query as an atomic unit? Though this question isn't exclusive to databases.

There's a nice general discussion on the issue here:
http://callbackhell.com/
Also, many people use modules like Async to manage the flow control issues.

Since you mention by using Express you can use the next() as an alternative to callbacks.
app.get('/',function first(req,res,next){
res.write('Hello ');
res.locals.user = req.user;
next();
//Execute next handler in queue
});
app.get('/',function second(req,res,next){
res.write('World!');
//Use res.locals.user
res.end();
});
//response shows Hello World!
The route handlers use extra parameter next and are executed in the order they are given, until one of them returns a response. next takes either no parameters at all or an error as a parameter. You can set the variable you want to pass into next function in the res.locals

Use a Promise or Future library, such as Q (available on npm).
Quoting from Q's readme, promises let you turn this:
step1(function (value1) {
step2(value1, function(value2) {
step3(value2, function(value3) {
step4(value3, function(value4) {
// Do something with value4
});
});
});
});
into this:
Q.fcall(step1)
.then(step2)
.then(step3)
.then(step4)
.then(function (value4) {
// Do something with value4
}, function (error) {
// Handle any error from step1 through step4
})
.done();
Every other solution I've seen to callback hell introduces trade-offs that simply seem a step backward. Asynchronous operations do not form the natural logical boundaries between larger operations, so if you factor functions or otherwise modularize along these boundaries you will get microfactored code.

One way that I am fond of doing is like this...
exports.index = function(req, res) {
var connection
db.getConnection(gotConnection)
function gotConnection(err, _c) {
connection = _c
connection.query('SELECT * FROM test_table', gotData)
}
function gotData(err, rows) {
connection.end();
var templateVariables = {
title: 'Index Page',
response: rows[0].text
}
res.render('index', templateVariables);
}
});
You should also always handle errors as well in your code, I'm assuming you left them out to make the code easier to read here.

We Keep Coding

JavaScript is the programming language of the Web.

JavaScript leaking memory (Node.js/Restify/MongoDB) - javascript

Might be a bug in getters, I got it when using virtuals or getters for mongoose schema https://github.com/LearnBoost/mongoose/issues/1565

Related

Node.js / Mongoose - Undo Database Modifications on Error

Duplicate Array Data Web Scraping

Text Game: Asynchronous Node - Prompts and MySQL

How can I promisify the MongoDB native Javascript driver using bluebird?

Pre-emptive: How to avoid deep callback hierarchy in Node? (Using Express) [duplicate]

Categories

Resources