Deleting large Javascript objects when process is running out of memory

Deleting large Javascript objects when process is running out of memory - javascript

I'm a novice to this kind of javascript, so I'll give a brief explanation:
I have a web scraper built in Nodejs that gathers (quite a bit of) data, processes it with Cheerio (basically jQuery for Node) creates an object then uploads it to mongoDB.
It works just fine, except for on larger sites. What's appears to be happening is:
I give the scraper an online store's URL to scrape
Node goes to that URL and retrieves anywhere from 5,000 - 40,000 product urls to scrape
For each of these new URLs, Node's request module gets the page source then loads up the data to Cheerio.
Using Cheerio I create a JS object which represents the product.
I ship the object off to MongoDB where it's saved to my database.
As I say, this happens for thousands of URLs and once I get to, say, 10,000 urls loaded I get errors in node. The most common is:
Node: Fatal JS Error: Process out of memory
Ok, here's the actual question(s):
I think this is happening because Node's garbage cleanup isn't working properly. It's possible that, for example, the request data scraped from all 40,000 urls is still in memory, or at the very least the 40,000 created javascript objects may be. Perhaps it's also because the MongoDB connection is made at the start of the session and is never closed (I just close the script manually once all the products are done). This is to avoid opening/closing the connection it every single time I log a new product.
To really ensure they're cleaned up properly (once the product goes to MongoDB I don't use it anymore and can be deleted from memory) can/should I just simply delete it from memory, simply using delete product?
Moreso (I'm clearly not across how JS handles objects) if I delete one reference to the object is it totally wiped from memory, or do I have to delete all of them?
For instance:
var saveToDB = require ('./mongoDBFunction.js');
function getData(link){
request(link, function(data){
var $ = cheerio.load(data);
createProduct($)
})
}
function createProduct($)
var product = {
a: 'asadf',
b: 'asdfsd'
// there's about 50 lines of data in here in the real products but this is for brevity
}
product.name = $('.selector').dostuffwithitinjquery('etc');
saveToDB(product);
}
// In mongoDBFunction.js
exports.saveToDB(item){
db.products.save(item, function(err){
console.log("Item was successfully saved!");
delete item; // Will this completely delete the item from memory?
})
}

delete in javascript is NOT used to delete variables or free memory. It is ONLY used to remove a property from an object. You may find this article on the delete operator a good read.
You can remove a reference to the data held in a variable by setting the variable to something like null. If there are no other references to that data, then that will make it eligible for garbage collection. If there are other references to that object, then it will not be cleared from memory until there are no more references to it (e.g. no way for your code to get to it).
As for what is causing the memory accumulation, there are a number of possibilities and we can't really see enough of your code to know what references could be held onto that would keep the GC from freeing up things.
If this is a single, long running process with no breaks in execution, you might also need to manually run the garbage collector to make sure it gets a chance to clean up things you have released.
Here's are a couple articles on tracking down your memory usage in node.js: http://dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/ and https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/.

JavaScript has a garbage collector that automatically track which variable is "reachable". If a variable is "reachable", then its value won't be released.
For example if you have a global variable var g_hugeArray and you assign it a huge array, you actually have two JavaScript object here: one is the huge block that holds the array data. Another is a property on the window object whose name is "g_hugeArray" that points to that data. So the reference chain is: window -> g_hugeArray -> the actual array.
In order to release the actual array, you make the actual array "unreachable". you can break either link the above chain to achieve this. If you set g_hugeArray to null, then you break the link between g_hugeArray and the actual array. This makes the array data unreachable thus it will be released when the garbage collector runs. Alternatively, you can use "delete window.g_hugeArray" to remove property "g_hugeArray" from the window object. This breaks the link between window and g_hugeArray and also makes the actual array unreachable.
The situation gets more complicated when you have "closures". A closure is created when you have a local function that reference a local variable. For example:
function a()
{
var x = 10;
var y = 20;
setTimeout(function()
{
alert(x);
}, 100);
}
In this case, local variable x is still reachable from the anonymous time out function even after function "a" has returned. If without the timeout function, then both local variable x and y will become unreachable as soon as function a returns. But the existence of the anonymous function change this. Depending on how the JavaScript engine is implemented, it may choose to keep both variable x and y (because it doesn't know whether the function will need y until the function actually runs, which occurs after function a returns). Or if it is smart enough, it can only keep x. Imagine that if both x and y points to big things, this can be a problem. So closure is very convenient but at times it is more likely to cause memory issues and can make it more difficult to track memory issues.

I faced same problem in my application with similar functionality. I've been looking for memory leaks or something like that. The size of consumed memory my process has reached to 1.4 GB and depends on the number of links that must be downloaded.
The first thing I noticed was that after manually running the Garbage Collector, almost all memory was freed. Each page that I downloaded took about 1 MB, was processed and stored in the database.
Then I install heapdump and looked at the snapshot of the application. More information about memory profiling you can found at Webstorm Blog.
My guess is that while the application is running, the GC does not start. To do this, I began to run application with the flag --expose-gc, and began to run GC manually at the time of implementation of the program.
const runGCIfNeeded = (() => {
let i = 0;
return function runGCIfNeeded() {
if (i++ > 200) {
i = 0;
if (global.gc) {
global.gc();
} else {
logger.warn('Garbage collection unavailable. Pass --expose-gc when launching node to enable forced garbage collection.');
}
}
};
})();
// run GC check after each iteration
checkProduct(product._id)
.then(/* ... */)
.finally(runGCIfNeeded)

Interestingly, if you do not use const, let, var, etc when you define something in the global scope, it seems be an attribute of the global object, and deleting returns true. This could cause it to be garbage collected. I tested it like this and it seems to have the intended impact on my memory usage, please let me know if this is incorrect or if you got drastically different results:
x = [];
process.memoryUsage();
i = 0;
while(i<1000000) {
x.push(10.5);
}
process.memoryUsage();
delete x
process.memoryUsage();

Related

Is it necessary to nullify primitive values for grabage collection?

If I have the following code:
function MyClass() {
this.data = {
// lots of data
};
}
var myClassInstace = new MyClass();
var myobj = {
num:123,
str:"hello",
theClass:myClassInstance
};
I know it's absolutely necessary to do:
myobj.theClass = null;
To free up myClassInstance and its data property for GC. However, what should I do with myobj.num and myobj.str? Do I have to give them a value of null too? Does the fact that they're primitive change anything regarding GC?

The JavaScript runtime that implements garbage collection will be able to collect items as soon as values are no longer reachable from code. This is true for object references as well as primitives. The details of the exact moment the item is collected varies by implementation, but it is not even necessary to set your object references to null (as you state) unless you need the object cleaned up sooner than the natural termination of the current function.
This all ties into the fundamental concept of "scope" and the Scope Chain. When an item is no longer in any other objects scope chain it can be collected. Understanding this clearly will answer this question and also help to understand closures, which are scenarios where items stay in memory longer than you might have expected.

There are a lot of "it depends here", ranging from what your code is doing to what browser you're running in. However, if your object is JIT compiled to not use a map for its attributes, then the number should be an 8 byte double stored inline inside the object. Nulling it will do nothing.
The string and the myclass instance will be a pointer to memory allocated outside the object (since a string can be arbitarily many bytes, it can't be stored inside the object. A compiler could conceivably store one instance of the string in memory and never free it, however). Nulling them can allow the garbage collector to free them before the main object goes out of scope.
However, the real question is why you're worried about this. Unless you have profiled your code and identified garbage collection or memory leaks as a problem, you should not be trying to optimize GC behavior. In particular, unless your myobj object is itself going to be live for a long time, you should not worry about nulling fields. The GC will collect it when it goes out of scope.

setting to undefined (not null) will work however delete is better example delete myobj.theClass
Just to avoid misunderstanding I will say that there is no way to really delete an object from memory in JavaScript. you delete it's references or set them to undefined so that the GC can do it's work and really delete.

jQuery closures garbage collector

Before asking I read this question:
How to free up the memory in JavaScript
When your variable data is large and you want that data to be eligible for garbage collection, then you are correct to assign something small to that variable, like undefined or null or "". But it only make sense when the variable persists (e.g. it's global or part of some persistent data structure) as stated in previous question.
Then I made a test request like this:
setInterval(function() {
$.get('http://localhost/small-data', function(r) {
r = null;
},'json');
}, 1000);
In my example, my URL (at localhost) will output only 5KB;
jQuery will parse the response r as JSON and I create a loop with setInterval that's not doing anything else and after a few hours Firefox will have more than 1GB at the task manager.
It doesn't make sense. FF never frees memory. Does this example code have a memory leak?

How is cyclic referencing handled in javascript and browsers?

I have been exploring patterns in various MV* frameworks out there and today noticed a weird one, which seems to cause some issues
Model prototype. has a property collections: []
Collection prototype. has a property models: []
When a collection gets a new model, it is being pushed into collection.models but the model itself is also decorated to be aware of the collection it is a member of - i.e. the collection instance is pushed into model.collections.
so model.collections[0] is a collection that contains a .models[0] being the model that has a collection property... and so on.
at its most basic:
var A = function() {
this.collections = [];
},
B = function() {
this.models = [];
this.add = function(what) {
what.collections.push(this);
this.models.push(what)
};
};
var model = new A();
var collection = new B();
collection.add(model);
Here's the guilty party in action: https://github.com/lyonbros/composer.js/blob/master/composer.js#L310-313 and then further down it's pushing into models here: https://github.com/lyonbros/composer.js/blob/master/composer.js#L781-784
I suppose there is going to be a degree of lazy evaluation - things won't be used until they are needed. That code - on its own - works.
But I was also writing tests via buster.js and I noticed that all the tests that had reliance on sinon.spy() were producing InternalError: too much recursion (FF) or RangeError: Maximum call stack size exceeded(Chrome). The captured FF was even crashing unresponsively, which I have never encountered with buster test driver before - it even went to 3.5gb of ram use over my lunch break.
After a fair amount of debugging, I undid the reference storage and suddenly, it was all working fine again. Admittedly, the removal of the spy() assertions also worked but that's not the point.
So, the question is - having code like that, is it acceptable, how will the browsers interpret it, what is the bottleneck and how would you decorate your models with a pointer to the collection they belong in (perhaps a collection controller and collection uids or something).
full gist of the buster.js test that will fail: https://gist.github.com/2960549

The browsers don't care. The issue is that the tool you were using failed to check for cyclic reference chains through the object graph. Those are perfectly legitimate, at least they are if you want them and expect them.
If you think of an object and its properties, and the objects referenced directly or indirectly via those properties, then that assembly makes up a graph. If it's possible to follow references around and wind up back where you started, then that means the graph has a cycle. It's definitely a good thing that the language allows cycles. Whether it's appropriate in a given system is up to the relevant code.
Thus, for example, a recursive function that traverses an object graph without checking to see if it's already visited an object will definitely trigger a "too much recursion" error if the graph is cyclic.

There will only be two objects referencing each other (called "circular reference").
var a, b = {a: a={b: b}};
// a.b: pointer to b
// b.a: pointer to a
There is no recursion at all. If you are getting too much recursion or Maximum call stack size exceeded errors, there needs to be a function which is invoked too often. This could e.g. happen when you try to clone the objects and recurse over the properties without caring for circular references. You'll need to look further in your code, also the error messages should include a (very long) call stack.

Trying to Understand Javascript Closures + Memory Leaks

I've been reading up a lot on closures in Javascript. I come from a more traditional (C, C++, etc) background and understand call stacks and such, but I am having troubles with memory usage in Javascript. Here's a (simplified) test case I set up:
function updateLater(){
console.log('timer update');
var params = new Object();
for(var y=0; y<1000000; y++){
params[y] = {'test':y};
}
}
Alternatively, I've also tried using a closure:
function updateLaterClosure(){
return (function(){
console.log('timer update');
var params = new Object()
for(var y=0; y<1000000; y++)
{
params[y] = {'test':y};
}
});
}
Then, I set an interval to run the function...
setInterval(updateLater, 5000); // or var c = updateLaterClosure(); setInterval(c,5000);
The first time the timer runs, the Memory Usage jumps form 50MB to 75MB (according to Chrome's Task Manager). The second time it goes above 100MB. Occasionally it drops back down a little, but never below 75MB.
Check it out yourself: https://local.phazm.com:4435/Streamified/extension/branches/lib/test.html
Clearly, params is not being fully garbage collected, because the memory from the first timer call is not being freed... yet, neither is it adding 25MB of memory on EACH call, so it is not as if the garbage collection is NEVER happening... it almost seems as though one instance of "params" is always being kept around. I've tried setting up a sub-closure and other things... no dice.
What is MOST disturbing, though, is that the memory usage trends upwards. It might "just" be 75MB for now, but leave it running for long enough (overnight) and it'll get to 500 MB.
Ideas?
Thanks!

Allocating 25mb causes a GC to happen. This GC cleans up the last instance but of course not the current. So you always have one instance around.
GC does not happen when the program is idle. It does not happen between your timer calls so the memory stays around.

That is not even a closure. A closure is when you return something from a function, like an array, function, object or anything that can contain references, and it carries with it all the local members of that function.
what you have there is just a case of a very long loop that is building a very big object. and maybe your memory does not get reclaimed as fast as you are building the huge objects.

Garbage Collection and jQuery?

How does jQuery (JavaScript) and gc work?
callBack is a function that runs as a callback to a JSON response.
What will be in memory when the callBack function has executed?
What I would like to hear is that the data object and the autoCompleteData will be garbage collected. And only the data stored in $("input#reciever") resides in the memory.
Is this the case?
//The code in question:
var callBack = function(data) {
var autoCompleteData = jQuery.map(data.receivers, function(receiver, i){
return {label: receiver.name, id: receiver.id };
});
$("input#reciever").autocomplete({
source: autoCompleteData,
select: function(event, receiver) {
$("input#reciever").val(receiver.item.label);
$("input#recieverId").val(receiver.item.id);
return false;
}
});
}

Garbage collection in Javascript works by freeing the memory of any object that no other javascript code has a reference to. If nobody has a reference to it, it can't be in use any more so it can be safely freed.
References to an object can be from a variable or from a scope of execution that is still active.
In your example above, while waiting for the .autocomplete() function to finish, everything in your code is still in scope and nothing will be garbage collected. That means that autoCompleteData will be preserved (and not garbage collected) until the .autocomplete() method is completely done executing. This is normal, expected and, in fact required for proper function in many places.
As a measure of one reason why this data is still in scope, the variable autoCompleteData is still in scope in the select callback function. It would be legal and proper for you to reference that variable in the select callback function. Thus the JS engine must not garbage collect it until it is no longer in scope and can no longer be referenced by any code.
In some cases, you can cause memory to be available for garbage collection by explicitly clearing a variable.
For example, if you restructured your code like this:
var callBack = function(data) {
$("input#reciever").autocomplete({
source: jQuery.map(data.receivers, function(receiver, i){
return {label: receiver.name, id: receiver.id };,
select: function(event, receiver) {
$("input#reciever").val(receiver.item.label);
$("input#recieverId").val(receiver.item.id);
return false;
}
});
}
Then, the autocomplete data only exists as an argument to .autocomplete() and it may be eligible for garbage collection sooner as there is no requirement by the JS engine that is keep that data until the select callback is called as there was before. Whether the data is actually garbage collected right away depends upon whether the internal implementation of .autocomplete() stores it away somewhere that lasts until the select method is called or not.
FYI, the exact timing of garbage collection matters the most with big pieces of data (many megabytes), zillions of pieces of data (lots of pieces of data that add up to hundreds of megabytes. If the size of something is measured in kilobytes or even hundreds of kilobytes and there's only one of it, then the exact timing of whether the memory is garbage collected immediately or when a callback gets called is not really all that important as browsers these days have access to a reasonable amount of memory. If you were dealing with giant pieces of data or dealing with zillions of them or doing something repetitively and had some sort of leak, those could all cause problems (particularly on mobile), but an example like you have above is unlikely to cause an issue unless the data set is large relative to the memory available in the browser.

Objects are passed by reference in JavaScript, so the object reached by accessing autoCompleteData will be the same one that the autocomplete plugin uses.
Because of this, the variable autoCompleteData will not be garbage collected (but this is not detrimental to your program, as it's required by the autocomplete plugin.
The data object however, should be garbage collected, as nothing is providing a reference to it, and it has fallen out of scope.
Additionally, it is important to note that garbage collection does not work differently for jQuery; it behaves the same as it does across JavaScript (and ofc, all other JavaScript frameworks).

We Keep Coding

JavaScript is the programming language of the Web.

Deleting large Javascript objects when process is running out of memory - javascript

Related

Is it necessary to nullify primitive values for grabage collection?

jQuery closures garbage collector

How is cyclic referencing handled in javascript and browsers?

Trying to Understand Javascript Closures + Memory Leaks

Garbage Collection and jQuery?

Categories

Resources