I have a very large collection of objects in node.js (millions) that I need to keep in memory for caching purposes (they are maintained in several global hash objects). Each hash collections stores about 750k keys.
In order to keep GC to a minimum, I want to find out the best way to store these items. Would it be better to split these items into 100s of thousands of hashes? Should I maybe not use hashes at all? Is there any way to keep them off the heap completely so they never are examined by GC (and if so how would I do so)?
There is no public API to control garbage collection from JavaScript.
But GC has gone a long way over the years. Modern GC implementations will notice that some objects live long and put them into a special "area" which will be collected only rarely.
How this exactly works is completely implementation dependent; every browser does it's own thing and usually, this also often changes when a new browser version is released.
EDIT The layout on memory and the organization is completely irrelevant. Modern GCs are really hard to understand in detail without investing a couple of weeks reading the actual code. So what I explain now is a really simplified picture; the real code will work different (and some GCs will use completely different tricks to achieve the same goal).
Imagine that the GC has a counter for each object in which it counts how often it has seen it in the past. Plus it has several lists where it keeps objects of different age, that is with objects whose counters have passed certain thresholds. So when the counter reaches some limit, the object is moved to the next list.
The first list is visited every time GC runs. The second list is only considered for every Nth GC run.
An alternate implementation might add new objects to the top of the "GC list" and for every GC run, it will only check N elements. So long living objects will move down the list and after a while, they don't be checked every time.
What this means for you is that you don't have to do anything; the GC will figure out that your huge map lives a long time (and the same is true for all objects in the map) and after a while, it will start to ignore this data structure.
Related
In attempting to determine how to most effciently perform a series of repetitive steps, I find that I know very little about memory usage in a web browser. I've read about the garbage collector and memory leaks several times before but that is still very little. The specific scenario in which I am interested is as follows.
A set of data objects is retrieved from indexedDB using a getAll statement over a key range. The data is all text and would never exceed 0.25MB and is likely always less that 0.1MB.
The array of data objects is passed to a function that builds a node in a document fragment and replaces a node in the DOM.
As the user modifies/adds/deletes the data, the state is saved in the database using a get/put sequence, and an add if a new object is built. This is performed on one object at a time, not on the whole array.
When the user is finished, the next set of data is retrieved and the DOM node is replaced again. The users can navigate forward and backward through the packets of data they build.
After the new node is built, the variable that holds the array of data objects is no longer used, and is set to null in attempt at avoiding any potential memory leaks. But I started thnking about what really happens to that data and memory allocation each time the functions are invoked.
Suppose the user rapidly clicks through a series of data packets. Each time the data is retrieved and the new nodes built and replaced in the DOM, is the same memory space being used over and over again, or does each invocation take up more space, and the memory used in the former invocations is later released by the GC, such that rapidly clicking through ten nodes temporarily takes up ten nodes worth of data? I mean this to include all the data involved in these steps and not just the DOM node being replaced.
One reason I ask is that I was considering holding the array of data objects in RAM, and updating the objects as the user makes edits/builds more obejcts, and then just using a put operation to record it in the database rather than a get/put sequence. And that made me wonder what takes place if that array is held in a property of one of the functions. Will the data allocation of the function property repeatedly be re-used with each new invocation or will it take up more space each time also, until former invocations are released?
I thought maybe that, even when the variables are set to null after the new nodes are built, the GC may take time to release the memory anyway, such that memory is always being used and it may as well hold one data packet at a time and eliminate the need for a get and waiting for the onsuccess event to update the object and then put it back again.
Of course, all of this works very quickly with the get/put sequence, but I'd like to understand what is happening regarding memory; and, if setting the variables to null and not holding the array in RAM is really saving nothing, then there is no reason to use get, and the less work done may make it less likely that, after a user works with this tool for an hour or two, there would be a memory issue.
Thank you for considering my rather novice question.
Thank you for the comments. Although they are interesting, they don't really concern what I am asking. I'm not asking about memory leaks, but only commented that the variables are being set to null at the close of the functions. They are set to null because I read that if the references to an area of memory are 'broken' it helps the GC's mark-and-sweep algorithm identify that the particular area of memory may be released. And, if some type of reference back to a variable remains, it will at least be pointing to null rather than an area of remaining data. Whether or not that is true, I don't know for sure; but that is what I read. I read many articles on memory leaks but these three, article 1, article 2, and article 3 I was able to find again easily to provide for examples. In article 3, page 5 shows setting a variable to null as a way to protect against memory leaks by making an object unreachable.
I don't understand why the question would be considered a micro optimization. It's just a novice question about how memory is used in the browser when multiple invocations of functions take place in between GC cycles. And it's just an example of what I've been examining, and does not depend on indexedDB at all, nor whether or not memory leaks really exist in properly coded applications. The description I provided likely made it confusing.
If a function employs local variables to retrieve an array of data objects from an indexedDB object store and in the transaction.oncomplete handler passes reference to that area of memory to another function that uses it to build a document fragment and replace a DOM node what happens to the data? Two functions have reference to the array and one to the fragment. If the references are manually broken by setting the variables that pointed to them to null (and even if not set to null), eventually the GC will release that memory. But, if the user clicks through rapidly, repeatedly invoking these functions in a short interval of time, that is, between GC cycles, will each invocation allocate a new area of memory for the data array and fragment, such that in between the GC cycles, there could be ten sets of data areas held in RAM waiting to be relased? If the array was held in a property of the function that retrieves it, would each invocation re-use that same area of memory or would the function property simple change its reference to a new area of memory holding the new array and breaking reference to the area of memory holding the previous array, but there would still be ten sets of data areas in between GC cycles?
As I wrote before, I was wondering if there was a way to re-use the same memory area for each invocation. If not possible and if there would always be several sets of data held in RAM waiting to be released between GC cycles, it may be beneficial to retain reference to the current data set and use it to reduce the amount of work performed by the browser to save the current state, which, in itself is an entriely separate issue but one that depends on how the memory is used by the browser.
When I observe snap shots of the memory in Firefox developer tools, memory use grows as I step through these data sets, repeatedly retrieving new data and building a new fragment to replace the DOM node; but the amount of that data is relatively small and it may just build up until the GC cycle runs. From this observation, it appears that each invocation uses a new area of data and breaks the reference to the previous area of data. Therefore, maintaining a reference to the current data set is not a memory issue because there are always many such memory areas held in RAM anyway in between GC cycles.
Nonetheless, I must have an issue of some sort because after adding 100 data packets, and navigating up an down them through Next / Previous buttons, the data usage continues to grow and nearly all in the domNode section. It starts out at 7MB total, 6MB in domNode and 5MB of that in #document. #document remains at 5MB but domNode grows to at least 150MB as do nothing but move up and down the records, retrieving data, building and replacing nodes but never making an edit to the data; and never having more than one node for it is always replacement of exactly the same size because in this test the 100 data packets are identical copies. So, just getAll, build a fragment, replace a DOM node, over and over again.
Thank you.
I have some very large objects that are used intensively, but occasionally in my Node.JS program. Loading these objects are expensive. In total they occupy more memory than I have in the system.
Is there any way to create a "weak reference" in JavaScript, so that the garbage collector will remove my objects when memory is low, and then I can check whether the object is and reload it again if it was garbage collected since my last access?
The particular use case I had in mind was cartographic reprojection and tiling of gigabytes of map imagery.
Is there any way to create a "weak reference" in Javascript, so that the garbage collector will remove my objects when memory is low?
No, Javascript does not have that.
I don't think a weakMap or weakSet will offer you anything useful here. They don't do what you're asking for. Instead, they allow you to have a reference to something that will NOT prohibit garbage collection. But, if there are no other references to the data, then it will be garbage collected immediately. So, they won't keep the data around for awhile like you want. If you have any other reference to those objects (to keep them around), then they will never get garbage collected. Javascript doesn't offer a weak reference that is only garbage collected when memory starts to get full. Something is either eligible for garbage collection or it isn't. If it's eligible, it will be freed in the next GC pass.
It sounds like what you probably want is a memory cache. You could decide how large you want the cache to be and then keep items in the cache based on some strategy. The most common strategy is LRU (least recently used) where you kick an item out of the cache when you reach the cache size limit and you need to load a new item in the cache. With LRU, you keep track of when an item was last used from the cache and you kick out the oldest one. If you are trying to manage the cache to a memory usage size, you will have to have some scheme for estimating the memory usage of your objects in the cache.
Note that many databases will essentially offer you this functionality as a built-in feature since they will usually contain some sort of caching themselves so if you request an item from the database that you have recently requested, it probably comes from a read cache. You don't really say what your objects are so it's hard for us to suggest exactly how they could be used from a database.
Consider JavaScript code below. I am creating array of let's say 4 elements and then immediately delete reference to it. When will garbage collection happen? I know it is specific to language implementation, but we have not that many JavaScript engines.
Edit: it is the simplest possible case, but it interests me because garbage collection causes audible glitches in some web audio applications.
var a = [1, 2, 3, 4];
a = null;
// other code
Update:
Javascript and Garbage collection does not explain the sequence of events and how it is triggered. I don't want to control garbage collection. I need better understanding to design better code.
There are many ways GCs can be triggered
allocation-triggering: there is no more room in the allocation buffers/young generation/howeveritiscalled and a GC is required to free up objectsThis gets a bit more complicated in a mixed GC/manual allocation environment. malloc wrappers might need to trigger GCs too since javascript can hold onto manually allocated resources.
[in browsers] when a tab/window gets closed/replaced on the assumption that there is lots of easy-to-collect garbage at that point
time-triggered by heuristics for incremental collections to meet pause time goals
privileged javascript may be able to trigger collections directly
as last-ditch effort by various non-GC-managed components if they run out of native resources (file handles, virtual address space) in the hope that GC objects awaiting finalization were holding onto them
There probably are other reasons that I currently can think of. It's generally not something you should concern yourself with.
If you want to keep GCs low you need to keep your object allocation rate low. E.g. some algorithms can benefit from pre-allocated buffers on which they perform their work. The effectiveness of that strategy depends on whether the javascript runtime has escape analysis and how effective it is at avoiding short-lived allocations in the first place. And on whether the collector is generational. A generational collector suffers a lot less from rapid, short-lived allocations.
It's a good approach to delete the references of objects that are no longer in use. Many (almost all) javascript engines use Mark and Sweep approach. If you are deleting a reference to an object and it has no other references, it will probably be collected in the next Mark and Sweep Cycle.
Garbage collection is a cyclic approach that happens in regular intervals in few engines. Some engines say they do GC in an efficient manner but, there are no valid documentation available.
I just stumbled upon this jsperf result: http://jsperf.com/delet-is-slow
It shows that using delete is slow in javascript but I am not sure I get why. What is the javascript engine doing behind the scene to make things slow?
I think the question is not why delete is slow... The speed of a simple delete operation is not worth measuring...
The JS perf link that you show does the following:
Create two arrays of 6 elements each.
Delete at one of the indexes of one array.
Iterate through all the indexes of each array.
The script shows that iterating through an array o which delete was applied is slower than iterating though a normal array.
You should ask yourself, why delete makes an array slow?
The engine internally stores array elements in contiguous memory space, and access them using an numeric indexer.
That's what they call a fast access array.
If you delete one of the elements in this ordered and contiguous index, you force the array to mutate into dictionary mode... thus, what before was the exact location of the item in the array (the indexer) becomes the key in the dictionary under which the array has to search for the element.
So iterating becomes slow, because don't move into the next space in memory anymore, but you perform over and over again a hash search.
You'll get a lot of answers here about micro-optimisation but delete really does sometimes have supreme problems where it becomes incredibly slow in certain scenarios that people must be aware of in JS. These are to my knowledge edge cases and may or may not apply to you.
I recommend to profile and benchmark in different browsers to detect these anomalies.
I honestly don't know the reasons as I tend to workaround it this but I would guess combinations of quirks in the GC (it is might be getting invoked too often), brutal rehashing, optimisations for other cases and weird object structure/bad time complexity.
The cases usually involve moderate to large numbers of keys, for example:
Delete from objects with many keys:
(function() {
var o={},s,i,c=console;
s=new Date();for(i=0;i<1000000;i+=10)o[i]=i;c.log('Set: '+(new Date()-s));
s=new Date();for(i=0;i<50000;i+=10)delete(o[i]);c.log('Delete: '+(new Date()-s));})();
Chrome:
Set: 21
Delete: 2084
Firefox:
Set: 74
Delete: 2
I have encountered a few variations of this and they are not always easy to reproduce. A signature is that it usually seems to degrade exponentially. In one case in Firefox delete inside a for in loop would degrade to around 3-6 operations a second where as deleting when iterating Object.keys would be fine.
I personally tend to think that these cases can be considered bugs. You get massive asymptotic and disproportionate performance degradation that you can work around in ways that shouldn't change the time or space complexity or that might even normally make performance moderately worse. This means that when considered as a declarative language JS gets the implementation/optimisations wrong. Map does not have the same problem with delete so far that I have seen.
Reasons:
To be sure, you would have to look into the source code or run some profiling.
delete in various scenarios can change speed arbitrarily based on how engines are written and this can change from version to version.
JavaScript objects tend to not be used with large amounts of properties and delete is called relatively infrequently in every day usage. They're also used as part of the language heavily (they're actually just associative arrays). Virtually everything relies on an implementation of an object. IF you create a function, that makes an object, if you put in a numeric literal it's actually an object.
It's quite possible for it to be slow purely because it hasn't been well optimised (either neglect or other priorities) or due to mistakes in implementation.
There are some common possible causes aside from optimisation deficit and mistakes.
Garbage Collection:
A poorly implemented delete function may inadvertently trigger garbage collection excessively.
Garbage collection has to iterate everything in memory to find out of there are any orphans, traversing variables and references as a graph.
The need to detect circular references can make GC especially expensive. GC without circular reference detection can be done using reference counters and a simple check. Circular reference checking requires traversing the reference graph or another structure for managing references and in either case that's a lot slower than using reference counters.
In normal implementations, rather than freeing and rearranging memory every time something is deleted, things are usually marked as possible to delete with GC deferred to perform in bulk or batches intermittently.
Mistakes can potentially lead to the entire GC being triggered too frequently. That may involve someone having put an instruction to run the GC every delete or a mistake in its statistical heuristics.
Resizing:
It's quite possible for large objects as well the memory remapping to shrink them might not be well optimised. When you have dynamically sized structures internally in the engine it can be expensive to resize them. Engines will also have varying levels of their own memory management on top of that of that provides by the operating system which will significantly complicate things.
Where an engine manages its own memory efficiently, even a delete that deletes an object efficiently without the need for a full GC run (has no circular references) this can trigger the need to rearrange memory internally to fill the gap.
That may involve reallocating data of the new size, then copying everything from the old memory to the new memory before freeing the old memory. It can also require updating all of the pointers to the old memory in some cases (IE, where a pointer pointer [all roads lead to Rome] wasn't used).
Rehashing:
It may also rehash (needs as the object is an associative array) on deletes. Often people only rehash on demand, when there are hash collisions but some engines may also rehash on deletes.
Rehashing on deletes prevents a problem where you can add 10 items to an object, then add a million objects, then remove those million objects and the object left with the 10 items will both take up more memory and be slower.
A hash with ten items needs ten slots with an optimum hash, though it'll actually require 16 slots. That times the size of a pointer will be 16 * 8 bytes or 128bytes. When you add the million items then it needs a million slots, 20 bit keys or 8 megabytes. If you delete the million keys without rehashing it then the object you have with ten items is taking up 8 megabyte when it only needs 128 bytes. That makes it important to rehash on item removal.
The problem with this is that when you add you know if you need to rehash or not because there will be a collision. When deleting, you don't know if you need to rehash or not. It's easy to make a performance mistake with that and rehash every time.
There are a number of strategies for reasonable downhashing intervals although without making things complicated it can be easy to make a mistake. Simple approaches tend to work moderately well (IE, time since last rehash, size of key versus minimum size, history of collision pairs, tombstone keys and delete in bulk as part of GC, etc) on average but can also easily get stuck in corner cases. Some engines might switch to a different hash implementation for large objects such as nested where as others might try to implement one for all.
Rehashing tends to work the same as resizing for simply implementations, make an entirely new one then insert the old one into it. However for rehashing a lot more needs to be done beforehand.
No Bulk:
It doesn't let you delete a bunch of things at once from a hash. It's usually much more efficient to delete items from a hash in bulk (most operations on the same thing work better in bulk) but the delete keyword only allows you to delete one by one. That makes it slow by design for cases with multiple deletes on the same object.
Due to this, only a handful of implementation of delete would be comparable to creating a new object and inserting the items you want to keep for speed (though this doesn't explain why with some engines delete is slow in its own right for a single call).
V8:
Slow delete of object properties in JS in V8
Apparently this was caused due to switching out implementations and a problem similar to but different to the downsizing problem seen with hashes (downsizing to flat array / runthrough implementation rather than hash size). At least I was close.
Downsizing is a fairly common problem in programming which results in what is somewhat akin to a game of Jenga.
I am writing a code that solves equations for users. The user will be able to enter matrices, vectors and such. Right now, as I solve those equations, I write the solution to a table with each row representing any given variable or equation. The javascript object that was used to hold the equation during calculation is destroyed. I did this on purpose to prevent large arrays of data being stored in memory and potentially slowing down the script. As an array is needed later, it is read into an object from the appropriate table row.
I am wondering how holding memory in the DOM like this compares to holding it in objects in memory. I am more at risk to slow down my code execution by holding these things in memory as objects instead of on the DOM? Also, since these objects will need to be accessed by multiple parts of the code, these would have to be global objects. Right?
In a word, no.
There's one rule that is a constant in web development: the DOM is slow.
Selecting, accessing, and iterating DOM elements and reading and modifying their properties is one of the single slowest things you can do in browser JS. On the other hand, working with native JS objects is very fast. As a rule, I avoid working with the DOM as much as possible.
You make the (incorrect) assumption that low memory usage somehow equates to improved performance. In fact, it is generally1 (and not just in the browser) the case that increased memory usage increases application performance. By the nature of having calculated results cached in memory, you avoid the processor- and time-intensive need to rebuild data at a later time. Of course, you must balance your use of memory with the need for performance.
In your case, I'd definitely keep the equation object in memory. It's probably using a surprisingly small amount of memory. For example, I just created an array of 1,000 integers in Chrome, and the memory profiler told me that it used less than 40 kB – almost all of it object overhead. The values themselves only took 4 kB – 1000 * 4 bytes (32 bits).
When you consider that the routines that read the DOM to regenerate the object probably take tens to hundreds of milliseconds (or much worse, depending on complexity) to run, why wouldn't you just generate the object once and keep it, especially since the cost is so low?
However, as with all performance questions, the real answer is that you have to profile your application to see where the performance bottlenecks really lie. Use Chrome's Developer Tools' Profiles tab to take CPU profiles (which measure processor utilization over time) and heap snapshots (which measure memory usage in an instant) as you use your app.
1 - This really depends on whether your program uses so much memory that pages must be swapped out. On desktop machines where 16 GB of RAM has become commonplace, you have nearly nothing to worry about. However, it is still important to watch memory usage on mobile devices where resources are more constricted.
I am wondering how holding memory in the DOM like this compares to holding it in objects in memory.
Test it and see, I'd expect the DOM to use more memory and be very much slower to access. In any case, unless you are storing an awful lot of stuff the memory implications are likely minimal.
I am more at risk to slow down my code execution by holding these things in memory as objects instead of on the DOM?
The opposite I expect. Accessing the DOM will be very much slower.
Also, since these objects will need to be accessed by multiple parts of the code, these would have to be global objects. Right?
Wrong. You can hold references in a closure, or put all the objects into a single array or object. In any case, global objects aren't bad per se, they are just considered untidy and increase the chance of name collisions.