Deoptimizations kill the performance with binary trees

Deoptimizations kill the performance with binary trees - javascript

I need some help to optimize the code below. I can't understand how to rewrite the code to avoid the deoptimizations.
Below is the code that works very slow on the Node platform. I took it from benchmarksgame-team binary-trees benchmark and added minor changes.
When it is run with --trace-deopt it shows that the functions in the hot path are depotimized, i.e.
[bailout (kind: deopt-lazy, reason: (unknown)): begin. deoptimizing 0x02117de4aa11 <JSFunction bottomUpTree2 (sfi = 000002F24C7D7539)>, opt id 3, bytecode offset 9, deopt exit 17, FP to SP delta 80, caller SP 0x00807d7fe6f0, pc 0x7ff6afaca78d]
The benchmark, run it using node --trace-deopt a 20
function mainThread() {
const maxDepth = Math.max(6, parseInt(process.argv[2]));
const stretchDepth = maxDepth + 1;
const check = itemCheck(bottomUpTree(stretchDepth));
console.log(`stretch tree of depth ${stretchDepth}\t check: ${check}`);
const longLivedTree = bottomUpTree(maxDepth);
for (let depth = 4; depth <= maxDepth; depth += 2) {
const iterations = 1 << maxDepth - depth + 4;
work(iterations, depth);
}
console.log(`long lived tree of depth ${maxDepth}\t check: ${itemCheck(longLivedTree)}`);
}
function work(iterations, depth) {
let check = 0;
for (let i = 0; i < iterations; i++) {
check += itemCheck(bottomUpTree(depth));
}
console.log(`${iterations}\t trees of depth ${depth}\t check: ${check}`);
}
function TreeNode(left, right) {
return {left, right};
}
function itemCheck(node) {
if (node.left === null) {
return 1;
}
return 1 + itemCheck2(node);
}
function itemCheck2(node) {
return itemCheck(node.left) + itemCheck(node.right);
}
function bottomUpTree(depth) {
return depth > 0
? bottomUpTree2(depth)
: new TreeNode(null, null);
}
function bottomUpTree2(depth) {
return new TreeNode(bottomUpTree(depth - 1), bottomUpTree(depth - 1))
}
console.time();
mainThread();
console.timeEnd();

(V8 developer here.)
The premise of this question is incorrect: a few deopts don't matter, and don't move the needle regarding performance. Trying to avoid them is an exercise in futility.
The first step when trying to improve performance of something is to profile it. In this case, a profile reveals that the benchmark is spending:
about 46.3% of the time in optimized code (about 4/5 of that for tree creation and 1/5 for tree iteration)
about 0.1% of the time in unoptimized code
about 52.8% of the time in the garbage collector, tracing and freeing all those short-lived objects.
This is as artificial a microbenchmark as they come. 50% GC time never happens in real-world code that does useful things aside from allocating multiple gigabytes of short-lived objects as fast as possible.
In fact, calling them "short-lived objects" is a bit inaccurate in this case. While the vast majority of the individual trees being constructed are indeed short-lived, the code allocates one super-large long-lived tree early on. That fools V8's adaptive mechanisms into assuming that all future TreeNodes will be long-lived too, so it allocates them in "old space" right away -- which would save time if the guess was correct, but ends up wasting time because the TreeNodes that follow are actually short-lived and would be better placed in "new space" (which is optimized for quickly freeing short-lived objects). So just by reshuffling the order of operations, I can get a 3x speedup.
This is a typical example of one of the general problems with microbenchmarks: by doing something extreme and unrealistic, they often create situations that are not at all representative of typical real-world scenarios. If engine developers optimized for such microbenchmarks, engines would perform worse for real-world code. If JavaScript developers try to derive insights from microbenchmarks, they'll write code that performs worse under realistic conditions.
Anyway, if you want to optimize this code, avoid as many of those object allocations as you can.
Concretely:
An artificial microbenchmark like this, by its nature, intentionally does useless work (such as: computing the same value a million times). You said you wanted to optimize it, which means avoiding useless work, but you didn't specify which parts of the useless work you'd like to preserve, if any. So in the absence of a preference, I'll assume that all useless work is useless. So let's optimize!
Looking at the code, it creates perfect binary trees of a given depth and counts their nodes. In other words, it sums up the 1s in these examples:
depth=0:
1
depth=1:
1
/ \
1 1
depth=2:
1
/ \
1 1
/ \ / \
1 1 1 1
and so on. If you think about it for a bit, you'll realize that such a tree of depth N has (2 ** (N+1)) - 1 nodes. So we can replace:
itemCheck(bottomUpTree(depth));
with
(2 ** (depth+1)) - 1
(and analogously for the "stretchDepth" line).
Next, we can take care of the useless repetitions. Since x + x + x + x + ... N times is the same as x*N, we can replace:
let check = 0;
for (let i = 0; i < iterations; i++) {
check += (2 ** (depth + 1)) - 1;
}
with just:
let check = ((2 ** (depth + 1)) - 1) * iterations;
With that we're from 12 seconds down to about 0.1 seconds. Not bad for five minutes of work, eh?
And that remaining time is almost entirely due to the longLivedTree. To apply the same optimizations to the operations creating and iterating that tree, we'd have to move them together, getting rid of its "long-livedness". Would you find that acceptable? You could get the overall time down to less than a millisecond! Would that make the benchmark useless? Not actually any more useless than it was to begin with, just more obviously so.

Related

Interpreting performance tests for non-zero check

I've recently again fallen into the trap of premature optimization, and hopefully climbed back out. However, on my short intermission, i've encountered something, which i'd like to confirm.
My very basic performance test yielded similar results on chrome (all variables are just declared globally, the first takes ~9ms, second ~7.5ms):
input = Array.from({ length: 1000000 }, () => Math.random() > 0.8 ? 0 : Math.random() * 1000000000 );
start = performance.now();
for (let i = 0; i < 1000000; i++) {
input[i] = input[i] === 0 ? 0 : 1;
}
console.log(performance.now() - start);
and
input = Array.from({ length: 1000000 }, () => Math.random() > 0.8 ? 0 : Math.random() * 1000000000 );
start = performance.now();
for (let i = 0; i < 1000000; i++) {
input[i] = ((input[i] | (~input[i] + 1)) >>> 31) & 1;
}
console.log(performance.now() - start);
When taking into account, that the loop and assignment itself is already taking a lot of time (~4.5ms), the second potentially takes ~33% less time, which is however far smaller gain than what i'd expect (and within measuring inaccuracy, e.g. on FireFox, both take much longer, and the second is ~33% worse).
Can i at this point conclude, that an optimization of the condition is already taking place, and a similar code change already being done, or am i falling for some mirage of a microbenchmark? In my mind, a branch of this category should be excessively more time consuming than the calculation.
I am primarily skeptical of my own reasoning, because i know, that these kinds of tests can very easily have skewed results for unforeseen reasons.

Yes, it's most likely an effect of measurement and JIT-compiler effects combined.
Especially the heuristik-driven JIT-compiler in JS modern browsers have is that good, that you won't have any chance of visualizing the "real" performance benefit of the intuitively faster statement.
The variance in my browser is so high that both codes need approximately equal time (running in Opera), which underlines your effect in FF.
Be glad that you found that 1.5ms advantage, you won't see more.

Why is using a loop to iterate from start of array to end faster than iterating both start to end and end to start?

Given an array having .length 100 containing elements having values 0 to 99 at the respective indexes, where the requirement is to find element of of array equal to n : 51.
Why is using a loop to iterate from start of array to end faster than iterating both start to end and end to start?
const arr = Array.from({length: 100}, (_, i) => i);
const n = 51;
const len = arr.length;
console.time("iterate from start");
for (let i = 0; i < len; i++) {
if (arr[i] === n) break;
}
console.timeEnd("iterate from start");
const arr = Array.from({length: 100}, (_, i) => i);
const n = 51;
const len = arr.length;
console.time("iterate from start and end");
for (let i = 0, k = len - 1; i < len && k >= 0; i++, k--) {
if (arr[i] === n || arr[k] === n) break;
}
console.timeEnd("iterate from start and end");
jsperf https://jsperf.com/iterate-from-start-iterate-from-start-and-end/1

The answer is pretty obvious:
More operations take more time.
When judging the speed of code, you look at how many operations it will perform. Just step through and count them. Every instruction will take one or more CPU cycles, and the more there are the longer it will take to run. That different instructions take a different amount of cycles mostly does not matter - while an array lookup might be more costly than integer arithmetic, both of them basically take constant time and if there are too many, it dominates the cost of our algorithm.
In your example, there are few different types of operations that you might want to count individually:
comparisons
increments/decrements
array lookup
conditional jumps
(we could be more granular, such as counting variable fetch and store operations, but those hardly matter - everything is in registers anyway - and their number basically is linear to the others).
Now both of your code iterate about 50 times - they element on which they break the loop is in the middle of the array. Ignoring off-by-a-few errors, those are the counts:
| forwards | forwards and backwards
---------------+------------+------------------------
>=/===/< | 100 | 200
++/-- | 50 | 100
a[b] | 50 | 100
&&/||/if/for | 100 | 200
Given that, it's not unexpected that doing twice the works takes considerably longer.
I'll also answer a few questions from your comments:
Is additional time needed for the second object lookup?
Yes, every individual lookup counts. It's not like they could be performed at once, or optimised into a single lookup (imaginable if they had looked up the same index).
Should there be two separate loops for each start to end and end to start?
Doesn't matter for the number of operations, just for their order.
Or, put differently still, what is the fastest approach to find an element in an array?
There is no "fastest" regarding the order, if you don't know where the element is (and they are evenly distributed) you have to try every index. Any order - even random ones - would work the same. Notice however that your code is strictly worse, as it looks at each index twice when the element is not found - it does not stop in the middle.
But still, there are a few different approaches at micro-optimising such a loop - check these benchmarks.
let is (still?) slower than var, see Why is using `let` inside a `for` loop so slow on Chrome? and Why is let slower than var in a for loop in nodejs?. This tear-up and tear-down (about 50 times) of the loop body scope in fact does dominate your runtime - that's why your inefficient code isn't completely twice as slow.
comparing against 0 is marginally faster than comparing against the length, which puts looping backwards at an advantage. See Why is iterating through an array backwards faster than forwards, JavaScript loop performance - Why is to decrement the iterator toward 0 faster than incrementing and Are loops really faster in reverse?
in general, see What's the fastest way to loop through an array in JavaScript?: it changes from engine update to engine update. Don't do anything weird, write idiomatic code, that's what will get optimised better.

#Bergi is correct. More operations is more time. Why? More CPU clock cycles.
Time is really a reference to how many clock cycles it takes to execute the code.
In order to get to the nitty-gritty of that you need to look at the machine level code (like assembly level code) to find the true evidence. Each CPU (core?) clock cycle can execute one instruction, so how many instructions are you executing?
I haven't counted the clock cycles in a long time since programming Motorola CPUs for embedded applications. If your code is taking longer then it is in fact generating a larger instruction set of machine code, even if the loop is shorter or runs an equal amount of times.
Never forget that your code is actually getting compiled into a set of commands that the CPU is going to execute (memory pointers, instruction-code level pointers, interrupts, etc.). That is how computers work and its easier to understand at the micro controller level like an ARM or Motorola processor but the same is true for the sophisticated machines that we are running on today.
Your code simply does not run the way you write it (sounds crazy right?). It is run as it is compiled to run as machine level instructions (writing a compiler is no fun). Mathematical expression and logic can be compiled in to quite a heap of assembly, machine level code and that is up to how the compiler chooses to interpret it (it is bit shifting, etc, remember binary mathematics anyone?)
Reference:
https://software.intel.com/en-us/articles/introduction-to-x64-assembly
Your question is hard to answer but as #Bergi stated the more operations the longer, but why? The more clock cycles it takes to execute your code. Dual core, quad core, threading, assembly (machine language) it is complex. But no code gets executed as you have written it. C++, C, Pascal, JavaScript, Java, unless you are writing in assembly (even that compiles down to machine code) but it is closer to actual execution code.
A masters in CS and you will get to counting clock cycles and sort times. You will likely make you own language framed on machine instruction sets.
Most people say who cares? Memory is cheap today and CPUs are screaming fast and getting faster.
But there are some critical applications where 10 ms matters, where an immediate interrupt is needed, etc.
Commerce, NASA, a Nuclear power plant, Defense Contractors, some robotics, you get the idea . . .
I vote let it ride and keep moving.
Cheers,
Wookie

Since the element you're looking for is always roughly in the middle of the array, you should expect the version that walks inward from both the start and end of the array to take about twice as long as one that just starts from the beginning.
Each variable update takes time, each comparison takes time, and you're doing twice as many of them. Since you know it will take one or two less iterations of the loop to terminate in this version, you should reason it will cost about twice as much CPU time.
This strategy is still O(n) time complexity since it only looks at each item once, it's just specifically worse when the item is near the center of the list. If it's near the end, this approach will have a better expected runtime. Try looking for item 90 in both, for example.

Selected answer is excellent. I'd like to add another aspect: Try findIndex(), it's 2-3 times faster than using loops:
const arr = Array.from({length: 900}, (_, i) => i);
const n = 51;
const len = arr.length;
console.time("iterate from start");
for (let i = 0; i < len; i++) {
if (arr[i] === n) break;
}
console.timeEnd("iterate from start");
console.time("iterate using findIndex");
var i = arr.findIndex(function(v) {
return v === n;
});
console.timeEnd("iterate using findIndex");

The other answers here cover the main reasons, but I think an interesting addition could be mentioning cache.
In general, sequentially accessing an array will be more efficient, particularly with large arrays. When your CPU reads an array from memory, it also fetches nearby memory locations into cache. This means that when you fetch element n, element n+1 is also probably loaded into cache. Now, cache is relatively big these days, so your 100 int array can probably fit comfortably in cache. However, on an array of much larger size, reading sequentially will be faster than switching between the beginning and the end of the array.

Slow delete of object properties in JS in V8

Just to train myself a bit of Typescript I wrote a simple ES6 Map+Set-like implementation based on plain JS Object. It works only for primitive keys, so no buckets, no hash-codes, etc. The problem I encountered is implementing delete method. Using plain delete is just unacceptably slow. For large maps it's about 300-400x slower than ES6 Map delete. I noticed the huge performance degradation if size of the object is large. On Node JS 7.9.0 (and Chrome 57 for example) if object has 50855 properties delete performance is the same as ES6 Map. But for 50856 properties the ES6 Map is faster on 2 orders of magnitude. Here is the simple code to reproduce:
// for node 6: 76300
// for node 7: 50855
const N0 = 50855;
function fast() {
const N = N0
const o = {}
for ( let i = 0; i < N; i++ ) {
o[i] = i
}
const t1 = Date.now()
for ( let i = 0; i < N; i++ ) {
delete o[i]
}
const t2 = Date.now()
console.log( N / (t2 - t1) + ' KOP/S' )
}
function slow() {
const N = N0 + 1 // adding just 1
const o = {}
for ( let i = 0; i < N; i++ ) {
o[i] = i
}
const t1 = Date.now()
for ( let i = 0; i < N; i++ ) {
delete o[i]
}
const t2 = Date.now()
console.log( N / (t2 - t1) + ' KOP/S' )
}
fast()
slow()
I guess I could instead of delete properties just set them to undefined or some guard object, but this will mess the code, because hasOwnProperty will not work correctly, for...in loops will need additional check and so on. Are there more nice solutions?
P.S. I'm using node 7.9.0 on OSX Sierra
Edited
Thanks for comments guys, I fixed OP/S => KOP/S. I think I asked rather badly specified question, so I changed the title. After some investigation I found out that for example in Firefox there is no such problems -- deleting cost grows linearly. So it's problem of super smart V8. And I think it's just a bug:(

(V8 developer here.) Yes, this is a known issue. The underlying problem is that objects should switch their elements backing store from a flat array to a dictionary when they become too sparse, and the way this has historically been implemented was for every delete operation to check if enough elements were still present for that transition not to happen yet. The bigger the array, the more time this check took. Under certain conditions (recently created objects below a certain size), the check was skipped -- the resulting impressive speedup is what you're observing in the fast() case.
I've taken this opportunity to fix the (frankly quite silly) behavior of the regular/slow path. It should be sufficient to check every now and then, not on every single delete. The fix will be in V8 6.0, which should be picked up by Node in a few months (I believe Node 8 is supposed to get it at some point).
That said, using delete causes various forms and magnitudes of slowdown in many situations, because it tends to make things more complicated, forcing the engine (any engine) to perform more checks and/or fall off various fast paths. It is generally recommended to avoid using delete whenever possible. Since you have ES6 maps/sets, use them! :-)

To answer the question "why adding 1 to N slows the delete operation".
My guess: the slowness comes from the way the memory is allocated for your Object.
Try changing your code to this:
(() => {
const N = 50855
const o = {}
for ( let i = 0; i < N; i++ ) {
o[i] = i
}
// Show the heap memory allocated
console.log(process.memoryUsage().heapTotal);
const t1 = Date.now()
for ( let i = 0; i < N; i++ ) {
delete o[i]
}
const t2 = Date.now()
console.log( N / (t2 - t1) + ' OP/S' )
})();
Now, when you run with N = 50855 the memory allocated is: "8306688 bytes" (8.3MB)
When you run with N = 50856 the memory allocated is: "8929280 bytes" (8.9MB).
So you got a 600kb increase in the size of the memory allocated, only by adding one more key to your Object.
Now, I say I "guess" that this is where the slowness come from, but I think it makes sense for the delete function to be slower as the size of your Object increases.
If you try with N = 70855 you would still have same 8.9MB used. This is because usually memory allocators allocate memory in fixed "batches" while increasing the size of an Array/Object in order to reduce the number of memory allocations they do.
Now, same thing might happen with delete and the GC. The memory you delete has to be picked up by the GC, and if the Object size is larger the GC will be slower. Also the memory might be released if the number of keys goes under a specific number.
(You should read about memory allocation for dynamic arrays if you want to know more; there was a cool article on what increase rate you should use for memory allocation, I can't find it atm :( )
PS: delete is not "extremely slow", you just compute the op/s wrong. The time passed is in milliseconds, not in seconds so you have to multiply by 1000.

Coursera Node.js Fibonacci implementation hanging

I am currently doing program 2 for the startup engineering course offered on coursera
I'm programming using and ubuntu instance using Amazon web services and my programming is constantly hanging. There might be something wrong with my node.js program but I can't seem to locate it.
This program is meant to produce the first 100 Fibonacci numbers separated with commas.
#! /usr/bin/env node
//calculation
var fibonacci = function(n){
if(n < 1){return 0;}
else if(n == 1 || n == 2){return 1;}
else if(n > 2){return fibonacci(n - 1) + fibonacci(n-2);}
};
//put in array
var firstkfib = function(k){
var i;
var arr = [];
for(i = 1; i <= k; i++){
arr.push(fibonacci(i));
}
return arr
};
//print
var format = function(arr){
return arr.join(",");
};
var k = 100;
console.log("firstkfib(" + k +")");
console.log(format(firstkfib(k)));
The only output I get is
ubuntu#ip-172-31-30-245:~$ node fib.js
firstkfib(100)
and then the program hangs

I don't know if you are familiar with Time complexity and algorithmic analysis, but, it turns out that your program has an exponential running time. This basically means that, as the input increases, the time it takes to run your program increases exponentially. (If my explanation is not very clear, check this link)
It turns out that this sort of running time is extremely slow. For example, if it takes 1 ms to run your program for k=1, it would take 2^100 ms to run it for k=100. This turns out to be a ridiculously big number.
In any case, as Zhehao points out, the solution is to save the value of fib(n-1) and fib(n-2) (in an array, for example), and reuse it to compute fib(n). Check out this video lecture from MIT (the first 15 mins) on how to do it.

You may want to try printing out the numbers as they are being computed, instead of printing out the entire list at the end. It's possible that the computation is hanging somewhere along the line.
On another note, this is probably the most inefficient way of computing a list of fibonacci numbers. You compute fibonacci(n) and then fibonacci(n+1) without reusing any of the work from the previous computation. You may want to go back and rethink your method. There's a much faster and simpler iterative method.

writing intense computational code in nodeJS leads to blocking. since Fibonacci is an intense computational code so might end up blocking.

Why is this js code so slow?

This code takes 3 seconds on Chrome and 6s on Firefox.
If I write the code in Java and run it under Java 7.0 it takes only 10ms.
Chrome's JS engine is usually very fast. Why is it so slow here?
btw. this code is just for testing. I know it's not very practical way to write a fibonacci function
fib = function(n) {
if (n < 2) {
return n;
} else {
return fib(n - 1) + fib(n - 2);
}
};
console.log(fib(32));

This isn't fault of javascript, but your algorithm. You're recomputing same subproblems over and over again, and it gets worse when N is bigger. This is call graph for a single call:
F(32)
/ \
F(31) F(30)
/ \ / \
F(30) F(29) F(29) F(28)
/ \ / \ / \ | \
F(29) F(28) F(28) F(27) F(28) F(27) F(27) F(26)
... deeper and deeper
As you can see from this tree, you're computing some fibonacci numbers several times, for example F(28) is computed 4 times. From the "Algorithm Design Manual" book:
How much time does this algorithm take to compute F(n)? Since F(n+1)
/F(n) ≈ φ = (1 + sqrt(5))/2 ≈ 1.61803, this means that F(n) > 1.6^n . Since our
recursion tree has only 0 and 1 as leaves, summing up to such a large
number means we must have at least 1.6^n leaves or procedure calls!
This humble little program takes exponential time to run!
You have to use memoization or build solution bottom up (i.e. small subproblems first).
This solution uses memoization (thus, we're computing each Fibonacci number only once):
var cache = {};
function fib(n) {
if (!cache[n]) {
cache[n] = (n < 2) ? n : fib(n - 1) + fib(n - 2);
}
return cache[n];
}
This one solves it bottom up:
function fib(n) {
if (n < 2) return n;
var a = 0, b = 1;
while (--n) {
var t = a + b;
a = b;
b = t;
}
return b;
}

As is fairly well known, the implementation of the fibonacci function you gave in your question requires a lot of steps if implemented naively. In particular, it takes 7,049,155 calls.
However, these kinds of algorithms can be greatly sped up with a technique known as memoization. If you see the function call fib(32) taking several seconds, the function is being implemented naively. If it returns instantly, there is a high probability that the implementation is using memoization.

Based on the evidence already provided the conclusion I draw is:
When the code is not run from the console (like in the jsFiddle where my machine, a Sandy Bridge Macbook Air, computes it in 55ms) the JS engine is able to JIT and possibly automatically memoize the algorithm.
When run from the js console none of this occurs. On my machine it was only under 10x slower: 460ms.
I then edited the code to look for F(38) which bumped the times up to 967ms and 9414ms so it has maintained a similar speedup factor. This indicates that no memoization is being performed and the speedup is probably due to JITting.

Just a comment...
Function calls are relatively expensive, recursion is very expensive and always slower than an equivalent using an efficient loop. e.g the following is thousands of times faster than the recursive alternative in IE:
function fib2(n) {
var arr = [0, 1];
var len = 2;
while (len <= n) {
arr[len] = arr[len-1] + arr[len-2];
++len;
}
return arr[n];
}
And as noted in other answers, it seems the OP algorithm is also inherently slow, but I guess that isn't really the issue.

In addition to the memoization approach recommended by #galymzhan, you could also use another algorithm all together. Traditionally, the formula for nth Fibonacci number is F(n) = F(n-1) + F(n-2). This has time complexity that is directly proportional to n.
Dijkstra came up with an algorithm to derive Fibonacci numbers in less than half the steps as specified by the conventional formula. This was outlined in his writing EDW #654. It goes:
For even numbers, F(2n) = (F(n))2 + (F(n+1))2
For odd numbers, F(2n+1) = (2F(n) + F(n+1)) * F(n+1) OR F(2n-1) = (2F(n+1) - F(n)) * F(n)

We Keep Coding

JavaScript is the programming language of the Web.