How to implement a robust hash table like v8

How to implement a robust hash table like v8 - javascript

Looking to learn how to implement a hash table in a decent way in JavaScript.
I would like for it to be able to:
Efficiently resolve collisions,
Be space efficient, and
Be unbounded in size (at least in principle, like v8 objects are, up to the size of the system memory).
From my research and help from SO, there are many ways to resolve collisions in hash tables. The way v8 does it is Quadratic probing:
hash-table.h
The wikipedia algorithm implementing quadratic probing in JavaScript looks something like this:
var i = 0
var SIZE = 10000
var key = getKey(arbitraryString)
var hash = key % SIZE
if (hashtable[hash]) {
while (i < SIZE) {
i++
hash = (key + i * i) % SIZE
if (!hashtable[hash]) break
if (i == SIZE) throw new Error('Hashtable full.')
}
hashtable[hash] = key
} else {
hashtable[hash] = key
}
The elements that are missing from the wikipedia entry are:
How to compute the hash getKey(arbitraryString). Hoping to learn how v8 does this (not necessarily an exact replica, just along the same lines). Not being proficient in C it looks like the key is an object, and the hash is a 32 bit integer. Not sure if the lookup-cache.h is important.
How to make it dynamic so the SIZE constraint can be removed.
Where to store the final hash, and how to compute it more than once.
V8 allows you to specify your own "Shape" object to use in the hash table:
// The hash table class is parameterized with a Shape.
// Shape must be a class with the following interface:
// class ExampleShape {
// public:
// // Tells whether key matches other.
// static bool IsMatch(Key key, Object* other);
// // Returns the hash value for key.
// static uint32_t Hash(Isolate* isolate, Key key);
// // Returns the hash value for object.
// static uint32_t HashForObject(Isolate* isolate, Object* object);
// // Convert key to an object.
// static inline Handle<Object> AsHandle(Isolate* isolate, Key key);
// // The prefix size indicates number of elements in the beginning
// // of the backing storage.
// static const int kPrefixSize = ..;
// // The Element size indicates number of elements per entry.
// static const int kEntrySize = ..;
// // Indicates whether IsMatch can deal with other being the_hole (a
// // deleted entry).
// static const bool kNeedsHoleCheck = ..;
// };
But not sure what the key is and how they convert that key to the hash so keys are evenly distributed and the hash function isn't just a hello-world example.
The question is, how to implement a quick hash table like V8 that can efficiently resolve collisions and is unbounded in size. It doesn't have to be exactly like V8 but have the features outlined above.
In terms of space efficiency, a naive approach would do var array = new Array(10000), which would eat up a bunch of memory until it was filled out. Not sure how v8 handles it, but if you do var x = {} a bunch of times, it doesn't allocate a bunch of memory for unused keys, somehow it is dynamic.
I'm stuck here essentially:
var m = require('node-murmurhash')
function HashTable() {
this.array = new Array(10000)
}
HashTable.prototype.key = function(value){
// not sure if the key is actually this, or
// the final result computed from the .set function,
// and if so, how to store that.
return m(value)
}
HashTable.prototype.set = function(value){
var key = this.key(value)
var array = this.array
// not sure how to get rid of this constraint.
var SIZE = 10000
var hash = key % SIZE
var i = 0
if (array[hash]) {
while (i < SIZE) {
i++
hash = (key + i * i) % SIZE
if (!array[hash]) break
if (i == SIZE) throw new Error('Hashtable full.')
}
array[hash] = value
} else {
array[hash] = value
}
}
HashTable.prototype.get = function(index){
return this.array[index]
}

This is a very broad question, and I'm not sure what exactly you want an answer to. ("How to implement ...?" sounds like you just want someone to do your work for you. Please be more specific.)
How to compute the hash
Any hash function will do. I've pointed out V8's implementation in the other question you've asked; but you really have a lot of freedom here.
Not sure if the lookup-cache.h is important.
Nope, it's unrelated.
How to make it dynamic so the SIZE constraint can be removed.
Store the table's current size as a variable, keep track of the number of elements in your hash table, and grow the table when the percentage of used slots exceeds a given threshold (you have a space-time tradeoff there: lower load factors like 50% give fewer collisions but use more memory, higher factors like 80% use less memory but hit more slow cases). I'd start with a capacity that's an estimate of "minimum number of entries you'll likely need", and grow in steps of 2x (e.g. 32 -> 64 -> 128 -> etc.).
Where to store the final hash,
That one's difficult: in JavaScript, you can't store additional properties on strings (or primitives in general). You could use a Map (or object) on the side, but if you're going to do that anyway, then you might as well use that as the hash table, and not bother implementing your own thing on top.
and how to compute it more than once.
That one's easy: invoke your hashing function again ;-)
I just want a function getUniqueString(string)
How about this:
var table = new Map();
var max = 0;
function getUniqueString(string) {
var unique = table.get(string);
if (unique === undefined) {
unique = (++max).toString();
table.set(string, unique);
}
return unique;
}
For nicer encapsulation, you could define an object that has table and max as properties.

Related

Can I pre-allocate memory for a Map/Set with a known number of elements?

In the case of a JS array, it's possible to create one with predefined length. If we pass the length to the constructor, like new Array(itemCount), a JS engine could pre-allocate memory for the array, so it won't be needed to reallocate it while new items are added to the array.
Is it possible to pre-allocate memory for a Map or Set? It doesn't accept the length in the constructor like array does. If you know that a map will contain 10 000 items, it should be much more efficient to allocate the memory once than reallocate it multiple times.
Here is the same question about arrays.
There is discussion in the comments, whether the predefined array size has an impact on the performance or not. I've created a simple test to check it, the result is:
Chrome - filling an array with a predefined size is about 3 times faster
Firefox - there is no much difference, but the array with predefined length is a little bit faster
Edge - filling of a predefined size array is about 1.5 times faster
There is a suggestion to create the entries array and pass it to the Map, so the map can take the length from the array. I've created such test too, the result is:
Chrome - construction of a map which accepts entries array is about 1.5 times slower
Firefox - there is no difference
Edge - passing of entries array to the Map constructor is about 1.5 times faster

There is no way to influence the way the engine allocates memory, so whatever you do, you cannot guarantee that the trick you are using works on every engine or even at another version of the same engine.
There are however typed arrays in JS, they are the only datastructures with a fixed size. With them, you can implement your own hashmap with fixed size.
A very naive implementation is slightly faster² on inserting, not sure how it behaves on reads and others (also it can only store 8bit integers > 0):
function FixedMap(size) {
const keys = new Uint8Array(size),
values = new Uint8Array(size);
return {
get(key) {
let pos = key % size;
while(keys[pos] !== key) {
if(keys[pos] % size !== key % size)
return undefined;
pos = (pos + 1) % size;
}
return values[pos];
},
set(key, value) {
let pos = key % size;
while(key[pos] && key[pos] !== key) pos = (pos + 1) % size;
keys[pos] = value;
values[pos] = value;
}
};
}
console.time("native");
const map = new Map();
for(let i = 0; i < 10000; i++)
map.set(i, Math.floor(Math.random() * 1000));
console.timeEnd("native");
console.time("self");
const fixedMap = FixedMap(10000);
for(let i = 0; i < 10000; i++)
fixedMap.set(i, Math.floor(Math.random() * 1000));
console.timeEnd("self");
² Marketing would say It is up to 20% faster! and I would say It is up to 2ms faster, and I spent more than 10minutes on this ...

No, this is not possible. It's doubtful even that the Array(length) trick still works, engines have become much better at optimising array assignments and might have chosen a different strategy to (pre)determine the size of memory to allocate.
For Maps and Sets, no such trick exists. Your best bet will be creating them by passing an iterable to their constructor, like new Map(array), instead of using the set/add methods. This does at least offer the chance for the engine to take the iterable size as a hint - although filtering out duplicates can still lead to differences.
Disclaimer: I don't know whether any engines implement, or plan to implement, such an optimisation. It might not be worth the effort, especially if the current memory allocation strategy is already good enough to diminish any improvements.

In JS which one makes sense: searching for a value in object collection by foreach vs keeping multiple collections with different keys

I'm working on some kind of 1:1 chat system, the environment is Node.JS
For each country, there is a country room (lobby), for each socket client there is a js class/object is being created and each object is in a list with their unique user id.
This unique id is preserved even users logged in from different browser tabs etc..
Each object stored in collections like: "connections" (all of them), "operators"(only operators), "{countryISO}_clients" (users) and the reference key is their unique id.
In some circumstances, I need to access these connections by their socket ids.
At this point, I can think of 2 resolutions.
Using a for each loop to find the desired object
Creating another collection, this time instead of unique id use socket id (or something else.)
Which one makes sense? Because in JS since this collection will be a reference list instead of a copy, it feels like it makes sense (and beautiful looking) but I can't be sure. Which one is expensive in memory/performance terms?
I can't make thorough tests since I don't know how to create dummy (simultaneous) socket connections.
Expected connected socket client count: 300 - 1000 (depends on the time of the day)
e.g. user:
"qk32d2k":{
"uid":"qk32d2k",
"name":"Josh",
"socket":"{socket.io's socket reference}",
"role":"user",
"rooms":["room1"],
"socketids":["sid1"]
"country":"us",
...
info:() => { return gatherSomeData(); },
update:(obj) => { return updateSomeData(obj); },
send:(data)=>{ /*send data to this user*/ }
}
e.g. Countries collection:
{
us:{
"qk32d2k":{"object above."}
"l33t71":{"another user object."}
},
ca:{
"asd231":{"other user object."}
}
}

Pick Simple Design First that Optimizes for Most Common Access
There is no ideal answer here in the absolute. CPUs are wicked fast these days, so if I were you I'd start out with one simple mechanism of storing the sockets that you can access both ways you want, even if one way is kind of a brute force search. Pick the data structure that optimizes the access mechanism that you expect to be either most common or most sensitive to performance.
So, if you are going to be looking up by userID the most, then I'd probably store the sockets in a Map object with the userID as the key. That will give you fast, optimized access to get the socket for a given userID.
For finding a socket by some other property of the socket, you will just iterate the Map item by item until you find the desired match on some other socket property. I'd probably use a for/of loop because it's both fast and easy to bail out of the loop when you've found your match (something you can't do on a Map or Array object with .forEach()). You can obviously make yourself a little utility function or method that will do the brute force lookup for you and that will allow you to modify the implementation later without changing much calling code.
Measure and Add Further Optimization Later (if data shows you need to)
Then, once you get up to scale (or simulated scale in pre-production test), you take a look at the performance of your system. If you have loads of room to spare, you're done - no need to look further. If you have some operations that are slower than desired or higher than desired CPU usage, then you profile your system and find out where the time is going. It's most likely that your performance bottlenecks will be elsewhere in your system and you can then concentrate on those aspects of the system. If, in your profiling, you find that the linear lookup to find the desired socket is causing some of your slow-down, then you can make a second parallel lookup Map with the socketID as the key in order to optimize that type of lookup.
But, I would not recommend doing this until you've actually shown that it is an issue. Premature optimization before you have actual metrics that prove it's worth optimizing something just add complexity to a program without any proof that it is required or even anywhere close to a meaningful bottleneck in your system. Our intuition about what the bottlenecks are is often way, way off. For that reasons, I tend to pick an intelligent first design that is relatively simple to implement, maintain and use and then, only when we have real usage data by which we can measure actual performance metrics would I spend more time optimizing it or tweaking it or making it more complicated in order to make it faster.
Encapsulate the Implementation in Class
If you encapsulate all operations here in a class:
Adding a socket to the data structure.
Removing a socket from the data structure.
Looking up by userID
Looking up by socketID
Any other access to the data structure
Then, all calling code will access this data structure via the class and you can tweak the implementation some time in the future (to optimize based on data) without having to modify any of the calling code. This type of encapsulation can be very useful if you suspect future modifications or change of modifications to the way the data is stored or accessed.
If You're Still Worried, Design a Quick Bench Measurement
I created a quick snippet that tests how long a brute force lookup is in a 1000 element Map object (when you want to find it by something other than what the key is) and compared it to an indexed lookup.
On my computer, a brute force lookup (non-indexed lookup) takes about 0.002549 ms per lookup (that's an average time when doing 1,000,000 lookups. For comparison an indexed lookup on the same Map takes about 0.000017 ms. So you save about 0.002532 ms per lookup. So, this is fractions of a millisecond.
function addCommas(str) {
var parts = (str + "").split("."),
main = parts[0],
len = main.length,
output = "",
i = len - 1;
while(i >= 0) {
output = main.charAt(i) + output;
if ((len - i) % 3 === 0 && i > 0) {
output = "," + output;
}
--i;
}
// put decimal part back
if (parts.length > 1) {
output += "." + parts[1];
}
return output;
}
let m = new Map();
// populate the Map with objects that have a property that
// you have to do a brute force lookup on
function rand(min, max) {
return Math.floor((Math.random() * (max - min)) + min)
}
// keep all randoms here just so we can randomly get one
// to try to find (wouldn't normally do this)
// just for testing purposes
let allRandoms = [];
for (let i = 0; i < 1000; i++) {
let r = rand(1, 1000000);
m.set(i, {id: r});
allRandoms.push(r);
}
// create a set of test lookups
// we do this ahead of time so it's not part of the timed
// section so we're only timing the actual brute force lookup
let numRuns = 1000000;
let lookupTests = [];
for (let i = 0; i < numRuns; i++) {
lookupTests.push(allRandoms[rand(0, allRandoms.length)]);
}
let indexTests = [];
for (let i = 0; i < numRuns; i++) {
indexTests.push(rand(0, allRandoms.length));
}
// function to brute force search the map to find one of the random items
function findObj(targetVal) {
for (let [key, val] of m) {
if (val.id === targetVal) {
return val;
}
}
return null;
}
let startTime = Date.now();
for (let i = 0; i < lookupTests.length; i++) {
// get an id from the allRandoms to search for
let found = findObj(lookupTests[i]);
if (!found) {
console.log("!!didn't find brute force target")
}
}
let delta = Date.now() - startTime;
//console.log(`Total run time for ${addCommas(numRuns)} lookups: ${delta} ms`);
//console.log(`Avg run time per lookup: ${delta/numRuns} ms`);
// Now, see how fast the same number of indexed lookups are
let startTime2 = Date.now();
for (let i = 0; i < indexTests.length; i++) {
let found = m.get(indexTests[i]);
if (!found) {
console.log("!!didn't find indexed target")
}
}
let delta2 = Date.now() - startTime2;
//console.log(`Total run time for ${addCommas(numRuns)} lookups: ${delta2} ms`);
//console.log(`Avg run time per lookup: ${delta2/numRuns} ms`);
let results = `
Total run time for ${addCommas(numRuns)} brute force lookups: ${delta} ms<br>
Avg run time per brute force lookup: ${delta/numRuns} ms<br>
<hr>
Total run time for ${addCommas(numRuns)} indexed lookups: ${delta2} ms<br>
Avg run time per indexed lookup: ${delta2/numRuns} ms<br>
<hr>
Net savings of an indexed lookup is ${(delta - delta2)/numRuns} ms per lookup
`;
document.body.innerHTML = results;

Implement partition refinement algorithm

Partition Refinement Algorithm
I have this algorithm from another Stack Exchange question:
Let
S = {x_1,x_2,x_3,x_4,x_5}
be the state space and
R = {
(x_1,a,x_2),
(x_1,b,x_3),
(x_2,a,x_2),
(x_3,a,x_4),
(x_3,b,x_5),
(x_4,a,x_4), // loop
(x_5,a,x_5), // loop (a-transition)
(x_5,b,x_5) // loop (b-transition)
}
be the transition relation
Then we start with the partition
Pi_1 = {{x_1,x_2,x_3,x_4,x_5}}
where all the states are lumped together.
Now, x_2 and x_4 can both always do an a-transition to a state, but no b-transitions, whereas the remaining states can do both a- and b-transitions, so we split the state space as
Pi_2 = {{x_1,x_3,x_5},{x_2,x_4}}.
Next, x_5 can do an a-transition into the class {x_1,x_3,x_5},
but x_1 and x_3 can not, since their a-transitions go into the class {x_2,x_4}. Hence these should again be split, so we get
Pi_3 = {{x_1,x_3},{x_5},{x_2,x_4}}.
Now it should come as no surprise that x_3 can do a b-transition into the class {x_5}, but x_1 can not, so they must also be split, so we get
Pi_4 = {{x_1},{x_3},{x_5},{x_2,x_4}},
and if you do one more step, you will see that Pi_4 = Pi_5, so this is the result.
Implementation
I do not know how to implement this algorithm in JavaScript.
// blocks in initial partition (only consist of 1 block)
const initialPartition = { block1: getStates() };
// new partition
const partition = {};
// loop through each block in the partition
Object.values(initialPartition).forEach(function (block) {
// now split the block into subgroups based on the relations.
// loop through each node in block to see to which nodes it has relations
block.forEach(function (node) {
// recursively get edges (and their targets) coming out of the node
const successors = node.successors();
// ...
}
});
I guess I should create a function that for each node can say which transitions it can make. If I have such function, I can loop through each node in each block, find the transitions, and create a key using something like
const key = getTransitions(node).sort().join();
and use this key to group the nodes into subblocks, making it possible to do something like
// blocks in initial partition (only consist of 1 block)
const initialPartition = { block1: getStates() };
// new partition
const partition = {};
// keep running until terminating
while (true) {
// loop through each block in the partition
Object.values(initialPartition).forEach(function (block) {
// now split the block into subgroups based on the relations.
// loop through each node in block to see to which nodes it has relations
block.forEach(function (node) {
// get possible transitions
const key = getTransitions(node).sort().join();
// group by key
partition[key].push(node);
}
});
}
but I need to remember which nodes were already separated into blocks, so the subblocks keep becoming smaller (i.e. if I have {{x_1,x_3,x_5},{x_2,x_4}}, I should remember that these blocks can only become smaller, and never 'interchange').
Can someone give an idea on how to implement the algorithm? Just in pseudocode or something, so I can implement how to get the nodes' successors, incomers, labels (e.g. a-transition or b-transition), etc. This is a bisimulation algorithm, and the algorithm is implemented in Haskell in the slides here, but I do not understand Haskell.

es6 Map and Set complexity, v8 implementation

Is it a fair assumption that in v8 implementation retrieval / lookup is O(1)?
(I know that the standard doesn't guarantee that)

Is it a fair assumption that in v8 implementation retrieval / lookup is O(1)?
Yes. V8 uses a variant of hash tables that generally have O(1) complexity for these operations.
For details, you might want to have a look at https://codereview.chromium.org/220293002/ where OrderedHashTable is implemented based on https://wiki.mozilla.org/User:Jorend/Deterministic_hash_tables.

For people don't want to dig into the rabbit hole too deep:
1: We can assume good hash table implementations have practically O(1) time complexity.
2: Here is a blog posted by V8 team explains how some memory optimization was done on its hashtable implementation for Map, Set, WeakSet, and WeakMap: Optimizing hash tables: hiding the hash code
Based on 1 and 2: V8's Set and Map's get & set & add & has time complexity practically is O(1).

The original question is already answered, but O(1) doesn't tell a lot about how efficient the implementation is.
First of all, we need to understand what variation of hash table is used for Maps. "Classical" hash tables won't do as they do not provide any iteration order guarantees, while ES6 spec requires insertion to be in the iteration order. So, Maps in V8 are built on top of so-called deterministic hash tables. The idea is similar to the classical algorithm, but there is another layer of indirection for buckets and all entries are inserted and stored in a contiguous array of a fixed size. Deterministic hash tables algorithm indeed guarantees O(1) time complexity for basic operations, like set or get.
Next, we need to know what's the initial size of the hash table, the load factor, and how (and when) it grows/shrinks. The short answer is: initial size is 4, load factor is 2, the table (i.e. the Map) grows x2 as soon as its capacity is reached, and shrinks as soon as there is more than 1/2 of deleted entries.
Let’s consider the worst-case when the table has N out of N entries (it's full), all entries belong to a single bucket, and the required entry is located at the tail. In such a scenario, a lookup requires N moves through the chain elements.
On the other hand, in the best possible scenario when the table is full, but each bucket has 2 entries, a lookup will require up to 2 moves.
It is a well-known fact that while individual operations in hash tables are “cheap”, rehashing is not. Rehashing has O(N) time complexity and requires allocation of the new hash table on the heap. Moreover, rehashing is performed as a part of insertion or deletion operations, when necessary. So, for instance, a map.set() call could be more expensive than you would expect. Luckily, rehashing is a relatively infrequent operation.
Apart from that, such details as memory layout or hash function also matter, but I'm not going to go into these details here. If you're curious how V8 Maps work under the hood, you may find more details here. I was interested in the topic some time ago and tried to share my findings in a readable manner.

let map = new Map();
let obj = {};
const benchMarkMapSet = size => {
console.time("benchMarkMapSet");
for (let i = 0; i < size; i++) {
map.set(i, i);
}
console.timeEnd("benchMarkMapSet");
};
const benchMarkMapGet = size => {
console.time("benchMarkMapGet");
for (let i = 0; i < size; i++) {
let x = map.get(i);
}
console.timeEnd("benchMarkMapGet");
};
const benchMarkObjSet = size => {
console.time("benchMarkObjSet");
for (let i = 0; i < size; i++) {
obj[i] = i;
}
console.timeEnd("benchMarkObjSet");
};
const benchMarkObjGet = size => {
console.time("benchMarkObjGet");
for (let i = 0; i < size; i++) {
let x = obj[i];
}
console.timeEnd("benchMarkObjGet");
};
let size = 2e6;
benchMarkMapSet(size);
benchMarkObjSet(size);
benchMarkMapGet(size);
benchMarkObjGet(size);
benchMarkMapSet: 382.935ms
benchMarkObjSet: 76.077ms
benchMarkMapGet: 125.478ms
benchMarkObjGet: 2.764ms
benchMarkMapSet: 373.172ms
benchMarkObjSet: 77.192ms
benchMarkMapGet: 123.035ms
benchMarkObjGet: 2.638ms

Why don't we just test.
var size_1 = 1000,
size_2 = 1000000,
map_sm = new Map(Array.from({length: size_1}, (_,i) => [++i,i])),
map_lg = new Map(Array.from({length: size_2}, (_,i) => [++i,i])),
i = size_1,
j = size_2,
s;
s = performance.now();
while (i) map_sm.get(i--);
console.log(`Size ${size_1} map returns an item in average ${(performance.now()-s)/size_1}ms`);
s = performance.now();
while (j) map_lg.get(j--);
console.log(`Size ${size_2} map returns an item in average ${(performance.now()-s)/size_2}ms`);
So to me it seems like it converges to O(1) as the size grows. Well lets call it O(1) then.

creating a simple one-way hash

Are there any standard hash functions/methods that maps an arbitrary 9 digit integer into another (unique) 9 digit integer, such that it is somewhat difficult to map back (without using brute force).
Hashes should not collide, so every output 1 ≤ y < 10^9 needs to be mapped from one and only one input value in 1 ≤ x < 10^9.

The problem you describe is really what Format-Preserving Encryption aims to solve.
One standard is currently being worked out by NIST: the new FFX mode of encryption for block ciphers.
It may be more complex than what you expected though. I cannot find any implementation in Javascript, but some examples exist in other languages: here (Python) or here (C++).

You are requiring a non-colliding hash function with only about 30 bits. That's going to be a tall order for any hash function. Actually, what you need is not a Pseudo Random Function such as a hash but a Pseudo Random Permutation.
You could use an encryption function for this, but you would obviously need to keep the key secret. Furthermore, encryption functions normally bits as input and output, and 10^9 is not likely to use an exact number of bits. So if you are going for such an option you may have to use format preserving encryption.
You may also use any other function that is a PRP within the group 0..10^9-1 (after decrementing the value with 1), but if an attacker finds out what parameters you are using then it becomes really simple to revert back to the original. An example would be a multiplication with a number that is relatively prime with 10^9-1, modulo 10^9-1.

This is what i can come up with:
var used = {};
var hash = function (num) {
num = md5(num);
if (used[num] !== undefined) {
return used[num];
} else {
var newNum;
do {
newNum = Math.floor(Math.random() * 1000000000) + 1;
} while (contains(newNum))
used[num] = newNum;
return newNum;
}
};
var contains = function (num) {
for (var i in used) {
if (used[i] === num) {
return true;
}
}
return false;
};
var md5 = function (num) {
//method that return an md5 (or any other) hash
};
I should note however that it will run into problems when you try to hash a lot of different numbers because the do..while will produce random numbers and compare them with already generated numbers. If you have already generated a lot of numbers it will get more and more unlikely to find the remaining ones.

We Keep Coding

JavaScript is the programming language of the Web.