Heuristic for multi-dimensional knapsack - javascript

This is a follow-up question to: Finding max value of a weighted subset sum of a power set
Whereas the previous question solves (to optimality) problems of size <= 15 in reasonable time, I would like to solve problems of size ~2000 to near-optimality.
As a small example problem, I am given a certain range of nodes:
var range = [0,1,2,3,4];
A function creates a power set for all the nodes in the range and assigns each combination a numeric score. Negative scores are removed, resulting in the following array S. S[n][0] is the bitwise OR of all included nodes, and S[n][1] is the score:
var S = [
[1,0], //0
[2,0], //1
[4,0], //2
[8,0], //3
[16,0], //4
[3,50], //0-1
[5,100], //0-2
[6,75], //1-2
[20,130], //2-4
[7,179] //0-1-2 e.g. combining 0-1-2 has a key of 7 (bitwise `1|2|4`) and a score of 179.
];
The optimal solution, maximizing the score, would be:
var solution = [[8,3,20],180]
Where solution[0] is an array of combinations from S. and solution[1] is the resulting score. Note that bitwise 8 & 3 & 20 == 0 signifying that each node is used only once.
Problem specifics: Each node must be used exactly once and the score for the single-node combinations will always be 0, as shown in the S array above.
My current solution (seen here) uses dynamic programming and works for small problems. I have seen heuristics involving dynamic programming, such as https://youtu.be/ze1Xa28h_Ns, but can't figure out how I'd apply that to a multi-dimensional problem. Given the problem constraints, what would be a reasonable heuristic to apply?
EDIT:
Things I've tried
Greedy approach (sort score greatest to least, pick the next viable candidate)
Same as above, but sort by score/cardinality(combo)
GRASP (edit each score by up to 10%, then sort, repeat until a better solution hasn't been found in x seconds)

This problem is really an integer optimization problem, with binary variables x_i indicating if the i^th element of S is selected and constraints indicating that each bit is used exactly once. The objective is to maximize the score attained across the selected elements. If we defined S_i to be the i^th element of S, L_b to be the indices of elements in S with bit b set, w_i to be the score associated with element i, and assumed there were n elements in set S and k bits, we could write this in mathematical notation as:
min_{x} \sum_{i=1..n} w_i*x_i
s.t. \sum_{i \in L_b} x_i = 1 \forall b = 1..k
x_i \in {0, 1} \forall i = 1..n
In many cases, linear programming solvers are much (much, much) more effective than exhaustive search at solving these sorts of problems. Unfortunately I am not aware of any javascript linear programming libraries (a Google query turned up SimplexJS and glpk.js and node-lp_solve -- I have no experience with any of these and couldn't immediately get any to work). As a result I will do the implementation in R using the lpSolve package.
w <- c(0, 0, 0, 0, 0, 50, 100, 75, 130, 179)
elt <- c(1, 2, 4, 8, 16, 3, 5, 6, 20, 7)
k <- 5
library(lpSolve)
mod <- lp(direction = "max",
objective.in = w,
const.mat = t(sapply(1:k, function(b) 1*(bitwAnd(elt, 2^(b-1)) > 0))),
const.dir = rep("=", k),
const.rhs = rep(1, k),
all.bin = TRUE)
elt[mod$solution > 0.999]
# [1] 8 3 20
mod$objval
# [1] 180
As you'll note, this is an exact formulation of your problem. However, by setting a timeout (you'd actually need to use the lpSolveAPI package in R instead of the lpSolve package to do this), you can get the best solution found by the solver before reaching your specified timeout. This may not be an optimal solution, and you can control how long before the heuristic stops trying to find better solutions. If the solver terminates before the timeout, the solution is guaranteed to be optimal.

A reasonable heuristic (the first that comes to my mind) would be one that iteratively took the feasible element with the largest score, eliminating all elements that have overlapping bits with the selected element.
I would implement this by first sorting in decreasing order by score and then then iteratively add the first element and filter the list, removing any element that overlaps the selected element.
In javascript:
function comp(a, b) {
if (a[1] < b[1]) return 1;
if (a[1] > b[1]) return -1;
return 0;
}
S.sort(comp); // Sort descending by score
var output = []
var score = 0;
while (S.length > 0) {
output.push(S[0][0]);
score += S[0][1];
newS = [];
for (var i=0; i < S.length; i++) {
if ((S[i][0] & S[0][0]) == 0) {
newS.push(S[i]);
}
}
S = newS;
}
alert(JSON.stringify([output, score]));
This selects elements 7, 8, and 16, with score 179 (as opposed to the optimal score of 180).

Related

JS function that creates a range of numbers (inclusive beginning and end) with a limit on range items

You are given a starting number and ending number and the max number of output elements allowed. How would you create an output array with as even a distribution as possible, while still including the first and last points in the output?
Function signature
function generatePoints(startingNumber, endingNumber, maxPoints) {}
Function desired output
generatePoints(0, 8, 5) // [0, 2, 4, 6, 8]
Here's what I tried so far
function generatePoints(startingNumber, endingNumber, maxPoints) {
const interval = Math.round((endingNumber - startingNumber) / maxPoints)
let count = 0
let counter = 0
let points = []
while(count < maxPoints - 1) {
points.push(counter)
counter+=interval
count++
}
points.push(endingNumber)
return points
}
Technically this creates the correct output for the simple case, but falls short when up against most other edge cases due to the fact that I'm stopping one iteration early and then adding the final point. I'm thinking that the better way to do this (to create a better distribution) is to build from the center of the array outwards, versus building from the start of the array and then stopping one element early and appending the endingNumber.
Note this:
0 2 4 6 8
+-----+ +-----+ +-----+ +-----+
A B C D
Splitting our range into intervals with 5 points including the endpoints, we have only four intervals. It will always be one fewer than the number of points. We can divide our range up evenly into these smaller ranges, simply by continually adding the width of one interval, which is just (endingNumber - startingNumber) / (maxPoints - 1). We can do it like this:
const generatePoints = (startingNumber, endingNumber, maxPoints) => Array .from (
{length: maxPoints},
(_, i) => startingNumber + i * (endingNumber - startingNumber) / (maxPoints - 1)
)
console .log (generatePoints (0, 8, 5))
We just build an array of the right length, using the index parameter to count the number of smaller intervals we're using.
We do no error-checking here, and if maxPoints were just 1, we might have an issue. But that's easy enough to handle how you like.
But there is a concern here. Why is the parameter called maxPoints instead of points? If the number of points allowed is variable, I think we need further requirements.
Do not Math.round(interval). Instead Math.round(counter) at that last moment.
The reason why is that if you've added k intervals, the error in what you're going can be as much as 0.5*k. But if you round at the last minute, the error is never more than 0.5.

Efficient computation of n choose k in Node.js

I have some performance sensitive code on a Node.js server that needs to count combinations. From this SO answer, I used this simple recursive function for computing n choose k:
function choose(n, k) {
if (k === 0) return 1;
return (n * choose(n-1, k-1)) / k;
}
Then since we all know iteration is almost always faster than recursion, I wrote this function based on the multiplicative formula:
function choosei(n,k){
var result = 1;
for(var i=1; i <= k; i++){
result *= (n+1-i)/i;
}
return result;
}
I ran a few benchmarks on my machine. Here are the results of just one of them:
Recursive x 178,836 ops/sec ±7.03% (60 runs sampled)
Iterative x 550,284 ops/sec ±5.10% (51 runs sampled)
Fastest is Iterative
The results consistently showed that the iterative method is indeed about 3 to 4 times faster than the recursive method in Node.js (at least on my machine).
This is probably fast enough for my needs, but is there any way to make it faster? My code has to call this function very frequently, sometimes with fairly large values of n and k, so the faster the better.
EDIT
After running a few more tests with le_m's and Mike's solutions, it turns out that while both are significantly faster than the iterative method I proposed, Mike's method using Pascal's triangle appears to be slightly faster than le_m's log table method.
Recursive x 189,036 ops/sec ±8.83% (58 runs sampled)
Iterative x 538,655 ops/sec ±6.08% (51 runs sampled)
LogLUT x 14,048,513 ops/sec ±9.03% (50 runs sampled)
PascalsLUT x 26,538,429 ops/sec ±5.83% (62 runs sampled)
Fastest is PascalsLUT
The logarithmic look up method has been around 26-28 times faster than the iterative method in my tests, and the method using Pascal's triangle has been about 1.3 to 1.8 times faster than the logarithmic look up method.
Note that I followed le_m's suggestion of pre-computing the logarithms with higher precision using mathjs, then converted them back to regular JavaScript Numbers (which are always double-precision 64 bit floats).
Never compute factorials, they grow too quickly. Instead compute the result you want. In this case, you want the binomial numbers, which have an incredibly simple geometric construction: you can build pascal's triangle, as you need it, and do it using plain arithmetic.
Start with [1] and [1,1]. The next row has [1] at the start, [1+1] in the middle, and [1] at the end: [1,2,1]. Next row: [1] at the start, the sum of the first two terms in spot 2, the sum of the next two terms in spot three, and [1] at the end: [1,3,3,1]. Next row: [1], then 1+3=4, then 3+3=6, then 3+1=4, then [1] at the end, and so on and so on. As you can see, no factorials, logarithms, or even multiplications: just super fast addition with clean integer numbers. So simple, you can build a massive lookup table by hand.
And you should.
Never compute in code what you can compute by hand and just include as constants for immediate lookup; in this case, writing out the table for up to something around n=20 is absolutely trivial, and you can then just use that as your "starting LUT" and probably never even access the high rows.
But, if you do need them, or more, then because you can't build an infinite lookup table you compromise: you start with a pre-specified LUT, and a function that can "fill it up" to some term you need that's not in it yet:
// step 1: a basic LUT with a few steps of Pascal's triangle
const binomials = [
[1],
[1,1],
[1,2,1],
[1,3,3,1],
[1,4,6,4,1],
[1,5,10,10,5,1],
[1,6,15,20,15,6,1],
[1,7,21,35,35,21,7,1],
[1,8,28,56,70,56,28,8,1],
// ...
];
// step 2: a function that builds out the LUT if it needs to.
module.exports = function binomial(n,k) {
while(n >= binomials.length) {
let s = binomials.length;
let nextRow = [];
nextRow[0] = 1;
for(let i=1, prev=s-1; i<s; i++) {
nextRow[i] = binomials[prev][i-1] + binomials[prev][i];
}
nextRow[s] = 1;
binomials.push(nextRow);
}
return binomials[n][k];
};
Since this is an array of ints, the memory footprint is tiny. For a lot of work involving binomials, we realistically don't even need more than two bytes per integer, making this a minuscule lookup table: we don't need more than 2 bytes until you need binomials higher than n=19, and the full lookup table up to n=19 takes up a measly 380 bytes. This is nothing compared to the rest of your program. Even if we allow for 32 bit ints, we can get up to n=35 in a mere 2380 bytes.
So the lookup is fast: either O(constant) for previously computed values, (n*(n+1))/2 steps if we have no LUT at all (in big O notation, that would be O(n²), but big O notation is almost never the right complexity measure), and somewhere in between for terms we need that aren't in the LUT yet. Run some benchmarks for your application, which will tell you how big your initial LUT should be, simply hard code that (seriously. these are constants, they're are exactly the kind of values that should be hard coded), and keep the generator around just in case.
However, do remember that you're in JavaScript land, and you are constrained by the JavaScript numerical type: integers only go up to 2^53, beyond that the integer property (every n has a distinct m=n+1 such that m-n=1) is not guaranteed. This should hardly ever be a problem, though: once we hit that limit, we're dealing with binomial coefficients that you should never even be using.
The following algorithm has a run-time complexity of O(1) given a linear look-up table of log-factorials with space-complexity O(n).
Limiting n and k to the range [0, 1000] makes sense since binomial(1000, 500) is already dangerously close to Number.MAX_VALUE. We would thus need a look-up table of size 1000.
On a modern JavaScript engine, a compact array of n numbers has a size of n * 8 bytes. A full look-up table would thus require 8 kilobytes of memory. If we limit our input to the range [0, 100], the table would only occupy 800 bytes.
var logf = [0, 0, 0.6931471805599453, 1.791759469228055, 3.1780538303479458, 4.787491742782046, 6.579251212010101, 8.525161361065415, 10.60460290274525, 12.801827480081469, 15.104412573075516, 17.502307845873887, 19.987214495661885, 22.552163853123425, 25.19122118273868, 27.89927138384089, 30.671860106080672, 33.50507345013689, 36.39544520803305, 39.339884187199495, 42.335616460753485, 45.38013889847691, 48.47118135183523, 51.60667556776438, 54.78472939811232, 58.00360522298052, 61.261701761002, 64.55753862700634, 67.88974313718154, 71.25703896716801, 74.65823634883016, 78.0922235533153, 81.55795945611504, 85.05446701758152, 88.58082754219768, 92.1361756036871, 95.7196945421432, 99.33061245478743, 102.96819861451381, 106.63176026064346, 110.32063971475739, 114.0342117814617, 117.77188139974507, 121.53308151543864, 125.3172711493569, 129.12393363912722, 132.95257503561632, 136.80272263732635, 140.67392364823425, 144.5657439463449, 148.47776695177302, 152.40959258449735, 156.3608363030788, 160.3311282166309, 164.32011226319517, 168.32744544842765, 172.3527971391628, 176.39584840699735, 180.45629141754378, 184.53382886144948, 188.6281734236716, 192.7390472878449, 196.86618167289, 201.00931639928152, 205.1681994826412, 209.34258675253685, 213.53224149456327, 217.73693411395422, 221.95644181913033, 226.1905483237276, 230.43904356577696, 234.70172344281826, 238.97838956183432, 243.2688490029827, 247.57291409618688, 251.8904022097232, 256.22113555000954, 260.5649409718632, 264.9216497985528, 269.2910976510198, 273.6731242856937, 278.0675734403661, 282.4742926876304, 286.893133295427, 291.3239500942703, 295.76660135076065, 300.22094864701415, 304.6868567656687, 309.1641935801469, 313.65282994987905, 318.1526396202093, 322.66349912672615, 327.1852877037752, 331.7178871969285, 336.26118197919845, 340.815058870799, 345.37940706226686, 349.95411804077025, 354.5390855194408, 359.1342053695754, 363.73937555556347];
function binomial(n, k) {
return Math.exp(logf[n] - logf[n-k] - logf[k]);
}
console.log(binomial(5, 3));
Explanation
Starting with the original iterative algorithm, we first replace the product with a sum of logarithms:
function binomial(n, k) {
var logresult = 0;
for (var i = 1; i <= k; i++) {
logresult += Math.log(n + 1 - i) - Math.log(i);
}
return Math.exp(logresult);
}
Our loop now sums over k terms. If we rearrange the sum, we can easily see that we sum over consecutive logarithms log(1) + log(2) + ... + log(k) etc. which we can replace by a sum_of_logs(k) which is actually identical to log(k!). Precomputing these values and storing them in our lookup-table logf then leads to the above one-liner algorithm.
Computing the look-up table:
I recommend precomputing the lookup-table with higher precision and converting the resulting elements to 64-bit floats. If you do not need that little bit of additional precision or want to run this code on the client side, use this:
var size = 1000, logf = new Array(size);
logf[0] = 0;
for (var i = 1; i <= size; ++i) logf[i] = logf[i-1] + Math.log(i);
Numerical precision:
By using log-factorials, we avoid precision problems inherent to storing raw factorials.
We could even use Stirling's approximation for log(n!) instead of a lookup-table and still get 12 significant figures for above computation in both run-time and space complexity O(1):
function logf(n) {
return n === 0 ? 0 : (n + .5) * Math.log(n) - n + 0.9189385332046728 + 0.08333333333333333 / n - 0.002777777777777778 * Math.pow(n, -3);
}
function binomial(n , k) {
return Math.exp(logf(n) - logf(n - k) - logf(k));
}
console.log(binomial(1000, 500)); // 2.7028824094539536e+299
Using Pascal's triangle is a fast method for calculating n choose k.
The fastest method I know of would be to make use of the results from "On the Complexity of Calculating Factorials". Just calculate all 3 factorials, then perform the two division operations, each with complexity M(n logn).

Generate a random array in Javascript/jquery for Sudoku puzzle

I want to fill the 9 x 9 grid from the array by taking care of following condition
A particular number should not be repeated across the same column.
A particular number should not be repeated across the same row.
When i execute the below mentioned code it fills all the 9 X 9 grid with random values without the above mentioned condition.How can I add those two condition before inserting values into my 9 X 9 Grid.
var sudoku_array = ['1','2','3','4','6','5','7','8','9'];
$('.smallbox input').each(function(index) {
$(this).val(sudoku_array[Math.floor(Math.random()*sudoku_array.length)]);
});
My JSFIDDLE LINK
Generating and solving Sudokus is actually not as simple as other (wrong) answers might suggest, but it is not rocket science either. Instead of copying and pasting from Wikipedia I'd like to point you to this question.
However, since it is bad practice to just point to external links, I want to justify it by providing you at least with the intuition why naive approaches fail.
If you start generating a Sudoku board by filling some fields with random numbers (thereby taking into account your constraints), you obtain a partially filled board. Completing it is then equivalent to solving a Sudoku which is nothing else than completing a partially filled board by adhering to the Sudoku rules. If you ever tried it, you will know that this is not possible if you decide on the next number by chosing a valid number only with respect to the 3x3 box, the column and the row. For all but the simplest Sudokus there is some trial and error, so you need a form of backtracking.
I hope this helps.
To ensure that no number is repeated on a row, you might need a shuffling function. For columns, you'll just have to do it the hard way (checking previous solutions to see if a number exists on that column). I hope i am not confusing rows for columns, i tend to do it a lot.
It's similar to the eight queens problem in evolutionary computing. Backtracking, a pure random walk or an evolved solution would solve the problem.
This code will take a while, but it'll do the job.
You can the iterate through the returned two dimensional array, and fill the sudoku box. Holla if you need any help with that
Array.prototype.shuffle = function() {
var arr = this.valueOf();
var ret = [];
while (ret.length < arr.length) {
var x = arr[Math.floor(Number(Math.random() * arr.length))];
if (!(ret.indexOf(x) >= 0)) ret.push(x);
}
return ret;
}
function getSudoku() {
var sudoku = [];
var arr = [1, 2, 3, 4, 5, 6, 7, 8, 9];
sudoku.push(arr);
for (var i = 1; i < 9; i++) {
while (sudoku.length <= i) {
var newarr = arr.shuffle();
var b = false;
for (var j = 0; j < arr.length; j++) {
for (var k = 0; k < i; k++) {
if (sudoku[k].indexOf(newarr[j]) == j) b = true;
}
}
if (!b) {
sudoku.push(newarr);
document.body.innerHTML += `${newarr}<br/>`;
}
}
}
return sudoku;
}
getSudoku()
You need to keep track of what you have inserted before, for the following line:
$(this).val(sudoku_array[Math.floor(Math.random()*sudoku_array.length)]);
For example you can have a jagged array (arrays of arrays, its like a 2-D array) instead of 'sudoku_array' you have created to keep track of available numbers. In fact, you can create two jagged arrays, one for column and one for rows. Since you don't keep track of what you have inserted before, numbers are generated randomly.
After you create an array that keeps available numbers, you do the following:
After you generate number, remove it from the jagged array's respective row and column to mark it unavailable for those row and columns.
Before creating any number, check if it is available in the jagged array(check for both column and row). If not available, make it try another number.
Note: You can reduce the limits of random number you generate to available numbers. If you do that the random number x you generate would mean xth available number for that cell. That way you would not get a number that is not available and thus it works significantly faster.
Edit: As Lex82 pointed out in the comments and in his answer, you will also need a backtracking to avoid dead ends or you need to go deeper in mathematics. I'm just going to keep my answer in case it gives you an idea.

Finding the difference between two arrays in PHP, Node and Golang

Here is a typical example of what I need to do
$testArr = array(2.05080E6,29400,420);
$stockArrays = array(
array(2.05080E6,29400,0),
array(2.05080E6,9800,420),
array(1.715E6,24500,280),
array(2.05080E6,29400,140),
array(2.05080E6,4900,7));
I need to identify the stockArray that is the least different. A few clarifications
The numeric values of array elements at each position are guaranteed not to overlap. (i.e. arr[0] will always have the biggest values, arr1 will be at least an order of 10 magnitude smaller etc).
The absolute values of the differences do not count when determining least different. Only, the number of differing array indices matter.
Positional differences do have a weighting. Thus in my example stockArr1 is "more different" thought it too - like its stockArr[0] & stockArr[3] counterparts - differs in only one index position because that index position is bigger.
The number of stockArrays elements will typically be less than 10 but could potentially be much more (though never into 3 figures)
The stock arrays will always have the same number of elements. The test array will have the same or fewer elements. However, when fewer testArr would be padded out so that potentially matching elements are always in the same place as the stockArray. e.g.
$testArray(29400,140)
would be transformed to
$testArray(0,29400,140);
prior to being subjected to difference testing.
Finally, a tie is possible. For instance my example above the matches would be stockArrays[0] and stockArrays[3].
In my example the result would be
$result = array(0=>array(0,0,1),3=>array(0,0,1));
indicating that the least different stock arrays are at indices 0 & 3 with the differences being at position 2.
In PHP I would handle all of this with array_diff as my starting point. For Node/JavaScript I would probably be tempted to the php.js array_diff port though I would be inclined to explore a bit given that in the worst cast scenario it is an O(n2) affair.
I am a newbie when it comes to Golang so I am not sure how I would implement this problem there. I have noted that Node does have an array_diff npm module.
One off-beat idea I have had is converting the array to a padded string (smaller array elements are 0 padded) and effectively do an XOR on the ordinal value of each character but have dismissed that as probably a rather nutty thing to do.
I am concerned with speed but not at all costs. In an ideal world the same solution (algorithm) would be used in each target language though in reality the differences between them might mean that is not possible/not a good idea.
Perhaps someone here might be able to point me to less pedestrian ways of accomplishing this - i.e. not just array_diff ports.
Here's the equivalent of the array_diff solution: (assuming I didn't make a mistake)
package main
import "fmt"
func FindLeastDifferent(needle []float64, haystack [][]float64) int {
if len(haystack) == 0 {
return -1
}
var currentIndex, currentDiff int
for i, arr := range haystack {
diff := 0
for j := range needle {
if arr[j] != needle[j] {
diff++
}
}
if i == 0 || diff < currentDiff {
currentDiff = diff
currentIndex = i
}
}
return currentIndex
}
func main() {
idx := FindLeastDifferent(
[]float64{2.05080E6, 29400, 420},
[][]float64{
{2.05080E6, 29400, 0},
{2.05080E6, 9800, 420},
{1.715E6, 24500, 280},
{2.05080E6, 29400, 140},
{2.05080E6, 4900, 7},
{2.05080E6, 29400, 420},
},
)
fmt.Println(idx)
}
Like you said its O(n * m) where n is the number of elements in the needle array, and m is the number of arrays in the haystack.
If you don't know the haystack ahead of time, then there's probably not much you can do to improve this. But if, instead, you're storing this list in a database, I think your intuition about string search has some potential. PostgreSQL for example supports string similarity indexes. (And here's an explanation of a similar idea for regular expressions: http://swtch.com/~rsc/regexp/regexp4.html)
One other idea: if your arrays are really big you can calculate fuzzy hashes (http://ssdeep.sourceforge.net/) which would make your n smaller.

Is it correct to use JavaScript Array.sort() method for shuffling?

I was helping somebody out with his JavaScript code and my eyes were caught by a section that looked like that:
function randOrd(){
return (Math.round(Math.random())-0.5);
}
coords.sort(randOrd);
alert(coords);
My first though was: hey, this can't possibly work! But then I did some experimenting and found that it indeed at least seems to provide nicely randomized results.
Then I did some web search and almost at the top found an article from which this code was most ceartanly copied. Looked like a pretty respectable site and author...
But my gut feeling tells me, that this must be wrong. Especially as the sorting algorithm is not specified by ECMA standard. I think different sorting algoritms will result in different non-uniform shuffles. Some sorting algorithms may probably even loop infinitely...
But what do you think?
And as another question... how would I now go and measure how random the results of this shuffling technique are?
update: I did some measurements and posted the results below as one of the answers.
After Jon has already covered the theory, here's an implementation:
function shuffle(array) {
var tmp, current, top = array.length;
if(top) while(--top) {
current = Math.floor(Math.random() * (top + 1));
tmp = array[current];
array[current] = array[top];
array[top] = tmp;
}
return array;
}
The algorithm is O(n), whereas sorting should be O(n log n). Depending on the overhead of executing JS code compared to the native sort() function, this might lead to a noticable difference in performance which should increase with array sizes.
In the comments to bobobobo's answer, I stated that the algorithm in question might not produce evenly distributed probabilities (depending on the implementation of sort()).
My argument goes along these lines: A sorting algorithm requires a certain number c of comparisons, eg c = n(n-1)/2 for Bubblesort. Our random comparison function makes the outcome of each comparison equally likely, ie there are 2^c equally probable results. Now, each result has to correspond to one of the n! permutations of the array's entries, which makes an even distribution impossible in the general case. (This is a simplification, as the actual number of comparisons neeeded depends on the input array, but the assertion should still hold.)
As Jon pointed out, this alone is no reason to prefer Fisher-Yates over using sort(), as the random number generator will also map a finite number of pseudo-random values to the n! permutations. But the results of Fisher-Yates should still be better:
Math.random() produces a pseudo-random number in the range [0;1[. As JS uses double-precision floating point values, this corresponds to 2^x possible values where 52 ≤ x ≤ 63 (I'm too lazy to find the actual number). A probability distribution generated using Math.random() will stop behaving well if the number of atomic events is of the same order of magnitude.
When using Fisher-Yates, the relevant parameter is the size of the array, which should never approach 2^52 due to practical limitations.
When sorting with a random comparision function, the function basically only cares if the return value is positive or negative, so this will never be a problem. But there is a similar one: Because the comparison function is well-behaved, the 2^c possible results are, as stated, equally probable. If c ~ n log n then 2^c ~ n^(a·n) where a = const, which makes it at least possible that 2^c is of same magnitude as (or even less than) n! and thus leading to an uneven distribution, even if the sorting algorithm where to map onto the permutaions evenly. If this has any practical impact is beyond me.
The real problem is that the sorting algorithms are not guaranteed to map onto the permutations evenly. It's easy to see that Mergesort does as it's symmetric, but reasoning about something like Bubblesort or, more importantly, Quicksort or Heapsort, is not.
The bottom line: As long as sort() uses Mergesort, you should be reasonably safe except in corner cases (at least I'm hoping that 2^c ≤ n! is a corner case), if not, all bets are off.
It's never been my favourite way of shuffling, partly because it is implementation-specific as you say. In particular, I seem to remember that the standard library sorting from either Java or .NET (not sure which) can often detect if you end up with an inconsistent comparison between some elements (e.g. you first claim A < B and B < C, but then C < A).
It also ends up as a more complex (in terms of execution time) shuffle than you really need.
I prefer the shuffle algorithm which effectively partitions the collection into "shuffled" (at the start of the collection, initially empty) and "unshuffled" (the rest of the collection). At each step of the algorithm, pick a random unshuffled element (which could be the first one) and swap it with the first unshuffled element - then treat it as shuffled (i.e. mentally move the partition to include it).
This is O(n) and only requires n-1 calls to the random number generator, which is nice. It also produces a genuine shuffle - any element has a 1/n chance of ending up in each space, regardless of its original position (assuming a reasonable RNG). The sorted version approximates to an even distribution (assuming that the random number generator doesn't pick the same value twice, which is highly unlikely if it's returning random doubles) but I find it easier to reason about the shuffle version :)
This approach is called a Fisher-Yates shuffle.
I would regard it as a best practice to code up this shuffle once and reuse it everywhere you need to shuffle items. Then you don't need to worry about sort implementations in terms of reliability or complexity. It's only a few lines of code (which I won't attempt in JavaScript!)
The Wikipedia article on shuffling (and in particular the shuffle algorithms section) talks about sorting a random projection - it's worth reading the section on poor implementations of shuffling in general, so you know what to avoid.
I did some measurements of how random the results of this random sort are...
My technique was to take a small array [1,2,3,4] and create all (4! = 24) permutations of it. Then I would apply the shuffling function to the array a large number of times and count how many times each permutation is generated. A good shuffling algoritm would distribute the results quite evenly over all the permutations, while a bad one would not create that uniform result.
Using the code below I tested in Firefox, Opera, Chrome, IE6/7/8.
Surprisingly for me, the random sort and the real shuffle both created equally uniform distributions. So it seems that (as many have suggested) the main browsers are using merge sort. This of course doesn't mean, that there can't be a browser out there, that does differently, but I would say it means, that this random-sort-method is reliable enough to use in practice.
EDIT: This test didn't really measured correctly the randomness or lack thereof. See the other answer I posted.
But on the performance side the shuffle function given by Cristoph was a clear winner. Even for small four-element arrays the real shuffle performed about twice as fast as random-sort!
// The shuffle function posted by Cristoph.
var shuffle = function(array) {
var tmp, current, top = array.length;
if(top) while(--top) {
current = Math.floor(Math.random() * (top + 1));
tmp = array[current];
array[current] = array[top];
array[top] = tmp;
}
return array;
};
// the random sort function
var rnd = function() {
return Math.round(Math.random())-0.5;
};
var randSort = function(A) {
return A.sort(rnd);
};
var permutations = function(A) {
if (A.length == 1) {
return [A];
}
else {
var perms = [];
for (var i=0; i<A.length; i++) {
var x = A.slice(i, i+1);
var xs = A.slice(0, i).concat(A.slice(i+1));
var subperms = permutations(xs);
for (var j=0; j<subperms.length; j++) {
perms.push(x.concat(subperms[j]));
}
}
return perms;
}
};
var test = function(A, iterations, func) {
// init permutations
var stats = {};
var perms = permutations(A);
for (var i in perms){
stats[""+perms[i]] = 0;
}
// shuffle many times and gather stats
var start=new Date();
for (var i=0; i<iterations; i++) {
var shuffled = func(A);
stats[""+shuffled]++;
}
var end=new Date();
// format result
var arr=[];
for (var i in stats) {
arr.push(i+" "+stats[i]);
}
return arr.join("\n")+"\n\nTime taken: " + ((end - start)/1000) + " seconds.";
};
alert("random sort: " + test([1,2,3,4], 100000, randSort));
alert("shuffle: " + test([1,2,3,4], 100000, shuffle));
Interestingly, Microsoft used the same technique in their pick-random-browser-page.
They used a slightly different comparison function:
function RandomSort(a,b) {
return (0.5 - Math.random());
}
Looks almost the same to me, but it turned out to be not so random...
So I made some testruns again with the same methodology used in the linked article, and indeed - turned out that the random-sorting-method produced flawed results. New test code here:
function shuffle(arr) {
arr.sort(function(a,b) {
return (0.5 - Math.random());
});
}
function shuffle2(arr) {
arr.sort(function(a,b) {
return (Math.round(Math.random())-0.5);
});
}
function shuffle3(array) {
var tmp, current, top = array.length;
if(top) while(--top) {
current = Math.floor(Math.random() * (top + 1));
tmp = array[current];
array[current] = array[top];
array[top] = tmp;
}
return array;
}
var counts = [
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0]
];
var arr;
for (var i=0; i<100000; i++) {
arr = [0,1,2,3,4];
shuffle3(arr);
arr.forEach(function(x, i){ counts[x][i]++;});
}
alert(counts.map(function(a){return a.join(", ");}).join("\n"));
I have placed a simple test page on my website showing the bias of your current browser versus other popular browsers using different methods to shuffle. It shows the terrible bias of just using Math.random()-0.5, another 'random' shuffle that isn't biased, and the Fisher-Yates method mentioned above.
You can see that on some browsers there is as high as a 50% chance that certain elements will not change place at all during the 'shuffle'!
Note: you can make the implementation of the Fisher-Yates shuffle by #Christoph slightly faster for Safari by changing the code to:
function shuffle(array) {
for (var tmp, cur, top=array.length; top--;){
cur = (Math.random() * (top + 1)) << 0;
tmp = array[cur]; array[cur] = array[top]; array[top] = tmp;
}
return array;
}
Test results: http://jsperf.com/optimized-fisher-yates
I think it's fine for cases where you're not picky about distribution and you want the source code to be small.
In JavaScript (where the source is transmitted constantly), small makes a difference in bandwidth costs.
It's been four years, but I'd like to point out that the random comparator method won't be correctly distributed, no matter what sorting algorithm you use.
Proof:
For an array of n elements, there are exactly n! permutations (i.e. possible shuffles).
Every comparison during a shuffle is a choice between two sets of permutations. For a random comparator, there is a 1/2 chance of choosing each set.
Thus, for each permutation p, the chance of ending up with permutation p is a fraction with denominator 2^k (for some k), because it is a sum of such fractions (e.g. 1/8 + 1/16 = 3/16).
For n = 3, there are six equally-likely permutations. The chance of each permutation, then, is 1/6. 1/6 can't be expressed as a fraction with a power of 2 as its denominator.
Therefore, the coin flip sort will never result in a fair distribution of shuffles.
The only sizes that could possibly be correctly distributed are n=0,1,2.
As an exercise, try drawing out the decision tree of different sort algorithms for n=3.
There is a gap in the proof: If a sort algorithm depends on the consistency of the comparator, and has unbounded runtime with an inconsistent comparator, it can have an infinite sum of probabilities, which is allowed to add up to 1/6 even if every denominator in the sum is a power of 2. Try to find one.
Also, if a comparator has a fixed chance of giving either answer (e.g. (Math.random() < P)*2 - 1, for constant P), the above proof holds. If the comparator instead changes its odds based on previous answers, it may be possible to generate fair results. Finding such a comparator for a given sorting algorithm could be a research paper.
It is a hack, certainly. In practice, an infinitely looping algorithm is not likely.
If you're sorting objects, you could loop through the coords array and do something like:
for (var i = 0; i < coords.length; i++)
coords[i].sortValue = Math.random();
coords.sort(useSortValue)
function useSortValue(a, b)
{
return a.sortValue - b.sortValue;
}
(and then loop through them again to remove the sortValue)
Still a hack though. If you want to do it nicely, you have to do it the hard way :)
If you're using D3 there is a built-in shuffle function (using Fisher-Yates):
var days = ['Lundi','Mardi','Mercredi','Jeudi','Vendredi','Samedi','Dimanche'];
d3.shuffle(days);
And here is Mike going into details about it:
http://bost.ocks.org/mike/shuffle/
No, it is not correct. As other answers have noted, it will lead to a non-uniform shuffle and the quality of the shuffle will also depend on which sorting algorithm the browser uses.
Now, that might not sound too bad to you, because even if theoretically the distribution is not uniform, in practice it's probably nearly uniform, right? Well, no, not even close. The following charts show heat-maps of which indices each element gets shuffled to, in Chrome and Firefox respectively: if the pixel (i, j) is green, it means the element at index i gets shuffled to index j too often, and if it's red then it gets shuffled there too rarely.
These screenshots are taken from Mike Bostock's page on this subject.
As you can see, shuffling using a random comparator is severely biased in Chrome and even more so in Firefox. In particular, both have a lot of green along the diagonal, meaning that too many elements get "shuffled" somewhere very close to where they were in the original sequence. In comparison, a similar chart for an unbiased shuffle (e.g. using the Fisher-Yates algorithm) would be all pale yellow with just a small amount of random noise.
Here's an approach that uses a single array:
The basic logic is:
Starting with an array of n elements
Remove a random element from the array and push it onto the array
Remove a random element from the first n - 1 elements of the array and push it onto the array
Remove a random element from the first n - 2 elements of the array and push it onto the array
...
Remove the first element of the array and push it onto the array
Code:
for(i=a.length;i--;) a.push(a.splice(Math.floor(Math.random() * (i + 1)),1)[0]);
Can you use the Array.sort() function to shuffle an array – Yes.
Are the results random enough – No.
Consider the following code snippet:
/*
* The following code sample shuffles an array using Math.random() trick
* After shuffling, the new position of each item is recorded
* The process is repeated 100 times
* The result is printed out, listing each item and the number of times
* it appeared on a given position after shuffling
*/
var array = ["a", "b", "c", "d", "e"];
var stats = {};
array.forEach(function(v) {
stats[v] = Array(array.length).fill(0);
});
var i, clone;
for (i = 0; i < 100; i++) {
clone = array.slice();
clone.sort(function() {
return Math.random() - 0.5;
});
clone.forEach(function(v, i) {
stats[v][i]++;
});
}
Object.keys(stats).forEach(function(v, i) {
console.log(v + ": [" + stats[v].join(", ") + "]");
});
Sample output:
a: [29, 38, 20, 6, 7]
b: [29, 33, 22, 11, 5]
c: [17, 14, 32, 17, 20]
d: [16, 9, 17, 35, 23]
e: [ 9, 6, 9, 31, 45]
Ideally, the counts should be evenly distributed (for the above example, all counts should be around 20). But they are not. Apparently, the distribution depends on what sorting algorithm is implemented by the browser and how it iterates the array items for sorting.
There is nothing wrong with it.
The function you pass to .sort() usually looks something like
function sortingFunc( first, second )
{
// example:
return first - second ;
}
Your job in sortingFunc is to return:
a negative number if first goes before second
a positive number if first should go after second
and 0 if they are completely equal
The above sorting function puts things in order.
If you return -'s and +'s randomly as what you have, you get a random ordering.
Like in MySQL:
SELECT * from table ORDER BY rand()

Categories