Considering the performance, what's the best way to get random subset from an array?
Say we get an array with 90000 items, I wanna get 10000 random items from it.
One approach I'm thinking about is to get a random index from 0 to array.length and then remove the selected one from the original array by using Array.prototype.splice. Then get the next random item from the rest.
But the splice method will rearrange the index of all the items after the one we just selected and move them forward on step. Doesn't it affect the performance?
Items may duplicates, but what we select should not. Say we've selected index 0, then we should only look up the rest 1~89999.
If you want a subset of the shuffled array, you do not need to shuffle the whole array. You can stop the classic fisher-yates shuffle when you have drawn your 10000 items, leaving the other 80000 indices untouched.
I would first randomize the whole array then splice of a 10000 items.
How to randomize (shuffle) a JavaScript array?
Explains a good way to randomize a array in javascript
A reservoir sampling algorithm can do this.
Here's an attempt at implementing Knuth's "Algorithm S" from TAOCP Volume 2 Section 3.4.2:
function sample(source, size) {
var chosen = 0,
srcLen = source.length,
result = new Array(size);
for (var seen = 0; chosen < size; seen++) {
var remainingInput = srcLen - seen,
remainingOutput = size - chosen;
if (remainingInput*Math.random() < remainingOutput) {
result[chosen++] = source[seen];
}
}
return result;
}
Basically it makes one pass over the input array, choosing or skipping items based on a function of a random number, the number of items remaining in the input, and the number of items remaining to be required in the output.
There are three potential problems with this code: 1. I may have mucked it up, 2. Knuth calls for a random number "between zero and one" and I'm not sure if this means the [0, 1) interval JavaScript provides or the fully closed or fully open interval, 3. it's vulnerable to PRNG bias.
The performance characteristics should be very good. It's O(srcLen). Most of the time we finish before going through the entire input. The input is accessed in order, which is a good thing if you are running your code on a computer that has a cache. We don't even waste any time reading or writing elements that don't ultimately end up in the output.
This version doesn't modify the input array. It is possible to write an in-place version, which might save some memory, but it probably wouldn't be much faster.
I am working on an implementation of the 15-pieces-sliding puzzle, and I am stuck at the point were I must make sure I only shuffle into "solvable permutations" - in my case with the empty tile in the down right corner: even permutations.
I have read many similar threads such as How can I ensure that when I shuffle my puzzle I still end up with an even permutation? and understand that I need to "count the parity of the number of inversions in the permutation".
I am writing in Javascript, and using Fischer-Yates-algorithm to randomize my numbers:
var allNrs = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14];
for (var i = allNrs.length - 1; i > 0; i--) {
var j = Math.floor(Math.random() * (i + 1));
var temp1 = allNrs[i];
var temp2 = allNrs[j];
allNrs[i] = temp2;
allNrs[j] = temp1;
}
How do I actually caculate this permutation or parity value that I have read about in so many posts?
Just count the number of swaps you're making. If the number of swaps is even then the permutation has an even parity.
For example, these are the even permutations for 3 numbers. Note that you need 0 or 2 swaps to get to them from [1,2,3]:
1,2,3
2,3,1
3,1,2
Each swap of two numbers you do flips the parity. If you have an even number of them, you are good. If you have an odd number, you are not.
This is essentially what parity means and it is a (simple) theorem of group theory that any two ways to get to the same shuffle have the same parity.
Here's my basic problem: I'm given a currentTime. For example, 750 seconds. I also have an array which contains 1000 to 2000 objects, each of which have a startTime, endTime, and an _id attribute. Given the currentTime, I need to find the object that has a startTime and endTime that falls within that range -- for example, startTime : 740, endTime : 755.
What is the most efficient way to do this in Javascript?
I've simply been doing something like this, for starters:
var arrayLength = array.length;
var x = 0;
while (x < arrayLength) {
if (currentTime >= array[x].startTime && currentTime <= array[x].endTime) {
// then I've found my object
}
x++;
};
But I suspect that looping isn't the best option here. Any suggestions?
EDIT: For clarity, the currentTime has to fall within the startTime and endTime
My solution: The structure of my data affords me certain benefits that allows me to simplify things a bit. I've done a basic binary search, as suggested, since the array is already sorted by startTime. I haven't fully tested the speed of this thing, but I suspect it's a fair bit faster, especially with larger arrays.
var binarySearch = function(array, currentTime) {
var low = 0;
var high = array.length - 1;
var i;
while (low <= high) {
i = Math.floor((low + high) / 2);
if (array[i].startTime <= currentTime) {
if (array[i].endTime >= currentTime ){
// this is the one
return array[i]._id;
} else {
low = i + 1;
}
}
else {
high = i - 1;
}
}
return null;
}
The best way to tackle this problem depends on the number of times you will have to call your search function.
If you call your function just a few times, let's say m times, go for linear search. The overall complexity for the calls of this function will be O(mn).
If you call your function many times, and by many I mean more than log(n) times, you should:
Sort your array in O(nlogn) by startTime, then by endTime if you have several items with equal values of startTime
Do binary search to find the range of elements with startTime <= x. This means doing two binary searches: one for the start of the range and one for the end of the range. This is done in O(logn)
Do linear search inside [start, end]. You have to do linear search because the order of startTimes tells you nothing about the endTimes. This can be anywhere between O(1) and O(n) and it depends on the distribution of your segments and the value of x.
Average case: O(nlogn) for initialization and O(logn) for each search.
Worst case: an array containing many equal segments, or segments that have a common interval, and searching in this interval. In that case you will do O(nlogn) for initialization and O(n + logn) = O(n) for search.
Sounds like a problem for binary search.
Assuming that your search array is long-lived and relatively constant, the first iteration would be to sort all the array elements by start time (or create an index of sorted start times pointing to the array elements if you don't want them sorted).
Then you can efficiently (with a binary chop) discount ones that start too late. A sequential search of the others would then be faster.
For even more speed, maintain separate sorted indexes for start and end times. Then do the same operation as mentioned previously to throw away those that start too late.
Then, for the remaining ones, use the end time index to throw away those that end too early, and what you have left is your candidate list.
But, make sure this is actually needed. Two thousand elements doesn't seem like a huge amount so you should time the current approach and only attempt optimisation if it is indeed a problem.
From the information given it is not possible to tell what would be the best solution. If the array is not sorted, looping is the best way for single queries. A single scan along the array only takes O(N) (where N is the length of the array), whereas sorting it and then doing a binary search would take O(N log(N) + log(N)), thus it would in this case take more time.
The analysis will look much different if you have a great number of different queries on the same large array. If you have about N queries on the same array, sorting might actually improve the performance, as each Query will take O(log(N)). Thus for N queries it will require O(N log(N)) (the remaining log(N) gets dropped now) whereas the unsorted search will also take O(N^2) which is clearly larger. When sorting starts to make an impact exactly also depends on the size of the array.
The situation is also different again, when you update the array fairly often. Updating an unsorted array can be done in O(1) amortized, whereas updating a sorted array takes O(N). So if you have fairly frequent updates sorting might hurt.
There are also some very efficient data structures for range queries, however again it depends on the actual usage if they make sense or not.
If the array is not sorted, yours is the correct way.
Do not fall into the trap of thinking to sort the array first, and then apply your search.
With the code you tried, you have a complexity of O(n), where n is the number of elements.
If you sort the array first, you first fall into a complexity of O(n log(n)) (compare to Sorting algorithm), in the average case.
Then you have to apply the binary search, which executes at an average complexity of O(log_ 2(n) - 1).
So, you will end up by spending, in the average case:
O(n log(n) + (log_2(n) - 1))
instead of just O(n).
An interval tree is a data structure that allows answering such queries in O(lg n) time (both average and worst-case), if there are n intervals total. Preprocessing time to construct the data structure is O(n lg n); space is O(n). Insertion and deletion times are O(lg n) for augmented interval trees. Time to answer all-interval queries is O(m + lg n) if m intervals cover a point. Wikipedia describes several kinds of interval trees; for example, a centered interval tree is a tertiary tree with each node storing:
• A center point
• A pointer to another node containing all intervals completely to the left of the center point
• A pointer to another node containing all intervals completely to the right of the center point
• All intervals overlapping the center point sorted by their beginning point
• All intervals overlapping the center point sorted by their ending point
Note, an interval tree has O(lg n) complexity for both average and worst-case queries that find one interval to cover a point. The previous answers have O(n) worst-case query performance for the same. Several previous answers claimed that they have O(lg n) average time. But none of them offer evidence; instead they merely assert that average performance is O(lg n). The main feature of those previous answers is using a binary search for begin times. Then some say to use a linear search, and others say to use a binary search, for end times, but without making clear what set of intervals the latter search is over. They claim to have O(lg n) average performance, but that is merely wishful thinking. As pointed out in the wikipedia article under the heading Naive Approach,
A naive approach might be to build two parallel trees, one ordered by the beginning point, and one ordered by the ending point of each interval. This allows discarding half of each tree in O(log n) time, but the results must be merged, requiring O(n) time. This gives us queries in O(n + log n) = O(n), which is no better than brute-force.
I was helping somebody out with his JavaScript code and my eyes were caught by a section that looked like that:
function randOrd(){
return (Math.round(Math.random())-0.5);
}
coords.sort(randOrd);
alert(coords);
My first though was: hey, this can't possibly work! But then I did some experimenting and found that it indeed at least seems to provide nicely randomized results.
Then I did some web search and almost at the top found an article from which this code was most ceartanly copied. Looked like a pretty respectable site and author...
But my gut feeling tells me, that this must be wrong. Especially as the sorting algorithm is not specified by ECMA standard. I think different sorting algoritms will result in different non-uniform shuffles. Some sorting algorithms may probably even loop infinitely...
But what do you think?
And as another question... how would I now go and measure how random the results of this shuffling technique are?
update: I did some measurements and posted the results below as one of the answers.
After Jon has already covered the theory, here's an implementation:
function shuffle(array) {
var tmp, current, top = array.length;
if(top) while(--top) {
current = Math.floor(Math.random() * (top + 1));
tmp = array[current];
array[current] = array[top];
array[top] = tmp;
}
return array;
}
The algorithm is O(n), whereas sorting should be O(n log n). Depending on the overhead of executing JS code compared to the native sort() function, this might lead to a noticable difference in performance which should increase with array sizes.
In the comments to bobobobo's answer, I stated that the algorithm in question might not produce evenly distributed probabilities (depending on the implementation of sort()).
My argument goes along these lines: A sorting algorithm requires a certain number c of comparisons, eg c = n(n-1)/2 for Bubblesort. Our random comparison function makes the outcome of each comparison equally likely, ie there are 2^c equally probable results. Now, each result has to correspond to one of the n! permutations of the array's entries, which makes an even distribution impossible in the general case. (This is a simplification, as the actual number of comparisons neeeded depends on the input array, but the assertion should still hold.)
As Jon pointed out, this alone is no reason to prefer Fisher-Yates over using sort(), as the random number generator will also map a finite number of pseudo-random values to the n! permutations. But the results of Fisher-Yates should still be better:
Math.random() produces a pseudo-random number in the range [0;1[. As JS uses double-precision floating point values, this corresponds to 2^x possible values where 52 ≤ x ≤ 63 (I'm too lazy to find the actual number). A probability distribution generated using Math.random() will stop behaving well if the number of atomic events is of the same order of magnitude.
When using Fisher-Yates, the relevant parameter is the size of the array, which should never approach 2^52 due to practical limitations.
When sorting with a random comparision function, the function basically only cares if the return value is positive or negative, so this will never be a problem. But there is a similar one: Because the comparison function is well-behaved, the 2^c possible results are, as stated, equally probable. If c ~ n log n then 2^c ~ n^(a·n) where a = const, which makes it at least possible that 2^c is of same magnitude as (or even less than) n! and thus leading to an uneven distribution, even if the sorting algorithm where to map onto the permutaions evenly. If this has any practical impact is beyond me.
The real problem is that the sorting algorithms are not guaranteed to map onto the permutations evenly. It's easy to see that Mergesort does as it's symmetric, but reasoning about something like Bubblesort or, more importantly, Quicksort or Heapsort, is not.
The bottom line: As long as sort() uses Mergesort, you should be reasonably safe except in corner cases (at least I'm hoping that 2^c ≤ n! is a corner case), if not, all bets are off.
It's never been my favourite way of shuffling, partly because it is implementation-specific as you say. In particular, I seem to remember that the standard library sorting from either Java or .NET (not sure which) can often detect if you end up with an inconsistent comparison between some elements (e.g. you first claim A < B and B < C, but then C < A).
It also ends up as a more complex (in terms of execution time) shuffle than you really need.
I prefer the shuffle algorithm which effectively partitions the collection into "shuffled" (at the start of the collection, initially empty) and "unshuffled" (the rest of the collection). At each step of the algorithm, pick a random unshuffled element (which could be the first one) and swap it with the first unshuffled element - then treat it as shuffled (i.e. mentally move the partition to include it).
This is O(n) and only requires n-1 calls to the random number generator, which is nice. It also produces a genuine shuffle - any element has a 1/n chance of ending up in each space, regardless of its original position (assuming a reasonable RNG). The sorted version approximates to an even distribution (assuming that the random number generator doesn't pick the same value twice, which is highly unlikely if it's returning random doubles) but I find it easier to reason about the shuffle version :)
This approach is called a Fisher-Yates shuffle.
I would regard it as a best practice to code up this shuffle once and reuse it everywhere you need to shuffle items. Then you don't need to worry about sort implementations in terms of reliability or complexity. It's only a few lines of code (which I won't attempt in JavaScript!)
The Wikipedia article on shuffling (and in particular the shuffle algorithms section) talks about sorting a random projection - it's worth reading the section on poor implementations of shuffling in general, so you know what to avoid.
I did some measurements of how random the results of this random sort are...
My technique was to take a small array [1,2,3,4] and create all (4! = 24) permutations of it. Then I would apply the shuffling function to the array a large number of times and count how many times each permutation is generated. A good shuffling algoritm would distribute the results quite evenly over all the permutations, while a bad one would not create that uniform result.
Using the code below I tested in Firefox, Opera, Chrome, IE6/7/8.
Surprisingly for me, the random sort and the real shuffle both created equally uniform distributions. So it seems that (as many have suggested) the main browsers are using merge sort. This of course doesn't mean, that there can't be a browser out there, that does differently, but I would say it means, that this random-sort-method is reliable enough to use in practice.
EDIT: This test didn't really measured correctly the randomness or lack thereof. See the other answer I posted.
But on the performance side the shuffle function given by Cristoph was a clear winner. Even for small four-element arrays the real shuffle performed about twice as fast as random-sort!
// The shuffle function posted by Cristoph.
var shuffle = function(array) {
var tmp, current, top = array.length;
if(top) while(--top) {
current = Math.floor(Math.random() * (top + 1));
tmp = array[current];
array[current] = array[top];
array[top] = tmp;
}
return array;
};
// the random sort function
var rnd = function() {
return Math.round(Math.random())-0.5;
};
var randSort = function(A) {
return A.sort(rnd);
};
var permutations = function(A) {
if (A.length == 1) {
return [A];
}
else {
var perms = [];
for (var i=0; i<A.length; i++) {
var x = A.slice(i, i+1);
var xs = A.slice(0, i).concat(A.slice(i+1));
var subperms = permutations(xs);
for (var j=0; j<subperms.length; j++) {
perms.push(x.concat(subperms[j]));
}
}
return perms;
}
};
var test = function(A, iterations, func) {
// init permutations
var stats = {};
var perms = permutations(A);
for (var i in perms){
stats[""+perms[i]] = 0;
}
// shuffle many times and gather stats
var start=new Date();
for (var i=0; i<iterations; i++) {
var shuffled = func(A);
stats[""+shuffled]++;
}
var end=new Date();
// format result
var arr=[];
for (var i in stats) {
arr.push(i+" "+stats[i]);
}
return arr.join("\n")+"\n\nTime taken: " + ((end - start)/1000) + " seconds.";
};
alert("random sort: " + test([1,2,3,4], 100000, randSort));
alert("shuffle: " + test([1,2,3,4], 100000, shuffle));
Interestingly, Microsoft used the same technique in their pick-random-browser-page.
They used a slightly different comparison function:
function RandomSort(a,b) {
return (0.5 - Math.random());
}
Looks almost the same to me, but it turned out to be not so random...
So I made some testruns again with the same methodology used in the linked article, and indeed - turned out that the random-sorting-method produced flawed results. New test code here:
function shuffle(arr) {
arr.sort(function(a,b) {
return (0.5 - Math.random());
});
}
function shuffle2(arr) {
arr.sort(function(a,b) {
return (Math.round(Math.random())-0.5);
});
}
function shuffle3(array) {
var tmp, current, top = array.length;
if(top) while(--top) {
current = Math.floor(Math.random() * (top + 1));
tmp = array[current];
array[current] = array[top];
array[top] = tmp;
}
return array;
}
var counts = [
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0]
];
var arr;
for (var i=0; i<100000; i++) {
arr = [0,1,2,3,4];
shuffle3(arr);
arr.forEach(function(x, i){ counts[x][i]++;});
}
alert(counts.map(function(a){return a.join(", ");}).join("\n"));
I have placed a simple test page on my website showing the bias of your current browser versus other popular browsers using different methods to shuffle. It shows the terrible bias of just using Math.random()-0.5, another 'random' shuffle that isn't biased, and the Fisher-Yates method mentioned above.
You can see that on some browsers there is as high as a 50% chance that certain elements will not change place at all during the 'shuffle'!
Note: you can make the implementation of the Fisher-Yates shuffle by #Christoph slightly faster for Safari by changing the code to:
function shuffle(array) {
for (var tmp, cur, top=array.length; top--;){
cur = (Math.random() * (top + 1)) << 0;
tmp = array[cur]; array[cur] = array[top]; array[top] = tmp;
}
return array;
}
Test results: http://jsperf.com/optimized-fisher-yates
I think it's fine for cases where you're not picky about distribution and you want the source code to be small.
In JavaScript (where the source is transmitted constantly), small makes a difference in bandwidth costs.
It's been four years, but I'd like to point out that the random comparator method won't be correctly distributed, no matter what sorting algorithm you use.
Proof:
For an array of n elements, there are exactly n! permutations (i.e. possible shuffles).
Every comparison during a shuffle is a choice between two sets of permutations. For a random comparator, there is a 1/2 chance of choosing each set.
Thus, for each permutation p, the chance of ending up with permutation p is a fraction with denominator 2^k (for some k), because it is a sum of such fractions (e.g. 1/8 + 1/16 = 3/16).
For n = 3, there are six equally-likely permutations. The chance of each permutation, then, is 1/6. 1/6 can't be expressed as a fraction with a power of 2 as its denominator.
Therefore, the coin flip sort will never result in a fair distribution of shuffles.
The only sizes that could possibly be correctly distributed are n=0,1,2.
As an exercise, try drawing out the decision tree of different sort algorithms for n=3.
There is a gap in the proof: If a sort algorithm depends on the consistency of the comparator, and has unbounded runtime with an inconsistent comparator, it can have an infinite sum of probabilities, which is allowed to add up to 1/6 even if every denominator in the sum is a power of 2. Try to find one.
Also, if a comparator has a fixed chance of giving either answer (e.g. (Math.random() < P)*2 - 1, for constant P), the above proof holds. If the comparator instead changes its odds based on previous answers, it may be possible to generate fair results. Finding such a comparator for a given sorting algorithm could be a research paper.
It is a hack, certainly. In practice, an infinitely looping algorithm is not likely.
If you're sorting objects, you could loop through the coords array and do something like:
for (var i = 0; i < coords.length; i++)
coords[i].sortValue = Math.random();
coords.sort(useSortValue)
function useSortValue(a, b)
{
return a.sortValue - b.sortValue;
}
(and then loop through them again to remove the sortValue)
Still a hack though. If you want to do it nicely, you have to do it the hard way :)
If you're using D3 there is a built-in shuffle function (using Fisher-Yates):
var days = ['Lundi','Mardi','Mercredi','Jeudi','Vendredi','Samedi','Dimanche'];
d3.shuffle(days);
And here is Mike going into details about it:
http://bost.ocks.org/mike/shuffle/
No, it is not correct. As other answers have noted, it will lead to a non-uniform shuffle and the quality of the shuffle will also depend on which sorting algorithm the browser uses.
Now, that might not sound too bad to you, because even if theoretically the distribution is not uniform, in practice it's probably nearly uniform, right? Well, no, not even close. The following charts show heat-maps of which indices each element gets shuffled to, in Chrome and Firefox respectively: if the pixel (i, j) is green, it means the element at index i gets shuffled to index j too often, and if it's red then it gets shuffled there too rarely.
These screenshots are taken from Mike Bostock's page on this subject.
As you can see, shuffling using a random comparator is severely biased in Chrome and even more so in Firefox. In particular, both have a lot of green along the diagonal, meaning that too many elements get "shuffled" somewhere very close to where they were in the original sequence. In comparison, a similar chart for an unbiased shuffle (e.g. using the Fisher-Yates algorithm) would be all pale yellow with just a small amount of random noise.
Here's an approach that uses a single array:
The basic logic is:
Starting with an array of n elements
Remove a random element from the array and push it onto the array
Remove a random element from the first n - 1 elements of the array and push it onto the array
Remove a random element from the first n - 2 elements of the array and push it onto the array
...
Remove the first element of the array and push it onto the array
Code:
for(i=a.length;i--;) a.push(a.splice(Math.floor(Math.random() * (i + 1)),1)[0]);
Can you use the Array.sort() function to shuffle an array – Yes.
Are the results random enough – No.
Consider the following code snippet:
/*
* The following code sample shuffles an array using Math.random() trick
* After shuffling, the new position of each item is recorded
* The process is repeated 100 times
* The result is printed out, listing each item and the number of times
* it appeared on a given position after shuffling
*/
var array = ["a", "b", "c", "d", "e"];
var stats = {};
array.forEach(function(v) {
stats[v] = Array(array.length).fill(0);
});
var i, clone;
for (i = 0; i < 100; i++) {
clone = array.slice();
clone.sort(function() {
return Math.random() - 0.5;
});
clone.forEach(function(v, i) {
stats[v][i]++;
});
}
Object.keys(stats).forEach(function(v, i) {
console.log(v + ": [" + stats[v].join(", ") + "]");
});
Sample output:
a: [29, 38, 20, 6, 7]
b: [29, 33, 22, 11, 5]
c: [17, 14, 32, 17, 20]
d: [16, 9, 17, 35, 23]
e: [ 9, 6, 9, 31, 45]
Ideally, the counts should be evenly distributed (for the above example, all counts should be around 20). But they are not. Apparently, the distribution depends on what sorting algorithm is implemented by the browser and how it iterates the array items for sorting.
There is nothing wrong with it.
The function you pass to .sort() usually looks something like
function sortingFunc( first, second )
{
// example:
return first - second ;
}
Your job in sortingFunc is to return:
a negative number if first goes before second
a positive number if first should go after second
and 0 if they are completely equal
The above sorting function puts things in order.
If you return -'s and +'s randomly as what you have, you get a random ordering.
Like in MySQL:
SELECT * from table ORDER BY rand()