Finding the difference between two arrays in PHP, Node and Golang - javascript

Here is a typical example of what I need to do
$testArr = array(2.05080E6,29400,420);
$stockArrays = array(
array(2.05080E6,29400,0),
array(2.05080E6,9800,420),
array(1.715E6,24500,280),
array(2.05080E6,29400,140),
array(2.05080E6,4900,7));
I need to identify the stockArray that is the least different. A few clarifications
The numeric values of array elements at each position are guaranteed not to overlap. (i.e. arr[0] will always have the biggest values, arr1 will be at least an order of 10 magnitude smaller etc).
The absolute values of the differences do not count when determining least different. Only, the number of differing array indices matter.
Positional differences do have a weighting. Thus in my example stockArr1 is "more different" thought it too - like its stockArr[0] & stockArr[3] counterparts - differs in only one index position because that index position is bigger.
The number of stockArrays elements will typically be less than 10 but could potentially be much more (though never into 3 figures)
The stock arrays will always have the same number of elements. The test array will have the same or fewer elements. However, when fewer testArr would be padded out so that potentially matching elements are always in the same place as the stockArray. e.g.
$testArray(29400,140)
would be transformed to
$testArray(0,29400,140);
prior to being subjected to difference testing.
Finally, a tie is possible. For instance my example above the matches would be stockArrays[0] and stockArrays[3].
In my example the result would be
$result = array(0=>array(0,0,1),3=>array(0,0,1));
indicating that the least different stock arrays are at indices 0 & 3 with the differences being at position 2.
In PHP I would handle all of this with array_diff as my starting point. For Node/JavaScript I would probably be tempted to the php.js array_diff port though I would be inclined to explore a bit given that in the worst cast scenario it is an O(n2) affair.
I am a newbie when it comes to Golang so I am not sure how I would implement this problem there. I have noted that Node does have an array_diff npm module.
One off-beat idea I have had is converting the array to a padded string (smaller array elements are 0 padded) and effectively do an XOR on the ordinal value of each character but have dismissed that as probably a rather nutty thing to do.
I am concerned with speed but not at all costs. In an ideal world the same solution (algorithm) would be used in each target language though in reality the differences between them might mean that is not possible/not a good idea.
Perhaps someone here might be able to point me to less pedestrian ways of accomplishing this - i.e. not just array_diff ports.

Here's the equivalent of the array_diff solution: (assuming I didn't make a mistake)
package main
import "fmt"
func FindLeastDifferent(needle []float64, haystack [][]float64) int {
if len(haystack) == 0 {
return -1
}
var currentIndex, currentDiff int
for i, arr := range haystack {
diff := 0
for j := range needle {
if arr[j] != needle[j] {
diff++
}
}
if i == 0 || diff < currentDiff {
currentDiff = diff
currentIndex = i
}
}
return currentIndex
}
func main() {
idx := FindLeastDifferent(
[]float64{2.05080E6, 29400, 420},
[][]float64{
{2.05080E6, 29400, 0},
{2.05080E6, 9800, 420},
{1.715E6, 24500, 280},
{2.05080E6, 29400, 140},
{2.05080E6, 4900, 7},
{2.05080E6, 29400, 420},
},
)
fmt.Println(idx)
}
Like you said its O(n * m) where n is the number of elements in the needle array, and m is the number of arrays in the haystack.
If you don't know the haystack ahead of time, then there's probably not much you can do to improve this. But if, instead, you're storing this list in a database, I think your intuition about string search has some potential. PostgreSQL for example supports string similarity indexes. (And here's an explanation of a similar idea for regular expressions: http://swtch.com/~rsc/regexp/regexp4.html)
One other idea: if your arrays are really big you can calculate fuzzy hashes (http://ssdeep.sourceforge.net/) which would make your n smaller.

Related

Why is using a loop to iterate from start of array to end faster than iterating both start to end and end to start?

Given an array having .length 100 containing elements having values 0 to 99 at the respective indexes, where the requirement is to find element of of array equal to n : 51.
Why is using a loop to iterate from start of array to end faster than iterating both start to end and end to start?
const arr = Array.from({length: 100}, (_, i) => i);
const n = 51;
const len = arr.length;
console.time("iterate from start");
for (let i = 0; i < len; i++) {
if (arr[i] === n) break;
}
console.timeEnd("iterate from start");
const arr = Array.from({length: 100}, (_, i) => i);
const n = 51;
const len = arr.length;
console.time("iterate from start and end");
for (let i = 0, k = len - 1; i < len && k >= 0; i++, k--) {
if (arr[i] === n || arr[k] === n) break;
}
console.timeEnd("iterate from start and end");
jsperf https://jsperf.com/iterate-from-start-iterate-from-start-and-end/1
The answer is pretty obvious:
More operations take more time.
When judging the speed of code, you look at how many operations it will perform. Just step through and count them. Every instruction will take one or more CPU cycles, and the more there are the longer it will take to run. That different instructions take a different amount of cycles mostly does not matter - while an array lookup might be more costly than integer arithmetic, both of them basically take constant time and if there are too many, it dominates the cost of our algorithm.
In your example, there are few different types of operations that you might want to count individually:
comparisons
increments/decrements
array lookup
conditional jumps
(we could be more granular, such as counting variable fetch and store operations, but those hardly matter - everything is in registers anyway - and their number basically is linear to the others).
Now both of your code iterate about 50 times - they element on which they break the loop is in the middle of the array. Ignoring off-by-a-few errors, those are the counts:
| forwards | forwards and backwards
---------------+------------+------------------------
>=/===/< | 100 | 200
++/-- | 50 | 100
a[b] | 50 | 100
&&/||/if/for | 100 | 200
Given that, it's not unexpected that doing twice the works takes considerably longer.
I'll also answer a few questions from your comments:
Is additional time needed for the second object lookup?
Yes, every individual lookup counts. It's not like they could be performed at once, or optimised into a single lookup (imaginable if they had looked up the same index).
Should there be two separate loops for each start to end and end to start?
Doesn't matter for the number of operations, just for their order.
Or, put differently still, what is the fastest approach to find an element in an array?
There is no "fastest" regarding the order, if you don't know where the element is (and they are evenly distributed) you have to try every index. Any order - even random ones - would work the same. Notice however that your code is strictly worse, as it looks at each index twice when the element is not found - it does not stop in the middle.
But still, there are a few different approaches at micro-optimising such a loop - check these benchmarks.
let is (still?) slower than var, see Why is using `let` inside a `for` loop so slow on Chrome? and Why is let slower than var in a for loop in nodejs?. This tear-up and tear-down (about 50 times) of the loop body scope in fact does dominate your runtime - that's why your inefficient code isn't completely twice as slow.
comparing against 0 is marginally faster than comparing against the length, which puts looping backwards at an advantage. See Why is iterating through an array backwards faster than forwards, JavaScript loop performance - Why is to decrement the iterator toward 0 faster than incrementing and Are loops really faster in reverse?
in general, see What's the fastest way to loop through an array in JavaScript?: it changes from engine update to engine update. Don't do anything weird, write idiomatic code, that's what will get optimised better.
#Bergi is correct. More operations is more time. Why? More CPU clock cycles.
Time is really a reference to how many clock cycles it takes to execute the code.
In order to get to the nitty-gritty of that you need to look at the machine level code (like assembly level code) to find the true evidence. Each CPU (core?) clock cycle can execute one instruction, so how many instructions are you executing?
I haven't counted the clock cycles in a long time since programming Motorola CPUs for embedded applications. If your code is taking longer then it is in fact generating a larger instruction set of machine code, even if the loop is shorter or runs an equal amount of times.
Never forget that your code is actually getting compiled into a set of commands that the CPU is going to execute (memory pointers, instruction-code level pointers, interrupts, etc.). That is how computers work and its easier to understand at the micro controller level like an ARM or Motorola processor but the same is true for the sophisticated machines that we are running on today.
Your code simply does not run the way you write it (sounds crazy right?). It is run as it is compiled to run as machine level instructions (writing a compiler is no fun). Mathematical expression and logic can be compiled in to quite a heap of assembly, machine level code and that is up to how the compiler chooses to interpret it (it is bit shifting, etc, remember binary mathematics anyone?)
Reference:
https://software.intel.com/en-us/articles/introduction-to-x64-assembly
Your question is hard to answer but as #Bergi stated the more operations the longer, but why? The more clock cycles it takes to execute your code. Dual core, quad core, threading, assembly (machine language) it is complex. But no code gets executed as you have written it. C++, C, Pascal, JavaScript, Java, unless you are writing in assembly (even that compiles down to machine code) but it is closer to actual execution code.
A masters in CS and you will get to counting clock cycles and sort times. You will likely make you own language framed on machine instruction sets.
Most people say who cares? Memory is cheap today and CPUs are screaming fast and getting faster.
But there are some critical applications where 10 ms matters, where an immediate interrupt is needed, etc.
Commerce, NASA, a Nuclear power plant, Defense Contractors, some robotics, you get the idea . . .
I vote let it ride and keep moving.
Cheers,
Wookie
Since the element you're looking for is always roughly in the middle of the array, you should expect the version that walks inward from both the start and end of the array to take about twice as long as one that just starts from the beginning.
Each variable update takes time, each comparison takes time, and you're doing twice as many of them. Since you know it will take one or two less iterations of the loop to terminate in this version, you should reason it will cost about twice as much CPU time.
This strategy is still O(n) time complexity since it only looks at each item once, it's just specifically worse when the item is near the center of the list. If it's near the end, this approach will have a better expected runtime. Try looking for item 90 in both, for example.
Selected answer is excellent. I'd like to add another aspect: Try findIndex(), it's 2-3 times faster than using loops:
const arr = Array.from({length: 900}, (_, i) => i);
const n = 51;
const len = arr.length;
console.time("iterate from start");
for (let i = 0; i < len; i++) {
if (arr[i] === n) break;
}
console.timeEnd("iterate from start");
console.time("iterate using findIndex");
var i = arr.findIndex(function(v) {
return v === n;
});
console.timeEnd("iterate using findIndex");
The other answers here cover the main reasons, but I think an interesting addition could be mentioning cache.
In general, sequentially accessing an array will be more efficient, particularly with large arrays. When your CPU reads an array from memory, it also fetches nearby memory locations into cache. This means that when you fetch element n, element n+1 is also probably loaded into cache. Now, cache is relatively big these days, so your 100 int array can probably fit comfortably in cache. However, on an array of much larger size, reading sequentially will be faster than switching between the beginning and the end of the array.

Efficient sparse mapping from integers to integers

I'm implementing a purpose-built regular expression engine using finite automata. I will have to store thousands of states, each state with its own transition table from unicode code points (or UTF-16 code units; I haven't decided) to state IDs.
In many cases, the table will be extremely sparse, but in other cases it will be nearly full. In most cases, most of the entries will fall into several contiguous ranges with the same value.
The simplest implementation would be a lookup table, but each such table would take up a great deal of space. A list of (range, value) pairs would be much smaller, but slower. A binary search tree would be faster than a list.
Is there a better approach, perhaps leveraging built-in functionality?
Unfortunately, JavaScript's built-in data-types - especially Map - are not of great help in accomplishing this task, as they lack the relevant methods.
In most cases, most of the entries will fall into several contiguous
ranges with the same value.
We can however exploit this and use a binary search strategy on sorted arrays, assuming the transition tables won't be modified often.
Encode contiguous input ranges leading to the same state by storing each input range's lowest value in a sorted array. Keep the states at corresponding indices in a separate array:
let inputs = [0, 5, 10]; // Input ranges [0,4], [5,9], [10,∞)
let states = [0, 1, 0 ]; // Inputs [0,4] lead to state 0, [5,9] to 1, [10,∞) to 0
Now, given an input, you need to perform a binary search on the inputs array similar to Java's floorEntry(k):
// Returns the index of the greatest element less than or equal to
// the given element, or undefined if there is no such element:
function floorIndex(sorted, element) {
let low = 0;
let high = sorted.length - 1;
while (low <= high) {
let mid = low + high >> 1;
if (sorted[mid] > element) {
high = mid - 1;
} else if (sorted[mid] < element) {
low = mid + 1;
} else {
return mid
}
}
return low - 1;
}
// Example: Transition to 1 for emoticons in range 1F600 - 1F64F:
let transitions = {
inputs: [0x00000, 0x1F600, 0x1F650],
states: [0, 1, 0 ]
};
let input = 0x1F60B; // 😋
let next = transitions.states[floorIndex(transitions.inputs, input)];
console.log(`transition to ${next}`);
This search completes in O(log n) steps where n is the number of contiguous input ranges. The transition table for a single state then has a space requirement of O(n). This approach works equally well for sparse and dense transition tables as long as our initial assumption - the number of contiguous input ranges leading to the same state is small - holds.
Sounds like you have two very different cases ("in many cases, the table will be extremely sparse, but in other cases it will be nearly full").
For the sparse case, you could possibly have a separate sparse index (or several layers of indexes), then your actual data could be stored in a typed array. Because the index(es) would be mapping from integers to integers, they could be represented as typed arrays as well.
Looking up a value would look like this:
Binary search the index. The index stores pairs as consecutive entries in the typed array – the first element is the search value, the second is the position in the data set (or the next index).
If you have multiple indexes, repeat 1 as necessary.
Start iterating your dataset at the position given by the last index. Because the index is sparse, this position might not be the one where the value is stored, but it is a good starting point as the correct value is guaranteed to be nearby.
The dataset itself is represented as a typed array where consecutive pairs hold the key and the value.
I cannot think of anything better to use in JavaScript. Typed arrays are pretty fast and having indexes should increase the speed drastically. That being said, if you only have a couple thousand entries, don't bother with indexes, do a binary search directly on the typed array (described in 4. above).
For the dense case, I am not sure. If the dense case happens to be a case where repeated values across ranges of keys are likely, consider using something like run-length encoding – identical consecutive values are represented simply as their number of occurrences and then the actual value. Once again, use typed arrays and binary search, possibly even indexes to make this faster.

First element in array

Why do indexes in arrays always start with 0? Does it have something to do with binary? For example:
var myArray = [5,6,7,8];
To access the number 5, you would have to say
myArray[0]
But why?
No, I don't have a real problem. As you can evidently tell I'm new to this stuff.
I'm sure this has been asked an answered a hundred times, but I'll bite.
One way of looking at the "index" or "key" is as an "offset".
myArray essentially acts as a pointer to the first item in a series of items. Specifically, it points to the number "5" in memory. So when you say myArray[1] it's like saying "the location of the first element in myArray plus 1 item over", thus you would be jumping over the first element.
In C, when you write *myArray (pointer dereference) it actually gives you back the first element.
#include <stdio.h>
int main(void) {
int myArray[] = {5,6,7,8};
printf("%d",*myArray); // prints "5", equivalent to myArray[0]
printf("%d",*(myArray+1)); // prints "6", equivalent to myArray[1]
return 0;
}
There are more practical reasons than "that's the way computers work" too.
nice blog about the historical reasons: http://developeronline.blogspot.fi/2008/04/why-array-index-should-start-from-0.html
It's basic computer science stuff, which harkens back to the day when memory was so limited, everything started with 0s and not 1s because if you started at 0 you could count up to ten total numbers in a single digit.
You're clearly new to this, trust me, from now on, you'll be counting 0 , 1 , 2 , 3!
Wikipedia gives us this explanation:
Index origin
Some languages, such as C, provide only zero-based array types, for
which the minimum valid value for any index is 0. This choice is
convenient for array implementation and address computations. With a
language such as C, a pointer to the interior of any array can be
defined that will symbolically act as a pseudo-array that accommodates
negative indices. This works only because C does not check an index
against bounds when used. Other languages provide only one-based array
types, where each index starts at 1; this is the traditional
convention in mathematics for matrices and mathematical sequences. A
few languages, such as Pascal, support n-based array types, whose
minimum legal indices are chosen by the programmer. The relative
merits of each choice have been the subject of heated debate.
Zero-based indexing has a natural advantage to one-based indexing in avoiding off-by-one or fencepost errors. See comparison of
programming languages (array) for the base indices used by various
languages.
Read more about arrays here
Read more about off-by-one and fencepost errors here
In Javascript, like many other languages, arrays always start at index zero, but it's not that way in all languages.
In Pascal, for example, you define the lower and upper boundary, so you can start an array at index three:
var myArray: Integer[3..6];
It's most common to start arrays at zero, because that's most efficient when you access the items. If you start at any other index, that value has to be subtracted when the address where the item is stored is calculated. That extra calculation wouldn't be an issue today, but back when languages like C was constructed it surely was.
(Well, arrays in Javascript is actually accessed completely different from most other languages, but it uses zero based indexes because most similar languages, where the inspiration comes from, do.)

why to use sorting maps on arrays. how is it better in some instances

I'm trying to learn about array sorting. It seems pretty straightforward. But on the mozilla site, I ran across a section discussing sorting maps (about three-quarters down the page).
The compareFunction can be invoked multiple times per element within
the array. Depending on the compareFunction's nature, this may yield a
high overhead. The more work a compareFunction does and the more
elements there are to sort, the wiser it may be to consider using a
map for sorting.
The example given is this:
// the array to be sorted
var list = ["Delta", "alpha", "CHARLIE", "bravo"];
// temporary holder of position and sort-value
var map = [];
// container for the resulting order
var result = [];
// walk original array to map values and positions
for (var i=0, length = list.length; i < length; i++) {
map.push({
// remember the index within the original array
index: i,
// evaluate the value to sort
value: list[i].toLowerCase()
});
}
// sorting the map containing the reduced values
map.sort(function(a, b) {
return a.value > b.value ? 1 : -1;
});
// copy values in right order
for (var i=0, length = map.length; i < length; i++) {
result.push(list[map[i].index]);
}
// print sorted list
print(result);
I don't understand a couple of things. To wit: What does it mean, "The compareFunction can be invoked multiple times per element within the array"? Can someone show me an example of that. Secondly, I understand what's being done in the example, but I don't understand the potential "high[er] overhead" of the compareFunction. The example shown here seems really straightforward and mapping the array into an object, sorting its value, then putting it back into an array would take much more overhead I'd think at first glance. I understand this is a simple example, and probably not intended for anything else than to show the procedure. But can someone give an example of when it would be lower overhead to map like this? It seems like a lot more work.
Thanks!
When sorting a list, an item isn't just compared to one other item, it may need to be compared to several other items. Some of the items may even have to be compared to all other items.
Let's see how many comparisons there actually are when sorting an array:
var list = ["Delta", "alpha", "CHARLIE", "bravo", "orch", "worm", "tower"];
var o = [];
for (var i = 0; i < list.length; i++) {
o.push({
value: list[i],
cnt: 0
});
}
o.sort(function(x, y){
x.cnt++;
y.cnt++;
return x.value == y.value ? 0 : x.value < y.value ? -1 : 1;
});
console.log(o);
Result:
[
{ value="CHARLIE", cnt=3},
{ value="Delta", cnt=3},
{ value="alpha", cnt=4},
{ value="bravo", cnt=3},
{ value="orch", cnt=3},
{ value="tower", cnt=7},
{ value="worm", cnt=3}
]
(Fiddle: http://jsfiddle.net/Guffa/hC6rV/)
As you see, each item was compared to seveal other items. The string "tower" even had more comparisons than there are other strings, which means that it was compared to at least one other string at least twice.
If the comparison needs some calculation before the values can be compared (like the toLowerCase method in the example), then that calculation will be done several times. By caching the values after that calculation, it will be done only once for each item.
The primary time saving in that example is gotten by avoiding calls to toLowerCase() in the comparison function. The comparison function is called by the sort code each time a pair of elements needs to be compared, so that's a savings of a lot of function calls. The cost of building and un-building the map is worth it for large arrays.
That the comparison function may be called more than once per element is a natural implication of how sorting works. If only one comparison per element were necessary, it would be a linear-time process.
edit — the number of comparisons that'll be made will be roughly proportional to the length of the array times the base-2 log of the length. For a 1000 element array, then, that's proportional to 10,000 comparisons (probably closer to 15,000, depending on the actual sort algorithm). Saving 20,000 unnecessary function calls is worth the 2000 operations necessary to build and un-build the sort map.
This is called the “decorate - sort - undecorate” pattern (you can find a nice explanation on Wikipedia).
The idea is that a comparison based sort will have to call the comparison function at least n times (where n is the number of item in the list) as this is the number of comparison you need just to check that the array is already sorted. Usually, the number of comparison will be larger than that (O(n ln n) if you are using a good algorithm), and according to the pingeonhole principle, there is at least one value that will be passed twice to the comparison function.
If your comparison function does some expensive processing before comparing the two values, then you can reduce the cost by first doing the expensive part and storing the result for each values (since you know that even in the best scenario you'll have to do that processing). Then, when sorting, you use a cheaper comparison function that only compare those cached outputs.
In this example, the "expensive" part is converting the string to lowercase.
Think of this like caching. It's simply saying that you should not do lots of calculation in the compare function, because you will be calculating the same value over and over.
What does it mean, "The compareFunction can be invoked multiple times per element within the array"?
It means exactly what it says. Lets you have three items, A, B and C. They need to be sorted by the result of compare function. The comparisons might be done like this:
compare(A) to compare(B)
compare(A) to compare(C)
compare(B) to compare(C)
So here, we have 3 values, but the compare() function was executed 6 times. Using a temporary array to cache things ensures we do a calculation only once per item, and can compare those results.
Secondly, I understand what's being done in the example, but I don't understand the potential "high[er] overhead" of the compareFunction.
What if compare() does a database fetch (comparing the counts of matching rows)? Or a complex math calculation (factorial, recursive fibbinocci, or iteration over a large number of items) These sorts of things you don't want to do more than once.
I would say most of the time, it's fine to leave really simple/fast calculations inline. Don't over optimize. But if you need to anything complex or slow in the comparison, you have to be smarter about it.
To respond to your first question, why would the compareFunction be called multiple times per element in the array?
Sorting an array almost always requires more than N passes, where N is the size of the array (unless the array is already sorted). Thus, for every element in your array, it may be compared to another element in your array up to N times (bubble sort requires at most N^2 comparisons). The compareFunction you provide will be used every time to determine whether two elements are less/equal/greater and thus will be called multiple times per element in the array.
A simple response for you second question, why would there be potentially higher overhead for a compareFunction?
Say your compareFunction does a lot of unnecessary work while comparing two elements of the array. This can cause sort to be slower, and thus using a compareFunction could potentially cause higher overhead.

Is it correct to use JavaScript Array.sort() method for shuffling?

I was helping somebody out with his JavaScript code and my eyes were caught by a section that looked like that:
function randOrd(){
return (Math.round(Math.random())-0.5);
}
coords.sort(randOrd);
alert(coords);
My first though was: hey, this can't possibly work! But then I did some experimenting and found that it indeed at least seems to provide nicely randomized results.
Then I did some web search and almost at the top found an article from which this code was most ceartanly copied. Looked like a pretty respectable site and author...
But my gut feeling tells me, that this must be wrong. Especially as the sorting algorithm is not specified by ECMA standard. I think different sorting algoritms will result in different non-uniform shuffles. Some sorting algorithms may probably even loop infinitely...
But what do you think?
And as another question... how would I now go and measure how random the results of this shuffling technique are?
update: I did some measurements and posted the results below as one of the answers.
After Jon has already covered the theory, here's an implementation:
function shuffle(array) {
var tmp, current, top = array.length;
if(top) while(--top) {
current = Math.floor(Math.random() * (top + 1));
tmp = array[current];
array[current] = array[top];
array[top] = tmp;
}
return array;
}
The algorithm is O(n), whereas sorting should be O(n log n). Depending on the overhead of executing JS code compared to the native sort() function, this might lead to a noticable difference in performance which should increase with array sizes.
In the comments to bobobobo's answer, I stated that the algorithm in question might not produce evenly distributed probabilities (depending on the implementation of sort()).
My argument goes along these lines: A sorting algorithm requires a certain number c of comparisons, eg c = n(n-1)/2 for Bubblesort. Our random comparison function makes the outcome of each comparison equally likely, ie there are 2^c equally probable results. Now, each result has to correspond to one of the n! permutations of the array's entries, which makes an even distribution impossible in the general case. (This is a simplification, as the actual number of comparisons neeeded depends on the input array, but the assertion should still hold.)
As Jon pointed out, this alone is no reason to prefer Fisher-Yates over using sort(), as the random number generator will also map a finite number of pseudo-random values to the n! permutations. But the results of Fisher-Yates should still be better:
Math.random() produces a pseudo-random number in the range [0;1[. As JS uses double-precision floating point values, this corresponds to 2^x possible values where 52 ≤ x ≤ 63 (I'm too lazy to find the actual number). A probability distribution generated using Math.random() will stop behaving well if the number of atomic events is of the same order of magnitude.
When using Fisher-Yates, the relevant parameter is the size of the array, which should never approach 2^52 due to practical limitations.
When sorting with a random comparision function, the function basically only cares if the return value is positive or negative, so this will never be a problem. But there is a similar one: Because the comparison function is well-behaved, the 2^c possible results are, as stated, equally probable. If c ~ n log n then 2^c ~ n^(a·n) where a = const, which makes it at least possible that 2^c is of same magnitude as (or even less than) n! and thus leading to an uneven distribution, even if the sorting algorithm where to map onto the permutaions evenly. If this has any practical impact is beyond me.
The real problem is that the sorting algorithms are not guaranteed to map onto the permutations evenly. It's easy to see that Mergesort does as it's symmetric, but reasoning about something like Bubblesort or, more importantly, Quicksort or Heapsort, is not.
The bottom line: As long as sort() uses Mergesort, you should be reasonably safe except in corner cases (at least I'm hoping that 2^c ≤ n! is a corner case), if not, all bets are off.
It's never been my favourite way of shuffling, partly because it is implementation-specific as you say. In particular, I seem to remember that the standard library sorting from either Java or .NET (not sure which) can often detect if you end up with an inconsistent comparison between some elements (e.g. you first claim A < B and B < C, but then C < A).
It also ends up as a more complex (in terms of execution time) shuffle than you really need.
I prefer the shuffle algorithm which effectively partitions the collection into "shuffled" (at the start of the collection, initially empty) and "unshuffled" (the rest of the collection). At each step of the algorithm, pick a random unshuffled element (which could be the first one) and swap it with the first unshuffled element - then treat it as shuffled (i.e. mentally move the partition to include it).
This is O(n) and only requires n-1 calls to the random number generator, which is nice. It also produces a genuine shuffle - any element has a 1/n chance of ending up in each space, regardless of its original position (assuming a reasonable RNG). The sorted version approximates to an even distribution (assuming that the random number generator doesn't pick the same value twice, which is highly unlikely if it's returning random doubles) but I find it easier to reason about the shuffle version :)
This approach is called a Fisher-Yates shuffle.
I would regard it as a best practice to code up this shuffle once and reuse it everywhere you need to shuffle items. Then you don't need to worry about sort implementations in terms of reliability or complexity. It's only a few lines of code (which I won't attempt in JavaScript!)
The Wikipedia article on shuffling (and in particular the shuffle algorithms section) talks about sorting a random projection - it's worth reading the section on poor implementations of shuffling in general, so you know what to avoid.
I did some measurements of how random the results of this random sort are...
My technique was to take a small array [1,2,3,4] and create all (4! = 24) permutations of it. Then I would apply the shuffling function to the array a large number of times and count how many times each permutation is generated. A good shuffling algoritm would distribute the results quite evenly over all the permutations, while a bad one would not create that uniform result.
Using the code below I tested in Firefox, Opera, Chrome, IE6/7/8.
Surprisingly for me, the random sort and the real shuffle both created equally uniform distributions. So it seems that (as many have suggested) the main browsers are using merge sort. This of course doesn't mean, that there can't be a browser out there, that does differently, but I would say it means, that this random-sort-method is reliable enough to use in practice.
EDIT: This test didn't really measured correctly the randomness or lack thereof. See the other answer I posted.
But on the performance side the shuffle function given by Cristoph was a clear winner. Even for small four-element arrays the real shuffle performed about twice as fast as random-sort!
// The shuffle function posted by Cristoph.
var shuffle = function(array) {
var tmp, current, top = array.length;
if(top) while(--top) {
current = Math.floor(Math.random() * (top + 1));
tmp = array[current];
array[current] = array[top];
array[top] = tmp;
}
return array;
};
// the random sort function
var rnd = function() {
return Math.round(Math.random())-0.5;
};
var randSort = function(A) {
return A.sort(rnd);
};
var permutations = function(A) {
if (A.length == 1) {
return [A];
}
else {
var perms = [];
for (var i=0; i<A.length; i++) {
var x = A.slice(i, i+1);
var xs = A.slice(0, i).concat(A.slice(i+1));
var subperms = permutations(xs);
for (var j=0; j<subperms.length; j++) {
perms.push(x.concat(subperms[j]));
}
}
return perms;
}
};
var test = function(A, iterations, func) {
// init permutations
var stats = {};
var perms = permutations(A);
for (var i in perms){
stats[""+perms[i]] = 0;
}
// shuffle many times and gather stats
var start=new Date();
for (var i=0; i<iterations; i++) {
var shuffled = func(A);
stats[""+shuffled]++;
}
var end=new Date();
// format result
var arr=[];
for (var i in stats) {
arr.push(i+" "+stats[i]);
}
return arr.join("\n")+"\n\nTime taken: " + ((end - start)/1000) + " seconds.";
};
alert("random sort: " + test([1,2,3,4], 100000, randSort));
alert("shuffle: " + test([1,2,3,4], 100000, shuffle));
Interestingly, Microsoft used the same technique in their pick-random-browser-page.
They used a slightly different comparison function:
function RandomSort(a,b) {
return (0.5 - Math.random());
}
Looks almost the same to me, but it turned out to be not so random...
So I made some testruns again with the same methodology used in the linked article, and indeed - turned out that the random-sorting-method produced flawed results. New test code here:
function shuffle(arr) {
arr.sort(function(a,b) {
return (0.5 - Math.random());
});
}
function shuffle2(arr) {
arr.sort(function(a,b) {
return (Math.round(Math.random())-0.5);
});
}
function shuffle3(array) {
var tmp, current, top = array.length;
if(top) while(--top) {
current = Math.floor(Math.random() * (top + 1));
tmp = array[current];
array[current] = array[top];
array[top] = tmp;
}
return array;
}
var counts = [
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0]
];
var arr;
for (var i=0; i<100000; i++) {
arr = [0,1,2,3,4];
shuffle3(arr);
arr.forEach(function(x, i){ counts[x][i]++;});
}
alert(counts.map(function(a){return a.join(", ");}).join("\n"));
I have placed a simple test page on my website showing the bias of your current browser versus other popular browsers using different methods to shuffle. It shows the terrible bias of just using Math.random()-0.5, another 'random' shuffle that isn't biased, and the Fisher-Yates method mentioned above.
You can see that on some browsers there is as high as a 50% chance that certain elements will not change place at all during the 'shuffle'!
Note: you can make the implementation of the Fisher-Yates shuffle by #Christoph slightly faster for Safari by changing the code to:
function shuffle(array) {
for (var tmp, cur, top=array.length; top--;){
cur = (Math.random() * (top + 1)) << 0;
tmp = array[cur]; array[cur] = array[top]; array[top] = tmp;
}
return array;
}
Test results: http://jsperf.com/optimized-fisher-yates
I think it's fine for cases where you're not picky about distribution and you want the source code to be small.
In JavaScript (where the source is transmitted constantly), small makes a difference in bandwidth costs.
It's been four years, but I'd like to point out that the random comparator method won't be correctly distributed, no matter what sorting algorithm you use.
Proof:
For an array of n elements, there are exactly n! permutations (i.e. possible shuffles).
Every comparison during a shuffle is a choice between two sets of permutations. For a random comparator, there is a 1/2 chance of choosing each set.
Thus, for each permutation p, the chance of ending up with permutation p is a fraction with denominator 2^k (for some k), because it is a sum of such fractions (e.g. 1/8 + 1/16 = 3/16).
For n = 3, there are six equally-likely permutations. The chance of each permutation, then, is 1/6. 1/6 can't be expressed as a fraction with a power of 2 as its denominator.
Therefore, the coin flip sort will never result in a fair distribution of shuffles.
The only sizes that could possibly be correctly distributed are n=0,1,2.
As an exercise, try drawing out the decision tree of different sort algorithms for n=3.
There is a gap in the proof: If a sort algorithm depends on the consistency of the comparator, and has unbounded runtime with an inconsistent comparator, it can have an infinite sum of probabilities, which is allowed to add up to 1/6 even if every denominator in the sum is a power of 2. Try to find one.
Also, if a comparator has a fixed chance of giving either answer (e.g. (Math.random() < P)*2 - 1, for constant P), the above proof holds. If the comparator instead changes its odds based on previous answers, it may be possible to generate fair results. Finding such a comparator for a given sorting algorithm could be a research paper.
It is a hack, certainly. In practice, an infinitely looping algorithm is not likely.
If you're sorting objects, you could loop through the coords array and do something like:
for (var i = 0; i < coords.length; i++)
coords[i].sortValue = Math.random();
coords.sort(useSortValue)
function useSortValue(a, b)
{
return a.sortValue - b.sortValue;
}
(and then loop through them again to remove the sortValue)
Still a hack though. If you want to do it nicely, you have to do it the hard way :)
If you're using D3 there is a built-in shuffle function (using Fisher-Yates):
var days = ['Lundi','Mardi','Mercredi','Jeudi','Vendredi','Samedi','Dimanche'];
d3.shuffle(days);
And here is Mike going into details about it:
http://bost.ocks.org/mike/shuffle/
No, it is not correct. As other answers have noted, it will lead to a non-uniform shuffle and the quality of the shuffle will also depend on which sorting algorithm the browser uses.
Now, that might not sound too bad to you, because even if theoretically the distribution is not uniform, in practice it's probably nearly uniform, right? Well, no, not even close. The following charts show heat-maps of which indices each element gets shuffled to, in Chrome and Firefox respectively: if the pixel (i, j) is green, it means the element at index i gets shuffled to index j too often, and if it's red then it gets shuffled there too rarely.
These screenshots are taken from Mike Bostock's page on this subject.
As you can see, shuffling using a random comparator is severely biased in Chrome and even more so in Firefox. In particular, both have a lot of green along the diagonal, meaning that too many elements get "shuffled" somewhere very close to where they were in the original sequence. In comparison, a similar chart for an unbiased shuffle (e.g. using the Fisher-Yates algorithm) would be all pale yellow with just a small amount of random noise.
Here's an approach that uses a single array:
The basic logic is:
Starting with an array of n elements
Remove a random element from the array and push it onto the array
Remove a random element from the first n - 1 elements of the array and push it onto the array
Remove a random element from the first n - 2 elements of the array and push it onto the array
...
Remove the first element of the array and push it onto the array
Code:
for(i=a.length;i--;) a.push(a.splice(Math.floor(Math.random() * (i + 1)),1)[0]);
Can you use the Array.sort() function to shuffle an array – Yes.
Are the results random enough – No.
Consider the following code snippet:
/*
* The following code sample shuffles an array using Math.random() trick
* After shuffling, the new position of each item is recorded
* The process is repeated 100 times
* The result is printed out, listing each item and the number of times
* it appeared on a given position after shuffling
*/
var array = ["a", "b", "c", "d", "e"];
var stats = {};
array.forEach(function(v) {
stats[v] = Array(array.length).fill(0);
});
var i, clone;
for (i = 0; i < 100; i++) {
clone = array.slice();
clone.sort(function() {
return Math.random() - 0.5;
});
clone.forEach(function(v, i) {
stats[v][i]++;
});
}
Object.keys(stats).forEach(function(v, i) {
console.log(v + ": [" + stats[v].join(", ") + "]");
});
Sample output:
a: [29, 38, 20, 6, 7]
b: [29, 33, 22, 11, 5]
c: [17, 14, 32, 17, 20]
d: [16, 9, 17, 35, 23]
e: [ 9, 6, 9, 31, 45]
Ideally, the counts should be evenly distributed (for the above example, all counts should be around 20). But they are not. Apparently, the distribution depends on what sorting algorithm is implemented by the browser and how it iterates the array items for sorting.
There is nothing wrong with it.
The function you pass to .sort() usually looks something like
function sortingFunc( first, second )
{
// example:
return first - second ;
}
Your job in sortingFunc is to return:
a negative number if first goes before second
a positive number if first should go after second
and 0 if they are completely equal
The above sorting function puts things in order.
If you return -'s and +'s randomly as what you have, you get a random ordering.
Like in MySQL:
SELECT * from table ORDER BY rand()

Categories