I would like to randomly select one element from an array, but each element has a known probability of selection.
All chances together (within the array) sums to 1.
What algorithm would you suggest as the fastest and most suitable for huge calculations?
Example:
id => chance
array[
0 => 0.8
1 => 0.2
]
for this pseudocode, the algorithm in question should on multiple calls statistically return four elements on id 0 for one element on id 1.
Compute the discrete cumulative density function (CDF) of your list -- or in simple terms the array of cumulative sums of the weights. Then generate a random number in the range between 0 and the sum of all weights (might be 1 in your case), do a binary search to find this random number in your discrete CDF array and get the value corresponding to this entry -- this is your weighted random number.
The algorithm is straight forward
rand_no = rand(0,1)
for each element in array
if(rand_num < element.probablity)
select and break
rand_num = rand_num - element.probability
I have found this article to be the most useful at understanding this problem fully.
This stackoverflow question may also be what you're looking for.
I believe the optimal solution is to use the Alias Method (wikipedia).
It requires O(n) time to initialize, O(1) time to make a selection, and O(n) memory.
Here is the algorithm for generating the result of rolling a weighted n-sided die (from here it is trivial to select an element from a length-n array) as take from this article.
The author assumes you have functions for rolling a fair die (floor(random() * n)) and flipping a biased coin (random() < p).
Algorithm: Vose's Alias Method
Initialization:
Create arrays Alias and Prob, each of size n.
Create two worklists, Small and Large.
Multiply each probability by n.
For each scaled probability pi:
If pi < 1, add i to Small.
Otherwise (pi ≥ 1), add i to Large.
While Small and Large are not empty: (Large might be emptied first)
Remove the first element from Small; call it l.
Remove the first element from Large; call it g.
Set Prob[l]=pl.
Set Alias[l]=g.
Set pg := (pg+pl)−1. (This is a more numerically stable option.)
If pg<1, add g to Small.
Otherwise (pg ≥ 1), add g to Large.
While Large is not empty:
Remove the first element from Large; call it g.
Set Prob[g] = 1.
While Small is not empty: This is only possible due to numerical instability.
Remove the first element from Small; call it l.
Set Prob[l] = 1.
Generation:
Generate a fair die roll from an n-sided die; call the side i.
Flip a biased coin that comes up heads with probability Prob[i].
If the coin comes up "heads," return i.
Otherwise, return Alias[i].
Here is an implementation in Ruby:
def weighted_rand(weights = {})
raise 'Probabilities must sum up to 1' unless weights.values.inject(&:+) == 1.0
raise 'Probabilities must not be negative' unless weights.values.all? { |p| p >= 0 }
# Do more sanity checks depending on the amount of trust in the software component using this method,
# e.g. don't allow duplicates, don't allow non-numeric values, etc.
# Ignore elements with probability 0
weights = weights.reject { |k, v| v == 0.0 } # e.g. => {"a"=>0.4, "b"=>0.4, "c"=>0.2}
# Accumulate probabilities and map them to a value
u = 0.0
ranges = weights.map { |v, p| [u += p, v] } # e.g. => [[0.4, "a"], [0.8, "b"], [1.0, "c"]]
# Generate a (pseudo-)random floating point number between 0.0(included) and 1.0(excluded)
u = rand # e.g. => 0.4651073966724186
# Find the first value that has an accumulated probability greater than the random number u
ranges.find { |p, v| p > u }.last # e.g. => "b"
end
How to use:
weights = {'a' => 0.4, 'b' => 0.4, 'c' => 0.2, 'd' => 0.0}
weighted_rand weights
What to expect roughly:
sample = 1000.times.map { weighted_rand weights }
sample.count('a') # 396
sample.count('b') # 406
sample.count('c') # 198
sample.count('d') # 0
An example in ruby
#each element is associated with its probability
a = {1 => 0.25 ,2 => 0.5 ,3 => 0.2, 4 => 0.05}
#at some point, convert to ccumulative probability
acc = 0
a.each { |e,w| a[e] = acc+=w }
#to select an element, pick a random between 0 and 1 and find the first
#cummulative probability that's greater than the random number
r = rand
selected = a.find{ |e,w| w>r }
p selected[0]
This can be done in O(1) expected time per sample as follows.
Compute the CDF F(i) for each element i to be the sum of probabilities less than or equal to i.
Define the range r(i) of an element i to be the interval [F(i - 1), F(i)].
For each interval [(i - 1)/n, i/n], create a bucket consisting of the list of the elements whose range overlaps the interval. This takes O(n) time in total for the full array as long as you are reasonably careful.
When you randomly sample the array, you simply compute which bucket the random number is in, and compare with each element of the list until you find the interval that contains it.
The cost of a sample is O(the expected length of a randomly chosen list) <= 2.
This is a PHP code I used in production:
/**
* #return \App\Models\CdnServer
*/
protected function selectWeightedServer(Collection $servers)
{
if ($servers->count() == 1) {
return $servers->first();
}
$totalWeight = 0;
foreach ($servers as $server) {
$totalWeight += $server->getWeight();
}
// Select a random server using weighted choice
$randWeight = mt_rand(1, $totalWeight);
$accWeight = 0;
foreach ($servers as $server) {
$accWeight += $server->getWeight();
if ($accWeight >= $randWeight) {
return $server;
}
}
}
Ruby solution using the pickup gem:
require 'pickup'
chances = {0=>80, 1=>20}
picker = Pickup.new(chances)
Example:
5.times.collect {
picker.pick(5)
}
gave output:
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]]
If the array is small, I would give the array a length of, in this case, five and assign the values as appropriate:
array[
0 => 0
1 => 0
2 => 0
3 => 0
4 => 1
]
"Wheel of Fortune" O(n), use for small arrays only:
function pickRandomWeighted(array, weights) {
var sum = 0;
for (var i=0; i<weights.length; i++) sum += weights[i];
for (var i=0, pick=Math.random()*sum; i<weights.length; i++, pick-=weights[i])
if (pick-weights[i]<0) return array[i];
}
the trick could be to sample an auxiliary array with elements repetitions which reflect the probability
Given the elements associated with their probability, as percentage:
h = {1 => 0.5, 2 => 0.3, 3 => 0.05, 4 => 0.05 }
auxiliary_array = h.inject([]){|memo,(k,v)| memo += Array.new((100*v).to_i,k) }
ruby-1.9.3-p194 > auxiliary_array
=> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]
auxiliary_array.sample
if you want to be as generic as possible, you need to calculate the multiplier based on the max number of fractional digits, and use it in the place of 100:
m = 10**h.values.collect{|e| e.to_s.split(".").last.size }.max
Another possibility is to associate, with each element of the array, a random number drawn from an exponential distribution with parameter given by the weight for that element. Then pick the element with the lowest such ‘ordering number’. In this case, the probability that a particular element has the lowest ordering number of the array is proportional to the array element's weight.
This is O(n), doesn't involve any reordering or extra storage, and the selection can be done in the course of a single pass through the array. The weights must be greater than zero, but don't have to sum to any particular value.
This has the further advantage that, if you store the ordering number with each array element, you have the option to sort the array by increasing ordering number, to get a random ordering of the array in which elements with higher weights have a higher probability of coming early (I've found this useful when deciding which DNS SRV record to pick, to decide which machine to query).
Repeated random sampling with replacement requires a new pass through the array each time; for random selection without replacement, the array can be sorted in order of increasing ordering number, and k elements can be read out in that order.
See the Wikipedia page about the exponential distribution (in particular the remarks about the distribution of the minima of an ensemble of such variates) for the proof that the above is true, and also for the pointer towards the technique of generating such variates: if T has a uniform random distribution in [0,1), then Z=-log(1-T)/w (where w is the parameter of the distribution; here the weight of the associated element) has an exponential distribution.
That is:
For each element i in the array, calculate zi = -log(T)/wi (or zi = -log(1-T)/wi), where T is drawn from a uniform distribution in [0,1), and wi is the weight of the I'th element.
Select the element which has the lowest zi.
The element i will be selected with probability wi/(w1+w2+...+wn).
See below for an illustration of this in Python, which takes a single pass through the array of weights, for each of 10000 trials.
import math, random
random.seed()
weights = [10, 20, 50, 20]
nw = len(weights)
results = [0 for i in range(nw)]
n = 10000
while n > 0: # do n trials
smallest_i = 0
smallest_z = -math.log(1-random.random())/weights[0]
for i in range(1, nw):
z = -math.log(1-random.random())/weights[i]
if z < smallest_z:
smallest_i = i
smallest_z = z
results[smallest_i] += 1 # accumulate our choices
n -= 1
for i in range(nw):
print("{} -> {}".format(weights[i], results[i]))
Edit (for history): after posting this, I felt sure I couldn't be the first to have thought of it, and another search with this solution in mind shows that this is indeed the case.
In an answer to a similar question, Joe K suggested this algorithm (and also noted that someone else must have thought of it before).
Another answer to that question, meanwhile, pointed to Efraimidis and Spirakis (preprint), which describes a similar method.
I'm pretty sure, looking at it, that the Efraimidis and Spirakis is in fact the same exponential-distribution algorithm in disguise, and this is corroborated by a passing remark in the Wikipedia page about Reservoir sampling that ‘[e]quivalently, a more numerically stable formulation of this algorithm’ is the exponential-distribution algorithm above. The reference there is to a sequence of lecture notes by Richard Arratia; the relevant property of the exponential distribution is mentioned in Sect.1.3 (which mentions that something similar to this is a ‘familiar fact’ in some circles), but not its relationship to the Efraimidis and Spirakis algorithm.
I would imagine that numbers greater or equal than 0.8 but less than 1.0 selects the third element.
In other terms:
x is a random number between 0 and 1
if 0.0 >= x < 0.2 : Item 1
if 0.2 >= x < 0.8 : Item 2
if 0.8 >= x < 1.0 : Item 3
I am going to improve on https://stackoverflow.com/users/626341/masciugo answer.
Basically you make one big array where the number of times an element shows up is proportional to the weight.
It has some drawbacks.
The weight might not be integer. Imagine element 1 has probability of pi and element 2 has probability of 1-pi. How do you divide that? Or imagine if there are hundreds of such elements.
The array created can be very big. Imagine if least common multiplier is 1 million, then we will need an array of 1 million element in the array we want to pick.
To counter that, this is what you do.
Create such array, but only insert an element randomly. The probability that an element is inserted is proportional the the weight.
Then select random element from usual.
So if there are 3 elements with various weight, you simply pick an element from an array of 1-3 elements.
Problems may arise if the constructed element is empty. That is it just happens that no elements show up in the array because their dice roll differently.
In which case, I propose that the probability an element is inserted is p(inserted)=wi/wmax.
That way, one element, namely the one that has the highest probability, will be inserted. The other elements will be inserted by the relative probability.
Say we have 2 objects.
element 1 shows up .20% of the time.
element 2 shows up .40% of the time and has the highest probability.
In thearray, element 2 will show up all the time. Element 1 will show up half the time.
So element 2 will be called 2 times as many as element 1. For generality all other elements will be called proportional to their weight. Also the sum of all their probability are 1 because the array will always have at least 1 element.
I wrote an implementation in C#:
https://github.com/cdanek/KaimiraWeightedList
O(1) gets (fast!), O(n) recalculates, O(n) memory use.
I have an array of values, specifically pixel offsets of a certain type of element.
Let's say they are in an array arrScroll[] with values [5, 10, 15, 50, 100, 250].
If my window scrolls past 5 pixels but not past 10 pixels, I get the index 0. If my window scrolls past 15 pixels but not past 50 pixels, I get the index 2. If it scrolls back below 10 pixels but not below 5, I get the index 0 again.
What I'm trying to do is find a graceful way (instead of a ton of conditionals for each possible range, as the number of ranges can change) to always get the lower of the two indexes of the scroll range that I am in, except at the range 0 to arrScroll[0], in which case I pass a different value than the index.
Another example: if I am in the range of arrScroll[3] and arrScroll[4] then I will obtain the index 3. Only once I pass the position of the higher index number do I get its index.
This has nothing to do with sorting as the values are already sorted from smallest to greatest. On a scroll event listener, I simply want to obtain the index of the lower of the two values comprising the index.
What would be the best way to organize this so that it can function for an array of arbitrary length?
More complete example:
I have the colors red, blue, and green. I have an array with values [100, 200, 300]. When my window scrolls past 100 pixels but not past 200 pixels, I will have something like $(element).css('background-color', colorArr[index]) where the color in colorArr[] at index 0 is red.
Then if the window scrolls past 200, but not past 300, I run the same code snippet, but the index is now 1 and the color is blue.
If I scroll back below 200 pixels but not below 100 pixels, the index is once again 0 and the color passed is red.
This is trivial to create with if statements if the length of the array is known, but I don't know how to do it for an array of arbitrary length.
If I'm not mistaken you're seeking to find index based on a value?
let arr = [5, 10, 15, 50, 100, 250];
let colorArr = ["red", "blue", "green", "yellow", "orange", "black"]
//fake listener
let scrollListener = (pixelScrolled) => {
let index = (arr.findIndex((element)=>pixelScrolled<element)+arr.length)%(arr.length+1);
$(document.body).css('background-color', colorArr[index])
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<button onclick="(()=>scrollListener(9))()">fake scroll toggle : 9</button>
<button onclick="(()=>scrollListener(49))()">fake scroll toggle : 49</button>
<button onclick="(()=>scrollListener(250))()">fake scroll toggle : 250</button>
I'd iterate on your arrScroll values starting from the begining, assuming your offsets are always sorted in a growing order and break as soon as your test position is greater than the tested index
var test = function (position, offsets) {
for (var index = 0; index < offsets.length - 1; index++) {
if (position < offsets[index]) break;
}
return index > 0 ? index - 1 : 'something';
}
// will return index 1 cuz 4 is bw 1 and 5
console.log(test(4, [0, 1, 5, 10, 20, 42]))
// will return 3
console.log(test(12, [0, 1, 5, 10, 20, 42]))
// would return 0, but would return "something" since this is your special case
console.log(test(16, [20, 100]))
That way you iterate only on part of your offset arrays, exiting as soon as possible
Okay this is going to be hard to explain. So bear with me.
Im having less of a problem with the programming, and more a problem with the idea behind what Im trying to do.
I have a grid of triangles. Ref: http://i.imgur.com/08BPHiD.png [1]
Each triangle is it's own polygon on a canvas element that I have set as an object within the code. The only difference between the objects is the coordinates that I pass through as parameters of a function like so:
var triCoordX = [1, 2, 3, ...];
var triCoordY = [1, 2, 3, ...];
var triCoordFlipX = [1, 2, 3, ...];
var triCoordFlipY = [1, 2, 3, ...];
var createTri = function(x, y, z) {
return {
x: x,
y: y,
sides: 3,
radius: 15,
rotation: z,
fillRed: 17,
fillGreen: 17,
fillBlue: 17,
closed: true,
shadowColor: '#5febff',
shadowBlur: 5,
shadowOpacity: 0.18
}
};
for (i = 0; i < triCoordX.length; i++){
var tri = new Kinetic.RegularPolygon(createTri(triCoordX[i], triCoordY[i], 0));
}
for (i = 0; i < triCoordFlipX.length; i++){
var triFlip = new Kinetic.RegularPolygon(createTri(triCoordFlipX[i], triCoordFlipY[i], 180));
}
Now what Im trying to do exactly is have each object polygon be able to 'recognise' its neighbors for various graphical effects.
How I propose to do this is pass a 4th parameter into the function that I push from another array using the for loop that sets a kind of "index" for each polygon. Also in the for loop I will define a function that points to the index 'neighbors' of the object polygon.
So for instance, if I want to select a random triangle from the grid and make it glow, and on completion of a tween want to make one of it's neighbors glow I will have the original triangle use it's object function to identify a 'neighbor' index and pick at random one of its 3 'neighbors'.
The problem is with this model, Im not entirely sure how to do it without large amounts of bloat in my programming, or when I set the function for the loop, to set a way for the loop to intuitively pick the correct index numbers for what are actually the triangle's neighbors.
If all of that made sense, Im looking for any and all suggestions.
Think of your triangles as being laid out in a grid with the triangle in the top left corner being col==0, row==0.
Then you can find the row/col coordinates of the 3 neighbors of any triangle with the following function.
Ignore any neighbors with the following coordinates because the neighbors would be off the grid.
col<0
row<0
col>ColumnCount-1
row>RowCount-1
Example code (warning...untested code--you may have to tweak it):
function findNeighbors(t){
// determine if this triangle's row/col are even or odd
var evenRow=(t.col%2==0);
var evenCol=(t.row%2==0;
// left neighbor is always the same
n1={ col:t.col-1, row:t.row };
// right neighbor is always the same
n2={ col:t.col+1, row:t.row };
// third neighbor depends on row/col being even or odd
if(evenRow && evenCol){
n3={ col:t.col, row:t.row+1 };
}
if(evenRow && !evenCol){
n3={ col:t.col, row:t.row-1 };
}
if(!evenRow && evenCol){
n3={ col:t.col, row:t.row-1 };
}
if(!evenRow && !evenCol){
n3={ col:t.col, row:t.row+1 };
}
// return an array with the 3 neighbors
return([n1,n2,n3]);
}
I have been racking my brain on how to make this work. I can find no examples of this and actually no previous questions. Basically I have a 121 thumbnail images (with the exact same dimensions), arrange them in a grid with gutters and I want to take the first image and place it in the center. (this allows for an 11x11 image grid) Then I would like to take each next image and begin to arrange them around the center image using the next closest available vacant location to the center image until all used up. It is assumed the list of images will be gotten from an array object. What is the most efficient way of doing this?
Most likely not the most efficient way of solving this, but I wanted to play with it:
You could iterate over all the points in your grid, calculate their distances to the center point and then sort the points by this distance. The advantage over the algorithmic solutions is that you can use all sorts of distance functions:
// Setup constants
var arraySize = 11;
var centerPoint = {x:5, y:5};
// Calculate the Euclidean Distance between two points
function distance(point1, point2) {
return Math.sqrt(Math.pow(point1.x - point2.x, 2) + Math.pow(point1.y - point2.y, 2));
}
// Create array containing points with distance values
var pointsWithDistances = [];
for (var i=0; i<arraySize; i++) {
for (var j=0; j<arraySize; j++) {
var point = {x:i, y:j};
point.distance = distance(centerPoint, point);
pointsWithDistances.push(point);
}
}
// Sort points by distance value
pointsWithDistances.sort(function(point1, point2) {
return point1.distance == point2.distance ? 0 : point1.distance < point2.distance ? -1 : 1;
});
The resulting pointsWithDistances array will look like this:
[
{x:5, y:5, distance:0},
{x:4, y:5, distance:1},
{x:5, y:4, distance:1},
...
{x:4, y:4, distance:1.4142135623730951},
{x:4, y:6, distance:1.4142135623730951},
...
{x:3, y:5, distance:2},
...
]
By iterating over the array in this order you are effectively filling the grid from the center outwards.
(Thanks for Andreas Carlbom's idea how to display this structure.)
Check out the difference to using Rectilinear Distances:
// Rectilinear Distance between two points
function distance(point1, point2) {
return Math.abs(point1.x - point2.x) + Math.abs(point1.y - point2.y);
}
For the shell-like structure of the algorithmic approaches you can use the Maximum Metric:
// 'Maximum Metric' Distance between two points
function distance(point1, point2) {
return Math.max(Math.abs(point1.x - point2.x), Math.abs(point1.y - point2.y));
}
You can play with the code here: http://jsfiddle.net/green/B3cF8/