I have a master list with 8 items in, then a number of lists with the same items as the master list but where the items appear in a different order. How do I find a percentage similarity between each list with the master list?
For example, the master list might be:
[8,7,6,5,4,3,2,1];
One of the lists I want to compare it against could be:
[8,6,4,2,7,5,3,1];
I know I could just loop through the master list and check for matches, but is there an elegant way to work out how close each number is in the list to the same number in the master list?
For example:
position 0: '8' match in position 0; 0 positions difference (100%)
position 1: '7' match in position 4; 3 positions difference (57.1%)
position 2: '6' match in position 1; 2 positions difference (71.4%)
etc.
The end result would be a percentage similarity between the two lists.
You could use the Array map and reduce functions:
function getSimilaritry(a, b) {
return a.map(function(val, index) {
//calculate the position offset and divide by the length to get each
//values similarity score
var posOffset = Math.abs(b.indexOf(val) - index);
return posOffset/a.length
}).reduce(function(curr, prev) {
//divide the current value by the length and subtract from
//one to get the contribution to similarity
return (1 - curr/a.length) + prev;
});
}
If the lists aren't guaranteed to have the same values, you would need to add handling for that.
Also note that the order you pass the arguments a and b to the getSimilarity function will impact the result. Not clear if this is an issue for your application.
PS: I think your question was down-voted for not including the code you already had trying to solve this problem.
Related
I have an ordered data set of decimal numbers. This data is always similar - but not always the same. The expected data is a few, 0 - 5 large numbers, followed by several (10 - 90) average numbers then follow by smaller numbers. There are cases where a large number may be mixed into the average numbers' See the following arrays.
let expectedData = [35.267,9.267,9.332,9.186,9.220,9.141,9.107,9.114,9.098,9.181,9.220,4.012,0.132];
let expectedData = [35.267,32.267,9.267,9.332,9.186,9.220,9.141,9.107,30.267,9.114,9.098,9.181,9.220,4.012,0.132];
I am trying to analyze the data by getting the average without high numbers on front and low numbers on back. The middle high/low are fine to keep in the average. I have a partial solution below. Right now I am sort of brute forcing it but the solution isn't perfect. On smaller datasets the first average calculation is influenced by the large number.
My question is: Is there a way to handle this type of problem, which is identifying patterns in an array of numbers?
My algorithm is:
Get an average of the array
Calculate an above/below average value
Remove front (n) elements that are above average
remove end elements that are below average
Recalculate average
In JavaScript I have: (this is partial leaving out below average)
let total= expectedData.reduce((rt,cur)=> {return rt+cur;}, 0);
let avg = total/expectedData.length;
let aboveAvg = avg*0.1+avg;
let remove = -1;
for(let k=0;k<expectedData.length;k++) {
if(expectedData[k] > aboveAvg) {
remove=k;
} else {
if(k==0) {
remove = -1;//no need to remove
}
//break because we don't want large values from middle removed.
break;
}
}
if(remove >= 0 ) {
//remove front above average
expectedData.splice(0,remove+1);
}
//remove belows
//recalculate average
I believe you are looking for some outlier detection Algorithm. There are already a bunch of questions related to this on Stack overflow.
However, each outlier detection algorithm has its own merits.
Here are a few of them
https://mathworld.wolfram.com/Outlier.html
High outliers are anything beyond the 3rd quartile + 1.5 * the inter-quartile range (IQR)
Low outliers are anything beneath the 1st quartile - 1.5 * IQR
Grubbs's test
You can check how it works for your expectations here
Apart from these 2, the is a comparison calculator here . You can visit this to use other Algorithms per your need.
I would have tried to get a sliding window coupled with an hysteresis / band filter in order to detect the high value peaks, first.
Then, when your sliding windows advance, you can add the previous first value (which is now the last of analyzed values) to the global sum, and add 1 to the number of total values.
When you encounter a peak (=something that causes the hysteresis to move or overflow the band filter), you either remove the values (may be costly), or better, you set the value to NaN so you can safely ignore it.
You should keep computing a sliding average within your sliding window in order to be able to auto-correct the hysteresis/band filter, so it will reject only the start values of a peak (the end values are the start values of the next one), but once values are stabilized to a new level, values will be kept again.
The size of the sliding window will set how much consecutive "stable" values are needed to be kept, or in other words how much UNstable values are rejected when you reach a new level.
For that, you can check the mode of the values (rounded) and then take all the numbers in a certain range around the mode. That range can be taken from the data itself, for example by taking the 10% of the max - min value. That helps you to filter your data. You can select the percent that fits your needs. Something like this:
let expectedData = [35.267,9.267,9.332,9.186,9.220,9.141,9.107,9.114,9.098,9.181,9.220,4.012,0.132];
expectedData.sort((a, b) => a - b);
/// Get the range of the data
const RANGE = expectedData[ expectedData.length - 1 ] - expectedData[0];
const WINDOW = 0.1; /// Window of selection 10% from left and right
/// Frequency of each number
let dist = expectedData.reduce((acc, e) => (acc[ Math.floor(e) ] = (acc[ Math.floor(e) ] || 0) + 1, acc), {});
let mode = +Object.entries(dist).sort((a, b) => b[1] - a[1])[0][0];
let newData = expectedData.filter(e => mode - RANGE * WINDOW <= e && e <= mode + RANGE * WINDOW);
console.log(newData);
Suppose if you are given a bunch of points in (x,y) values and you need to generate points by linearly interpolate between the 2 nearest values in the x axis, what is the fastest implementation to do so?
I searched around but I was unable to find a satisfactory answer, I feel its because I wasnt searching for the right words.
For example, if I was given (0,0) (0.5 , 1) (1, 0.5), then I want to get a value at 0.7; it would be (0.7-0.5)/(1-0.5) * (0.5-1) + 1; but what data structure would allow me to find the 2 nearest key values to interpolate in between? Is a simple linear search/ binary search if I have many key values the best I could do?
The way I usually implement O(1) interpolation is by means of an additional data structure, which I call IntervalSelector that in time O(1) will give the two surrounding values of the sequence that have to be interpolated.
An IntervalSelector is a class that, when given a sequence of n abscissas builds and remembers a table that will map any given value of x to the index i such that sequence[i] <= x < sequence[i+1] in time O(1).
Note: In what follows arrays are 1 based.
The algorithm that builds the table proceeds as follow:
Find delta to be the minimum distance between two consecutive elements in the input sequence of abscissas.
Set count := (b-a)/delta + 1, where a and b are respectively the first and last of the (ascending) sequence and / stands for the integer quotient of the division.
Define table to be an Array of count elements.
For i between 1 and n set table[(sequence[j]-a)/delta + 1] := j.
Repeat every entry of table visited in 4 to the unvisited positions that come right after it.
On output, table maps j to i if (j-1)*d <= sequence[i] - a < j*d.
Here is an example:
Since elements 3rd and 4th are the closest ones, we divide the interval in subintervals of this smallest length. Now, we remember in the table the positions of the left end of each of these deta-intervals. Later on, when an input x is given, we compute the delta-interval of such x as (x-a)/delta + 1 and use the table to deduce the corresponding interval in the sequence. If x falls to the left of the ith sequence element, we choose the (i-1)th.
More precisely:
Given any input x between a and b calculate j := (x-a)/delta + 1 and i := table[j]. If x < sequence[i] put i := i - 1. Then, the index i satisfies sequence[i] <= x < sequence[i+1]; otherwise the distance between these two consecutive elements would be smaller than delta, which is not.
Remark: Be aware that if the minimum distance delta between consecutive elements in sequence is too small the table will have too many entries. The simple description I've presented here ignores these pathological cases, which require additional work.
Yes, a simple binary search should do well and will typically suffice.
If you need to get better, you might try interpolation search (has nothing to do with your value interpolation).
If your points are distributed at fixed intervals (like in your example, 0 0.5 1), you can also simply store the values in an array and access them in constant time via their index.
Considering the performance, what's the best way to get random subset from an array?
Say we get an array with 90000 items, I wanna get 10000 random items from it.
One approach I'm thinking about is to get a random index from 0 to array.length and then remove the selected one from the original array by using Array.prototype.splice. Then get the next random item from the rest.
But the splice method will rearrange the index of all the items after the one we just selected and move them forward on step. Doesn't it affect the performance?
Items may duplicates, but what we select should not. Say we've selected index 0, then we should only look up the rest 1~89999.
If you want a subset of the shuffled array, you do not need to shuffle the whole array. You can stop the classic fisher-yates shuffle when you have drawn your 10000 items, leaving the other 80000 indices untouched.
I would first randomize the whole array then splice of a 10000 items.
How to randomize (shuffle) a JavaScript array?
Explains a good way to randomize a array in javascript
A reservoir sampling algorithm can do this.
Here's an attempt at implementing Knuth's "Algorithm S" from TAOCP Volume 2 Section 3.4.2:
function sample(source, size) {
var chosen = 0,
srcLen = source.length,
result = new Array(size);
for (var seen = 0; chosen < size; seen++) {
var remainingInput = srcLen - seen,
remainingOutput = size - chosen;
if (remainingInput*Math.random() < remainingOutput) {
result[chosen++] = source[seen];
}
}
return result;
}
Basically it makes one pass over the input array, choosing or skipping items based on a function of a random number, the number of items remaining in the input, and the number of items remaining to be required in the output.
There are three potential problems with this code: 1. I may have mucked it up, 2. Knuth calls for a random number "between zero and one" and I'm not sure if this means the [0, 1) interval JavaScript provides or the fully closed or fully open interval, 3. it's vulnerable to PRNG bias.
The performance characteristics should be very good. It's O(srcLen). Most of the time we finish before going through the entire input. The input is accessed in order, which is a good thing if you are running your code on a computer that has a cache. We don't even waste any time reading or writing elements that don't ultimately end up in the output.
This version doesn't modify the input array. It is possible to write an in-place version, which might save some memory, but it probably wouldn't be much faster.
My coworker and I are arguing about why the shuffle algorithm given in this list of JS tips & tricks doesn't produce biased results like the sort Jeff Atwood describes for naive shuffles.
The array shuffle code in the tips is:
list.sort(function() Math.random() - 0.5);
Jeff's naive shuffle code is:
for (int i = 0; i < cards.Length; i++)
{
int n = rand.Next(cards.Length);
Swap(ref cards[i], ref cards[n]);
}
I wrote this JS to test the shuffle:
var list = [1,2,3];
var result = {123:0,132:0,321:0,213:0,231:0,312:0};
function shuffle() { return Math.random() - 0.5; }
for (var i=0; i<60000000; i++) {
result[ list.sort(shuffle).join('') ]++;
}
For which I get results (from Firefox 5) like:
Order Count %Diff True Avg
123 9997461 -0.0002539
132 10003451 0.0003451
213 10001507 0.0001507
231 9997563 -0.0002437
312 9995658 -0.0004342
321 10004360 0.000436
Presumably Array.sort is walking the list array and performing swaps of (adjacent) elements similar to Jeff's example. So why don't the results look biased?
I found the reason it appears unbiased.
Array.sort() not only returns the array, it changes the array itself. If we re-initialize the array for each loop, we get results like:
123 14941
132 7530
321 7377
213 15189
231 7455
312 7508
Which shows a very significant bias.
For those interested, here's the modified code:
var result = {123:0,132:0,321:0,213:0,231:0,312:0};
var iterations = 60000;
function shuffle() {
comparisons++;
return Math.random() - 0.5;
}
for (var i=0; i<iterations; i++) {
var list = [1,2,3];
result[ list.sort(shuffle).join('') ]++;
}
console.log(result);
The problem with the naive shuffle is that the value might have already been swapped and you might swap it again later. Let's say you have three cards and you pick one truly at random for the first card. If you later can randomly swap that card with a latter one then you are taking away from the randomness of that first selection.
If the sort is quicksort, it continually splits the list about in half. The next iteration splits each of those groups into two groups randomly. This keeps going on until you are down to single cards, then you combine them all together. The difference is that you never take a card from the second randomly selected group and move it back to the first group.
The Knuth-Fisher-Yates shuffle is different than the naive shuffle because you only pick a card once. If you were picking random cards from a deck, would you put a card back and pick again? No, you take random cards one at a time. This is the first I've heard of it, but I've done something similar back in high school going from index 0 up. KFY is probably faster because I have an extra addition in the random statement.
for (int i = 0; i < cards.Length - 1; i++)
{
int n = rand.Next(cards.Length - i) + i; // (i to cards.Length - 1)
Swap(ref cards[i], ref cards[n]);
}
Don't think of it as swapping, think of it as selecting random cards from a deck. For each element in the array (except the last because there is only one left) you pick a random card out of all the remaining cards and lay it down forming a new stack of cards that are randomly shuffled. It doesn't matter that your remaining cards are no longer in the original order if you've done any swapping already, you are still picking one random card from all the remaining cards.
The random quicksort is like taking a stack of cards and randomly dividing them into two groups, then taking each group and randomly dividing it into two smaller groups, and on and on until you have individual cards then putting them back together.
Actually, that doesn't implement his naïve random sort. His algorithm actually transposes array keys manually, while sort actively sorts a list.
sort uses quicksort or insertion sort (thanks to cwolves for pointing that out -- see comments) to do this (this will vary based on the implementation):
Is A bigger or smaller than B? Smaller? Decrement.
Is A bigger or smaller than C? Smaller? Decrement.
Is A bigger or smaller than D? Smaller? Insert A after D
Is B bigger or smaller than C? Smaller? Decrement.
Is B bigger or smaller than D? Smaller? Insert B after D and before A...
This means that your big O for the average case is O(n log n) and your big O for the worst case is O(n^2) for each loop iteration.
Meanwhile the Atwood naïve random sort is a simple:
Start at A. Find random value. Swap.
Go to B. Find random value. Swap.
Go to C. Find random value. Swap.
(Knuth-Fisher-Yates is almost the same, only backwards)
So his has a big for the worst case of O(n) and a big O for the average case of O(n).
I'm writing a bit of JavaScript code which should select a random item from a canvas if the item meets certain requirements. There are different kinds of items (circles, triangles, squares etc.) and there's usually not the same number of items for each kind. The items are arranged in a hierarchy (so a square can contain a few circles, and a circle can contain other circles and so on - they can all be nested).
Right now, my (primitive) approach at selecting a random item is to:
Recursively traverse the canvas and build a (possibly huge!) list of items
Shuffle the list
Iterate the shuffled list from the front until I find an item which meets some extra requirements.
The problem with this is that it doesn't scale well. I often run into memory issues because either the recursion depth is too high or the total list of items becomes too large.
I was considering to rewrite this code so that I consider choosing elements as I traverse the canvas - but I don't know how I could "randomly" choose an element if I don't know how many elements there are in total.
Does anybody have some idea how to solve this?
Start with max_r = -1 and rand_node = null. Iterate through the tree. For each node meeting the criteria:
r = random()
if r > max_r:
rand_node = node
max_r = r
At the end rand_node will be a randomly selected node with only a single iteration required.
You can do this without first creating a list (sorry for my C pseudo code)
int index = 0;
foreach (element)
{
if (element matches criteria)
{
index++;
int rand = random value in [1 index] range
if (1 == rand)
{
chosen = element;
}
}
}
The math works out, say you have a list where 3 of the elements match the criteria, the probability that the first element will be chosen is:
1 * (1 - 1 / 2) * (1 - 1 / 3) = 1 * (1 / 2) * (2 / 3) = 1 / 3
The probability of the second being chose is:
(1 / 2) * (1 - 1 / 3) = (1 / 2) * (2 / 3) = 1 / 3
and finally for the third element
1 / 3
Which is the correct answer.
You could recursively do this:
Traverse the current tree level, creating a list
Select a random node from it (using a random number from 0 to list.Length)
Repeat until you get to a leaf node
that would give items in small subtrees a higher probability of being selected though.
Alternatively, you don't need to build a list, you only need to keep track of the number of items. That would mean traversing the tree a second time to get to the selected item, but you wouldn't need the extra memory for the list.