I have an ordered data set of decimal numbers. This data is always similar - but not always the same. The expected data is a few, 0 - 5 large numbers, followed by several (10 - 90) average numbers then follow by smaller numbers. There are cases where a large number may be mixed into the average numbers' See the following arrays.
let expectedData = [35.267,9.267,9.332,9.186,9.220,9.141,9.107,9.114,9.098,9.181,9.220,4.012,0.132];
let expectedData = [35.267,32.267,9.267,9.332,9.186,9.220,9.141,9.107,30.267,9.114,9.098,9.181,9.220,4.012,0.132];
I am trying to analyze the data by getting the average without high numbers on front and low numbers on back. The middle high/low are fine to keep in the average. I have a partial solution below. Right now I am sort of brute forcing it but the solution isn't perfect. On smaller datasets the first average calculation is influenced by the large number.
My question is: Is there a way to handle this type of problem, which is identifying patterns in an array of numbers?
My algorithm is:
Get an average of the array
Calculate an above/below average value
Remove front (n) elements that are above average
remove end elements that are below average
Recalculate average
In JavaScript I have: (this is partial leaving out below average)
let total= expectedData.reduce((rt,cur)=> {return rt+cur;}, 0);
let avg = total/expectedData.length;
let aboveAvg = avg*0.1+avg;
let remove = -1;
for(let k=0;k<expectedData.length;k++) {
if(expectedData[k] > aboveAvg) {
remove=k;
} else {
if(k==0) {
remove = -1;//no need to remove
}
//break because we don't want large values from middle removed.
break;
}
}
if(remove >= 0 ) {
//remove front above average
expectedData.splice(0,remove+1);
}
//remove belows
//recalculate average
I believe you are looking for some outlier detection Algorithm. There are already a bunch of questions related to this on Stack overflow.
However, each outlier detection algorithm has its own merits.
Here are a few of them
https://mathworld.wolfram.com/Outlier.html
High outliers are anything beyond the 3rd quartile + 1.5 * the inter-quartile range (IQR)
Low outliers are anything beneath the 1st quartile - 1.5 * IQR
Grubbs's test
You can check how it works for your expectations here
Apart from these 2, the is a comparison calculator here . You can visit this to use other Algorithms per your need.
I would have tried to get a sliding window coupled with an hysteresis / band filter in order to detect the high value peaks, first.
Then, when your sliding windows advance, you can add the previous first value (which is now the last of analyzed values) to the global sum, and add 1 to the number of total values.
When you encounter a peak (=something that causes the hysteresis to move or overflow the band filter), you either remove the values (may be costly), or better, you set the value to NaN so you can safely ignore it.
You should keep computing a sliding average within your sliding window in order to be able to auto-correct the hysteresis/band filter, so it will reject only the start values of a peak (the end values are the start values of the next one), but once values are stabilized to a new level, values will be kept again.
The size of the sliding window will set how much consecutive "stable" values are needed to be kept, or in other words how much UNstable values are rejected when you reach a new level.
For that, you can check the mode of the values (rounded) and then take all the numbers in a certain range around the mode. That range can be taken from the data itself, for example by taking the 10% of the max - min value. That helps you to filter your data. You can select the percent that fits your needs. Something like this:
let expectedData = [35.267,9.267,9.332,9.186,9.220,9.141,9.107,9.114,9.098,9.181,9.220,4.012,0.132];
expectedData.sort((a, b) => a - b);
/// Get the range of the data
const RANGE = expectedData[ expectedData.length - 1 ] - expectedData[0];
const WINDOW = 0.1; /// Window of selection 10% from left and right
/// Frequency of each number
let dist = expectedData.reduce((acc, e) => (acc[ Math.floor(e) ] = (acc[ Math.floor(e) ] || 0) + 1, acc), {});
let mode = +Object.entries(dist).sort((a, b) => b[1] - a[1])[0][0];
let newData = expectedData.filter(e => mode - RANGE * WINDOW <= e && e <= mode + RANGE * WINDOW);
console.log(newData);
Related
You are given a starting number and ending number and the max number of output elements allowed. How would you create an output array with as even a distribution as possible, while still including the first and last points in the output?
Function signature
function generatePoints(startingNumber, endingNumber, maxPoints) {}
Function desired output
generatePoints(0, 8, 5) // [0, 2, 4, 6, 8]
Here's what I tried so far
function generatePoints(startingNumber, endingNumber, maxPoints) {
const interval = Math.round((endingNumber - startingNumber) / maxPoints)
let count = 0
let counter = 0
let points = []
while(count < maxPoints - 1) {
points.push(counter)
counter+=interval
count++
}
points.push(endingNumber)
return points
}
Technically this creates the correct output for the simple case, but falls short when up against most other edge cases due to the fact that I'm stopping one iteration early and then adding the final point. I'm thinking that the better way to do this (to create a better distribution) is to build from the center of the array outwards, versus building from the start of the array and then stopping one element early and appending the endingNumber.
Note this:
0 2 4 6 8
+-----+ +-----+ +-----+ +-----+
A B C D
Splitting our range into intervals with 5 points including the endpoints, we have only four intervals. It will always be one fewer than the number of points. We can divide our range up evenly into these smaller ranges, simply by continually adding the width of one interval, which is just (endingNumber - startingNumber) / (maxPoints - 1). We can do it like this:
const generatePoints = (startingNumber, endingNumber, maxPoints) => Array .from (
{length: maxPoints},
(_, i) => startingNumber + i * (endingNumber - startingNumber) / (maxPoints - 1)
)
console .log (generatePoints (0, 8, 5))
We just build an array of the right length, using the index parameter to count the number of smaller intervals we're using.
We do no error-checking here, and if maxPoints were just 1, we might have an issue. But that's easy enough to handle how you like.
But there is a concern here. Why is the parameter called maxPoints instead of points? If the number of points allowed is variable, I think we need further requirements.
Do not Math.round(interval). Instead Math.round(counter) at that last moment.
The reason why is that if you've added k intervals, the error in what you're going can be as much as 0.5*k. But if you round at the last minute, the error is never more than 0.5.
I need to develop an algorithm that randomly selects values within user-specified intervals. Furthermore, these values need to be separated by a minimum user-defined distance. In my case the values and intervals are times, but this may not be important for the development of a general algorithm.
For example: A user may define three time intervals (0900-1200, 1200-1500; 1500-1800) upon which 3 values (1 per interval) are to be selected. The user may also say they want the values to be separated by at least 30 minutes. Thus, values cannot be 1159, 1201, 1530 because the first two elements are separated by only 2 minutes.
A few hundred (however many I am able to give) points will be awarded to the most efficient algorithm. The answer can be language agnostic, but answers either in pseudocode or JavaScript are preferred.
Note:
The number of intervals, and the length of each interval, are completely determined by the user.
The distance between two randomly selected points is also completely determined by the user (but must be less than the length of the next interval)
The user-defined intervals will not overlap
There may be gaps between the user-defined intervals (e.g., 0900-1200, 1500-1800, 2000-2300)
I already have the following two algorithms and am hoping to find something more computationally efficient:
Randomly select value in Interval #1. If this value is less than user-specified distance from the beginning of Interval #2, adjust the beginning of Interval #2 prior to randomly selecting a value from Interval #2. Repeat for all intervals.
Randomly select values from all intervals. Loop through array of selected values and determine if they are separated by user-defined minimum distance. If not (i.e., values are too close), randomly select new values. Repeat until valid array.
This works for me, and I'm currently not able to make it "more efficient":
function random(intervals, gap = 1){
if(!intervals.length) return [];
// ensure the ordering of the groups
intervals = intervals.sort((a,b) => a[0] - b[0])
// check for distance, init to a value that can't exist
let res = []
for(let i = 0; i < intervals.length; i++){
let [min, max] = intervals[i]
// check if can exist a possible number
if(i < intervals.length - 1 && min + gap > intervals[i+1][1]){
throw new Error("invalid ranges and gap")
}
// if we can't create a number in the current section, try to generate another number from the previous
if( i > 0 && res[i-1] + gap > max){
// reset the max value for the previous interval to force the number to be smaller
intervals[i-1][1] = res[i-1] - 1
res.pop()
i-=2
}
else {
// set as min the lower between the min of the interval and the previous number generated + gap
if( i > 0 ){
min = Math.max(res[i-1] + gap , min)
}
// usual formula to get a random number in a specific interval
res.push(Math.round(Math.random() * (max - min) + min))
}
}
return res
}
console.log(random([
[0900, 1200],
[1200, 1500],
[1500, 1800],
], 400))
this works like:
generate the first number ()
check if can generate second number (for the gap rule)
- if i can, generate it and go back to point 2 (but with the third number)
- if i can't, I se the max of the previous interval to the generated number, and make it generate it again (so that it generates a lower number)
I can't figure out what's the complexity, since there are random number involved, but might happen that with 100 intervals, at the generation of the 100th random number, you see that you can't, and so in the worst case this might go back generating everything from the first one.
However, every time it goes back, it shrinks the range of the intervals, so it will converge to a solution if exists
This seems to do the job. For explanations see comments in the code ...
Be aware, that this code does not do any checks of your conditions, ie non overlapping intervals and intervals are big enough to allow the mindist to be fulfilled. If the conditions are not met, it may generate erroneous results.
This algorithm allows the minimum distance between two values to be defined with each interval separately.
Also be aware, that an interval limit like 900 in this algorithm does not mean 9:00 o'clock, but just the numeric value of 900. If you want the intervals to represent times, you have to represent them as, for instance, minutes since midnight. Ie 9:00 will become 540, 12:00 will become 720 and 15:00 will become 900.
EDIT
With the current edit it also supports wrap-overs at midnight (Although it does not support intervals or minimum distances of more than a whole day)
//these are the values entered by the user
//as intervals are non overlapping I interpret
//an interval [100, 300, ...] as 100 <= x < 300
//ie the upper limit is not part of that interval
//the third value in each interval is the minimum
//distance from the last value, ie [500, 900, 200]
//means a value between 500 and 900 and it must be
//at least 200 away from the last value
//if the upper limit of an interval is less than the lower limit
//this means a wrap-around at midnight.
//the minimin distance in the first interval is obviously 0
let intervals = [
[100, 300, 0],
[500, 900, 200],
[900, 560, 500]
]
//the total upper limit of an interval (for instance number of minutes in a day)
//lenght of a day. if you don't need wrap-arounds set to
//Number.MAX_SAFE_INTEGER
let upperlimit = 1440;
//generates a random value x with min <= x < max
function rand(min, max) {
return Math.floor(Math.random() * (max - min)) + min;
}
//holds all generated values
let vals = [];
let val = 0;
//Iterate over all intervals, to generate one value
//from each interval
for (let iv of intervals) {
//the next random must be greater than the beginning of the interval
//and if the last value is within range of mindist, also greater than
//lastval + mindist
let min = Math.max(val + iv[2], iv[0]);
//the next random must be less than the end of the interval
//if the end of the interval is less then current min
//there is a wrap-around at midnight, thus extend the max
let max = iv[1] < min ? iv[1] + upperlimit : iv[1];
//generate the next val. if it's greater than upperlimit
//it's on the next day, thus remove a whole day
val = rand(min, max);
if (val > upperlimit) val -= upperlimit;
vals.push(val);
}
console.log(vals)
As you may notice, this is more or less an implementation of your proposal #1 but I don't see any way of making this more "computationally efficient" than that. You can't get around selecting one value from each interval, and the most efficent way of always generating a valid number is to adjust the lower limit of the interval, if neccessary.
Of course, with this approach, the selection of next number is always limited by the selection of the previous. Especially if the minimum distance between two numbers is near the length of the interval, and the previous number selected was rather at the upper limit of its interval.
This can simply be done by separating the intervals by required many minutes. However there might be edge cases like a given interval being shorter than a seperation or even worse two consequent intervals being shorter than the separation in which case you can safely throw an error. i.e. had in [[900,1200],[1200,1500]] case 1500 - 900 < 30 been. So you best check this case per consequent tuples and throw an error if they don't satisfy before trying any further.
Then it gets a little hairy. I mean probabilistically. A naive approach would chose a random value among [900,1200] and depending on the result would add 30 to it and accordingly limit the bottom boundary of the second tuple. Say if the random number chosen among [900,1200] turns out to be 1190 then we will force the second random number to be chosen among [1220,1500]. This makes second random choice dependent on the outcome of the first choice and as far as I remember from probability lessons this is no good. I believe we have to find all possible borders and make a random choice among them and then make two safe random choices one from each range.
Another point to consider is, this might be a long list of tuples to start with. So we should care about not limiting the second tuple in each turn since it will be the first tuple on the next turn and we would like to have it as wide as possible. So perhaps getting the minimum possible value from the first range (limitting the first range as much as possible) may turn out to be more productive than random tries which might (most possibly) yield a problem in further steps.
I can give you the code but since you haven't showed any tries you have to settle with this rod to go and fish yourself.
Suppose if you are given a bunch of points in (x,y) values and you need to generate points by linearly interpolate between the 2 nearest values in the x axis, what is the fastest implementation to do so?
I searched around but I was unable to find a satisfactory answer, I feel its because I wasnt searching for the right words.
For example, if I was given (0,0) (0.5 , 1) (1, 0.5), then I want to get a value at 0.7; it would be (0.7-0.5)/(1-0.5) * (0.5-1) + 1; but what data structure would allow me to find the 2 nearest key values to interpolate in between? Is a simple linear search/ binary search if I have many key values the best I could do?
The way I usually implement O(1) interpolation is by means of an additional data structure, which I call IntervalSelector that in time O(1) will give the two surrounding values of the sequence that have to be interpolated.
An IntervalSelector is a class that, when given a sequence of n abscissas builds and remembers a table that will map any given value of x to the index i such that sequence[i] <= x < sequence[i+1] in time O(1).
Note: In what follows arrays are 1 based.
The algorithm that builds the table proceeds as follow:
Find delta to be the minimum distance between two consecutive elements in the input sequence of abscissas.
Set count := (b-a)/delta + 1, where a and b are respectively the first and last of the (ascending) sequence and / stands for the integer quotient of the division.
Define table to be an Array of count elements.
For i between 1 and n set table[(sequence[j]-a)/delta + 1] := j.
Repeat every entry of table visited in 4 to the unvisited positions that come right after it.
On output, table maps j to i if (j-1)*d <= sequence[i] - a < j*d.
Here is an example:
Since elements 3rd and 4th are the closest ones, we divide the interval in subintervals of this smallest length. Now, we remember in the table the positions of the left end of each of these deta-intervals. Later on, when an input x is given, we compute the delta-interval of such x as (x-a)/delta + 1 and use the table to deduce the corresponding interval in the sequence. If x falls to the left of the ith sequence element, we choose the (i-1)th.
More precisely:
Given any input x between a and b calculate j := (x-a)/delta + 1 and i := table[j]. If x < sequence[i] put i := i - 1. Then, the index i satisfies sequence[i] <= x < sequence[i+1]; otherwise the distance between these two consecutive elements would be smaller than delta, which is not.
Remark: Be aware that if the minimum distance delta between consecutive elements in sequence is too small the table will have too many entries. The simple description I've presented here ignores these pathological cases, which require additional work.
Yes, a simple binary search should do well and will typically suffice.
If you need to get better, you might try interpolation search (has nothing to do with your value interpolation).
If your points are distributed at fixed intervals (like in your example, 0 0.5 1), you can also simply store the values in an array and access them in constant time via their index.
Basically I can only plot 1000 values on a chart but my dataset frequently has more than 1000 values.
So... let's say I have 3000 values - that's easy, every 3rd point is plotted (if i / 3 == 1). What about when it's a number like 2106? I'm trying to plot evenly.
for(var i = 0; i < chartdata.node.length; i++){
//something in here
}
Since your may have more or less than 1000 I would go with something like this
var inc = Math.floor(chartdata.node.length / 1000);
if ( inc==0 )
inc=1;
for ( var i=0; i<chartdata.node.length; i+=inc )
{
}
Exactly 1000 points, slightly irregular spacing
Let A be the number of data points you have (i.e. 2106) and B be the number of data points you want to use (i.e. 1000). In a continuous case, you'd space your plot points at every A/B data points. With discrete data points, you can do the following: maintain a counter C, initialized to zero. For every one of the A input data points, you add B to that counter. If the resulting value is larger than A, you plot the data point and subtract A from the counter. On the whole, you'll have added the value B A times, and subtracted A B times, so you should end up with a zero counter again, having plotted exactly B data items.
You can tweak this to obtain different behaviour at the end points, e.g. to always include the first and last data point. Simply plot the first point unconditionally, then do the above scheme for the remaining points, i.e. with A=2105 and B=999. One benefit of this whole approach is that all of this works in integer arithmetic, so rounding errors will be of no concern to you.
Perfectly regular spacing, but less data points
If even spacing is more important, then you can simply compute the amount by which you want to increment your index for every plot using floor(A/B). Due to the floor function, this will be a smaller number than the fractional resoult would be. In the worst case, a number which is almost two will get rounded down to one, resulting in only slightly more than 500 data points being actually plotted. These will be evenly spaced, though.
You could try something like this (in pseudo-code):
var desired_data_length = 1000;
for (var i = 0; i < desired_data_length; i++)
{
var actual_i = int(float(i) / float(desired_data_length) * float(chartdata.length));
// do something with actual_i as the index
}
This will use desired_data_length number of indices, and will linearly interpolate from [0,desired_data_length) to [0,chartdata.length), which is what you want.
If the data is purely numerical you may try Typed Arrays.
I’m having problems generating normally distributed random numbers (mu=0 sigma=1)
using JavaScript.
I’ve tried Box-Muller's method and ziggurat, but the mean of the generated series of numbers comes out as 0.0015 or -0.0018 — very far from zero!! Over 500,000 randomly generated numbers this is a big issue. It should be close to zero, something like 0.000000000001.
I cannot figure out whether it’s a method problem, or whether JavaScript’s built-in Math.random() generates not exactly uniformly distributed numbers.
Has someone found similar problems?
Here you can find the ziggurat function:
http://www.filosophy.org/post/35/normaldistributed_random_values_in_javascript_using_the_ziggurat_algorithm/
And below is the code for the Box-Muller:
function rnd_bmt() {
var x = 0, y = 0, rds, c;
// Get two random numbers from -1 to 1.
// If the radius is zero or greater than 1, throw them out and pick two
// new ones. Rejection sampling throws away about 20% of the pairs.
do {
x = Math.random()*2-1;
y = Math.random()*2-1;
rds = x*x + y*y;
}
while (rds === 0 || rds > 1)
// This magic is the Box-Muller Transform
c = Math.sqrt(-2*Math.log(rds)/rds);
// It always creates a pair of numbers. I'll return them in an array.
// This function is quite efficient so don't be afraid to throw one away
// if you don't need both.
return [x*c, y*c];
}
If you generate n independent normal random variables, the standard deviation of the mean will be sigma / sqrt(n).
In your case n = 500000 and sigma = 1 so the standard error of the mean is approximately 1 / 707 = 0.0014. The 95% confidence interval, given 0 mean, would be around twice this or (-0.0028, 0.0028). Your sample means are well within this range.
Your expectation of obtaining 0.000000000001 (1e-12) is not mathematically grounded. To get within that range of accuracy, you would need to generate about 10^24 samples. At 10,000 samples per second that would still take 3 quadrillon years to do...this is precisely why it's good to avoid computing things by simulation if possible.
On the other hand, your algorithm does seem to be implemented correctly :)