Calculate percentile rank (Parse)

Calculate percentile rank (Parse) - javascript

I need to calculate the percentile rank of a particular value against a large number of values filtered in various different ways. The data is all stored on Parse.com, which has a limitation of returning a maximum of 1000 rows per query. The number of values stored is likely to exceed well over 100,000.
By 'percentile rank', I mean I need to calculate the percentage of values that the provided value is greater than. I am not trying to calculate the value of a provided percentile. For example, given a list of values {20, 23, 24, 29, 30, 31, 35, 40, 40, 43} the percentile rank of the provided value 35 is 70%. The algorithm for this is simply the rank of the value / count of values * 100. Not sure if 'percentile rank' is the correct terminology for this.
I have considered a couple of different approaches to this. The first is to pull down the full list of values (into Parse Cloud) and then calculate the percentile rank from there, then filter the list and calculate again, repeating the last two steps as many times as required. The problem with this approach is it will not work once we reach 1000 values, which we can expect pretty quickly.
Another option, which is the best I can come up with so far, is to query the count of items, and the rank of the provided value. For example:
var rank_world_alltime = new Parse.Query("Values")
.lessThan("value", request.params.value) // Filters query to values less than the provided value, so counting this query will return the rank
.count();
var count_world_alltime = new Parse.Query("Values")
.count();
Parse.Promise.when(rank_world_alltime, count_world_alltime).then(function(rank, count) {
percentile = rank / count * 100;
console.log("world_alltime_percentile = " + percentile);
});
This works well for a single calculation, but I need to perform multiple calculations, and this approach very quickly becomes a lot of queries. I expect to need to run about 15 calculations per call, which is 30 queries. All calculations need to complete in under 3 seconds before Parse terminates the job, and I am limited to 30 reqs/second, so this is very quickly going to become a problem.
Does anyone have any suggestions on how else I could approach this? I've thought about somehow pre-processing some of this but can't quite work out how to do so, as the filters will be based on time and location (city and country), so there are potentially a LOT of pre-calculations that will need to be run at regular intervals. The results do not need to be 100% accurate but something close.

I don't know much about parse, but as far as I understand what you say, it is some kind of cloud database thingy that holds your hiscores, and limits you 1000 rows per query, 3 seconds per job, and 30 queries per second.
In order to have approximate calculations and divide by 2 the number of queries, I would first of all cache the total (count_world_alltime, count_region,week, whatever). If you can save them somewhere locally. For numbers of 100K just getting the order of magnitude (thus not the latest updated number) should be good enough to get a percentile.
Maybe you can get several counts per query. However my lack of expertise in parse/nosql kind of stops me from being sure of this, you'll have to check their documentation. If it is possible however, for the case where you need percentiles for a serie of values all in the same category, I would
Order the values, let's call them a,b,c,d,e (once ordered)
Get the number of values between the intervals [0,a] [a,b] [b,c] [c,d] [d,e]
Use the cached total to get the percentiles (where Nxy is the number of values in [x,y]) :
Pa = 100 * N0a / total
Pb = 100 * ( N0a + Nab ) / total
Pc = 100 * ( N0a + Nab + Nbc ) / total
and so on...
If you need a value ranked worldwide, the other per region, some per week others over all times, etc, this doesn't apply. In that case I don't think you can get below 1 query/number, with caching the totals.

Related

Algorithm: randomly select points within user-defined intervals, separated by minimum user-defined values

I need to develop an algorithm that randomly selects values within user-specified intervals. Furthermore, these values need to be separated by a minimum user-defined distance. In my case the values and intervals are times, but this may not be important for the development of a general algorithm.
For example: A user may define three time intervals (0900-1200, 1200-1500; 1500-1800) upon which 3 values (1 per interval) are to be selected. The user may also say they want the values to be separated by at least 30 minutes. Thus, values cannot be 1159, 1201, 1530 because the first two elements are separated by only 2 minutes.
A few hundred (however many I am able to give) points will be awarded to the most efficient algorithm. The answer can be language agnostic, but answers either in pseudocode or JavaScript are preferred.
Note:
The number of intervals, and the length of each interval, are completely determined by the user.
The distance between two randomly selected points is also completely determined by the user (but must be less than the length of the next interval)
The user-defined intervals will not overlap
There may be gaps between the user-defined intervals (e.g., 0900-1200, 1500-1800, 2000-2300)
I already have the following two algorithms and am hoping to find something more computationally efficient:
Randomly select value in Interval #1. If this value is less than user-specified distance from the beginning of Interval #2, adjust the beginning of Interval #2 prior to randomly selecting a value from Interval #2. Repeat for all intervals.
Randomly select values from all intervals. Loop through array of selected values and determine if they are separated by user-defined minimum distance. If not (i.e., values are too close), randomly select new values. Repeat until valid array.

This works for me, and I'm currently not able to make it "more efficient":
function random(intervals, gap = 1){
if(!intervals.length) return [];
// ensure the ordering of the groups
intervals = intervals.sort((a,b) => a[0] - b[0])
// check for distance, init to a value that can't exist
let res = []
for(let i = 0; i < intervals.length; i++){
let [min, max] = intervals[i]
// check if can exist a possible number
if(i < intervals.length - 1 && min + gap > intervals[i+1][1]){
throw new Error("invalid ranges and gap")
}
// if we can't create a number in the current section, try to generate another number from the previous
if( i > 0 && res[i-1] + gap > max){
// reset the max value for the previous interval to force the number to be smaller
intervals[i-1][1] = res[i-1] - 1
res.pop()
i-=2
}
else {
// set as min the lower between the min of the interval and the previous number generated + gap
if( i > 0 ){
min = Math.max(res[i-1] + gap , min)
}
// usual formula to get a random number in a specific interval
res.push(Math.round(Math.random() * (max - min) + min))
}
}
return res
}
console.log(random([
[0900, 1200],
[1200, 1500],
[1500, 1800],
], 400))
this works like:
generate the first number ()
check if can generate second number (for the gap rule)
- if i can, generate it and go back to point 2 (but with the third number)
- if i can't, I se the max of the previous interval to the generated number, and make it generate it again (so that it generates a lower number)
I can't figure out what's the complexity, since there are random number involved, but might happen that with 100 intervals, at the generation of the 100th random number, you see that you can't, and so in the worst case this might go back generating everything from the first one.
However, every time it goes back, it shrinks the range of the intervals, so it will converge to a solution if exists

This seems to do the job. For explanations see comments in the code ...
Be aware, that this code does not do any checks of your conditions, ie non overlapping intervals and intervals are big enough to allow the mindist to be fulfilled. If the conditions are not met, it may generate erroneous results.
This algorithm allows the minimum distance between two values to be defined with each interval separately.
Also be aware, that an interval limit like 900 in this algorithm does not mean 9:00 o'clock, but just the numeric value of 900. If you want the intervals to represent times, you have to represent them as, for instance, minutes since midnight. Ie 9:00 will become 540, 12:00 will become 720 and 15:00 will become 900.
EDIT
With the current edit it also supports wrap-overs at midnight (Although it does not support intervals or minimum distances of more than a whole day)
//these are the values entered by the user
//as intervals are non overlapping I interpret
//an interval [100, 300, ...] as 100 <= x < 300
//ie the upper limit is not part of that interval
//the third value in each interval is the minimum
//distance from the last value, ie [500, 900, 200]
//means a value between 500 and 900 and it must be
//at least 200 away from the last value
//if the upper limit of an interval is less than the lower limit
//this means a wrap-around at midnight.
//the minimin distance in the first interval is obviously 0
let intervals = [
[100, 300, 0],
[500, 900, 200],
[900, 560, 500]
]
//the total upper limit of an interval (for instance number of minutes in a day)
//lenght of a day. if you don't need wrap-arounds set to
//Number.MAX_SAFE_INTEGER
let upperlimit = 1440;
//generates a random value x with min <= x < max
function rand(min, max) {
return Math.floor(Math.random() * (max - min)) + min;
}
//holds all generated values
let vals = [];
let val = 0;
//Iterate over all intervals, to generate one value
//from each interval
for (let iv of intervals) {
//the next random must be greater than the beginning of the interval
//and if the last value is within range of mindist, also greater than
//lastval + mindist
let min = Math.max(val + iv[2], iv[0]);
//the next random must be less than the end of the interval
//if the end of the interval is less then current min
//there is a wrap-around at midnight, thus extend the max
let max = iv[1] < min ? iv[1] + upperlimit : iv[1];
//generate the next val. if it's greater than upperlimit
//it's on the next day, thus remove a whole day
val = rand(min, max);
if (val > upperlimit) val -= upperlimit;
vals.push(val);
}
console.log(vals)
As you may notice, this is more or less an implementation of your proposal #1 but I don't see any way of making this more "computationally efficient" than that. You can't get around selecting one value from each interval, and the most efficent way of always generating a valid number is to adjust the lower limit of the interval, if neccessary.
Of course, with this approach, the selection of next number is always limited by the selection of the previous. Especially if the minimum distance between two numbers is near the length of the interval, and the previous number selected was rather at the upper limit of its interval.

This can simply be done by separating the intervals by required many minutes. However there might be edge cases like a given interval being shorter than a seperation or even worse two consequent intervals being shorter than the separation in which case you can safely throw an error. i.e. had in [[900,1200],[1200,1500]] case 1500 - 900 < 30 been. So you best check this case per consequent tuples and throw an error if they don't satisfy before trying any further.
Then it gets a little hairy. I mean probabilistically. A naive approach would chose a random value among [900,1200] and depending on the result would add 30 to it and accordingly limit the bottom boundary of the second tuple. Say if the random number chosen among [900,1200] turns out to be 1190 then we will force the second random number to be chosen among [1220,1500]. This makes second random choice dependent on the outcome of the first choice and as far as I remember from probability lessons this is no good. I believe we have to find all possible borders and make a random choice among them and then make two safe random choices one from each range.
Another point to consider is, this might be a long list of tuples to start with. So we should care about not limiting the second tuple in each turn since it will be the first tuple on the next turn and we would like to have it as wide as possible. So perhaps getting the minimum possible value from the first range (limitting the first range as much as possible) may turn out to be more productive than random tries which might (most possibly) yield a problem in further steps.
I can give you the code but since you haven't showed any tries you have to settle with this rod to go and fish yourself.

How to Create a custom objectID in mongodb

As far as i know, you can call ObjectId("something"); to generate a new id.
Is it possible to generate a random id, which does not already exist in the database/collection and has a specific format?
In my case, i want object id to generate a unique random 10digit number.
So the result should be:
var ObjectId = require('mongodb').ObjectID
var id = new ObjectId("something");
console.log(id) ==> 0123456789

As in the comments, your best bet would be to get seconds into the current year when inserting the document. But the real question would be how intensive would insertion of new documents be. You need to account for multiple factors, first being the more documents you insert, the higher the chance your ids will collide at one point.
I would recommend just leaving the standard GUID mongo generates for you, however a solution that I can think of from top of my head would be getting the seconds into the current year, substring that to get the last 5 digits, and then generate 5 random digits and merge them together.
new Date().getTime().toString().substring(8) + Math.floor(Math.random() * (99999 - 10000)) + 100000;
With the above (80 bits) you would get 4.135898×10-13 collision probability on 1000000 documents.

How to display users chances " % " depending on their bet amounts?

I'm developing a Jackpot Roulette game. There is a main pot. Users join each round. Every bet is going to the main pot. And the winner will get the pot.
I'm looking for a way to display to each user their chance to win in each moment. (I can update that chance every 1 second, to keep him updated).
I have an array of users, every user did a bet amount.
For example:
[ username: user1, betAmount: 600],
[ username: user2, betAmount: 400]
and so on..
How can I calculate their chance to win?
In this array for example, I will like to show to user1 his chance is for example 60%, and user2 40%.
I don't know how to calculate, their chance, and display to them.
I'm looking for a function that can calculate their chance, depending on the bet amounts them did compared with other users. Max chance is 100% and will be divided by the users at that moment.

Example of how you'd calculate the odds based on betAmount/total(betAmounts)*100
var bets = [
{ username: 'user1', betAmount: 100 },
{ username: 'user2', betAmount: 300 }
];
// get the total bets
var totalbets = bets.map(({betAmount}) => betAmount).reduce((a, b) => a + b);
// output (to console)
bets.forEach(({username, betAmount}) => {
console.log(`user ${username} odds ${betAmount/totalbets*100}`);
});
How to display it on a web page depends on the page, you've supplied no HTML to work with.
A little explanation of the code:
Array#map creates a new array using the results of calling the provided function for each element in the original array. In this case the result of .map is an array of numbers [100, 300].
Array#reduce applies the given function against a "accumulator" with the end result being a single value. In this case the function is called a single time with a = 100 and b = 300, the result is the returned value a + b = 400.
If there were three values in the array, say [3,2,1]:
First call a=3,b=2, returns a+b = 5
Second call a=5 (the returned value from previous iteration), b=1 - the result is a+b=6
Read the linked documentation for each for better explanation, as reduce for example can also be supplied an initial value for the accumulator, which would mean for an array of 3 values, the callback function is called 3 times, not 2.

Well it depends.... What game are the people playing? Based on what game they are playing then that would decide if bet amount is a factor in their chance to win.
Instead of using bet amounts try using whatever value they bet on and then compare that value to a possible amount of values.
For example if the rules of the game is choose a number from 1-100 and the more your bet amount the more numbers you can choose, then you can calculate their chance to win as the number of values they chose / 100. So if I bet 200 and chose numbers 10 and 3, and you bet 100 and chose the number 61, I have a 2% chance to win and you have a 1% chance to win.
However if you want to use bet amounts as a factor then do this:
var bettotal = bet1 + bet2 + bet3 + bet4;
var user1chance = bet1 / bettotal * 100;
//etc etc...
And then if you want to display it to an element:
document.findElementById('myID').value = user1chance;

It sounds like a given user's chance to win is bet amount divided by total amount bet (×100%). So if user 1 bets 100, user 2 300, then user 1 has a 25% chance to win (100 / 400)

Getting the index of an object in an ordered list in Firebase

I'm building a leaderboard using Firebase. The player's position in the leaderboard is tracked using Firebase's priority system.
At some point in my program's execution, I need to know what position a given user is at in the leaderboard. I might have thousands of users, so iterating through all of them to find an object with the same ID (thus giving me the index) isn't really an option.
Is there a more performant way to determine the index of an object in an ordered list in Firebase?
edit: I'm trying to figure out the following:
/
---- leaderboard
--------user4 {...}
--------user1 {...}
--------user3 {...} <- what is the index of user3, given a snapshot of user3?
--------...

If you are processing tens or hundreds of elements and don't mind taking a bandwidth hit, see Katos answer.
If you're processing thounsands of records, you'll need to follow an approach outlined in principle in pperrin's answer. The following answer details that.
Step 1: setup Flashlight to index your leaderboard with ElasticSearch
Flashlight is a convenient node script that syncs elasticsearch with Firebase data.
Read about how to set it up here.
Step 2: modify Flashlight to allow you to pass query options to ElasticSearch
as of this writing, Flashlight gives you no way to tell ElasticSearch you're only interested in the number of documents matched and not the documents themselves.
I've submitted this pull request which uses a simple one-line fix to add this functionality. If it isn't closed by the time you read this answer, simply make the change in your copy/fork of flashlight manually.
Step 3: Perform the query!
This is the query I sent via Firebase:
{
index: 'firebase',
type: 'allTime',
query: {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"points": {
"gte": minPoints
}
}
}
}
},
options: {
"search_type": "count"
}
};
Replace points with the name of the field tracking points for your users, and minPoints with the number of points of the user whose rank you are interested in.
The response will look something like:
{
max_score: 0,
total: 2
}
total is the number of users who have the same or greater number of points -- in other words, the user's rank!

Since Firebase stores object, not arrays, the elements do not have an "index" in the list--JavaScript and by extension JSON objects are inherently unordered. As explained in Ordered Docs and demonstrated in the leaderboard example, you accomplish ordering by using priorities.
A set operation:
var ref = new Firebase('URL/leaderboard');
ref.child('user1').setPriority( newPosition /*score?*/ );
A read operation:
var ref = new Firebase('URL/leaderboard');
ref.child('user1').once('value', function(snap) {
console.log('user1 is at position', snap.getPriority());
});

To get the info you want, at some point a process is going to have to enumerate the nodes to count them. So the question is then where/when the couting takes place.
Using .count() in the client will mean it is done every time it is needed, it will be pretty accurate, but procesing/traffic heavy.
If you keep a separate index of the count it will need regular refreshing, or constant updating (each insert causeing a shuffling up of the remaining entries).
Depending on the distribution and volume of your data I would be tempted to go with a background process that just updates(/rebuilds) the index every (say) ten or twenty additions. And indexes every (say) 10 positions.
"Leaderboard",$UserId = priority=$score
...
"Rank",'10' = $UserId,priority=$score
"Rank",'20' = $UserId,priority=$score
...
From a score you get the rank within ten and then using a startat/endat/count on your "Leaderboard" get it down to the unit.
If your background process is monitoring the updates to the leaderboard, it could be more inteligent about its updates to the index either updating only as requried.

I know this is an old question, but I just wanted to share my solutions for future reference. First of all, the Firebase ecosystem has changed quite a bit, and I'm assuming the current best practices (i.e. Firestore and serverless functions). I personally considered these solutions while building a real application, and ended up picking the scheduled approximated ranks.
Live ranks (most up-to-date, but expensive)
When preparing a user leaderboard I make a few assumptions:
The leaderboard ranks users based on a number which I'll call 'score' from now on
New users rank lowest on the leaderboard, so upon user creation, their rank is set to the total user count (with a Firebase function, which sets the rank, but also increases the 'total user' counter by 1).
Scores can only increase (with a few adaptations decreasing scores can also be supported).
Deleted users keep a 'ghost' spot on the leaderboard.
Whenever a user gets to increase their score, a Firebase function responds to this change by querying all surpassed users (whose score is >= the user's old score but < the user's new score) and have their rank decreased by 1. The user's own rank is increased by the size of the before-mentioned query.
The rank is now immediately available on client reads. However, the ranking updates inside of the proposed functions are fairly read- and write-heavy. The exact number of operations depends greatly on your application, but for my personal application a great frequency of score changes and relative closeness of scores rendered this approach too inefficient. I'm curious if anyone has found a more efficient (live) alternative.
Scheduled ranks (simplest, but expensive and periodic)
Schedule a Firebase function to simply sort the entire user collection by ascending score and write back the rank for each (in a batch update). This process can be repeated daily, or more frequent/infrequent depending on your application. For N users, the function always makes N reads and N writes.
Scheduled approximated ranks (cheapest, but non-precise and periodic)
As an alternative for the 'Scheduled ranks' option, I would suggest an approximation technique: instead of writing each user's exact rank upon for each scheduled update, the collection of users (still sorted as before) is simply split into M chunks of equal size and the scores that bound these chunks are written to a separate 'stats' collection.
So, for example: if we use M = 3 for simplicity and we read 60 users sorted by ascending score, we have three chunks of 20 users. For each of the (still sorted chunks) we get the score of the last (lowest score of chunk) and the first user (highest score of chunk) (i.e. the range that contains all scores of that chunk). Let's say that the chunk with the lowest scores has scores ranging from 20-120, the second chunk has scores from 130-180 and the chunk with the highest scores has scores 200-350. We now simply write these ranges to a 'stats' collection (the write-count is reduced to 1, no matter how many users!).
Upon rank retrieval, the user simply reads the most recent 'stats' document and approximates their percentile rank by comparing the ranges with their own score. Of course it is possible that a user scores higher than the greatest score or lower than the lowest score from the previous 'stats' update, but I would just consider them belonging to the highest scoring group and the lowest scoring group respectively.
In my own application I used M = 20 and could therefore show the user percentile ranks by 5% accuracy, and estimate even within that range using linear interpolation (for example, if the user score is 450 and falls into the 40%-45%-chunk ranging from 439-474, we estimate the user's percentile rank to be 40 + (450 - 439) / (474 - 439) * 5 = 41.57...%).
If you want to get real fancy you can also estimate exact percentile ranks by fitting your expected score distribution (e.g. normal distribution) to the measured ranges.
Note: all users DO need to read the 'stats' document to approximate their rank. However, in most applications not all users actually view the statistics (as they are either not active daily or just not interested in the stats). Personally, I also used the 'stats' document (named differently) for storing other DB values that are shared among users, so this document is already retrieved anyways. Besides that, reads are 3x cheaper than writes. Worst case scenario is 2N reads and 1 write.

divide one value by another

I have some code from someone but wondering why they might have used a function like this.
this.viewable= 45;
getGroups: function() {
return Math.ceil( this.getList().length / this.viewable );
}
Why would they divide the list length by a number viewable.
The result is the amount of items that should be rendered on the screen.
Why not just say 45 be the number. Is it meant to be a percentage of the list. Usually I will divide a large value by a smaller value to get the percentage.
Sorry if this seems like a stupid math question but my Math skills are crap :) And just trying to understand and learn some simple Math skills.

It's returning the number of groups (pages) that are required to display the list. The reason it's declared as a variable (vs. using the constant in the formula) is so that it can be modified easily in one place. And likely this is part of a plugin for which the view length can be modified from outside, so this declaration provides a handle to it, with 45 being the default.

That will give the number of pages required to view them all.

I would guess you can fit 45 items on a page and this is calculating the number of pages.
Or something similar to that?

This would return the total number of pages.
Total items = 100 (for example)
Viewable = 45
100 / 45 = 2.22222....
Math.ceil(2.2222) = 3
Therefore 3 pages

judging by the function name "getGroups", viewable is the capacity to show items (probably some interface list size).
By doing that division we know how many pages the data is to be divided (grouped) in order to be viewed on the interface. The ceil functions guarantees that we don't left out partial pages, if we had come records left that don't fill a complete page, we still want to show them and therefor make them count for a page.

We Keep Coding

JavaScript is the programming language of the Web.