Reduce the complexity of matching the elements of two arrays

Reduce the complexity of matching the elements of two arrays - javascript

I wrote a code that extracts column headers (the first row in a sheet) from a google sheet and compares them with an array of objects. Each object in the objects array has 3 properties: "question", "answer", and "category". The code compares the header of each column, with the "question" property pf each object in the array.
If they're similar it should add the index of the column as a key to some dictionary and set its value to be an array that holds the answer and the category of that question. No need to much explain why I'm doing this, but briefly I built this logic to be able to grade applicants answers on some questions (hence linking the index of a question to its right answer and to its category). Here is the code:
for (i = 0; i<columnHeaders[0].length; i++){
for (j=0; j<questionsObjects.length; j++){
//Get the current question and convert it to lower case
question = questionsObjects[j].question.toString().toLowerCase();
//Get column header, remove any spaces and new lines from it, and convert it to lower case
columnHeader = columnHeaders[0][i].toString().toLowerCase();
if (isStringSimilar(columnHeader, question)){
//Link the column index to its corresponding question object
var catAndAnswer = [];
catAndAnswer.push (questionsObjects[j].category.toLowerCase());
catAndAnswer.push (questionsObjects[j].rightAnswer.toLowerCase());
columnsQuestionsDictionary[i] = catAndAnswer;
} else {
SpreadsheetApp.getActive().getSheetByName("log").appendRow(["no", columnHeader, question]);
}
}
}
The code runs well, my only problem is complexity, it's very high. In some cases this method takes almost 6 minutes to execute (for this case I had around 40 columns and 7 question objects)! To decouple the nested loops, I thought of concatenating the questions values (of all objects in the questions object array) into 1 single string where I precede each question with its index in the objects array.
For example:
var str = "";
for (j=0; j<questionsObjects.length; j++){
str = str + j + questionsObjects[j].question.toString.toLowerCase();
}
Then, I can have another separate loop through the columns headers, extract each header into a string, then use regex exec method to match that header in the long questions string (str), and if it's found I would get its index in str, then subtract 1 from it to know its index in the objects array. However, it turned out that the complexity of matching a regular expression is O(N) where N is the length of the string we search in (str in this example), given that this will be inside the columns loop, I see that we still get a high complexity that can go to O(N^2).
How can I optimize those nested loops so the code runs in the most efficient way possible?

OK, so I used the way suggested by Nina Schholz in the comments and I moved columnHeader = columnHeaders[0][i].toString().toLowerCase(); to be in the outer loop instead of being in the inner one since it's only needed in the outer one.
The time needed to run the code was reduced from ~295 seconds to ~208 seconds, which is good.
I also tried switching the loops order where I made the outer loop to be the inner one and the inner one to be the outer one and updated the usage of i and j accordingly. I did that because it's always recommended to have the outer loop with less iterations and the inner one with more iterations (according to this resource), and in my case, the loop that iterates over questions object array is always expected to have number of iterations <= the other loop.
This is because if we want to calculate the complexity of 2 nested loops, it'll be (ixj) + i, where i and j represents the number of iterations of the outer and the inner loops, respectively. Switching the loops order won't impact the multiplication part (ixj) but it'll impact the addition part. So, it's always better to have the outer number of iterations smaller than the inner ones.
After doing this the final time of the run became ~202 seconds.
Of course since the loops are switched now, I moved this line to the inner loop: columnHeader = columnHeaders[0][i].toString().toLowerCase();, but at the same time I moved this question = questionsObjects[j].question.toString().toLowerCase(); to be under the outer loop because it's only needed there.

Related

Question concerning efficiency in calculated array indices and repeated references to a specific array value

In trying to minimize the time it takes a JavaScript function to process, please consider this set up. In the function is a loop that operates on an array of similar objects. The index of the array is [4 + loop counter] and there are several references to array[4+i][various property names], such as a[4+i].x, a[4+i].y, a[4+i].z in each loop iteration.
Is it okay to keep calculating 4+i several times within each loop iteration, or would efficiency be gained by declaring a variable at the top of the loop to hold the value of 4+i and use that variable as the index, or declare a variable to be a reference to the a[4+i] object? Is it more work for the browser to declare a new variable or to add 4+i ten times? Does the browser work each time to find a[n] such that, if one needs to use the object in a[n] multiple times per loop iteration, it would be better to set x = a[n] and just reference x.property_names ten times?
Thank you for considering my very novice question.

Does the browser work each time to find a[n] such that, if one needs to use the object in a[n] multiple times per loop iteration, it would be better to set x = a[n] and just reference x.property_names ten times?
Yes. Although the JavaScript engine may be able to optimize away the repeated a[4+i] work, it also might not be able to, depending on what your code is doing. In contrast, creating a local variable to store the reference in is very, very little work.
Subjectively, it's also probably clearer to the reader and more maintainable to do x = a[4+i] once and then use x.
That said, the best way to know the answer to this question in your specific situation is to do it and see if there's an improvement.

This snippet runs for a bit more than half minutes...
function m(f){
const t=[Date.now()];
const s=[];
for(let r=0;r<10;r++){
s.push(f());
t.push(Date.now());
}
for(let i=0;i<t.length-1;i++)
t[i]=t[i+1]-t[i];
t.pop();
t.sort((a,b)=>a-b);
t.pop();
t.pop();
return t.reduce((a,b)=>a+b);
}
const times=1000;
const bignumber=100000;
const bigarray=new Array(bignumber);
for(let i=0;i<bignumber;i++)
bigarray[i]={x:Math.random(),y:Math.random(),z:Math.random()};
for(let i=0;i<4;i++)
bigarray.push(bigarray[i]);
console.log("idx",m(function(){
let sum=0;
for(let r=0;r<times;r++)
for(let i=0;i<bignumber;i++)
sum+=bigarray[i].x+bigarray[i].y+bigarray[i].z;
return sum;
}));
console.log("idx+4",m(function(){
let sum=0;
for(let r=0;r<times;r++)
for(let i=0;i<bignumber;i++)
sum+=bigarray[i+4].x+bigarray[i+4].y+bigarray[i+4].z;
return sum;
}));
console.log("item",m(function(){
let sum=0;
for(let r=0;r<times;r++)
for(let i=0;i<bignumber;i++){
let item=bigarray[i];
sum+=item.x+item.y+item.z;
}
return sum;
}));
console.log("item+4",m(function(){
let sum=0;
for(let r=0;r<times;r++)
for(let i=0;i<bignumber;i++){
let item=bigarray[i+4];
sum+=item.x+item.y+item.z;
}
return sum;
}));
... and produces output like
idx 2398
idx+4 2788
item 2252
item+4 2303
for me on Chrome. The numbers are runtime in milliseconds of 8 runs (8 best out of 10).
Where
idx is bigarray[b].x+bigarray[b].y+bigarray[b].z, repeated access to the same element with a named index (i)
idx+4 is bigarray[i+4].x+bigarray[i+4].y+bigarray[i+4].z, repeated access to the same element with a calculated index (i+4)
item is item.x+item.y+item.z, so an array element was stored in a variable
item+4 is item.x+item.y+item.z too, just the array element was picked from i+4
Your question is very visibly the outlier here. Repeated access to an element with a "fixed" index (idx case) is already a bit slower than getting out the element into a variable (item and item+4 cases, where +4 is the slower one of course, that addition is executed 800 million times after all). But the 3 times repeated access to an element with a calculated index (idx+4 case) is 15-20+% slower than any of the others.
Here the array is so small that it fits into the L3 cache. If you "move" a couple 0-s from times to bignumber, the overall difference decreases to 10-15%, and anything else than idx+4 performs practically the same.

Need to write an algorithm for getting sum of values from Array 1 values for each Array 2 value

I am creating a algorithm to match any combination of cells of first array to second array value with priority in second array. for example in javascript :
var arr=[10,20,30,40,50,60,70,80,90];
var arr2=[100,120,140];
what I want is to define into following logic(priority for value of second array's cell serially) automatically and please help me finding pseudo for algorithm
100 = 10+20+30+40 //arr2[0] = arr1[0] + arr1[1] + arr1[2] + arr1[3]
120 = 50+70 //arr2[1] = arr1[4] + arr1[6]
140 = 60+80 //arr2[2] = arr1[5] + arr1[7]
90 = 90 //remaining arr1[8]
values are demo and can be changed dynamically.

Solution is possible if you take both array as sorted array and then start adding elements from last ends of first array (array1) which are the greatest as array is sorted , now check if sum matches then proceed else if sum is lesser than element in array2 you were checking then you need to add third element from array1. Another case if sum is greater than element in array2 then you have to neglect one of the element from array1 you have used in addition and replace the addition with the previous element you HV used from array one. Repeat the steps. You need to think how to do this correctly or else you need to share some of your work or logic u r thinking , so that we can help

As the matter is quite complex, over and above sufficing on a pseudo code style explanation, I have also coded a practical implementation that you may find at this link.
I advise you to refrain from looking at the solution and first try to implement the algorithm yourself as there is a lot of scope for further improvement.
Here is in broad lines an explanation to the way I have decided to tackle the algorithm:
The problem presented by the OP is related to a classic example of distributing n unique elements over k unique boxes.
In this case here, arr has 9 unique elements that need to be distributed over three distinct spots, represented by the container: arr2.
So the first step in tackling this problem is to figure out how you can implement a function that given n and k, is able to calculate all the possible distributions that apply.
The closest that I could come up with was the Stirling Numbers of the Second Kind, which is defined as:
The number of ways of partitioning a set of n elements into m nonempty sets (i.e., m set blocks), also called a Stirling set number. For example, the set {1,2,3} can be partitioned into three subsets in one way: {{1},{2},{3}}; into two subsets in three ways: {{1,2},{3}}, {{1,3},{2}}, and {{1},{2,3}}; and into one subset in one way: {{1,2,3}}.
If you pay close attention to the example provided, you will realize that it pertains to the enumeration of all the distribution combinations possible over INDISTINGUISHABLE partitions as order doesn't matter.
Since in our case, each spot in the container arr2 represents a UNIQUE spot and order therefore does matter, we will thus be required to enumerate all the Stirling Combinations over every possible combination of arr2.
Practically speaking, this means that for our example where arr2.length === 3, we will be required to apply all of the Stirling Combinations obtained to [100,120,140], [120,140,100], [140,100,120] etc.(in total 6 permutations)
The main challenging part here is to implement the Stirling Function, but luckily somebody has already done so:
http://blogs.msdn.com/b/oldnewthing/archive/2014/03/24/10510315.aspx
After copy and pasting the Stirling Function and using it to distribute arr over 3 unique spots, you now need to filter out the distributions that don't sum up to the designated spots encompassed by arr2.
This will then leave you with all the possible solutions that apply. In your case, for
var arr=[10,20,30,40,50,60,70,80,90];
var arr2=[100,120,140];
no solutions apply at all.
A quick workaround to that is by expanding the distribution target arr2 from [100,120,140] to [100,120,140,90]. A better workaround is that in the case zero solutions are found, then take away one element from list arr until you obtain a solution. Then you can later on expand your solution sets by including this element where it represents a mapping of it unto itself.

Best ways to get random items from an array in javascript

Considering the performance, what's the best way to get random subset from an array?
Say we get an array with 90000 items, I wanna get 10000 random items from it.
One approach I'm thinking about is to get a random index from 0 to array.length and then remove the selected one from the original array by using Array.prototype.splice. Then get the next random item from the rest.
But the splice method will rearrange the index of all the items after the one we just selected and move them forward on step. Doesn't it affect the performance?
Items may duplicates, but what we select should not. Say we've selected index 0, then we should only look up the rest 1~89999.

If you want a subset of the shuffled array, you do not need to shuffle the whole array. You can stop the classic fisher-yates shuffle when you have drawn your 10000 items, leaving the other 80000 indices untouched.

I would first randomize the whole array then splice of a 10000 items.
How to randomize (shuffle) a JavaScript array?
Explains a good way to randomize a array in javascript

A reservoir sampling algorithm can do this.
Here's an attempt at implementing Knuth's "Algorithm S" from TAOCP Volume 2 Section 3.4.2:
function sample(source, size) {
var chosen = 0,
srcLen = source.length,
result = new Array(size);
for (var seen = 0; chosen < size; seen++) {
var remainingInput = srcLen - seen,
remainingOutput = size - chosen;
if (remainingInput*Math.random() < remainingOutput) {
result[chosen++] = source[seen];
}
}
return result;
}
Basically it makes one pass over the input array, choosing or skipping items based on a function of a random number, the number of items remaining in the input, and the number of items remaining to be required in the output.
There are three potential problems with this code: 1. I may have mucked it up, 2. Knuth calls for a random number "between zero and one" and I'm not sure if this means the [0, 1) interval JavaScript provides or the fully closed or fully open interval, 3. it's vulnerable to PRNG bias.
The performance characteristics should be very good. It's O(srcLen). Most of the time we finish before going through the entire input. The input is accessed in order, which is a good thing if you are running your code on a computer that has a cache. We don't even waste any time reading or writing elements that don't ultimately end up in the output.
This version doesn't modify the input array. It is possible to write an in-place version, which might save some memory, but it probably wouldn't be much faster.

Random Selections from Arrays

I am creating a game in Unity. I'm in the planning stage of it right now, but I'm trying to work out a problem I've come to. The game involves randomly selected objects from three different categories falling and the player has to catch the particular objects in particular bins.
So here's what needs to happen:
One or two of the arrays must be randomly chosen, one or two of the objects within that particular array must be chosen, no more than four objects can fall at once, the different objects must fall from different places and fall at different times.
Now I have a clip of code that I got from another project I did that's written in JavaScript (which is what I've been using, but I could also do it in Boo or C++) that solves part of the last point. It chooses a random location along the x access and then has the object fall until y=0, and then it resets.
function Update()
{
transform.position.y -= 50 * Time.deltaTime;
if(transform.position.y < 0)
{
transform.position.y = 50;
transform.position.x = Random.Range(0,60);
transform.position.z = -16;
}
}
I'm going to rewrite part of it to say that it will reset after it hits a particular collider, yields for a short time period, and find then a new random and drop that instead. But what I'm having problems with is the actual randomizing of the objects. I have six objects in each of the three arrays, and I've looked for codes where something is chosen from an array by numerical value, but nothing about randomly choosing one of the arrays and then choosing something within the random array. Neither have I found anything about the random selection in JavaScript, Boo, or C++.
Any information on this code would be helpful, thanks in advance!

To select one object at random from one of three arrays at random, you better work with an array of array. You then will need to generate two random numbers and store them as indexes to the array of arrays.
so instead of three different arrays, initialize a single array
var a = [];
a.push([1,2,3]);
a.push([10,20]);
a.push([100,200,300,400]);
and then
var i = Math.floor(Math.random()*a.length);
var j = Math.floor(Math.random()*a[i].length);
var o = a[i][j];

why to use sorting maps on arrays. how is it better in some instances

I'm trying to learn about array sorting. It seems pretty straightforward. But on the mozilla site, I ran across a section discussing sorting maps (about three-quarters down the page).
The compareFunction can be invoked multiple times per element within
the array. Depending on the compareFunction's nature, this may yield a
high overhead. The more work a compareFunction does and the more
elements there are to sort, the wiser it may be to consider using a
map for sorting.
The example given is this:
// the array to be sorted
var list = ["Delta", "alpha", "CHARLIE", "bravo"];
// temporary holder of position and sort-value
var map = [];
// container for the resulting order
var result = [];
// walk original array to map values and positions
for (var i=0, length = list.length; i < length; i++) {
map.push({
// remember the index within the original array
index: i,
// evaluate the value to sort
value: list[i].toLowerCase()
});
}
// sorting the map containing the reduced values
map.sort(function(a, b) {
return a.value > b.value ? 1 : -1;
});
// copy values in right order
for (var i=0, length = map.length; i < length; i++) {
result.push(list[map[i].index]);
}
// print sorted list
print(result);
I don't understand a couple of things. To wit: What does it mean, "The compareFunction can be invoked multiple times per element within the array"? Can someone show me an example of that. Secondly, I understand what's being done in the example, but I don't understand the potential "high[er] overhead" of the compareFunction. The example shown here seems really straightforward and mapping the array into an object, sorting its value, then putting it back into an array would take much more overhead I'd think at first glance. I understand this is a simple example, and probably not intended for anything else than to show the procedure. But can someone give an example of when it would be lower overhead to map like this? It seems like a lot more work.
Thanks!

When sorting a list, an item isn't just compared to one other item, it may need to be compared to several other items. Some of the items may even have to be compared to all other items.
Let's see how many comparisons there actually are when sorting an array:
var list = ["Delta", "alpha", "CHARLIE", "bravo", "orch", "worm", "tower"];
var o = [];
for (var i = 0; i < list.length; i++) {
o.push({
value: list[i],
cnt: 0
});
}
o.sort(function(x, y){
x.cnt++;
y.cnt++;
return x.value == y.value ? 0 : x.value < y.value ? -1 : 1;
});
console.log(o);
Result:
[
{ value="CHARLIE", cnt=3},
{ value="Delta", cnt=3},
{ value="alpha", cnt=4},
{ value="bravo", cnt=3},
{ value="orch", cnt=3},
{ value="tower", cnt=7},
{ value="worm", cnt=3}
]
(Fiddle: http://jsfiddle.net/Guffa/hC6rV/)
As you see, each item was compared to seveal other items. The string "tower" even had more comparisons than there are other strings, which means that it was compared to at least one other string at least twice.
If the comparison needs some calculation before the values can be compared (like the toLowerCase method in the example), then that calculation will be done several times. By caching the values after that calculation, it will be done only once for each item.

The primary time saving in that example is gotten by avoiding calls to toLowerCase() in the comparison function. The comparison function is called by the sort code each time a pair of elements needs to be compared, so that's a savings of a lot of function calls. The cost of building and un-building the map is worth it for large arrays.
That the comparison function may be called more than once per element is a natural implication of how sorting works. If only one comparison per element were necessary, it would be a linear-time process.
edit — the number of comparisons that'll be made will be roughly proportional to the length of the array times the base-2 log of the length. For a 1000 element array, then, that's proportional to 10,000 comparisons (probably closer to 15,000, depending on the actual sort algorithm). Saving 20,000 unnecessary function calls is worth the 2000 operations necessary to build and un-build the sort map.

This is called the “decorate - sort - undecorate” pattern (you can find a nice explanation on Wikipedia).
The idea is that a comparison based sort will have to call the comparison function at least n times (where n is the number of item in the list) as this is the number of comparison you need just to check that the array is already sorted. Usually, the number of comparison will be larger than that (O(n ln n) if you are using a good algorithm), and according to the pingeonhole principle, there is at least one value that will be passed twice to the comparison function.
If your comparison function does some expensive processing before comparing the two values, then you can reduce the cost by first doing the expensive part and storing the result for each values (since you know that even in the best scenario you'll have to do that processing). Then, when sorting, you use a cheaper comparison function that only compare those cached outputs.
In this example, the "expensive" part is converting the string to lowercase.

Think of this like caching. It's simply saying that you should not do lots of calculation in the compare function, because you will be calculating the same value over and over.
What does it mean, "The compareFunction can be invoked multiple times per element within the array"?
It means exactly what it says. Lets you have three items, A, B and C. They need to be sorted by the result of compare function. The comparisons might be done like this:
compare(A) to compare(B)
compare(A) to compare(C)
compare(B) to compare(C)
So here, we have 3 values, but the compare() function was executed 6 times. Using a temporary array to cache things ensures we do a calculation only once per item, and can compare those results.
Secondly, I understand what's being done in the example, but I don't understand the potential "high[er] overhead" of the compareFunction.
What if compare() does a database fetch (comparing the counts of matching rows)? Or a complex math calculation (factorial, recursive fibbinocci, or iteration over a large number of items) These sorts of things you don't want to do more than once.
I would say most of the time, it's fine to leave really simple/fast calculations inline. Don't over optimize. But if you need to anything complex or slow in the comparison, you have to be smarter about it.

To respond to your first question, why would the compareFunction be called multiple times per element in the array?
Sorting an array almost always requires more than N passes, where N is the size of the array (unless the array is already sorted). Thus, for every element in your array, it may be compared to another element in your array up to N times (bubble sort requires at most N^2 comparisons). The compareFunction you provide will be used every time to determine whether two elements are less/equal/greater and thus will be called multiple times per element in the array.
A simple response for you second question, why would there be potentially higher overhead for a compareFunction?
Say your compareFunction does a lot of unnecessary work while comparing two elements of the array. This can cause sort to be slower, and thus using a compareFunction could potentially cause higher overhead.

We Keep Coding

JavaScript is the programming language of the Web.