Efficient sparse mapping from integers to integers

Efficient sparse mapping from integers to integers - javascript

I'm implementing a purpose-built regular expression engine using finite automata. I will have to store thousands of states, each state with its own transition table from unicode code points (or UTF-16 code units; I haven't decided) to state IDs.
In many cases, the table will be extremely sparse, but in other cases it will be nearly full. In most cases, most of the entries will fall into several contiguous ranges with the same value.
The simplest implementation would be a lookup table, but each such table would take up a great deal of space. A list of (range, value) pairs would be much smaller, but slower. A binary search tree would be faster than a list.
Is there a better approach, perhaps leveraging built-in functionality?

Unfortunately, JavaScript's built-in data-types - especially Map - are not of great help in accomplishing this task, as they lack the relevant methods.
In most cases, most of the entries will fall into several contiguous
ranges with the same value.
We can however exploit this and use a binary search strategy on sorted arrays, assuming the transition tables won't be modified often.
Encode contiguous input ranges leading to the same state by storing each input range's lowest value in a sorted array. Keep the states at corresponding indices in a separate array:
let inputs = [0, 5, 10]; // Input ranges [0,4], [5,9], [10,∞)
let states = [0, 1, 0 ]; // Inputs [0,4] lead to state 0, [5,9] to 1, [10,∞) to 0
Now, given an input, you need to perform a binary search on the inputs array similar to Java's floorEntry(k):
// Returns the index of the greatest element less than or equal to
// the given element, or undefined if there is no such element:
function floorIndex(sorted, element) {
let low = 0;
let high = sorted.length - 1;
while (low <= high) {
let mid = low + high >> 1;
if (sorted[mid] > element) {
high = mid - 1;
} else if (sorted[mid] < element) {
low = mid + 1;
} else {
return mid
}
}
return low - 1;
}
// Example: Transition to 1 for emoticons in range 1F600 - 1F64F:
let transitions = {
inputs: [0x00000, 0x1F600, 0x1F650],
states: [0, 1, 0 ]
};
let input = 0x1F60B; // 😋
let next = transitions.states[floorIndex(transitions.inputs, input)];
console.log(`transition to ${next}`);
This search completes in O(log n) steps where n is the number of contiguous input ranges. The transition table for a single state then has a space requirement of O(n). This approach works equally well for sparse and dense transition tables as long as our initial assumption - the number of contiguous input ranges leading to the same state is small - holds.

Sounds like you have two very different cases ("in many cases, the table will be extremely sparse, but in other cases it will be nearly full").
For the sparse case, you could possibly have a separate sparse index (or several layers of indexes), then your actual data could be stored in a typed array. Because the index(es) would be mapping from integers to integers, they could be represented as typed arrays as well.
Looking up a value would look like this:
Binary search the index. The index stores pairs as consecutive entries in the typed array – the first element is the search value, the second is the position in the data set (or the next index).
If you have multiple indexes, repeat 1 as necessary.
Start iterating your dataset at the position given by the last index. Because the index is sparse, this position might not be the one where the value is stored, but it is a good starting point as the correct value is guaranteed to be nearby.
The dataset itself is represented as a typed array where consecutive pairs hold the key and the value.
I cannot think of anything better to use in JavaScript. Typed arrays are pretty fast and having indexes should increase the speed drastically. That being said, if you only have a couple thousand entries, don't bother with indexes, do a binary search directly on the typed array (described in 4. above).
For the dense case, I am not sure. If the dense case happens to be a case where repeated values across ranges of keys are likely, consider using something like run-length encoding – identical consecutive values are represented simply as their number of occurrences and then the actual value. Once again, use typed arrays and binary search, possibly even indexes to make this faster.

Related

Has this sorting algorithm been invented? Is it linear time complexity?

I am not sure how objects natively sort numbers, and whether that affects the time complexity of this algorithm. That why I wonder if this is linear (O(n)).
I am aware that the space complexity is terrible.
This is my code:
const objSort = (arr) => {
let storage = {};
let sorted = [];
let entries;
arr.forEach((num) => {
storage[num] ? storage[num]++ : storage[num] = 1;
});
entries = Object.entries(storage);
entries.forEach(([ key, value ]) => {
for (let i = 0; i < value; i++) {
sorted.push(+key);
}
});
return sorted;
};

I am not sure about how objects sort numbers automatically
Nope they do not. Object key order is not guaranteed. Therefore, you actually dont sort anything :)

This is the gist of a potentially linear sort. However, it depends on several implementation requirements.
First, I want to make sure that I understand the approach:
Start with a blank storage structure. This will count the occurrences of each element of arr, indexed by the element itself. For instance, given the input string "abaczaz", storage would finish as {'a': 3, 'b':1, 'c':1, 'z':2}
Iterate through storage; for each entry, emit the indicated quantity of the listed element. For the current example, this would produce "aaabczz".
Each of these iterations is length N, a linear solution. However, note the requirements for this to work:
You must have a mapping of objects to indexes that is O(1).
That mapping must embody the sorting order for the objects.
Access time into and from storage must be O(1).
Iteration through storage must be O(N).
For this last point, most solutions are O(range(object)), a number large enough to be impractical. For instance, to sort 64-bit integers, you would need an array of length 2^64 memory locations, and you'd iterate through all 2^64 integers to emit the sorted array.
Granted, the fixed bound makes this technically O(1), but for a very large value of 1. :-) In theoretical terms, it's O(log (max(arr) - min(arr)), as the memory requirement depends on the value range. This drives your overall complexity to O(n log n)`
FINALLY ...
Yes, this sort has already been implemented. I've had it as an interview question.

Finding the difference between two arrays in PHP, Node and Golang

Here is a typical example of what I need to do
$testArr = array(2.05080E6,29400,420);
$stockArrays = array(
array(2.05080E6,29400,0),
array(2.05080E6,9800,420),
array(1.715E6,24500,280),
array(2.05080E6,29400,140),
array(2.05080E6,4900,7));
I need to identify the stockArray that is the least different. A few clarifications
The numeric values of array elements at each position are guaranteed not to overlap. (i.e. arr[0] will always have the biggest values, arr1 will be at least an order of 10 magnitude smaller etc).
The absolute values of the differences do not count when determining least different. Only, the number of differing array indices matter.
Positional differences do have a weighting. Thus in my example stockArr1 is "more different" thought it too - like its stockArr[0] & stockArr[3] counterparts - differs in only one index position because that index position is bigger.
The number of stockArrays elements will typically be less than 10 but could potentially be much more (though never into 3 figures)
The stock arrays will always have the same number of elements. The test array will have the same or fewer elements. However, when fewer testArr would be padded out so that potentially matching elements are always in the same place as the stockArray. e.g.
$testArray(29400,140)
would be transformed to
$testArray(0,29400,140);
prior to being subjected to difference testing.
Finally, a tie is possible. For instance my example above the matches would be stockArrays[0] and stockArrays[3].
In my example the result would be
$result = array(0=>array(0,0,1),3=>array(0,0,1));
indicating that the least different stock arrays are at indices 0 & 3 with the differences being at position 2.
In PHP I would handle all of this with array_diff as my starting point. For Node/JavaScript I would probably be tempted to the php.js array_diff port though I would be inclined to explore a bit given that in the worst cast scenario it is an O(n2) affair.
I am a newbie when it comes to Golang so I am not sure how I would implement this problem there. I have noted that Node does have an array_diff npm module.
One off-beat idea I have had is converting the array to a padded string (smaller array elements are 0 padded) and effectively do an XOR on the ordinal value of each character but have dismissed that as probably a rather nutty thing to do.
I am concerned with speed but not at all costs. In an ideal world the same solution (algorithm) would be used in each target language though in reality the differences between them might mean that is not possible/not a good idea.
Perhaps someone here might be able to point me to less pedestrian ways of accomplishing this - i.e. not just array_diff ports.

Here's the equivalent of the array_diff solution: (assuming I didn't make a mistake)
package main
import "fmt"
func FindLeastDifferent(needle []float64, haystack [][]float64) int {
if len(haystack) == 0 {
return -1
}
var currentIndex, currentDiff int
for i, arr := range haystack {
diff := 0
for j := range needle {
if arr[j] != needle[j] {
diff++
}
}
if i == 0 || diff < currentDiff {
currentDiff = diff
currentIndex = i
}
}
return currentIndex
}
func main() {
idx := FindLeastDifferent(
[]float64{2.05080E6, 29400, 420},
[][]float64{
{2.05080E6, 29400, 0},
{2.05080E6, 9800, 420},
{1.715E6, 24500, 280},
{2.05080E6, 29400, 140},
{2.05080E6, 4900, 7},
{2.05080E6, 29400, 420},
},
)
fmt.Println(idx)
}
Like you said its O(n * m) where n is the number of elements in the needle array, and m is the number of arrays in the haystack.
If you don't know the haystack ahead of time, then there's probably not much you can do to improve this. But if, instead, you're storing this list in a database, I think your intuition about string search has some potential. PostgreSQL for example supports string similarity indexes. (And here's an explanation of a similar idea for regular expressions: http://swtch.com/~rsc/regexp/regexp4.html)
One other idea: if your arrays are really big you can calculate fuzzy hashes (http://ssdeep.sourceforge.net/) which would make your n smaller.

First element in array

Why do indexes in arrays always start with 0? Does it have something to do with binary? For example:
var myArray = [5,6,7,8];
To access the number 5, you would have to say
myArray[0]
But why?
No, I don't have a real problem. As you can evidently tell I'm new to this stuff.

I'm sure this has been asked an answered a hundred times, but I'll bite.
One way of looking at the "index" or "key" is as an "offset".
myArray essentially acts as a pointer to the first item in a series of items. Specifically, it points to the number "5" in memory. So when you say myArray[1] it's like saying "the location of the first element in myArray plus 1 item over", thus you would be jumping over the first element.
In C, when you write *myArray (pointer dereference) it actually gives you back the first element.
#include <stdio.h>
int main(void) {
int myArray[] = {5,6,7,8};
printf("%d",*myArray); // prints "5", equivalent to myArray[0]
printf("%d",*(myArray+1)); // prints "6", equivalent to myArray[1]
return 0;
}
There are more practical reasons than "that's the way computers work" too.

nice blog about the historical reasons: http://developeronline.blogspot.fi/2008/04/why-array-index-should-start-from-0.html

It's basic computer science stuff, which harkens back to the day when memory was so limited, everything started with 0s and not 1s because if you started at 0 you could count up to ten total numbers in a single digit.
You're clearly new to this, trust me, from now on, you'll be counting 0 , 1 , 2 , 3!

Wikipedia gives us this explanation:
Index origin
Some languages, such as C, provide only zero-based array types, for
which the minimum valid value for any index is 0. This choice is
convenient for array implementation and address computations. With a
language such as C, a pointer to the interior of any array can be
defined that will symbolically act as a pseudo-array that accommodates
negative indices. This works only because C does not check an index
against bounds when used. Other languages provide only one-based array
types, where each index starts at 1; this is the traditional
convention in mathematics for matrices and mathematical sequences. A
few languages, such as Pascal, support n-based array types, whose
minimum legal indices are chosen by the programmer. The relative
merits of each choice have been the subject of heated debate.
Zero-based indexing has a natural advantage to one-based indexing in avoiding off-by-one or fencepost errors. See comparison of
programming languages (array) for the base indices used by various
languages.
Read more about arrays here
Read more about off-by-one and fencepost errors here

In Javascript, like many other languages, arrays always start at index zero, but it's not that way in all languages.
In Pascal, for example, you define the lower and upper boundary, so you can start an array at index three:
var myArray: Integer[3..6];
It's most common to start arrays at zero, because that's most efficient when you access the items. If you start at any other index, that value has to be subtracted when the address where the item is stored is calculated. That extra calculation wouldn't be an issue today, but back when languages like C was constructed it surely was.
(Well, arrays in Javascript is actually accessed completely different from most other languages, but it uses zero based indexes because most similar languages, where the inspiration comes from, do.)

why to use sorting maps on arrays. how is it better in some instances

I'm trying to learn about array sorting. It seems pretty straightforward. But on the mozilla site, I ran across a section discussing sorting maps (about three-quarters down the page).
The compareFunction can be invoked multiple times per element within
the array. Depending on the compareFunction's nature, this may yield a
high overhead. The more work a compareFunction does and the more
elements there are to sort, the wiser it may be to consider using a
map for sorting.
The example given is this:
// the array to be sorted
var list = ["Delta", "alpha", "CHARLIE", "bravo"];
// temporary holder of position and sort-value
var map = [];
// container for the resulting order
var result = [];
// walk original array to map values and positions
for (var i=0, length = list.length; i < length; i++) {
map.push({
// remember the index within the original array
index: i,
// evaluate the value to sort
value: list[i].toLowerCase()
});
}
// sorting the map containing the reduced values
map.sort(function(a, b) {
return a.value > b.value ? 1 : -1;
});
// copy values in right order
for (var i=0, length = map.length; i < length; i++) {
result.push(list[map[i].index]);
}
// print sorted list
print(result);
I don't understand a couple of things. To wit: What does it mean, "The compareFunction can be invoked multiple times per element within the array"? Can someone show me an example of that. Secondly, I understand what's being done in the example, but I don't understand the potential "high[er] overhead" of the compareFunction. The example shown here seems really straightforward and mapping the array into an object, sorting its value, then putting it back into an array would take much more overhead I'd think at first glance. I understand this is a simple example, and probably not intended for anything else than to show the procedure. But can someone give an example of when it would be lower overhead to map like this? It seems like a lot more work.
Thanks!

When sorting a list, an item isn't just compared to one other item, it may need to be compared to several other items. Some of the items may even have to be compared to all other items.
Let's see how many comparisons there actually are when sorting an array:
var list = ["Delta", "alpha", "CHARLIE", "bravo", "orch", "worm", "tower"];
var o = [];
for (var i = 0; i < list.length; i++) {
o.push({
value: list[i],
cnt: 0
});
}
o.sort(function(x, y){
x.cnt++;
y.cnt++;
return x.value == y.value ? 0 : x.value < y.value ? -1 : 1;
});
console.log(o);
Result:
[
{ value="CHARLIE", cnt=3},
{ value="Delta", cnt=3},
{ value="alpha", cnt=4},
{ value="bravo", cnt=3},
{ value="orch", cnt=3},
{ value="tower", cnt=7},
{ value="worm", cnt=3}
]
(Fiddle: http://jsfiddle.net/Guffa/hC6rV/)
As you see, each item was compared to seveal other items. The string "tower" even had more comparisons than there are other strings, which means that it was compared to at least one other string at least twice.
If the comparison needs some calculation before the values can be compared (like the toLowerCase method in the example), then that calculation will be done several times. By caching the values after that calculation, it will be done only once for each item.

The primary time saving in that example is gotten by avoiding calls to toLowerCase() in the comparison function. The comparison function is called by the sort code each time a pair of elements needs to be compared, so that's a savings of a lot of function calls. The cost of building and un-building the map is worth it for large arrays.
That the comparison function may be called more than once per element is a natural implication of how sorting works. If only one comparison per element were necessary, it would be a linear-time process.
edit — the number of comparisons that'll be made will be roughly proportional to the length of the array times the base-2 log of the length. For a 1000 element array, then, that's proportional to 10,000 comparisons (probably closer to 15,000, depending on the actual sort algorithm). Saving 20,000 unnecessary function calls is worth the 2000 operations necessary to build and un-build the sort map.

This is called the “decorate - sort - undecorate” pattern (you can find a nice explanation on Wikipedia).
The idea is that a comparison based sort will have to call the comparison function at least n times (where n is the number of item in the list) as this is the number of comparison you need just to check that the array is already sorted. Usually, the number of comparison will be larger than that (O(n ln n) if you are using a good algorithm), and according to the pingeonhole principle, there is at least one value that will be passed twice to the comparison function.
If your comparison function does some expensive processing before comparing the two values, then you can reduce the cost by first doing the expensive part and storing the result for each values (since you know that even in the best scenario you'll have to do that processing). Then, when sorting, you use a cheaper comparison function that only compare those cached outputs.
In this example, the "expensive" part is converting the string to lowercase.

Think of this like caching. It's simply saying that you should not do lots of calculation in the compare function, because you will be calculating the same value over and over.
What does it mean, "The compareFunction can be invoked multiple times per element within the array"?
It means exactly what it says. Lets you have three items, A, B and C. They need to be sorted by the result of compare function. The comparisons might be done like this:
compare(A) to compare(B)
compare(A) to compare(C)
compare(B) to compare(C)
So here, we have 3 values, but the compare() function was executed 6 times. Using a temporary array to cache things ensures we do a calculation only once per item, and can compare those results.
Secondly, I understand what's being done in the example, but I don't understand the potential "high[er] overhead" of the compareFunction.
What if compare() does a database fetch (comparing the counts of matching rows)? Or a complex math calculation (factorial, recursive fibbinocci, or iteration over a large number of items) These sorts of things you don't want to do more than once.
I would say most of the time, it's fine to leave really simple/fast calculations inline. Don't over optimize. But if you need to anything complex or slow in the comparison, you have to be smarter about it.

To respond to your first question, why would the compareFunction be called multiple times per element in the array?
Sorting an array almost always requires more than N passes, where N is the size of the array (unless the array is already sorted). Thus, for every element in your array, it may be compared to another element in your array up to N times (bubble sort requires at most N^2 comparisons). The compareFunction you provide will be used every time to determine whether two elements are less/equal/greater and thus will be called multiple times per element in the array.
A simple response for you second question, why would there be potentially higher overhead for a compareFunction?
Say your compareFunction does a lot of unnecessary work while comparing two elements of the array. This can cause sort to be slower, and thus using a compareFunction could potentially cause higher overhead.

efficiently finding an object that falls within a certain number range

Here's my basic problem: I'm given a currentTime. For example, 750 seconds. I also have an array which contains 1000 to 2000 objects, each of which have a startTime, endTime, and an _id attribute. Given the currentTime, I need to find the object that has a startTime and endTime that falls within that range -- for example, startTime : 740, endTime : 755.
What is the most efficient way to do this in Javascript?
I've simply been doing something like this, for starters:
var arrayLength = array.length;
var x = 0;
while (x < arrayLength) {
if (currentTime >= array[x].startTime && currentTime <= array[x].endTime) {
// then I've found my object
}
x++;
};
But I suspect that looping isn't the best option here. Any suggestions?
EDIT: For clarity, the currentTime has to fall within the startTime and endTime
My solution: The structure of my data affords me certain benefits that allows me to simplify things a bit. I've done a basic binary search, as suggested, since the array is already sorted by startTime. I haven't fully tested the speed of this thing, but I suspect it's a fair bit faster, especially with larger arrays.
var binarySearch = function(array, currentTime) {
var low = 0;
var high = array.length - 1;
var i;
while (low <= high) {
i = Math.floor((low + high) / 2);
if (array[i].startTime <= currentTime) {
if (array[i].endTime >= currentTime ){
// this is the one
return array[i]._id;
} else {
low = i + 1;
}
}
else {
high = i - 1;
}
}
return null;
}

The best way to tackle this problem depends on the number of times you will have to call your search function.
If you call your function just a few times, let's say m times, go for linear search. The overall complexity for the calls of this function will be O(mn).
If you call your function many times, and by many I mean more than log(n) times, you should:
Sort your array in O(nlogn) by startTime, then by endTime if you have several items with equal values of startTime
Do binary search to find the range of elements with startTime <= x. This means doing two binary searches: one for the start of the range and one for the end of the range. This is done in O(logn)
Do linear search inside [start, end]. You have to do linear search because the order of startTimes tells you nothing about the endTimes. This can be anywhere between O(1) and O(n) and it depends on the distribution of your segments and the value of x.
Average case: O(nlogn) for initialization and O(logn) for each search.
Worst case: an array containing many equal segments, or segments that have a common interval, and searching in this interval. In that case you will do O(nlogn) for initialization and O(n + logn) = O(n) for search.

Sounds like a problem for binary search.

Assuming that your search array is long-lived and relatively constant, the first iteration would be to sort all the array elements by start time (or create an index of sorted start times pointing to the array elements if you don't want them sorted).
Then you can efficiently (with a binary chop) discount ones that start too late. A sequential search of the others would then be faster.
For even more speed, maintain separate sorted indexes for start and end times. Then do the same operation as mentioned previously to throw away those that start too late.
Then, for the remaining ones, use the end time index to throw away those that end too early, and what you have left is your candidate list.
But, make sure this is actually needed. Two thousand elements doesn't seem like a huge amount so you should time the current approach and only attempt optimisation if it is indeed a problem.

From the information given it is not possible to tell what would be the best solution. If the array is not sorted, looping is the best way for single queries. A single scan along the array only takes O(N) (where N is the length of the array), whereas sorting it and then doing a binary search would take O(N log(N) + log(N)), thus it would in this case take more time.
The analysis will look much different if you have a great number of different queries on the same large array. If you have about N queries on the same array, sorting might actually improve the performance, as each Query will take O(log(N)). Thus for N queries it will require O(N log(N)) (the remaining log(N) gets dropped now) whereas the unsorted search will also take O(N^2) which is clearly larger. When sorting starts to make an impact exactly also depends on the size of the array.
The situation is also different again, when you update the array fairly often. Updating an unsorted array can be done in O(1) amortized, whereas updating a sorted array takes O(N). So if you have fairly frequent updates sorting might hurt.
There are also some very efficient data structures for range queries, however again it depends on the actual usage if they make sense or not.

If the array is not sorted, yours is the correct way.
Do not fall into the trap of thinking to sort the array first, and then apply your search.
With the code you tried, you have a complexity of O(n), where n is the number of elements.
If you sort the array first, you first fall into a complexity of O(n log(n)) (compare to Sorting algorithm), in the average case.
Then you have to apply the binary search, which executes at an average complexity of O(log_ 2(n) - 1).
So, you will end up by spending, in the average case:
O(n log(n) + (log_2(n) - 1))
instead of just O(n).

An interval tree is a data structure that allows answering such queries in O(lg n) time (both average and worst-case), if there are n intervals total. Preprocessing time to construct the data structure is O(n lg n); space is O(n). Insertion and deletion times are O(lg n) for augmented interval trees. Time to answer all-interval queries is O(m + lg n) if m intervals cover a point. Wikipedia describes several kinds of interval trees; for example, a centered interval tree is a tertiary tree with each node storing:
• A center point
• A pointer to another node containing all intervals completely to the left of the center point
• A pointer to another node containing all intervals completely to the right of the center point
• All intervals overlapping the center point sorted by their beginning point
• All intervals overlapping the center point sorted by their ending point
Note, an interval tree has O(lg n) complexity for both average and worst-case queries that find one interval to cover a point. The previous answers have O(n) worst-case query performance for the same. Several previous answers claimed that they have O(lg n) average time. But none of them offer evidence; instead they merely assert that average performance is O(lg n). The main feature of those previous answers is using a binary search for begin times. Then some say to use a linear search, and others say to use a binary search, for end times, but without making clear what set of intervals the latter search is over. They claim to have O(lg n) average performance, but that is merely wishful thinking. As pointed out in the wikipedia article under the heading Naive Approach,
A naive approach might be to build two parallel trees, one ordered by the beginning point, and one ordered by the ending point of each interval. This allows discarding half of each tree in O(log n) time, but the results must be merged, requiring O(n) time. This gives us queries in O(n + log n) = O(n), which is no better than brute-force.

We Keep Coding

JavaScript is the programming language of the Web.