I am learning about join algorithms in regards to relational data query processing. The simple case is the nested loop join:
function nestedJoin(R, S, compare) {
const out = []
for (const r of R) {
for (const s of S) {
if (compare(r, s)) {
out.push([ r, s ])
}
}
}
return out
}
Where compare would compare the join attribute.
The case I'm wondering about is the index join. Copying from that cheat sheet into JS sort of, we have:
function indexJoin(R, S) {
const out = []
for (const r of R) {
const X = findInIndex(S.C, r.c)
for (const s of X) {
out.push([ r, s ])
}
}
return out
}
But what is that findInIndex(S.C, r.c)? What is being passed into it (S.C)? And how does it work?
The join indices paper says this:
With no indices, two basic algorithms based on sorting [4] and hashing
[5, lo] avoid the prohibitive cost of the nested loop method.
A join index is a binary relation. It only contains pairs of surrogates which makes
( it small. However, for generality, we assume that it does not always fit in RAM.
Therefore, a join index must be clustered. Since we may need fast access to JI
tuples via either r values or s values depending on whether there are selects on
relations R or S, a JI should be clustered on (r, s). A simple and uniform solution
is to maintain two copies of the JI, one clustered on r and the other clustered on
s. Each copy is implemented by a W-tree an efficient variation of the versatile
B-tree [l, 71.
So if it were a B+tree, what would the key and value be used in each B+tree, and how would you use the keys and values (in what order do you plugin keys and get values)? Also, cannot the "join index" just be implemented something like this if it were in JavaScript?
const joinIndex = {}
function join(r, s) {
const rest = joinIndex[r.c] = joinIndex[r.c] ?? {}
rest[s.c] = true
}
function findInIndex(leftKey) {
return Object.keys(joinIndex[leftKey])
}
Please show how the join algorithm would be implemented, either with my approach or the B+tree approach. If it is the B+tree approach, you don't need to implement a B+tree, but just explain how you would plug things in/out of the 2 B+trees to explain the algorithm with more clarity.
First of all, the join index that the paper speaks of, can best be imagined as a table that implements a many-to-many relationship between two tables. A record in this join index table consists of two foreign keys: one referencing the primary key in the R table, and another referencing the primary key in the S table.
I didn't get the S.C notation used in the cheat sheet. But it is clear you'll somehow need to specify which join index to use, and more specifically, which B+Tree (clustering) you want to use on it (in case two of them are defined), and finally, which value (r.c, the key of r) you want to find in it.
The role of the B+tree is to provide an ordered hash table, i.e. where you can search a key efficiently, and can easily walk from that point to the subsequent entries in order. In this particular use for a join index, this allows you to efficiently find all pairs (r1, s) for a given r1. The first of those would be found by drilling down from the root of the B+tree to the first leaf having r1. Then a walk forward across the bottom layer of the B+tree would find all the other tuples with r1, until a tuple is encountered that no longer has this r1 value.
Note that you still need an index on the original tables as well, in order to find the complete record for a given key. In practice that could also be done with a B+Tree, but in JavaScript, a simple dictionary (plain object) would suffice.
So in JavaScript syntax we could imagine something like this:
// Arguments:
// - joinIndexTree: a B+Tree having (rKey, sKey) tuples, keyed and ordered by rKey.
// - rKey: the key to search all matches for
function findInIndex(joinIndexTree, rKey) {
let result = []; // This will collect all the sKey for
// which thee is a (rKey, sKey)
// Find left-most location in B+Tree where rKey should occur (if present)
let btreeCursor = joinIndexTree.find(rKey);
if (btreeCursor.EOF()) return result; // At the far-right end of the B+Tree
let tuple = btreeCursor.get(); // Read the tuple found at this location
while (tuple[0] == rKey) { // First member of tuple matches rKey
result.push(tuple[1]); // Collect the corresponding s-value
btreeCursor.next(); // Walk to the next tuple
if (btreeCursor.EOF()) break; // At the end of the B+Tree
tuple = btreeCursor.get(); // Read the tuple found at this location
}
return result;
}
The main program would be:
const joinIndexTree = ;// B+Tree with (rKey, sKey) pairs, ordered by rKey
const sIndex = Object.fromEntries(S.map(s => [s.id, s])); // dictionary
function indexJoin(joinIndexTree, R, S, sIndex) {
const out = []
for (const r of R) {
const sids = findInIndex(joinIndexTree, r.id)
for (const s_id of sids) {
const s = sIndex[s_id]; // Look up by primary key index
out.push([ r, s ])
}
}
return out
}
When you only need read-only operations on the table (queries), then instead of a B+Tree, you can create a dictionary of arrays, where you can lookup by joinIndex[r.id] and get an array of s.id values. This is certainly easy to set up and work with, but it is a pain to keep updated when the tables are not read-only.
As alternative to B+Tree, you can also use other balanced search trees, such as AVL and red-black trees, but in my experience B+Trees have superior performance.
Related
My task is:
Implement the function duplicateStudents(), which gets the variable
"students" and filters for students with the same matriculation
number. Firstly, project all elements in students by matriculation
number. After that you can filter for duplicates relatively easily. At
the end project using the following format: { matrikelnummer:
(matrikelnummer), students: [ (students[i], students[j], ... ) ] }.
Implement the invalidGrades() function, which gets the variable "grades"
and filters for possibly incorrect notes. For example, in order to
keep a manual check as low as possible, the function should determine
for which matriculation numbers several grades were transmitted for
the same course. Example: For matriculation number X, a 2. 7 and a 2.
3 were transmitted for course Y. However, the function would also take
into account the valid case, i. e. for matriculation number X, once a
5,0 and once a 2,3 were transmitted for course Y.
In this task you should only use map(), reduce(), and filter(). Do not
implement for-loops.
function duplicateStudents(students) {
return students
// TODO: implement me
}
function invalidGrades(grades) {
return grades
.map((s) => {
// TODO: implement me
return {
matrikelnummer: -1/* put something here */,
grades: []/* put something here */,
};
})
.filter((e) => e.grades.length > 0)
}
The variables students and grades I have in a separate file. I know it might be helpful to upload the files too, but one is 1000 lines long, the other 500. That’s why I’m not uploading them. But I hope it is possible to do the task without the values. It is important to say that the values are represented as an array
I'll give you an example of using reduce on duplicateStudents, that's not returning the expected format but you could go from there.
const duplicateStudents = (students) => {
const grouping = students.reduce((previous, current) => {
if (previous[current.matrikelnummer]) previous[current.matrikelnummer].push(current); // add student if matrikelnummer already exist
else previous[current.matrikelnummer] = [current];
return previous;
}, {});
console.log(grouping);
return //you could process `grouping` to the expected format in here
};
here's preferences for you:
map
filter
reduce
I have a React JS app that uses an API to pull a JSON object that contains an array of 10,000+ objects. This array is then used to populate a table, with certain filters and options from checkboxes, dropdowns, that can manipulate the data. When tapping a checkbox, filter, sort, reduce functions are used on the array to return a specific subset of the array that can populate the table again.
There are 10-15 options to choose from, so 10-15 filter/map/reduce functions running on the data each time a box is checked.
These filtering options now cause a noticeable lag between clicking on the checkbox and changing the table. The app freezes while it calculates the new array. Is there a more efficient flow to filter my data?
Some example functions below below:
//gameData is an array of 10k+ objects
let inData = gameData
const options = {
dateToTime: new Date('2020-03-01'),
servers:[1,2,3],
maps:['A','B','C']
}
function groupByArray(array, key) {
return array.reduce(function (rv, x) {
let v = key instanceof Function ? key(x) : x[key];
let el = rv.find((r) => r && r.key === v);
if (el) {
el.values.push(x);
} else {
rv.push({ key: v, values: [x] });
} return rv;
}, []);
}
const gamesGrouped = groupByArray(inData, 'gameid')
inData = gamesGrouped.filter(a => a.playername != "new")
inData = inData.filter(game => {
const thisTime = new Date(game.creationtime)
return (thisTime < options.dateToTime)
})
inData = inData.filter(game => options.servers.includes(game.serverip))
inData.filter(game => options.maps.includes(game.map))
Thanks in advance!
I would say it is impossible to give general answer how to process array data, but I can give some pointers.
be careful when nesting loops (to avoid repeating same iterations)
avoid overheads, eg. find() can be replaced with for loop which is quite faster (I know it is easier to write find(), but you are looking at roughly 30% performance increase by switching it to for loop)
paginate - you can process array in chunks using generators (eg. if you need to show only first 10 results, that would be faster then processing all of them)
Also code that you provided is bit cryptic, you might want to use better naming.
Here is performance comparison for groupByArray() function: https://jsbench.me/t7kltjm1sy/1
Worth noting, whenever I deal with performance sensitive situations I keep code to the VanillaJS as close as possible, because with large data sets even slight function overhead can be noticeable.
I have a group of arrays that I need to filter out duplicates. It needs to work in such a fashion that within each array, there are no duplicates, and within the total group, there are no two arrays that hold the same values.
The first part is easy - for each inner array, I can apply Set to the array and filter it out. So, given the matrix arrays I can apply the following to filter:
const sets : string[][] = arrays.map(arr=>[...new Set(arr)].sort());
This will give me an array of sets. How can I make this into a set of sets? As in, if sets=[[a, b],[c],[d, a],[c],[e]] I would like setOfSets to equal [[a, b],[c],[d, a],[e]]?
Applying setOfSets = [...new Set(sets)]; would not work, since arrays that are equal are not considered equal by default if they have different addresses. Is there a way to force set to check by value, or another effective way to create this effect?
Edit
Original matrix:
[[a, b, b],
[c,c],
[b,a],
[d,a],
[c,c],
[e,e]]
after creating and sorting sets:
[[a,b],
[c],
[a,b],
[d,a],
[c],
[e]]
desired result:
[[a,b],
[c],
[d,a],
[e]]
If the data in your set is easy to serialize, I would opt for a solution like this:
const data = [
["a", "b", "b"],
["c","c"],
["b","a"],
["d","a"],
["c","c"],
["e","e"]
];
// Create the "hash" of your set
const serializeSet = s => Array
.from(s)
.sort()
.join("___");
// Create a map (or object) that ensures 1 entry per hash
const outputMap = data
.map(xs => new Set(xs))
.reduce(
(acc, s) => acc.set(serializeSet(s), s),
new Map()
);
// Turn your Map and Sets back in to arrays
const output = Array
.from(outputMap.values())
.map(s => Array.from(s));
console.log(output);
To come up with a good hash function for your set, you need to have a good look at your data. For example:
When your arrays consist of single characters from a-z, like in my example above, we can sort those strings using a default sorter and then join the result using a character from outside the a-z range.
If your arrays consist of random strings or numbers, JSON.stringify(Array.from(s).sort()) is safer to use
When your arrays consist of plain objects, you could JSON.stringify its sorted elements, but watch out for differences in the order of objects properties! (e.g. {a: 1, b: 2} vs {b: 2, a: 1})
I have a javascript array of nested data that holds data which will be displayed to the user.
The user would like to be able to apply 0 to n filter conditions to the data they are looking at.
In order to meet this goal, I need to first find elements that match the 0 to n filter conditions, then perform some data manipulation on those entries. An obvious way of solving this is to have several filter statements back to back (with a conditional check inside them to see if the filter needs to be applied) and then a map function at the end like this:
var firstFilterList = _.filter(myData, firstFilterFunction);
var secondFilterList = _.filter(firstFilterList, secondFilterFunction);
var thirdFilterList = _.filter(secondFilterList, thirdFilterFunction);
var finalList = _.map(thirdFilterList, postFilterFunction);
In this case however, the javascript array would be traversed 4 times. A way to get around this would be to have a single filter that checks all 3 (or 0 to n) conditions before determining if there is a match, and then, inside the filter at the end of the function, doing the data manipulation, however this seems a bit hacky and makes the "filter" responsible for more than one thing, which is not ideal. The upside would be that the javascript Array is traversed only once.
Is there a "best practices" way of doing what I am trying to accomplish?
EDIT: I am also interested in hearing if it is considered bad practice to perform data manipulation (adding fields to javascript objects etc...) within a filter function.
You could collect all filter functions in an array and check every filter with the actual data set and filter by the result. Then take your mapping function to get the wanted result.
var data = [ /* ... */ ],
filterFn1 = () => Math.round(Math.random()),
filterFn2 = (age) => age > 35,
filterFn3 = year => year === 1955,
fns = [filterFn1, filterFn2, filterFn2],
whatever = ... // final function for mapping
result = data
.filter(x => fns.every(f => f(x)))
.map(whatever);
One thing you can do is to combine all those filter functions into one single function, with reduce, then call filter with the combined function.
var combined = [firstFilterFunction, seconfFilterFunction, ...]
.reduce((x, y) => (z => x(z) && y(z)));
var filtered = myData.filter(combined);
So I've got a store in indexeddb with a composed keyPath: [someKey, someSubKey]. I've got an index on someKey. I am looking for the fastest way to delete all object for a given someKey. The approach I'm trying to go with is to get the lower and upper subkey on the index, and then delete on the store with a key range between lower and upper subkey.
This works, at least in Firefox and Chrome. But it assumes the order of values will be the same in the index as in the store. I'm wondering if this is a safe assumption to make? I'm thinking it might, as the key paths of store and index share the first key, but can't find much documentation on sorting order. I'd rather not delete every record individually as there could be thousands of them.
The psuedo-code below describes the approach:
const someKeyRange = IDBKeyRange.only(givenSomeKey);
const lowerSubKey = index.openKeyCursor(someKeyRange, "next")
.primaryKey[1];
const upperSubKey = index.openKeyCursor(someKeyRange, "prev")
.primaryKey[1]; //1 to get subKey
const storeRange = IDBKeyRange.bound(
[givenSomeKey, lowerSubKey],
[givenSomeKey, upperSubKey]
);
store.delete(storeRange);
Key ordering is defined here:
https://w3c.github.io/IndexedDB/#key-construct
Specifically for your case, with array keys, they are member-wise ordered. i.e. if A < B, then [A, ...] < [B, ...].
And yes, if you were to delete the range IDBKeyRange.bound([A], [B], false, true) that would delete anything with [A, ...]. (This assumes there is no value between A and B. Sadly, there's no prefix-range in the API. https://github.com/w3c/IndexedDB/issues/47)