ES2015:multikeys Map : where is the bottleneck? - javascript

I have an algorithm coding issue in ES2017 with an async generator. I have to extract some data from a list of randomly-distributed maps.
My initial goal is to code a multi-keys, multiple values (a project I called mukemuva) data structure, with the ability of compacting the overlapping information.
(this is a long post, with some code reviews needed to answers it, so please take a while to answer...)
The problem I attempt to solve is the following:
in a regular ES2015 Map, keys should be scalar values (0, 1, 2, 3, 'avc', true, ...) but not litteral property bags objects (such as 3D points { x:1, y:2, z: 3 } of functions; because JS Maps use some internal hash of the JS engine, not the shape of the property bag itself.
For example, in the node REPL type
m = new Map()
m.set({ x: 1, y: 2, z: 3 } , 123)
m.set({ x: 1, y: 2, z: 3 } , 456)
m.set({ x: 1, y: 2, z: 3 } , 789)
... and you'll get a 3 pairs (key/value) resulting Map.
In a SQL RDBMS, doing the same will probably result in a 1 row table with just the ending value (789) and it is this behaviour that I expect.
NAIVE IMPLEMENTATION
a fast, but memoty conuming implemetnation is the maintain an internal list of pairs:
const pairs = [
{ bag: { x: 0, y: 0, z: 0 }, with_data: { r: 'ba', g: 'da', b: '55' } },
{ bag: { x: 1, y: 2, z: 3 }, with_data: 123 }
]
and loop the whole list to determine how to fill the array:
if a bag is present inside the list, replace its "with_data" value (UPDATE)
if not, create a new entriy in the pairs list, with appropriate "bag" and "with_data" fileds (INSERT/CREATE)
This is super fast but can be troubled or messed up while the size of the entry list increases, especially for high numbers of entries (typically 500K to 1000K)
ADVANCED IMPLEMENTATION
First I have conceptualized and named things :
m.set({ x: 1 y: 2, z: 3 }, 'c0ffee' )
'x', ''y and z: 'z' are so called "roles"
1, 2, and 3 are called "role partitions" or "parts"
the 'coffee' string is the stored "item"
duck blocks
For the storage I have developped a blocks memoty pool allocator that stores data under pseudo random keys generated with a MWC (Multiply-With-Carry) genertor, such as :
pool.set(data) --> uid, a key 5 digits of 37 values each (0-9, _ and A-Z)
pool.get(uid) --> arbitrary data, brought back to user
Using the pool, I can store duck blocks with this shape:
type DuckBlock = [
uid: KEY37(5),
ducktype: ROLE | PART | ITEM,
counter: 0, # refs counter
related: array of KEY37(5), # related block UIDs
fellow: KEY37(5) # "parent" block UID, for PART roles
data: <any> # user arbitrary data object
}
Using lookup Maps to store:
roles: <role_name> => UID,
parts: <part_value> => [UID] (lsit of parts)
items
querying ststem:
const generator = await multimap.select({
x: 0, // single value search
y: '*' // wildcard search,
z: (val) => -5 < val && val < 5 // filter function search
})
Brief Algorithm:
for each role, find the correponding partitions that martches filters
for each part duck block, get the items UIDs
carteian product parts_for_role_X * parts_for_role_Y * parts_for_role_Z
get the item_uids list for each entry
intersect the list s of items to get selecteed items
grab the items by their UID
insert / set up values
For the setting up of the value :
multimap.set({ x: 0, y:0, z: 0}, { rgb: 'bada55'})
... I compute the selection generator, with the values of the key as filters of ther select method, then :
if empty: set anew value
if not empty, overwrite the existing value
THE QUESTION
the naive code is surprisingly fast (3s for 1M pointsrecorded) but I suspect dlete oprations to be costful (imagine an array.splice of a such array)
The advanced version could make a save of up to 25% space (tested) but it is sadly slow (almost 30 min for 1 M records)
I suspect the mechansim of collecting uids to have a too big algorithmic complexity...
But where is really the bootleneck and how could I proceed to fix it ?
(will have its own repo sooner...)
https://github.com/hefeust/data-manipulation-software/mukemuva
Thanks for replies.

Related

Why does this vanilla js function return different results in d3v3 and d3v4

This is a MWE based on some templates going from v3 to v4 of the amazing d3.js.
The data is in csv file, both examples load the same file (its clean):
day,movie1,movie2,movie3,movie4,movie5,movie6
1,20,8,3,0,0,0
2,18,5,1,13,0,0
3,14,3,1,10,0,0
4,7,3,0,5,27,15
5,4,3,0,2,20,14
6,3,1,0,0,10,13
7,2,0,0,0,8,12
8,0,0,0,0,6,11
9,0,0,0,0,3,9
10,0,0,0,0,1,8
here is MWE in question:
d3.csv("../data/source/movies.csv", function (error, data) {
dataViz(data)});
function dataViz(incData) {
expData = incData;
stackData =[];
for (x in incData[0]) {
if (x != "day") {
var newMovieObject = {
name: x, values:[]
};
for (y in incData) {
newMovieObject
.values
.push({
x: parseInt(incData[y][ "day"]),
y: parseInt(incData[y][x])
})
}
stackData
.push(newMovieObject);
}}}
Now in v3 the stackData array has 6 objects with 10 values each e.g.:
{name: "movie1" values:[
{x: 1, y:20} //0
...
{x:10, y:0} //9
]
…
}
In v4 for however I get an array with 6 objects with 11 values each, the last one annoyingly being:
{name: "movie1" values:[
{x: 1, y:20} //0
...
{x:10, y:0} //9
{x: NaN, y: NaN} //10 *ouch*
]
…
}
As a js noob, I don't understand why this vanilla JS function returns different results, and what to do about it? Any help would be greatly appreciated.
The reason for this difference is that D3 v4.x creates an additional property named columns to the data array when it parses the CSV (look at the documentation).
So, for instance, given your data:
day,movie1,movie2,movie3,movie4,movie5,movie6
1,20,8,3,0,0,0
2,18,5,1,13,0,0
...
D3 creates, after the "normal" objects, this additional object (technically speaking, an additional property to the array):
columns: ["day", "movie", "movie2", "movie3", "movie4", "movie5", "movie6"]
Which you can call using data.columns.
The problem you're facing right now is that when you use a for...in loop you end up iterating this property as well, getting a lot of NaN.
Solution: you can simply avoid iterating over columns or, if you don't need it, you can remove it from your data. There are several ways for removing an property from an array in JavaScript, the simpler way being this:
delete incData.columns;
To check this columns property, simply console.log(data) using D3 v3 and v4, comparing the results.

Unexpected index values when chaining filter functions

The following code produces the result I desire:
// output 1,2,3,4,5,etc
var input$ = Rx.Observable.interval(500).map((v, idx) => idx+1);
var inputEveryOtherOdd$ = input$
// filter even numbers
.filter(v => v % 2)
.map(x => x) // workaround to fix the index (read below)
// filter every other remaining number (take every other odd number)
.filter((v, idx) => {
console.log(`Value: ${v}, Index: ${idx}`)
return !(idx % 2);
})
.subscribe(function(v) {
output.textContent+= v;
})
;
The log produces:
Value: 1, Index: 0
Value: 3, Index: 1
Value: 5, Index: 2
Value: 7, Index: 3
Value: 9, Index: 4
Each item that passes the first filter has the next index so that the index is incremented by one for each item (0,1,2,3,4,5,etc).
What I can't understand is if I remove map, the second filter receives different idx values for the same items:
Value: 1, Index: 0
Value: 3, Index: 2
Value: 5, Index: 4
Value: 7, Index: 6
Value: 9, Index: 8
It seems that the values filtered in the first filter are still being considered in the second filter. I can't make any sense of it. The filter function doesn't run for the values, so how can the index be incrementing for items that don't exist? Why does map make any difference?
Click here for live demo.
I expect chaining two filters together would produce the same result as synchronous filtering and the idx value would be 0,1,2,3,4,5,etc in each filter.
var result = [1,2,3,4,5,6,7,8,9]
.filter(v => v % 2)
.filter((v, idx) => !(idx % 2));
I was originally using startWith in place of map. It seems like just putting something in between the filters makes the idx value what I expect.
The problem is ironically a "feature" of the current version (v4) of RxJS.
The filter operator contains an optimization to detect if it is being used in a "chained" manner, that is, if you have multiple filters in a row. If it is then it will not create a second FilterObservable but rather will wrap the new filter operation in as part of the existing Observable.
See source code here and here.
As a result, no matter how many filters you chain together you will always receive the same index for all of them.
When you use an operator in between the two filters then it cannot perform the optimization and so the filters will not be merged and you will see different indices.
The indexidx is the index of the element in the original list, independently of what has been filtered out.

Time-series data in JSON

I need to model 1,000,000+ data points in JSON. I am thinking of two ways of doing this:
a) Array of objects:
[{time:123456789,value:1432423},{time:123456790,value:1432424},....]
or
b) Nested arrays
[[123456789,1432423],[123456790,1432424],....]
Naively comparing these two approaches, it feels like the latter is faster because it uses less characters but less descriptive. Is b really faster than a ? Which one would you choose and why ?
Is there a 3rd approach ?
{time:[123456789,123456790,...], value:[1432423,1432424,...]}
why?
iterating over a primitive array is faster.
comparable to "JSON size" with b) but you will not lose the "column" information
this npm could be of interest: https://github.com/michaelwittig/fliptable
If your time series data models some continuous function, especially over regular time intervals, there could be much more efficient representation with delta compression, even if you are still using JSON:
[
{time:10001,value:12345},
{time:10002,value:12354},
{time:10003,value:12354},
{time:10010,value:12352}
]
Can be represented as:
[[10001,1,1,7],[12345,9,,-2]]
Which is a 4 times shorter representation.
The original could be reconstructed with:
[{time:a[0][0],value:a[1][0]},{time:a[0][0] + a[0][1]||1, value: a[1][0] + a[1][1]||0 ...
To add another example (idea: 'time is a key'):
ts1 = {123456789: 1432423, 123456790: 1432424}
One could imagine even:
ts2 = {"2017-01-01": {x: 2, y: 3}, "2017-02-01": {x: 1, y: 5}}
Quite compact in notation.
When you want to get the keys, use Object.keys:
Object.keys(ts2) // ["2017-01-01", "2017-02-01"]
You can then either get the values by iterating using these keys or use the more experimental Object.values:
Object.values(ts2) // [{x: 2, y: 3}, {x: 1, y: 5}
In terms of speed: A quick test with 10.000.000 items in an array worked here:
obj3 = {};
for(var i=0; i < 10000000; i++) {obj3[i] = Math.random()};
console.time("values() test");
Object.values(obj3);
console.timeEnd("values() test");
console.time("keys() test");
Object.keys(obj3);
console.timeEnd("keys() test");
Results at my machine (Chrome, 3.2Ghz Xeon):
values() test: 181.77978515625ms
keys() test: 1230.604736328125ms

Need help building complex JS object

I'm trying to construct an array in JavaScript, but I'm not sure of the correct way to do it, or if this type of array is even possible...
Lets say I have a key for each item in the array starting with 'a' and ending with 'z'. For each item, the array key will correspond with another multidimensional array. These new multidimensional arrays are a series of coordinates (x and y). Each item in the original array can have many sets of coordinates. For example:
How can I construct such an array with Javascript? What is the proper syntax?
Just to add another possible option to your list, on the same lines as #SMcCrohan's answer, mixing objects and arrays.
var coords = {
a: [{
x: 20,
y: 15
},
{
x: 25,
y: 17
}],
b: [{
x: 10,
y: 30
}],
....
};
This assumes you will always use coordinates x and y, It means you can access the values like so:
var value1 = coords.a[1].x; // 25
var value2 = coords.b[0].y; // 30
For the data you've provided:
var arr = {
a: [[20,15],[25,17],[10,45]],
b: [[10,33],[12,2],[14,9],[72,103],[88,12]],
c: [[2,2],[41,21]],
d: [[0,0],[21,2],[44,44],[19,99],[1,1],[100,100]],
e: [[1,1],
f: [[3,40],[41,86]]
}
The first structure you want, a keyed array, isn't an array in JavaScript - it's an object. Objects contain key-value pairs. In this case, the values are arrays, and the objects in those arrays are themselves arrays.
An important thing to note here if you're coming from another language that defines 'regular' multi-dimensional arrays is that there is no expectation or guarantee that the 'rows' of this structure are all the same length.

underscore/lodash unique by multiple properties

I have an array of objects with duplicates and I'm trying to get a unique listing, where uniqueness is defined by a subset of the properties of the object. For example,
{a:"1",b:"1",c:"2"}
And I want to ignore c in the uniqueness comparison.
I can do something like
_.uniq(myArray,function(element) { return element.a + "_" + element+b});
I was hoping I could do
_.uniq(myArray,function(element) { return {a:element.a, b:element.b} });
But that doesn't work. Is there something like that I can do, or do I need to create a comparable representation of the object if I'm comparing multiple properties?
Use Lodash's uniqWith method:
_.uniqWith(array, [comparator])
This method is like _.uniq except that it accepts comparator which is invoked to compare elements of array. The order of result values is determined by the order they occur in the array. The comparator is invoked with two arguments: (arrVal, othVal).
When the comparator returns true, the items are considered duplicates and only the first occurrence will be included in the new array.
Example:
I have a list of locations with latitude and longitude coordinates -- some of which are identical -- and I want to see the list of locations with unique coordinates:
const locations = [
{
name: "Office 1",
latitude: -30,
longitude: -30
},
{
name: "Office 2",
latitude: -30,
longitude: 10
},
{
name: "Office 3",
latitude: -30,
longitude: 10
}
];
const uniqueLocations = _.uniqWith(
locations,
(locationA, locationB) =>
locationA.latitude === locationB.latitude &&
locationA.longitude === locationB.longitude
);
// Result has Office 1 and Office 2
There doesn't seem to be a straightforward way to do this, unfortunately. Short of writing your own function for this, you'll need to return something that can be directly compared for equality (as in your first example).
One method would be to just .join() the properties you need:
_.uniqBy(myArray, function(elem) { return [elem.a, elem.b].join(); });
Alternatively, you can use _.pick or _.omit to remove whatever you don't need. From there, you could use _.values with a .join(), or even just JSON.stringify:
_.uniqBy(myArray, function(elem) {
return JSON.stringify(_.pick(elem, ['a', 'b']));
});
Keep in mind that objects are not deterministic as far as property order goes, so you may want to just stick to the explicit array approach.
P.S. Replace uniqBy with uniq for Lodash < 4
Here there's the correct answer
javascript - lodash - create a unique list based on multiple attributes.
FYI var result = _.uniqBy(list, v => [v.id, v.sequence].join());
I do think that the join() approach is still the simplest. Despite concerns raised in the previous solution, I think choosing the right separator is the key to avoiding the identified pitfalls (with different value sets returning the same joined value). Keep in mind, the separator need not be a single character, it can be any string that you are confident will not occur naturally in the data itself. I do this all the time and am fond of using '~!$~' as my separator. It can also include special characters like \t\r\n etc.
If the data contained is truly that unpredictable, perhaps the max length is known and you could simply pad each element to its max length before joining.
There is a hint in #voithos and #Danail combined answer. How I solved this was to add a unique key on the objects in my array.
Starting Sample Data
const animalArray = [
{ a: 4, b: 'cat', d: 'generic' },
{ a: 5, b: 'cat', d: 'generic' },
{ a: 4, b: 'dog', d: 'generic' },
{ a: 4, b: 'cat', d: 'generic' },
];
In the example above, I want the array to be unique by a and b but right now I have two objects that have a: 4 and b: 'cat'. By combining a + b into a string I can get a unique key to check by.
{ a: 4, b: 'cat', d: 'generic', id: `${a}-${b}` }. // id is now '4-cat'
Note: You obviously need to map over the data or do this during creation of the object as you cannot reference properties of an object within the same object.
Now the comparison is simple...
_.uniqBy(animalArray, 'id');
The resulting array will be length of 3 it will have removed the last duplicate.
late to the party but I found this in lodash docs.
var objects = [{ 'x': 1, 'y': 2 }, { 'x': 2, 'y': 1 }, { 'x': 1, 'y': 2 }];
_.uniqWith(objects, _.isEqual);
// => [{ 'x': 1, 'y': 2 }, { 'x': 2, 'y': 1 }]

Categories