MongoDB - slow query on old documents (aggregation and sorting)

MongoDB - slow query on old documents (aggregation and sorting) - javascript

I have two DBs for testing and each contains thousands/hundreds of thousand of documents.
But with the same Schemas and CRUD operations.
Let's call DB1 and DB2.
I am using Mongoose
Suddenly DB1 became really slow during:
const eventQueryPipeline = [
{
$match: {
$and: [{ userId: req.body.userId }, { serverId: req.body.serverId }],
},
},
{
$sort: {
sort: -1,
},
},
];
const aggregation = db.collection
.aggregate(eventQueryPipeline)
.allowDiskUse(true);
aggregation.exect((err, result) => {
res.json(result);
});
In DB2 the same exact query runs in milliseconds up to maximum a 10 seconds
In DB1 the query never takes less than 40 seconds.
I do not understand why. What could I be missing?
I tried to confront the Documents and the Indexes and they're the same.
Deleting the collection and restrting saving the documents, brings the speed back to normal and acceptable, but why is it happening? Does someone had same experience?

Short answer:
You should create following index:
{ "userId": 1, "serverId": 1, "sort": 1 }
Longer answer
Based on your code (i see that you have .allowDiskUse(true)) it looks like mongo is trying to do in memory sort with "a lot" of data. Mongo has by default 100MB system memory limit for sort operations, and you can allow it to use temporary files on disk to store data if it hits that limit.
You can read more about it here: https://www.mongodb.com/docs/manual/reference/method/cursor.allowDiskUse/
In order to optimise the performance of your queries, you can use indexes.
Common rule that you should follow when planning indexes is ESR (Equality, Sort, Range). You can read more about it here: https://www.mongodb.com/docs/v4.2/tutorial/equality-sort-range-rule/
If we follow that rule while creating our compound index, we will add equality matches first, in your case "userId" and "serverId". After that comes the sort field, in your case "sort".
If you had a need to additionally filter results based on some range (eg. some value greater than X, or timestamp greater than yday), you would add that after the "sort".
That means your index should look like this:
schema.index({ userId: 1, serverId: 1, sort: 1 });
Additionally, you can probably remove allowDiskUse, and handle err inside aggregation.exec callback (im assuming that aggregation.exect is a typo)

Related

Mongo Aggregation to Calculate Unique IDs in Returned Docs is Slow

In addition to returning a filtered set of documents to the end user via a Mongo view, I also have a couple functions running to generate running totals. For some reason though, while my find() operation is really fast (225ms), this additional aggregation I'm running takes over 6 seconds to perform -- which slows down the whole endpoint, because this data is passed on in what I'm returning.
I'm trying to understand why this aggregation would be so slow? This aggregation matches to the filters passed in by the end-user, and then calculates the number of unique customer ids that appear in the returned documents. This is what it looks like:
let totalCustomers = await db
.collection("view_accounts_report")
.aggregate([{
$match: search
},
{
$group: {
_id: null,
customers: {
$addToSet: "$customer._id"
}
}
},
{
$project: {
uniqueCustomers: {
$size: "$customers"
}
}
}
])
.next();
Why would this take 6 seconds to run? Any ideas? Any tips on how I can speed it up?

Well, it'll definitely take some time, if there's a lot of distinct users (or a lot users in general). If you want to check, why it's taking so much time, try the explain argument.
On the other hand, distinct might be a better choice here:
let customers = await db.collection('view_accounts_report').distinct(search);
let totalCustomers = customers.length;

Algolia search on nested objects in a record - multiple facetFilters in one object

I’m migrating from Mongo to Firebase with Algolia on top to provide the search. But hitting a snag coming up with a comparable way to search in individual elements of a record.
I have an object that stores when a room is available: from and to. Each record can have many individual from/to combos (see the sample below with 2). I want to be able to run a search something like:
roomavailable.from <= 1522195200 AND roomavailable.to >=1522900799
But only have the query search a match within each element, not any facet in all elements. An element query in Mongo works like that. But if I run that query on the record listed below, it will return the record, because the two roomavailable objects satisfy the .from and .to query. I think.
Is there a way to ensure the search is looking only at matching a pair of .from and .to in an individual object/element?
Below is the pertinent part of the record stored in Algolia so you can see the structure.
"roomavailable": [
{
"_id": "rJbdWvY9M",
"from": 1522195200,
"to": 1522799999
},
{
"_id": "r1H_-vKqz",
"from": 1523923200,
"to": 1524268799
}
],
And here is the Mongo (mongoose) equivalent where its searching inside individual elements (this works):
$elemMatch: {
from: {
$lte: moment(dateArray[0]).utc().startOf('day').format()
},
to: {
$gte: moment(dateArray[1]).utc().endOf('day').format()
}
}
I have also tried this query but it seems to still match either the .from AND .to but in any of the the individual roomavailable elements:
index.search({
query: '',
filters: filters,
facetFilters: [roomavailable.from: 1522195200, roomavailable.to: 1524268799],
attributesToRetrieve: [
"roomavailable",
],
restrictHighlightAndSnippetArrays: true
})
I found a couple posts on Algolia discussing using 1 bracket vs. 2 brackets in the facetFilters. I've tried both. Neither work.
Any suggestions would be awesome. Thanks!

Edit: See discussion on Algolia Discourse:
https://discourse.algolia.com/t/how-to-match-multiple-attributes-in-nested-object-with-numericfilters/4887/8
Hi #kanec, thanks for clarifying your question!
Indeed what #Alefort suggested (using roomavailable in a separate index) would be the easiest option since the query I mentioned above will definitely return the results you want. This will mean that you'll have to query the room availability index separately in order to get which IDs are available, so you'll have to use multiple-queries:
https://www.algolia.com/doc/api-reference/api-methods/multiple-queries/
That said, I asked our core API team to see if there's a more reasonable way to approach this issue, but I fear that this is a filter limit due to performance reasons with arrays. You could transform your data structure in the following and index your rooms as an object instead:
[
{
"roomavailable": {
"0": {
"_id": "rJbdWvY9M",
"from": 1522195200,
"to": 1522799999
},
"1": {
"_id": "r1H_-vKqz",
"from": 1523923200,
"to": 1524268799
}
}
}
]
So you can apply the following filter:
{
"filters": "roomavailable.0.from <= 1522195200 AND roomavailable.0.to >= 1522799999 AND roomavailable.1.from <= 1522195200 AND roomavailable.1.to >=1522900799"
}
The downside of this is that you'll need to know the length of roomavailable in order to build the search query on the front-end (you can do so at indexing time by adding a roomavailable_count property) and also this will probably will be less performant with a considerable number of rooms per item; in this case, switching to a dedicated index makes totally sense for the following reasons:
If in your backend you frequently update available rooms you won't impact the other indices' build time
Filters will perform better (as explained above)
Indexing strategy will be simpler to handle
Let me know what you think about this and if it helps you out.

Mongoose findOneAndUpdate: create and then update nested array

I have a program where I'm requesting weather data from a server, processing the data, and then saving it to an mlab account using mongoose. I'm gathering 10 years of data, but the API that I'm requesting the data from only allows about a year at a time to be requested.
I'm using findOndAndUpdate to create/update the document for each weather station, but am having trouble updating the arrays within the data object. (Probably not the best way to describe it...)
For example, here's the model:
const stnDataSchema = new Schema(
{
station: { type: String, default: null },
elevation: { type: String, default: null },
timeZone: { type: String, default: null },
dates: {},
data: {}
},
{ collection: 'stndata' },
{ runSettersOnQuery: true }
)
where the dates object looks like this:
dates: ["2007-01-01",
"2007-01-02",
"2007-01-03",
"2007-01-04",
"2007-01-05",
"2007-01-06",
"2007-01-07",
"2007-01-08",
"2007-01-09"]
and the data object like this:
"data": [
{
"maxT": [
0,
null,
4.4,
0,
-2.7,
etc.....
what I want to have happen is when I run findOneAndUpdate I want to find the document based on the station, and then append new maxT values and dates to the respective arrays. I have it working for the date array, but am running into trouble with the data array as the elements I'm updated are nested.
I tried this:
const update = {
$set: {'station': station, 'elevation': elevation, 'timeZone': timeZone},
$push: {'dates': datesTest, 'data.0.maxT': testMaxT}};
StnData.findOneAndUpdate( query, update, {upsert: true} ,
function(err, doc) {
if (err) {
console.log("error in updateStation", err)
throw new Error('error in updateStation')
}
else {
console.log('saved')
but got an output into mlab like this:
"data": {
"0": {
"maxT": [
"a",
"b",
the issue is that I get a "0" instead of an array of one element. I tried 'data[0].maxT' but nothing happens when I do that.
The issue is that the first time I run the data for a station, I want to create a new document with data object of the format in my third code block, and then on subsequent runs, once that document already exists, update the maxT array with new values. Any ideas?

You are getting this output:
"data": {
"0": {
"maxT": [
"a",
"b",
because you are upserting the document. Upserting gets a bit complicated when dealing with arrays of documents.
When updating an array, MongoDB knows that data.0 refers to the first element in the array. However, when inserting, MongoDB can't tell if it's meant to be an array or an object. So it assumes it's an object. So rather than inserting ["val"], it inserts {"0": "val"}.
Simplest Solution
Don't use an upsert. Insert a document for each new weather station then use findOndAndUpdate to push values into the arrays in the documents. As long as you insert the arrays correctly the first time, you will be able to push to them without them turning into objects.
Alternative Simple Solution if data just Contains one Object
From your question, it looks like you only have one object in data. If that is the case, you could just make the maxT array top-level, instead of being a property of a single document in an array. Then it would act just like dates.
More Complicated MongoDB 3.6 Solution
If you truly cannot do without upserts, MongoDB 3.6 introduced the filtered positional operator $[<identifier>]. You can use this operator to update specific elements in an array which match a query. Unlike the simple positional operator $, the new $[<identifier>] operator can be used to upsert as long as an exact match is used.
You can read more about this operator here: https://docs.mongodb.com/manual/reference/operator/update/positional-filtered/
So your data objects will need to have a field which can be matched exactly on (say name). An example query would look something like this:
let query = {
_id: 'idOfDocument',
data: [{name: 'subobjectName'}] // Need this for an exact match
}
let update = {$push: {'data.$[el].maxT': testMaxT}}
let options = {upsert: true, arrayFilters: [{'el.name': 'subobjectName'}]}
StnData.findOneAndUpdate(query, update, options, callbackFn)
As you can see this adds much more complexity. It would be much easier to forget about trying to do upserts. Just do one insert then update.
Moreover mLab currently does not support MongoDB 3.6. So this method won't be viable when using mLab until 3.6 is supported.

How to optimize performance of searching in two array of object

There are two array of objects one from database and one from csv. I required to compare both array object by their relative properties of Phones and emails and find duplicate array among them. Due to odd database object structure I required to compare both array with Javascript. I wanted to know what is the best algorithm and best way of compare and find duplicates?
I explain simple calculations.
There are 5000 contacts in my database and user may upload another 3000 contacts from csv. Everytime we requires to find duplicate contacts from database and if they find then it may overwrite and rest should be insert. If I compare contact row by row then it may loop 5000 database contacts x 3000 csv contacts = 15000000 time traverse.
This is my present scenario I face due to this system goes stuck. I require some efficient solution of this issue.
I develop the stuff in NodeJS, RethinkDB.
Database object structure exactly represent like that way and it may duplicate entry of emails and phones in other contacts also.
[{
id: 2349287349082734,
name: "ABC",
phones: [
{
id: 2234234,
flag: true,
value: 982389679823
},
{
id: 65234234,
flag: false,
value: 2979023423
}
],
emails: [
{
id: 22346234,
flag: true,
value: "test#domain.com"
},
{
id: 609834234,
flag: false,
value: "test2#domain.com"
}
]
}]
Please review fiddle code, if you want: https://jsfiddle.net/dipakchavda2912/eua1truj/
I have already did indexing. The problem is looking very easy and known in first sight but when we talk about concurrency it is really very critical and CPU intensive.

If understand the question you can use the lodash method differenceWith
let csvContacts = [] //fill it with your values;
let databaseContacts = .... //from your database
let diffArray = [] //the non duplicated object;
const l = require("lodash");
diffArray = l.differenceWith(csvContact,
databaseContacts,
(firstValue,secValue)=>firstValue.email == secValue.email

query for last 10 items in dynamodb shell

I'm learning Dynamodb and for that I installed the local server that comes with a shell at http://localhost:8000/shell
now.. I created the following table:
var serverUpTimeTableName = 'bingodrive_server_uptime';
var eventUpTimeColumn = 'up_time';
var params = {
TableName: serverUpTimeTableName,
KeySchema: [ // The type of of schema. Must start with a HASH type, with an optional second RANGE.
{ // Required HASH type attribute
AttributeName: eventUpTimeColumn,
KeyType: 'HASH',
},
],
AttributeDefinitions: [ // The names and types of all primary and index key attributes only
{
AttributeName: eventUpTimeColumn,
AttributeType: 'N', // (S | N | B) for string, number, binary
},
],
ProvisionedThroughput: { // required provisioned throughput for the table
ReadCapacityUnits: 2,
WriteCapacityUnits: 2,
}
};
dynamodb.createTable(params, callback);
so I created a table only with one hash key called up_time, that's actually the only item in the table.
Now I want to fetch the last 10 inserted up times.
so far I created the following code:
var serverUpTimeTableName = 'bingodrive_server_uptime';
var eventUpTimeColumn = 'up_time';
var params = {
TableName: serverUpTimeTableName,
KeyConditionExpression: eventUpTimeColumn + ' != :value',
ExpressionAttributeValues: {
':value':0
},
Limit: 10,
ScanIndexForward: false
}
docClient.query(params, function(err, data) {
if (err) ppJson(err); // an error occurred
else ppJson(data); // successful response
});
ok.. so few things to notice:
I don't really need a KeyCondition. i just want the last 10 items, so I used Limit 10 for the limit and ScanIndexForward:false for reverse order.
!= or NE are not supported in key expressions for hash keys. and it seems that I must use some kind of index in the query.. confused about that.
so.. any information regarding the issue would be greatly appreciated.

Some modern terminology: Hash is now called Partition, Range is now called Sort.
Thank you Amazon.
You need to understand that Query-ing is an action on hash-keys. In order to initiate a query you must supply a hash-key. Since your table's primary key is only hash key (and not hash+range) you can't query it. You can only Scan it in order to find items. Scan doesn't require any knowledge about items in the table.
Moving on.. when you say "last 10 items" you actually do want a condition because you are filtering on the date attribute, you haven't defined any index so you can't have the engine provide you 10 results. If it were a range key element, you could get the Top-10 ordered elements by querying with a backwards index (ScanIndexForward:false) - again, not your schema.
In your current table - what exactly are you trying to do? You currently only have one attribute which is also the hash key so 10 items would look like (No order, no duplicates):
12312
53453
34234
123
534534
3101
11
You could move those to range key and have a global hash-key "stub" just to initiate the query you're making but that breaks the guidelines of DynamoDB as you have a hot partition and it won't have the best performance. Not sure this bothers you at the moment, but it is worth mentioning.

We Keep Coding

JavaScript is the programming language of the Web.