I am interested in optimizing a "pagination" solution I'm working on with MongoDB. My problem is straight forward. I usually limit the number of documents returned using the limit() functionality. This forces me to issue a redundant query without the limit() function in order for me to also capture the total number of documents in the query so I can pass to that to the client letting them know they'll have to issue an additional request(s) to retrieve the rest of the documents.
Is there a way to condense this into 1 query? Get the total number of documents but at the same time only retrieve a subset using limit()? Is there a different way to think about this problem than I am approaching it?
Mongodb 3.4 has introduced $facet aggregation
which processes multiple aggregation pipelines within a single stage
on the same set of input documents.
Using $facet and $group you can find documents with $limit and can get total count.
You can use below aggregation in mongodb 3.4
db.collection.aggregate([
{ "$facet": {
"totalData": [
{ "$match": { }},
{ "$skip": 10 },
{ "$limit": 10 }
],
"totalCount": [
{ "$group": {
"_id": null,
"count": { "$sum": 1 }
}}
]
}}
])
Even you can use $count aggregation which has been introduced in mongodb 3.6.
You can use below aggregation in mongodb 3.6
db.collection.aggregate([
{ "$facet": {
"totalData": [
{ "$match": { }},
{ "$skip": 10 },
{ "$limit": 10 }
],
"totalCount": [
{ "$count": "count" }
]
}}
])
No, there is no other way. Two queries - one for count - one with limit. Or you have to use a different database. Apache Solr for instance works like you want. Every query there is limited and returns totalCount.
MongoDB allows you to use cursor.count() even when you pass limit() or skip().
Lets say you have a db.collection with 10 items.
You can do:
async function getQuery() {
let query = await db.collection.find({}).skip(5).limit(5); // returns last 5 items in db
let countTotal = await query.count() // returns 10-- will not take `skip` or `limit` into consideration
let countWithConstraints = await query.count(true) // returns 5 -- will take into consideration `skip` and `limit`
return { query, countTotal }
}
Here's how to do this with MongoDB 3.4+ (with Mongoose) using $facets. This examples returns a $count based on the documents after they have been matched.
const facetedPipeline = [{
"$match": { "dateCreated": { $gte: new Date('2021-01-01') } },
"$project": { 'exclude.some.field': 0 },
},
{
"$facet": {
"data": [
{ "$skip": 10 },
{ "$limit": 10 }
],
"pagination": [
{ "$count": "total" }
]
}
}
];
const results = await Model.aggregate(facetedPipeline);
This pattern is useful for getting pagination information to return from a REST API.
Reference: MongoDB $facet
Times have changed, and I believe you can achieve what the OP is asking by using aggregation with $sort, $group and $project. For my system, I needed to also grab some user info from my users collection. Hopefully this can answer any questions around that as well. Below is an aggregation pipe. The last three objects (sort, group and project) are what handle getting the total count, then providing pagination capabilities.
db.posts.aggregate([
{ $match: { public: true },
{ $lookup: {
from: 'users',
localField: 'userId',
foreignField: 'userId',
as: 'userInfo'
} },
{ $project: {
postId: 1,
title: 1,
description: 1
updated: 1,
userInfo: {
$let: {
vars: {
firstUser: {
$arrayElemAt: ['$userInfo', 0]
}
},
in: {
username: '$$firstUser.username'
}
}
}
} },
{ $sort: { updated: -1 } },
{ $group: {
_id: null,
postCount: { $sum: 1 },
posts: {
$push: '$$ROOT'
}
} },
{ $project: {
_id: 0,
postCount: 1,
posts: {
$slice: [
'$posts',
currentPage ? (currentPage - 1) * RESULTS_PER_PAGE : 0,
RESULTS_PER_PAGE
]
}
} }
])
there is a way in Mongodb 3.4: $facet
you can do
db.collection.aggregate([
{
$facet: {
data: [{ $match: {} }],
total: { $count: 'total' }
}
}
])
then you will be able to run two aggregate at the same time
By default, the count() method ignores the effects of the
cursor.skip() and cursor.limit() (MongoDB docs)
As the count method excludes the effects of limit and skip, you can use cursor.count() to get the total count
const cursor = await database.collection(collectionName).find(query).skip(offset).limit(limit)
return {
data: await cursor.toArray(),
count: await cursor.count() // this will give count of all the documents before .skip() and limit()
};
It all depends on the pagination experience you need as to whether or not you need to do two queries.
Do you need to list every single page or even a range of pages? Does anyone even go to page 1051 - conceptually what does that actually mean?
Theres been lots of UX on patterns of pagination - Avoid the pains of pagination covers various types of pagination and their scenarios and many don't need a count query to know if theres a next page. For example if you display 10 items on a page and you limit to 13 - you'll know if theres another page..
MongoDB has introduced a new method for getting only the count of the documents matching a given query and it goes as follows:
const result = await db.collection('foo').count({name: 'bar'});
console.log('result:', result) // prints the matching doc count
Recipe for usage in pagination:
const query = {name: 'bar'};
const skip = (pageNo - 1) * pageSize; // assuming pageNo starts from 1
const limit = pageSize;
const [listResult, countResult] = await Promise.all([
db.collection('foo')
.find(query)
.skip(skip)
.limit(limit),
db.collection('foo').count(query)
])
return {
totalCount: countResult,
list: listResult
}
For more details on db.collection.count visit this page
It is possible to get the total result size without the effect of limit() using count() as answered here:
Limiting results in MongoDB but still getting the full count?
According to the documentation you can even control whether limit/pagination is taken into account when calling count():
https://docs.mongodb.com/manual/reference/method/cursor.count/#cursor.count
Edit: in contrast to what is written elsewhere - the docs clearly state that "The operation does not perform the query but instead counts the results that would be returned by the query". Which - from my understanding - means that only one query is executed.
Example:
> db.createCollection("test")
{ "ok" : 1 }
> db.test.insert([{name: "first"}, {name: "second"}, {name: "third"},
{name: "forth"}, {name: "fifth"}])
BulkWriteResult({
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 5,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]
})
> db.test.find()
{ "_id" : ObjectId("58ff00918f5e60ff211521c5"), "name" : "first" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c6"), "name" : "second" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c7"), "name" : "third" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c8"), "name" : "forth" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c9"), "name" : "fifth" }
> db.test.count()
5
> var result = db.test.find().limit(3)
> result
{ "_id" : ObjectId("58ff00918f5e60ff211521c5"), "name" : "first" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c6"), "name" : "second" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c7"), "name" : "third" }
> result.count()
5 (total result size of the query without limit)
> result.count(1)
3 (result size with limit(3) taken into account)
Try as bellow:
cursor.count(false, function(err, total){ console.log("total", total) })
core.db.users.find(query, {}, {skip:0, limit:1}, function(err, cursor){
if(err)
return callback(err);
cursor.toArray(function(err, items){
if(err)
return callback(err);
cursor.count(false, function(err, total){
if(err)
return callback(err);
console.log("cursor", total)
callback(null, {items: items, total:total})
})
})
})
Thought of providing a caution while using the aggregate for the pagenation. Its better to use two queries for this if the API is used frequently to fetch data by the users. This is atleast 50 times faster than getting the data using aggregate on a production server when more users are accessing the system online. The aggregate and $facet are more suited for Dashboard , reports and cron jobs that are called less frequently.
We can do it using 2 query.
const limit = parseInt(req.query.limit || 50, 10);
let page = parseInt(req.query.page || 0, 10);
if (page > 0) { page = page - 1}
let doc = await req.db.collection('bookings').find().sort( { _id: -1 }).skip(page).limit(limit).toArray();
let count = await req.db.collection('bookings').find().count();
res.json({data: [...doc], count: count});
I took the two queries approach, and the following code has been taken straight out of a project I'm working on, using MongoDB Atlas and a full-text search index:
return new Promise( async (resolve, reject) => {
try {
const search = {
$search: {
index: 'assets',
compound: {
should: [{
text: {
query: args.phraseToSearch,
path: [
'title', 'note'
]
}
}]
}
}
}
const project = {
$project: {
_id: 0,
id: '$_id',
userId: 1,
title: 1,
note: 1,
score: {
$meta: 'searchScore'
}
}
}
const match = {
$match: {
userId: args.userId
}
}
const skip = {
$skip: args.skip
}
const limit = {
$limit: args.first
}
const group = {
$group: {
_id: null,
count: { $sum: 1 }
}
}
const searchAllAssets = await Models.Assets.schema.aggregate([
search, project, match, skip, limit
])
const [ totalNumberOfAssets ] = await Models.Assets.schema.aggregate([
search, project, match, group
])
return await resolve({
searchAllAssets: searchAllAssets,
totalNumberOfAssets: totalNumberOfAssets.count
})
} catch (exception) {
return reject(new Error(exception))
}
})
I had the same problem and came across this question. The correct solution to this problem is posted here.
You can do this in one query. First you run a count and within that run the limit() function.
In Node.js and Express.js, you will have to use it like this to be able to use the "count" function along with the toArray's "result".
var curFind = db.collection('tasks').find({query});
Then you can run two functions after it like this (one nested in the other)
curFind.count(function (e, count) {
// Use count here
curFind.skip(0).limit(10).toArray(function(err, result) {
// Use result here and count here
});
});
Related
I have a list of content IDs and I’m trying to fetch the most recent comment (if one exists) for each of the content IDs in the list -
My query looks as follows:
const query = [
{
$match: {
content_id: { $in: myContentIds },
},
},
{ $sort: { ‘comment_created’: -1 } },
]
const results = await collection.find(query).toArray();
My understanding is this will fetch all of the comments related to the contentIds in the myContentIds array and sort them in descending order based on the date.
I could then limit my results using { $limit: 1} but this would return the most recent comment on any of the content items, rather than the most recent comment for each content.
How can I modify my query to return the most recent comment for each of my content items?
$group by content_id and get first recent document
$replaceRoot to replace that recent document in root (this is optional, you can use document by object recentComment)
const query = [
{ $match: { content_id: { $in: myContentIds } } },
{ $sort: { comment_created: -1 } },
{
$group: {
_id: "$content_id",
recentComment: { $first: "$$ROOT" }
}
},
{ $replaceRoot: { newRoot: "$recentComment" } }
];
const results = await collection.aggregate(query);
Playground
In MongoDB, is it possible to update the value of a field using the value from another field? The equivalent SQL would be something like:
UPDATE Person SET Name = FirstName + ' ' + LastName
And the MongoDB pseudo-code would be:
db.person.update( {}, { $set : { name : firstName + ' ' + lastName } );
The best way to do this is in version 4.2+ which allows using the aggregation pipeline in the update document and the updateOne, updateMany, or update(deprecated in most if not all languages drivers) collection methods.
MongoDB 4.2+
Version 4.2 also introduced the $set pipeline stage operator, which is an alias for $addFields. I will use $set here as it maps with what we are trying to achieve.
db.collection.<update method>(
{},
[
{"$set": {"name": { "$concat": ["$firstName", " ", "$lastName"]}}}
]
)
Note that square brackets in the second argument to the method specify an aggregation pipeline instead of a plain update document because using a simple document will not work correctly.
MongoDB 3.4+
In 3.4+, you can use $addFields and the $out aggregation pipeline operators.
db.collection.aggregate(
[
{ "$addFields": {
"name": { "$concat": [ "$firstName", " ", "$lastName" ] }
}},
{ "$out": <output collection name> }
]
)
Note that this does not update your collection but instead replaces the existing collection or creates a new one. Also, for update operations that require "typecasting", you will need client-side processing, and depending on the operation, you may need to use the find() method instead of the .aggreate() method.
MongoDB 3.2 and 3.0
The way we do this is by $projecting our documents and using the $concat string aggregation operator to return the concatenated string.
You then iterate the cursor and use the $set update operator to add the new field to your documents using bulk operations for maximum efficiency.
Aggregation query:
var cursor = db.collection.aggregate([
{ "$project": {
"name": { "$concat": [ "$firstName", " ", "$lastName" ] }
}}
])
MongoDB 3.2 or newer
You need to use the bulkWrite method.
var requests = [];
cursor.forEach(document => {
requests.push( {
'updateOne': {
'filter': { '_id': document._id },
'update': { '$set': { 'name': document.name } }
}
});
if (requests.length === 500) {
//Execute per 500 operations and re-init
db.collection.bulkWrite(requests);
requests = [];
}
});
if(requests.length > 0) {
db.collection.bulkWrite(requests);
}
MongoDB 2.6 and 3.0
From this version, you need to use the now deprecated Bulk API and its associated methods.
var bulk = db.collection.initializeUnorderedBulkOp();
var count = 0;
cursor.snapshot().forEach(function(document) {
bulk.find({ '_id': document._id }).updateOne( {
'$set': { 'name': document.name }
});
count++;
if(count%500 === 0) {
// Excecute per 500 operations and re-init
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
})
// clean up queues
if(count > 0) {
bulk.execute();
}
MongoDB 2.4
cursor["result"].forEach(function(document) {
db.collection.update(
{ "_id": document._id },
{ "$set": { "name": document.name } }
);
})
You should iterate through. For your specific case:
db.person.find().snapshot().forEach(
function (elem) {
db.person.update(
{
_id: elem._id
},
{
$set: {
name: elem.firstname + ' ' + elem.lastname
}
}
);
}
);
Apparently there is a way to do this efficiently since MongoDB 3.4, see styvane's answer.
Obsolete answer below
You cannot refer to the document itself in an update (yet). You'll need to iterate through the documents and update each document using a function. See this answer for an example, or this one for server-side eval().
For a database with high activity, you may run into issues where your updates affect actively changing records and for this reason I recommend using snapshot()
db.person.find().snapshot().forEach( function (hombre) {
hombre.name = hombre.firstName + ' ' + hombre.lastName;
db.person.save(hombre);
});
http://docs.mongodb.org/manual/reference/method/cursor.snapshot/
Starting Mongo 4.2, db.collection.update() can accept an aggregation pipeline, finally allowing the update/creation of a field based on another field:
// { firstName: "Hello", lastName: "World" }
db.collection.updateMany(
{},
[{ $set: { name: { $concat: [ "$firstName", " ", "$lastName" ] } } }]
)
// { "firstName" : "Hello", "lastName" : "World", "name" : "Hello World" }
The first part {} is the match query, filtering which documents to update (in our case all documents).
The second part [{ $set: { name: { ... } }] is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline). $set is a new aggregation operator and an alias of $addFields.
Regarding this answer, the snapshot function is deprecated in version 3.6, according to this update. So, on version 3.6 and above, it is possible to perform the operation this way:
db.person.find().forEach(
function (elem) {
db.person.update(
{
_id: elem._id
},
{
$set: {
name: elem.firstname + ' ' + elem.lastname
}
}
);
}
);
I tried the above solution but I found it unsuitable for large amounts of data. I then discovered the stream feature:
MongoClient.connect("...", function(err, db){
var c = db.collection('yourCollection');
var s = c.find({/* your query */}).stream();
s.on('data', function(doc){
c.update({_id: doc._id}, {$set: {name : doc.firstName + ' ' + doc.lastName}}, function(err, result) { /* result == true? */} }
});
s.on('end', function(){
// stream can end before all your updates do if you have a lot
})
})
update() method takes aggregation pipeline as parameter like
db.collection_name.update(
{
// Query
},
[
// Aggregation pipeline
{ "$set": { "id": "$_id" } }
],
{
// Options
"multi": true // false when a single doc has to be updated
}
)
The field can be set or unset with existing values using the aggregation pipeline.
Note: use $ with field name to specify the field which has to be read.
Here's what we came up with for copying one field to another for ~150_000 records. It took about 6 minutes, but is still significantly less resource intensive than it would have been to instantiate and iterate over the same number of ruby objects.
js_query = %({
$or : [
{
'settings.mobile_notifications' : { $exists : false },
'settings.mobile_admin_notifications' : { $exists : false }
}
]
})
js_for_each = %(function(user) {
if (!user.settings.hasOwnProperty('mobile_notifications')) {
user.settings.mobile_notifications = user.settings.email_notifications;
}
if (!user.settings.hasOwnProperty('mobile_admin_notifications')) {
user.settings.mobile_admin_notifications = user.settings.email_admin_notifications;
}
db.users.save(user);
})
js = "db.users.find(#{js_query}).forEach(#{js_for_each});"
Mongoid::Sessions.default.command('$eval' => js)
With MongoDB version 4.2+, updates are more flexible as it allows the use of aggregation pipeline in its update, updateOne and updateMany. You can now transform your documents using the aggregation operators then update without the need to explicity state the $set command (instead we use $replaceRoot: {newRoot: "$$ROOT"})
Here we use the aggregate query to extract the timestamp from MongoDB's ObjectID "_id" field and update the documents (I am not an expert in SQL but I think SQL does not provide any auto generated ObjectID that has timestamp to it, you would have to automatically create that date)
var collection = "person"
agg_query = [
{
"$addFields" : {
"_last_updated" : {
"$toDate" : "$_id"
}
}
},
{
$replaceRoot: {
newRoot: "$$ROOT"
}
}
]
db.getCollection(collection).updateMany({}, agg_query, {upsert: true})
(I would have posted this as a comment, but couldn't)
For anyone who lands here trying to update one field using another in the document with the c# driver...
I could not figure out how to use any of the UpdateXXX methods and their associated overloads since they take an UpdateDefinition as an argument.
// we want to set Prop1 to Prop2
class Foo { public string Prop1 { get; set; } public string Prop2 { get; set;} }
void Test()
{
var update = new UpdateDefinitionBuilder<Foo>();
update.Set(x => x.Prop1, <new value; no way to get a hold of the object that I can find>)
}
As a workaround, I found that you can use the RunCommand method on an IMongoDatabase (https://docs.mongodb.com/manual/reference/command/update/#dbcmd.update).
var command = new BsonDocument
{
{ "update", "CollectionToUpdate" },
{ "updates", new BsonArray
{
new BsonDocument
{
// Any filter; here the check is if Prop1 does not exist
{ "q", new BsonDocument{ ["Prop1"] = new BsonDocument("$exists", false) }},
// set it to the value of Prop2
{ "u", new BsonArray { new BsonDocument { ["$set"] = new BsonDocument("Prop1", "$Prop2") }}},
{ "multi", true }
}
}
}
};
database.RunCommand<BsonDocument>(command);
MongoDB 4.2+ Golang
result, err := collection.UpdateMany(ctx, bson.M{},
mongo.Pipeline{
bson.D{{"$set",
bson.M{"name": bson.M{"$concat": []string{"$lastName", " ", "$firstName"}}}
}},
)
I need to find random 5 documents from mongoDB by using find function. I using LoopBack 4 framework. I already try to use sample (it is in comment)
const userParties: IndividualParty[] = (await find(
this.logger,
{
where: {
and: [
{ _id: { nin: ids.map(id => id) } },
{ gender: { inq: gender } },
],
},
//sample: { size: 5 },
//limit: 5,
} as Filter<IndividualParty>,
this.partyRepository,
)) as IndividualParty[];
I'm not familiar with Loopback, but using pure node and node MongoDB driver, here's the shortest example I can come up with:
var run = async function() {
const conn = await require('mongodb').MongoClient.connect('mongodb://localhost:27017', {useNewUrlParser: true})
let agg = [
{'$match': {'_id': {'$gte': 50}}},
{'$sample': {'size': 5}}
]
let res = await conn.db('test').collection('test').aggregate(agg).toArray()
console.log(res)
await conn.close()
}()
In a collection containing _id from 0 to 99, this will randomly output 5 documents having _id larger than 50. Example output:
[ { _id: 60 }, { _id: 77 }, { _id: 84 }, { _id: 96 }, { _id: 63 } ]
You would need to make the above example work with Loopback, but the basic idea is there.
Note:
You need aggregation instead of find().
Have a read through the $sample documentation and note especially its behavior:
$sample uses one of two methods to obtain N random documents, depending on the size of the collection, the size of N, and $sample’s position in the pipeline.
The position of $sample in the pipeline is important. If you need to select a subset of the collection to do $sample on via a $match stage (as with the example above), then you will need to ensure that the subset to be sampled is within 16 MB (the limit of MongoDB in-memory sort).
I want to update an object inside an array of schemas without having to do two requests to the database. I currently am incrementing the field using findOneAndUpdate() if the object already exists and it works fine. but in case the object does not exist then I am having to make another request using update() to push the new object and make it available for later increments.
I want to be able to do only one request (e.g. findOne()) to get the user and then increment the field only if object exists in the array and if not I would like to push the new object instead. then save the document. this way I am only making one read/request from the database instead of two.
this is the function now:
async addItemToCart(body, userId) {
const itemInDb = await Model.findOneAndUpdate(
{
_id: userId,
'cart.productId': body.productId,
},
{ $inc: { 'cart.$.count': 1 } }
);
if (itemInDb) return true;
const updated = await Model.update(
{ _id: userId },
{ $push: { cart: body } }
);
if (updated.ok !== 1)
return createError(500, 'something went wrong in userService');
return true;
}
what I would like to do is:
async addItemToCart(body, userId) {
const itemInDb = await Model.findOne(
{
_id: userId,
'cart.productId': body.productId,
}
);
if (itemInDb) {
/**
*
* increment cart in itemInDb then do itemInDb.save() <<------------
*/
} else {
/**
* push product to itemInDb then save
*/
}
Thank you!
You can try findOneAndUpdate with upsert.
upsert: true then create data if not exists in DB.
Model.findOneAndUpdate(
{
_id: userId,
'cart.productId': body.productId,
},
{ $inc: { 'cart.$.count': 1 } },
{
upsert: true,
}
)
Use $set and $inc in one query.
try {
db.scores.findOneAndUpdate(
{
_id: userId,
'cart.productId': body.productId,
},
{ $set: { "cart.$.productName" : "A.B.C", "cart.$.productPrice" : 5}, $inc : { "cart.$.count" : 1 } },
{ upsert:true, returnNewDocument : true }
);
}
catch (e){
//error
}
reference Link : here
You can use upsert.
upsert is defined as an operation that creates a new document when no document matches the query criteria and if matches then it updates the document. It is an option for the update command. If you execute a command like below it works as an update, if there is a document matching query, or as an insert with a document described by the update as an argument.
Example: I am just giving a simple example. You have to change it according to your requirement.
db.people.update(
{ name: "Andy" },
{
name: "Andy",
rating: 1,
score: 1
},
{ upsert: true }
)
So in the above example, if the people with name Andy is found then the update operation will be performed. If not then it will create a new document.
I a currently querying my mondo db for an array of urls in one collection which returns an array. I then want to use that array to go through another collection and find the matching elements for each element in the previous query's returned array. Is it proper to use forEach on the array and do individual queries?
My code looks like such, the first function getUrls works great. The current error I get is:
(node:10754) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): TypeError: Cannot read property 'limit' of undefined
(node:10754) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
async function useUrls () {
let domains = await getUrls()
let db = await mongo.connect("mongodb://35.185.206.31:80/lc_data")
let results = []
domains.forEach( domain =>{
let query = {"$match":
{"email_domain": domain}
}
let cursor = db.collection('circleback')
.aggregate([query], (err, data) =>{
if(err)
throw err;
console.log("cb", data)
}).limit(1100)
})
As noted, the code in the question has a few problems, most of which can be addressed by looking at the full sample listing as supplied at the end of this response. What you are essentially asking for here is a variation on the "Top-N results" problem, for which there are a couple of ways to "practically" handle this.
So somewhat ranking from "worst" to "best":
Aggregation $slice
So rather than "loop" your results of your function, you can alternately supply all the results to a query using $in. That alleviates the need to "loop inputs", but the other thing needed here is the "top-N per output".
There really is not a "stable" mechanism in MongoDB for this as yet, but "if" it is plausible on the size of given collections then you can in fact simply $group on your "distinct" keys matching the provided $in arguments, and then $push all documents into an array and $slice the results:
let results = await db.collection('circleback').aggregate([
{ "$match": { "email_domain": { "$in": domains } } },
{ "$group": {
"_id": "$email_domain",
"docs": { "$push": "$$ROOT" }
}},
{ "$sort": { "_id": 1 } },
{ "$addFields": { "docs": { "$slice": [ "$docs", 0, 1100 ] } } }
]).toArray();
The "wider" issue here is that MongoDB has no way of "limiting" the array content on the initial $push. And this in fact is awaiting a long outstanding issue. SERVER-9377.
So whilst we can do this sort of operation "in theory", it often is not practical at all since the 16MB BSON Limit often restricts that "initial" array size, even if the $slice result would indeed stay under that cap.
Serial Loop Execution async/await
Your code shows you are running under this environment, so I suggest you actually use it. Simply await on each loop iteration from the source:
let results = [];
for ( let domain of domains ) {
results = results.concat(
await db.collection('circleback').find({ "email_domain": domain })
.limit(1100).toArray()
);
}
Simply functions allow you to do this, such as returning the standard cursor result of .find() as an array via .toArray() and then using .concat() to join with previous arrays of results.
It's simple and effective, but we can probably do a little better
Concurrent Execution of Async Methods
So instead of using a "loop" and await on each called async function, you can instead execute them all ( or at least "most" ) concurrently instead. This is in fact part of the problem you presently have as presented in the question, because nothing actually "waits" for the loop iteration.
We could use Promise.all() to effectively do this, however if it is actually a "very large" number of promises that would be running concurrently, this would run into the same problem as experienced, where the call stack is exceeded.
To avoid this, yet still have the benefits we can use Bluebird promises with Promise.map(). This has a "concurrent limiter" option, that allows only a specified number of operations to act simultaneously:
let results = [].concat.apply([],
await Promise.map(domains, domain =>
db.collection('circleback').find({ "email_domain": domain })
.limit(1100).toArray()
,{ concurrency: 10 })
);
In fact you should even be able to use a library such as Bluebird promises to "plugin" the .map() functionality to anything else that returns a Promise, such as your "source" function returning the list of "domains". Then you could "chain" just as is shown in the later examples.
Future MongoDB
Future releases of MongoDB ( from MongoDB 3.6 ) actually have a new "Non-Correlated" form of $lookup that allows a special case here. So going back to the original aggregation example, we can get the "distinct" values for each matching key, and then $lookup with a "pipeline" argument which would then allow a $limit to be applied on results.
let results = await db.collection('circleback').aggregate([
{ "$match": { "email_domain": { "$in": domains } } },
{ "$group": { "_id": "$email_domain" }},
{ "$sort": { "_id": 1 } },
{ "$lookup": {
"from": "circleback",
"let": {
"domain": "$_id"
},
"pipeline": [
{ "$redact": {
"$cond": {
"if": { "$eq": [ "$email_domain", "$$domain" ] },
"then": "$$KEEP",
"else": "$$PRUNE"
}
}},
{ "$limit": 1100 }
],
"as": "docs"
}}
]).toArray();
This would then always stay under the 16MB BSON limit, presuming of course that the argument to $in allowed that to be the case.
Example Listing
As a full Example Listing you can run, and generally play with as the default data set creation is intentionally quite large. It demonstrates all techniques described above as well as some general usage patterns to follow.
const mongodb = require('mongodb'),
Promise = require('bluebird'),
MongoClient = mongodb.MongoClient,
Logger = mongodb.Logger;
const uri = 'mongodb://localhost/bigpara';
function log(data) {
console.log(JSON.stringify(data,undefined,2))
}
(async function() {
let db;
try {
db = await MongoClient.connect(uri,{ promiseLibrary: Promise });
Logger.setLevel('info');
let source = db.collection('source');
let data = db.collection('data');
// Clean collections
await Promise.all(
[source,data].map( coll => coll.remove({}) )
);
// Create some data to work with
await source.insertMany(
Array.apply([],Array(500)).map((e,i) => ({ item: i+1 }))
);
let ops = [];
for (let i=1; i <= 10000; i++) {
ops.push({
item: Math.floor(Math.random() * 500) + 1,
index: i,
amount: Math.floor(Math.random() * (200 - 100 + 1)) + 100
});
if ( i % 1000 === 0 ) {
await data.insertMany(ops,{ ordered: false });
ops = [];
}
}
/* Fetch 5 and 5 example
*
* Note that the async method to supply to $in is a simulation
* of any real source that is returning an array
*
* Not the best since it means ALL documents go into the array
* for the selection. Then you $slice off only what you need.
*/
console.log('\nAggregate $in Example');
await (async function(source,data) {
let results = await data.aggregate([
{ "$match": {
"item": {
"$in": (await source.find().limit(5).toArray()).map(d => d.item)
}
}},
{ "$group": {
"_id": "$item",
"docs": { "$push": "$$ROOT" }
}},
{ "$addFields": {
"docs": { "$slice": [ "$docs", 0, 5 ] }
}},
{ "$sort": { "_id": 1 } }
]).toArray();
log(results);
})(source,data);
/*
* Fetch 10 by 2 example
*
* Much better usage of concurrent processes and only get's
* what is needed. But it is actually 1 request per item
*/
console.log('\nPromise.map concurrency example');
await (async function(source,data) {
let results = [].concat.apply([],
await source.find().limit(10).toArray().map(d =>
data.find({ item: d.item }).limit(2).toArray()
,{ concurrency: 5 })
);
log(results);
})(source,data);
/*
* Plain loop async/await serial example
*
* Still one request per item, requests are serial
* and therefore take longer to complete than concurrent
*/
console.log('\nasync/await serial loop');
await (async function(source,data) {
let items = (await source.find().limit(10).toArray());
let results = [];
for ( item of items ) {
results = results.concat(
await data.find({ item: item.item }).limit(2).toArray()
);
}
log(results);
})(source,data);
/*
* Non-Correlated $lookup example
*
* Uses aggregate to get the "distinct" matching results and then does
* a $lookup operation to retrive the matching documents to the
* specified $limit
*
* Typically not as efficient as the concurrent example, but does
* actually run completely on the server, and does not require
* additional connections.
*
*/
let version = (await db.db('admin').command({'buildinfo': 1})).version;
if ( version >= "3.5" ) {
console.log('\nNon-Correlated $lookup example $limit')
await (async function(source,data) {
let items = (await source.find().limit(5).toArray()).map(d => d.item);
let results = await data.aggregate([
{ "$match": { "item": { "$in": items } } },
{ "$group": { "_id": "$item" } },
{ "$sort": { "_id": 1 } },
{ "$lookup": {
"from": "data",
"let": {
"itemId": "$_id",
},
"pipeline": [
{ "$redact": {
"$cond": {
"if": { "$eq": [ "$item", "$$itemId" ] },
"then": "$$KEEP",
"else": "$$PRUNE"
}
}},
{ "$limit": 5 }
],
"as": "docs"
}}
]).toArray();
log(results);
})(source,data);
} else {
console.log('\nSkipped Non-Correlated $lookup demo');
}
} catch(e) {
console.error(e);
} finally {
db.close();
}
})();