MongooseJS - limit number of documents in subdocument

MongooseJS - limit number of documents in subdocument - javascript

I wonder if there's a way to limit the number of elements in a (sub)document. In my app, elements can be added to the subdocument with ajax, so its important that I prevent adding a ridiculous number of entries by a malicious user.
I can think of one way, that is to query the parent document, check number of elements in the subdocument and if its less than X, save it. That however requires to query the whole document, just to update a little thing. I would rather much prefer to do everything with update(), instead of findOne() and save(). Is there a way?
Edit: It can be done without db lookup prior to updating. I don't know if it's possible to do with a subdoc, I did it with a plain object. Simply store your data in an array or better a classical object. Then before updating, validate key numbers, if key number is greater than the allowed - ignore it. Ofc if dealing with an object, check if the key is a number first.

You can make an array limit in Mongoose, using a custom validate function, with something like:
validate: [v => v.length <= MAX, 'message']
where MAX is some maximal number of elements, and 'message' is a message to show on validation failure. See:
http://mongoosejs.com/docs/api.html#schematype_SchemaType-validate
But you cannot make a one request for MongoDB that would fail if the array is too long because MongoDB has no concept of schemas and that would be needed to do what you want. Schemas are only present in Mongoose.
So if you want to use MongoDB directly then not only you will not be able to do it in one request, it's actually much worse than that - you also cannot eliminate a race condition because MongoDB doesn't support transactions, so you cannot really guarantee that your limit will never be exceeded.

Related

DynamoDB/Dynamoose query based on ALL elements in an array

Is there any way using DynamoDB/Dynamoose to query based on an array field containing all of the specified elements? I'm making the move from MongoDB/Mongoose and I need something similar to the functionality of the $all operator in MongoDB (https://docs.mongodb.com/manual/reference/operator/query/all/)

No, there is no such feature.
In any case, you are talking about a FilterExpression parameter to Query here, not a KeyConditionExpression (which can only be about the key columns, and those cannot have nested arrays), so you will pay for the entire items anyway - So you might as well just read the entire items and do the comparisons you want in the client. This will cost you extra in network bandwidth, but not in DynamoDB operations for which you'll pay in any case.

Fastest way to check for an object's existence in a list in JavaScript

I've got a list of 100,000 items that live in memory (all of them big ints stored as strings).
The data structure these come in doesn't really matter. Right now they live in an array like so:
const list = ['1','2','3'...'100000'];
Note: The above is just an example - in reality, each entry is an 18 digit string.
I need to check for an object's existence. Currently I'm doing:
const needToCheck = '3';
const doesInclude = list.includes(needToCheck);
However there's a lot of ways I could do this existence check. I need this to be as performant as possible.
A few other avenues I could follow are:
Create a Map with the value being undefined
Create an object ({}) and create the keys of the object as the entries in list, then use hasOwnProperty.
Use a Set()
Use some other sort of data structure (a tree?) due to the fact that these are all numbers. However, due to the fact that these are all 18 digits in length, so maybe that'll be less performant.
I can accept a higher upfront cost to build the data structure to get a bigger speed increase later, as this is for a URL route that will be hit >1MM times a day.

Array.prototype.includes is an O(n) operation, which is not desirable - every time you want to check whether a value exists, you'll have to iterate over much of the collection (perhaps the entire collection).
A Map, Set, or object are better, since checking whether they have a value is an O(1) operation.
A tree is not desirable either, because lookup will necessarily take a number of operations down the tree, which could be an issue if the tree is large and you want to lookup frequently - so the O(1) solution is better.
A Map, while it works, probably isn't appropriate because you just want to see if a value exists - you don't need key-value pairs, just values. A Set is composed of only values (and Set.has is indeed O(1)), so that's the best choice for this situation. An object with keys, while it could work too, might not be a good idea because it may create many unnecessary hidden classes - a Set is more designed towards dynamic values at runtime.
So, the Set approach looks to be the most performant and appropriate choice.
You might also consider the possibility of moving the calculation to the server. 100,000 items isn't necessarily too much, but it's still a surprisingly large amount to see client-side.

Unconventionally, you could also use an object and set each of your 100,000 items as a property because under the hood, the JavaScript Object is implemented with a hash table.
For example,
var numbers = {
"1": 1243213,
"2": 4314121,
"3": 3142123
...
}
You could then very quickly check if an item existed by checking if numbers["1"] === undefined. And not only that, but you can also get the value of of the property at the same time.
However, this method does come with some drawbacks like iterating through the list becoming a lot more complicated (though still possible).
For reference, see https://stackoverflow.com/a/24196259/8250558

mysql order by field usage/limit/performance

I'm currently trying to modify the selection order of some records using a javascript drag&drop mechanism.
This is the idea:
Once I've ordered the elements by d&d I retrieve the IDs of each element (in the right order) and I send them to php via ajax call.
I store the array of IDs somewhere (to develop)
Then, I run a query like this:
$sql = "SELECT * FROM items ORDER BY field(id, ".$order.");";
(where $order is the imploded array of IDs)
It works quite good but, since I never used this feature before, my doubt is:
since my IDs are strings of 16 characters, and supposing to have 200 records to order....
...Should I expect some trouble in therms of performance?
Do you see any better solution?
Thanks.

The comments up there made me think and I realized that this approach has a big issue.
Even considering to send the $order array only at the end of drag&drop process - I mean, push a button (unlock d&d), reorder, confirm (send&lock) - it would be however necessary to perform a custom select on every single js action comporting a refresh of the elements (view, create, rename,...). And that's pretty dumb.
So I guess that the best approach is the one suggested by Kiko, maybe with a lock system as described above to avoid an ajax call and consequent reindexing of the order field at every single move.

Followers - mongodb database design

So I'm using mongodb and I'm unsure if I've got the correct / best database collection design for what I'm trying to do.
There can be many items, and a user can create new groups with these items in. Any user may follow any group!
I have not just added the followers and items into the group collection because there could be 5 items in the group, or there could be 10000 (and the same for followers) and from research I believe that you should not use unbound arrays (where the limit is unknown) due to performance issues when the document has to be moved because of its expanding size. (Is there a recommended maximum for array lengths before hitting performance issues anyway?)
I think with the following design a real performance issue could be when I want to get all of the groups that a user is following for a specific item (based off of the user_id and item_id), because then I have to find all of the groups the user is following, and from that find all of the item_groups with the group_id $in and the item id. (but I can't actually see any other way of doing this)
Follower
.find({ user_id: "54c93d61596b62c316134d2e" })
.exec(function (err, following) {
if (err) {throw err;};
var groups = [];
for(var i = 0; i<following.length; i++) {
groups.push(following[i].group_id)
}
item_groups.find({
'group_id': { $in: groups },
'item_id': '54ca9a2a6508ff7c9ecd7810'
})
.exec(function (err, groups) {
if (err) {throw err;};
res.json(groups);
});
})
Are there any better DB patterns for dealing with this type of setup?
UPDATE: Example use case added in comment below.
Any help / advice will be really appreciated.
Many Thanks,
Mac

I agree with the general notion of other answers that this is a borderline relational problem.
The key to MongoDB data models is write-heaviness, but that can be tricky for this use case, mostly because of the bookkeeping that would be required if you wanted to link users to items directly (a change to a group that is followed by lots of users would incur a huge number of writes, and you need some worker to do this).
Let's investigate whether the read-heavy model is inapplicable here, or whether we're doing premature optimization.
The Read Heavy Approach
Your key concern is the following use case:
a real performance issue could be when I want to get all of the groups that a user is following for a specific item [...] because then I have to find all of the groups the user is following, and from that find all of the item_groups with the group_id $in and the item id.
Let's dissect this:
Get all groups that the user is following
That's a simple query: db.followers.find({userId : userId}). We're going to need an index on userId which will make the runtime of this operation O(log n), or blazing fast even for large n.
from that find all of the item_groups with the group_id $in and the item id
Now this the trickier part. Let's assume for a moment that it's unlikely for items to be part of a large number of groups. Then a compound index { itemId, groupId } would work best, because we can reduce the candidate set dramatically through the first criterion - if an item is shared in only 800 groups and the user is following 220 groups, mongodb only needs to find the intersection of these, which is comparatively easy because both sets are small.
We'll need to go deeper than this, though:
The structure of your data is probably that of a complex network. Complex networks come in many flavors, but it makes sense to assume your follower graph is nearly scale-free, which is also pretty much the worst case. In a scale free network, a very small number of nodes (celebrities, super bowl, Wikipedia) attract a whole lot of 'attention' (i.e. have many connections), while a much larger number of nodes have trouble getting the same amount of attention combined.
The small nodes are no reason for concern, the queries above, including round-trips to the database are in the 2ms range on my development machine on a dataset with tens of millions of connections and > 5GB of data. Now that data set isn't huge, but no matter what technology you choose you, will be RAM bound because the indices must be in RAM in any case (data locality and separability in networks is generally poor), and the set intersection size is small by definition. In other words: this regime is dominated by hardware bottlenecks.
What about the supernodes though?
Since that would be guesswork and I'm interested in network models a lot, I took the liberty of implementing a dramatically simplified network tool based on your data model to make some measurements. (Sorry it's in C#, but generating well-structured networks is hard enough in the language I'm most fluent in...).
When querying the supernodes, I get results in the range of 7ms tops (that's on 12M entries in a 1.3GB db, with the largest group having 133,000 items in it and a user that follows 143 groups.)
The assumption in this code is that the number of groups followed by a user isn't huge, but that seems reasonable here. If it's not, I'd go for the write-heavy approach.
Feel free to play with the code. Unfortunately, it will need a bit of optimization if you want to try this with more than a couple of GB of data, because it's simply not optimized and does some very inefficient calculations here and there (especially the beta-weighted random shuffle could be improved).
In other words: I wouldn't worry about the performance of the read-heavy approach yet. The problem is often not so much that the number of users grows, but that users use the system in unexpected ways.
The Write Heavy Approach
The alternative approach is probably to reverse the order of linking:
UserItemLinker
{
userId,
itemId,
groupIds[] // for faster retrieval of the linker. It's unlikely that this grows large
}
This is probably the most scalable data model, but I wouldn't go for it unless we're talking about HUGE amounts of data where sharding is a key requirement. The key difference here is that we can now efficiently compartmentalize the data by using the userId as part of the shard key. That helps to parallelize queries, shard efficiently and improve data locality in multi-datacenter-scenarios.
This could be tested with a more elaborate version of the testbed, but I didn't find the time yet, and frankly, I think it's overkill for most applications.

I read your comment/use-case. So I update my answer.
I suggest to change the design as per this article: MongoDB Many-To-Many
The design approach is different and you might want to remodel your approach to this. I'll try to give you an idea to start with.
I make the assumption that a User and a Follower are basically the same entities here.
I think the point you might find interesting is that in MongoDB you can store array fields and this is what I will use to simplify/correct your design for MongoDB.
The two entities I would omit are: Followers and ItemGroups
Followers: It is simply a User who can follow Groups. I would add an
array of group ids to have a list of Groups that the User follows. So instead of having an entity Follower, I would only have User with an array field that has a list of Group Ids.
ItemGroups: I would remove this entity too. Instead I would use an array of Item Ids in the Group entity and an array of Group Ids in the Item entity.
This is basically it. You will be able to do what you described in your use case. The design is simpler and more accurate in the sense that it reflects the design decisions of a document based database.
Notes:
You can define indexes on array fields in MongoDB. See Multikey Indexes for example.
Be wary about using indexes on array fields though. You need to understand your use case in order to decide whether it is reasonable or not. See this article. Since you only reference ObjectIds I thought you could try it, but there might be other cases where it is better to change the design.
Also note that the ID field _id is a MongoDB
specific field type of ObjectID used as primary key. To access the ids you can refer to it e.g. as user.id, group.id, etc. You can use an index to ensure uniqueness as per this question.
Your schema design could look like this:
As to your other question/concerns
Is there a recommended maximum for array lengths before hitting performance issues anyway?
the answer is in MongoDB the document size is limited to 16 MB and there is now way you can work around that. However 16 MB is considered to be sufficient; if you hit the 16 MB then your design has to be improved. See here for info, section Document Size Limit.
I think with the following design a real performance issue could be when I want to get all of the groups that a user is following for a specific item (based off of the user_id and item_id)...
I would do this way. Note how "easier" it sounds when using MongoDB.
get the item of the user
get groups that reference that item
I would be rather concerned if the arrays get very large and you are using indexes on them. This could overall slow down write operations on the respective document(s). Maybe not so much in your case, but not entirely sure.

You're on the right track to creating a performant NoSQL schema design, and I think you're asking the right questions as to how to properly lay things out.
Here's my understanding of your application:
It looks like Groups can both have many Followers (mapping users to groups) and many Items, but Items may not necessarily be in many Groups (although it is possible). And from your given use-case example, it sounds like retrieving all the Groups an Item is in and all the Items in a Group will be some common read operations.
In your current schema design, you've implemented a model between mapping users to groups as followers and items to groups as item_groups. This works alright until you mention the problem with more complex queries:
I think with the following design a real performance issue could be when I want to get all of the groups that a user is following for a specific item (based off of the user_id and item_id)
I think a few things could help you out in this situation:
Take advantage of MongoDB's powerful indexing capabilities. In particular, I think you should consider creating compound indexes on your Follower objects covering your Group and User, and your Item_Groups on Item and Group, respectively. You'll also want to make sure this kind of relationship is unique, in that a user can only follow a group once and an item can only be added to a group once. This would best be achieved in some pre-save hooks defined in your schema, or using a plugin to check for validity.
FollowerSchema.index({ group: 1, user: 1 }, { unique: true });
Item_GroupsSchema.index({ group: 1, item: 1 }, { unique: true });
Using an index on these fields will create some overhead when writing to the collection, but it sounds like reading from the collection will be a more common interaction so it'll be worth it (I'd suggest reading more up on index performance).
Since a User probably won't be following thousands of groups, I think it'd be worthwhile to include in the user model an array of groups the user is following. This will help you out with that complex query when you want to find all instances of an item in groups that a user is currently following, since you'll have the list of groups right there. You'll still have the implementation where your using $in: groups, but it'll be with one less query to the collection.
As I mentioned before, it seems like items may not necessarily be in that many groups (just like users won't necessarily be following thousands of groups). If the case may commonly be that an item is in maybe a couple hundred groups, I'd consider just adding an array to the item model for each group that it gets added to. This would increase your performance when reading all the groups an item is in, a query you mentioned would be a common one. Note: You'd still use the Item_Groups model to retrieve all the items in a group by querying on the (now indexed) group_id.

Unfortunately NoSQL databases aren't eligible in this case. Your data model seems exact relational. According to MongoDB documentation we can do only these and can perform only these.
There are some practices. MongoDB advises to us using Followers collection to get which user follows which group and vice versa with good performance. You may find the closest case to your situation on this page on slide 14th. But I think the slides can eligible if you want to get each result on different page. For instance; You are a twitter user and when you click the followers button you'll see the all your followers. And then you click on a follower name you'll see the follower's messages and whatever you can see. As we can see all of those work step-by-step. No needed a relational query.
I believe that you should not use unbound arrays (where the limit is unknown) due to performance issues when the document has to be moved because of its expanding size. (Is there a recommended maximum for array lengths before hitting performance issues anyway?)
Yes, you're right. http://askasya.com/post/largeembeddedarrays .
But if you have about a hundred items in your array there is no problem.
If you have fixed size some data thousands you may embed them into your relational collections as array. And you can query your indexed embedded document fields rapidly.
In my humble opinion, you should create hundreds of thousands test data and check performances of using embedded documents and arrays eligible to your case. Don't forget creating indexes appropriate your queries. You may try to using document references on your tests. After tests if you like performance of results go ahead..
You had tried to find group_id records that are followed by a specific user and then you've tried to find a specific item with those group_id. Would it be possible Item_Groups and Followers collections have a many-to-many relation?
If so, many-to-many relation isn't supported by NoSQL databases.
Is there any chance you can change your database to MySQL?
If so you should check this out.
briefly MongoDB pros against to MySQL;
- Better writing performance
briefly MongoDB cons against to MySQL;
- Worse reading performance
If you work on Node.js you may check https://www.npmjs.com/package/mysql and https://github.com/felixge/node-mysql/
Good luck...

Return formatted value in MongoDB db.collection.find()

I have a MongoDB JavaScript function saved in db.system.js, and I want to use it to both query and produce output data.
I'm able to query the results of the function using the $where clause like so:
db.records.find(
{$where: "formatEmail(this.email.toString(), 'format type') == 'xxx'"},
{email:1})
but I'm unable to use this to return a formatted value for the projected "email" field, like so:
db.records.find({}, {"formatEmail(this.email.toString(), 'format type')": 1})
Is there any way to do this while preserving the ability to simply use a pre-built function?
UPDATE:
Thank you all for your prompt participation.
Let me explain why I need to do this in MongoDB and it's not a matter of client logic at the wrong layer.. What I am really trying to do is use the function for a shard bucketing value. Email was just one example, but in reality, what I have is a hash function that returns a mod value.
I'm aware of Mongo having the ability to shard based on a hashed value, but from what I gather, it produces a highly random value that can burden the re-balancing of shards with unnecessary load. So I want to control it like so func(_id, mod), which would return a value from 0 to say 1000 (depending on the mod value).
Also, I guess I would also like to use the output of the function in some sort of grouping scenario, and I guess Map Reduce does come to mind.. I was just hoping to avoid writing overly complex M/R for something so simple.. also, I don't really know how to do Map Reduce .. lol.
So, I gather that from your answers, there is no way to return any formatted value back from mongo (without map/reduce), is that right?

I think you are mixing your "layers" of functionality here -- the database stores and retrieves data, thats all. What you need to do is:
* get that data and store the cursor in a variable
* loop through your cursor, and for every record you go through
* format and output your record as you see fit.
This is somewhat similar to what you have described in your question, but its not part of MongoDB and you have to provide the "formatEmail" function in your "application layer"
Hope it helps

As #alernerdev has already mentioned, this is generally not done at a database layer. However, sometimes storing a pre-formatted version in your database is the way to go. Here's some instances where you may wish to store extra data:
If you need to lookup data in a particular format. For example, I have a "username" and a "usernameLowercase" fields for my primary user collection. The lowercase'd one is indexed, and is the one I use for username lookup. The mixed-case one is used for displaying the username.
If you need to report a large amount of data in a particular format. 100,000 email addresses all formatted in a particular way? Probably best to just store them in that format in the db.
If your translation from one format to another is computationally expensive. Doubly so if you're processing lots of records.
In this case, if all you're doing is looking up or retrieving an email in a specific format, I'd recommend adding a field for it and then indexing it. That way you won't need to do actual document retrieval for the lookup or the display. Super fast. Disk storage space for something the size of an email address is super cheap!

We Keep Coding

JavaScript is the programming language of the Web.