How to remove duplicates based on a condition in Mongodb?

How to remove duplicates based on a condition in Mongodb? - javascript

{
"_id" : ObjectId("5d3acf79ea99ef80dca9bcca"),
"memberId" : "123",
"generatedId" : "00000d2f-9922-457a-be23-731f5fefeb14",
"memberType" : "premium"
},
{
"_id" : ObjectId("5e01554cea99eff7f98d7eed"),
"memberId" : "123",
"generatedId" : "34jkd2092sdlk02kl23kl2309k2309kr",
"memberType" : "premium"
}
I have 1 million docs like this format and how can i remove duplicated docs based on "memberId".
I need to be remove the duplicated docs where the "generatedId" value do not contain "-". In this example it should be deleted the bottom doc since it does not contains "-" in the "generatedId" value.
Can someone share any idea how to do this.

Well, there can be a strategy, but still, it depends on your data a lot.
Let's say you take your docs. Group them by their Id's for counting (duplicates), and then from the duplicates separate out all those entries where generatedId does not contain hyphens "-". When you get these docs which are duplicates and also does not contain - in their generatedId, you can delete them.
const result = await Collection.aggregate([
{
$project: {
_id: 1, // keep the _id field where it is anyway
doc: "$$ROOT", // store the entire document in the "doc" field
},
},
{
$group: {
_id: "$doc.memberId", // group by the documents by memeberId
count: { $sum: 1 }, // count the number of documents in this group
generatedId: { $first: "$doc.generatedId" }, // for keeping these values to be passed to other stages
memberType: { $first: "$doc.memberType" }, // for keeping these values to be passed to other stages
},
},
{
$match: {
count: { $gt: 1 }, // only show what's duplicated because it'll have count greater than 1
// It'll match all those documents not having - in them
generatedId: { $regex: /^((?!-).)*$/g } / g,
},
},
]);
Now in the result, you'll have docs which were memberId duplicates and does not have - in their generatedId. You can query them for deletion.
Warning:
Depending on your data it's possible certain duplicated memberId does not have '-' at all in their generatedIds, so you might delete all docs.
Always take backup before performing operations that might behave uncertain way.

db.collection.aggregate([
{
// first match all records with having - in generatedId
"$match" : { "generatedId" : { "$regex": "[-]"} } },
// then group them
{
"$group": {
"_id": "$memberId",
}}
])

Related

Mongodb Random Data from aggregate function with Bias

Does mongodb aggragate function support biases or will favor values that are specified, for example i have an array of values for a variable
genre = [science, math, english]
and I want to get (5) random document from the database where the document has a either 1 of the genres in the specified array, I want to get these data from biases since if ever that only 2 documents matched the specified condition, i want the other 3 to be randomized instead, thus completing the 5 random documents that i need.
Heres what i've gotten so far, but it only gets random data without any values
const book = await Book.aggregate([
{
$match: { genre: { $type: "string" } }
},
{
$sample: { size: 6 }
},
{
$set: {
genre: {
$cond: {
if: { $eq: [{ $type: "$genre" }, "string"] },
then: ["$genre"],
else: "$genre"
}
}
}
},
]);

How to find next N elements from a cursor with MongoDB, without _id and on a sorted cursor

Let's say I have three person documents in a MongoDB, inserted in a random order.
{
"firstName": "Hulda",
"lastName": "Lamb",
},
{
"firstName": "Austin",
"lastName": "Todd",
},
{
"firstName": "John",
"lastName": "Doe",
}
My goal is to obtain, let's say, the next person after Austin when the list is in alphabetical order. So I would like to get the person with firstName = Hulda.
We can assume that I know Austin's _id.
My first attempt was to rely on the fact that _id is incremental, but it won't work because the persons can be added in any order in the database. Hulda's _id field has a value less than Austin's. I cannot do something like {_id: {$gt: <Austin's _id here>}};
And I also need to limit the number of returned elements, so N is a dynamic value.
Here is the code I have now, but as I mentioned, the ID trick is not working.
let cursor: any = this.db.collection(collectionName).find({_id: {$gt:
cursor = cursor.sort({firstName: 1});
cursor = cursor.limit(limit);
return cursor.toArray();
Some clarifications:
startId is a valid, existing _id of an object
limit is a variable holding an positive integer value
sorting and limit works as expected, just the selection of the next elements is wrong, so the {_id: {$gt: startId}}; messes up the selection.

Every MongoDB's Aggregation Framework operation's context is restricted to a single document. There's no mechanism like window functions in SQL. Your only way is to use $group to get an array which contains all your documents and then get Austin's index to be able to apply $slice:
db.collection.aggregate([
{
$sort: { firstName: 1 }
},
{
$group: {
_id: null,
docs: { $push: "$$ROOT" }
}
},
{
$project: {
nextNPeople: {
$slice: [ "$docs", { $add: [ { $indexOfArray: [ "$docs.firstName", "Austin" ] }, 1 ] }, 1 ]
}
}
},
{ $unwind: "$nextNPeople" },
{
$replaceRoot: {
newRoot: "$nextNPeople"
}
}
])
Mongo Playground
Depending on your data size / MongoDB performance, above solution may or may not be acceptable - it's up to you to decide if you want to deploy such code on production since $group operation can be pretty heavy.

How to count number of subdocuments with condition

I have a mongoDB collection with documents like the one bellow. I want to cumulatively, over all documents, count how many subdocuments that the event field has, which is not null.
{
name: "name1",
events: {
created: {
timestamp: 1512477520951
},
edited: {
timestamp: 1512638551022
},
deleted: null
}
}
{
name: "name2",
events: {
created: {
timestamp: 1512649915779
},
edited: null,
deleted: null
}
}
So the result of the query on these two documents should return 3, because there are 3 events that is not null in the collection. I can not change the format of the document to have the event field be an array.

You want $objectToArray from MongoDB 3.4.7 or greater in order to do this as an aggregation statement:
db.collection.aggregate([
{ "$group": {
"_id": null,
"total": {
"$sum": {
"$size": {
"$filter": {
"input": {
"$objectToArray": "$events"
},
"cond": { "$ne": [ "$$this.v", null ] }
}
}
}
}
}}
])
That part is needed to look at the "events" object and translate each of the "key/value" pairs into array entries. In this way you can apply the $filter operation in order to remove the null "values" ( the "v" property ) and then use $size in order to count the matching list.
All of that is done under a $group pipeline stage using the $sum accumulator
Or if you don't have a supporting version, you need mapReduce and JavaScript execution in order to to the same "object to array" operation:
db.collection.mapReduce(
function() {
emit(null,
Object.keys(this.events).filter(k => this.events[k] != null).length);
},
function(key,values) {
return Array.sum(values);
},
{ out: { inline: 1 } }
)
That uses the same basic process by obtaining the object keys as an array and rejecting those where the value is found to be null, then obtaining the length of the resulting array.
Because of the JavaScript evaluation, this is much slower than the aggregation framework counterpart. But it's really a question of what server version you have available to support what you need.

Is it possible to find random documents in collection, without same fields? (monogdb\node.js)

For example, I have a collection users with the following structure:
{
_id: 1,
name: "John",
from: "Amsterdam"
},
{
_id: 2,
name: "John",
from: "Boston"
},
{
_id: 3,
name: "Mia",
from: "Paris"
},
{
_id: 4,
name: "Kate",
from: "London"
},
{
_id: 5,
name: "Kate",
from: "Moscow"
}
How can I get 3 random documents in which names will not be repeated?
Using the function getFourNumbers(1, 5), I get array with 3 non-repeating numbers and search by _id
var random_nums = getThreeNumbersnumbers(1, 5); // [2,3,1]
users.find({_id: {$in: random_nums}, function (err, data) {...} //[John, Mia, John]
But it can consist two Johns or two Kates, what is unwanted behavior. How can I get three random documents ( [John, Mia, Kate]. Not [John, Kate, Kate] or [John, Mia, John]) with 1 or maximum 2 queries? Kate or John (duplicated names) should be random, but should not be repeated.

There you go - see the comments in the code for further explanation of what the stages do:
users.aggregate(
[
{ // eliminate duplicates based on "name" field and keep track of first document of each group
$group: {
"_id": "$name",
"doc": { $first: "$$ROOT" }
}
},
{
// restore the original document structure
$replaceRoot: {
newRoot: "$doc"
}
},
{
// select 3 random documents from the result
$sample: {
size:3
}
}
])
As always with the aggrgation framework you can run the query with more or less stages added in order to see the transformations step by step.

I think what you are looking for is the $group aggregator, which will give you the distinct value of the collection. It can be used as:
db.users.aggregate( [ { $group : { name : "$name" } } ] );
MongoDB docs: Retrieve Distinct Values

Integrate between two collections

I have two collections:
'DBVisit_DB':
"_id" : ObjectId("582bc54958f2245b05b455c6"),
"visitEnd" : NumberLong(1479252157766),
"visitStart" : NumberLong(1479249815749),
"fuseLocation" : {.... }
"userId" : "A926D9E4853196A98D1E4AC6006DAF00#1927cc81cfcf7a467e9d4f4ac7a1534b",
"modificationTimeInMillis" : NumberLong(1479263563107),
"objectId" : "C4B4CE9B-3AF1-42BC-891C-C8ABB0F8DC40",
"creationTime" : NumberLong(1479252167996),
"lastUserInteractionTime" : NumberLong(1479252167996)
}
'device_data':
"_id" : { "$binary" : "AN6GmE7Thi+Sd/dpLRjIilgsV/4AAAg=", "$type" : "00" },
"auditVersion" : "1.0",
"currentTime" : NumberLong(1479301118381),
"data" : {
"networkOperatorName" : "Cellcom",...
},
"timezone" : "Asia/Jerusalem",
"collectionAlias" : "DEVICE_DATA",
"shortDate" : 17121,
"userId" : "00DE86984ED3862F9277F7692D18C88A#1927cc81cfcf7a467e9d4f4ac7a1534b"
In DBVisit_DB I need to show all visits only for Cellcom users which took more than 1 hour. (visitEnd - visitStart > 1 hour). by matching the userId value in both the collection.
this is what I did so far:
//create an array that contains all the rows that "Cellcom" is their networkOperatorName
var users = db.device_data.find({ "data.networkOperatorName": "Cellcom" },{ userId: 1, _id: 0}).toArray();
//create an array that contains all the rows that the visit time is more then one hour
var time = db.DBVisit_DB.find( { $where: function() {
timePassed = new Date(this.visitEnd - this.visitStart).getHours();
return timePassed > 1}},
{ userId: 1, _id: 0, "visitEnd" : 1, "visitStart":1} ).toArray();
//merge between the two arrays
var result = [];
var i, j;
for (i = 0; i < time; i++) {
for (j = 0; j < users; j++) {
if (time[i].userId == users[j].userId) {
result.push(time[i]);
}
}
}
for (var i = 0; i < result.length; i++) {
print(result[i].userId);
}
but it doesn't show anything although I know for sure that there is id's that can be found in both the array I created.
*for verification: I'm not 100% sure that I calculated the visit time correctly.
btw I'm new to both javaScript and mongodb
********update********
in the "device_data" there are different rows but with the same "userId" field.
in the "device_data" I have also the "data.networkOperatorName" field which contains different types of cellular companies.
I've been asked to show all "Cellcom" users that based on the 'DBVisit_DB' collection been connected more then an hour means,
based on the field "visitEnd" and "visitStart" I need to know if ("visitEnd" - "visitStart" > 1)
{ "userId" : "457A7A0097F83074DA5E05F7E05BEA1D#1927cc81cfcf7a467e9d4f4ac7a1534b" }
{ "userId" : "E0F5C56AC227972CFAFC9124E039F0DE#1927cc81cfcf7a467e9d4f4ac7a1534b" }
{ "userId" : "309FA12926EC3EB49EB9AE40B6078109#1927cc81cfcf7a467e9d4f4ac7a1534b" }
{ "userId" : "B10420C71798F1E8768ACCF3B5E378D0#1927cc81cfcf7a467e9d4f4ac7a1534b" }
{ "userId" : "EE5C11AD6BFBC9644AF3C742097C531C#1927cc81cfcf7a467e9d4f4ac7a1534b" }
{ "userId" : "20EA1468672EFA6793A02149623DA2C4#1927cc81cfcf7a467e9d4f4ac7a1534b" }
each array contains this format, after my queries, I need to merge them into one. that I'll have the intersection between them.
thanks a lot for all the help!

With the aggregation framework, you can achieve the desired result by making use of the $lookup operator which allows you to do a "left-join" operation on collections in the same database as well as taking advantage of the $redact pipeline operator which can accommodate arithmetic operators that manipulate timestamps and converting them to minutes which you can query.
To show a simple example how useful the above aggregate operators are, you can run the following pipeline on the DBVisit_DB collection to see the actual time difference in minutes:
db..getCollection('DBVisit_DB').aggregate([
{
"$project": {
"visitStart": { "$add": [ "$visitStart", new Date(0) ] },
"visitEnd": { "$add": [ "$visitEnd", new Date(0) ] },
"timeDiffInMinutes": {
"$divide": [
{ "$subtract": ["$visitEnd", "$visitStart"] },
1000 * 60
]
},
"isMoreThanHour": {
"$gt": [
{
"$divide": [
{ "$subtract": ["$visitEnd", "$visitStart"] },
1000 * 60
]
}, 60
]
}
}
}
])
Sample Output
{
"_id" : ObjectId("582bc54958f2245b05b455c6"),
"visitEnd" : ISODate("2016-11-15T23:22:37.766Z"),
"visitStart" : ISODate("2016-11-15T22:43:35.749Z"),
"timeDiffInMinutes" : 39.0336166666667,
"isMoreThanHour" : false
}
Now, having an understanding of how the above operators work, you can now apply it in the following example, where running the following aggregate pipeline will use the device_data collection as the main collection, first filter the documents on the specified field using $match and then do the join to DBVisit_DB collection using $lookup. $redact will process the logical condition of getting visits which are more than an hour long within $cond and uses the special system variables $$KEEP to "keep" the document where the logical condition is true or $$PRUNE to "discard" the document where the condition was false.
The arithmetic operators $divide and $subtract allow you to calculate the difference between the two timestamp fields as minutes, and the $gt logical operator then evaluates the condition:
db.device_data.aggregate([
/* Filter input documents */
{ "$match": { "data.networkOperatorName": "Cellcom" } },
/* Do a left-join to DBVisit_DB collection */
{
"$lookup": {
"from": "DBVisit_DB",
"localField": "userId",
"foreignField": "userId",
"as": "userVisits"
}
},
/* Flatten resulting array */
{ "$unwind": "$userVisits" },
/* Redact documents */
{
"$redact": {
"$cond": [
{
"$gt": [
{
"$divide": [
{ "$subtract": [
"$userVisits.visitEnd",
"$userVisits.visitStart"
] },
1000 * 60
]
},
60
]
},
"$$KEEP",
"$$PRUNE"
]
}
}
])

There are couple of things incorrect in your java script.
Replace time and users condition with time.length and users.length in for loops.
Your timePassed calculation should be
timePassed = this.visitEnd - this.visitStart
return timePassed > 3600000
You have couple of data related issues.
You don't have matching userId and difference between visitEnd and visitStart is less than an hour for the documents you posted in the question.
For mongo based query you should checkout the other answer.

We Keep Coding

JavaScript is the programming language of the Web.

How to remove duplicates based on a condition in Mongodb? - javascript

db.collection.aggregate([ { // first match all records with having - in generatedId "$match" : { "generatedId" : { "$regex": "[-]"} } }, // then group them { "$group": { "_id": "$memberId", }} ])

Related

Mongodb Random Data from aggregate function with Bias

How to find next N elements from a cursor with MongoDB, without _id and on a sorted cursor

How to count number of subdocuments with condition

Is it possible to find random documents in collection, without same fields? (monogdb\node.js)

Integrate between two collections

Categories

Resources