There are two array of objects one from database and one from csv. I required to compare both array object by their relative properties of Phones and emails and find duplicate array among them. Due to odd database object structure I required to compare both array with Javascript. I wanted to know what is the best algorithm and best way of compare and find duplicates?
I explain simple calculations.
There are 5000 contacts in my database and user may upload another 3000 contacts from csv. Everytime we requires to find duplicate contacts from database and if they find then it may overwrite and rest should be insert. If I compare contact row by row then it may loop 5000 database contacts x 3000 csv contacts = 15000000 time traverse.
This is my present scenario I face due to this system goes stuck. I require some efficient solution of this issue.
I develop the stuff in NodeJS, RethinkDB.
Database object structure exactly represent like that way and it may duplicate entry of emails and phones in other contacts also.
[{
id: 2349287349082734,
name: "ABC",
phones: [
{
id: 2234234,
flag: true,
value: 982389679823
},
{
id: 65234234,
flag: false,
value: 2979023423
}
],
emails: [
{
id: 22346234,
flag: true,
value: "test#domain.com"
},
{
id: 609834234,
flag: false,
value: "test2#domain.com"
}
]
}]
Please review fiddle code, if you want: https://jsfiddle.net/dipakchavda2912/eua1truj/
I have already did indexing. The problem is looking very easy and known in first sight but when we talk about concurrency it is really very critical and CPU intensive.
If understand the question you can use the lodash method differenceWith
let csvContacts = [] //fill it with your values;
let databaseContacts = .... //from your database
let diffArray = [] //the non duplicated object;
const l = require("lodash");
diffArray = l.differenceWith(csvContact,
databaseContacts,
(firstValue,secValue)=>firstValue.email == secValue.email
Related
I have two DBs for testing and each contains thousands/hundreds of thousand of documents.
But with the same Schemas and CRUD operations.
Let's call DB1 and DB2.
I am using Mongoose
Suddenly DB1 became really slow during:
const eventQueryPipeline = [
{
$match: {
$and: [{ userId: req.body.userId }, { serverId: req.body.serverId }],
},
},
{
$sort: {
sort: -1,
},
},
];
const aggregation = db.collection
.aggregate(eventQueryPipeline)
.allowDiskUse(true);
aggregation.exect((err, result) => {
res.json(result);
});
In DB2 the same exact query runs in milliseconds up to maximum a 10 seconds
In DB1 the query never takes less than 40 seconds.
I do not understand why. What could I be missing?
I tried to confront the Documents and the Indexes and they're the same.
Deleting the collection and restrting saving the documents, brings the speed back to normal and acceptable, but why is it happening? Does someone had same experience?
Short answer:
You should create following index:
{ "userId": 1, "serverId": 1, "sort": 1 }
Longer answer
Based on your code (i see that you have .allowDiskUse(true)) it looks like mongo is trying to do in memory sort with "a lot" of data. Mongo has by default 100MB system memory limit for sort operations, and you can allow it to use temporary files on disk to store data if it hits that limit.
You can read more about it here: https://www.mongodb.com/docs/manual/reference/method/cursor.allowDiskUse/
In order to optimise the performance of your queries, you can use indexes.
Common rule that you should follow when planning indexes is ESR (Equality, Sort, Range). You can read more about it here: https://www.mongodb.com/docs/v4.2/tutorial/equality-sort-range-rule/
If we follow that rule while creating our compound index, we will add equality matches first, in your case "userId" and "serverId". After that comes the sort field, in your case "sort".
If you had a need to additionally filter results based on some range (eg. some value greater than X, or timestamp greater than yday), you would add that after the "sort".
That means your index should look like this:
schema.index({ userId: 1, serverId: 1, sort: 1 });
Additionally, you can probably remove allowDiskUse, and handle err inside aggregation.exec callback (im assuming that aggregation.exect is a typo)
I have a db object looking like this:
{
user_name: 'string',
skills: [
{ skill: 'skill1', lvl: 3 }
],
wantsToLearn: [
{skill: 'skill2' }
]
}
I want to make a query wherein I find all users with a wantToLearn skill matching with one pf my input user's skill (regardless of lvl) AND vice versa. Basically, I want to be able to find all users with a match between a skill and something they want to learn.
I have looked at the mongodb documentation and am still a bit clueless on how to do this the best way. I am new to databases in general except for some sql.
Any pointers would be very appreciated!
If you want to find all users matching your given skill, all you have to do is :
db.getCollection('yourCollection').find({"wantsToLearn.skill": "skill2" })
That's the way you query subdocuments in MongoDB, even in arrays
I am trying to take a JSON list that is formatted as such: (real list has over 2500 entries).
[
['fb.com', 'http://facebook.com/']
['ggle.com', 'http://google.com/']
]
The JSON list represents: ['request url', 'destination url']. It is for a redirect audit tool built on node.js.
The goal is to put those JSON value pairs in a javascript object with a key value array pair as such:
var importedUrls = {
requestUrl : [
'fb.com',
'ggle.com'
],
destinationUrl : [
'https://www.facebook.com/',
'http://www.google.com/'
]
}
Due to the sheer amount of redirects, I do prefer a nonblocking solution if possible.
You first need to create your object:
var importedUrls = {
requestUrl: [],
destinationUrl: []
}
Now, let's say you have your data in an array called importedData for lack of a better name. You can then iterate that array and push each value to its proper new array:
importedData.forEach(function(urls){
importedUrls.requestUrl.push(urls[0]);
importedUrls.destinationUrl.push(urls[1]);
});
This will format your object as you want it to be formatted, I hope.
I will propose it to you that you take another approach.
Why not have an array of importedUrls, each one with its correspondent keys?
You could have something like:
importedUrls = [
{
requestUrl: 'req',
destinationUrl: 'dest'
},
{
requestUrl: 'req2',
destinationUrl: 'dest2'
},
]
I'm sure you can figure out how to tweak the code I showed to fit this format if you want to. What you gain with this is a very clear separation of your urls and it makes the iterations a lot more intuitive.
I've run into a bit of an issue with some data that I'm storing in my MongoDB (Note: I'm using mongoose as an ODM). I have two schemas:
mongoose.model('Buyer',{
credit: Number,
})
and
mongoose.model('Item',{
bid: Number,
location: { type: [Number], index: '2d' }
})
Buyer/Item will have a parent/child association, with a one-to-many relationship. I know that I can set up Items to be embedded subdocs to the Buyer document or I can create two separate documents with object id references to each other.
The problem I am facing is that I need to query Items where it's bid is lower than Buyer's credit but also where location is near a certain geo coordinate.
To satisfy the first criteria, it seems I should embed Items as a subdoc so that I can compare the two numbers. But, in order to compare locations with a geoNear query, it seems it would be better to separate the documents, otherwise, I can't perform geoNear on each subdocument.
Is there any way that I can perform both tasks on this data? If so, how should I structure my data? If not, is there a way that I can perform one query and then a second query on the result from the first query?
Thanks for your help!
There is another option (besides embedding and normalizing) for storing hierarchies in mongodb, that is storing them as tree structures. In this case you would store Buyers and Items in separate documents but in the same collection. Each Item document would need a field pointing to its Buyer (parent) document, and each Buyer document's parent field would be set to null. The docs I linked to explain several implementations you could choose from.
If your items are stored in two separate collections than the best option will be write your own function and call it using mongoose.connection.db.eval('some code...');. In such case you can execute your advanced logic on the server side.
You can write something like this:
var allNearItems = db.Items.find(
{ location: {
$near: {
$geometry: {
type: "Point" ,
coordinates: [ <longitude> , <latitude> ]
},
$maxDistance: 100
}
}
});
var res = [];
allNearItems.forEach(function(item){
var buyer = db.Buyers.find({ id: item.buyerId })[0];
if (!buyer) continue;
if (item.bid < buyer.credit) {
res.push(item.id);
}
});
return res;
After evaluation (place it in mongoose.connection.db.eval("...") call) you will get the array of item id`s.
Use it with cautions. If your allNearItems array will be too large or you will query it very often you can face the performance problems. MongoDB team actually has deprecated direct js code execution but it is still available on current stable release.
Well i am struggling with the aggregation problems. I thought the easiest way to solve problem is to use map reduce or make separate find queries and then loop through with the async library help.
The schema is here:
db.keyword
keyword: String
start: Date
source: String(Only one of these (‘google’,’yahoo’,’bing’,’duckduckgo’) )
job: ref db.job
results: [
{
title: String
url: String
position: Number
}
]
db.job
name: String
keywords: [ String ]
urls: [ String ]
sources: [ String(‘google’,’yahoo’,’bing’,’duckduckgo’) ]
Now i need to take the data to this form:
data = {
categories: [ 'keyword1', 'keyword2', 'keyword3' ],
series: [
{
name: 'google',
data: [33, 43, 22]
},
{
name: 'yahoo',
data: [12, 5, 3]
}
]
}
Well the biggest problem is that the series[0].data array is made of really difficult find, matching the db.job.urls against the db.keyword.results.url and then get the position.
Is there any way to simplify the query_? I have looked through many of the map reduce examples, but I cant find the correct way what data to map and which to reduce.
It looks as though you are trying to combine data from two separate collections (keyword and job).
Map Reduce as well as the new Aggregation Framework can only operate on a single collection at a time.
Your best bet is probably to query each collection separately and programmatically combine the results, saving them in whichever form is best suited to your application.
If you would like to experiment with Map Reduce, here is a link to a blog post written by a user who used an incremental Map Reduce operation to combine values from two collections.
http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/
For more information on using Map Reduce with MongoDB, please see the Mongo Documentation:
http://www.mongodb.org/display/DOCS/MapReduce
(The section on incremental Map Reduce is here: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce)
There are some additional Map Reduce examples in the MongoDB Cookbook:
http://cookbook.mongodb.org/
For a step-by-step walkthrough of how a Map Reduce operation is run, please see the "Extras" section of the MongoDB Cookbook recipe "Finding Max And Min Values with Versioned Documents" http://cookbook.mongodb.org/patterns/finding_max_and_min/
Hopefully the above will give you some ideas for how to achieve your desired results. As I mentioned, I believe that the most straightforward solution is simply to combine the results programmatically. However, if you are successful writing a Map Reduce operation that does this, please post your solution, so that the Community may gain the benefit of your experience.