Fairly new to node and mongo. I'm a developer from a relational db background.
I have been asked to write a report to calculate the conversion rate from leads relating to vehicle workshop bookings to invoices. A conversion is where an invoice was produced within 60 days of a lead being generated.
So I have managed with mongodb, mongoose and nodejs to import all of the data from flat files into two collections, leads and invoices. I have 1M leads and about 30M invoices over a 5 year period and the rates are to be produced on a month by month basis. All data has vehicle reg in common.
So my problem is how do I join the data together with mongoose and nodejs?
So far I have attempted for any single lead so find any invoices within a 60 day period in order for the lead to qualify as a conversion. This works but my script stops after about 20 or so successful updates. At this point I think my script which makes individual queries for invoices per lead is too heavy a load on mongodb and I can see that making millions of individual queries is too much for mongodb.
After hours of browsing, I'm not sure what I should be looking for!?
Any help would be greatly appreciated.
Your attempt should be working without a problem. What helps me, though, with large data Mongo DB instances and analysis on them: Run queries directly in Mongo, not through Node. Like that you avoid having to convert Mongo structures (e.g. iterators) into Node structures (e.g. arrays) and generally lose a lot of overhead.
Also, make sure you have correct indices setup. That can be a HUGE difference in terms of performance in big databases.
What I would do then is something like (this should be considered pseudo code):
let converted = 0;
db.leads.find({},{id: 1, date: 1}).forEach(lead => {
const hasInvoices = db.invoices.count({leadId: lead.id, date: {$lt: lead.date + 60}});
converted ++;
});
To speed things up, I'd use the following index for this case:
db.invoices.createIndex({leadId: 1, date: -1});
Related
Currently developing some engineering software using PHP and MongoDB for storing the data.
I plan on doing some calculations where I am performing many calculations on a collection. Essentially, it contains data, and I want to perform calculations on the data, update the field, calculate on the next field and so on.
However, my developer has hit a snag.
He was doing what I thought would be a simple operation.
> Upload a CSV into a collection.
> Create secondary collection by transforming all of the values
> of the first collection according to user input of a value into a formula.
Similar to Excel's "Copy Value" then Paste Special Multiply.
Essentially create a new collection as product of the first
collection.
The developer reported back that this slowed his PC down to a crawl.
This concerns me that my advanced application has no hope of getting off the ground if mongo is slow to carry out this simple (to me) task.
Is there a proper way to go about performing thousands of calculations on a nosql collection? Are databases not meant for this sort of work load? Would I then have to pull the data out into an array, perform the calculations then insert the new values after the simulation is done?
I have read that java has better performance than PHP, should I direct the code toward java for engineering applications?
There are few questions needs to be checked before coming to any conclusion
1) What is the OS? for windows check the task manager for the mongod process details while you are running the queries; for linux use the top command to check the process details
2) What is the volume of data; I mean size in terms of megabyte/byte
3) If you have 2 GB RAM then you should consider splitting the data volume to less than 1 GB and start processing
4) If data volume is perfect for the RAM size then there must be disk speed/ RPM
5) If the data processing is only for localhost then you can process the data with some other laptop with higher configuration to see the outcome
6) What is the version of MongoDB you are using? try to upgrade to the latest
7) You can consider free MongoDB cluster Atlas(https://cloud.mongodb.com/user#/atlas/login) if the volume of data is not very huge
8) Can create a cluster in AWS free tier for few hours to know the outcome also
I hope you are shy away from trying. Last but not the least is your requirement.
TL;DR:
I'm making an app for a canteen. I have a collection with the persons and a collection where I "log" every meat took. I need to know those who DIDN'T take the meal.
Long version:
I'm making an application for my local Red Cross.
I'm trying to optimize this situation:
there is a canteen at wich the helped people can take food at breakfast, lunch and supper. We need to know how many took the meal (and this is easy).
if they are present they HAVE TO take the meal and eat, so we need to know how many (and who) HAVEN'T eat (this is the part that I need to optimize).
When they take the meal the "cashier" insert their barcode, the program log the "transaction" in the log collection.
Actually, on creation of the template "canteen" I create a local collection "meals" and populate it with the data of all the people in the DB, (so ID, name, fasting/satiated), then I use this collection for my counters and to display who took the meal and who didn't.
(the variable "mealKind" is = "breakfast" OR "lunch" OR "dinner" depending on the actual serving.)
Template.canteen.created = function(){
Meals=new Mongo.Collection(null);
var today= new Date();today.setHours(0,0,1);
var pers=Persons.find({"status":"present"},{fields:{"Name":1,"Surname":1,"barcode":1}}).fetch();
pers.forEach(function(uno){
var vediamo=Log.findOne({"dest":uno.codice,"what":mealKind, "when":{"$gte": today}});
if(typeof vediamo=="object"){
uno['eat']="satiated";
}else{
uno['eat']="fasting";
}
Meals.insert(uno);
});
};
Template.canteen.destroyed = function(){
meals.remove({});
};
From the meal collection I estrapolate the two colums of people satiated (with name, surname and barcode) and fasting, and I also use two helpers:
fasting:function(){
return Meals.find({"eat":"fasting"});
}
"countFasting":function(){
return Meals.find({"eat":"fasting"}).count();
}
//same for satiated
This was ok, but now the number of people is really increasing (we are arount 1000 and counting) and the creation of the page is very very slow, and usually it stops with errors so I can read that "100 fasting, 400 satiated" but I have around 1000 persons in the DB.
I can't figure out how to optimize the workflow, every other method that I tried involved (in a manner or another) more queries to the DB; I think that I missed the point and now I cannot see it.
I'm not sure about aggregation at this level and inside meteor, because of minimongo.
Although making this server side and not client side is clever, the problem here is HOW discriminate "fasting" vs "satiated" without cycling all the person collection.
+1 if the solution is compatibile with aleed:tabular
EDIT
I am still not sure about what is causing your performance issue (too many things in client memory / minimongo, too many calls to it?), but you could at least try different approaches, more traditionally based on your server.
By the way, you did not mention either how you display your data or how you get the incorrect reading for your number of already served / missing Persons?
If you are building a classic HTML table, please note that browsers struggle rendering more than a few hundred rows. If you are in that case, you could implement a client-side table pagination / infinite scrolling. Look for example at jQuery DataTables plugin (on which is based aldeed:tabular). Skip the step of building an actual HTML table, and fill it directly using $table.rows.add(myArrayOfData).draw() to avoid the browser limitation.
Original answer
I do not exactly understand why you need to duplicate your Persons collection into a client-side Meals local collection?
This requires that you have first all documents of Persons sent from server to client (this may not be problematic if your server is well connected / local. You may also still have autopublish package on, so you would have already seen that penalty), and then cloning all documents (checking for your Logs collection to retrieve any previous passages), effectively doubling your memory need.
Is your server and/or remote DB that slow to justify your need to do everything locally (client side)?
Could be much more problematic, should you have more than one "cashier" / client browser open, their Meals local collections will not be synchronized.
If your server-client connection is good, there is no reason to do everything client side. Meteor will automatically cache just what is needed, and provide optimistic DB modification to keep your user experience fast (should you structure your code correctly).
With aldeed:tabular package, you can easily display your Persons big table by "pages".
You can also link it with your Logs collection using the dburles:collection-helpers (IIRC there is an example en the aldeed:tabular home page).
I am currently challenged with archiving data on a MongoDB server that has been running for over a year, accumulating nearly 100GB of data (many collections having 10+ million documents). Much to my disappointment, the data model was designed in a way very similar to what you would expect to see in a relational database; a collection for each Model and foreign keys associating records to each other. For example:
// Collection: conversations (~1M documents)
{
_id: ObjectId(),
last_read_at: new Date,
}
// Collection: messages (~100M documents, 0-200k per conversation)
{
_id: ObjectId(),
conversation: ObjectId(),
}
// Collection: likes (~50M documents, 0-1k per message)
{
_id: ObjectId(),
message: ObjectId(),
}
If I was faced with a traditional RDBMS I could quite easily use JOIN to find and destroy all of the relevant chapters and pages, archive them, then DELETE them. Unfortunately I'm not so lucky.
My current approach is this:
Query for all books that haven't been read in n days and pipe results to Archive Strategy
Archive Strategy - In parallel:
Archive the documents to a file using pipe() from the Cursor
Document by document, query other collections that reference document, pipe results into new Archive Strategy stream
What we end up with is a dependency tree where the root is the collection we actually want to archive, and each level in the tree is a stream for archiving its dependencies. Given our dataset and model complexity this has turned out to be unbearably slow. I guess I have two questions:
Can I avoid the n+1 queries somehow, or is this simply a constraint of MongoDB since I can't join? My next idea here is to batch dependency the queries in groups using $in
Are streams the most effective way to handle a workload like this?
I successfully re-architected the program producing speed gains of over 100X by making the following changes:
Rather than producing a tree of streams, I broke up each branch into individual jobs which I placed into a queue. This necessitated saving intermediate data (_ids and other keys needed to crawl) temporarily to disk but that turned out to have no impact on speed since each branch was bottlenecked elsewhere. This allowed each branch of the crawl run at its own pace without slowing down faster branches. The tree design ended up having way too much contention to work.
I consumed the above mentioned queue in n worker processes using cluster the module where n was the number of CPU cores I had.
When consuming the _ids from disk, I batched them into groups of 1000 to minimize the number of queries I had to perform.
Using the above methods I was able to archive over 100M documents in about three hours with no noticeable impact on user experience. The program is now CPU bound but runs fast enough that I won't be digging deeper.
I am working with a database that was handed down to me. It has approximately 25 tables, and a very buggy query system that hasn't worked correctly for a while. I figured, instead of trying to bug test the existing code, I'd just start over from scratch. I want to say before I get into it, "I'm not asking anyone to build the code for me". I'm not that lazy, all I want to know is, what would be the best way to lay out the code? The existing query uses "JOIN" to combine the results of all the tables in one variable, and spits it into the query. I have been told in other questions displaying this code, that it's just too much, and far too many bugs to try to single out what is causing the break.
What would be the most efficient way to query these tables that reference each other?
Example: Person chooses car year, make, model. PHP then gathers that information, and queries the SQL database to find what parts have matching year, vehicle id's, and parts compatible. It then uses those results to pull parts that have matching car model id's, OR vehicle id's(because the database was built very sloppily, and compares all the different tables to produce: Parts, descriptions, prices, part number, sku number, any retailer notes, wheelbase, drive-train compatibility, etc.
I've been working on this for two weeks, and I'm approaching my deadline with little to no progress. I'm about to scrap their database, and just do data entry for a week, and rebuild their mess if it would be easier, but if I can use the existing pile of crap they've given me, and save some time, I would prefer it.
Would it be easier to do a couple queries and compare the results, then use those results to query for more results, and do it step by step like that, or is one huge query comparing everything at once more efficient?
Should I use JOIN and pull all the tables at once and compare, or pass the input into individual variables, and pass the PHP into javascript on the client side to save server load? Would it be simpler to break the code up so I can identify the breaking points, or would using one long string decrease query time, and server loads? This is a very complex question, but I just want to make sure there aren't too many responses asking for clarification on trivial areas. I'm mainly seeking the best advice possible on how to handle this complicated situation.
Rebuild the database then make a php import to bring over the data.
My application has 3 main models: companies, posts, and postdata. I'm providing an in depth analytics dashboard and am having trouble figuring out what is the best method to structure the models in mongodb for the best performance.
The postdata contains the fields: date, number of posts (for that date), average post length (for that date), and company id
The post contains the fields: date, post text, post length
In the dashboard view I want to display two graphs and two pieces of data.
graphs: one of the number of posts by date, and the other the average post length by date.
data: total number of posts for a date range, average post length for a date range
Currently in the views, I loop through the postdata collection to create a total posts number for that date range, and an average post length for that date range. I know I probably shouldn't be doing that much work in the views, but how else can I get the data I'm looking for? Should I get rid of the postdata collection and just use underscore and countBy to create the data for charts? What will give me the best performance / is the preferred method.
I would take a look at Marionette, it adds a few nice features to Backbone, one of which being collection views This is a nice way to separate out the view for your graphs.
If you think you'd like to grow the functionality of your dashboard to become more and more complex, then I would ditch the postdata model and do analysis on the client side, you can use libraries like d3, crossfilter, and rickshaw. This will give you a lot of flexibility to quickly add features. The advantages of keeping the postdata model would be: simplicity on the front end and best front end performance.
I think looping through collection shouldn't give that much impact on performance.
Although you can try something like this: additionally to BB Collection create a simple object with dates acting as keys and data object as values for calculations only. To query it you'll have to create an array of dates in requested range. It may work pretty fast if range will be relatively small but as range increases performance will get closer to just looping through the whole thing. I admit, it sounds a little crazy even to me. Some experimenting will definitely be needed.