I'm currently designing a database for a fairly large meteor app, and we're debating whether Meteor will perform better using more subscriptions, to collections of tiny documents, or fewer subscriptions to collections of larger documents.
Some of these documents could in end become quite large, such as listings of user favourites or preferences, that would only be viewable to the individual user in a particular view.
In terms of numbers we're talking about 10 subscriptions, at least four of which would not be consistently subscribed too, returning only a single larger document in particular views.
Versus 4 subscriptions to collections of possibly quite large documents, (I realize that those individual view would probably render faster having the data already on the client).
Any insights or empirical data would be incredibly helpful.
Thanks.
This is probably not a complete answer. I'm currently in the process of solving this problem myself.
I have some experience with larger data sets. I subscribe to a single collection without any restrictions:
Meteor.publish('collectionName', function () {
return collectionName.find();
});
My collection contains 400 documents with a total size of approximately 600KB (after dumping the collection using mongodump). With about 100 users in the system (using it daily from different continents) we do have a few performance issues:
When I load the page displaying the 400 items, you can watch the list being populated.
It is also important what you do with the 400 documents. Each of my entries is a fairly complex DOM node with multiple helpers executed for each item.
Generating the DOM takes some time and causes a peak on the clients CPU. Also a lot of data is transferred to the client.
A few ideas to solve the problems:
Using pagination (e.g. package alethes:pages) to prevent intense client computation.
Use a separate publication/subscription for a list view only including the fields necessary to display the list items to reduce the size of each document
Only publish/subscribe to documents that are viewable to the current user (e.g. checking for login status, access permissions, etc) to reduce the number of documents.
Cache the subscription: https://github.com/meteorhacks/subs-manager (haven't tried that yet)
Related
I have a chat app that is charging me a large number of reads for each page load, 1 for each message to show. I'm trying to figure out a way to reduce that number and optimize for cost, as refreshing the page a few times costs hundreds of reads.
The firestore pricing documentation says
For queries other than document
reads, such as a request for a list of collection IDs, you are billed
for one document read. If fetching the complete set of results
requires more than one request (for example, if you are using
pagination), you are billed once per request.
I considered that maybe if I fetch an entire collection without a query like shown here in the docs, a cost difference might be remotely possible. I'm sure that's probably wrong, but I can't find anything specifying what the exceptions are that only cost 1 read. It also crossed my mind to create an array to hold the most recent messages in the parent document of the collection, but the security rules for updating that array seem overly complex and not practical. I also read about using the firebase cache, but that doesn't seem useful here.
Here is code to demonstrate how I'm currently loading messages. I'm using the react-firebase-hooks library to snapshot this data with useCollectionData:
const q = query(messagesRef, orderBy("createdAt", "desc"), limit(100))
const [messages] = useCollectionData(q)
In researching, I found this question where I'm pretty sure the accepted answer is wrong. It did make me question the rules. Are there any strategies to reduce the number of reads for this common use case?
Pagination still incurs charges on a per-document read, right?
Yes, it does, but only when you load more pages.
I'm not trying to load the entire collection, but rather wondering if loading the collection without a query has a different cost than with.
Loading a collection without a query that is limiting the results, means that you're reading the entire collection. And, yes, the cost will be much higher if you're not using a query. Remember, that the cost of reading a collection/query in Firestore is equal to the number of documents that are actually returned. For example, if you have a collection of 1 million documents, and your query returns 100, you'll have to pay only 100 document reads.
I'm overall trying to figure out if there's a strategy that can improve the read cost of the example query I gave.
No. If you need to get the newest 100 messages, that's the best query you can have. The only change you can make to decrease the number of reads would be to change the value that you pass to the limit() function. And maybe it makes sense since a user might not be interested in reading 100 messages at once. Always try to display data that fits into a screen, and load any other data progressively.
TL;DR:
I'm making an app for a canteen. I have a collection with the persons and a collection where I "log" every meat took. I need to know those who DIDN'T take the meal.
Long version:
I'm making an application for my local Red Cross.
I'm trying to optimize this situation:
there is a canteen at wich the helped people can take food at breakfast, lunch and supper. We need to know how many took the meal (and this is easy).
if they are present they HAVE TO take the meal and eat, so we need to know how many (and who) HAVEN'T eat (this is the part that I need to optimize).
When they take the meal the "cashier" insert their barcode, the program log the "transaction" in the log collection.
Actually, on creation of the template "canteen" I create a local collection "meals" and populate it with the data of all the people in the DB, (so ID, name, fasting/satiated), then I use this collection for my counters and to display who took the meal and who didn't.
(the variable "mealKind" is = "breakfast" OR "lunch" OR "dinner" depending on the actual serving.)
Template.canteen.created = function(){
Meals=new Mongo.Collection(null);
var today= new Date();today.setHours(0,0,1);
var pers=Persons.find({"status":"present"},{fields:{"Name":1,"Surname":1,"barcode":1}}).fetch();
pers.forEach(function(uno){
var vediamo=Log.findOne({"dest":uno.codice,"what":mealKind, "when":{"$gte": today}});
if(typeof vediamo=="object"){
uno['eat']="satiated";
}else{
uno['eat']="fasting";
}
Meals.insert(uno);
});
};
Template.canteen.destroyed = function(){
meals.remove({});
};
From the meal collection I estrapolate the two colums of people satiated (with name, surname and barcode) and fasting, and I also use two helpers:
fasting:function(){
return Meals.find({"eat":"fasting"});
}
"countFasting":function(){
return Meals.find({"eat":"fasting"}).count();
}
//same for satiated
This was ok, but now the number of people is really increasing (we are arount 1000 and counting) and the creation of the page is very very slow, and usually it stops with errors so I can read that "100 fasting, 400 satiated" but I have around 1000 persons in the DB.
I can't figure out how to optimize the workflow, every other method that I tried involved (in a manner or another) more queries to the DB; I think that I missed the point and now I cannot see it.
I'm not sure about aggregation at this level and inside meteor, because of minimongo.
Although making this server side and not client side is clever, the problem here is HOW discriminate "fasting" vs "satiated" without cycling all the person collection.
+1 if the solution is compatibile with aleed:tabular
EDIT
I am still not sure about what is causing your performance issue (too many things in client memory / minimongo, too many calls to it?), but you could at least try different approaches, more traditionally based on your server.
By the way, you did not mention either how you display your data or how you get the incorrect reading for your number of already served / missing Persons?
If you are building a classic HTML table, please note that browsers struggle rendering more than a few hundred rows. If you are in that case, you could implement a client-side table pagination / infinite scrolling. Look for example at jQuery DataTables plugin (on which is based aldeed:tabular). Skip the step of building an actual HTML table, and fill it directly using $table.rows.add(myArrayOfData).draw() to avoid the browser limitation.
Original answer
I do not exactly understand why you need to duplicate your Persons collection into a client-side Meals local collection?
This requires that you have first all documents of Persons sent from server to client (this may not be problematic if your server is well connected / local. You may also still have autopublish package on, so you would have already seen that penalty), and then cloning all documents (checking for your Logs collection to retrieve any previous passages), effectively doubling your memory need.
Is your server and/or remote DB that slow to justify your need to do everything locally (client side)?
Could be much more problematic, should you have more than one "cashier" / client browser open, their Meals local collections will not be synchronized.
If your server-client connection is good, there is no reason to do everything client side. Meteor will automatically cache just what is needed, and provide optimistic DB modification to keep your user experience fast (should you structure your code correctly).
With aldeed:tabular package, you can easily display your Persons big table by "pages".
You can also link it with your Logs collection using the dburles:collection-helpers (IIRC there is an example en the aldeed:tabular home page).
I have a question about Collections - specifically, I want to have a large collection on a server, and load only small bits of it a piece at a time, in an unpredictable order, where I might stop wanting to have a local copy of any given piece at any time. Should I make a new subscription for each piece of data, and then stop it when I no longer want that piece of data? Or should I use some other method? Or should I just load large chunks of it that I won't use and leave them sitting around in my local copy of the collection?
Edit: Or should I have one subscription with a list of the ID's for each piece of data I want, and have the publication function specifically find each of those? Seems complicated, but it does keep me with only having to deal with one subscription.
Edit: Or maybe I should just skip using publications and subscriptions, and just use Methods to pass my data to the client? Loses a lot of functionality, and requires some extra work, but it does dodge most of the problems and should work just fine for my purposes.
Suppose Mongo collection ="items"
{
name:'item1',
type:'basic',
qty:40
}
you define collections on the Meteor server with
Items= new Mongo.Collection('items')
1.These collections contain all the data from the MongoDB collections, and you can run Items.find({...}) on them, which will return a cursor (a set of records, with methods to iterate through them and return them).
Meteor.publish('itemOver30', function itemPublication() {
return(Items.find({qty:{gte:10},{name:1,qty:1}));
});
This will return cursor to all the records with item qty over 30 in items collection(subset of total records, not whole collection).
2.Cursor is used to publish (send) a set of records (called a "record set"). You can optionally publish only some fields from those records. It is record sets (not collections) that clients subscribe to.
Meteor.subscribe('itemOver30');
On the client, you have Minimongo collections that partially mirror some of the records from the server. "Partially" because they may contain only some of the fields, and "some of the records" because you usually want to send to the client only the records it needs, to speed up page load, and only those it needs and has permission to access.
We are investigating using Breeze for field deployment of some tools. The scenario is this -- an auditor will visit sites in the field, where most of the time there will be no -- or very degraded -- internet access. Rather than replicate our SQL database on all the laptops and tablets (if that's even possible), we are hoping to use Breeze to cache the data and then store it locally so it is accessible when there is not a usable connection.
Unfortunately, Breeze seems to choke when caching any significant amount of data. Generally on Chrome it's somewhere between 8 and 13MB worth of entities (as measured by the HTTPResponse headers). This can change a bit depending on how many tabs I have open and such, but I have not been able to move that more than 10%. the error I get is the Chrome tab crashes and tells me to reload. The error is replicable (I download the data in 100K chunks and it fails on the same read every time and works fine if I stop it after the previous read) When I change the page size, it always fails within the same range.
Is this a limitation of Breeze, or Chrome? Or windows? I tried it on Firefox, and it handles even less data before the whole browser crashes. IE fares a little better, but none of them do great.
Looking at performance in task manager, I get the following:
IE goes from 250M memory usage to 1.7G of memory usage during the caching process and caches a total of about 14MB before throwing an out-of-memory error.
Chrome goes from 206B memory usage to about 850M while caching a total of around 9MB
Firefox goes from around 400M to about 750M and manages to cache about 5MB before the whole program crashes.
I can calculate how much will be downloaded with any selection criteria, but I cannot find a way to calculate how much data can be handled by any specific browser instance. This makes using Breeze for offline auditing close to useless.
Has anyone else tackled this problem yet? What are the best approaches to handling something like this. I've thought of several things, but none of them are ideal. Any ideas would be appreciated.
ADDED At Steve Schmitt's request:
Here are some helpful links:
Metadata
Entity Diagram (pdf) (and html and edmx)
The first query, just to populate the tags on the page runs quickly and downloads minimal data:
var query = breeze.EntityQuery
.from("Countries")
.orderBy("Name")
.expand("Regions.Districts.Seasons, Regions.Districts.Sites");
Once the user has select the Sites s/he wishes to cache, the following two queries are kicked off (used to be one query, but I broke it into two hoping it would be less of a burden on resources -- it didn't help). The first query (usually 2-3K entities and about 2MB) runs as expected. Some combination of the predicates listed are used to filter the data.
var qry = breeze.EntityQuery
.from("SeasonClients")
.expand("Client,Group.Site,Season,VSeasonClientCredit")
.orderBy("DistrictId,SeasonId,GroupId,ClientId")
var p = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p1 = breeze.Predicate("SeasonId", "==", SeasonId);
var p2 = breeze.Predicate("DistrictId", "==", DistrictId);
var p3 = breeze.Predicate("Group.Site.SiteId", "in", SiteIds);
After the first query runs, the second query (below) runs (also using some combination of the predicates listed to filter the data. At about 9MB, it will have about 50K rows to download). When the total download burden between the two queries is between 10MB and 13MB, browsers will crash.
var qry = breeze.EntityQuery
.from("Repayments")
.orderBy('SeasonId,ClientId,RepaymentDate');
var p1 = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p2 = breeze.Predicate("SeasonId", "==", SeasonId);
var p3 = breeze.Predicate("DistrictId", "==", DistrictId);
var p4 = breeze.Predicate("SiteId", "in", SiteIds);
Thanks for the interest, Steve. You should know that the Entity Relationships are inherited and currently in production supporting the majority of the organization's operations, so as few changes as possible to that would be best. Also, the hope is to grow this from a reporting application to one with which data entry can be done in the field (so, as I understand it, using projections to limit the data wouldn't work).
Thanks for the interest, and let me know if there is anything else you need.
Here are some suggestions based on my experience building on an offline capable web application using breeze. Some or all of these might not make sense for your use cases...
Identify which entity types need to be editable vs which are used to fill drop-downs etc. Load non-editable data using the noTracking query option and cache them in localStorage yourself using JSON.stringify. This avoids the overhead of coercing the data into entities, change tracking, etc. Good candidates for this approach in your model might be entity types like Country, Region, District, Site, etc.
If possible, provide a facility in your application for users to identify which records they want to "take offline". This way you don't need to load and cache everything, which can get quite expensive depending on the number of relationships, entities, properties, etc.
In conjunction with suggestion #2, avoid loading all the editable data at once and avoid using the same EntityManager instance to load each set of data. For example, if the Client entity is something that needs to be editable out in the field without a connection, create a new EntityManager, load a single client (expanding any children that also need to be editable) and cache this data separately from other clients.
Cache the breeze metadata once. When calling exportEntities the includeMetadata argument should be false. More info on this here.
To create new EntityManager instances make use of the createEmptyCopy method.
EDIT:
I want to respond to this comment:
Say I have a client who has bills and payments. That client is in a
group, in a site, in a region, in a country. Are you saying that the
client, payment, and bill information might each have their own EM,
while the location hierarchy might be in a 4th EM with no-tracking?
Then when I refer to them, I wire up the relationships as needed using
LINQs on the different EMs (give me all the bills for customer A, give
me all the payments for customer A)?
It's a bit of a judgement call in terms of deciding how to separate things out. Some of what I'm suggesting might be overkill, it really depends on the amount of data and the way your application is used.
Assuming you don't need to edit groups, sites, regions and countries while offline, the first thing I'd do would be to load the list of groups using the noTracking option and cache them in localStorage for offline use. Then do the same for sites, regions and countries. Keep in mind, entities loaded with the noTracking option aren't cached in the entity manager so you'll need to grab the query result, JSON.stringify it and then call localStorage.setItem. The intent here is to make sure your application always has access to the list of groups, sites, regions, etc so that when you display a form to edit a client entity you'll have the data you need to populate the group, site, region and country select/combobox/dropdown.
Assuming the user has identified the subset of clients they want to work with while offline, I'd then load each of these clients one at a time (including their payment and bill information but not expanding their group, site, region, country) and cache each client+payments+bills set using entityManager.exportEntities. Reasoning here is it doesn't make sense to load several clients plus their payments and bills into the same EntityManager each time you want to edit a particular client. That could be a lot of unnecessary overhead, but again, this is a bit of a judgement call.
#Jeremy's answer was excellent and very helpful, but didn't actually answer the question, which I was starting to think was unanswerable, or at least the wrong question. However #Steve in the comments gave me the most appropriate information for this question.
It is neither Breeze nor the Browser, but rather Knockout. Apparently the knockout wrapper around the breeze entities uses all that memory (at least while loading the entities and in my environment). As described above, Knockout/Breeze would crap out after reading around 5MB of data, causing Chrome to crash with over 1.7GB of memory usage (from a pre-download memory usage around 300MB). Rewriting the app in ANgularJS eliminated the problem. So far I have been able to download over 50MB from the exact same EF6 model into Breeze/Angular, total Chrome memory usage never went above 625MB.
I will be testing larger payloads, but 50 MB more than satisfies my needs for the moment. Thanks everyone for your help.
I have a huge master CouchDB database and slave read-only CouchDB database, that synchronizes with master database.
Because rate of changes is quick, and channel between servers is slow and unstable, I want to set order/priority to define what documents come first. I need to ensure that the documents with highest priority are definitely of the latest version, and I can ignore documents in the end of list.
SORTING, not FILTERING
If it is not possible, what solution could be?
Resource I have already looked at:
http://wiki.apache.org/couchdb/Replication
http://couchapp.org/page/index
UPDATE: the master database is actually Node.js NPM registry, and order is list of Most Depended-upon Packages. I am trying to make proxy, because cloning 50G always fails after a while. But the fact is "we don't need 90% of those modules, but quick & reliable access to those we depend on."
The short answer is no.
The long answer is that CouchDB provides ACID guarantees at the individual document level only, by design. The replicator will update each document atomically when it replicates (as can anyone, the replicator is just using the public API) but does not guarantee ordering, this is mostly because it uses multiple http connections to improve throughput. You can configure that down to 1 if you like and you'll get better ordering, but it's not a panacea.
After the bigcouch merge, all bets are off, there will be multiple sources and multiple targets with no imposed total order.
CouchDB, out of the box, does not provide you with any options to control the order of replication. I'm guessing you could piece something together if you keep documents with different priorities in different databases on the master, though. Then, you could replicate the high-priority master database into the slave database first, replicate lower-priority databases after that, etc.
You could set up filtered replication or named document replication:
http://wiki.apache.org/couchdb/Replication#Filtered_Replication
http://wiki.apache.org/couchdb/Replication#Named_Document_Replication
Both of these are alternatives to replicating an entire database. You could do the replication in smaller batch sizes, and order the batches to match your priorities.