Currently developing some engineering software using PHP and MongoDB for storing the data.
I plan on doing some calculations where I am performing many calculations on a collection. Essentially, it contains data, and I want to perform calculations on the data, update the field, calculate on the next field and so on.
However, my developer has hit a snag.
He was doing what I thought would be a simple operation.
> Upload a CSV into a collection.
> Create secondary collection by transforming all of the values
> of the first collection according to user input of a value into a formula.
Similar to Excel's "Copy Value" then Paste Special Multiply.
Essentially create a new collection as product of the first
collection.
The developer reported back that this slowed his PC down to a crawl.
This concerns me that my advanced application has no hope of getting off the ground if mongo is slow to carry out this simple (to me) task.
Is there a proper way to go about performing thousands of calculations on a nosql collection? Are databases not meant for this sort of work load? Would I then have to pull the data out into an array, perform the calculations then insert the new values after the simulation is done?
I have read that java has better performance than PHP, should I direct the code toward java for engineering applications?
There are few questions needs to be checked before coming to any conclusion
1) What is the OS? for windows check the task manager for the mongod process details while you are running the queries; for linux use the top command to check the process details
2) What is the volume of data; I mean size in terms of megabyte/byte
3) If you have 2 GB RAM then you should consider splitting the data volume to less than 1 GB and start processing
4) If data volume is perfect for the RAM size then there must be disk speed/ RPM
5) If the data processing is only for localhost then you can process the data with some other laptop with higher configuration to see the outcome
6) What is the version of MongoDB you are using? try to upgrade to the latest
7) You can consider free MongoDB cluster Atlas(https://cloud.mongodb.com/user#/atlas/login) if the volume of data is not very huge
8) Can create a cluster in AWS free tier for few hours to know the outcome also
I hope you are shy away from trying. Last but not the least is your requirement.
Related
I wrote a script that generates 5,443,200 data records in batches of 544,320, which are inserted into a mysql db using the mysqljs/mysql module, and when I try to bump up the volume of data records, I get a node heap out of memory error. I tried to refactor this using a generator and iterator, but doing so seemed to introduce a lot of buggy behavior and ultimately crashed my mysql server. I had also thrown in the cluster node module to see if that would help, but that created issues of my mysql server refusing connections and sometimes crashing my computer altogether.
My question is, how would I be able to scale the script such that I can generate 30 times the 5 million I've generated at ideally the same generation and insertion rate? I reckon I can still work with generators and iterators as it's most likely my particular implementation that's buggy.
https://gist.github.com/anonymous/bed4a311fb746ba04c65d331d23bd0a8
Batch-inserting more than 1000 rows at a time is very much into "diminishing returns". 1000 gives you about 99% of the theoretical maximum.
When you go much beyond 1000, you first get into inefficiencies, then you get (as you found out) into limitations. Your attempt to get to 100% has backfired.
I am creating a simple game in html5 canvas. i run it using javascript . ANd i want it to be a multiplayer game. but first i need to have a database where i can put the x and y position of an object that will run in every 30 milliseconds(it is the keyframes of my game animation.) . i need to save it in a file or database so other players can see the update of x and y position of other players...i hope you get my point...
now i am asking what database or file should i use to do this data position updating . that can be able to update that fast
For a scenario like this, you will probably get more mileage if you cache locations in memory locally, but then periodically "sync" them with the database. This will require ways to resolve conflicts in position (e.g. if the position you predict / have on the client-side JavaScript deviates from the actual position as reported by the database) but will allow you to be more efficient (e.g. updating at a faster rate when players are nearer to your player and less frequently when they are far away, for example). It will also allow you to animate your player's movements more steadily without jank in the event that a particular database request falls outside of your frame rate requirements.
As for the database, itself, there are a lot of databases to choose from. However, if you don't want to write the server-side code to provide an API for interacting with your database, then you may be interested in Firebase, which provides direct access from client-side JavaScript (without the need to create your own server / API layer on top of the database). Of course you can also use any other database -- Google Cloud Datastore, Google Cloud SQL, MySQL, Cassandra, MongoDB -- and write an appropriate API server layer in the language of your choice (which could also be JavaScript) to provide access to the underlying data, as a valid option as well (and, in fact, that might make more sense if you already have or plan to have a frontend webserver).
TL;DR:
I'm making an app for a canteen. I have a collection with the persons and a collection where I "log" every meat took. I need to know those who DIDN'T take the meal.
Long version:
I'm making an application for my local Red Cross.
I'm trying to optimize this situation:
there is a canteen at wich the helped people can take food at breakfast, lunch and supper. We need to know how many took the meal (and this is easy).
if they are present they HAVE TO take the meal and eat, so we need to know how many (and who) HAVEN'T eat (this is the part that I need to optimize).
When they take the meal the "cashier" insert their barcode, the program log the "transaction" in the log collection.
Actually, on creation of the template "canteen" I create a local collection "meals" and populate it with the data of all the people in the DB, (so ID, name, fasting/satiated), then I use this collection for my counters and to display who took the meal and who didn't.
(the variable "mealKind" is = "breakfast" OR "lunch" OR "dinner" depending on the actual serving.)
Template.canteen.created = function(){
Meals=new Mongo.Collection(null);
var today= new Date();today.setHours(0,0,1);
var pers=Persons.find({"status":"present"},{fields:{"Name":1,"Surname":1,"barcode":1}}).fetch();
pers.forEach(function(uno){
var vediamo=Log.findOne({"dest":uno.codice,"what":mealKind, "when":{"$gte": today}});
if(typeof vediamo=="object"){
uno['eat']="satiated";
}else{
uno['eat']="fasting";
}
Meals.insert(uno);
});
};
Template.canteen.destroyed = function(){
meals.remove({});
};
From the meal collection I estrapolate the two colums of people satiated (with name, surname and barcode) and fasting, and I also use two helpers:
fasting:function(){
return Meals.find({"eat":"fasting"});
}
"countFasting":function(){
return Meals.find({"eat":"fasting"}).count();
}
//same for satiated
This was ok, but now the number of people is really increasing (we are arount 1000 and counting) and the creation of the page is very very slow, and usually it stops with errors so I can read that "100 fasting, 400 satiated" but I have around 1000 persons in the DB.
I can't figure out how to optimize the workflow, every other method that I tried involved (in a manner or another) more queries to the DB; I think that I missed the point and now I cannot see it.
I'm not sure about aggregation at this level and inside meteor, because of minimongo.
Although making this server side and not client side is clever, the problem here is HOW discriminate "fasting" vs "satiated" without cycling all the person collection.
+1 if the solution is compatibile with aleed:tabular
EDIT
I am still not sure about what is causing your performance issue (too many things in client memory / minimongo, too many calls to it?), but you could at least try different approaches, more traditionally based on your server.
By the way, you did not mention either how you display your data or how you get the incorrect reading for your number of already served / missing Persons?
If you are building a classic HTML table, please note that browsers struggle rendering more than a few hundred rows. If you are in that case, you could implement a client-side table pagination / infinite scrolling. Look for example at jQuery DataTables plugin (on which is based aldeed:tabular). Skip the step of building an actual HTML table, and fill it directly using $table.rows.add(myArrayOfData).draw() to avoid the browser limitation.
Original answer
I do not exactly understand why you need to duplicate your Persons collection into a client-side Meals local collection?
This requires that you have first all documents of Persons sent from server to client (this may not be problematic if your server is well connected / local. You may also still have autopublish package on, so you would have already seen that penalty), and then cloning all documents (checking for your Logs collection to retrieve any previous passages), effectively doubling your memory need.
Is your server and/or remote DB that slow to justify your need to do everything locally (client side)?
Could be much more problematic, should you have more than one "cashier" / client browser open, their Meals local collections will not be synchronized.
If your server-client connection is good, there is no reason to do everything client side. Meteor will automatically cache just what is needed, and provide optimistic DB modification to keep your user experience fast (should you structure your code correctly).
With aldeed:tabular package, you can easily display your Persons big table by "pages".
You can also link it with your Logs collection using the dburles:collection-helpers (IIRC there is an example en the aldeed:tabular home page).
We are investigating using Breeze for field deployment of some tools. The scenario is this -- an auditor will visit sites in the field, where most of the time there will be no -- or very degraded -- internet access. Rather than replicate our SQL database on all the laptops and tablets (if that's even possible), we are hoping to use Breeze to cache the data and then store it locally so it is accessible when there is not a usable connection.
Unfortunately, Breeze seems to choke when caching any significant amount of data. Generally on Chrome it's somewhere between 8 and 13MB worth of entities (as measured by the HTTPResponse headers). This can change a bit depending on how many tabs I have open and such, but I have not been able to move that more than 10%. the error I get is the Chrome tab crashes and tells me to reload. The error is replicable (I download the data in 100K chunks and it fails on the same read every time and works fine if I stop it after the previous read) When I change the page size, it always fails within the same range.
Is this a limitation of Breeze, or Chrome? Or windows? I tried it on Firefox, and it handles even less data before the whole browser crashes. IE fares a little better, but none of them do great.
Looking at performance in task manager, I get the following:
IE goes from 250M memory usage to 1.7G of memory usage during the caching process and caches a total of about 14MB before throwing an out-of-memory error.
Chrome goes from 206B memory usage to about 850M while caching a total of around 9MB
Firefox goes from around 400M to about 750M and manages to cache about 5MB before the whole program crashes.
I can calculate how much will be downloaded with any selection criteria, but I cannot find a way to calculate how much data can be handled by any specific browser instance. This makes using Breeze for offline auditing close to useless.
Has anyone else tackled this problem yet? What are the best approaches to handling something like this. I've thought of several things, but none of them are ideal. Any ideas would be appreciated.
ADDED At Steve Schmitt's request:
Here are some helpful links:
Metadata
Entity Diagram (pdf) (and html and edmx)
The first query, just to populate the tags on the page runs quickly and downloads minimal data:
var query = breeze.EntityQuery
.from("Countries")
.orderBy("Name")
.expand("Regions.Districts.Seasons, Regions.Districts.Sites");
Once the user has select the Sites s/he wishes to cache, the following two queries are kicked off (used to be one query, but I broke it into two hoping it would be less of a burden on resources -- it didn't help). The first query (usually 2-3K entities and about 2MB) runs as expected. Some combination of the predicates listed are used to filter the data.
var qry = breeze.EntityQuery
.from("SeasonClients")
.expand("Client,Group.Site,Season,VSeasonClientCredit")
.orderBy("DistrictId,SeasonId,GroupId,ClientId")
var p = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p1 = breeze.Predicate("SeasonId", "==", SeasonId);
var p2 = breeze.Predicate("DistrictId", "==", DistrictId);
var p3 = breeze.Predicate("Group.Site.SiteId", "in", SiteIds);
After the first query runs, the second query (below) runs (also using some combination of the predicates listed to filter the data. At about 9MB, it will have about 50K rows to download). When the total download burden between the two queries is between 10MB and 13MB, browsers will crash.
var qry = breeze.EntityQuery
.from("Repayments")
.orderBy('SeasonId,ClientId,RepaymentDate');
var p1 = breeze.Predicate("District.Region.CountryId", "==", CountryId);
var p2 = breeze.Predicate("SeasonId", "==", SeasonId);
var p3 = breeze.Predicate("DistrictId", "==", DistrictId);
var p4 = breeze.Predicate("SiteId", "in", SiteIds);
Thanks for the interest, Steve. You should know that the Entity Relationships are inherited and currently in production supporting the majority of the organization's operations, so as few changes as possible to that would be best. Also, the hope is to grow this from a reporting application to one with which data entry can be done in the field (so, as I understand it, using projections to limit the data wouldn't work).
Thanks for the interest, and let me know if there is anything else you need.
Here are some suggestions based on my experience building on an offline capable web application using breeze. Some or all of these might not make sense for your use cases...
Identify which entity types need to be editable vs which are used to fill drop-downs etc. Load non-editable data using the noTracking query option and cache them in localStorage yourself using JSON.stringify. This avoids the overhead of coercing the data into entities, change tracking, etc. Good candidates for this approach in your model might be entity types like Country, Region, District, Site, etc.
If possible, provide a facility in your application for users to identify which records they want to "take offline". This way you don't need to load and cache everything, which can get quite expensive depending on the number of relationships, entities, properties, etc.
In conjunction with suggestion #2, avoid loading all the editable data at once and avoid using the same EntityManager instance to load each set of data. For example, if the Client entity is something that needs to be editable out in the field without a connection, create a new EntityManager, load a single client (expanding any children that also need to be editable) and cache this data separately from other clients.
Cache the breeze metadata once. When calling exportEntities the includeMetadata argument should be false. More info on this here.
To create new EntityManager instances make use of the createEmptyCopy method.
EDIT:
I want to respond to this comment:
Say I have a client who has bills and payments. That client is in a
group, in a site, in a region, in a country. Are you saying that the
client, payment, and bill information might each have their own EM,
while the location hierarchy might be in a 4th EM with no-tracking?
Then when I refer to them, I wire up the relationships as needed using
LINQs on the different EMs (give me all the bills for customer A, give
me all the payments for customer A)?
It's a bit of a judgement call in terms of deciding how to separate things out. Some of what I'm suggesting might be overkill, it really depends on the amount of data and the way your application is used.
Assuming you don't need to edit groups, sites, regions and countries while offline, the first thing I'd do would be to load the list of groups using the noTracking option and cache them in localStorage for offline use. Then do the same for sites, regions and countries. Keep in mind, entities loaded with the noTracking option aren't cached in the entity manager so you'll need to grab the query result, JSON.stringify it and then call localStorage.setItem. The intent here is to make sure your application always has access to the list of groups, sites, regions, etc so that when you display a form to edit a client entity you'll have the data you need to populate the group, site, region and country select/combobox/dropdown.
Assuming the user has identified the subset of clients they want to work with while offline, I'd then load each of these clients one at a time (including their payment and bill information but not expanding their group, site, region, country) and cache each client+payments+bills set using entityManager.exportEntities. Reasoning here is it doesn't make sense to load several clients plus their payments and bills into the same EntityManager each time you want to edit a particular client. That could be a lot of unnecessary overhead, but again, this is a bit of a judgement call.
#Jeremy's answer was excellent and very helpful, but didn't actually answer the question, which I was starting to think was unanswerable, or at least the wrong question. However #Steve in the comments gave me the most appropriate information for this question.
It is neither Breeze nor the Browser, but rather Knockout. Apparently the knockout wrapper around the breeze entities uses all that memory (at least while loading the entities and in my environment). As described above, Knockout/Breeze would crap out after reading around 5MB of data, causing Chrome to crash with over 1.7GB of memory usage (from a pre-download memory usage around 300MB). Rewriting the app in ANgularJS eliminated the problem. So far I have been able to download over 50MB from the exact same EF6 model into Breeze/Angular, total Chrome memory usage never went above 625MB.
I will be testing larger payloads, but 50 MB more than satisfies my needs for the moment. Thanks everyone for your help.
I want to make a Chrome extension that will store potentially large code snippets (with the snippet's name) from the user and use them.
I want the user to be able to upload files containing these snippets (or, even better, to copy and paste these snippets into a textarea in the extension's options page).
And, the tricky part, I want these snippets to be memorized by the extension in order to be accessible the next time user starts Chrome.
What kind of storage do you think I should use ?
What is average size of code snippet?
Depend on answer you can use:
1) localStorage - easiest solution but have size limitations. 5Mb based on data from http://en.wikipedia.org/wiki/Web_storage
2) chromeStorage - with "unlimitedStorage" permission can store more data, can be synced between devices with storage.sync. See more on: https://developer.chrome.com/extensions/storage
I would advise cloud back-end, just because it's cool:) have a look at one of those:
https://parse.com - quite powerful one with lots of features
https://www.firebase.com - less features but still very popular
Until certain limits the usage of both services is free.
Update
Parse.com pricing page even says you can have up to 20GB of database for free. Their FAQ reads:
What happens when I exceed 20GB of database storage?
The overage rate for database size is $10/GB but we only allow increases in increments of 20GB. When you you exceed 20GB of database size we will increase your soft limit to 40GB and begin charging you an incremental $200/month.