How would one get whole query size efficiently from the Firestore collection which has thousands of documents?
In my case I query documents by few different rules:
Start date
End date
Place id
Keywords
Then I limit the query to show only 50 records but I would need to get the size of the query without this limitations since that way pagination would show correctly in the front end.
I could use cloud function which makes the same query as earlier but without limit and then get size of it, but is there more efficient way of doing this? Query size could be thousands of documents so is there any performance issues by doing it this way? And how does the billing work on this kind of situation?
If ie. My query is 1500 documents is there going to be 1500 read operations to get the size of this query?
There has been other topics which recommends using counters to get size of the collection, but this does not suit my approach since the size depends on user's search parameters stated above.
All recommendations for this problem are welcome!
If you have in one collection thousands of documents it might be possible to need to update a counter very often. In Cloud Firestore, you can only update a single document about once per second, which might be too low for some high-traffic applications.
Query size could be thousands of documents so is there any performance issues by doing it this way?
No, it won't. According to the official documentation regarding Firestore counter, you can use distributed counters:
To support more frequent counter updates, create a distributed counter. Each counter is a document with a subcollection of "shards," and the value of the counter is the sum of the value of the shards.
This practice can help you achieve what you want.
And how does the billing work on this kind of situation?
In case you want to read the entire collection at once, you'll be billed with a read operation for each document read.
My query is 1500 documents is there going to be 1500 read operations to get the size of this query?
If you are looping the entire collection to get the number of documents, yes.
For more details about storing counters, please see the last part of my answer from this post:
As a personal hint, don't store this kind of counters in Cloud Firestore, because every time you increase or decrease the counter will cost you a read or a write operation. Host this counter in Firebase Realtime database at no cost.
Related
More precisely, a slice of the ordered documents. My idea would be this, but it isn't good:
firestore().collection("queue").orderBy("order_id", "asc").limit(3,5)
I'd be grateful if anyone could answer it.
Best Practice
"Do not use offsets. Instead, use cursors. Using an offset only avoids returning the skipped documents to your application, but these documents are still retrieved internally. The skipped documents affect the latency of the query, and your application is billed for the read operations required to retrieve them."
Firestore does not offer offset-based query results for web and mobile clients, as they are inefficient and costly on your bill. If you want to implement pagination in your app, you should follow the linked documentation and design your app accordingly. This will get you the ability to jump forward and backward in query results, but not to a specific index or offset without first reading everything up to that offset (which is the expensive part that Firestore is suggesting you should not do).
I have a chat app that is charging me a large number of reads for each page load, 1 for each message to show. I'm trying to figure out a way to reduce that number and optimize for cost, as refreshing the page a few times costs hundreds of reads.
The firestore pricing documentation says
For queries other than document
reads, such as a request for a list of collection IDs, you are billed
for one document read. If fetching the complete set of results
requires more than one request (for example, if you are using
pagination), you are billed once per request.
I considered that maybe if I fetch an entire collection without a query like shown here in the docs, a cost difference might be remotely possible. I'm sure that's probably wrong, but I can't find anything specifying what the exceptions are that only cost 1 read. It also crossed my mind to create an array to hold the most recent messages in the parent document of the collection, but the security rules for updating that array seem overly complex and not practical. I also read about using the firebase cache, but that doesn't seem useful here.
Here is code to demonstrate how I'm currently loading messages. I'm using the react-firebase-hooks library to snapshot this data with useCollectionData:
const q = query(messagesRef, orderBy("createdAt", "desc"), limit(100))
const [messages] = useCollectionData(q)
In researching, I found this question where I'm pretty sure the accepted answer is wrong. It did make me question the rules. Are there any strategies to reduce the number of reads for this common use case?
Pagination still incurs charges on a per-document read, right?
Yes, it does, but only when you load more pages.
I'm not trying to load the entire collection, but rather wondering if loading the collection without a query has a different cost than with.
Loading a collection without a query that is limiting the results, means that you're reading the entire collection. And, yes, the cost will be much higher if you're not using a query. Remember, that the cost of reading a collection/query in Firestore is equal to the number of documents that are actually returned. For example, if you have a collection of 1 million documents, and your query returns 100, you'll have to pay only 100 document reads.
I'm overall trying to figure out if there's a strategy that can improve the read cost of the example query I gave.
No. If you need to get the newest 100 messages, that's the best query you can have. The only change you can make to decrease the number of reads would be to change the value that you pass to the limit() function. And maybe it makes sense since a user might not be interested in reading 100 messages at once. Always try to display data that fits into a screen, and load any other data progressively.
I have a list of documents that are stored in DynamoDB. There are more than 10,000 documents being stored in the db. I have set the limit to 100 when requesting. DynamoDB is doing a good job by returning me the first 100 documents and the LastEvaluatedKey to get the next 100 documents.
The problem here is I also want the DynamoDB to return me the total number of pages for pagination purpose. In this case since i have 10,000 documents it should return 100 (the number of pagination)
For now what I have done is that I have been counting manually by looping the queries until it doesn't return me the LastEvaluatedKey. Add up how many looping has been done to get the total page. But I believe there are better approach.
As the other answer has correctly explained, there is no efficient way to get total result counts for DynamoDB query or scan operations. As such there is no efficient way to get total number of pages.
However, what I wanted to call out is that modern UIs have been moving away from classic pagination towards an infinite scroll design pattern. Where the “next page” of results is loaded on demand as the List is scrolled. This can be achieved with DynamoDB. You can still show discrete pages but cannot show, apriori, how many results or how many pages there are.. It’s a current shortcoming of DynamoDB.
Neither "Scan" (to read the entire table) nor "Query" (to read all the items with the same partition key) operations return an estimate on how many total results there are. In some cases (e.g., when a FilterExpression is provided), there is no efficient way for DynamoDB to do this estimation. In other cases there may be a way for DynamoDB to provide this information, but it doesn't.
If I understand you correctly, you want to Scan the entire table, without a filter. Like I said, Scan itself doesn't give you the number of items in the table. But you can find this number using DescribeTable, which returns, among other things, an "ItemCount" attribute, which is an estimate on the total number of items in the table, which may not be completely up-to-date but perhaps is good enough for your needs (if you want an estimate for some sort of progress report, that doesn't need to be 100% accurate).
If you really need accurate and up-to-the-moment counts of items in a partition or an entire table, you can always try to maintain such counters as separate data. Doing this correctly is not trivial, and will have performance implications, but in some use cases (e.g., rare writes and a lot of reads) may be useful.
You can maintain your own counts using DynamoDB streams - basically you create a Lambda function that watches for items being created/deleted and then write back to a counter item in DynamoDB that stores the current count of items.
I need to move my local project to a webserver and it is time to start saving things locally (users progress and history).
The main idea is that the webapp every 50ms or so will calculate 8 values that are related to the user which is using the webapp.
My questions are:
Should i use MySQL to store the data? At the moment im using a plain text file with a predefined format like:
Option1,Option2,Option3
Iteration 1
value1,value2,value3,value4,value5
Iteration 2
value1,value2,value3,value4,value5
Iteration 3
value1,value2,value3,value4,value5
...
If so, should i use 5 (or more in the future) columns (one for each value) and their ID as Iteration? Keep in mind i will have 5000+ Iterations per session (roughly 4mins)
Each users can have 10-20 sessions a day.
Will the DB become too big to be efficient?
Due to the sample speed a call to the DB every 50 ms seems a problem to me (especially since i have to animate the webpage heavily). I was wondering if it would be better to implement a Save button which populate all the DB with all the 5000+ values in one go. If so what could it be the best way?
Would it be better to save the *.txt directly in a folder in the webserver? Something like DB/usernameXYZ/dateZXY/filename_XZA.txt . To me yes, way less effort. If so which is the function that allows me to do so (possible JS/HTML).
The rules are simple, and are discussed in many Q&A here.
With rare exceptions...
Do not have multiple tables with the same schema. (Eg, one table per User)
Do not splay an array across columns. Use another table.
Do not put an array into a single column as a commalist. Exception: If you never use SQL to look at the individual items in the list, then it is ok for it to be an opaque text field.
Be wary of EAV schema.
Do batch INSERTs or use LOAD DATA. (10x speedup over one-row-per-INSERT)
Properly indexed, a billion-row table performs just fine. (Problem: It may not be possible to provide an adequate index.)
Images (a la your .txt files) could be stored in the filesystem or in a TEXT column in the database -- there is no universal answer of which to do. (That is, need more details to answer your question.)
"calculate 8 values that are related to the user" -- to vague. Some possibilities:
Dynamically computing a 'rank' is costly and time-consuming.
Summary data is best pre-computed
Huge numbers (eg, search hits) are best approximated
Calculating age from birth date - trivial
Data sitting in the table for that user is, of course, trivial to get
Counting number of 'friends' - it depends
etc.
well, i am creating a network that allows users creating posts and like them.
Asking on stackoverflow i've understood how to structure my database:
A collection which includes a document for each post.
A collection which includes a document for each like, in each of these documents there is a reference to post is referenced to.
When i want to get ALL likes about a post i can query the like collection looking for the reference to that post.
And till here i am ok. But assuming i'll have millions documents in like collection, i wondered how could i query and search among them in not too long time.
And i was advised of ensureIndex, in this case, i have to ensureindex of the field which contains reference to a post.
But when do i have to create this index? is enough to create it once (for example when i set up my database) and it will be as default in mongodb or do i have to do it during application life-time? thank you
But assuming i'll have millions documents in like collection, i wondered how could i query and search among them in not too long time.
I assume you would most likely want to do a count on the likes as an example?
You can't, instead you use optimizations to combat this. A count on millions of rows might get a bit slow.
A typical scenario are counters in SQL techs that you use to amend the parent row with a sum figure of its children.
Same applies to MongoDB.
You would aggregate important data to the top.
If you require to actually query the likes to show some who have liked it then you limit those likes. Google+ and other networks tend to limit the amount of likes they show to about 1,000.
And i was advised of ensureIndex,
Adding indexes to a database does help with actually searching for documents.
But when do i have to create this index? is enough to create it once
Yes, MongoDB will manage the index itself. You only need to ensure it once.