DynamoDb: Thousands of item to write with low capacity

DynamoDb: Thousands of item to write with low capacity - javascript

I just started writing some Lambda functions, my problem is this one:
I have around 7000 items to write.
Those items are having two index the primary the id and a secondary the spotname.
To write all those functions in the dynamodb with a batch write i wrote this code:
Unfortunately i face an issue with the batchwrite (25 items limit) and i solved it in the following way:
for (var j = 0; j < event.length; j++){
if(event[j][0] && event[j][1] && event[j][2] && event[j][3]){
requests.push(new Station(event[j][0],event[j][1],event[j][2],event[j][3]));
if(requests.length == 25 || j == (event.length -1)) { // when you have 25 ready..
var params = {
RequestItems: {
'Stations': requests
}
};
requests=[];
DynamoDB.batchWrite(params, function(err, data) {
if (err){
console.log("Error while batchWrite into dynamoDb");
console.log(err);
}
else{
console.log("Pushed all the added elements");
}
});
}
}
}
Now, i noticed that with a low capacity:
Table Read: 5 Write: 5
spotname-index Read: 5 Write: 5
I manage to write in the database only 1500 records.
Any advice?

I had this problem, this is how I solved it.
Increase the capacity for short period of time. Learnt it is by the hour. If you increase the capacity, try to use it within one hour. Then bring it down.
You cannot bring it down more than 4 times as of now. So you get 4 times in a day to bring your capacity down. You can increase the write capacity any number of times.
Second Approach is,
You can control the rate of write to Dynamo, so you spread your writes evenly across your capacity.
Make sure you write capacity is always higher than the incoming average record capacity.
Hope it helps.

Using the batch write API for DynamoDB doesn't actually use less throughput. It is really intended to reduce the amount of overhead for the HTTP requests when sending a large number of requests to DynamoDB. However, this means that one or more of the items that was attempted to be written may fail and it is your responsibility to detect this and retry those requests. This is likely why some of the records are not ending up in the database. To fix this issue you should look at the response to the batch write and retry those writes yourself.
In contrast, when putting single records at a time the AWS SDK will automatically retry. If you a are using a single thread as in the case above and switch to not using batch while your requests will definitely be throttled they will be given time to retry and succeed which just slows down the execution while keeping the throughput of the table low.
The better option is to temporarily raise the write throughput of the table to a higher value sufficient to support the bulk load. For this example I'd recommend a value beteen 50 and 100 writes. A single threaded load operation will likely be rate limited by the round trip time to the DynamoDB API well below these numbers. For loading only 7000 items I'd recommend avoiding the batch write API as it requires implementing retry logic yourself. However, if you are loading a lot more data or need the load to complete in less time the batch API can give you a theoretical 25 times performances improvement on the HTTP overhead assuming you are not being throttled.

Related

Batch write more than 25 items on DynamoDB using Lambda

Edit x1: Replaced the snippet with the full file
I'm currently in the process of seeding 1.8K rows in DynamoDB. When a user is created, these rows need to be generated and inserted. They don't need to be read immediately (Let's say, in less then 3 - 5 seconds). I'm currently using AWS Lambda and I'm getting hit by a timeout exception (Probably because more WCUs are consumed than provisioned, which I have 5 with Auto-Scaling disabled).
I've tried searching around Google and StackOverflow and this seems to be a gray area (which is kind of strange, considering that DynamoDB is marketed as an incredible solution handling massive amounts of data per second) in which no clear path exists.
We know that DynamoDB limits the inserts of 25 items per batch to prevent HTTP overhead. Meaning that we could call unlimited number of batchWrite and increase the WCUs.
I've tried calling the unlimited number of batchWrite by just firing it and not awaiting them (Will this count? I've read that since JS is single threaded the requests will be handled one by one anyways, except that I wouldn't have to wait the resposne if I don't use a promise.... Currently using Node 10 and Lambda), and nothing seems to happen. If I promisify the call and await it, I'd get a Lambda timeout exception (Probably because it ran out of WCUs).
I currently have 5 WCUs and 5RCUs (are these too small for these random-spiked operations?).
I'm kind of stuck as I don't want to be randomly increasing the WCUs for short periods of time. In addition, I've read that Auto-Scaling doesn't automatically kick in, and Amazon will only resize the Capacity Units 4 times a day.
How to write more than 25 items/rows into Table for DynamoDB?
https://www.keithrozario.com/2017/12/writing-millions-of-rows-into-dynamodb.html
What should I do about it?
Here's the full file what I'm using to insert into DynamoDB
const aws = require("aws-sdk");
export async function batchWrite(
data: {
PutRequest: {
Item: any;
};
}[]
) {
const client = new aws.DynamoDB.DocumentClient({
region: "us-east-2"
});
// 25 is the limit imposed by DynamoDB's batchWrite:
// Member must have length less than or equal to 25.
// This verifies whether the data is shaped correctly and has no duplicates.
const sortKeyList: string[] = [];
data.forEach((put, index) => {
const item = put.PutRequest.Item;
const has = Object.prototype.hasOwnProperty; // cache the lookup once, in module scope.
const hasPk = has.call(item, "pk");
const hasSk = has.call(item, "sk");
// Checks if it doesn't have a sort key. Unless it's a tenant object, which has
// the accountType attribute.
if (!hasPk || !hasSk) {
throw `hasPk is ${hasPk} and hasSk is ${hasSk} at index ${index}`;
}
if (typeof item["pk"] !== "string" || typeof item["sk"] !== "string") {
throw `Item at index ${index} pk or sk is not a string`;
}
if (sortKeyList.indexOf(item.sk) !== -1) {
throw `The item # index ${index} and sortkey ${item.sk} has duplicate values`;
}
if (item.sk.indexOf("undefined") !== -1) {
throw `There's an undefined in the sortkey ${index} and ${item.sk}`;
}
sortKeyList.push(put.PutRequest.Item.sk);
});
// DynamoDB only accepts 25 items at a time.
for (let i = 0; i < data.length; i += 25) {
const upperLimit = Math.min(i + 25, data.length);
const newItems = data.slice(i, upperLimit);
try {
await client
.batchWrite({
RequestItems: {
schon: newItems
}
})
.promise();
} catch (e) {
console.log("Total Batches: " + Math.ceil(data.length / 25));
console.error("There was an error while processing the request");
console.log(e.message);
console.log("Total data to insert", data.length);
console.log("New items is", newItems);
console.log("index is ", i);
console.log("top index is", upperLimit);
break;
}
}
console.log(
"If no errors are shown, creation in DynamoDB has been successful"
);
}

There are two issues that you're facing but I'll attempt to address them.
A full example of the items being written and the actual batchWrite request with the items shown has not been provided, so it is unclear if the actual request is properly formatted. Based on the information provided, and the issue being faced, it appears that the request is not correctly formatted.
The documentation for the batchWrite operation in the AWS Javascript SDK can be found here, and a previous answer here shows a solution for correctly building and formatting a batchWrite request.
Nonetheless, even if the request is formatted correctly, there still exists a second issue which is that there is sufficient capacity provisioned to handle the write requests to insert 1800 records in the required amount of time which has an upper limit of 5 seconds.
TL;DR the quick and easy solution to the capacity issue is to switch from Provisioned Capacity to On Demand capacity. As is shown below, the math indicates that unless you have consistent and/or predictable capacity requirements, most of the time On Demand capacity is going to not only remove the management overhead of provisioned capacity, but it's also going to be substantially less expensive.
As per the AWS DynamoDB documentation for provisioned capacity here, a Write Capacity Unit or WCU is billed, and thus defined, as follows:
Each API call to write data to your table is a write request. For items up to 1 KB in size, one WCU can perform one standard write request per second.
The AWS documentation for the batchWrite / batchWriteItem API here indicates that a batchWrite API request supports up to 25 items per request and individual items can be up to 400kb. Further to this, the number of WCU's required to process the batchWrite request depends on the size of the items in the request. The AWS documentation for managing capacity in DynamoDB here, advises the number of WCU's required to process a batchWrite request is calculated as follows:
BatchWriteItem — Writes up to 25 items to one or more tables. DynamoDB processes each item in the batch as an individual PutItem or DeleteItem request (updates are not supported). So DynamoDB first rounds up the size of each item to the next 1 KB boundary, and then calculates the total size. The result is not necessarily the same as the total size of all the items. For example, if BatchWriteItem writes a 500-byte item and a 3.5 KB item, DynamoDB calculates the size as 5 KB (1 KB + 4 KB), not 4 KB (500 bytes + 3.5 KB).
The size of the items in the batchWrite request has not been provided, but for the sake of this answer the assumption is made that they are <1KB each. With 25 items of <1KB each in the request, a minimum Provisioned Capacity of 25 WCU's is required to process a single batchWrite request per second. Assuming that the minimum 25 required WCU's are provisioned, considering the 5 second time limit on inserting the items, with just 25 WCU's provisioned, only one request with 25 items can be made per second which totals 125 items inserted in the 5 second time limit. Based on this, in order to achieve the goal of inserting 1800 items in 5 seconds 360 WCU's are need to achieve the goal.
Based on the current pricing for Provisioned Capacity found here, 360 WCU's of provisioned capacity would have a cost of approximately $175/month (not considering free tier credits).
There are two options for how you can handle this issue
Increase provisioned capacity. To achieve 1800 items in 5 seconds you're going to need to provision 360 WCU's.
The better option is to simply switch to On Demand capacity. The question mentioned that the write requests are “random-spiked operations”. If write requests are not predictable and consistent operations on a table, then the outcome is often over provisioning of the table and paying for idle capacity. “On Demand” capacity solves this and adheres to the Serverless philosophy of only paying for what you use where you are only billed for what you consume. Currently, on demand pricing is $1.25 / 1 million WCU's consumed. Based on this, if every new user is generating 1800 new items to be inserted, it would take 97,223 new users being created per month, before provisioning capacity for the table is competitive vs using on demand capacity. Put another way, until a new user is being registered on-average every 26 seconds, the math suggests sticking with on-demand capacity (worth noting that this does not consider RCU's or other items in the table or other access patterns).

How to generically solve the problem of generating incremental integer IDs in JavaScript

I have been thinking about this for a few days trying to see if there is a generic way to write this function so that you don't ever need to worry about it breaking again. That is, it is as robust as it can be, and it can support using up all of the memory efficiently and effectively (in JavaScript).
So the question is about a basic thing. Often times when you create objects in JavaScript of a certain type, you might give them an ID. In the browser, for example with virtual DOM elements, you might just give them a globally unique ID (GUID) and set it to an incrementing integer.
GUID = 1
let a = createNode() // { id: 1 }
let b = createNode() // { id: 2 }
let c = createNode() // { id: 3 }
function createNode() {
return { id: GUID++ }
}
But what happens when you run out of integers? Number.MAX_SAFE_INTEGER == 2⁵³ - 1. That is obviously a very large number: 9,007,199,254,740,991 quadrillions perhaps. Many billions of billions. But if JS can reach 10 million ops per second lets say in a pick of the hat way, then that is about 900,719,925s to reach that number, or 10416 days, or about 30 years. So in this case if you left your computer running for 30 years, it would eventually run out of incrementing IDs. This would be a hard bug to find!!!
If you parallelized the generation of the IDs, then you could more realistically (more quickly) run out of the incremented integers. Assuming you don't want to use a GUID scheme.
Given the memory limits of computers, you can only create a certain number of objects. In JS you probably can't create more than a few billion.
But my question is, as a theoretical exercise, how can you solve this problem of generating the incremented integers such that if you got up to Number.MAX_SAFE_INTEGER, you would cycle back from the beginning, yet not use the potentially billions (or just millions) that you already have "live and bound". What sort of scheme would you have to use to make it so you could simply cycle through the integers and always know you have a free one available?
function getNextID() {
if (i++ > Number.MAX_SAFE_INTEGER) {
return i = 0
} else {
return i
}
}
Random notes:
The fastest overall was Chrome 11 (under 2 sec per billion iterations, or at most 4 CPU cycles per iteration); the slowest was IE8 (about 55 sec per billion iterations, or over 100 CPU cycles per iteration).
Basically, this question stems from the fact that our typical "practical" solutions will break in the super-edge case of running into Number.MAX_SAFE_INTEGER, which is very hard to test. I would like to know some ways where you could solve for that, without just erroring out in some way.

But what happens when you run out of integers?
You won't. Ever.
But if JS can reach 10 million ops per second [it'll take] about 30 years.
Not much to add. No computer will run for 30 years on the same program. Also in this very contrived example you only generate ids. In a realistic calculation you might spend 1/10000 of the time to generate ids, so the 30 years turn into 300000 years.
how can you solve this problem of generating the incremented integers such that if you got up to Number.MAX_SAFE_INTEGER, you would cycle back from the beginning,
If you "cycle back from the beginning", they won't be "incremental" anymore. One of your requirements cannot be fullfilled.
If you parallelized the generation of the IDs, then you could more realistically (more quickly) run out of the incremented integers.
No. For the ids to be strictly incremental, you have to share a counter between these parallelized agents. And access to shared memory is only possible through synchronization, so that won't be faster at all.
If you still really think that you'll run out of 52bit, use BigInts. Or Symbols, depending on your usecase.

How to make concurrent Node.js stream processing while preserving order?

I have a complex data processing pipeline using streams, in which I have a readable stream input, a writable stream output, and a series of transform streams (let's call them step1, step2, step3, and step4). While step1, step3, and output are stateless, relying only on the data chunks coming in to produce their output, chunk for chunk, step2 and step4 are aggregation steps, collecting data from multiple chunks to produce their output, and often having outputs that overlap time-wise (e.g. chunk1, chunk3 and chunk5 might produce output1, chunk2 and chunk4 might produce output2, and so on).
Currently, the pipeline is structured as follows:
input.pipe(step1).pipe(step2).pipe(step3).pipe(step4).pipe(output);
This pipeline is very computationally expensive, and as such I'd like to split it across multiple instances, preferably running on multiple cores. Node.js streams guarantee order preservation, so Node.js seems to balance the message passing such that data chunks coming out of one step first get passed into the next step first, and this is a property I'd need to have on whatever method I come up with of making this computation concurrent.
I'm definitely not asking for hand-holding, more if anyone has solved this problem before, and the general approach used for this kind of thing. I'm not really sure where to start.

Although I haven't been able to complete the order preservation, the streaming framework I'm supporting, scramjet, will get you really close to achieve your goal.
I'll nudge you here towards the best solution:
let seq = 0;
source.pipe(new DataStream())
.map(data => {data, itr: seq++}) // mark your order
.separate(x => x % 8) // separate into 8 streams
.cluster((stream) => { // spawn subprocesses
// do your multi threaded transforms here
}, {threads: 8})
.mux((a, b) => a.itr - b.itr) // merge in the order above
At some point I will introduce reordering, but in order to keep abstraction I can't take too many shortcuts, but you can take yours like the 2^52 limitation of count in the above example (seq will run out of bit space to increment then).
This should lead you towards some solution.

How do I decrease the governance cost of the following code

It netsuite there is a limit on how frequently you can use certain APIs (as well as certain scripts). For what I am doing I believe the following is the applicable cost:
nlapiLoadSearch: 5
nlobjSearchResultSet.getSearch(): 10
It takes about an hour, but every time my script(which follows) errors out, probably due to this. How do I change it to make it have less governance cost?
function walkCat2(catId, pad){
var loadCategory = nlapiLoadRecord("sitecategory", "14958149");
var dupRecords = nlapiLoadSearch('Item', '1951'); //load saved search
var resultSet = dupRecords.runSearch(); //run saved search
resultSet.forEachResult(function(searchResult)
{
var InterID=(searchResult.getValue('InternalID')); // process- search
var LINEINX=loadCategory.getLineItemCount('presentationitem');
loadCategory.insertLineItem("presentationitem",LINEINX);
loadCategory.setLineItemValue("presentationitem", "presentationitem", LINEINX, InterID+'INVTITEM'); //--- Sets the line value.-jf
nlapiSubmitRecord(loadCategory , true);
return true; // return true to keep iterating
});
}

nlapiLoadRecord uses 5 units, nlapiLoadSearch uses 5, then actually it is resultSet.forEachResult that uses another 10. On top of that, you are running nlapiSubmitRecord for each search result, which will use 10 more units for each result.
It looks to me like all you are doing with your search results is adding line items to the Category record. You do not need to submit the record until you are completely done adding all the lines. Right now, you are submitting the record after every line you add.
Move the nlapiSubmitRecord after your forEachResult call. This will reduce your governance (and especially your execution time) from 10 units per search result to just 10 units.

Different APIs have different costs associated with them[see suiteanswers ID 10365]. Also, different types of scripts (user, scheduled, etc) have different max limits on what the total usage limit can be. [see suiteanswers ID 10481]
Your script should consume less than that limit else NetSuite will throw an error.
You can use the following line of code to measure your remaining usage at different points in your code.
nlapiLogExecution('AUDIT', 'Script Usage', 'RemainingUsage:'+nlapiGetContext().getRemainingUsage());
One strategy to avoid the max usage exceeded exception is to change the type of script to "scheduled script" since that has the maximum limit. Given that your loop is working off a search, the resultset could be huge and that may cause even a scheduled script to exceed its limits. In such case, you would want to introduce checkpoints in your code and make it reentrant. That way if you see the nlapiGetContext().getRemainingUsage() is less than your threshold, you offload the remaining work to a subsequent scheduled script.

Getting the index of an object in an ordered list in Firebase

I'm building a leaderboard using Firebase. The player's position in the leaderboard is tracked using Firebase's priority system.
At some point in my program's execution, I need to know what position a given user is at in the leaderboard. I might have thousands of users, so iterating through all of them to find an object with the same ID (thus giving me the index) isn't really an option.
Is there a more performant way to determine the index of an object in an ordered list in Firebase?
edit: I'm trying to figure out the following:
/
---- leaderboard
--------user4 {...}
--------user1 {...}
--------user3 {...} <- what is the index of user3, given a snapshot of user3?
--------...

If you are processing tens or hundreds of elements and don't mind taking a bandwidth hit, see Katos answer.
If you're processing thounsands of records, you'll need to follow an approach outlined in principle in pperrin's answer. The following answer details that.
Step 1: setup Flashlight to index your leaderboard with ElasticSearch
Flashlight is a convenient node script that syncs elasticsearch with Firebase data.
Read about how to set it up here.
Step 2: modify Flashlight to allow you to pass query options to ElasticSearch
as of this writing, Flashlight gives you no way to tell ElasticSearch you're only interested in the number of documents matched and not the documents themselves.
I've submitted this pull request which uses a simple one-line fix to add this functionality. If it isn't closed by the time you read this answer, simply make the change in your copy/fork of flashlight manually.
Step 3: Perform the query!
This is the query I sent via Firebase:
{
index: 'firebase',
type: 'allTime',
query: {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"points": {
"gte": minPoints
}
}
}
}
},
options: {
"search_type": "count"
}
};
Replace points with the name of the field tracking points for your users, and minPoints with the number of points of the user whose rank you are interested in.
The response will look something like:
{
max_score: 0,
total: 2
}
total is the number of users who have the same or greater number of points -- in other words, the user's rank!

Since Firebase stores object, not arrays, the elements do not have an "index" in the list--JavaScript and by extension JSON objects are inherently unordered. As explained in Ordered Docs and demonstrated in the leaderboard example, you accomplish ordering by using priorities.
A set operation:
var ref = new Firebase('URL/leaderboard');
ref.child('user1').setPriority( newPosition /*score?*/ );
A read operation:
var ref = new Firebase('URL/leaderboard');
ref.child('user1').once('value', function(snap) {
console.log('user1 is at position', snap.getPriority());
});

To get the info you want, at some point a process is going to have to enumerate the nodes to count them. So the question is then where/when the couting takes place.
Using .count() in the client will mean it is done every time it is needed, it will be pretty accurate, but procesing/traffic heavy.
If you keep a separate index of the count it will need regular refreshing, or constant updating (each insert causeing a shuffling up of the remaining entries).
Depending on the distribution and volume of your data I would be tempted to go with a background process that just updates(/rebuilds) the index every (say) ten or twenty additions. And indexes every (say) 10 positions.
"Leaderboard",$UserId = priority=$score
...
"Rank",'10' = $UserId,priority=$score
"Rank",'20' = $UserId,priority=$score
...
From a score you get the rank within ten and then using a startat/endat/count on your "Leaderboard" get it down to the unit.
If your background process is monitoring the updates to the leaderboard, it could be more inteligent about its updates to the index either updating only as requried.

I know this is an old question, but I just wanted to share my solutions for future reference. First of all, the Firebase ecosystem has changed quite a bit, and I'm assuming the current best practices (i.e. Firestore and serverless functions). I personally considered these solutions while building a real application, and ended up picking the scheduled approximated ranks.
Live ranks (most up-to-date, but expensive)
When preparing a user leaderboard I make a few assumptions:
The leaderboard ranks users based on a number which I'll call 'score' from now on
New users rank lowest on the leaderboard, so upon user creation, their rank is set to the total user count (with a Firebase function, which sets the rank, but also increases the 'total user' counter by 1).
Scores can only increase (with a few adaptations decreasing scores can also be supported).
Deleted users keep a 'ghost' spot on the leaderboard.
Whenever a user gets to increase their score, a Firebase function responds to this change by querying all surpassed users (whose score is >= the user's old score but < the user's new score) and have their rank decreased by 1. The user's own rank is increased by the size of the before-mentioned query.
The rank is now immediately available on client reads. However, the ranking updates inside of the proposed functions are fairly read- and write-heavy. The exact number of operations depends greatly on your application, but for my personal application a great frequency of score changes and relative closeness of scores rendered this approach too inefficient. I'm curious if anyone has found a more efficient (live) alternative.
Scheduled ranks (simplest, but expensive and periodic)
Schedule a Firebase function to simply sort the entire user collection by ascending score and write back the rank for each (in a batch update). This process can be repeated daily, or more frequent/infrequent depending on your application. For N users, the function always makes N reads and N writes.
Scheduled approximated ranks (cheapest, but non-precise and periodic)
As an alternative for the 'Scheduled ranks' option, I would suggest an approximation technique: instead of writing each user's exact rank upon for each scheduled update, the collection of users (still sorted as before) is simply split into M chunks of equal size and the scores that bound these chunks are written to a separate 'stats' collection.
So, for example: if we use M = 3 for simplicity and we read 60 users sorted by ascending score, we have three chunks of 20 users. For each of the (still sorted chunks) we get the score of the last (lowest score of chunk) and the first user (highest score of chunk) (i.e. the range that contains all scores of that chunk). Let's say that the chunk with the lowest scores has scores ranging from 20-120, the second chunk has scores from 130-180 and the chunk with the highest scores has scores 200-350. We now simply write these ranges to a 'stats' collection (the write-count is reduced to 1, no matter how many users!).
Upon rank retrieval, the user simply reads the most recent 'stats' document and approximates their percentile rank by comparing the ranges with their own score. Of course it is possible that a user scores higher than the greatest score or lower than the lowest score from the previous 'stats' update, but I would just consider them belonging to the highest scoring group and the lowest scoring group respectively.
In my own application I used M = 20 and could therefore show the user percentile ranks by 5% accuracy, and estimate even within that range using linear interpolation (for example, if the user score is 450 and falls into the 40%-45%-chunk ranging from 439-474, we estimate the user's percentile rank to be 40 + (450 - 439) / (474 - 439) * 5 = 41.57...%).
If you want to get real fancy you can also estimate exact percentile ranks by fitting your expected score distribution (e.g. normal distribution) to the measured ranges.
Note: all users DO need to read the 'stats' document to approximate their rank. However, in most applications not all users actually view the statistics (as they are either not active daily or just not interested in the stats). Personally, I also used the 'stats' document (named differently) for storing other DB values that are shared among users, so this document is already retrieved anyways. Besides that, reads are 3x cheaper than writes. Worst case scenario is 2N reads and 1 write.

We Keep Coding

JavaScript is the programming language of the Web.