DynamoDB read spikes using scan - javascript

I have a small job that runs every minute and perform a scan in a table that has near 3000 rows:
async execute (dialStatus) {
if (!process.env.DIAL_TABLE) {
throw new Error('Dial table not found')
}
const params = {
TableName: process.env.DIAL_TABLE,
FilterExpression: '#name = :name AND #dial_status = :dial_status AND #expires_on > :expires_on',
ExpressionAttributeNames: {
'#name': 'name',
'#dial_status': 'dial_status',
'#expires_on': 'expires_on'
},
ExpressionAttributeValues: {
':name': { 'S': this.name },
':dial_status': { 'S': dialStatus ? dialStatus : 'received' },
':expires_on': { 'N': Math.floor(moment().valueOf() / 1000).toString() }
}
}
console.log('params', params)
const dynamodb = new AWS.DynamoDB()
const data = await dynamodb.scan(params).promise()
return this._buildObject(data)
}
I'm facing a problem about read units and timeouts on dynamodb. Right now, I'm using 50 read units and it's getting expensive if compared to a RDS.
The attributes names used on scan function are not my primary key: name is a secondary index and dial_status is a normal attribute on my json but every row has this attribute.
This job run every minute for a list of parameters (i.e: if i have 10 parameters, I'll perform this scan 10 times in a minute).
My table has the following schema:
phone (PK Hash)
configuration: JSON in String format;
dial_status String;
expires_on: TTL number;
name: String
origin: String;
The job should get all items based on name and dial_status and the number of items is restricted to 15 elements each execution (each minute). For each element, it should be enqueued on SQS to be processed.
I really need to decrease those read units but I'm not sure on how optimize this function. I've read about reduce the page size or avoid scan. What's are my alternatives to avoid scan if I don't have primary key and I want to return a group of rows?
Any idea on how to fix this code to be called like 10-15 times every minute?

I suggest you to create a GSI (Global Secondary Index) with keys:
HASH: name_dialStatus
RANGE: expiresOn
As you've already guessed, the hash key has as value the concatenation of the two independent fields name and dialStatus.
Now you may use a query on this GSI, which is much more efficient since it doesn't scan all the table, but explores only the items you are interested in:
async execute(dialStatus) {
if (!process.env.DIAL_TABLE) {
throw new Error('Dial table not found')
}
const params = {
TableName: process.env.DIAL_TABLE,
IndexName: 'MY_GSI_NAME',
// replace `FilterExpression`
// always test the partition key for equality!
KeyConditionExpression: '#pk = :pk AND #sk > :skLow',
ExpressionAttributeNames: {
'#pk': 'name_dialStatus', // partition key name
'#sk': 'expires_on' // sorting key name
},
ExpressionAttributeValues: {
':pk': { 'S': `${this.name}:${dialStatus || 'received'}` },
':skLow': { 'N': Math.floor(moment().valueOf() / 1000).toString() }
}
}
console.log('params', params)
// Using AWS.DynamoDB.DocumentClient() there is no need to specify the type of fields. This is a friendly advice :)
const dynamodb = new AWS.DynamoDB();
// `scan` becomes `query` !!!
const data = await dynamodb.query(params).promise();
return this._buildObject(data);
}

It is always recommended to design dynamodb table based on access patterns to query it easily with keys(primarykey/sortkey) and to avoid expensive scan operations.
Revisit your table schema if it is not too late.
If it is already late then either create GSI with "name" as PrimaryKey and "expires_on" as SortKey with Projected attributes e.g. "dialStatus" so that you can query only required data to lower ready capacity.
If you still do not want to go with option 1 and option2 Scan operation with RateLimiter and pass only 25% of read capacity so that you can avoid spike.

Related

How can I store a large dataset result by chunks to a csv file in nodejs?

I have a mysql table of about 10 Million records, I would like to send those records to a csv file using NodeJs.
I know I can make a query to get all records, store the result in a json format variable and send those to a csv file using a library like fastcsv in conjunction with createWriteStream. Writing the result to the file is doable using a stream. But what I want to avoid is storing 10 Million records into memory (suppose that the records have a lot of lot of columns).
What I would like to do is to query only a subset of the result (for example 20k rows), store the info to the file, then query the next subset (next 20k rows) and append the results to the same file and continue the process until it finishes. The problem that I have right now is that i don't know how to control the execution for the next iteration. Accoding to the debug, different writing operation are being executed at the same time because of the asynchronous nature of nodejs giving me a file where some lines are mixed (multiple results in the same line) and unordered records.
I know the total execution time is affected with this approach, but in this case i prefer a controlled way and avoid ram consumption.
For the database query I'm using sequelize with MySQL, but the idea is the same regardless the query method.
This is my code so far:
// Store file function receives:
// (String) filename
// (Boolean) headers: first iteration is true to put a name to the columns
// (json document) jsonData is the information to store in te file
// (Boolean) append: Disabled the first iteration to create a new file
const storeFile = (filename, headers, jsonData, append) => {
const flags = append === true ? 'a' : 'w'
const ws = fs.createWriteStream(filename, { flags, rowDelimiter: '\r\n' })
fastcsv
.write(jsonData, { headers })
.on('finish', () => {
logger.info(`file=${filename} created/updated sucessfully`)
})
.pipe(ws)
}
// main
let filename = 'test.csv'
let offset = 0
let append = false
let headers = true
const limit = 20000
const totalIterations = Math.ceil(10000000/ limit)
for (let i = 0; i < totalIterations; i += 1) {
// eslint-disable-next-line no-await-in-loop
const records = await Record.findAll({
offset,
limit,
raw: true,
})
storeFile(filename, headers, records, append)
headers = false
append = true
offset += limit // offset is incremented to get the next subset
}

Firestore call extremely slow - how to make it faster?

I'm migrating from MongoDB to Firestore and I've been struggling to make this call faster. So, to give you the context, this is how every document looks like in the collection:
Document sample:
The "entities" object may have the following fields: "customer", "supplier", "product", "product_group" and/or "origin". What I'm trying to do is to build a ranking, e.g., a supplier ranking for docs with origin = "A" and product = "B". That means, I have to get from the DB all documents with entities = { origin:"A", product:"B", supplier: exists }
The collection has more than 300.000 documents, and I make between 4-10 calls depending on the "entities", which return between 0 and a few hundred results each. These calls are taking excessively long to execute (between 10 and 30 seconds in total) in Firestore, whereas with MongoDB it took around 2-3 seconds in total. This is how my code looks like as of now (following the previous example, the parameters in the function would be entities = {origin:"A", product:"B"} and entity = "supplier"):
async getRanking(entities: { [key: string]: any }, stage: string, metric: string, organisationId: string, entity: string) {
let indicators: IndicatorInsight[] = [];
const entitySet = ['product', 'origin', 'product_group', 'supplier', 'customer'];
const entitiesKeys = [];
//this way we always keep the values in the same order as in entitySet
for (const ent of entitySet) {
if ([...Object.keys(entities), entity].includes(ent)) {
entitiesKeys.push(ent);
}
}
let rankingRef = OrganisationRef
.doc(organisationId)
.collection('insight')
.where('stage', '==', stage)
.where('metric', '==', metric)
.where('insight_type', '==', 'indicator')
.where('entities_keys', '==', entitiesKeys)
for (const ent of Object.keys(entities)) {
rankingRef = rankingRef.where(`entities.${ent}`, '==', entities[ent]);
}
(await rankingRef.get()).forEach(snap => indicators.push(snap.data() as IndicatorInsight));
return indicators;
}
So my question is, do you have any suggestion on how to improve the structure of the documents and the querying in order to improve the performance of this call? As I mentioned, with MongoDB this was quite fast, 2-3 seconds tops.
I've been told Firestore should be extremely fast, so I guess I'm not doing things right here and I'd really appreciate your help. Please let me know if you need more details.

Complex Key - Filter by first key and sort only by second key

I have an application that displays services upon which I would like the user to be able to filter by service name while always being sorted (descending) by date (no sorting on service name).
Also the following question doesn't contain the appropriate solution:
Couch DB filter by key and sort by another field
Below is a simplistic version of the code to give an idea of what I am currently trying (serviceName must be first key as I cannot filter exclusively with the second key):
function(doc) {
if(doc.type && doc.type=="incident"){
emit([doc.serviceName, doc.date], doc)
}
};
Here is a Snippet where from my API writen in Node, that builds the parameters for the query:
if (req.query.search !== "") {
const search = req.query.search;
params = {
limit: limit,
skip: skip,
startkey: [search+"Z"],
endkey: [search],
descending:true,
include_docs: true
};
} else {
params = { limit: limit, skip: skip, descending:true, include_docs: true };
The above code currently filters service name but also sorts by service name before sorting by date. Is there anything I can add like (check snippet below) to force the query to sort the results by date without me having to do it in the code after I get the result.
const sort = [{'date': 'desc'}];
What I want is something like this when filtering by servicename:
SELECT * FROM incidents
WHERE serviceName LIKE #search+'%'
ORDER BY date DESC
But when not filtering by servicename:
SELECT * FROM incidents
ORDER BY date DESC
One way to do this is to ensure that the second element of the array you emit (the date-related bit) goes down as time proceeds. This can be achieved with a bit of hacking in the map function
function(doc) {
if(doc.type && doc.type=="incident") {
// get the date from the document
var d = doc.date;
// remove dash characters
var d1 = d.replace(/\-/g,'');
// convert to integer
var d2 = parseInt(d1)
// Subtract from number representing the year 3000
emit([doc.serviceName, 30000000 - d2], doc)
}
}
You may now query this view in ascending order (without descending=true) and the data will be sorted by serviceName and the time (reversed).

Get latest record per field in Parse.com JS query

I have a table in my Parse database with columns validFrom and uniqueId. There can be multiple records with the same uniqueId (like a name)
What query do I have to use to get the items with the latest validFrom for a given set of uniqueIds? I tried the following but this limits my search to 1 item for the entire set rather than 1 item per unique_id record:
var UpdateObject = Parse.Object.extend("UpdateObject");
var query = new Parse.Query(UpdateObject);
query.containedIn("unique_id", uniqueIdsArray).select('status', 'unique_id').descending("validFrom").limit(1);
The query semantics are limited, so the only approach is to query for a superset and manipulate the result to what you need. This is better done on the server to limit the transmission of extra objects.
Big caveat: did this with pencil and paper, not a running parse.app, so it may be wrong. But the big idea is to get all of the matching objects for all of the uniqueIds, group them by uniqueId, and then for each group return the one with the maximum validFrom date...
function updateObjectsInSet(uniqueIdsArray ) {
var query = new Parse.Query("UpdateObject");
// find all of the UpdateObjects with the given ids
query.containedIn("unique_id", uniqueIdsArray);
query.limit(1000);
return query.find().then(function(allUpdates) {
// group the result by id
var byId = _.groupBy(allUpdates, function(update) { return update.get("unique_id"); });
// for each group, select the newest validFrom date
return _.map(byId, function (updates) {
_.max(updates, function(update) { return -update.get("validFrom").getTime(); });
});
});
}
To place this in the cloud, just wrap it:
Parse.Cloud.define("updateObjectsInSet", function(request, response) {
updateObjectsInSet(request.params.uniqueIdsArray).then(function(result) {
response.success(result);
}, function(error) {
response.error(error);
});
});
Then use Parse.Cloud.run() from the client to call it.

MongoDB: How to rename a field using regex

I have a field in my documents, that is named after its timestamp, like so:
{
_id: ObjectId("53f2b954b55e91756c81d3a5"),
domain: "example.com",
"2014-08-07 01:25:08": {
A: [
"123.123.123.123"
],
NS: [
"ns1.example.com.",
"ns2.example.com."
]
}
}
This is very impractical for queries, since every document has a different timestamp.
Therefore, I want to rename this field, for all documents, to a fixed name.
However, I need to be able to match the field names using regex, because they are all different.
I tried doing this, but this is an illegal query.
db['my_collection'].update({}, {$rename:{ /2014.*/ :"201408"}}, false, true);
Does someone have a solution for this problem?
SOLUTION BASED ON NEIL LUNN'S ANSWER:
conn = new Mongo();
db = conn.getDB("my_db");
var bulk = db['my_coll'].initializeOrderedBulkOp();
var counter = 0;
db['my_coll'].find().forEach(function(doc) {
for (var k in doc) {
if (k.match(/^2014.*/) ) {
print("replacing " + k)
var unset = {};
unset[k] = 1;
bulk.find({ "_id": doc._id }).updateOne({ "$unset": unset, "$set": { WK1: doc[k]} });
counter++;
}
}
if ( counter % 1000 == 0 ) {
bulk.execute();
bulk = db['my_coll'].initializeOrderedBulkOp();
}
});
if ( counter % 1000 != 0 )
bulk.execute();
This is not a mapReduce operation, not unless you want a new collection that consists only of the _id and value fields that are produced from mapReduce output, much like:
"_id": ObjectId("53f2b954b55e91756c81d3a5"),
"value": {
"domain": "example.com",
...
}
}
Which at best is a kind of "server side" reworking of your collection, but of course not in the structure you want.
While there are ways to execute all of the code in the server, please don't try to do so unless you are really in a spot. These ways generally don't play well with sharding anyway, which is usually where people "really are in a spot" for the sheer size of records.
When you want to change things and do it in bulk, you generally have to "loop" the collection results and process the updates while having access to the current document information. That is, in the case where your "update" is "based on" information already contained in fields or structure of the document.
There is therefore not "regex replace" operation available, and there certainly is not one for renaming a field. So let's loop with bulk operations for the "safest" form of doing this without running the code all on the server.
var bulk = db.collection.initializeOrderedBulkOp();
var counter = 0;
db.collection.find().forEach(function(doc) {
for ( var k in doc ) {
if ( doc[k].match(/^2014.*/) ) {
var update = {};
update["$unset"][k] = 1;
update["$set"][ k.replace(/(\d+)-(\d+)-(\d+).+/,"$1$2$3") ] = doc[k];
bulk.find({ "_id": doc._id }).updateOne(update);
counter++;
}
}
if ( counter % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp();
}
});
if ( counter % 1000 != 0 )
bulk.execute();
So the main things there are the $unset operator to remove the existing field and the $set operator to create the new field in the document. You need the document content to examine and use both the "field name" and "value", so hence the looping as there is no other way.
If you don't have MongoDB 2.6 or greater on the server then the looping concept still remains without the immediate performance benefit. You can look into things like .eval() in order to process on the server, but as the documentation suggests, it really is not recommended. Use with caution if you must.
As you already recognized, value-keys are indeed very bad for the MongoDB query language. So bad that what you want to do doesn't work.
But you could do it with a MapReduce. The map and reduce functions wouldn't do anything, but the finalize function would do the conversion in Javascript.
Or you could write a little program in a programming language of your which reads all documents from the collection, makes the change, and writes them back using collection.save.

Categories