I have a DocumentDB stored procedure that does insert or update (not replace but rather reads and update existing document). The stored procedure does at most two operations:
query by Id and
either insert or update
The document is also not particularly large. However, every now and then I would get either time out (caused by bounded execution) or 449 (conflict updating resources, which is a transient error).
IMO this isn't a particularly taxing stored procedure but seems that I'm running to limitations already. I could do more work client side but I love the ACID guarantee in the stored procedure.
Is there any quantitative measure on bounded execution? I'm wondering if I'm simply doing things wrong or I have indeed hit limit of DocumentDB.
My stored procedure is a modified https://github.com/Azure/azure-documentdb-js-server/blob/master/samples/stored-procedures/update.js that takes in document instead of id. I'm using "$addToSet" in particular and the code looks like
function unique(arr) {
var uniqueArr = [], map = {};
for (var i = 0; i < arr.length; i++) {
var exists = map[arr[i]];
if (!exists) {
uniqueArr.push(arr[i]);
map[arr[i]] = true;
}
}
return uniqueArr;
}
// The $addToSet operator adds elements to an array only if they do not already exist in the set.
function addToSet(document, update) {
var fields, i;
if (update.$addToSet) {
console.log(">addToSet");
fields = Object.keys(update.$addToSet);
for (i = 0; i < fields.length; i++) {
if (!Array.isArray(document[fields[i]])) {
// Validate the document field; throw an exception if it is not an array.
throw new Error("Bad $addToSet parameter - field in document must be an array.");
}
// convert to array if input is not an array
var newIds = Array.isArray(update.$addToSet[fields[i]])
? update.$addToSet[fields[i]]
: [update.$addToSet[fields[i]]];
var finalIds = unique(document[fields[i]].concat(newIds));
document[fields[i]] = finalIds;
}
}
}
DocumentDB stored procedures must complete within 5 seconds. They are also limited by the provisioned throughput of the collection. If you have 5000 RU/s provisioned, then the stored procedure cannot consume more than 5000 * 5 RUs in total.
When a stored procedure reaches its execution time or its throughput limit, any request to perform a database operation (read, write, query) will receive a pre-emption signal, i.e. the request will not be accepted as a signal for the stored procedure to wrap up execution, and return to the caller. If you check for return code from each call, your stored procedure will never timeout. Here's a snippet showing how to do this (full samples are available at https://github.com/Azure/azure-documentdb-js-server/blob/master/samples/stored-procedures/):
var isAccepted = collection.replaceDocument(...) {
// additional logic in callback
});
if (!isAccepted) {
// wrap up execution and return
}
Regarding 449, this is a concurrency error that can be returned if your stored procedure attempts to perform a conflicting write. This is side-effect free and safe to retry on from the client. You can implement a retry until succeeded pattern whenever you run into this error.
Related
I have an SPA and for technical reasons I have different elements potentially firing the same fetch() call pretty much at the same time.[1]
Rather than going insane trying to prevent multiple unrelated elements to orchestrate loading of elements, I am thinking about creating a gloabalFetch() call where:
the init argument is serialised (along with the resource parameter) and used as hash
when a request is made, it's queued and its hash is stored
when another request comes, and the hash matches (which means it's in-flight), another request will NOT be made, and it will piggy back from the previous one
async function globalFetch(resource, init) {
const sigObject = { ...init, resource }
const sig = JSON.stringify(sigObject)
// If it's already happening, return that one
if (globalFetch.inFlight[sig]) {
// NOTE: I know I don't yet have sig.timeStamp, this is just to show
// the logic
if (Date.now - sig.timeStamp < 1000 * 5) {
return globalFetch.inFlight[sig]
} else {
delete globalFetch.inFlight[sig]
}
const ret = globalFetch.inFlight[sig] = fetch(resource, init)
return ret
}
globalFetch.inFlight = {}
It's obviously missing a way to have the requests' timestamps. Plus, it's missing a way to delete old requests in batch. Other than that... is this a good way to go about it?
Or, is there something already out there, and I am reinventing the wheel...?
[1] If you are curious, I have several location-aware elements which will reload data independently based on the URL. It's all nice and decoupled, except that it's a little... too decoupled. Nested elements (with partially matching URLs) needing the same data potentially end up making the same request at the same time.
Your concept will generally work just fine.
Some thing missing from your implementation:
Failed responses should either not be cached in the first place or removed from the cache when you see the failure. And failure is not just rejected promises, but also any request that doesn't return an appropriate success status (probably a 2xx status).
JSON.stringify(sigObject) is not a canonical representation of the exact same data because properties might not be stringified in the same order depending upon how the sigObject was built. If you grabbed the properties, sort them and inserted them in sorted order onto a temporary object and then stringified that, it would be more canonical.
I'd recommend using a Map object instead of a regular object for globalFetch.inFlight because it's more efficient when you're adding/removing items regularly and will never have any name collision with property names or methods (though your hash would probably not conflict anyway, but it's still a better practice to use a Map object for this kind of thing).
Items should be aged from the cache (as you apparently know already). You can just use a setInterval() that runs every so often (it doesn't have to run very often - perhaps every 30 minutes) that just iterates through all the items in the cache and removes any that are older than some amount of time. Since you're already checking the time when you find one, you don't have to clean the cache very often - you're just trying to prevent non-stop build-up of stale data that isn't going to be re-requested - so it isn't getting automatically replaced with newer data and isn't being used from the cache.
If you have any case insensitive properties or values in the request parameters or the URL, the current design would see different case as different requests. Not sure if that matters in your situation or not or if it's worth doing anything about it.
When you write the real code, you need Date.now(), not Date.now.
Here's a sample implementation that implements all of the above (except for case sensitivity because that's data-specific):
function makeHash(url, obj) {
// put properties in sorted order to make the hash canonical
// the canonical sort is top level only,
// does not sort properties in nested objects
let items = Object.entries(obj).sort((a, b) => b[0].localeCompare(a[0]));
// add URL on the front
items.unshift(url);
return JSON.stringify(items);
}
async function globalFetch(resource, init = {}) {
const key = makeHash(resource, init);
const now = Date.now();
const expirationDuration = 5 * 1000;
const newExpiration = now + expirationDuration;
const cachedItem = globalFetch.cache.get(key);
// if we found an item and it expires in the future (not expired yet)
if (cachedItem && cachedItem.expires >= now) {
// update expiration time
cachedItem.expires = newExpiration;
return cachedItem.promise;
}
// couldn't use a value from the cache
// make the request
let p = fetch(resource, init);
p.then(response => {
if (!response.ok) {
// if response not OK, remove it from the cache
globalFetch.cache.delete(key);
}
}, err => {
// if promise rejected, remove it from the cache
globalFetch.cache.delete(key);
});
// save this promise (will replace any expired value already in the cache)
globalFetch.cache.set(key, { promise: p, expires: newExpiration });
return p;
}
// initalize cache
globalFetch.cache = new Map();
// clean up interval timer to remove expired entries
// does not need to run that often because .expires is already checked above
// this just cleans out old expired entries to avoid memory increasing
// indefinitely
globalFetch.interval = setInterval(() => {
const now = Date.now()
for (const [key, value] of globalFetch.cache) {
if (value.expires < now) {
globalFetch.cache.delete(key);
}
}
}, 10 * 60 * 1000); // run every 10 minutes
Implementation Notes:
Depending upon your situation, you may want to customize the cleanup interval time. This is set to run a cleanup pass every 10 minutes just to keep it from growing unbounded. If you were making millions of requests, you'd probably run that interval more often or cap the number of items in the cache. If you aren't making that many requests, this can be less frequent. It is just to clean up old expired entries sometime so they don't accumulate forever if never re-requested. The check for the expiration time in the main function already keeps it from using expired entries - that's why this doesn't have to run very often.
This looks as response.ok from the fetch() result and promise rejection to determine a failed request. There could be some situations where you want to customize what is and isn't a failed request with some different criteria than that. For example, it might be useful to cache a 404 to prevent repeating it within the expiration time if you don't think the 404 is likely to be transitory. This really depends upon your specific use of the responses and behavior of the specific host you are targeting. The reason to not cache failed results is for cases where the failure is transitory (either a temporary hiccup or a timing issue and you want a new, clean request to go if the previous one failed).
There is a design question for whether you should or should not update the .expires property in the cache when you get a cache hit. If you do update it (like this code does), then an item could stay in the cache a long time if it keeps getting requested over and over before it expires. But, if you really want it to only be cached for a maximum amount of time and then force a new request, you can just remove the update of the expiration time and let the original result expire. I can see arguments for either design depending upon the specifics of your situation. If this is largely invariant data, then you can just let it stay in the cache as long as it keeps getting requested. If it is data that can change regularly, then you may want it to be cached no more than the expiration time, even if its being requested regularly.
Consider using a ServiceWorker or Workbox to separate caching logic from your application. The Stale-While-Revalidate strategy could apply here.
I am trying to implement a trigger on an Azure DocumentDb collection, which is supposed to auto-increment a version of a document, which is being inserted. The trigger is created as a pre-trigger.
The challenge I am facing is that collection class doesn't seem to provide a synchronous API for querying data. My plan for the trigger was to query existing documents, get the top version, increment, and assign the +1 value to the document, which is being inserted into the collection. But since the result of the query is only available asynchronously, by that time my trigger is completed and the document is inserted unmodified.
How can I await the query result?
Here is how my current trigger looks like:
// TRIGGER Auto increment version
function autoIncrementVersion() {
var collection = getContext().getCollection();
var request = getContext().getRequest();
var docToCreate = request.getBody();
// Reject documents that do not have a name property by throwing an exception.
if (!docToCreate.Version) {
throw new Error('Document must include a "Version" property.');
}
var lastVersion;
var filter = "SELECT TOP 1 d.Version FROM CovenantsDocuments d ORDER BY d.Version DESC";
var result = collection.queryDocuments(collection.getSelfLink(), filter, {},
function (err, documents, responseOptions) {
if (err) throw new Error("Error: " + err.message);
if (documents.length != 1 || !documents[0]) {
lastVersion = 0;
} else {
lastVersion = documents[0];
}
//By the time we reach this line, our trigger has already completed?
docToCreate.Version = lastVersion + 1;
});
if (!result) throw "Unable to read last version of the document";
}
UPDATE: The issue was with the way I was submitting request. Looks like triggers are not fired by default, their names need to be explicitly provided as an argument to the request.
In my case the trigger wasn't firing until I changed the client code to this:
RequestOptions options = new RequestOptions
{
PreTriggerInclude = new[] { "autoIncrementVersion"}
};
client.CreateDocumentAsync(url, document, options);
It will automatically wait until all pending async operations either complete, fail, or time out before returning. What you have is close. The only thing that I can see is missing is that you never call request.setBody(docToCreate) after you alter docToCreate.
That said, I'm not 100% certain that this approach is safe. All operations inside of a trigger, sproc, or UDF are atomic, but I'm not sure that the combination of a pre-trigger plus a write operation is atomic. The risk is that two simultaneous writes will both run and complete the trigger part which would give them a same .Version. You would probably have to ask the DocumentDB Product Managers to confirm this. They hang out here so they may respond here.
If you find that it's not atomic, then you can move everything (read to find latest version and write) into a stored procedure (sproc).
You might also consider creating a single document whose id you hard code to something like 'LAST_VERSION' to hold the last used version. That means that every write will result in a read + two writes (one for the document and one to update this document), but it may be more efficient than your query + one write approach. You could do all of this in one sproc or you could use a pre-trigger (to fetch the 'LAST_VERSION' + write operation + post-trigger (to update the 'LAST_VERSION' document) depending upon what the Product Managers say about atomicity.
One more caution about your current approach... Make sure the precision of the index on the Version field is set to -1 (Maximum precision).
When I run the following code 9999999+ times, Node returns with:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)
Whats the best solution to get around this issue, other than increasing the max alloc size or any command line arguments?
I'd like to improve the code quality rather than hack a solution.
The following is the main bulk of recursion within the application.
The application is a load testing tool.
a.prototype.createClients = function(){
for(var i = 0; i < 999999999; i++){
this.recursiveRequest();
}
}
a.prototype.recursiveRequest = function(){
var self = this;
self.hrtime = process.hrtime();
if(!this.halt){
self.reqMade++;
this.http.get(this.options, function(resp){
resp.on('data', function(){})
.on("connection", function(){
})
.on("end", function(){
self.onSuccess();
});
})
.on("error", function(e){
self.onError();
});
}
}
a.prototype.onSuccess = function(){
var elapsed = process.hrtime(this.hrtime),
ms = elapsed[0] * 1000000 + elapsed[1] / 1000
this.times.push(ms);
this.successful++;
this.recursiveRequest();
}
Looks like you should really be using a queue instead of recursive calls. async.queue offers a fantastic mechanism for processing asynchronous queues. You should also consider using the request module to make your http client connections simpler.
var async = require('async');
var request = require('request');
var load_test_url = 'http://www.testdomain.com/';
var parallel_requests = 1000;
function requestOne(task, callback) {
request.get(task.url, function(err, connection, body) {
if(err) return callback(err);
q.push({url:load_test_url});
callback();
});
}
var q = async.queue(requestOne, parallel_requests);
for(var i = 0; i < parallel_requests; i++){
q.push({url:load_test_url});
}
You can set the parallel_requests variable according to how many simultaneous requests you want to hit the test server with.
You are launching 1 billion "clients" in parallel, and having each of them perform an http get request recursively in an endless recursion.
Few remarks:
while your question mentions 10 million clients, your code creates 1 billion clients.
You should replace the for loop by a recursive function, to get rid of the out-of-memory error.
Something in these lines:
a.prototype.createClients = function(i){
if (i < 999999999) {
this.recursiveRequest();
this.createClients(i+1);
}
}
Then, you probably want to include some delay between the clients creations, or between the calls to recursiveRequest. Use setTimeout.
You should have a way to get the recursions stopping (onSuccess and recursiveRequest keep calling each other)
A flow control library like async node.js module may help.
10 million is very large... Assuming that the stack supports any number of calls, it should work, but you are likely asking the JavaScript interpreter to load 10 million x quite a bit of memory... and the result is Out of Memory.
Also I personally do no see why you'd want to have so many requests at the same time (testing a heavy load on a server?) one way to optimize is to NOT create "floating functions" which you are doing a lot. "Floating functions" use their own set of memory on each instantiation.
this.http.get(this.options, function(resp){ ... });
^^^^
++++--- allocates memory x 10 million
Here the function(resp)... declaration allocates more memory on each call. What you want to do is:
# either global scope:
function r(resp) {...}
this.http.get(this.options, r ...);
# or as a static member:
a.r = function(resp) {...};
this.http.get(this.options, a.r ...);
At least you'll save on all that function memory. That goes for all the functions you declare within the r function, of course. Especially if their are quite large.
If you want to use the this pointer (make r a prototype function) then you can do that:
a.prototype.r = function(resp) {...};
// note that we have to have a small function to use 'that'... probably not a good idea
var that = this;
this.http.get(this.options, function(){that.r();});
To avoid the that reference, you may use an instance saved in a global. That defeats the use of an object as such though:
a.instance = new a;
// r() is static, but can access the object as follow:
a.r = function(resp) { a.instance.<func>(); }
Using the instance you can access the object's functions from the static r function. That could be the actual implementation which could make full use of the this reference:
a.r = function(resp) { a.instance.r_impl(); }
According to a comment by Daniel, your problem is that you misuse a for() to count the total number of requests you want to send. This means you can apply a very simple fix to your code as follow:
a.prototype.createClients = function(){
this.recursiveRequest();
};
a.prototype.recursiveRequest = function(){
var self = this;
self.hrtime = process.hrtime();
if(!this.halt && this.successful < 10000000){
...
Your recursivity is enough to run the test any number of times.
What you do is never quit, though. You have a halt variable, but it does not look like you ever set that to true. However, to test 10 million times, you want to check the number of requests you already sent.
My "fix" supposes that onError() fails (is no recursive). You could also change the code to make use of the halt flag as in:
a.prototype.onSuccess = function(){
var elapsed = process.hrtime(this.hrtime),
ms = elapsed[0] * 1000000 + elapsed[1] / 1000
this.times.push(ms);
this.successful++;
if(this.successful >= 10000000)
{
this.halt = true;
}
this.recursiveRequest();
}
Note here that you will be pushing ms in the times buffer 10 million times. That's a big table! You may want to have a total instead and compute an average at the end:
this.time += ms;
// at the end:
this.average_time = this.time / this.successful;
I'm having trouble using Mongoskin to perform bulk inserting (MongoDB 2.6+) on Node.
var dbURI = urigoeshere;
var db = mongo.db(dbURI, {safe:true});
var bulk = db.collection('collection').initializeUnorderedBulkOp();
for (var i = 0; i < 200000; i++) {
bulk.insert({number: i}, function() {
console.log('bulk inserting: ', i);
});
}
bulk.execute(function(err, result) {
res.json('send response statement');
});
The above code gives the following warnings/errors:
(node) warning: possible EventEmitter memory leak detected. 51 listeners added. Use emitter.setMaxListeners() to increase limit.
TypeError: Object #<SkinClass> has no method 'execute'
(node) warning: possible EventEmitter memory leak detected. 51 listeners added. Use emitter.setMaxListeners() to increase limit.
TypeError: Object #<SkinClass> has no method 'execute'
Is it possible to use Mongoskin to perform unordered bulk operations? If so, what am I doing wrong?
You can do it but you need to change your calling conventions to do this as only the "callback" form will actually return a collection object from which the .initializeUnorderedBulkOp() method can be called. There are also some usage differences to how you think this works:
var dbURI = urigoeshere;
var db = mongo.db(dbURI, {safe:true});
db.collection('collection',function(err,collection) {
var bulk = collection.initializeUnorderedBulkOp();
count = 0;
for (var i = 0; i < 200000; i++) {
bulk.insert({number: i});
count++;
if ( count % 1000 == 0 )
bulk.execute(function(err,result) {
// maybe do something with results
bulk = collection.initializeUnorderedBulkOp(); // reset after execute
});
});
// If your loop was not a round divisor of 1000
if ( count % 1000 != 0 )
bulk.execute(function(err,result) {
// maybe do something here
});
});
So the actual "Bulk" methods themselves don't require callbacks and work exactly as shown in the documentation. The exeception is .execute() which actually sends the statements to the server.
While the driver will sort this out for you somewhat, it probably is not a great idea to queue up too many operations before calling execute. This basically builds up in memory, and though the driver will only send in batches of 1000 at a time ( this is a server limit as well as the complete batch being under 16MB ), you probably want a little more control here, at least to limit memory usage.
That is the point of the modulo tests as shown, but if memory for building the operations and a possibly really large response object are not a problem for you then you can just keep queuing up operations and call .execute() once.
The "response" is in the same format as given in the documentation for BulkWriteResult.
I am inserting resources into the database in a for loop using this insert function.
I am querying first to check whether there is a similar resource in the database already. If so we don't need to recreate it.
for(var i = 0; i < 20; i++){
!function(i){
addResource("url", "type", ...)
}(i) // for local variable with callback
}
function addResource(url, type, callback){
getResource({"url":url}, function(tx, result){
console.log(result.rows.length, result.rows.length===1);
if(result.rows.length===0){ //didn't exist yet -> create it && return id
insertResource(url, type, function(tx, r){
callback(r.insertId);
});
} else if(result.rows.length===1){ //already exist -> return id
callback(result.rows.item(0).id);
} else { //should not happen -> error
console.error("addResource: Non unique identifier");
}
});
}
function insertResource(url, type, callback){
var query = "INSERT INTO resource(url, type) VALUES (?, ?);";
insert(query, [url, type], callback);
}
However, when i run this code, the same resource gets added 20 times instead of only once. I suspect that the delay on the execution of the callbacks makes it so that all the "===0" checks pass before any of them are created.
So is there maybe a way to stop this from happening? When i put constraints on the database the code just stops running when the constraint is violated, which i don't want to happen.
I suspect that the delay on the execution of the callbacks makes it so that all the "===0" checks pass before any of them are created.
Yeah. You have a race condition going between each round by mixing a synchronous for loop with asynchronous getResource() and insert().
The loop starts all 20 rounds in parallel, which are all looking for duplicates at the same time, before any have actually been inserted. They all find their result set empty, so they each insert.
You'll probably want to use an asynchronous iterator, such as async's timesSeries(), so each round is delayed until those before it have completed.
async.timesSeries(20, function (i, done) {
addResource("url", "type", function (id) {
// ...
done(null);
});
});