I am inserting resources into the database in a for loop using this insert function.
I am querying first to check whether there is a similar resource in the database already. If so we don't need to recreate it.
for(var i = 0; i < 20; i++){
!function(i){
addResource("url", "type", ...)
}(i) // for local variable with callback
}
function addResource(url, type, callback){
getResource({"url":url}, function(tx, result){
console.log(result.rows.length, result.rows.length===1);
if(result.rows.length===0){ //didn't exist yet -> create it && return id
insertResource(url, type, function(tx, r){
callback(r.insertId);
});
} else if(result.rows.length===1){ //already exist -> return id
callback(result.rows.item(0).id);
} else { //should not happen -> error
console.error("addResource: Non unique identifier");
}
});
}
function insertResource(url, type, callback){
var query = "INSERT INTO resource(url, type) VALUES (?, ?);";
insert(query, [url, type], callback);
}
However, when i run this code, the same resource gets added 20 times instead of only once. I suspect that the delay on the execution of the callbacks makes it so that all the "===0" checks pass before any of them are created.
So is there maybe a way to stop this from happening? When i put constraints on the database the code just stops running when the constraint is violated, which i don't want to happen.
I suspect that the delay on the execution of the callbacks makes it so that all the "===0" checks pass before any of them are created.
Yeah. You have a race condition going between each round by mixing a synchronous for loop with asynchronous getResource() and insert().
The loop starts all 20 rounds in parallel, which are all looking for duplicates at the same time, before any have actually been inserted. They all find their result set empty, so they each insert.
You'll probably want to use an asynchronous iterator, such as async's timesSeries(), so each round is delayed until those before it have completed.
async.timesSeries(20, function (i, done) {
addResource("url", "type", function (id) {
// ...
done(null);
});
});
Related
I have a DocumentDB stored procedure that does insert or update (not replace but rather reads and update existing document). The stored procedure does at most two operations:
query by Id and
either insert or update
The document is also not particularly large. However, every now and then I would get either time out (caused by bounded execution) or 449 (conflict updating resources, which is a transient error).
IMO this isn't a particularly taxing stored procedure but seems that I'm running to limitations already. I could do more work client side but I love the ACID guarantee in the stored procedure.
Is there any quantitative measure on bounded execution? I'm wondering if I'm simply doing things wrong or I have indeed hit limit of DocumentDB.
My stored procedure is a modified https://github.com/Azure/azure-documentdb-js-server/blob/master/samples/stored-procedures/update.js that takes in document instead of id. I'm using "$addToSet" in particular and the code looks like
function unique(arr) {
var uniqueArr = [], map = {};
for (var i = 0; i < arr.length; i++) {
var exists = map[arr[i]];
if (!exists) {
uniqueArr.push(arr[i]);
map[arr[i]] = true;
}
}
return uniqueArr;
}
// The $addToSet operator adds elements to an array only if they do not already exist in the set.
function addToSet(document, update) {
var fields, i;
if (update.$addToSet) {
console.log(">addToSet");
fields = Object.keys(update.$addToSet);
for (i = 0; i < fields.length; i++) {
if (!Array.isArray(document[fields[i]])) {
// Validate the document field; throw an exception if it is not an array.
throw new Error("Bad $addToSet parameter - field in document must be an array.");
}
// convert to array if input is not an array
var newIds = Array.isArray(update.$addToSet[fields[i]])
? update.$addToSet[fields[i]]
: [update.$addToSet[fields[i]]];
var finalIds = unique(document[fields[i]].concat(newIds));
document[fields[i]] = finalIds;
}
}
}
DocumentDB stored procedures must complete within 5 seconds. They are also limited by the provisioned throughput of the collection. If you have 5000 RU/s provisioned, then the stored procedure cannot consume more than 5000 * 5 RUs in total.
When a stored procedure reaches its execution time or its throughput limit, any request to perform a database operation (read, write, query) will receive a pre-emption signal, i.e. the request will not be accepted as a signal for the stored procedure to wrap up execution, and return to the caller. If you check for return code from each call, your stored procedure will never timeout. Here's a snippet showing how to do this (full samples are available at https://github.com/Azure/azure-documentdb-js-server/blob/master/samples/stored-procedures/):
var isAccepted = collection.replaceDocument(...) {
// additional logic in callback
});
if (!isAccepted) {
// wrap up execution and return
}
Regarding 449, this is a concurrency error that can be returned if your stored procedure attempts to perform a conflicting write. This is side-effect free and safe to retry on from the client. You can implement a retry until succeeded pattern whenever you run into this error.
I am trying to implement a trigger on an Azure DocumentDb collection, which is supposed to auto-increment a version of a document, which is being inserted. The trigger is created as a pre-trigger.
The challenge I am facing is that collection class doesn't seem to provide a synchronous API for querying data. My plan for the trigger was to query existing documents, get the top version, increment, and assign the +1 value to the document, which is being inserted into the collection. But since the result of the query is only available asynchronously, by that time my trigger is completed and the document is inserted unmodified.
How can I await the query result?
Here is how my current trigger looks like:
// TRIGGER Auto increment version
function autoIncrementVersion() {
var collection = getContext().getCollection();
var request = getContext().getRequest();
var docToCreate = request.getBody();
// Reject documents that do not have a name property by throwing an exception.
if (!docToCreate.Version) {
throw new Error('Document must include a "Version" property.');
}
var lastVersion;
var filter = "SELECT TOP 1 d.Version FROM CovenantsDocuments d ORDER BY d.Version DESC";
var result = collection.queryDocuments(collection.getSelfLink(), filter, {},
function (err, documents, responseOptions) {
if (err) throw new Error("Error: " + err.message);
if (documents.length != 1 || !documents[0]) {
lastVersion = 0;
} else {
lastVersion = documents[0];
}
//By the time we reach this line, our trigger has already completed?
docToCreate.Version = lastVersion + 1;
});
if (!result) throw "Unable to read last version of the document";
}
UPDATE: The issue was with the way I was submitting request. Looks like triggers are not fired by default, their names need to be explicitly provided as an argument to the request.
In my case the trigger wasn't firing until I changed the client code to this:
RequestOptions options = new RequestOptions
{
PreTriggerInclude = new[] { "autoIncrementVersion"}
};
client.CreateDocumentAsync(url, document, options);
It will automatically wait until all pending async operations either complete, fail, or time out before returning. What you have is close. The only thing that I can see is missing is that you never call request.setBody(docToCreate) after you alter docToCreate.
That said, I'm not 100% certain that this approach is safe. All operations inside of a trigger, sproc, or UDF are atomic, but I'm not sure that the combination of a pre-trigger plus a write operation is atomic. The risk is that two simultaneous writes will both run and complete the trigger part which would give them a same .Version. You would probably have to ask the DocumentDB Product Managers to confirm this. They hang out here so they may respond here.
If you find that it's not atomic, then you can move everything (read to find latest version and write) into a stored procedure (sproc).
You might also consider creating a single document whose id you hard code to something like 'LAST_VERSION' to hold the last used version. That means that every write will result in a read + two writes (one for the document and one to update this document), but it may be more efficient than your query + one write approach. You could do all of this in one sproc or you could use a pre-trigger (to fetch the 'LAST_VERSION' + write operation + post-trigger (to update the 'LAST_VERSION' document) depending upon what the Product Managers say about atomicity.
One more caution about your current approach... Make sure the precision of the index on the Version field is set to -1 (Maximum precision).
I am working on an API through which i should be able to add a list of user as JIRA watchers (so i am talking to the JIRa REST API)
Here is the function to do this:
for (var i = 0; i < votes.totalValue; i++) {
var voter = {
user : votes.rawVoteData[i].user,
value : votes.rawVoteData[i].value,
email : votes.rawVoteData[i].email,
fname : votes.rawVoteData[i].fname,
lname : votes.rawVoteData[i].fname
};
// Add these users as watchers to the Jira
jira.jira.addWatcher(issueId, voter.user, function(err, result){
// TODO: Return to callback
rep++;
console.log('user='+voter.user);
console.log(result);
});
}
The votes object has the list of users shown above. Now the problem is that when i execute this function, due to async nature of node, the for loop gets executed completely then all the async calls get fired at once. (which doesn't work for me since the JIRa REST API does not seem to support this)
I want to change the above code so that i make the call to the addWatcher function for a user one at a time & make the next addWAtcher call for the next user only if the previous async call has returned(. i.e. move on to the next async call one by one for each user and not fire all calls to addwatcher all at once.)
How can i do this?
Please advise,
thanks!
you should take a look at async module.
You can use async.each() function. The function takes an array of items, then iterates over them calling a wrapper function which accepts the item as an argument. When all the calls are complete, you specify a final function to be called.
other solution can be async.parallel() (in case you don't mind running them in parallel.
Credit for this great tutorial
While waiting for the back end devs to implement a "cancel all" feature, which cancels all tasks tracked by the back end, I am attempting to makeshift it by cancelling each individual task. The cancel REST service accepts an ID in the form of a data object {transferID: someID}.
I use a FOR loop to iterate over an array of IDs that I have stored elsewhere. Anticipating that people MAY end up with dozens or hundreds of tasks, I wanted to implement a small delay that will theoretically not overflow the number of HTTP requests the browser can handle and will also reduce a blast of load on the back end CPU. Here is some code with comments for the purpose of this discussion:
ta.api.cancel = function (taskArray, successCallback, errorCallback) {
// taskArray is ["task1","task2"]
// this is just the latest attempt. I had an attempt where I didn't bother
// with this and the results were the same. I THOUGHT there was a "back image"
// type issue so I tried to instantiate $.ajax into two different variables.
// It is not a back image issue, though, but one to do with setTimeout.
ta.xhrObjs = ta.xhrObjs || {};
for (var i = 0; i < taskArray.length; i++) {
console.log(taskArray); // confirm that both task1 and task2 are there.
var theID = taskArray[i];
var id = {transferID: theID}; // convert to the format understood by REST
console.log(id); // I see "task1" and then "task2" consecutively... odd,
// because I expect to see the "inside the setTimeout" logging line next
setTimeout(function () {
console.log('inside the setTimeout, my id is: ')
console.log(id.transferID);
// "inside the setTimeout, my id is: task2" twice consecutively! Y NO task1?
ta.xhrObjs[theID] = doCancel(id);
}, 20 * i);
}
function doCancel(id) {
// a $.Ajax call for "task2" twice, instead of "task1" then "task2" 20ms
// later. No point debugging the Ajax (though for the record, cache is
// false!) because the problem is already seen in the 'setTimeout' and
// fixed by not setting a timeout.
}
}
Thing is: I know setTimeout makes the containing function execute asynchronously. If I take out the timeout, and just call doCancel in the iterator, it will call it on task1 and then task2. But although it makes the call async, I don't understand why it just does task2 twice. Can't wrap my head around it.
I am looking for a way to get the iterator to make the Ajax calls with a 20ms delay. But I need it to call on both! Anybody see a glaring error that I can fix, or know of a technique?
You must wrap your function setTimeout and pass the id variable into it, like this:
(function(myId, i) {
setTimeout(function () {
console.log('inside the setTimeout, my id is: ', myId);
}, 20 * i);
}(theId, i));
This pattern does not create a unique variable1 for each instance of the loop as one might expect.
function () {
for (var i = 0; i < length; i++) {
var variable1;
}
}
In javascript variables are "hoisted". To quote Mozilla:
"Because variable declarations (and declarations in general) are
processed before any code is executed, declaring a variable anywhere
in the code is equivalent to declaring it at the top."
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/var
So it should be re-written as:
function () {
var variable1;
for (var i = 0; i < length; i++) {
}
}
What this means is that after the loop has finished, any asynchronous callbacks that reference this variable will see the last value of the loop.
I just came to this awful situation where I have an array of strings each representing a possibly existing file (e.g. var files = ['file1', 'file2', 'file3']. I need to loop through these file names and try to see if it exists in the current directory, and if it does, stop looping and forget the rest of the remaining files. So basically I want to find the first existing file of those, and fallback to a hard-coded message if nothing was found.
This is what I currently have:
var found = false;
files.forEach(function(file) {
if (found) return false;
fs.readFileSync(path + file, function(err, data) {
if (err) return;
found = true;
continueWithStuff();
});
});
if (found === false) {
// Handle this scenario.
}
This is bad. It's blocking (readFileSync) thus it's slow.
I can't just supply callback methods for fs.readFile, it's not that simple because I need to take the first found item... and the callbacks may be called at any random order. I think one way would be to have a callback that increases a counter and keeps a list of found/not found information and when it reaches the files.length count, then it checks through the found/not found info and decides what to do next.
This is painful. I do see the performance greatness in evented IO, but this is unacceptable. What choices do I have?
Don't use sync stuff in a normal server environment -- things are single threaded and this will completely lock things up while it waits for the results of this io bound loop. CLI utility = probably fine, server = only okay on startup.
A common library for asynchronous flow control is
https://github.com/caolan/async
async.filter(['file1','file2','file3'], path.exists, function(results){
// results now equals an array of the existing files
});
And if you want to say, avoid the extra calls to path.exists, then you could pretty easily write a function 'first' that did the operations until some test succeeded. Similar to https://github.com/caolan/async#until - but you're interested in the output.
The async library is absolutely what you are looking for. It provides pretty much all the types of iteration that you'd want in a nice asynchronous way. You don't have to write your own 'first' function though. Async already provides a 'some' function that does exactly that.
https://github.com/caolan/async#some
async.some(files, path.exists, function(result) {
if (result) {
continueWithStuff();
}
else {
// Handle this scenario
}
});
If you or someone reading this in the future doesn't want to use Async, you can also do your own basic version of 'some.'
function some(arr, func, cb) {
var count = arr.length-1;
(function loop() {
if (count == -1) {
return cb(false);
}
func(arr[count--], function(result) {
if (result) cb(true);
else loop();
});
})();
}
some(files, path.exists, function(found) {
if (found) {
continueWithStuff();
}
else {
// Handle this scenario
}
});
You can do this without third-party libraries by using a recursive function. Pass it the array of filenames and a pointer, initially set to zero. The function should check for the existence of the indicated (by the pointer) file name in the array, and in its callback it should either do the other stuff (if the file exists) or increment the pointer and call itself (if the file doesn't exist).
Use async.waterfall for controlling the async call in node.js for example:
by including async-library and use waterfall call in async:
var async = require('async');
async.waterfall(
[function(callback)
{
callback(null, taskFirst(rootRequest,rootRequestFrom,rootRequestTo, callback, res));
},
function(arg1, callback)
{
if(arg1!==undefined )
{
callback(null, taskSecond(arg1,rootRequest,rootRequestFrom,rootRequestTo,callback, res));
}
}
])
(Edit: removed sync suggestion because it's not a good idea, and we wouldn't want anyone to copy/paste it and use it in production code, would we?)
If you insist on using async stuff, I think a simpler way to implement this than what you described is to do the following:
var path = require('path'), fileCounter = 0;
function existCB(fileExists) {
if (fileExists) {
global.fileExists = fileCounter;
continueWithStuff();
return;
}
fileCounter++;
if (fileCounter >= files.length) {
// none of the files exist, handle stuff
return;
}
path.exists(files[fileCounter], existCB);
}
path.exists(files[0], existCB);