How to use unordered bulk inserting with Mongoskin? - javascript

I'm having trouble using Mongoskin to perform bulk inserting (MongoDB 2.6+) on Node.
var dbURI = urigoeshere;
var db = mongo.db(dbURI, {safe:true});
var bulk = db.collection('collection').initializeUnorderedBulkOp();
for (var i = 0; i < 200000; i++) {
bulk.insert({number: i}, function() {
console.log('bulk inserting: ', i);
});
}
bulk.execute(function(err, result) {
res.json('send response statement');
});
The above code gives the following warnings/errors:
(node) warning: possible EventEmitter memory leak detected. 51 listeners added. Use emitter.setMaxListeners() to increase limit.
TypeError: Object #<SkinClass> has no method 'execute'
(node) warning: possible EventEmitter memory leak detected. 51 listeners added. Use emitter.setMaxListeners() to increase limit.
TypeError: Object #<SkinClass> has no method 'execute'
Is it possible to use Mongoskin to perform unordered bulk operations? If so, what am I doing wrong?

You can do it but you need to change your calling conventions to do this as only the "callback" form will actually return a collection object from which the .initializeUnorderedBulkOp() method can be called. There are also some usage differences to how you think this works:
var dbURI = urigoeshere;
var db = mongo.db(dbURI, {safe:true});
db.collection('collection',function(err,collection) {
var bulk = collection.initializeUnorderedBulkOp();
count = 0;
for (var i = 0; i < 200000; i++) {
bulk.insert({number: i});
count++;
if ( count % 1000 == 0 )
bulk.execute(function(err,result) {
// maybe do something with results
bulk = collection.initializeUnorderedBulkOp(); // reset after execute
});
});
// If your loop was not a round divisor of 1000
if ( count % 1000 != 0 )
bulk.execute(function(err,result) {
// maybe do something here
});
});
So the actual "Bulk" methods themselves don't require callbacks and work exactly as shown in the documentation. The exeception is .execute() which actually sends the statements to the server.
While the driver will sort this out for you somewhat, it probably is not a great idea to queue up too many operations before calling execute. This basically builds up in memory, and though the driver will only send in batches of 1000 at a time ( this is a server limit as well as the complete batch being under 16MB ), you probably want a little more control here, at least to limit memory usage.
That is the point of the modulo tests as shown, but if memory for building the operations and a possibly really large response object are not a problem for you then you can just keep queuing up operations and call .execute() once.
The "response" is in the same format as given in the documentation for BulkWriteResult.

Related

Are JavaScript event loop operations on variables blocking?

In the non-blocking event loop of JavaScript, is it safe to read and then alter a variable? What happens if two processes want to change a variable nearly at the same time?
Example A:
Process 1: Get variable A (it is 100)
Process 2: Get variable A (it is 100)
Process 1: Add 1 (it is 101)
Process 2: Add 1 (it is 101)
Result: Variable A is 101 instead of 102
Here is a simplified example, having an Express route. Lets say the route gets called 1000 per second:
let counter = 0;
const getCounter = () => {
return counter;
};
const setCounter = (newValue) => {
counter = newValue;
};
app.get('/counter', (req, res) => {
const currentValue = getCounter();
const newValue = currentValue + 1;
setCounter(newValue);
});
Example B:
What if we do something more complex like Array.findIndex() and then Array.splice()? Could it be that the found index has become outdated because another event-process already altered the array?
Process A findIndex (it is 12000)
Process B findIndex (it is 34000)
Process A splice index 12000
Process B splice index 34000
Result: Process B removed the wrong index, should have removed 33999 instead
const veryLargeArray = [
// ...
];
app.get('/remove', (req, res) => {
const id = req.query.id;
const i = veryLargeArray.findIndex(val => val.id === id);
veryLargeArray.splice(i, 1);
});
Example C:
What if we add an async operation into Example B?
const veryLargeArray = [
// ...
];
app.get('/remove', (req, res) => {
const id = req.query.id;
const i = veryLargeArray.findIndex(val => val.id === id);
someAsyncFunction().then(() => {
veryLargeArray.splice(i, 1);
});
});
This question was kind of hard to find the right words to describe it. Please feel free to update the title.
As per #ThisIsNoZaku's link, Javascript has a 'Run To Completion' principle:
Each message is processed completely before any other message is processed.
This offers some nice properties when reasoning about your program, including the fact that whenever a function runs, it cannot be pre-empted and will run entirely before any other code runs (and can modify data the function manipulates). This differs from C, for instance, where if a function runs in a thread, it may be stopped at any point by the runtime system to run some other code in another thread.
A downside of this model is that if a message takes too long to complete, the web application is unable to process user interactions like click or scroll. The browser mitigates this with the "a script is taking too long to run" dialog. A good practice to follow is to make message processing short and if possible cut down one message into several messages.
Further reading: https://developer.mozilla.org/en-US/docs/Web/JavaScript/EventLoop
So, for:
Example A: This works perfectly fine as a sitecounter.
Example B: This works perfectly fine as well, but if many requests happen at the same time then the last request submitted will be waiting quite some time.
Example C: If another call to \remove is sent before someAsyncFunction finishes, then it is entirely possible that your array will be invalid. The way to resolve this would be to move the index finding into the .then clause of the async function.
IMO, at the cost of latency, this solves a lot of potentially painful concurrency problems. If you must optimise the speed of your requests, then my advice would be to look into different architectures (additional caching, etc).

Speeding up IndexedDB search with Multiple Workers

PROBLEM: I am trying to speed up my IndexedDB searches by using multiple web workers and therefore executing multiple read transactions simultaneously, but it's not really working, and my CPU only gets to around 30-35% utilization. I have a 4-core processor and was hoping that spawning 4 web workers would dramatically reduce the search time.
I am using Firefox 53 with a WebExtension; other browsers are not an option.
DATABASE: I have a data store with about 250,000 records, each with about 30 keys, some of them containing paragraphs of text.
TASK: Perform a string search on a given key to find matching values. Currently, this takes about 90 seconds to do on a single thread. Adding an additional worker reduces that time to about 75 seconds. More workers than that have no noticeable effect. An acceptable time to me would be under 10 seconds (somewhat comparable to an SQL database).
CURRENT STRATEGY: Spawn a worker for each processor, and create a Promise that resolves when the worker sends a message. In each worker, open the database, divide the records up evenly, and search for the string. I do that by starting on the first record if you're the first worker, second record for the second, etc. Then advance by the number of workers. So the first worker checks records 1, 5, 9, etc. Second worker checks 2, 6, 10, etc. Of course, I could also have the first worker check 1-50, second worker check 51-100, etc. (but obviously thousands each).
Using getAll() on a single thread took almost double the time and 4GB of memory. Splitting that into 4 ranges significantly reduces the time down to a total of about 40 seconds after merging the results (the 40 seconds varies wildly every time I run the script).
Any ideas on how I can make this work, or other suggestions for significantly speeding up the search?
background.js:
var key = whatever, val = something
var proc = navigator.hardwareConcurrency; // Number of processors
var wPromise = []; // Array of promises (one for each worker)
var workers = [];
/* Create a worker for each processor */
for (var pos = 0; pos < proc; pos++) {
workers[pos] = new Worker("js/dbQuery.js");
wPromise.push(
new Promise( resolve => workers[pos].onmessage = resolve )
);
workers[pos].postMessage({key:key, val:val, pos:pos, proc:proc});
}
return Promise.all(wPromise); // Do something once all the workers have finished
dbQuery.js:
onmessage = e => {
var data = e.data;
var req = indexedDB.open("Blah", 1);
req.onsuccess = e => {
var keyArr = [];
var db = e.currentTarget.result;
db.transaction("Blah").objectStore("Blah").index(data.key).openKeyCursor().onsuccess = e => {
var cursor = e.target.result;
if (cursor) {
if (data.pos) {
cursor.advance(data.pos); // Start searching at a position based on which web worker
data.pos = false;
}
else {
if (cursor.key.includes(data.val)) {
keyArr.push(cursor.primaryKey); // Store key if value is a match
}
cursor.advance(data.proc); // Advance position based on number of processors
}
}
else {
db.close();
postMessage(keyArr);
close();
}
}
}
}
Any ideas on how I can make this work, or other suggestions for
significantly speeding up the search?
You can substitute using Promise.race() for Promise.all() to return a resolved Promise once a match is found, instead of waiting for all of the Promises passed to Promise.all() to be resolved.

DocumentDB - Quantify bounded execution in stored procedure

I have a DocumentDB stored procedure that does insert or update (not replace but rather reads and update existing document). The stored procedure does at most two operations:
query by Id and
either insert or update
The document is also not particularly large. However, every now and then I would get either time out (caused by bounded execution) or 449 (conflict updating resources, which is a transient error).
IMO this isn't a particularly taxing stored procedure but seems that I'm running to limitations already. I could do more work client side but I love the ACID guarantee in the stored procedure.
Is there any quantitative measure on bounded execution? I'm wondering if I'm simply doing things wrong or I have indeed hit limit of DocumentDB.
My stored procedure is a modified https://github.com/Azure/azure-documentdb-js-server/blob/master/samples/stored-procedures/update.js that takes in document instead of id. I'm using "$addToSet" in particular and the code looks like
function unique(arr) {
var uniqueArr = [], map = {};
for (var i = 0; i < arr.length; i++) {
var exists = map[arr[i]];
if (!exists) {
uniqueArr.push(arr[i]);
map[arr[i]] = true;
}
}
return uniqueArr;
}
// The $addToSet operator adds elements to an array only if they do not already exist in the set.
function addToSet(document, update) {
var fields, i;
if (update.$addToSet) {
console.log(">addToSet");
fields = Object.keys(update.$addToSet);
for (i = 0; i < fields.length; i++) {
if (!Array.isArray(document[fields[i]])) {
// Validate the document field; throw an exception if it is not an array.
throw new Error("Bad $addToSet parameter - field in document must be an array.");
}
// convert to array if input is not an array
var newIds = Array.isArray(update.$addToSet[fields[i]])
? update.$addToSet[fields[i]]
: [update.$addToSet[fields[i]]];
var finalIds = unique(document[fields[i]].concat(newIds));
document[fields[i]] = finalIds;
}
}
}
DocumentDB stored procedures must complete within 5 seconds. They are also limited by the provisioned throughput of the collection. If you have 5000 RU/s provisioned, then the stored procedure cannot consume more than 5000 * 5 RUs in total.
When a stored procedure reaches its execution time or its throughput limit, any request to perform a database operation (read, write, query) will receive a pre-emption signal, i.e. the request will not be accepted as a signal for the stored procedure to wrap up execution, and return to the caller. If you check for return code from each call, your stored procedure will never timeout. Here's a snippet showing how to do this (full samples are available at https://github.com/Azure/azure-documentdb-js-server/blob/master/samples/stored-procedures/):
var isAccepted = collection.replaceDocument(...) {
// additional logic in callback
});
if (!isAccepted) {
// wrap up execution and return
}
Regarding 449, this is a concurrency error that can be returned if your stored procedure attempts to perform a conflicting write. This is side-effect free and safe to retry on from the client. You can implement a retry until succeeded pattern whenever you run into this error.

Recursion - Node out of memory

When I run the following code 9999999+ times, Node returns with:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)
Whats the best solution to get around this issue, other than increasing the max alloc size or any command line arguments?
I'd like to improve the code quality rather than hack a solution.
The following is the main bulk of recursion within the application.
The application is a load testing tool.
a.prototype.createClients = function(){
for(var i = 0; i < 999999999; i++){
this.recursiveRequest();
}
}
a.prototype.recursiveRequest = function(){
var self = this;
self.hrtime = process.hrtime();
if(!this.halt){
self.reqMade++;
this.http.get(this.options, function(resp){
resp.on('data', function(){})
.on("connection", function(){
})
.on("end", function(){
self.onSuccess();
});
})
.on("error", function(e){
self.onError();
});
}
}
a.prototype.onSuccess = function(){
var elapsed = process.hrtime(this.hrtime),
ms = elapsed[0] * 1000000 + elapsed[1] / 1000
this.times.push(ms);
this.successful++;
this.recursiveRequest();
}
Looks like you should really be using a queue instead of recursive calls. async.queue offers a fantastic mechanism for processing asynchronous queues. You should also consider using the request module to make your http client connections simpler.
var async = require('async');
var request = require('request');
var load_test_url = 'http://www.testdomain.com/';
var parallel_requests = 1000;
function requestOne(task, callback) {
request.get(task.url, function(err, connection, body) {
if(err) return callback(err);
q.push({url:load_test_url});
callback();
});
}
var q = async.queue(requestOne, parallel_requests);
for(var i = 0; i < parallel_requests; i++){
q.push({url:load_test_url});
}
You can set the parallel_requests variable according to how many simultaneous requests you want to hit the test server with.
You are launching 1 billion "clients" in parallel, and having each of them perform an http get request recursively in an endless recursion.
Few remarks:
while your question mentions 10 million clients, your code creates 1 billion clients.
You should replace the for loop by a recursive function, to get rid of the out-of-memory error.
Something in these lines:
a.prototype.createClients = function(i){
if (i < 999999999) {
this.recursiveRequest();
this.createClients(i+1);
}
}
Then, you probably want to include some delay between the clients creations, or between the calls to recursiveRequest. Use setTimeout.
You should have a way to get the recursions stopping (onSuccess and recursiveRequest keep calling each other)
A flow control library like async node.js module may help.
10 million is very large... Assuming that the stack supports any number of calls, it should work, but you are likely asking the JavaScript interpreter to load 10 million x quite a bit of memory... and the result is Out of Memory.
Also I personally do no see why you'd want to have so many requests at the same time (testing a heavy load on a server?) one way to optimize is to NOT create "floating functions" which you are doing a lot. "Floating functions" use their own set of memory on each instantiation.
this.http.get(this.options, function(resp){ ... });
^^^^
++++--- allocates memory x 10 million
Here the function(resp)... declaration allocates more memory on each call. What you want to do is:
# either global scope:
function r(resp) {...}
this.http.get(this.options, r ...);
# or as a static member:
a.r = function(resp) {...};
this.http.get(this.options, a.r ...);
At least you'll save on all that function memory. That goes for all the functions you declare within the r function, of course. Especially if their are quite large.
If you want to use the this pointer (make r a prototype function) then you can do that:
a.prototype.r = function(resp) {...};
// note that we have to have a small function to use 'that'... probably not a good idea
var that = this;
this.http.get(this.options, function(){that.r();});
To avoid the that reference, you may use an instance saved in a global. That defeats the use of an object as such though:
a.instance = new a;
// r() is static, but can access the object as follow:
a.r = function(resp) { a.instance.<func>(); }
Using the instance you can access the object's functions from the static r function. That could be the actual implementation which could make full use of the this reference:
a.r = function(resp) { a.instance.r_impl(); }
According to a comment by Daniel, your problem is that you misuse a for() to count the total number of requests you want to send. This means you can apply a very simple fix to your code as follow:
a.prototype.createClients = function(){
this.recursiveRequest();
};
a.prototype.recursiveRequest = function(){
var self = this;
self.hrtime = process.hrtime();
if(!this.halt && this.successful < 10000000){
...
Your recursivity is enough to run the test any number of times.
What you do is never quit, though. You have a halt variable, but it does not look like you ever set that to true. However, to test 10 million times, you want to check the number of requests you already sent.
My "fix" supposes that onError() fails (is no recursive). You could also change the code to make use of the halt flag as in:
a.prototype.onSuccess = function(){
var elapsed = process.hrtime(this.hrtime),
ms = elapsed[0] * 1000000 + elapsed[1] / 1000
this.times.push(ms);
this.successful++;
if(this.successful >= 10000000)
{
this.halt = true;
}
this.recursiveRequest();
}
Note here that you will be pushing ms in the times buffer 10 million times. That's a big table! You may want to have a total instead and compute an average at the end:
this.time += ms;
// at the end:
this.average_time = this.time / this.successful;

Nested queries in Node JS / MongoDB

My userlist table in mongo is set up like this:
email: email#email.com, uniqueUrl:ABC, referredUrl: ...
I have the following code where I query all of the users in my database, and for each of those users, find out how many other's users' referredUrl's equal the current user's unique url:
exports.users = function(db) {
return function(req, res) {
db.collection('userlist').find().toArray(function (err, items) {
for(var i = 0; i<= items.length; i++) {
var user = items[i];
console.log(user.email);
db.collection('userlist').find({referredUrl:user.uniqueUrl}).count(function(err,count) {
console.log(count);
});
}
});
};
};
Right now I'm first logging the user's email, then the count associated with the user. So the console should look as such:
bob#bob.com
1
chris#chris.com
3
grant#grant.com
2
Instead, it looks like this:
bob#bob.com
chris#chris.com
grant#grant.com
1
3
2
What's going on? Why is the nested query only returning after the first query completes?
Welcome to asynchronous programming and callbacks.
What you are expecting to happen is that everything works in a linear order, but that is not how node works. The whole subject is a little too broad for here, but could do with some reading up on.
Luckily the methods invoked by the driver all key of process.nextTick, which gives you something to look up and search on. But there is a simple way to remedy the code due to the natural way that things are queued.
db.collection('userlist').find().toArray(function(err,items) {
var processing = function(user) {
db.collection('userlist').find({ referredUrl: user.uniqueUrl })
.count(function(err,count) {
console.log( user.email );
console.log( count );
});
};
for( var i = 0; i < items.length; i++) {
var user = items[i];
processing(user);
}
});
Now of course that is really an oversimplified way of explaining this, but understand here that you are passing parameters through to your repeating .find() and then doing all the output there.
As said, fortunately some of the work is done for you in the API functions and the event stack is maintained as you added the calls. But mostly, now the output calls are made together and are not occurring within different sets of events.
For a detailed explanation of event loops and callbacks, I'm sure there a much better ones out there than I could write here.
Callbacks are asynchronous in node.js. So, your count function (function(err,count) { console.log(count); }) is not executed immediately after console.log(user.email);. Therefore, the output is normal, nothing wrong with it. What the wrong is the coding style. You shouldn't call callbacks consecutively to get same result when you call functions in same manner in python (in single thread). To get desired result, you should do all work in single callback. But before doing that, I recommend you to understand how callbacks work in nodejs. This will significantly help your coding in nodejs

Categories