Query changing strings to ints in mongoDB takes a long time - javascript

I run the following query on my whole data set(approx. 3 million documents) in mongoDB to change user IDs that are strings into ints. This query does not seem to complete:
var cursor = db.play_sessions.find()
while (cursor.hasNext()) {
var play = cursor.next();
db.play_sessions.update({_id : play._id}, {$set : {user_id : new NumberInt(play.user_id) }});
}
I run this query on the same data set and it returns relatively quickly:
db.play_sessions.find().forEach(function(play){
if (play.score && play.user_id && play.user_attempt_no && play.game_id && play.level && play.training_session_id) {
print(play.score,",",play.user_id,",",play.user_attempt_no,",",play.game_id,",",play.level,",",parseInt(play.training_session_id).toFixed());
} else if (play.score && play.user_id && play.user_attempt_no && play.game_id && play.level) {
print(play.score,",",play.user_id,",",play.user_attempt_no,",",play.game_id,",",play.level);
};
});
I understand I am writing to the database in the first query but why does the first query never seem to return, while the second does so relatively quickly? Is there something wrong with the code in the first query?

Three million documents is quite a lot of documents so the whole operation is going to take a while. But the main thing here to consider is that you are asking to both "send" data to the database and "receive" a acknowledged write response ( because that is what happens ) three million times. That alone is a lot more waiting in between operations than simply iterating a cursor.
Another reason here is that it is very likely that you are running MongoDB 2.6 or a greater revision. There is a core difference between earlier versions and versions upward to how this code is processed in the shell. The core of this is the Bulk Operations API which contains methods that are actually used by all the shell helpers for all interaction with the database.
In prior versions, in such a "loop" operation the "write concern" acknowledgement was not done in this context for each iteration. The way it is done now ( since the helpers actually use the Bulk API ) the acknowledgement is returned for every iteration. This slows things down a lot. Unless of course you use the Bulk operations directly.
So to "re-cast" your values in modern versions, do this instead:
var bulk = db.play_sessions.initializeUnorderedBulkOp();
var count = 0;
db.play_sessions.find({ "user_id": { "$type": 2 } }).forEach(function(doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "user_id": NumberInt(doc.user_id) }
});
count++;
if ( count % 10000 == 0 ) {
bulk.execute();
bulk = db.play_sessions.initializeUnorderedBulkOp();
}
});
if ( count % 10000 != 0 )
bulk.execute();
Bulk operations send all of their "batch" in a single request. In actual fact the underlying driver breaks this up into individual batch requests of 1000 items, but 10000 is a reasonable number without taking up too much memory in most cases.
The other optimization here is that the only items selected by the query are those that are presently a "string" by the $type operator to identify this. This could possibly speed things up if some of the data is already converted.
If indeed you have an earlier version of MongoDB and you are running this conversion on a collection that is not on a sharded cluster, then your other option is to use db.eval().
Take care to actually read the content on that link though. This is not a good idea, and you should never use this in production and only as a last resort for one off conversion. The code is submitted as JavaScript and actually run on the server. As a result a high level of locking can and will occur while running. You have been warned:
db.eval(function() {
db.play_sessions.find({ "user_id": { "$type": 2 } }).forEach(function(doc) {
db.play_sessions.update(
{ "_id": doc._id },
{ "$set": { "user_id": NumberInt( doc.user_id ) } }
);
});
});
Use with caution and prefer the "batch" processing or even the basic loop on a machine as close as possible in network terms to the actual database server. Preferably on the server.
Also where version permits and you still deem the eval case necessary, try to use the Bulk operations methods anyway, as that is the greatly optimized approach.

Related

How to do find all or get all in dynamo db in NodeJS

I want to get all the data from a table in Dynamo DB in nodejs, this is my code
const READ = async (payload) => {
const params = {
TableName: payload.TableName,
};
let scanResults = [];
let items;
do {
items = await dbClient.scan(params).promise();
items.Items.forEach((item) => scanResults.push(item));
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != "undefined");
return scanResults;
};
I implemented this and this is working fine, but our code review tool is flagging red that this is not optimized or causing some memory leak, I just cannot figure out why, I have read somewhere else that scanning API from dynamo DB is not the most efficient way to get all data in node or is there something else that I am missing to optimize this code
DO LIKE THIS ONLY IF YOUR DATA SIZE IS VERY LESS (less than 100 items or data size less than 1MB, that's I prefer and in that case you don't need a do-while loop)
Think about the following scenario, What about in case in future, more and more items will add in to DynamoDB table? - This will return all your data and put into the scanResults variable right? This will impact the memory. Also, DynamoDB scan operation is expensive - in terms of both memory and cost
It's perfectly okay to use SCAN operation if the data is very less. Otherwise, go with pagination (I always prefer this). If there are 1000's of items, then who will look in to all these in a single shot? So use pagination instead.
Lets take another scenario, If your requirement is to retrieve all the data for doing some analytics or aggregation. Then better store the aggregate data upfront into the table (same or different DynamoDB table) as an item or use some analytics database.
If your requirement is something else, elaborate it in the question.

pre-allocation of records using count

I've read that pre-allocation of a record can improve the performance, which should be beneficial especially when handling many records of a time series dataset.
updateRefLog = function(_ref,year,month,day){
var id = _ref,"|"+year+"|"+month;
db.collection('ref_history').count({"_id":id},function(err,count){
// pre-allocate if needed
if(count < 1){
db.collection('ref_history').insert({
"_id":id
,"dates":[{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0},{"count":0}]
});
}
// update
var update={"$inc":inc['dates.'+day+'.count'] = 1;};
db.collection('ref_history').update({"_id":id},update,{upsert: true},
function(err, res){
if(err !== null){
//handle error
}
}
);
});
};
I'm a little concerned that having to go through a promise might slow this down, and possibly checking for count every time would negate the performance benefit of pre allocating a record.
Is there a more performant way to handle this?
The general statement of "pre-allocation" is about the potential cost of an "update" operation that causes the document to "grow". If that results in a document size that is greater than the currently allocated space, then the document would be "moved" to another location on disk to accomodate the new space. This can be costly, and hence the general recommendation to intially write the document befitting to it's eventual "size".
Honestly the best way to handle such an operation would be to do an "upsert" initially with all the array elements allocated, and then only update the requried element in position. This would reduce to "two" potential writes, and you can further reduce to a single "over the wire" operation using Bulk API methods:
var id = _ref,"|"+year+"|"+month;
var bulk = db.collection('ref_history').initializeOrderedBulkOp();
bulk.find({ "_id": id }).upsert().updateOne({
"$setOnInsert": {
"dates": Array.apply(null,Array(32)).map(function(el) { return { "count": 0 }})
}
});
var update={"$inc":inc['dates.'+day+'.count'] = 1;};
bulk.find({ "_id": id }).updateOne(update);
bulk.execute(function(err,results) {
// results would show what was modified or not
});
Or since newer drivers are favouring consistency with one another, the "Bulk" parts have been relegated to regular arrays of WriteOperations instead:
var update={"$inc":inc['dates.'+day+'.count'] = 1;};
db.collection('ref_history').bulkWrite([
{ "updateOne": {
"filter": { "_id": id },
"update": {
"$setOnInsert": {
"dates": Array.apply(null,Array(32)).map(function(el) {
return { "count": 0 }
})
}
},
"upsert": true
}},
{ "updateOne": {
"filter": { "_id": id },
"update": update
}}
],function(err,result) {
// same thing as above really
});
In either case the $setOnInsert as the sole block will only do anything if an "upsert" actually occurs. The main case being that the only contact with the server will be a single request and response, as opposed to "back and forth" operations waiting on network communication.
This is typically what "Bulk" operations are for. They reduce that network overhead when you might as well send a batch of requests to the server. The result significantly speeds things, and neither operation is really dependant on the other with the exception of the exception of "ordered", which is the default in the latter case, and explicitly set by the legacy .initializeOrderedBulkOp().
Yes there is a "little" overhead in the "upsert", but there is "less" than in testing with .count() and waiting for that result first.
N.B Not sure about the 32 array entries in your listing. You possibly meant 24 but copy/paste got the better of you. At any rate there are better ways to do that than hardcoding, as is demonstrated.

Concurrent beforeSave calls allowing duplicates

In an effort to prevent certain objects from being created, I set a conditional in that type of object's beforeSave cloud function.
However, when two objects are created simultaneously, the conditional does not work accordingly.
Here is my code:
Parse.Cloud.beforeSave("Entry", function(request, response) {
var theContest = request.object.get("contest");
theContest.fetch().then(function(contest){
if (contest.get("isFilled") == true) {
response.error('This contest is full.');
} else {
response.success();
});
});
Basically, I don't want an Entry object to be created if a Contest is full. However, if there is 1 spot in the Contest remaining and two entries are saved simultaneously, they both get added.
I know it is an edge-case, but a legitimate concern.
Parse is using Mongodb which is a NoSQL database designed to be very scalable and therefore provides limited synchronisation features. What you really need here is mutual exclusion which is unfortunately not supported on a Boolean field. However Parse provides atomicity for counters and array fields which you can use to enforce some control.
See http://blog.parse.com/announcements/new-atomic-operations-for-arrays/
and https://parse.com/docs/js/guide#objects-updating-objects
Solved this by using increment and then doing the check in the save callback (instead of fetching the object and checking a Boolean on it).
Looks something like this:
Parse.Cloud.beforeSave("Entry", function(request, response) {
var theContest = request.object.get("contest");
theContest.increment("entries");
theContest.save().then(function(contest) {
if (contest.get("entries") > contest.get("maxEntries")) {
response.error('The contest is full.');
} else {
response.success();
}
});
}

What is more effective: every time make an ajax request or slice a result in func?

I have a JSON data of news like this:
{
"news": [
{"title": "some title #1","text": "text","date": "27.12.15 23:45"},
{"title": "some title #2","text": "text","date": "26.12.15 22:35"},
...
]
}
I need to get a certain number of this list, depended on an argument in a function. As I understand, its called pagination.
I can get the ajax response and slice it immediately. So that every time the function is called - every time it makes an ajax request.
Like this:
function showNews(page) {
var newsPerPage = 5,
firstArticle = newsPerPage*(page-1);
xhr.onreadystatechange = function() {
if(xhr.readyState == 4) {
var newsArr = JSON.parse(xhr.responseText),
;
newsArr.news = newsArr.news.slice(firstArticle, newsPerPage*(page));
addNews(newsArr);
}
};
xhr.open("GET", url, true);
xhr.send();
Or I can store all the result in newsArr and slice it in that additional function addNews, sorted by pages.
function addNews(newsArr, newsPerPage) {
var pages = Math.ceil(amount/newsPerPages), // counts number of pages
pagesData = {};
for(var i=0; i<=pages; i++) {
var min = i*newsPerPages, //min index of current page in loop
max = (i+1)*newsPerPages; // max index of current page in loop
newsArr.news.forEach(createPageData);
}
function createPageData(item, j) {
if(j+1 <= max && j >= min) {
if(!pagesData["page"+(i+1)]) {
pagesData["page"+(i+1)] = {news: []};
}
pagesData["page"+(i+1)].news.push(item);
}
}
So, simple question is which variant is more effective? The first one loads a server and the second loads users' memory. What would you choose in my situation? :)
Thanks for the answers. I understood what I wanted. But there is so much good answers that I can't choose the best
It is actually a primarily opinion-based question.
For me, pagination approach looks better because it will not produce "lag" before displaying the news. From user's POV the page will load faster.
As for me, I would do pagination + preload of the next page. I.e., always store the contents of the next page, so that you can show it without a delay. When a user moves to the last page - load another one.
Loading all the news is definitely a bad idea. If you have 1000 news records, then every user will have to load all of them...even if he isn't going to read a single one.
In my opinion, less requests == better rule doesn't apply here. It is not guaranteed that a user will read all the news. If StackOverflow loaded all the questions it has every time you open the main page, then both StackOverflow and users would have huge problems.
If the max number of records that your service returns is around 1000, then I don't think it is going to create a huge payload or memory issues (by looking at the nature of your data), so I think option-2 is better because
number of service calls will be less
since user will not see any lag while paginating, his experience of using the site will be better.
As a rule of thumb:
less requests == better
but that's not always possible. You may run out of memory/network if the data you store is huge, i.e. you may need pagination on the server side. Actually server side pagination should be the default approach and then you think about improvements (e.g. local caching) if you really need them.
So what you should do is try all scenarios and see how well they behave in your concrete situation.
I prefer fetch all data but showing on some certain condition like click on next button data is already there just do hide and show on condition using jquery.
Every time call ajax is bad idea.
but you also need to call ajax for new data if data is changed after some periodic time

Loading large amount of data into memory - most efficient way to do this?

I have a web-based documentation searching/viewing system that I'm developing for a client. Part of this system is a search system that allows the client to search for a term[s] contained in the documentation. I've got the necessary search data files created, but there's a lot of data that needs to be loaded, and it takes anywhere from 8-20 seconds to load all the data. The data is broken into 40-100 files, depending on what documentation needs to be searched. Each file is anywhere from 40-350kb.
Also, this application must be able to run on the local file system, as well as through a webserver.
When the webpage loads up, I can generate a list of what search data files I need load. This entire list must be loaded before the webpage can be considered functional.
With that preface out of the way, let's look at how I'm doing it now.
After I know that the entire webpage is loaded, I call a loadData() function
function loadData(){
var d = new Date();
var curr_min = d.getMinutes();
var curr_sec = d.getSeconds();
var curr_mil = d.getMilliseconds();
console.log("test.js started background loading, time is: " + curr_min + ":" + curr_sec+ ":" + curr_mil);
recursiveCall();
}
function recursiveCall(){
if(file_array.length > 0){
var string = file_array.pop();
setTimeout(function(){$.getScript(string,recursiveCall);},1);
}
else{
var d = new Date();
var curr_min = d.getMinutes();
var curr_sec = d.getSeconds();
var curr_mil = d.getMilliseconds();
console.log("test.js stopped background loading, time is: " + curr_min + ":" + curr_sec+ ":" + curr_mil);
}
}
What this does is processes an array of files sequentially, taking a 1ms break between files. This helps prevent the browser from being completely locked up during the loading process, but the browser still tends to get bogged down by loading the data. Each of the files that I'm loading look like this:
AddToBookData(0,[0,1,2,3,4,5,6,7,8]);
AddToBookData(1,[0,1,2,3,4,5,6,7,8]);
AddToBookData(2,[0,1,2,3,4,5,6,7,8]);
Where each line is a function call that is adding data to an array. The "AddToBookData" function simply does the following:
function AddToBookData(index1,value1){
BookData[BookIndex].push([index1,value1]);
}
This is the existing system. After loading all the data, "AddToBookData" can get called 100,000+ times.
I figured that was pretty inefficient, so I wrote a script to take the test.js file which contains all the function calls above, and processed it to change it into a giant array which is equal to the data structure that BookData is creating. Instead of making all the function calls that the old system did, I simply do the following:
var test_array[..........(data structure I need).......]
BookData[BookIndex] = test_array;
I was expecting to see a performance increase because I was removing all the function calls above, this method takes slightly more time to create the exact data structure. I should note that "test_array" holds slightly over 90,000 elements in my real world test.
It seems that both methods of loading data have roughly the same CPU utilization. I was surprised to find this, since I was expecting the second method to require little CPU time, since the data structure is being created before hand.
Please advise?
Looks like there are two basic areas for optimising the data loading, that can be considered and tackled separately:
Downloading the data from the server. Rather than one large file you should gain wins from parallel loads of multiple smaller files. Experiment with number of simultaneous loads, bear in mind browser limits and diminishing returns of having too many parallel connections. See my parallel vs sequential experiments on jsfiddle but bear in mind that the results will vary due to the vagaries of pulling the test data from github - you're best off testing with your own data under more tightly controlled conditions.
Building your data structure as efficiently as possible. Your result looks like a multi-dimensional array, this interesting article on JavaScript array performance may give you some ideas for experimentation in this area.
But I'm not sure how far you'll really be able to go with optimising the data loading alone. To solve the actual problem with your application (browser locking up for too long) have you considered options such as?
Using Web Workers
Web Workers might not be supported by all your target browsers, but should prevent the main browser thread from locking up while it processes the data.
For browsers without workers, you could consider increasing the setTimeout interval slightly to give the browser time to service the user as well as your JS. This will make things actually slightly slower but may increase user happiness when combined with the next point.
Providing feedback of progress
For both worker-capable and worker-deficient browsers, take some time to update the DOM with a progress bar. You know how many files you have left to load so progress should be fairly consistent and although things may actually be slightly slower, users will feel better if they get the feedback and don't think the browser has locked up on them.
Lazy Loading
As suggested by jira in his comment. If Google Instant can search the entire web as we type, is it really not possible to have the server return a file with all locations of the search keyword within the current book? This file should be much smaller and faster to load than the locations of all words within the book, which is what I assume you are currently trying to get loaded as quickly as you can?
I tested three methods of loading the same 9,000,000 point dataset into Firefox 3.64.
1: Stephen's GetJSON Method
2) My function based push method
3) My pre-processed array appending method:
I ran my tests two ways: The first iteration of testing I imported 100 files containing 10,000 rows of data, each row containing 9 data elements [0,1,2,3,4,5,6,7,8]
The second interation I tried combining files, so that I was importing 1 file with 9 million data points.
This was a lot larger than the dataset I'll be using, but it helps demonstrate the speed of the various import methods.
Separate files: Combined file:
JSON: 34 seconds 34
FUNC-BASED: 17.5 24
ARRAY-BASED: 23 46
Interesting results, to say the least. I closed out the browser after loading each webpage, and ran the tests 4 times each to minimize the effect of network traffic/variation. (ran across a network, using a file server). The number you see is the average, although the individual runs differed by only a second or two at most.
Instead of using $.getScript to load JavaScript files containing function calls, consider using $.getJSON. This may boost performance. The files would now look like this:
{
"key" : 0,
"values" : [0,1,2,3,4,5,6,7,8]
}
After receiving the JSON response, you could then call AddToBookData on it, like this:
function AddToBookData(json) {
BookData[BookIndex].push([json.key,json.values]);
}
If your files have multiple sets of calls to AddToBookData, you could structure them like this:
[
{
"key" : 0,
"values" : [0,1,2,3,4,5,6,7,8]
},
{
"key" : 1,
"values" : [0,1,2,3,4,5,6,7,8]
},
{
"key" : 2,
"values" : [0,1,2,3,4,5,6,7,8]
}
]
And then change the AddToBookData function to compensate for the new structure:
function AddToBookData(json) {
$.each(json, function(index, data) {
BookData[BookIndex].push([data.key,data.values]);
});
}
Addendum
I suspect that regardless what method you use to transport the data from the files to the BookData array, the true bottleneck is in the sheer number of requests. Must the files be fragmented into 40-100? If you change to JSON format, you could load a single file that looks like this:
{
"file1" : [
{
"key" : 0,
"values" : [0,1,2,3,4,5,6,7,8]
},
// all the rest...
],
"file2" : [
{
"key" : 1,
"values" : [0,1,2,3,4,5,6,7,8]
},
// yadda yadda
]
}
Then you could do one request, load all the data you need, and move on... Although the browser may initially lock up (although, maybe not), it would probably be MUCH faster this way.
Here is a nice JSON tutorial, if you're not familiar: http://www.webmonkey.com/2010/02/get_started_with_json/
Fetch all the data as a string, and use split(). This is the fastest way to build an array in Javascript.
There's an excellent article a very similar problem, from the people who built the flickr search: http://code.flickr.com/blog/2009/03/18/building-fast-client-side-searches/

Categories