Delete multiple documents - javascript

The following code is working but extremely slow. Up till the search function all goes well. First, the search function returns a sequence and not an array (why?!). Second, the array consists of nodes and I need URI's for the delete. And third, the deleteDocument function takes a string and not an array of URI's.
What would be the better way to do this? I need to delete year+ old documents.
Here I use xdmp.log in stead of document.delete just te be safe.
var now = new Date();
var yearBack = now.setDate(now.getDate() - 365);
var date = new Date(yearBack);
var b = cts.jsonPropertyRangeQuery("Dtm", "<", date);
var c = cts.search(b, ['unfiltered']).toArray();
for (i=0; i<fn.count(c); i++) {
xdmp.log(fn.documentUri(c[i]), "info");
};

Doing the same with cts.uris:
var now = new Date();
var yearBack = now.setDate(now.getDate() - 365);
var date = new Date(yearBack);
var b = cts.jsonPropertyRangeQuery("Dtm", "<", date);
var c = cts.uris("", [], b);
while (true) {
var uri = c.next();
if (uri.done == true){
break;
}
xdmp.log(uri.value, "info");
}
HTH!

Using toArray will work but is most likely were your slowness is. The cts.search() function returns an iterator. So All you have to do is loop over it and do your deleting until there is no more items in it. Also You might want to limit your search to 1,000 items. A transaction with a large number of deletes will take a while and might time out.
Here is an example of looping over the iterator
var now = new Date();
var yearBack = now.setDate(now.getDate() - 365);
var date = new Date(yearBack);
var b = cts.jsonPropertyRangeQuery("Dtm", "<", date);
var c = cts.search(b, ['unfiltered']);
while (true) {
var doc = c.next();
if (doc.done == true){
break;
}
xdmp.log(fn.documentUri(doc), "info");
}
here is an example if you wanted to limit to the first 1,000.
fn.subsequence(cts.search(b, ['unfiltered']), 1, 1000);

Several things to consider.
1) If you are searching for the purpose of deleting or anything that doesnt require the document body, using a search that returns URIs instead of nodes can be much faster. If that isnt convenient then getting the URI as close to the search expression can achieve similar results. You want to avoid having the server have to fetch and expand the document just to get the URI to delete it.
2) While there is full coverage in the JavaScript API's for all MarkLogic features, the JavaScript API's are based on the same underlying functions that the XQuery API's use. Its useful to understand that, and take a look at the equivalent XQuery API docs to get the big picture. For example Arrays vs Iterators - If the JS search API's returned Arrays it could be a huge performance problem because the underlying code is based on 'lazy evaluation' of sequences. For example a search could return 1 million rows but if you only look at the first one the server can often avoid accessing the remaining 999,999,999 documents. Similarly, as you iterate only the in scope referenced data needs to be in available. If they had to be put into an array then all results would have to be pre-fetched and put put in memory upfront.
3) Always keep in mind that operations which return lists of things may only be bounded by how big your database is. That is why cts.search() and other functions have built in 'pagination'. You should code for that from the start.
By reading the users guides you can get a better understanding of not only how to do something, but how to do it efficiently - or even at all - once your database becomes larger than memory. In general its a good idea to always code for paginated results - it is a lot more efficient and your code will still work just as well after you add 100 docs or a million.
4) take a look at xdmp.nodeUrl https://docs.marklogic.com/xdmp.nodeUri,
This function, unlike fn.documentUri(), will work on any node even if its not document node. If you can put this right next to the search instead of next to the delete then the system can optimize much better. The examples in the JavaScript guide are a good start https://docs.marklogic.com/guide/getting-started/javascript#chapter
In your case I suggest something like this to experiment with both pagination and extracting the URIs without having to expand the documents ..
var uris = []
for (var result of fn.subsequence(cts.search( ... ), 1 , 100 )
uris.push(xdmp.nodeUri(result))
for( i in uris )
xdmp.log( uris[i] )

Related

Find all the subranges for a date range api request with capped max offset

I've to extract data from an API that has a max offset (1000) and limit (20). So it's possible that between two dates (from and to) there are more results than the returned from the request.
I'd like to get the complete set of data between those two dates so I'm trying to come up with a solution to accomplish this.
My idea is to start by making a request like:
https://api.com/?from=2021-07-20&to=2021-07-24&limit=20&offset=1000
If the request returns less than 20 elements the search is over, I've all the data available between the ranges. But if the request returns 20 elements, then it's probable that there's more data in this range, so I have to find a way to keep splitting ranges until this condition is false.
I've thought about splitting the ranges like:
from = from
to_1 = 2021-07-22
from_1 = 2021-07-22
to = 2021-07-24
And then pass those ranges to the recursive function until finding all the needed subranges.
The output would be something like:
[(2021-07-21,2021-07-22),(2021-07-22,2021-07-23),(2021-07-23, 2021-07-24)]
The problem with this solution is that I'm expanding the from's and to's so I can't manage to use a recursive function and I'm struggling about how to fix this issue.
Edit: I've added the "javascript" tag as the solution is going to be implemented in that language but ideas/pseudocode is welcome too.
Seems like you're looking for binary search.
Your recursive function should look like this (it's JS pseudo-code):
function get_ranges(from, to)
{
const items = request(from, to)
if (items.length < 20) {
return [[from, to]]
}
const date_between = calculate_date_between(from, to)
const left_ranges = get_ranges(from, date_between)
const right_ranges = get_ranges(date_between, to)
return [...left_ranges, ...right_ranges]
}
const result = get_ranges(min_date, max_date)

javascript function toString'd with dynamic values embedded?

I'm calling toString() on a function so that I can send it across the wire to a system that will apply it against an object. A simplified example is below. Here I am pushing the function as a string to a remote service, which is checking if the remote object this, which has the property timestamp, is older than cutoff which is currently always one house ago...
var functionToSend = function() {
let cutoff = new Date();
cutoff.setHours(autoApproveCutoff.getHours() - 1);
const isMatch = this.timestamp <= cutoff
return isMatch
};
foo.sendFunction(functionToSend.toString())
I'd like to be able to sent the cutoff hour (either at build or runtime), whilst maintaining functionToSend as an actual function so I can unit test and enjoy autocomplete, colourisation etc. Plus, if i end up with 100 checks for 100 different cutoff values I don't want to have to maintain them all separately.
cutoff.setHours(autoApproveCutoff.getHours() - DYNAMIC-VALUE);
Bear in mind that once the function is stringified you can't reference anything external to the function (except thos this which is the remote object).
Might be impossible. Might be more effort than it's worth. But would be interested if anyone has any ideas, short of code-gen.

Call function saved as string on chosen object

I have something like:
var sFunction = 'my_function("param1", "param2")';
var oMyObject = ...;
And I want to combine it so the result would be equal to:
oMyObject.my_function("param1", "param2");
Would much appreciate any tips.
Remark
As many of you suggested to find a root cause and try not to deal with the problematic input here are some pieces of information about the origins of the "problem".
The sFunction comes from database, hardcoded in one of the columns. It is custom one which should be called on object retrieved basing on other parameters of sFunction's database record.
So being backed up by your comments I will try suggesting changing data model in hope that it is not too late for that. Thank you all for your help.
I am given that as an input, it may come from db or anywhere else. I just have to deal with it in described way.
As Luca noted, you're probably best off solving the problem that brought you to the point of having code in a string that you feel you need to evaluate at runtime. The number of use cases for doing that is very low.
For instance, instead of
sFunction = 'my_function("param1", "param2")';
perhaps you could have
call = {
f: "my_function",
params: ["param1", "param2"]
};
Then it's:
oMyObject[call.f].apply(oMyObject, call.params);
call could even start life as JSON text you parse -- live example:
var json =
'{' +
'"f": "my_function",' +
'"params": ["param1", "param2"]' +
'}';
var call = JSON.parse(json);
var oMyObject = {
my_function: function(p1, p2) {
console.log(p1, p2);
}
};
oMyObject[call.f].apply(oMyObject, call.params);
That's markedly safer than an arbitrary code execution.
You can do this with your sFunction (eval("oMyObject." + sFunction)), but consider:
It lets any arbitrary code in sFunction run.
If User A supplies the code and then you run it on User B's system, you're compromising User B's privacy. (I am not a lawyer, but you could be doing so in a way that violates a country's data protection or privacy laws.)
Now, if you're loading code from a DB and you know that the code in the DB can only be put there by trusted people (for instance, developers on your team, not end users of the system), that's fine, it's largely like running a script file. But there's almost certainly a better way to do it than delivering the code as a string and evaling it.
But if the code comes from "anywhere else", it's not fine; see bullet points above. The setup is fundamentally broken and better options are available. Take that information to your boss, and if necessary to his/her boss, and if necessary his/her boss, until you find someone who can change the requirement.
Here's a string hack that doesn't use eval(), but as I (and others) have said, this is not a good solution. The better solution would be to return the function name and any arguments as a comma delimited string, which would at least make this kind of solution more straight-forward.
var sFunction = 'my_function("param1", "param2")';
// The object would have to already have the function:
var oMyObject = {
my_function: function(x,y){
return x + y;
}
};
// Remove the last ")" and split the remainder into an array at the "("
var funcParts = sFunction.replace(")","").split("(");
// Split the second part (the arguments) into its own array
var funcArgs = funcParts[1].split(",");
// Pass the function name as a string key to the object and then pass the arguments to that
console.log(oMyObject[funcParts[0]](funcArgs[0], funcArgs[1]));
The bigger question is, what ultimately are you trying to accomplish as there is almost always a better approach than this.
To do a dynamic function call you can of course eval as I did in the comments, which is of course a terrible idea. Here is a quick-and-dirty alternative:
const dynamicCallMethod = (obj, s) => {
try {
const fname = s.match(/([$\w]+\(/);
const params = s.match(/("[\w$]+")/g);
return obj[fname](...params);
} catch (e) {
return e;
}
};
Note I still think there's any easier way to do this if you describe the scenario in more detail. The above will fail for any non-ascii characters, for instance.

node.js change in concatenation?

I'm trying to debug some code that another programmer has left for me to maintain. I've just attempted to upgrade from node.js 5 to node.js 8 and my database queries are for some requests coming back with key not found errors
We're using couchbase for the database and our document keys are "encrypted" for security. So we may have a key that starts like this "User_myemail#gmail.com" but we encrypt it using the following method:
function _GetScrambledKey(dbKey)
{
//select encryption key based on db key content
var eKeyIndex = CalculateEncryptionKeyIndex(dbKey, eKeys.length);
var sha = CalculateSHA512(dbKey + eKeyIndex);
return sha;
}
function CalculateEncryptionKeyIndex(str, max)
{
var hashBuf = CalculateSHA1(str);
var count = 0;
for (var i = 0; i < hashBuf.length; i++)
{
count += hashBuf[i];
count = count % max;
}
return count;
}
We then query couchbase for the document with
cb.get("ECB_"+encryptedKey, opts, callback);
In node5 this worked but in node8 we're getting some documents return fine and others return as missing. I outputted the "ECB_"+encryptedKey as an int array and the results have only confused me more. They are different on node5 to node8 but only by one character right in the middle of the array.
Outputting the encryptedKey as an int array on both versions shows this
188,106,14,227,211,70,94,97,63,130,78,246,155,65,6,148,62,215,47,230,211,109,35,99,21,60,178,74,195,13,233,253,187,142,213,213,104,58,168,60,225,148,25,101,155,91,122,77,2,99,102,235,26,71,157,99,6,47,162,152,58,181,21,175
Then outputting the concatenated string, in the same way, shows slightly different results
This is the node8 output
Node8 key: 69,67,66,95,65533,106,14,65533,65533,70,94,97,63,65533,78,65533,65533,65,6,65533,62,65533,47,65533,65533,109,35,99,21,60,65533,74,65533,13,65533,65533,65533,65533,65533,65533,104,58,65533,60,65533,25,101,65533,91,122,77,2,99,102,65533,26,71,65533,99,6,47,65533,65533,58,65533,21,65533
And this is the node5 output
Node5 key: 69,67,66,95,65533,106,14,65533,65533,70,94,97,63,65533,78,65533,65533,65,6,65533,62,65533,47,65533,65533,109,35,99,21,60,65533,74,65533,13,65533,65533,65533,65533,65533,65533,104,58,65533,60,65533,65533,25,101,65533,91,122,77,2,99,102,65533,26,71,65533,99,6,47,65533,65533,58,65533,21,65533
I had to run it through a diff tool to see the difference
Comparing that to the original pre-append array it looks like the 225 has just been dropped in node8. Is 225 significant? I can't understand how that would be possible otherwise unless it's a bug. Does anyone have any ideas?
Looks like this was a change in v8 5.5 https://github.com/nodejs/node/issues/21278
A lot of the issues you are facing, including the concatenation can be cleaned up using newer features from ES6 that are available in node 8.
In general, you should avoid doing string concatenations with the + operator and should use string literals instead. In your case, you should replace the "ECB_"+encryptedKey with `ECB_${encryptedKey}`.
Additionally, if you want to output the contents of the integers values from this concatenated string, then you are better off using .join, the spread operator (...) and the Buffer class from Node as follows:
let encKey = `ECB_${encryptedKey}`;
let tmpBuff = Buffer.from(encKey);
let buffArrVals = [...tmpBuff];
console.log(buffArrVals.join(','));
Also, if you can help it, you really should avoid using var inside of function blocks like it exists in your sample code. var performs something called variable hoisting and causes the variable to become available outside the scope it was declared, which is seldom intended. From node 6+ onward the recommendation is to use let or const for variable declarations to ensure they stay scoped to the block they are declared.

Store data from an array and use it later

I had this problem which Cooper helped me to solve it (thanks again for that), but now I'm struggling with a different one. The following script will count how many times a client code will appear on another Spreadsheet using as a second condition yesterday date.
function countSheets()
{
var vA = appSh();
var td = Utilities.formatDate(subDaysFromDate(new Date(),2), Session.getScriptTimeZone(), "dd/MM/yyyy");
var mbs=getAllSheets();
//var s='';
for (var i=2;i<vA.length;i++)
{
var d = Utilities.formatDate(new Date(vA[i][12]), Session.getScriptTimeZone(), "dd/MM/yyyy");
for(var key in mbs)
{
if(vA[i][0]==key && d==td)
{
mbs[key]+=1;
}
}
}
return mbs;
}
Then I have the below code which will search in the main spreadsheet (a table) a string and when was found will return row number, also will search for the date yesterday and return the column number. Based on these information I'll get the range where I need to paste the count result from the first script.
function runScript()
{
var ss=SpreadsheetApp.openById('ID');
var mbs=countSheets();
for(var key in mbs)
{
var sh=ss.getSheetByName(key);
var rg=sh.getDataRange();
var vA=rg.getValues();
for(var i=0;i<vA.length;i++)
{
if(vA[i][1]=='Total Number of Applications')
{
var nr=i;
break;//by terminating as soon as we find a match we should get improved performance. Which is something you cant do in a map.
}
}
if(typeof(nr)!='undefined')//If we don't find a match this is undefined
{
var today=subDaysFromDate(new Date(),2).setHours(0,0,0,0);
for(var i=0;i<vA[3].length;i++)
{
if(vA[3][i])//Some cells in this range have no contents
{
if(today.valueOf()==new Date(vA[3][i]).valueOf())
{
sh.getRange(nr+1,i+1,1,1).setValue(Number(mbs[key]));
}
}
}
}
}
return sh;
}
PROBLEM: I have 24 rows on the main Spreadsheet. So I will need to write the same script 24 times. As example, I need to count Total Number of Applications, Total Number of Calls, Number of Live Adverts and so on. If I do this it will exceed execution time since each script takes on average 25 seconds to run.
I did some researches on this website and internet and read about storing values and re-use them over and over. At the moment my script will have to go every time through the same file and count for each condition.
Q1: Is there any chance to create another array that contain all those strings from the second script?
Q2: How to use PropertiesService or anything else to store data and don't have to run over and over getValues() ? I've read Google Documentation but couldn't understand that much from it.
I hope it all make sense and can fix this problem.
My best regards,
Thank you!
My Approach to your Problem
You probably should write it for a couple of rows and then look at the two of them and see what is unique to each one. What is unique about each one is what you have to figure out how to store or access via an external function call.
The issue of time may require that you run these functions separately. I have a dialog which I use to load databases which does exactly that. It loads 800 lines and waits for 10 seconds then loads another 800 lines and wait for ten seconds and keeps doing that until there are no more lines. True it takes about 10 minutes to do this but I can be doing something else while it's working so I don't really care how long it takes. I do care about minimizing my impact to the Google Server though and so I don't run something like this just for fun.
By the way the 10 second delay is external to the gs function.

Categories