Removing 'complicated' duplicates - javascript

Test File
Sometimes, my lists of emails include duplicate addresses for the same person. For example, Jane's addresses are both "jane.doe#email.com" and "doe.jane#email". Her variants include replacing the "." with "-" or "_". At the moment, my duplicates script—upgraded ever so kindly by #Jordan Running and Ed Nelson—takes care of 'strict' duplicates, yet cannot detect that "doe.jane#email.com" is a 'complicated' duplicate of "jane.doe#email.com". Is there a way to delete even these duplicates such that I do not email more than one of Jane's addresses? All of them point to the same inbox, so I need only include one of her addresses.
Here is my current code:
function removeDuplicates() {
const startTime = new Date();
const newData = [];
const sheet = SpreadsheetApp.getActiveSheet();
const data = sheet.getDataRange().getValues();
const numRows = data.length;
const seen = {};
for (var i = 0, row, key; i < numRows && (row = data[i]); i++) {
key = JSON.stringify(row);
if (key in seen) {
continue;
}
seen[key] = true;
newData.push(row);
};
sheet.clearContents();
sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
// Show summary
const secs = (new Date() - startTime) / 1000;
SpreadsheetApp.getActiveSpreadsheet().toast(
Utilities.formatString('Processed %d rows in %.2f seconds (%.1f rows/sec); %d deleted',
numRows, secs, numRows / secs, numRows - newData.length),
'Remove duplicates', -1);
}

Sample File
Fuzzy match test
Notes:
used without #email.com part, it distorts the result
use a the custom function: =removeDuplicatesFuzzy(B2:B12,0.66)
0.66 is a percentage of fuzzy match.
the right column of a result (Column D) shows found matches with > 0.66 accuracies. Dash - is when matches are not found ("unique" values)
Background
You may try this library:
https://github.com/Glench/fuzzyset.js
To install it, copy the code from here.
The usage is simple:
function similar_test(string1, string2)
{
string1 = string1 || 'jane.doe#email.com';
string2 = string2 || 'doe.jane#email.com'
a = FuzzySet();
a.add(string1);
var result = a.get(string2);
Logger.log(result); // [[0.6666666666666667, jane.doe#email.com]]
return result[0][0]; // 0.6666666666666667
}
There's also more info here: https://glench.github.io/fuzzyset.js/
Notes:
please google more info, look for javascript fuzzy string match. Here's related Q: Javascript fuzzy search that makes sense. Note: the solution should work in Google Sheets (no ECMA-6)
this algorithm is not smart like a human, it tests a string by char. If you have two similar strings like don.jeans#email.com it will be 84% similar to doe.jane#email.com but human detects it is completely another person.

Search for my Google Sheets add-on called Flookup. It should do what you want.
For your case, you can use this function:
ULIST(colArray, [threshold])
The parameter details are:
colArray: the column from which unique values are to be returned.
threshold: the minimum percentage similarity between the colArray values that are not unique.
Or you can simply use the Highlight duplicates or Remove duplicates from the add-on menu.
The key feature is that you can adjust the level of strictness by changing the percentage similarity.
Bonus: It will easily catch swaps like jane.doe#email.com / doe.jane#email.com
You can find out more at the official website.

Related

Advice on how to optimise Google AppScript code [duplicate]

I've just written my first google apps scripts, ported from VBA, which formats a column of customer order information (thanks to you all of your direction).
Description:
The code identifies state codes by their - prefix, then combines the following first name with a last name (if it exists). It then writes "Order complete" where the last name would have been. Finally, it inserts a necessary blank cell if there is no gap between the orders (see image below).
Problem:
The issue is processing time. It cannot handle longer columns of data. I am warned that
Method Range.getValue is heavily used by the script.
Existing Optimizations:
Per the responses to this question, I've tried to keep as many variables outside the loop as possible, and also improved my if statements. #MuhammadGelbana suggests calling the Range.getValue method just once and moving around with its value...but I don't understand how this would/could work.
Code:
function format() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var s = ss.getActiveSheet();
var lastRow = s.getRange("A:A").getLastRow();
var row, range1, cellValue, dash, offset1, offset2, offset3;
//loop through all cells in column A
for (row = 0; row < lastRow; row++) {
range1 = s.getRange(row + 1, 1);
//if cell substring is number, skip it
//because substring cannot process numbers
cellValue = range1.getValue();
if (typeof cellValue === 'number') {continue;};
dash = cellValue.substring(0, 1);
offset1 = range1.offset(1, 0).getValue();
offset2 = range1.offset(2, 0).getValue();
offset3 = range1.offset(3, 0).getValue();
//if -, then merge offset cells 1 and 2
//and enter "Order complete" in offset cell 2.
if (dash === "-") {
range1.offset(1, 0).setValue(offset1 + " " + offset2);
//Translate
range1.offset(2, 0).setValue("Order complete");
};
//The real slow part...
//if - and offset 3 is not blank, then INSERT CELL
if (dash === "-" && offset3) {
//select from three rows down to last
//move selection one more row down (down 4 rows total)
s.getRange(row + 1, 1, lastRow).offset(3, 0).moveTo(range1.offset(4, 0));
};
};
}
Formatting Update:
For guidance on formatting the output with font or background colors, check this follow-up question here. Hopefully you can benefit from the advice these pros gave me :)
Issue:
Usage of .getValue() and .setValue() in a loop resulting in increased processing time.
Documentation excerpts:
Minimize calls to services:
Anything you can accomplish within Google Apps Script itself will be much faster than making calls that need to fetch data from Google's servers or an external server, such as requests to Spreadsheets, Docs, Sites, Translate, UrlFetch, and so on.
Look ahead caching:
Google Apps Script already has some built-in optimization, such as using look-ahead caching to retrieve what a script is likely to get and write caching to save what is likely to be set.
Minimize "number" of read/writes:
You can write scripts to take maximum advantage of the built-in caching, by minimizing the number of reads and writes.
Avoid alternating read/write:
Alternating read and write commands is slow
Use arrays:
To speed up a script, read all data into an array with one command, perform any operations on the data in the array, and write the data out with one command.
Slow script example:
/**
* Really Slow script example
* Get values from A1:D2
* Set values to A3:D4
*/
function slowScriptLikeVBA(){
const ss = SpreadsheetApp.getActive();
const sh = ss.getActiveSheet();
//get A1:D2 and set it 2 rows down
for(var row = 1; row <= 2; row++){
for(var col = 1; col <= 4; col++){
var sourceCellRange = sh.getRange(row, col, 1, 1);
var targetCellRange = sh.getRange(row + 2, col, 1, 1);
var sourceCellValue = sourceCellRange.getValue();//1 read call per loop
targetCellRange.setValue(sourceCellValue);//1 write call per loop
}
}
}
Notice that two calls are made per loop(Spreadsheet ss, Sheet sh and range calls are excluded. Only including the expensive get/set value calls). There are two loops; 8 read calls and 8 write calls are made in this example for a simple copy paste of 2x4 array.
In addition, Notice that read and write calls alternated making "look-ahead" caching ineffective.
Total calls to services: 16
Time taken: ~5+ seconds
Fast script example:
/**
* Fast script example
* Get values from A1:D2
* Set values to A3:D4
*/
function fastScript(){
const ss = SpreadsheetApp.getActive();
const sh = ss.getActiveSheet();
//get A1:D2 and set it 2 rows down
var sourceRange = sh.getRange("A1:D2");
var targetRange = sh.getRange("A3:D4");
var sourceValues = sourceRange.getValues();//1 read call in total
//modify `sourceValues` if needed
//sourceValues looks like this two dimensional array:
//[//outer array containing rows array
// ["A1","B1","C1",D1], //row1(inner) array containing column element values
// ["A2","B2","C2",D2],
//]
//#see https://stackoverflow.com/questions/63720612
targetRange.setValues(sourceValues);//1 write call in total
}
Total calls to services: 2
Time taken: ~0.2 seconds
References:
Best practices
What does the range method getValues() return and setValues() accept?
Using methods like .getValue() and .moveTo() can be very expensive on execution time. An alternative approach is to use a batch operation where you get all the column values and iterate across the data reshaping as required before writing to the sheet in one call. When you run your script you may have noticed the following warning:
The script uses a method which is considered expensive. Each
invocation generates a time consuming call to a remote server. That
may have critical impact on the execution time of the script,
especially on large data. If performance is an issue for the script,
you should consider using another method, e.g. Range.getValues().
Using .getValues() and .setValues() your script can be rewritten as:
function format() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var s = ss.getActiveSheet();
var lastRow = s.getLastRow(); // more efficient way to get last row
var row;
var data = s.getRange("A:A").getValues(); // gets a [][] of all values in the column
var output = []; // we are going to build a [][] to output result
//loop through all cells in column A
for (row = 0; row < lastRow; row++) {
var cellValue = data[row][0];
var dash = false;
if (typeof cellValue === 'string') {
dash = cellValue.substring(0, 1);
} else { // if a number copy to our output array
output.push([cellValue]);
}
// if a dash
if (dash === "-") {
var name = (data[(row+1)][0]+" "+data[(row+2)][0]).trim(); // build name
output.push([cellValue]); // add row -state
output.push([name]); // add row name
output.push(["Order complete"]); // row order complete
output.push([""]); // add blank row
row++; // jump an extra row to speed things up
}
}
s.clear(); // clear all existing data on sheet
// if you need other data in sheet then could
// s.deleteColumn(1);
// s.insertColumns(1);
// set the values we've made in our output [][] array
s.getRange(1, 1, output.length).setValues(output);
}
Testing your script with 20 rows of data revealed it took 4.415 seconds to execute, the above code completes in 0.019 seconds

How to use correctly setFormula in App Script

I have a problem to use correctly setFormula in app script, i tried to use setFormula in a indeterminated range cells but I do not know how to specify that the range of rows be increased and it is not just a specific range. The script that I try to make is a condition in which if in a range of cells there is information, then put the formula in a cell.
function formulas() {
var activeSheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Sheet 1");
var rows = activeSheet.getMaxRows();
for(var i=7; i <= rows; i++){
var workingCell = activeSheet.getRange(i, 3).getValue();
if(workingCell != ""){
activeSheet.getRange(i, 4).setFormula("=$B$5"); //this is fine
activeSheet.getRange(i, 5).setFormula("=((100/H7)*I7)/100"); //but this not
}
}
}
how can I do it so if it's row 8 is (" = ((100 / H8) * I8) / 100 ") and so on.
EDIT
The problem is that I try to apply it to many cells and that the rows that I add will increase according to the row in which I am placing the formula ... If a formula is added in row D9 and D10 , then The range is H9, I9 and H10, I10
The simplest "fix" , as pointed out in a comment, is to concatenate your i loop variable into the formula, like this:
function formulas() {
var activeSheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Sheet 1");
var rows = activeSheet.getLastRow(); //maxRows consider blank rows, you don't need those
for(var i=7; i <= rows; i++){
var workingCell = activeSheet.getRange(i, 3).getValue();
if(workingCell != ""){
activeSheet.getRange(i, 4).setFormula("=$B$5");
activeSheet.getRange(i, 5).setFormula("=((100/H" +i+ ")*I" +i+ ")/100");
}
}
}
Anyway, this function executes too many gets and sets against the spreadsheet, and this will perform poorly as your sheet grows. You should try to minimize all your sets and gets by issuing them in bulk, that is, against a bigger range rather than cell-by-cell.
Your use-case has a problem with this approach because you have some blank spots in your range (when workingCell is blank). If setting a "blank" formula for those values is not an issue for you, then you can speed your script greatly by using this:
function formulas() {
var sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Sheet 1"); //not necessarily active
var workingCells = sheet.getSheetValues(7, 3, -1, 1); //-1 == lastRow
var r1c1formulas = [];
for (var i=0; i < workingCells.length; i++){
if (workingCells[i][0] != "") {
r1c1formulas.push(['=R5C2', '=((100/R[0]C[3])*R[0]C[4])/100']);
} else
r1c1formulas.push(['=""','=""']);
}
sheet.getRange(7, 4, workingCells.length, 2).setFormulasR1C1(r1c1formulas);
}
The 2nd "trick" is to use the formulas in R1C1 notation instead of regular A1 style. Check the setFormulaR1C1 documentation here.
The R1C1 notation may seem daunting at first but is rather simple, I'd say it is simpler then the 'A1' one. I'll try to summarize it here. R is the row number, and C column, and in front of the letter you have the row and column numbers (instead of letter). So =$B$5 is written as =R5C2.
The last thing different is the relative reference. In the A1 notation you just don't place the '$' signs. Which is not really intuitive and not all that flexible when you're trying to set a bunch of formulas at once (exactly your use-case). Because on the A1 the relative formula is a different formula, the references are different =B1 is not the same as =C1 (which could be "the same" if set on two cells in the same row and consecutive columns).
Anyway, on the R1C1 notation the relative reference is counted as number of rows and columns from the cell that is the reference.
So, when you set the formula =H7*I7 to cell E7, you count that H is 3 columns ahead of E and I 4. And it is all on the same row, so zero row difference. Lastly, to write a relative reference you wrap the number in []. Therefore =H7 * I7 on E7 becomes =R[0]C[3] * R[0]C[4].

Google Script Errror: "Incorrect range width" when using setValues

Try as I might I CANNOT decipher the problem that I'm having writing new rows to a sheet. I've done this several times and I've debugged this thoroughly using Logger.log, but I just can't solve it. Here's a summary of what I'm doing, a code snippet, and a log:
What I'm doing:
Adding rows to a sheet (below existing rows)
73 new rows are stored stored in array: Grade Rows
When attempt to write the new rows to the sheet, get this error:
Incorrect range width, was 1 should be 26
Here’s the code including some Logger.logs:
var BeginningRow = LastSGRowSheet + 1;
var EndingRow = BeginningRow + SGPushKtr -1;
Logger.log("BeginningRow =>" + BeginningRow + "<=, SGPushKtr =>" + SGPushKtr + "<=, Ending Row =>" + EndingRow + "<=");
var GradesRangeString = 'A' + BeginningRow + ':' + LastStudentGradesColumnLetter + EndingRow;
Logger.log("GradesRangeString =>" + GradesRangeString + "<=");
StudentGradeSheet.getRange(GradesRangeString).setValues(GradeRows);
The error occurs in that last line of code.
Here’s the log:
17-12-31 11:51:15:763 EST] BeginningRow =>364<=, SGPushKtr =>73<=, Ending Row =>436<=
[17-12-31 11:51:15:764 EST] GradesRangeString =>A364:Z436<=
Let's say that your data array is dA then the number of rows in that array is dA.length and assuming its a rectangular array then the number of columns is vA[0].length. So your output command has to be some thing like this.
sheet.getRange(firstRow,firstColumn,dA.length,dA[0].length).setValues(dA);
If you'd like to learn a little more about this problem check this out.
You could also append each row to the current sheet one row at a time in loop.
It's hard to know why GradeRows doesn't match your range without seeing all of your code.
Using Cooper's getRange arguments will likely reveal your problem, and will prevent you from having to update your row and column variables when you make changes to your code. Another issue that gets me sometimes is the fact that the setValues array needs to be exactly the same dimensions as the range. If one row has a different length, it will fail. If the logic I use to create row arrays can result in different lengths, I use the function below to make sure my arrays are symmetric before writing them to a sheet. It is also helpful for debugging.
/**
* Takes a 2D array with element arrays with differing lengths
* and adds empty string elements as necessary to return
* a 2D array with all element arrays of equal length.
* #param {array} ar
* #return {array}
*/
function symmetric2DArray(ar){
var maxLength;
var symetric = true;
if (!Array.isArray(ar)) return [['not an array']];
ar.forEach( function(row){
if (!Array.isArray(row)) return [['not a 2D array']];
if (maxLength && maxLength !== row.length) {
symetric = false;
maxLength = (maxLength > row.length) ? maxLength : row.length;
} else { maxLength = row.length }
});
if (!symetric) {
ar.map(function(row){
while (row.length < maxLength){
row.push('');
}
return row;
});
}
return ar
}
How about using appendRow()? That way you don't need to do lots of calculations about the range. You can loop through your data and add it row by row. Something like this:
myDataArr = [[1,2],[3,4],[5,6]]
myDataArr.forEach(function(arrayItem){
sheet.appendRow([arrayItem[0],arrayItem[1]])
})
// This will output to the sheet in three rows.
// [1][2]
// [3][4]
// [5][6]

Find the index of a string in Javascript with help of first three characters

I have numerous tsv files each with header row. Now one column name in header row is age. In few files, column name is age while in other files it has EOL charcter such as \r \n.
Now how can i use str.indexOf('age') function so that i get index of age irrespective of column name age with EOL character such as \n , \r etc..
Foe eg:
tsv file1:
Name Address Age Ph_Number
file 2:
Name Address Age/r
file 3:
Name Address Age\n
I am trying to find index of age column in each files header row.
However when i do-
header.indexOf('age')
it gives me result only in case of file1 because in other 2 files we have age as age\r and age\n..
My question is how should i find index of age irrespective of \r \n character along with age in header row.
i have following script now:
var headers = rows[0].split('\t');
if (file.name === 'subjects.tsv'){
for (var i = 0; i < rows.length; i++) {
var ageIdColumn = headers.indexOf("age");
console.log(headers)
As I stated in the comments, indexOf() returns the starting position of the string. It doesn't matter what comes after it:
var csvFile1 = 'column1,column2,column3,age,c1r1';
var csvFile2 = 'column1,column2,column3,age\r,c1r1';
var csvFile3 = 'column1,column2,column3,age\n,c1r1';
console.log(csvFile1.indexOf("age"));
console.log(csvFile2.indexOf("age"));
console.log(csvFile3.indexOf("age"));
If you specifically want to find the versions with the special characters, just look for them explicitly:
var csvFile4 = 'column1,age\r,column2,column3,age\n,c1r1';
console.log(csvFile4.indexOf("age\r"));
console.log(csvFile4.indexOf("age\n"));
Lastly, it may be that you are confused as to what, exactly indexOf() is supposed to do. It is not supposed to tell you where all occurrences of a given string are. It stops looking after the first match. To get all the locations, you'd need a loop similar to this:
var csvFile5 = 'column1,age\r,column2,age, column3,age\n,c1r1';
var results = []; // Found indexes will be stored here.
var pos = null; // Stores the last index position where "age" was found
while (pos !== -1){
// store the index where "age" is found
// If pos is not null, then we've already found age earlier and we
// need to start looking for the next occurence 3 characters after
// where we found it last time. If pos is null, we haven't found it
// yet and need to start from the beginning.
pos = csvFile5.indexOf("age", pos != null ? pos + 3 : pos );
pos !== -1 ? results.push(pos) : "";
}
// All the positions where "age" was in the string (irrespective of what follows it)
// are recorded in the array:
console.log(results);

How do I compare string data between cells in a google spreadsheet?

If I copy/paste the information into both cells my script runs correctly and matches the strings in the cells to the correct row for the user so I can lookup their email. If I let my google form fill the first cell however the data in the two cells no longer matches. I'm probably overlooking something obvious about comparing the strings but hopefully someone can point me in the right direction. Here is the code I have so far.
var rows = SpreadsheetApp.getActiveSheet().getLastRow();
var cell = SpreadsheetApp.getActiveSheet().getRange(rows, 2);
var value = cell.getValue().toString();
var ss = SpreadsheetApp.getActiveSpreadsheet();
ss.setActiveSheet(ss.getSheets()[1]);
var sheet;
var teacher;
var cc = "no match";
for(var h=1; h <= ss.getLastRow(); h++)
{
sheet = ss.getActiveSheet().getRange(h, 2);
teacher = sheet.getValue().toString();
if (value == teacher)
cc = ss.getActiveSheet().getRange(h, 1).getValue().toString();
}
Try to use watchdog (clicking on line number) or some message box to see what happens
i.e. Browser.msgBox(teacher); bebore testing the value of value ^^
and may be don t use the "value" as a variable name, it could generate problem to execute the script.

Categories