A more efficient 'remove duplicates' function - javascript

I manage Google Sheet lists that sometimes exceed 10,000 rows. For sheets with rows up to around 5,000, the remove duplicates function noted below works finely. But for anything above 5,000, I receive the 'Exceeded maximum execution time' error. I would be grateful for some instruction on how to make the code more efficient such that it could run smoothly even for sheets with 10k+ rows.
function removeDuplicates() {
var sheet = SpreadsheetApp.getActiveSheet();
var data = sheet.getDataRange().getValues();
var newData = new Array();
for(i in data){
var row = data[i];
var duplicate = false;
for(j in newData){
if(row.join() == newData[j].join()){
duplicate = true;
}
}
if(!duplicate){
newData.push(row);
}
}
sheet.clearContents();
sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
}

There are a couple of things that are making your code slow. Let's look at your two for loops:
for (i in data) {
var row = data[i];
var duplicate = false;
for (j in newData){
if (row.join() == newData[j].join()) {
duplicate = true;
}
}
if (!duplicate) {
newData.push(row);
}
}
On the face of it, you're doing the right things: For every row in the original data, check if the new data already has a matching row. If it doesn't, add the row to the new data. In the process, however, you're doing a lot of extra work.
Consider, for example, the fact that at any given time, a row in data will have no more than one matching row in newData. But in your inner for loop, after you find that one match, it still continues checking the rest of the rows in newData. The solution to this would be to add a break; after duplicate = true; to stop iterating.
Consider also that for any given j, the value of newData[j].join() will always be the same. Suppose you have 100 rows in data, and no duplicates (the worst case). By the time your function finishes, you'll have calculated newData[0].join() 99 times, newData[1].join() 98 times... all in all you'll have done almost 5,000 calculations to get the same 99 values. A solution to this is memoization, whereby you store the result of a calculation in order to avoid doing the same calculation again later.
Even if you make those two changes, though, your code's time complexity is still O(n²). If you have 100 rows of data, in the worst case the inner loop will run 4,950 times. For 10,000 rows that number is around 50 million.
However, we can do this is O(n) time instead, if we get rid of the inner loop and reformulate the outer loop like so:
var seen = {};
for (var i in data) {
var row = data[i];
var key = row.join();
if (key in seen) {
continue;
}
seen[key] = true;
newData.push(row);
}
Here, instead of checking every row of newData for a row matching row in every iteration, we store every row we've seen so far as a key in the object seen. Then in each iteration we just have to check if seen has a key matching row, an operation we can do in nearly constant time, or O(1).1
As a complete function, here's what it looks like:
function removeDuplicates_() {
const startTime = new Date();
const sheet = SpreadsheetApp.getActiveSheet();
const data = sheet.getDataRange().getValues();
const numRows = data.length;
const newData = [];
const seen = {};
for (var i = 0, row, key; i < numRows && (row = data[i]); i++) {
key = JSON.stringify(row);
if (key in seen) {
continue;
}
seen[key] = true;
newData.push(row);
}
sheet.clearContents();
sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
// Show summary
const secs = (new Date() - startTime) / 1000;
SpreadsheetApp.getActiveSpreadsheet().toast(
Utilities.formatString('Processed %d rows in %.2f seconds (%.1f rows/sec); %d deleted',
numRows, secs, numRows / secs, numRows - newData.length),
'Remove duplicates', -1);
}
function onOpen() {
SpreadsheetApp.getActive().addMenu('Scripts', [
{ name: 'Remove duplicates', functionName: 'removeDuplicates_' }
]);
}
You'll see that instead of using row.join() this code uses JSON.stringify(row), because row.join() is fragile (['a,b', 'c'].join() == ['a', 'b,c'].join(), for example). JSON.stringify isn't free, but it's a good compromise for our purposes.
In my tests this processes a simple spreadsheet with 50,000 rows and 2 columns in a little over 8 seconds, or around 6,000 rows per second.

Related

While loop not stopping when 2D array cell isn't defined (javascript)

Goal: Run through the columns of a 2D array (comes from an Excel file with uneven column lengths) and put the entries that exist into their own array.
What I did: The length of the longest column is 90 entries, which is the second column in the Excel file, and the shortest is 30, which is the first column. I set up a for loop to go through each column and a while loop to go through each entry while it exists and append it to a new array.
Original(ish) Code:
//read in Excel file into 2D array called "myExcel"
var columnNames = ["shortest", "longest", "irrelevant"];
shortArray = [];
longArray = [];
irrArray = [];
var s
for (var i = 0; i < columnNames.length; i++) {
var columnName = columnNames[i];
s = 0;
while (myExcel[s][columnName]) {
if ((columnName === "shortest")) {
var row = myExcel[s][columnName];
shortArray.append(row);
s++;
} else if ((columnName === "longest")) {
var row = myExcel[s][columnName];
longArray.append(row);
s++;
} else if ((columnName === "irrelevant")) {
var row = myExcel[s][columnName];
irrArray.append(row);
s++;
}
}
}
Problem: It's only half working. It makes it through the first column (30 rows) just fine--it stops when myExcel[s][columnName] no longer exists (when columnName = "shortest" and after s = 29). Then, it makes it all the way through columnName = "longest" and s = 89 before giving me the error "TypeError: Cannot read property 'longest' of undefined". I'm assuming it's because it's trying to go through row 90, which doesn't exist. But I thought that's where my while loop would stop.
What I've Tried:
do while loop
//blah
do {
//blah
} while (myExcel[s][columnName]);
Added additional while loop condition
//blah
while ((myExcel[s][columnName]) && s<myExcel.length) {
//blah
}
Using typeof
//blah
while (typeof (myExcel[s][columnName]) === 'string') { //also used this with !=='undefined' and ==='string' when I added a number to the end of each row in the Excel sheet
//blah
}
And basically every combination of these (and probably much more I'm forgetting). I'm sure it's an easy fix, but I've spent days trying to figure it out so I guess I have to ask for help at this point. I'm also a MATLAB person and recently had to learn both Python and Javascript because of COVID, so it could possibly be a language switch issue (although I don't think so because I've been googling and messing with this for days). Any help would be very appreciated!
In your while loop change the check to,
while(myExcel[s] && myExcel[s][columnName] ) {
If you are writing modern Js, then you could simply optional chain it like so, while(myExcel[s]?.[columnName])
The thing is, you are trying to traverse inside an outer array. But you first need to check if outer array exists and then go check the inner array.
I don't fully understand you approach, but I think you're looking for this:
var shortArray = [];
var longArray = [];
var irrArray = [];
for(let row of myExcel){
if(!row) continue; // not sure if this check is necessary.
if(row.shortest) shortArray.append(row.shortest);
if(row.longest) longArray.append(row.longest);
if(row.irrelevant) irrArray.append(row.irrelevant);
}

For Loop deleteRow(i) deleting the wrong rows

I'm using Google Apps Script to write a script to edit Google Sheets for a Mailing List. I'd like it to run through all rows and delete any rows with 'BOUNCED' 'ERROR' or 'NO_RECIPIENT' in a specific cell.
The problem I'm having is the For Loop uses brackets [ ] to designate the rows and columns, which indexes the first row at 0. The deleteRows() action uses curved parenthesis, which indexes the first row at 1. For this reason, I'm having trouble deleting the correct row.
If I program deleteRow(i), it deletes the row following the one being tested by the For loop. If I program deleteRow(i+1), it deletes the correct row the first time, but subsequently deletes the following row. See my code below:
function cleanUp() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getActiveSheet();
var data = sheet.getDataRange().getValues();
for ( var i = 1; i < 30; i++) {
if (data[i][9] === 'ERROR' || data[i][9] === 'BOUNCED' || data[i][9] === 'NO_RECIPIENT') {
sheet.deleteRow(i+1);
}
}
}
Once a row is deleted the rows below it change it's position. One way to avoid this problem is to do the loop in reverse order.
In other words, instead of
for(var i = 1; i < 30; i++)
Use
for(var i = 29 ; i > 0; i--)

Finding an empty cell in a column using google sheet script

I am trying to find an empty cell in a specific column in google sheets. I am familiar with getLastRow(), but it returns the last row in whole sheet. I want to get the first empty cell in a specific column. So, I used a for loop and if. But I don't know why it is not working. The problem is that for loop does not return anything.
I am getting 10 rows of the column H (position 8) in the sheet test (line2). first 5 rows already have content. Data will be added to this cell later using another code. so, I want to be able to find the first empty cell in this column to be able to put the new data.
var sheettest = SpreadsheetApp.getActive().getSheetByName("test");
var columntest = sheettest.getRange(4, 8, 10).getValues();
for(var i=0; columntest[i]=="" ; i++){
if (columntest[i] ==""){
var notationofemptycell = sheettest.getRange(i,8).getA1Notation();
return notationofemptycell
}
}
As I said, I want to find the first empty cell in that column. I defined the for loop to go on as long as the cell is not empty. then if the cell is empty, it should return the A1 notation of that cell.
It seems the for loop does go on until it find the empty cell, because I get no definition for the var "notationofemptycell" in debugging.
This may be faster:
function getFirstEmptyCellIn_A_Column() {
var rng,sh,values,x;
/* Ive tested the code for speed using many different ways to do this and using array.some
is the fastest way - when array.some finds the first true statement it stops iterating -
*/
sh = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('test');
rng = sh.getRange('A:A');//This is the fastest - Its faster than getting the last row and
//getting a specific range that goes only to the last row
values = rng.getValues(); // get all the data in the column - This is a 2D array
x = 0;
values.some(function(ca,i){
//Logger.log(i)
//Logger.log(ca[0])
x = i;//Set the value every time - its faster than first testing for a reason to set the value
return ca[0] == "";//The first time that this is true it stops looping
});
Logger.log('x: ' + x)//x is the index of the value in the array - which is one less than
//the row number
return x + 1;
}
You need to loop through the length of columntest. Try this:
function myFunction() {
var sheettest = SpreadsheetApp.getActive().getSheetByName("test");
var lr=sheettest.getLastRow()
var columntest = sheettest.getRange(4, 8, lr).getValues();
for(var i=0; i<columntest.length ; i++){
if (columntest[i] ==""){
var notationofemptycell = sheettest.getActiveCell().getA1Notation();
Logger.log( notationofemptycell)
break;//stop when first blank cell on column H os found.
}
}
}

Google Script to remove duplicates exceeds processing time

I have a Google Sheet that has 10k+ rows of data. While it should be rare, there could be instances of duplicate data being entered into the tab, and I have written a script to search for and remove those duplicates. For a while, this script has been running nicely and doing exactly what I expected it to do. But now that the tab has grown to over 10k rows, the script is exceeding the 6 minute time limit.
I've based this function on this tutorial.
// remove duplicates on Ship Details Complete
function duplicateShipDetailsComplete() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sourceSheet = ss.getSheetByName("Shipment Details Complete");
var sourceRange = sourceSheet.getRange(2, 1, sourceSheet.getLastRow(), 16)
var sourceData = sourceRange.getValues();
var keepData = new Array();
var deleteCount = 0;
for(i in sourceData) { // look for duplicates
var row = sourceData[i];
var duplicate = false; // initialize as not a duplicate
for(j in keepData) { // compare the current row in data to the rows in newData
if(row[2] == keepData[j][2] // duplicate Partner Invoice?
&& row[4] == keepData[j][4] // duplicate vPO?
&& row[5] == keepData[j][5] // duplicate SKU?
&& row[7] == keepData[j][7]) { // duplicate qty?
duplicate = true; // only if ALL criteria are duplicate, set row as a duplicate
}
}
if(!duplicate) { // If the row is NOT a duplicate
keepData.push(row); // add to newData
} else {
deleteCount++; // keep track of duplicates being deleted
}
}
sourceRange.clear();
sourceSheet.getRange(2, 1, keepData.length, keepData[0].length).setValues(keepData); // paste the keepData into the Working sheet
return deleteCount;
}
I've thought about breaking it up into pieces; process 1/3 of the data in each of 3 different calls. But this is actually being called from a different function that emails the returned deleteCount value, if it's greater than 0.
// Nightly email after checking for duplicates in Ship Details Complete
function sendEmailShipDetails() {
var deleted = duplicateShipDetailsComplete();
var update = parseFloat(SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Dept/Class").getRange(1,6).getValue()).toFixed(2);
if(deleted > 0) {
MailApp.sendEmail(
"me#myoffice.com",
"Shipment Details Cleaned Up",
deleted + " Shipment Detail line(s) were deleted from Complete as duplicates.\n" +
"Updated Value: " + update + "\n"
);
}
}
Even without that email function, it's exceeding the limit when I call duplicateShipDetailsComplete() directly. I suppose I could write three different functions (first 1/3, second 1/3, third 1/3) and update a cell somewhere with the results for each, and then call the email function separately to get that value. I'd feel a little better about that if I could write 1 function and pass parameters to it, but this is all coming from a Time Based Trigger, and you can't pass parms from those. But before I started to do that, I thought I'd check to see if someone had other suggestions on how I could make the existing code more efficient. Or, see if someone had totally different ideas on how I can do this.
Thanks

Google Script for a Sheet - Maximum Execution time Exceeded

I'm writing a script that's going to look through a monthly report and create sheets for each store for a company we do work for and copy data for each to the new sheets. Currently the issue I'm running into is that we have two days of data and 171 lines is taking my script 369.261 seconds to run and it is failing to finish.
function menuItem1() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet1 = ss.getSheetByName("All Stores");
var data = sheet1.getDataRange().getValues();
var CurStore;
var stores = [];
var target_sheet;
var last_row;
var source_range
var target_range;
var first_row = sheet1.getRange("A" + 1 +":I" + 1);
//assign first store number into initial index of array
CurStore = data[1][6].toString();
//add 0 to the string so that all store numbers are four digits.
while (CurStore.length < 4) {CurStore = "0" + CurStore;}
stores[0] = CurStore;
// traverse through every row and add all unique store numbers to the array
for (var row = 2; row <= data.length; row++) {
CurStore = data[row-1][6].toString();
while (CurStore.length < 4) {
CurStore = "0" + CurStore;
}
if (stores.indexOf(CurStore) == -1) {
stores.push(CurStore.toString());
}
}
// sort the store numbers into numerical order
stores.sort();
// traverse through the stores array, creating a sheet for each store, set the master sheet as the active so we can copy values, insert first row (this is for column labels), traverse though every row and when the unique store is found,
// we take the whole row and paste it onto it's newly created sheet
// at the end push a notification to the user letting them know the report is finished.
for (var i = stores.length -1; i >= 0; i--) {
ss.insertSheet(stores[i].toString());
ss.setActiveSheet(sheet1);
target_sheet = ss.getSheetByName(stores[i].toString());
last_row = target_sheet.getLastRow();
target_range = target_sheet.getRange("A"+(last_row+1)+":G"+(last_row+1));
first_row.copyTo(target_range);
for (var row = 2; row <= data.length; row++) {
CurStore = data[row-1][6].toString();
while (CurStore.length < 4) {
CurStore = "0" + CurStore;
}
if (stores[i] == CurStore) {
source_range = sheet1.getRange("A" + row +":I" + row);
last_row = target_sheet.getLastRow();
target_range = target_sheet.getRange("A"+(last_row+1)+":G"+(last_row+1));
source_range.copyTo(target_range);
}
}
for (var j = 1; j <= 9; j++) {
target_sheet.autoResizeColumn(j);
}
}
Browser.msgBox("The report has been finished.");
}
Any help would be greatly appreciated as I'm still relatively new at using this, and I'm sure there are plenty of ways to speed this up, if not, I'll end up finding a way to break down the function to divide up the execution. If need be, I can also provide some sample data if need be.
Thanks in advance.
The problem is calling SpreadsheepApp lib related methods like getRange() in each iteration. As stated here:
Using JavaScript operations within your script is considerably faster
than calling other services. Anything you can accomplish within Google
Apps Script itself will be much faster than making calls that need to
fetch data from Google's servers or an external server, such as
requests to Spreadsheets, Docs, Sites, Translate, UrlFetch, and so on.
Your scripts will run faster if you can find ways to minimize the
calls the scripts make to those services.
I ran into the same situation and, instead of doing something like for(i=0;i<data.length;i++), I ended up dividing the data.length into 3 separate functions and ran them manually each time one of them ended.
Same as you, I had a complex report to automate and this was the only solution.

Categories