This question concerns the Firestore database, but, more generally, it concerns making async requests in parallel.
Simply put, I wish to update multiple Firestore documents as quickly and efficiently as possible by mapping over an array of their document IDs.
the .set() method is async (returning a promise) and so I understand that I can wrap the multiple requests - i.e. the map() - in a Promise.all() function, which will return a single promise once all of the requests have resolved.
However, it is not at all clear to me whether I should await the set() within the map().
Whichever way I write the code (i.e. with or without await), it does not appear to make a difference to speed, but does it, or should it?
What is the correct way to achieve the greatest speed and efficiency in this instance?
const update_promises = array_of_ids.map(async id => {
const new_values = {
new_values_1: "val1",
last_update: new Date()
}
return await db.collection("my_collection").doc(id).set(new_values, { merge: true });
// OR SHOULD IT BE:
// return db.collection("my_collection").doc(id).set(new_values, { merge: true });
})
return await Promise.all(update_promises)
When get call set(), the SDK is going to immediately pipeline the write request over the a single managed connection. If you want to write a bunch of documents as fast as possible, you should kick them all off, then await the results at the end. You probably don't want to await each one individually, since you are causing the code to stop for a moment while it waits for the result before the next one can get sent. That said, the performance impact is probably negligible overall, unless you have a lot of work to be done in between each write.
My general rule is to only await an individual promise if its result is needed right away before moving on. Otherwise, collect all the promises together into an array for a single await with Promise.all().
Related
let's say that i wanna update a serie of documents, i'm doing it using forEach like this.
students.forEach(async (name) => {
const docRef = doc(db, "students", name);
await updateDoc(docRef, {
school: "Some School",
});
});
And it's working fine, but i was wondering if that's bad for sending too many requests or something. Is there another better/smarter way to do it?
There's nothing particularly "bad" about this given what you've shared, unless you are also observing some behavior that you don't like.
If your intent is to update all of the documents atomically (up to a limit of 500 in a batch), so that any failures or interruptions won't leave the set of documents in an inconsistent state, you are better off using a batch write instead. But that won't necessarily give you any better performance or other improved runtime behavior.
I usually recommend against using await in a scenario where you are using updateDoc on a sequence of documents, as it's actually faster to let the updates run in parallel. For more on this, see What is the fastest way to write a lot of documents to Firestore?
But the await here is harmless, since using await in a forEach has no impact on the other operations in that same forEach. For more on this, see: Using async/await with a forEach loop If you were to use a for of loop though, be sure to remove the await for improved throughput.
I am having a problem where I am making a bulk insert of multiple elements into a table, then I immediatly get the last X elements from that table that were recently inserted but when I do that it seems that the elements have not yet been inserted fully even thought I am using async await to wait for the async operations.
I am making a bulk insert like
const createElements = elementsArray => {
return knex
.insert(elementsArray)
.into('elements');
};
Then I have a method to immediately access those X elements that were inserted:
const getLastXInsertedElements = (userId, length, columns=['*']) => {
return knex.select(...columns)
.from('elements').where('userId', userId)
.orderBy('createdAt', 'desc')
.limit(length);
}
And finally after getting those elements I get their ids and save them into another table that makes use of element_id of those recently added elements.
so I have something like:
// A simple helper function that handles promises easily
const handleResponse = (promise, message) => {
return promise
.then(data => ([data, undefined]))
.catch(error => {
if (message) {
throw new Error(`${message}: ${error}`);
} else {
return Promise.resolve([undefined, `${message}: ${error}`])
}
}
);
};
async function service() {
await handleResponse(createElements(list), 'error text'); // insert x elements from the list
const [elements] = await handleResponse(getLastXInsertedElements(userId, list.length), 'error text') // get last x elements that were recently added
await handleResponse(useElementsIdAsForeignKey(listMakingUseOfElementsIds), 'error text'); // Here we use the ids of the elements we got from the last query, but we are not getting them properly for some reason
}
So the problem:
Some times when I execute getLastXInsertedElements it seems that the elements are not yet finished inserting, even thought I am waiting with async/await for it, any ideas why this is? maybe something related to bulk inserts that I don't know of? an important note, all the elements always properly inserted into the table at some point, it just seems like this point is not respected by the promise (async operation that returns success for the knex.insert).
Update 1:
I have tried putting the select after the insert inside a setTimeout of 5 seconds for testing purposes, but the problem seems to persist, that is really weird, seems one would think 5 seconds is enough between the insert and the select to get all the data.
I would like to have all X elements that were just inserted accessible in the select query from getLastXInsertedElements consistently.
Which DB are you using, how big list of data are you inserting? You could also test if you are inserting and getLastXInsertedElements in a transaction if that hides your problem.
Doing those operations in transaction also forces knex to use the same connection for both queries so it might lead to a tracks where is this coming from.
Another trick to force queries to use the same connection would be to set pool's min and max configuration to be 1 (just for testing is parallelism is indeed the problem here).
Also since you have not provided complete reproduction code for this, I'm suspecting there is something else here in the mix which causes this odd behavior. Usually (but not always) this kind of weird cases that shouldn't happen are caused by user error in elsewhere using the library.
I'll update the answer if there is more information provided. Complete reproduction code would be the most important piece of information.
I am not 100% sure but I guess the knex functions do not return promise by default (but a builder object for the query). That builder has a function called then that transforms the builder into a promise. So you may try to add a call to that:
...
limit(length)
.then(x => x); // required to transform to promise
Maybe try debugging the actual type of the returned value. It might happen that this still is not a promise. In this case you may not use async await but need to use the then Syntax because it might not be real js promises but their own implementation.
Also see this issue about standard js promise in knex https://github.com/knex/knex/issues/1588
In theory, it should work.
You say "it seems"... a more clear problem explanation could be helpful.
I can argue the problem is that you have elements.length = list.length - n where n > 0; in your code there are no details about userId property in your list; a possible source of the problem could be that some elements in your list has a no properly set userId property.
I'm using forEach to write over 300 documents with data from an object literal.
It works 80% of the time -- all documents get written, the other times it only writes half or so, before the response gets sent and the function ends. Is there a way to make it pause and always work correctly?
Object.entries(qtable).forEach(([key, value]) => {
db.collection("qtable").doc(key).set({
s: value.s,
a: value.a
}).then(function(docRef) {
console.log("Document written with ID: ", docRef.id);
res.status(200).send(qtable);
return null;
})
Would it be bad pratice to just put a 2 second delay?
You are sending the response inside your loop, before the loop is complete. If you are using Cloud Functions (you didn't say), sending the response will terminate the function an clean up any extra work that hasn't completed.
You will need to make sure that you only send the response after all the async work is complete. This means you will have to pay attention to the promises returned by set() and use them to determine when to finally send the response. Leaning how promises work in JavaScript is crucial to writing functions that work properly.
You need to wait for the set() calls to conclude. They return a promise that you should deal with.
For instance, you can do this by pushing the return of set() to a promise array and awaiting for them outside the loop (with Promise.all()).
Or you can await each individual call, but in this case you need to change the forEach() to a normal loop, otherwise the await will not work inside a forEach() arrow function.
Also, you should probably set the response status just once, and outside the loop.
I'm refactoring some old node modules into a more functional style. I'm like a second year freshman when it comes to FP :) Where I keep getting hung up is handling large async flows. Here is an example where I'm making a request to a db and then caching the response:
// Some external xhr/promise lib
const fetchFromDb = make => {
return new Promise(resolve => {
console.log('Simulate async db request...'); // just simulating a async request/response here.
setTimeout(() => {
console.log('Simulate db response...');
resolve({ make: 'toyota', data: 'stuff' });
}, 100);
});
};
// memoized fn
// this caches the response to getCarData(x) so that whenever it is invoked with 'x' again, the same response gets returned.
const getCarData = R.memoizeWith(R.identity, (carMake, response) => response.data);
// Is this function pure? Or is it setting something outside the scope (i.e., getCarData)?
const getCarDataFromDb = (carMake) => {
return fetchFromDb(carMake).then(getCarData.bind(null, carMake));
// Note: This return statement is essentially the same as:
// return fetchFromDb(carMake).then(result => getCarData(carMake, result));
};
// Initialize the request for 'toyota' data
const toyota = getCarDataFromDb('toyota'); // must be called no matter what
// Approach #1 - Just rely on thenable
console.log(`Value of toyota is: ${toyota.toString()}`);
toyota.then(d => console.log(`Value in thenable: ${d}`)); // -> Value in thenable: stuff
// Approach #2 - Just make sure you do not call this fn before db response.
setTimeout(() => {
const car = getCarData('toyota'); // so nice!
console.log(`later, car is: ${car}`); // -> 'later, car is: stuff'
}, 200);
<script src="https://cdnjs.cloudflare.com/ajax/libs/ramda/0.25.0/ramda.min.js"></script>
I really like memoization for caching large JSON objects and other computed properties. But with a lot of asynchronous requests whose responses are dependent on each other for doing work, I'm having trouble keeping track of what information I have and when. I want to get away from using promises so heavily to manage flow. It's a node app, so making things synchronous to ensure availability was blocking the event loop and really affecting performance.
I prefer approach #2, where I can get the car data simply with getCarData('toyota'). But the downside is that I have to be sure that the response has already been returned. With approach #1 I'll always have to use a thenable which alleviates the issue with approach #2 but introduces its own problems.
Questions:
Is getCarFromDb a pure function as it is written above? If not, how is that not a side-effect?
Is using memoization in this way an FP anti-pattern? That is, calling it from a thenable with the response so that future invocations of that same method return the cached value?
Question 1
It's almost a philosophical question here as to whether there are side-effects here. Calling it does update the memoization cache. But that itself has no observable side-effects. So I would say that this is effectively pure.
Update: a comment pointed out that as this calls IO, it can never be pure. That is correct. But that's the essence of this behavior. It's not meaningful as a pure function. My answer above is only about side-effects, and not about purity.
Question 2
I can't speak for the whole FP community, but I can tell you that the Ramda team (disclaimer: I'm a Ramda author) prefers to avoid Promises, preferring more lawful types such Futures or Tasks. But the same questions you have here would be in play with those types substituted for Promises. (More on these issues below.)
In General
There is a central point here: if you're doing asynchronous programming, it will spread to every bit of the application that touches it. There is nothing you will do that changes this basic fact. Using Promises/Tasks/Futures helps avoid some of the boilerplate of callback-based code, but it requires you to put the post response/rejection code inside a then/map function. Using async/await helps you avoid some of the boilerplate of Promise-based code, but it requires you to put the post reponse/rejection code inside async functions. And if one day we layer something else on top of async/await, it will likely have the same characteristics.
(While I would suggest that you look at Futures or Tasks instead of Promises, below I will only discuss Promises. The same ideas should apply regardless.)
My suggestion
If you're going to memoize anything, memoize the resulting Promises.
However you deal with your asynchrony, you will have to put the code that depends on the result of an asynchronous call into a function. I assume that the setTimeout of your second approach was just for demonstration purposes: using timeout to wait for a DB result over the network is extremely error-prone. But even with setTimeout, the rest of your code is running from within the setTimeout callback.
So rather than trying to separate the cases for when your data has already been cached and when it hasn't, simply use the same technique everywhere: myPromise.then(... my code ... ). That could look something like this:
// getCarData :: String -> Promise AutoInfo
const getCarData = R.memoizeWith(R.identity, make => new Promise(resolve => {
console.log('Simulate async db request...')
setTimeout(() => {
console.log('Simulate db response...')
resolve({ make: 'toyota', data: 'stuff' });
}, 100)
})
)
getCarData('toyota').then(carData => {
console.log('now we can go', carData)
// any code which depends on carData
})
// later
getCarData('toyota').then(carData => {
console.log('now it is cached', carData)
})
<script src="//cdnjs.cloudflare.com/ajax/libs/ramda/0.25.0/ramda.min.js"></script>
In this approach, whenever you need car data, you call getCarData(make). Only the first time will it actually call the server. After that, the Promise is served out of the cache. But you use the same structures everywhere to deal with it.
I only see one reasonable alternative. I couldn't tell if your discussion about having to have to wait for the data before making remaining calls means that you would be able to pre-fetch your data. If that's the case, then there is one additional possibility, one which would allow you to skip the memoization as well:
// getCarData :: String -> Promise AutoInfo
const getCarData = make => new Promise(resolve => {
console.log('Simulate async db request...')
setTimeout(() => {
console.log('Simulate db response...')
resolve({ make: 'toyota', data: 'stuff' });
}, 100)
})
const makes = ['toyota', 'ford', 'audi']
Promise.all(makes.map(getCarData)).then(allAutoInfo => {
const autos = R.zipObj(makes, allAutoInfo)
console.log('cooking with gas', autos)
// remainder of app that depends on auto data here
})
<script src="//cdnjs.cloudflare.com/ajax/libs/ramda/0.25.0/ramda.min.js"></script>
But this one means that nothing will be available until all your data has been fetched. That may or may not be all right with you, depending on all sorts of factors. And for many situations, it's not even remotely possible or desirable. But it is possible that yours is one where it is helpful.
One technical point about your code:
const getCarDataFromDb = (carMake) => {
return fetchFromDb(carMake).then(getCarData.bind(null, carMake));
};
Is there any reason to use getCarData.bind(null, carMake) instead of () => getCarData(carMake)? This seems much more readable.
Is getCarFromDb a pure function as it is written above?
No. Pretty much anything that uses I/O is impure. The data in the DB could change, the request could fail, so it doesn't give any reliable guarantee that it will return consistent values.
Is using memoization in this way an FP anti-pattern? That is, calling it from a thenable with the response so that future invocations of that same method return the cached value?
It's definitely an asynchrony antipattern. In your approach #2 you are creating a race condition where the operation will succeed if the DB query completes in less than 200 ms, and fail if it takes longer than that. You've labeled a line in your code "so nice!" because you're able to retrieve data synchronously. That suggests to me that you're looking for a way to skirt the issue of asynchrony rather than facing it head-on.
The way you're using bind and "tricking" memoizeWith into storing the value you're passing into it after the fact also looks very awkward and unnatural.
It is possible to take advantage of caching and still use asynchrony in a more reliable way.
For example:
// Some external xhr/promise lib
const fetchFromDb = make => {
return new Promise(resolve => {
console.log('Simulate async db request...')
setTimeout(() => {
console.log('Simulate db response...')
resolve({ make: 'toyota', data: 'stuff' });
}, 2000);
});
};
const getCarDataFromDb = R.memoizeWith(R.identity, fetchFromDb);
// Initialize the request for 'toyota' data
const toyota = getCarDataFromDb('toyota'); // must be called no matter what
// Finishes after two seconds
toyota.then(d => console.log(`Value in thenable: ${d.data}`));
// Wait for 5 seconds before getting Toyota data again.
// This time, there is no 2-second wait before the data comes back.
setTimeout(() => {
console.log('About to get Toyota data again');
getCarDataFromDb('toyota').then(d => console.log(`Value in thenable: ${d.data}`));
}, 5000);
<script src="https://cdnjs.cloudflare.com/ajax/libs/ramda/0.25.0/ramda.min.js"></script>
The one potential pitfall here is that if a request should fail, you'll be stuck with a rejected promise in your cache. I'm not sure what would be the best way to address that, but you'd surely need some way of invalidating that part of the cache or implementing some sort of retry logic somewhere.
In a language with threads and locks it is easy to implement a lazy load by checking the value of a variable, if it's null then lock the next section of code, check the value again and then load the resource and assign. This prevents it from being loaded multiple times and causes threads after the first to wait for the first thread to complete the action that's needed.
Psuedo code:
if(myvar == null) {
lock(obj) {
if(myvar == null) {
myvar = getData();
}
}
}
return myvar;
JavaScript runs in a single thread, however, it still has this type of issue because of asynchronous execution while one call is waiting on a blocking resource. In this Node.js example:
var allRecords;
module.exports = getAllRecords(callback) {
if(allRecords) {
return callback(null,allRecords);
}
db.getRecords({}, function(err, records) {
if (err) {
return callback(err);
}
// Use existing object if it has been
// set by another async request to this
// function
allRecords = allRecords || partners;
return callback(null, allRecords);
});
}
I'm lazy loading all the records from a small DB table the first time this function is called and then returning the in-memory records on subsequent calls.
Problem: If multiple async requests are made to this function at the same time then the table is going to be loaded unnecessarily from the DB multiple times.
In order to solve this I could simulate a locking mechanism by creating a var lock; variable and setting it to true while the table is loading. I would then put the other async calls into a setTimeout() loop and check back on this variable every (say) 1 second until the data was available and then allow them to return.
The problems with that solution are:
It's fragile, what if the first async call throws and doesn't unset the lock.
How many times do we loop back into the timer before giving up?
How long should the timer be set for? In some environments 1 second might be way too long and inefficient.
Is there a best practise for solving this in JavaScript?
On the first call to the service, initialize an array. Start the fetch operation. Create a Promise, store it in the array.
On subsequent calls, if the data is there, return an already-fulfilled Promise. If not, add another Promise to the array and return that.
When the data arrives, resolve all the waiting Promise objects in the list. (You can throw away the list once the data's there.)
I really like the promise solution in the other answer -- very clever, very interesting. Promises aren't the dominent methodology, so you may need to educate the team. I'm going to go in another direction though.
What you're after is a memoize function -- an in-memory key/value cache of expensive results. JavaScript the Good Parts has a memoize sample towards the end. Lodash has a memoize function. These assume synchronous processing so don't account for your scenario -- which is to say they'd hit the database lots of times until one of the "threads" replied.
The async library also has a memoize function that does exactly what you want. In it's innards, it keeps a queue array of callbacks, and once it gets the answer, it both caches it and calls all the callbacks.
If you're into inventing, by all means, use promises. If you'd just like a plug-n-play answer, use async#memoize.