Google Cloud Pub/Sub triggers high latency on low messages throughput - javascript

i'm running project which publishes messages to a PubSub topic and triggers background cloud function.
I read that with high volumes of messages, it performs well, but for lesser amounts like hundreds or even tens of messages per second, Pub/Sub may yield high latencies..
Code example to publish single message:
const {PubSub} = require('#google-cloud/pubsub');
const pubSubClient = new PubSub();
async function publishMessage() {
const topicName = 'my-topic';
const dataBuffer = Buffer.from(data);
const messageId = await pubSubClient.topic(topicName).publish(dataBuffer);
console.log(`Message ${messageId} published.`);
}
publishMessage().catch(console.error);
Code example function triggered by PubSub:
exports.subscribe = async (message) => {
const name = message.data
? Buffer.from(message.data, 'base64').toString()
: 'World';
console.log(`Hello, ${name}!`);
}
Cloud Function Environment Details:
Node: 8
google-cloud/pubsub: 1.6.0
The problem is when using PubSub with low throughput of messages (for example, 1 request per second) it struggles sometimes and shows incredibly high latency (up to 7-9s or more).
Is there a way or workaround to make PubSub perform well each time (50ms or less delay) even with small amount of incoming messages?

If you are always publishing to the same topic, you'd be better off keeping the object returned from pubSubClient.topic(topicName) and reusing it, whether you have a small number or large number of messages. If you want to minimize latency, you'll also want to set the maxMilliseconds property of the batching setting. By default, this is 10ms. As you have the code now, it means that every publish waits 10ms to send a message in hopes of filling the batch. Given that you create a new publisher via the topic call on every publish, you are guaranteed to always wait at least 10ms. You can set it when you call topic:
const publisher = pubSubClient.topic(topicName, {
batching: {
maxMessages: 100,
maxMilliseconds: 1,
},
});
If after reusing the the object returned from pubSubClient.topic(topicName) and changing maxMilliseconds you are still experiencing such a delay, then you should reach out to Google Cloud Support so they can look at the specific project and topic you are using as that kind of latency is definitely not expected.

Related

How do I improve the accuracy of the DelayNode?

I'm using the WebAudio API and doing a simple delay on the input using DelayNode. The code is as follows (might not work in the snippet because it requests microphone permissions):
const player = document.querySelector("#player")
const handleSuccess = stream => {
const context = new AudioContext()
const source = context.createMediaStreamSource(stream)
const delay = context.createDelay(179) //seconds
//source.connect(context.destination) connect directly
source.connect(delay)
delay.connect(context.destination)
delay.delayTime.value = 1
}
navigator.mediaDevices.getUserMedia({audio: true, video: false}).then(handleSuccess)
However, when I run a metronome at 60 beats per minute (one click every second), the audio coming from the browser is not delayed by exactly one second, and multiple clicks are heard (the delay is slightly longer than expected). Is there any way to make an exact delay?
I guess the problem is not the DelayNode itself but rather that there are multiple other hidden delays in the audio pipeline. If you want to hear the previous click from the metronome at the very same time as the current click you would need to reduce the time of your delay to account for those other delays (also known as latency).
The signal takes a bit of time to travel from your microphone through your A/D converter into the browser. This time is usually available as part of the settings of the MediaStream
stream.getAudioTracks()[0].getSettings().latency
Piping the MediaStream into the Web Audio API will probably add some more latency. And the AudioContext itself will add some latency due to its internal buffer.
context.baseLatency
It will also take some time to get the signal out of the computer again. It will travel from the browser to the OS which passes it on to the hardware. This value is also exposed on the AudioContext.
context.outputLatency
In theory you would only need to subtract all those values from the delay time and it would just work. However in reality every hardware/OS/browser combination is a bit different and you will probably need to make some adjustments in order to successfully overlay the previous with the current click depending on your personal setup.
You can use the async/await syntax (documentation here), here are two examples (taken from here)
const foo = async () => {
await sleep(1000);
// do something
}
const foo = async evt => {
await sleep(1000);
// do something with evt
}

Child process setInterval sporadically not firing

My application places bets on certain sporting events. Almost like an automatic betting bot. A part of this is to track the current status of the event so it can make an informed calculation whether to place a bet or not. To do this, I poll the status of an event every minute using setInterval. A single sporting event is "watched" by 1 child process. There could be up 100+ sporting events at any one time meaning there could be 100+ child process' spawned & actively polling.
worker/index.js
const logger = require('../lib/logger/winston')
const utils = {
updateEventState: require('../utils/update-event-state')
}
const db = require('../db')
let eventStatesInterval
module.exports = async function() {
try {
logger.info(`[STARTED FOR EVENT ${process.EVENT_ID} (Process: ${process.pid})]`)
const event = await db.getEvent(process.EVENT_ID)
logger.info(`[STARTING STATE POLLING FOR EVENT (${process.EVENT_ID} - Process: ${process.pid})]`)
eventStatesInterval = setInterval(utils.updateEventState, 60000, event) // 1 min
process.on('exit', code => {
clearInterval(eventStatesInterval)
})
} catch(err) {
throw err
}
}
utils/update-event-state.js
const logger = require('../lib/logger/winston')
const db = require('../db')
module.exports = async function(event) {
try {
const update = {}
if (!process.CHECKING_EVENT) {
process.CHECKING_EVENT = true
logger.info(`[POLLING STATE (${process.EVENT_ID} - Process: ${process.pid})]`)
// Some async operations polling APIs to get the full status of an event
const hasEnded = await api.getHasEventEnded(process.EVENT_ID)
await db.updateEvent(process.EVENT_ID, update)
if (hasEnded) {
process.exit(0)
}
process.CHECKING_EVENT = false
}
} catch(err) {
throw err
}
}
It's also worth noting that a single child process could have more setInterval process' further down the line. For example, if I place a bet that is not fully matched. I poll to check when/if it gets matched. This is also run on a setInterval basis (about every 5 secs). Check ing the logs, some process' are polled correctly every minute but a couple (inconsistent number each time) are not being polled at all. Looking at the logs for a specific process, I get:
For reference, the current time was 22:33 when that screenshot was taken so the interval had not happened in over an hour
There was only 4 events (child process') running at the time.
That is an example screenshot. A process can log several interval callbacks & then just...stop. No more logs at all. No errors or anything. Just stops.
This application runs on a Digital Ocean box with 4GB memory & 2 vCPUs. The app is dockerised. When running docker stats, 100% CPU is being used constantly. When starting with docker run, I limit the memory usage but not the CPU. Could this be the issue? Would setInterval callback liable not be invoked due to CPU constraints?
It's worth noting that this feature to poll the event state is new (5 days) & I had never had an issue with setInterval beforehand (I don't know how much CPU was being used though). I was initially polling the state every 30 seconds & when I noticed this problem. When I checked docker stats, almost 200% CPU usage was shown. Lowering it to 1 minute instead has fixed that slightly. I have the process.CHECKING_EVENT global bool there to not block the thread & not cause a pile up of tasks on the stack.

How to cancel a wasm process from within a webworker

I have a wasm process (compiled from c++) that processes data inside a web application. Let's say the necessary code looks like this:
std::vector<JSONObject> data
for (size_t i = 0; i < data.size(); i++)
{
process_data(data[i]);
if (i % 1000 == 0) {
bool is_cancelled = check_if_cancelled();
if (is_cancelled) {
break;
}
}
}
This code basically "runs/processes a query" similar to a SQL query interface:
However, queries may take several minutes to run/process and at any given time the user may cancel their query. The cancellation process would occur in the normal javascript/web application, outside of the service Worker running the wasm.
My question then is what would be an example of how we could know that the user has clicked the 'cancel' button and communicate it to the wasm process so that knows the process has been cancelled so it can exit? Using the worker.terminate() is not an option, as we need to keep all the loaded data for that worker and cannot just kill that worker (it needs to stay alive with its stored data, so another query can be run...).
What would be an example way to communicate here between the javascript and worker/wasm/c++ application so that we can know when to exit, and how to do it properly?
Additionally, let us suppose a typical query takes 60s to run and processes 500MB of data in-browser using cpp/wasm.
Update: I think there are the following possible solutions here based on some research (and the initial answers/comments below) with some feedback on them:
Use two workers, with one worker storing the data and another worker processing the data. In this way the processing-worker can be terminated, and the data will always remain. Feasible? Not really, as it would take way too much time to copy over ~ 500MB of data to the webworker whenever it starts. This could have been done (previously) using SharedArrayBuffer, but its support is now quite limited/nonexistent due to some security concerns. Too bad, as this seems like by far the best solution if it were supported...
Use a single worker using Emterpreter and using emscripten_sleep_with_yield. Feasible? No, destroys performance when using Emterpreter (mentioned in the docs above), and slows down all queries by about 4-6x.
Always run a second worker and in the UI just display the most recent. Feasible? No, would probably run into quite a few OOM errors if it's not a shared data structure and the data size is 500MB x 2 = 1GB (500MB seems to be a large though acceptable size when running in a modern desktop browser/computer).
Use an API call to a server to store the status and check whether the query is cancelled or not. Feasible? Yes, though it seems quite heavy-handed to long-poll with network requests every second from every running query.
Use an incremental-parsing approach where only a row at a time is parsed. Feasible? Yes, but also would require a tremendous amount of re-writing the parsing functions so that every function supports this (the actual data parsing is handled in several functions -- filter, search, calculate, group by, sort, etc. etc.
Use IndexedDB and store the state in javascript. Allocate a chunk of memory in WASM, then return its pointer to JavaScript. Then read database there and fill the pointer. Then process your data in C++. Feasible? Not sure, though this seems like the best solution if it can be implemented.
[Anything else?]
In the bounty then I was wondering three things:
If the above six analyses seem generally valid?
Are there other (perhaps better) approaches I'm missing?
Would anyone be able to show a very basic example of doing #6 -- seems like that would be the best solution if it's possible and works cross-browser.
For Chrome (only) you may use shared memory (shared buffer as memory). And raise a flag in memory when you want to halt. Not a big fan of this solution (is complex and is supported only in chrome). It also depends on how your query works, and if there are places where the lengthy query can check the flag.
Instead you should probably call the c++ function multiple times (e.g. for each query) and check if you should halt after each call (just send a message to the worker to halt).
What I mean by multiple time is make the query in stages (multiple function cals for a single query). It may not be applicable in your case.
Regardless, AFAIK there is no way to send a signal to a Webassembly execution (e.g. Linux kill). Therefore, you'll have to wait for the operation to finish in order to complete the cancellation.
I'm attaching a code snippet that may explain this idea.
worker.js:
... init webassembly
onmessage = function(q) {
// query received from main thread.
const result = ... call webassembly(q);
postMessage(result);
}
main.js:
const worker = new Worker("worker.js");
const cancel = false;
const processing = false;
worker.onmessage(function(r) {
// when worker has finished processing the query.
// r is the results of the processing.
processing = false;
if (cancel === true) {
// processing is done, but result is not required.
// instead of showing the results, update that the query was canceled.
cancel = false;
... update UI "cancled".
return;
}
... update UI "results r".
}
function onCancel() {
// Occurs when user clicks on the cancel button.
if (cancel) {
// sanity test - prevent this in UI.
throw "already cancelling";
}
cancel = true;
... update UI "canceling".
}
function onQuery(q) {
if (processing === true) {
// sanity test - prevent this in UI.
throw "already processing";
}
processing = true;
// Send the query to the worker.
// When the worker receives the message it will process the query via webassembly.
worker.postMessage(q);
}
An idea from user experience perspective:
You may create ~two workers. This will take twice the memory, but will allow you to "cancel" "immediately" once. (it will just mean that in the backend the 2nd worker will run the next query, and when the 1st finishes the cancellation, cancellation will again become immediate).
Shared Thread
Since the worker and the C++ function that it called share the same thread, the worker will also be blocked until the C++ loop is finished, and won't be able to handle any incoming messages. I think the a solid option would minimize the amount of time that the thread is blocked by instead initializing one iteration at a time from the main application.
It would look something like this.
main.js -> worker.js -> C++ function -> worker.js -> main.js
Breaking up the Loop
Below, C++ has a variable initialized at 0, which will be incremented at each loop iteration and stored in memory.
C++ function then performs one iteration of the loop, increments the variable to keep track of loop position, and immediately breaks.
int x;
x = 0; // initialized counter at 0
std::vector<JSONObject> data
for (size_t i = x; i < data.size(); i++)
{
process_data(data[i]);
x++ // increment counter
break; // stop function until told to iterate again starting at x
}
Then you should be able to post a message to the web worker, which then sends a message to main.js that the thread is no longer blocked.
Canceling the Operation
From this point, main.js knows that the web worker thread is no longer blocked, and can decide whether or not to tell the web worker to execute the C++ function again (with the C++ variable keeping track of the loop increment in memory.)
let continueOperation = true
// here you can set to false at any time since the thread is not blocked here
worker.expensiveThreadBlockingFunction()
// results in one iteration of the loop being iterated until message is received below
worker.onmessage = function(e) {
if (continueOperation) {
worker.expensiveThreadBlockingFunction()
// execute worker function again, ultimately continuing the increment in C++
} {
return false
// or send message to worker to reset C++ counter to prepare for next execution
}
}
Continuing the Operation
Assuming all is well, and the user has not cancelled the operation, the loop should continue until finished. Keep in mind you should also send a distinct message for whether the loop has completed, or needs to continue, so you don't keep blocking the worker thread.

Error: 10 ABORTED: Too much contention on these documents. Please try again

What does this error mean?
Especially, what do they mean by : Please try again
Does it mean that the transaction failed I have to re-run the transaction manually?
From what I understood from the documentation,
The transaction read a document that was modified outside of the
transaction. In this case, the transaction automatically runs again.
The transaction is retried a finite number of times.
If so, on which documents?
The error do not indicate which document it is talking about. I just get this stack:
{ Error: 10 ABORTED: Too much contention on these documents. Please
try again.
at Object.exports.createStatusErrornode_modules\grpc\src\common.js:87:15)
at ClientReadableStream._emitStatusIfDone \node_modules\grpc\src\client.js:235:26)
at ClientReadableStream._receiveStatus \node_modules\grpc\src\client.js:213:8)
at Object.onReceiveStatus \node_modules\grpc\src\client_interceptors.js:1256:15)
at InterceptingListener._callNext node_modules\grpc\src\client_interceptors.js:564:42)
at InterceptingListener.onReceiveStatus\node_modules\grpc\src\client_interceptors.js:614:8)
at C:\Users\Tolotra Samuel\PhpstormProjects\CryptOcean\node_modules\grpc\src\client_interceptors.js:1019:24
code: 10, metadata: Metadata { _internal_repr: {} }, details: 'Too
much contention on these documents. Please try again.' }
To recreate this error, just run a for loop on the db.runTransaction method as indicated on the documentation
We run into the same problem with the Firebase Firestore database. Even small counters with less then 30 items to cound where running into this issue.
Our solution was not to distribute the counter but to increase the number of tries for the transaction and to add a deffer time for those retries.
The first step was to save the transaction action as const witch could be passed to another function.
const taskCountTransaction = async transaction => {
const taskDoc = await transaction.get(taskRef)
if (taskDoc.exists) {
let increment = 0
if (change.after.exists && !change.before.exists) {
increment = 1
} else if (!change.after.exists && change.before.exists) {
increment = -1
}
let newCount = (taskDoc.data()['itemsCount'] || 0) + increment
return await transaction.update(taskRef, { itemsCount: newCount > 0 ? newCount : 0 })
}
return null
}
The second step was to create two helper functions. One for waiting a specifix amount of time and the other one to run the transaction and catch errors. If the abort error with the code 10 occurs we just run the transaction again for a specific amount of retries.
const wait = ms => { return new Promise(resolve => setTimeout(resolve, ms))}
const runTransaction = async (taskCountTransaction, retry = 0) => {
try {
await fs.runTransaction(taskCountTransaction)
return null
} catch (e) {
console.warn(e)
if (e.code === 10) {
console.log(`Transaction abort error! Runing it again after ${retry} retries.`)
if (retry < 4) {
await wait(1000)
return runTransaction(taskCountTransaction, ++retry)
}
}
}
}
Now that we have all we need we can just call our helper function with await and our transaction call will run longer then a default one and it will deffer in time.
await runTransaction(taskCountTransaction)
What I like about this solution is that it doesn't mean more or complicated code and that most of the already written code can stay as it is. It also uses more time and resources only if the counter gets to the point that it has to count more items. Othervise the time and resources are the same as if you would have the default transactions.
For scaling up for large amounts of items we can increase eather the amount of retries or the waiting time. Both are also affecting the costs for Firebase. For the waiting part we also need to increase the timeout for our function.
DISCLAIMER: I have not stress tested this code with thousands or more of items. In our specific case the problems started with 20+ items and we need up to 50 items for a task. I tested it with 200 items and the problem did not apear again.
The transaction does run several times if needed, but if the values read continue to be updated before the write or writes can occur it will eventually fail, thus the documentation noting the transaction is retried a finite number of times. If you have a value that is updating frequently like a counter, consider other solutions like distributed counters. If you'd like more specific suggestions, I recommend you include the code of your transaction in your question and some information about what you're trying to achieve.
Firestore re-runs the transaction only a finite number of times. As of writing, this number is hard-coded as 5, and cannot be changed. To avoid congestion/contention when many users are using the same document, normally we use the exponential back-off algorithm (but this will result in transactions taking longer to complete, which may be acceptable in some use cases).
However, as of writing, this has not been implemented in the Firebase SDK yet — transactions are retried right away. Fortunately, we can implement our own exponential back-off algorithm in a transaction:
const createTransactionCollisionAvoider = () => {
let attempts = 0
return {
async avoidCollision() {
attempts++
await require('delay')(Math.pow(2, attempts) * 1000 * Math.random())
}
}
}
…which can be used like this:
// Each time we run a transaction, create a collision avoider.
const collisionAvoider = createTransactionCollisionAvoider()
db.runTransaction(async transaction => {
// At the very beginning of the transaction run,
// introduce a random delay. The delay increases each time
// the transaction has to be re-run.
await collisionAvoider.avoidCollision()
// The rest goes as normal.
const doc = await transaction.get(...)
// ...
transaction.set(...)
})
Note: The above example may cause your transaction to take up to 1.5 minutes to complete. This is fine for my use case. You might have to adjust the backoff algorithm for your use case.
I have implemented a simple back-off solution to share : maintain a global variable that assigns a different "retry slot" to each failed connection. For example if 5 connections came at the same time and 4 of them got a contention error, each would get a delay of 500ms, 1000ms, 1500ms, 2000ms until trying again, for example. So it could potentially all resolved at the same time without any more contention.
My transaction is a response of calling Firebase Functions. Each Functions computer instance could have a global variable nextRetrySlot that is preserved until it is shut down. So if error.code === 10 is caught for contention issue, the delay time can be (nextRetrySlot + 1) * 500 then you could for example nextRetrySlot = (nextRetrySlot + 1) % 10 so next connections get a different time round-robin in 500ms ~ 5000ms range.
Below are some benchmarks :
My situation is that I would like each new Firebase Auth registration to get a much shorter ID derived from unique Firebase UID, thus it has risk of collision.
My solution is simply to check all registered short ID and if the query returns something, just generate an another one until it is not. Then we register this new short ID to the database. So the algorithm cannot rely on only Firebase UID, but it is able to "move to the next one" in a deterministic way. (not just random again).
This is my transaction, it first read a database of all used short ID then write a new one atomically, to prevent an extremely unlikely event that 2 new registers came at the same time, with a different Firebase UID that derived into the same short ID, and both see that the short ID is vacant at the same time.
I run a test that intentionally register 20 different Firebase UIDs which all derived into the same short ID. (extremely unlikely situation) All that runs in burst at the same time. First I tried using the same delay on next retry, so I expect it to clash with each other again and again while slowly resolving some connections.
Same 500ms delay on retry : 45000ms ~ 60000ms
Same 1000ms delay on retry : 30000ms ~ 49000ms
Same 1500ms delay on retry : 43000ms ~ 49000ms
Then with distributed delay time in slots :
500ms * 5 slots on retry : 20000ms ~ 31000ms
500ms * 10 slots on retry : 22000ms ~ 23000ms
500ms * 20 slots on retry : 19000ms ~ 20000ms
1000ms * 5 slots on retry : ~29000ms
1000ms * 10 slots on retry : ~25000ms
1000ms * 20 slots on retry : ~26000ms
Confirming that different delay time definitely helps.
Found maxAttempts in the runTransaction code which should modify the 5 default attempts (but didn't tested yet).
Anyway, I think that random wait (plus eventually the queue) are still the better option.
Firestore now supports server-side increment() and decrement() atomic operations.
You can increment or decrement by any amount. See their blog post for full details. In many cases, this will remove the need for a client side transaction.
Example:
document("fitness_teams/Team_1").
updateData(["step_counter" : FieldValue.increment(500)])
This is still limited to a sustained write limit of 1 QPS per document so if you need higher throughput, consider using distributed counters. This will increase your read cost (as you'll need to read all the shard documents and compute a total) but allow you to scale your throughput by increasing the number of shards. Now, if you do need to increment a counter as part of a transaction, it's much less likely to fail due to update contention.

Firebase transactions bug

Consider the following:
function useCredits(userId, amount){
var userRef = firebase.database().ref().child('users').child(userId);
userRef.transaction(function(user) {
if (!user){
return user;
}
user.credits -= amount;
return user;
}, NOOP, false);
}
function notifyUser(userId, message){
var notificationId = Math.random();
var userNotificationRef = firebase.database().ref().child('users').child(userId).child('notifications').child(notificationId);
userNotificationRef.transaction(function(notification) {
return message;
}, NOOP, false);
}
These are called from the same node js process.
A user looks like this:
{
"name": 'Alex',
"age": 22,
"credits": 100,
"notifications": {
"1": "notification 1",
"2": "notification 2"
}
}
When I run my stress tests I notice that sometimes the user object passed to the userRef transaction update function is not the full user it is only the following:
{
"notifications": {
"1": "notification 1",
"2": "notification 2"
}
}
This obviously causes an Error because user.credits does not exist.
It is suspicious that the user object passed to update function of the userRef transaction is the same as the data returned by the userNotificationRef transaction's update function.
Why is this the case? This problem goes away if I run both transactions on the user parent location, but this is a less optimal solution as I am then effectively locking on and reading the whole user object, which is redundant when adding a write once notification.
In my experience, you can't rely on the initial value passed into a transaction update function. Even if the data is populated in the datastore, the function might be called with null, a partial value, or a stale old value (in case of a local update in flight). This is not usually a problem as long as you take a defensive approach when writing the function (and you should!), since the bogus update will be refused and the transaction retried.
But beware: if you abort the transaction (by returning undefined) because the data doesn't make sense, then it's not checked against the server and won't get retried. For this reason, I recommend never aborting transactions. I built a monkey patch to apply this fix (and others) transparently; it's browser-only but could be adapted to Node trivially.
Another thing you can do to help a bit is to insert an on('value') call on the same ref just before the transaction and keep it alive until the transaction completes. This will usually cause the transaction to run on the correct data on the first try, doesn't affect bandwidth too much (since the current value would need to be transmitted anyway), and increases local latency a little if you have applyLocally set or defaulting to true. I do this in my NodeFire library, among many other optimizations and tweaks.
On top of all the above, as of this writing there's still a bug in the SDK where very rarely the wrong base value will get "stuck" and the transaction retry continuously (failing with maxretry every so often) until you restart the process.
Good luck! I still use transactions in my server, where failures can be retried easily and I have multiple processes running, but have given up on using them on the client -- they're just too unreliable. In my opinion it's often better to redesign your data structures so that transactions aren't needed.

Categories