Cloud Functions for Firebase: completing long processes without touching maximum timeout

Cloud Functions for Firebase: completing long processes without touching maximum timeout - javascript

I have to transcode videos from webm to mp4 when they're uploaded to firebase storage. I have a code demo here that works, but if the uploaded video is too large, firebase functions will time out on me before the conversion is finished. I know it's possible to increase the timeout limit for the function, but that seems messy, since I can't ever confirm the process will take less time than the timeout limit.
Is there some way to stop firebase from timing out without just increasing the maximum timeout limit?
If not, is there a way to complete time consuming processes (like video conversion) while still having each process start using firebase function triggers?
If even completing time consuming processes using firebase functions isn't something that really exists, is there some way to speed up the conversion of fluent-ffmpeg without touching the quality that much? (I realize this part is a lot to ask. I plan on lowering the quality if I absolutely have to, as the reason webms are being converted to mp4 is for IOS devices)
For reference, here's the main portion of the demo I mentioned. As I said before, the full code can be seen here, but this section of the code copied over is the part that creates the Promise that makes sure the transcoding finishes. The full code is only 70 something lines, so it should be relatively easy to go through if needed.
const functions = require('firebase-functions');
const mkdirp = require('mkdirp-promise');
const gcs = require('#google-cloud/storage')();
const Promise = require('bluebird');
const ffmpeg = require('fluent-ffmpeg');
const ffmpeg_static = require('ffmpeg-static');
(There's a bunch of text parsing code here, followed by this next chunk of code inside an onChange event)
function promisifyCommand (command) {
return new Promise( (cb) => {
command
.on( 'end', () => { cb(null) } )
.on( 'error', (error) => { cb(error) } )
.run();
})
}
return mkdirp(tempLocalDir).then(() => {
console.log('Directory Created')
//Download item from bucket
const bucket = gcs.bucket(object.bucket);
return bucket.file(filePath).download({destination: tempLocalFile}).then(() => {
console.log('file downloaded to convert. Location:', tempLocalFile)
cmd = ffmpeg({source:tempLocalFile})
.setFfmpegPath(ffmpeg_static.path)
.inputFormat(fileExtension)
.output(tempLocalMP4File)
cmd = promisifyCommand(cmd)
return cmd.then(() => {
//Getting here takes forever, because video transcoding takes forever!
console.log('mp4 created at ', tempLocalMP4File)
return bucket.upload(tempLocalMP4File, {
destination: MP4FilePath
}).then(() => {
console.log('mp4 uploaded at', filePath);
});
})
});
});

Cloud Functions for Firebase is not well suited (and not supported) for long-running tasks that can go beyond the maximum timeout. Your only real chance at using only Cloud Functions to perform very heavy compute operations is to find a way to split up the work into multiple function invocations, then join the results of all that work into a final product. For something like video transcoding, that sounds like a very difficult task.
Instead, consider using a function to trigger a long-running task in App Engine or Compute Engine.

As follow up for the random anonymous person who tries to figure out how to get past transcoding videos or some other long processes, here's a version of the same code example I gave that instead sends a http request to a google app engine process which transcodes the file. No documentation for it as of right now, but looking at Firebase/functions/index.js code and the app.js code may help you with your issue.
https://github.com/Scew5145/GCSConvertDemo
Good luck.

Related

Long-running asynchronous file copies block browser requests

Express.js serving a Remix app. The server-side code sets several timers at startup that do various background jobs every so often, one of which checks if a remote Jenkins build is finished. If so, it copies several large PDFs from one network path to another network path (both on GSA).
One function creates an array of chained glob+copyFile promises:
import { copyFile } from 'node:fs/promises';
import { promisify } from "util";
import glob from "glob";
...
async function getFiles() {
let result: Promise<void>[] = [];
let globPromise = promisify(glob);
for (let wildcard of wildcards) { // lots of file wildcards here
result.push(globPromise(wildcard).then(
(files: string[]) => {
if (files.length < 1) {
// do error stuff
} else {
for (let srcFile of files) {
let tgtFile = tgtDir + basename(srcFile);
return copyFile(srcFile, tgtFile);
}
}
},
(reason: any) => {
// do error stuff
}));
}
return result;
}
Another async function gets that array and does Promise.allSettled on it:
copyPromises = await getFiles();
console.log("CALLING ALLSETTLED.THEN()...");
return Promise.allSettled(copyPromises).then(
(results) => {
console.log("ALLSETTLED COMPLETE...");
Between the "CALLING" and "COMPLETE" messages, which can take on the order of several minutes, the server no longer responds to browser requests, which timeout.
However, during this time my other active backend timers can still be seen running and completing just fine in the server console log (I made one run every 5 seconds for test purposes, and it runs quite smoothly over and over while those file copies are crawling along).
So it's not blocking the server as a whole, it's seemingly just preventing browser requests from being handled. And once the "COMPLETE" message pops up in the log, browser requests are served up normally again.
The Express startup script basically just does this for Remix:
const { createRequestHandler } = require("#remix-run/express");
...
app.all(
"*",
createRequestHandler({
build: require(BUILD_DIR),
mode: process.env.NODE_ENV,
})
);
What's going on here, and how do I solve this?

It's apparent no further discussion is forthcoming, and I've not determined why the async I/O functions are preventing server responses, so I'll go ahead and post an answer that was basically Konrad Linkowski's workaround solution from the comments: to use the OS to do the copies instead of using copyFile(). It boils down to this in place of the glob+copyFile calls inside getFiles:
const exec = util.promisify(require('node:child_process').exec);
...
async function getFiles() {
...
result.push( exec("copy /y " + wildcard + " " + tgtDir) );
...
}
This does not exhibit any of the request-crippling behavior; for the entire time the copies are chugging away (many minutes), browser requests are handled instantly.
It's an OS-specific solution and thus non-portable as-is, but that's fine in our case since we will likely be using a Windows server for this app for many years to come. And certainly if needed, runtime OS-detection could be used to make the commands run on other OSes.

I guess that this is due to node's libuv using a threadpool with synchronous access for file system operations, and the pool size is only 4. See https://kariera.future-processing.pl/blog/on-problems-with-threads-in-node-js/ for a demonstration of the problem, or Nodejs - What is maximum thread that can run same time as thread pool size is four? for an explanation of how this is normally not a problem in network-heavy applications.
So if you have a filesystem-access-heavy application, try increasing the thread pool by setting the UV_THREADPOOL_SIZE environment variable.

Can I cancel the execution of a promise? Trying to check thousands of links and don't want to wait for requests to time out

Disclaimer: I'm not experienced with programming or with networks in general so I might be missing something quite obvious.
So i'm making a function in node.js that should go over an array of image links from my database and check if they're still working. There's thousands of links to check so I can't just fire off several thousand fetch calls at once and wait for results, instead I'm staggering the requests, going 10 by 10 and doing head requests to minimize the bandwidth usage.
I have two issues.
The first one is that after fetching the first 10-20 links quickly, the other requests take quite a bit longer and 9 or 10 out of 10 of them will time out. This might be due to some sort of network mechanism that throttles my requests when there are many being fired at once, but I'm thinking it's likely due to my second issue.
The second issue is that the checking process slows down after a few iterations. Here's an outline of what I'm doing. I'm taking the string array of image links and slicing it 10 by 10 then I check those 10 posts in 10 promises: (ignore the i and j variables, they're there just to track the individual promises and timeouts for loging/debugging)
const partialResult = await Promise.all(postsToCheck.map(async (post, j) => await this.checkPostForBrokenLink(post, i + j)));
within checkPostForBrokenLink I have a race between the fetch and a timeout of 10 seconds because I don't want to have to wait for the connection to time out every time timing out is a problem, I give it 10 seconds and then flag it as having timed out and move on.
const timeoutPromise = index => {
let timeoutRef;
const promise = new Promise<null>((resolve, reject) => {
const start = new Date().getTime();
console.log('===TIMEOUT INIT===' + index);
timeoutRef = setTimeout(() => {
const end = new Date().getTime();
console.log('===TIMEOUT FIRE===' + index, end - start);
resolve(null);
}, 10 * 1000);
});
return { timeoutRef, promise, index };
};
const fetchAndCancelTimeout = timeout => {
return fetch(post.fileUrl, { method: 'HEAD' })
.then(result => {
return result;
})
.finally(() => {
console.log('===CLEAR===' + index); //index is from the parent function
clearTimeout(timeout);
});
};
const timeout = timeoutPromise(index);
const videoTest = await Promise.race([fetchAndCancelTimeout(timeout.timeoutRef), timeout.promise]);
if fetchAndCancelTimeout finishes before timeout.promise does, it will cancel that timeout, but if the timeout finishes first the promise is still "resolving" in the background, despite the code having moved on. I'm guessing this is why my code is slowing down. The later timeouts take 20-30 seconds from being set up to firing, despite being set to 10 seconds. As far as I know, this has to be because the main process is busy and doesn't have time to execute the event queue, though I don't really know what it could be doing except waiting for the promises to resolve.
So the question is, first off, am I doing something stupid here that I shouldn't be doing and that's causing everything to be slow? Secondly, if not, can I somehow manually stop the execution of the fetch promise if the timeout fires first so as not to waste resources on a pointless process? Lastly, is there a better way to check if a large number of links are valid that what I'm doing here?

I found the problem and it wasn't, at least not directly, related to promise buildup. The code shown was for checking video links but, for images, the fetch call was done by a plugin and that plugin was causing the slowdown. When I started using the same code for both videos and images, the process suddenly became orders of magnitude quicker. I didn't think to check the plugin at first because it was supposed to only do a head request and format the results which shouldn't be an issue.
For anyone looking at this trying to find a way to cancel a fetch, #some provided an idea that seems like it might work. Check out https://www.npmjs.com/package/node-fetch#request-cancellation-with-abortsignal

Something you might want to investigate here is the Bluebird Promise library.
There are two functions in particular that I believe could simplify your implementation regarding rate limiting your requests and handling timeouts.
Bluebird Promise.map has a concurrency option (link), which allows you to set the number of concurrent requests and it also has a Promise.timeout function (link) which will return a rejection of the promise if a certain timeout has occurred.

Handling 25,000 GET requests to an API

So basically I'm trying to make a ton of, (about 25,000), GET requests to an API. I'm using axios as my library for making HTTP calls.
So I have:
dsHistPromises.push(axios.get(url))
And then I'm using:
axios.all(dsHistPromises)
.then(function(results) {
results.forEach(function(response){
if (format === lists.formats.highlow) {
storage.darkskyHistoryHighLow.push(requests.parseDarkskyHighLow(response, city))
}
// parse data here and print it to files...
})
}).catch(err => {
throw err
})
to handle all of my promises.
When I try to run my code, I get errors like
(node:10400) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 2): Error: read ECONNRESET
I have to imagine this is an issue with the API I'm connecting to. The server probably is having a hard time managing my requests, correct?
Are there are any tricks to getting around this?

As you suspect, I'd say it was because of the API you're accessing. Most sane APIs will have some sort of throttle to prevent one user from bringing them down. You're probably running into that.
What you'll need to do is throttle your messages by putting a time delay in between calls.
There are a couple of ways to handle this. One would be to do each call in order, one at a time. This will help avoid throttling from the server, but may be slower.
If the server will allow it, you can do multiple requests in parallel.
Basically, it'd look something like this:
const urls = getUrlsToCallFromSomewhere();
const delay = 500; // ms
const next = () => {
if (!urls.length) {
return;
}
const url = urls.pop();
axios.get(url).then(data => {
results.push(data);
const now = Date.now();
if (now - start > delay) {
next();
} else {
setTimeout(next, delay - (now - start));
}
});
};
This isn't super perfect, but should give you an idea. Basically, call them one at a time. When it's done, check how long it's been. If you've hit the delay amount already, call it right away. If not, use setTimeout() to wait the amount of time.
You can arrange your code to get a few of these going at the same time to speed things up, and adjust the delay to be as long or as fast as you need.

How to be sure two clients are not requesting at the same time corrupting state

I am new to Node JS and I am building a small app that relies on filesystem.
Let's say that the goal of my app is to fill a file like this:
1
2
3
4
..
And I want that at each request, a new line is written to the file, and in the right order.
Can I achieve that?
I know I can't let my question here without making any code so here it is. I am using an Express JS server:
(We imagine that the file contains only 1 at the first code launch)
import express from 'express'
import fs from 'fs'
let app = express();
app.all('*', function (req, res, next) {
// At every request, I want to write my file
writeFile()
next()
})
app.get('/', function(req,res) {
res.send('Hello World')
})
app.listen(3000, function (req,res) {
console.log('listening on port 3000')
})
function writeFile() {
// I get the file
let content = fs.readFileSync('myfile.txt','utf-8')
// I get an array of the numbers
let numbers = content.split('\n').map(item => {
return parseInt(item)
})
// I compute the new number and push it to the list
let new_number = numbers[numbers.length - 1] + 1
numbers.push(new_number)
// I write back the file
fs.writeFileSync('myfile.txt',numbers.join('\n'))
}
I tried to make a guess on the synchronous process that made me thinking that I was sure that nothing else was made at the same moment but I was really note sure...
If I am unclear, please tell me in the comments

If I understood you correctly, what you are scared of is a race condition happening in this case, where if two clients reach the HTTP server at the same time, the file is saved with the same contents where the number is only incremented once instead of twice.
The simple fix for it is to make sure the shared resource is only access or modified once at a time. In this case, using synchronous methods fix your problem. As when they are executing the whole node process is blocked and will not do anything.
If you change the synchronous methods with their asynchronous counter-parts without any other concurrency control measures then your code is definitely vulnerable to race conditions or corrupted state.
Now if this is only the thing your application is doing, it's probably best to keep this way as it's very simple, but let's say you want to add other functionality to it, in that case you probably want to avoid any synchronous methods as it blocks the process and won't let you have any concurrency.
A simple way to add a concurrency control, is to have a counter which keeps track how many requests are queued. If there's nothing queued up(counter === 0), then we just do read and write the file, else we add to the counter. Once writing to the file is finished we decrease from the counter and repeat:
app.all('*', function (req, res, next) {
// At every request, I want to write my file
writeFile();
next();
});
let counter = 0;
function writeFile() {
if (counter === 0) {
work(function onWriteFileDone() {
counter--;
if (counter > 0) {
work();
}
});
} else {
counter++;
}
function work(callback) {
// I get the file
let content = fs.readFile('myfile.txt','utf-8', function (err, content) {
// ignore the error because life is too short on stackoverflow questions...
// I get an array of the numbers
let numbers = content.split('\n').map(parseInt);
// I compute the new number and push it to the list
let new_number = numbers[numbers.length - 1] + 1;
numbers.push(new_number);
// I write back the file
fs.writeFile('myfile.txt',numbers.join('\n'), callback);
});
}
}
Of course this function doesn't have any arguments, but if you want to add to it, you have to use a queue instead of the counter where you store the arguments in the queue.
Now don't write your own concurrency mechanisms. There's a lot of in the node ecosystem. For example you can use the async module, which provide a queue.
Note that if you only have one process at a time, then you don't have to worry about multiple threads since In node.js, in one process there's only one thread of execution at a time, but let's say there's multiple processes writing to the file then that might make things more complicated, but let's keep that for another question if not already covered. Operating systems provides a few different ways to handle this, also you could use your own lock files or a dedicated process to write to the file or a message queue process.

NodeJS http and extremely large response bodies

At the moment, I'm trying to request a very large JSON object from an API (particularly this one) which, depending on various factors, can be upwards of a few MB. The problem is, however, is that NodeJS takes forever to do anything and then just runs out of memory: the first line of my response callback doesn't ever execute.
I could request each item individually, but that is a tremendous amount of requests. To quote the a dev behind the new API:
Until now, if you wanted to get all the market orders for Tranquility you had to request every type per region individually. That would generally be 50+ regions multiplied by upwards of 13,000 types. Even if it was just 13,000 types and 50 regions, that is 650,000 requests required to get all the market information. And if you wanted to get all the data in the 5-minute cache window, it would require almost 2,200 requests per second.
Obviously, that is not a great idea.
I'm trying to get the array items into redis for use later, then follow the next url and repeat until the last page is reached. Is there any way to do this?
EDIT:
Here's the problem code. Visiting the URL works fine in-browser.
// ...
REGIONS.forEach((region) => {
LOG.info(' * Grabbing data for `' + region.name + '#' + region.id + '`');
var href = url + region.id + '/orders/all/', next = href;
var page = 1;
while (!!next) {
https.get(next, (res) => {
LOG.info(' * * Page ' + page++ + ' responded with ' + res.statusCode);
// ...
The first LOG.info line executes, while the second does not.

It appears that you are doing a while(!!next) loop which is the cause of your problem. If you show more of the server code, we could advise more precisely and even suggest a better way to code it.
Javascript run your code single threaded. That means one thread of execution runs to completion before any other events can be run.
So, if you do:
while(!!next) {
https.get(..., (res) => {
// hoping this will run
});
}
Then, your callback to http.get() will never get called. Your while loop just keeps running forever. As long as it is running, the callback from the https.get() can never get called. That request is likely long since completed and there's an event sitting in the internal JS event queue to call the callback, but until your while() loop finished, that event can't get called. So you have a deadlock. The while() loop is waiting for something else to run to change it's condition, but nothing else can run until the while() loop is done.
There are several other ways to do serial async iterations. In general, you can't use .forEach() or while().
Here are several schemes for async looping:
Node.js: How do you handle callbacks in a loop?
While loop with jQuery async AJAX calls
How to synchronize a sequence of promises?
How to use after and each in conjunction to create a synchronous loop in underscore js
Or, the async library which you mentioned also has functions for doing async looping.

First of all, a few MBs of json payload is not exactly huge. So the route handler code might require some close scrutiny.
However, to actually deal with huge amounts of JSON, you can consume your request as a stream. JSONStream (along with many other similar libraries) allow you to do this in a memory efficient way. You can specify the paths you need to process using JSONPath (XPath analog for JSON) and then subscribe to the stream for matching data sets.
Following example from the README of JSONStream illustrates this succinctly:
var request = require('request')
, JSONStream = require('JSONStream')
, es = require('event-stream')
request({url: 'http://isaacs.couchone.com/registry/_all_docs'})
.pipe(JSONStream.parse('rows.*'))
.pipe(es.mapSync(function (data) {
console.error(data)
return data
}))

Use the stream functionality of the request module to process large amounts of incoming data. As data comes through the stream, parse it to a chunk of data that can be worked with, push that data through the pipe, and pull in the next chunk of data.
You might create a transform stream to manipulate a chunk of data that has been parsed and a write stream to store the chunk of data.
For example:
var stream = request ({ url: your_url }).pipe(parseStream)
.pipe(transformStream)
.pipe (writeStream);
stream.on('finish', () => {
setImmediate (() => process.exit(0));
});
Try for info on creating streams https://bl.ocks.org/joyrexus/10026630

We Keep Coding

JavaScript is the programming language of the Web.

Cloud Functions for Firebase: completing long processes without touching maximum timeout - javascript

Related

Long-running asynchronous file copies block browser requests

Can I cancel the execution of a promise? Trying to check thousands of links and don't want to wait for requests to time out

Handling 25,000 GET requests to an API

How to be sure two clients are not requesting at the same time corrupting state

NodeJS http and extremely large response bodies

Categories

Resources