puppeteer-cluster: queue instead of execute - javascript

I'm experimenting with Puppeteer Cluster and I just don't understand how to use queuing properly. Can it only be used for calls where you don't wait for a response? I'm using Artillery to fire a bunch of requests simultaneously, but they all fail while only some fail when I have the command execute directly.
I've taken the code straight from the examples and replaced execute with queue which I expected to work, except the code doesn't wait for the result. Is there a way to achieve this anyway?
So this works:
const screen = await cluster.execute(req.query.url);
But this breaks:
const screen = await cluster.queue(req.query.url);
Here's the full example with queue:
const express = require('express');
const app = express();
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
});
await cluster.task(async ({ page, data: url }) => {
// make a screenshot
await page.goto('http://' + url);
const screen = await page.screenshot();
return screen;
});
// setup server
app.get('/', async function (req, res) {
if (!req.query.url) {
return res.end('Please specify url like this: ?url=example.com');
}
try {
const screen = await cluster.queue(req.query.url);
// respond with image
res.writeHead(200, {
'Content-Type': 'image/jpg',
'Content-Length': screen.length //variable is undefined here
});
res.end(screen);
} catch (err) {
// catch error
res.end('Error: ' + err.message);
}
});
app.listen(3000, function () {
console.log('Screenshot server listening on port 3000.');
});
})();
What am I doing wrong here? I'd really like to use queuing because without it every incoming request appears to slow down all the other ones.

Author of puppeteer-cluster here.
Quote from the docs:
cluster.queue(..): [...] Be aware that this function only returns a Promise for backward compatibility reasons. This function does not run asynchronously and will immediately return.
cluster.execute(...): [...] Works like Cluster.queue, just that this function returns a Promise which will be resolved after the task is executed. In case an error happens during the execution, this function will reject the Promise with the thrown error. There will be no "taskerror" event fired.
When to use which function:
Use cluster.queue if you want to queue a large number of jobs (e.g. list of URLs). The task function needs to take care of storing the results by printing them to console or storing them into a database.
Use cluster.execute if your task function returns a result. This will still queue the job, so this is like calling queue in addition to waiting for the job to finish. In this scenario, there is most often a "idling cluster" present which is used when a request hits the server (like in your example code).
So, you definitely want to use cluster.execute as you want to wait for the results of the task function. The reason, you do not see any errors is (as quoted above) that the errors of the cluster.queue function are emitted via a taskerror event. The cluster.execute errors are directly thrown (Promise is rejected). Most likely, in both cases your jobs fail, but it is only visible for the cluster.execute

Related

Exiting the stream NodeJS

Only 'text' is output to the console, but 'text2' and 'text3' are not output, because exit from the stream is faster. This is the most simplified code of the real project structure. I can't figure out why this is happening and what to do about it.
This is the stream handler:
async function func1() {
await console.log('text2')
}
async function func() {
await console.log('text')
await func1();
await console.log('text3')
}
async function handler() {
await func()
await process.exit(123)
}
handler();
The only thing I can add is that the code above is contained in a handler.js file and is run like this:
const {Worker} = require('worker_threads')
const myWorker = new Worker('./handler.js')
myWorker.on('exit', (data) => {
console.log('Worker exit: ' + data)
})
И вывод следующий:
console
Despite the lack of a true, reproducible example, for your use case; this question is interesting and the answer wasn't obvious to find. I am not as read up on the worker_threads API as id like to be but ill offer my two cents.
After doing some research/testing (and not really having much to go on in terms of your specific use case), i believe it is because you are calling process.exit inside the worker thread. Since worker threads add their console statements to the main threads call stack, it takes some time before they all run. Calling process.exit here must be removing any remaining operations from that worker on the main threads call stack before they have a chance to run.
When you remove process.exit, all log statements run and the worker exits naturally.
If you need to close the worker thread at a particular point in time (which is what i think the real question here is), you might be better off sending a message back to the main thread using the parentPort.postMessage() method and then having the main thread terminate the worker:
// worker.js
const {parentPort} = require('worker_threads');
async function func1() {
await console.log('text2')
}
async function func() {
await console.log('text')
await func1();
await console.log('text3')
}
async function handler() {
await func()
parentPort.postMessage({kill: true});
}
handler();
Then listen for that message event and terminate the worker from the main thread:
// index.js
const {Worker} = require('worker_threads')
const myWorker = new Worker('./handler.js')
myWorker.on('exit', (data) => {
console.log('Worker exit: ' + data)
})
myWorker.on('message', async msg => {
if (msg.kill) {
console.log('killing worker with code');
await myWorker.terminate();
}
})
There are some gatchas here as the terminate method terminates the thread "as soon as possible" and you are relying on events and will need to prevent the worker from continuing on while the main thread executes the terminate method. However, without more information i cant be of much help. From our comments it might also be worth looking into child processes for this. Hope this helps

Nodejs : Promise chain terminates after a certain limit without throwing any errors

I am trying to do a simple operation in nodejs using promises. I have an array, which contains objects. These objects in turn contain query parameters for a url that I want to hit with a GET request. I want the get requests to be sequential, as the number of requests is around 6000. I searched around the internet and stumbled onto this medium article which shows how to run promises sequentially.
Medium article link
Following the approach, I wrote the following snippet -
let itr = set[Symbol.iterator]();//set which contains the objects to be pushed to the function
//that makes the GET request
let runNext = ()=>{
fetchLinks(itr.next().value).then( x =>{ //fetchLinks returns a new Promise which wraps the
//GET request. on successful response, I resolve(1)
runNext();
}).catch(err=>{
console.log("storing");
translateMapAndStore();//fetchLinks returns a reject when an undefined object is detected.
//a global object is storing the response data which is then written to a file.
});
}
runNext();//initiate the recursive promise chain
The error that I face right now is that the process terminates after fetching exactly 11 times. I don't know the reason behind this. No errors are thrown and the process exits gracefully. Is there something that I have missed here?
Why not just like below?
const urls = ["url0", "url1", "url2"];
(async function run() {
for (const url of urls) {
try {
await fetchLinks(url);
} catch (err) {
console.log(err);
console.log("storing");
translateMapAndStore();
}
}
})();

Setting delay/timeout for axios requests in map() function

I am using node and axios (with TS, but that's not too important) to query an API. I have a suite of scripts that make calls to different endpoints and log the data (sometimes filtering it.) These scripts are used for debugging purposes. I am trying to make these scripts "better" by adding a delay between requests so that I don't "blow up" the API, especially when I have a large array I'm trying to pass. So basically I want it to make a GET request and pause for a certain amount of time before making the next request.
I have played with trying setTimeout() functions, but I'm only putting them in places where they add the delay after the requests have executed; everywhere I have inserted the function has had this result. I understand why I am getting this result, I just had to try everything I could to at least increase my understanding of how things are working.
I have though about trying to set up a queue or trying to use interceptors, but I think I might be "straying far" from a simpler solution with those ideas.
Additionally, I have another "base script" that I wrote on the fly (sorta the birth point for this batch of scripts) that I constructed with a for loop instead of the map() function and promise.all. I have played with trying to set the delay in that script as well, but I didn't get anywhere helpful.
var axios = require('axios');
var fs = require('fs');
const Ids = [arrayOfIds];
try {
// Promise All takes an array of promises
Promise.all(Ids.map(id => {
// Return each request as its individual promise
return axios
.get(URL + 'endPoint/' + id, config)
}))
.then((vals) =>{
// Vals is the array of data from the resolved promise all
fs.appendFileSync(`${__dirname}/*responseOutput.txt`,
vals.map((v) => {
return `${JSON.stringify(v.data)} \n \r`
}).toString())
}).catch((e) => console.log)
} catch (err) {
console.log(err);
}
No errors with the above code; just can't figure out how to put the delay in correctly.
You could try Promise.map from bluebird
It has the option of setting concurrency
var axios = require('axios');
var fs = require('fs');
var Promise = require('bluebird');
const Ids = [arrayOfIds];
let concurrency = 3; // only maximum 3 HTTP request will run concurrently
try {
Promise.map(Ids, id => {
console.log(`starting request`, id);
return axios.get(URL + 'endPoint/' + id, config)
}, { concurrency })
.then(vals => {
console.log({vals});
})
;
} catch (err) {
console.log(err);
}

What is the fast way to send response to the client and then run long/heavy actions in nodejs

I'm trying to figure out, what is the best/fast way to send the response from expressjs and then do log or do long actions in the server, without delaying the response to the client.
I have the following code, but I see that the response to the client is send only after the loop is finished. I though that the response will be send because I'm triggering res.send(html); and then calling longAction
function longAction () {
for (let i = 0; i < 1000000000; i++) {}
console.log('Finish');
}
function myfunction (req, res) {
res.render(MYPATH, 'index.response.html'), {title: 'My Title'}, (err, html) => {
if (err) {
re.status(500).json({'error':'Internal Server Error. Error Rendering HTML'});
}
else {
res.send(html);
longAction();
}
});
}
router.post('/getIndex', myfunction);
What is the best way to send the response and then run my long/heavy actions?
Or What I'm missing?
I'm trying to figure out, what is the best/fast way to send the response from expressjs and then do log or do long actions in the server, without delaying the response to the client.
The best way to do this is to only call longAction() when express tells you that the response has been sent. Since the response object is a stream, you can use the finish event on that stream to know when all data from the stream has been flushed to the underlying OS.
From the writable stream documentation:
The 'finish' event is emitted after the stream.end() method has been called, and all data has been flushed to the underlying system.
Here's how you could use that in your specific code:
function myfunction (req, res) {
res.render(MYPATH, 'index.response.html'), {title: 'My Title'}, (err, html) => {
if (err) {
res.status(500).json({'error':'Internal Server Error. Error Rendering HTML'});
}
else {
res.on('finish', () => {
longAction();
});
res.send(html);
}
});
}
For a little more explanation on the finish event, you can start by looking at the Express code for res.send() and see that is ends up calling res.end() to actually send the data. If you then look at the documentation for .end() on a stream writable, it says this:
Calling the writable.end() method signals that no more data will be written to the Writable. The optional chunk and encoding arguments allow one final additional chunk of data to be written immediately before closing the stream. If provided, the optional callback function is attached as a listener for the 'finish' event.
So, since Express doesn't expose access to the callback that .end() offers, we just listen to the finish event ourselves to be notified when the stream is done sending its last bit of data.
Note, there is also a typo in your code where re.status(500) should be res.status(500).
Use setImmediate :
var test = function(){
for (let i = 0; i < 1000000000; i++) {}
console.log('Finish');
}
router.get("/a", function(req, res, next){
setImmediate(test);
return res.status(200).json({});
});
Your long function will be executed at the end of the current event loop cycle. This code will execute after any I/O operations (in this sample, first I/O operation is the res.status(200)) in the current event loop and before any timers scheduled for the next event loop.
Thanks for the answers:
After checking I think the best approach is using listening to finish event
res.on('finish', () => {
// Do another stuff after finish got the response
});

node.js async request with timeout?

Is it possible, in node.js, to make an asynchronous call that times out if it takes too long (or doesn't complete) and triggers a default callback?
The details:
I have a node.js server that receives a request and then makes multiple requests asynchronously behind the scenes, before responding. The basic issue is covered by an existing question, but some of these calls are considered 'nice to have'. What I mean is that if we get the response back, then it enhances the response to the client, but if they take too long to respond it is better to respond to the client in a timely manner than with those responses.
At the same time this approach would allow to protect against services that simply aren't completing or failing, while allowing the main thread of operation to respond.
You can think of this in the same way as a Google search that has one core set of results, but provides extra responses based on other behind the scenes queries.
If its simple just use setTimout
app.get('/', function (req, res) {
var result = {};
// populate object
http.get('http://www.google.com/index.html', (res) => {
result.property = response;
return res.send(result);
});
// if we havent returned within a second, return without data
setTimeout(function(){
return res.send(result);
}, 1000);
});
Edit: as mentioned by peteb i forgot to check to see if we already sent. This can be accomplished by using res.headerSent or by maintaining a 'sent' value yourself. I also noticed res variable was being reassigned
app.get('/', function (req, res) {
var result = {};
// populate object
http.get('http://www.google.com/index.html', (httpResponse) => {
result.property = httpResponse;
if(!res.headersSent){
res.send(result);
}
});
// if we havent returned within a second, return without data
setTimeout(function(){
if(!res.headersSent){
res.send(result);
}
}, 1000);
});
Check this example of timeout callback https://github.com/jakubknejzlik/node-timeout-callback/blob/master/index.js
You could modify it to do action if time's out or just simply catch error.
You can try using a timeout. For example using the setTimeout() method:
Setup a timeout handler: var timeOutX = setTimeout(function…
Set that variable to null: timeOutX = NULL (to indicate that the timeout has been fired)
Then execute your callback function with one argument (error handling): callback({error:'The async request timed out'});
You add the time for your timeout function, for example 3 seconds
Something like this:
var timeoutX = setTimeout(function() {
timeOutX = null;
yourCallbackFunction({error:'The async request timed out'});
}, 3000);
With that set, you can then call your async function and you put a timeout check to make sure that your timeout handler didn’t fire yet.
Finally, before you run your callback function, you must clear that scheduled timeout handler using the clearTimeout() method.
Something like this:
yourAsyncFunction(yourArguments, function() {
if (timeOutX) {
clearTimeout(timeOutX);
yourCallbackFunction();
}
});

Categories