Piping the response of an http request to a file is pretty easy:
http.get(url, function (res) {
var file = fs.createWriteStream(filename)
res.pipe(file)
file.on('finish', function () {
file.close()
})
})
But when I try to set up a retry system, things get complicated. I can decide whether to retry based on res.statusCode, but then if I decide to retry, this means not piping the response to the writable stream, so the response just stays open. This considerably slows down execution when I do many retries. A solution to this is to (uselessly) listen to the data and end events just to get the response to close, but then if I decide not to retry, I can not longer pipe to the writeStream, as the response is now closed.
Anyways, I'm surely not the only one wanting to stream the response of an http request to a file, retrying when I get a 503, but I can't seem to find any code "out there" that does this.
Solved. The slowness happens when a lot of responses are left open (unconsumed). The solution was to response.resume() them, letting them "spew in nothingness" when a retry is necessary. So in pseudo code:
http.get(url, function (response) {
if (response.statusCode !== 200) {
response.resume()
retry()
} else {
response.pipe(file)
}
})
My original problem was that I was checking wheter to retry or not too late, forcing me to "spew into nothingness" before having decided, making it impossible to decide not to retry because the data from the response "stream" was already consumed.
Related
There are various posts dealing with the general issue of timeouts when using http.get(), but non of them seems to address the question of how to deal with timeouts that occur during the stream itself, after a successful response was already received.
Take this code for example. It sends a request to some server, that responds on time, but creates an artificial timeout during the stream:
(async()=>{
//Send request to a dummy server, that creates a timeout, IN THE MIDDLE OF THE STREAM.
const request =http.get('http://localhost/timeout/',async (res)=>{
const write = fs.createWriteStream('.text.txt');
try {
await pipelinePromisified(res,write);
console.log('Everything went fine')//Being that the timeout error during the stream is not caught by pipeline,
//the promise gets resolved..
} catch (error) {
//The error is NOT caught by pipeline!
console.log('Error from pipeline',error)
}
}).on('error',(e)=>{
//Error is caught here
console.log('error from request on error')
})
request.setTimeout(4000,()=>{
request.destroy(new Error('request timed out'));
//This causes the piping of the streams to stop in the middle(resulting a partial file), but the pipeline doesn't treat this is an error.
})
})()
Note the key issue: The timeout during the stream is recognized, the request is destroyed, the IncomingMessage(response) stops pumping data- but pipeline doesn't recognize it as an error.
The outcome is, that the client code is not aware of the fact that file was partially downloaded, being that no error is thrown.
How to handle this situation? In my testing, calling response.emit('error') seems to solve this, but Node's docs clearly state not to do this.
Any help would be greatly appreciated.
Update: It seems that on Node 12(i have 10 and 12 installed via nvm), an error is caught by pipeline, but not in Node 10.
I’ve now spent countless hours trying to get the cache API to cache a simple request. I had it working once in between but forgot to add something to the cache key, and now its not working anymore. Needless to say, cache.put() not having a return value that specifies if the request was actually cached or not does not exactly help and I am left with trial and error. Can someone maybe give me a hint on what I’m doing wrong and what is actually required? I’ve read all the documentation more than 3 times now and I’m at a loss…
Noteworthy maybe is that this REST endpoint sets pragma: no-cache and everything else cache-related to no-cache, but i want to forcibly cache the response anyway which is why I tried to completely re-write the headers before caching, but it still isn’t working (not matching or not storing, no one knows…)
async function apiTest(token, url) {
let apiCache = await caches.open("apiResponses");
let request = new Request(
new URL("https://api.mysite.com/api/"+url),
{
headers: {
"Authorization": "Bearer "+token,
}
}
)
// Check if the response is already in the cloudflare cache
let response = await apiCache.match(request);
if (response) {
console.log("Serving from cache");
}
if (!response) {
// if not, ask the origin if the permission is granted
response = await fetch(request);
// cache response in cloudflare cache
response = new Response(response.body, {
status: response.status,
statusText: response.statusText,
headers: {
"Cache-Control": "max-age=900",
"Content-Type": response.headers.get("Content-Type"),
}
});
await apiCache.put(request, response.clone());
}
return response;
}
Thanks in advance for any help, I've asked the same question on the Cloudflare community first and not received an answer in 2 weeks
This might be related to your use of caches.default, instead of opening a private cache with caches.open("whatever"). When you use caches.default, you are sharing the same cache that fetch() itself uses. So when your worker runs, your worker checks the cache, then fetch() checks the cache, then fetch() later writes the cache, and then your worker also writes the same cache entry. Since the write operations in particular happen asynchronously (as the response streams through), it's quite possible that they are overlapping and the cache is getting confused and tossing them all out.
To avoid this, you should open a private cache namespace. So, replace this line:
let cache = caches.default;
with:
let cache = await caches.open("whatever");
(This await always completes immediately; it's only needed because the Cache API standard insists that this method is asynchronous.)
This way, you are reading and writing a completely separate cache entry from the one that fetch() itself reads/writes.
The use case for caches.default is when you intentionally want to operate on exactly the cache entry that fetch() would also use, but I don't think you need to do that here.
EDIT: Based on conversation below, I now suspect that the presence of the Authorization header was causing the cache to refuse to store the response. But, using a custom cache namespace (as described above) means that you can safely cache the value using a Request that doesn't have that header, because you know the cached response can only be accessed by the Worker via the cache API. It sounds like this approach worked in your case.
I'm working on a project that involving a lot of large files where I only need to extract the HTTP header rather than loading the entirely of the file itself, so I'm using the request module to extract the HTTP header immediately and then abort the request since I don't need the entirely of the file. Alas, my current structure has me assigning the request object and then using a listener for response as such is the case below.
const req = request(url);
req.on('response', function(res) {
if (res.statusCode !== 200) {
req.abort();
return this.emit('error', new Error('Bad status code'));
}
if (res.headers.hasOwnProperty(headProp)) {
parseFunc(res.headers);
req.abort();
}
});
Ideally, I'd like, if possible, to utilize Promises to be able to parse the request URL like:
const req = request.getAsync(url)
req.on('response', function(res) {
//whatever logic
});
req.then(parseFunc(res.headers);
But listener events don't really work since the request object hasn't been saved to anything. Additionally, the chained then on a request.getAsync.then seems to only execute after the file has been parsed, which can be 10-11 seconds vs the 250ms-1s of exiting upon the abort.
So, in short: can I get the functionality I desire while avoiding callbacks?
I have a node.js process that uses a large number of client requests to pull information from a website. I am using the request package (https://www.npmjs.com/package/request) since, as it says: "It supports HTTPS and follows redirects by default."
My problem is that after a certain period of time, the requests begin to hang. I haven't been able to determine if this is because the server is returning an infinite data stream, or if something else is going on. I've set the timeout, but after some number of successful requests, some of them eventually get stuck and never complete.
var options = { url: 'some url', timeout: 60000 };
request(options, function (err, response, body) {
// process
});
My questions are, can I shut down a connection after a certain amount of data is received using this library, and can I stop the request from hanging? Do I need to use the http/https libraries and handle the redirects and protocol switching myself in order the get the kind of control I need? If I do, is there a standardized practice for that?
Edit: Also, if I stop the process and restart it, they pick right back up and start working, so I don't think it is related to the server or the machine the code is running on.
Note that in request(options, callback), the callback will be fired when request is completed and there is no way to break the request.
You should listen on data event instead:
var request = require('request')
var stream = request(options);
var len = 0
stream.on('data', function(data) {
// TODO process your data here
// break stream if len > 1000
len += Buffer.byteLength(data)
if (len > 1000) {
stream.abort()
}
})
If you look at the answer by Casey Chu (answered Nov30'10) in this question : How do you extract POST data in Node.js?
You will see that he is responding to 'data' events , to construct the body of the request. Reproducing code here:
var qs = require('querystring');
function (request, response) {
if (request.method == 'POST') {
var body = '';
request.on('data', function (data) {
body += data;
// Too much POST data, kill the connection!
if (body.length > 1e6)
request.connection.destroy();
});
request.on('end', function () {
var post = qs.parse(body);
// use post['blah'], etc.
});
}
}
Suppose I don't care about POST requests, and hence never check if a request is POST or create a 'data' event handler, is there a risk that someone can block my thread by sending a really large post request ? For example, instead of the above code, what if I just did:
function hearStory(request, response) {
response.writeHead(200, {"Content-Type": "text/plain"});
response.write("Cool story bro!");
response.end();
}
What happens to really large POST requests then ? Does the server just ignore the body ? Is there any risk to this approach ? Get requests including their headers must be less that 80kB, so it seems like a simple way to avoid flooding my server.
Hopefully these kinds of attacks can be detected and averted before it ever gets to your server via a firewall or something else. You shouldn't handle DOS attacks with the server itself. However, if they've gotten to your server with malicious intent, there needs to be a way to handle it. If you intend on handling POST requests, the code you're referring will help.
You could, if you just want to avoid POST requests all together and not listen for them, as is demonstrated by the second code snippet, do something like the following.
function denyPost(req, res) {
if (request.method == 'POST') {
console.log('POST denied...'); // this is optional.
request.connection.destroy(); // this kills the connection.
}
}
Of course, this wont work if you plan on handling post requests somehow. But again, DOS attacks need to be handled before they ever get to your server. If they've gotten there, they've already won.