How to prevent POST spamming in node.js Express using busboy?

How to prevent POST spamming in node.js Express using busboy? - javascript

I am following the example here: https://www.npmjs.com/package/busboy
I am worried that someone may deliberately try to overload the server. I wonder if there is a convenient way, before the data is uploaded, to prevent spamming by measuring the size of the entire POST body, not just the file(s) uploaded. I tried the following, which apparently didn't work:
if (JSON.stringify(req.body).length > 5 * 1024 * 1024) res.redirect('/');

You cannot rely on Content-Length being set. Even if it were set, if the person was acting malicious, they either may use an incorrect Content-Length or they may use Transfer-Encoding: chunked, in which case there is no way to tell how large the request body is.
Additionally, calling stringify() every time on req.body could easily cause a DoS-style attack as well.
However, busboy does have several options for limiting various aspects of application/x-www-form-urlencoded and multipart/form-data requests (e.g. max file size, max number of files, etc.).
You might also limit the parsing of request bodies to routes where you're expecting request bodies, instead of trying to parse request bodies for all requests.

Related

How to send http request with body property using fetch in javascript [duplicate]

I'm developing a new RESTful webservice for our application.
When doing a GET on certain entities, clients can request the contents of the entity.
If they want to add some parameters (for example sorting a list) they can add these parameters in the query string.
Alternatively I want people to be able to specify these parameters in the request body.
HTTP/1.1 does not seem to explicitly forbid this. This will allow them to specify more information, might make it easier to specify complex XML requests.
My questions:
Is this a good idea altogether?
Will HTTP clients have issues with using request bodies within a GET request?
https://www.rfc-editor.org/rfc/rfc2616

Roy Fielding's comment about including a body with a GET request.
Yes. In other words, any HTTP request message is allowed to contain a message body, and thus must parse messages with that in mind. Server semantics for GET, however, are restricted such that a body, if any, has no semantic meaning to the request. The requirements on parsing are separate from the requirements on method semantics.
So, yes, you can send a body with GET, and no, it is never useful to do so.
This is part of the layered design of HTTP/1.1 that will become clear again once the spec is partitioned (work in progress).
....Roy
Yes, you can send a request body with GET but it should not have any meaning. If you give it meaning by parsing it on the server and changing your response based on its contents, then you are ignoring this recommendation in the HTTP/1.1 spec, section 4.3:
...if the request method does not include defined semantics for an entity-body, then the message-body SHOULD be ignored when handling the request.
And the description of the GET method in the HTTP/1.1 spec, section 9.3:
The GET method means retrieve whatever information ([...]) is identified by the Request-URI.
which states that the request-body is not part of the identification of the resource in a GET request, only the request URI.
Update
The RFC2616 referenced as "HTTP/1.1 spec" is now obsolete. In 2014 it was replaced by RFCs 7230-7237. Quote "the message-body SHOULD be ignored when handling the request" has been deleted. It's now just "Request message framing is independent of method semantics, even if the method doesn't define any use for a message body" The 2nd quote "The GET method means retrieve whatever information ... is identified by the Request-URI" was deleted. - From a comment
From the HTTP 1.1 2014 Spec:
A payload within a GET request message has no defined semantics; sending a payload body on a GET request might cause some existing implementations to reject the request.

While you can do that, insofar as it isn't explicitly precluded by the HTTP specification, I would suggest avoiding it simply because people don't expect things to work that way. There are many phases in an HTTP request chain and while they "mostly" conform to the HTTP spec, the only thing you're assured is that they will behave as traditionally used by web browsers. (I'm thinking of things like transparent proxies, accelerators, A/V toolkits, etc.)
This is the spirit behind the Robustness Principle roughly "be liberal in what you accept, and conservative in what you send", you don't want to push the boundaries of a specification without good reason.
However, if you have a good reason, go for it.

You will likely encounter problems if you ever try to take advantage of caching. Proxies are not going to look in the GET body to see if the parameters have an impact on the response.

Elasticsearch accepts GET requests with a body. It even seems that this is the preferred way: Elasticsearch guide
Some client libraries (like the Ruby driver) can log the cry command to stdout in development mode and it is using this syntax extensively.

Neither restclient nor REST console support this but curl does.
The HTTP specification says in section 4.3
A message-body MUST NOT be included in a request if the specification of the request method (section 5.1.1) does not allow sending an entity-body in requests.
Section 5.1.1 redirects us to section 9.x for the various methods. None of them explicitly prohibit the inclusion of a message body. However...
Section 5.2 says
The exact resource identified by an Internet request is determined by examining both the Request-URI and the Host header field.
and Section 9.3 says
The GET method means retrieve whatever information (in the form of an entity) is identified by the Request-URI.
Which together suggest that when processing a GET request, a server is not required to examine anything other that the Request-URI and Host header field.
In summary, the HTTP spec doesn't prevent you from sending a message-body with GET but there is sufficient ambiguity that it wouldn't surprise me if it was not supported by all servers.

GET, with a body!?
Specification-wise you could, but, it's not a good idea to do so injudiciously, as we shall see.
RFC 7231 §4.3.1 states that a body "has no defined semantics", but that's not to say it is forbidden. If you attach a body to the request and what your server/app makes out of it is up to you. The RFC goes on to state that GET can be "a programmatic view on various database records". Obviously such view is many times tailored by a large number of input parameters, which are not always convenient or even safe to put in the query component of the request-target.
The good: I like the verbiage. It's clear that one read/get a resource without any observable side-effects on the server (the method is "safe"), and, the request can be repeated with the same intended effect regardless of the outcome of the first request (the method is "idempotent").
The bad: An early draft of HTTP/1.1 forbade GET to have a body, and - allegedly - some implementations will even up until today drop the body, ignore the body or reject the message. For example, a dumb HTTP cache may construct a cache key out of the request-target only, being oblivious to the presence or content of a body. An even dumber server could be so ignorant that it treats the body as a new request, which effectively is called "request smuggling" (which is the act of sending "a request to one device without the other device being aware of it" - source).
Due to what I believe is primarily a concern with inoperability amongst implementations, work in progress suggests to categorize a GET body as a "SHOULD NOT", "unless [the request] is made directly to an origin server that has previously indicated, in or out of band, that such a request has a purpose and will be adequately supported" (emphasis mine).
The fix: There's a few hacks that can be employed for some of the problems with this approach. For example, body-unaware caches can indirectly become body-aware simply by appending a hash derived from the body to the query component, or disable caching altogether by responding a cache-control: no-cache header from the server.
Alas when it comes to the request chain, one is often not in control of- or even aware, of all present and future HTTP intermediaries and how they will deal with a GET body. That's why this approach must be considered generally unreliable.
But POST, is not idempotent!
POST is an alternative. The POST request usually includes a message body (just for the record, body is not a requirement, see RFC 7230 §3.3.2). The very first use case example from RFC 7231 (§4.3.3) is "providing a block of data [...] to a data-handling process". So just like GET with a body, what happens with the body on the back-end side is up to you.
The good: Perhaps a more common method to apply when one wish to send a request body, for whatever purpose, and so, will likely yield the least amount of noise from your team members (some may still falsely believe that POST must create a resource).
Also, what we often pass parameters to is a search function operating upon constantly evolving data, and a POST response is only cacheable if explicit freshness information is provided in the response.
The bad: POST requests are not defined as idempotent, leading to request retry hesitancy. For example, on page reload, browsers are unwilling to resubmit an HTML form without prompting the user with a nonreadable cryptic message.
The fix: Well, just because POST is not defined to be idempotent doesn't mean it mustn't be. Indeed, RFC 7230 §6.3.1 writes: "a user agent that knows (through design or configuration) that a POST request to a given resource is safe can repeat that request automatically". So, unless your client is an HTML form, this is probably not a real problem.
QUERY is the holy grail
There's a proposal for a new method QUERY which does define semantics for a message body and defines the method as idempotent. See this.
Edit: As a side-note, I stumbled into this StackOverflow question after having discovered a codebase where they solely used PUT requests for server-side search functions. This were their idea to include a body with parameters and also be idempotent. Alas the problem with PUT is that the request body has very precise semantics. Specifically, the PUT "requests that the state of the target resource be created or replaced with the state [in the body]" (RFC 7231 §4.3.4). Clearly, this excludes PUT as a viable option.

You can either send a GET with a body or send a POST and give up RESTish religiosity (it's not so bad, 5 years ago there was only one member of that faith -- his comments linked above).
Neither are great decisions, but sending a GET body may prevent problems for some clients -- and some servers.
Doing a POST might have obstacles with some RESTish frameworks.
Julian Reschke suggested above using a non-standard HTTP header like "SEARCH" which could be an elegant solution, except that it's even less likely to be supported.
It might be most productive to list clients that can and cannot do each of the above.
Clients that cannot send a GET with body (that I know of):
XmlHTTPRequest Fiddler
Clients that can send a GET with body:
most browsers
Servers & libraries that can retrieve a body from GET:
Apache
PHP
Servers (and proxies) that strip a body from GET:
?

What you're trying to achieve has been done for a long time with a much more common method, and one that doesn't rely on using a payload with GET.
You can simply build your specific search mediatype, or if you want to be more RESTful, use something like OpenSearch, and POST the request to the URI the server instructed, say /search. The server can then generate the search result or build the final URI and redirect using a 303.
This has the advantage of following the traditional PRG method, helps cache intermediaries cache the results, etc.
That said, URIs are encoded anyway for anything that is not ASCII, and so are application/x-www-form-urlencoded and multipart/form-data. I'd recommend using this rather than creating yet another custom json format if your intention is to support ReSTful scenarios.

I put this question to the IETF HTTP WG. The comment from Roy Fielding (author of http/1.1 document in 1998) was that
"... an implementation would be broken to do anything other than to parse and discard that body if received"
RFC 7213 (HTTPbis) states:
"A payload within a GET request message has no defined semantics;"
It seems clear now that the intention was that semantic meaning on GET request bodies is prohibited, which means that the request body can't be used to affect the result.
There are proxies out there that will definitely break your request in various ways if you include a body on GET.
So in summary, don't do it.

From RFC 2616, section 4.3, "Message Body":
A server SHOULD read and forward a message-body on any request; if the
request method does not include defined semantics for an entity-body,
then the message-body SHOULD be ignored when handling the request.
That is, servers should always read any provided request body from the network (check Content-Length or read a chunked body, etc). Also, proxies should forward any such request body they receive. Then, if the RFC defines semantics for the body for the given method, the server can actually use the request body in generating a response. However, if the RFC does not define semantics for the body, then the server should ignore it.
This is in line with the quote from Fielding above.
Section 9.3, "GET", describes the semantics of the GET method, and doesn't mention request bodies. Therefore, a server should ignore any request body it receives on a GET request.

Which server will ignore it? – fijiaaron Aug 30 '12 at 21:27
Google for instance is doing worse than ignoring it, it will consider it an error!
Try it yourself with a simple netcat:
$ netcat www.google.com 80
GET / HTTP/1.1
Host: www.google.com
Content-length: 6
1234
(the 1234 content is followed by CR-LF, so that is a total of 6 bytes)
and you will get:
HTTP/1.1 400 Bad Request
Server: GFE/2.0
(....)
Error 400 (Bad Request)
400. That’s an error.
Your client has issued a malformed or illegal request. That’s all we know.
You do also get 400 Bad Request from Bing, Apple, etc... which are served by AkamaiGhost.
So I wouldn't advise using GET requests with a body entity.

According to XMLHttpRequest, it's not valid. From the standard:
4.5.6 The send() method
client . send([body = null])
Initiates the request. The optional argument provides the request
body. The argument is ignored if request method is GET or HEAD.
Throws an InvalidStateError exception if either state is not
opened or the send() flag is set.
The send(body) method must run these steps:
If state is not opened, throw an InvalidStateError exception.
If the send() flag is set, throw an InvalidStateError exception.
If the request method is GET or HEAD, set body to null.
If body is null, go to the next step.
Although, I don't think it should because GET request might need big body content.
So, if you rely on XMLHttpRequest of a browser, it's likely it won't work.

If you really want to send cachable JSON/XML body to web application the only reasonable place to put your data is query string encoded with RFC4648: Base 64 Encoding with URL and Filename Safe Alphabet. Of course you could just urlencode JSON and put is in URL param's value, but Base64 gives smaller result. Keep in mind that there are URL size restrictions, see What is the maximum length of a URL in different browsers? .
You may think that Base64's padding = character may be bad for URL's param value, however it seems not - see this discussion: http://mail.python.org/pipermail/python-bugs-list/2007-February/037195.html . However you shouldn't put encoded data without param name because encoded string with padding will be interpreted as param key with empty value.
I would use something like ?_b64=<encodeddata>.

I wouldn't advise this, it goes against standard practices, and doesn't offer that much in return. You want to keep the body for content, not options.

You have a list of options which are far better than using a request body with GET.
Let' assume you have categories and items for each category. Both to be identified by an id ("catid" / "itemid" for the sake of this example). You want to sort according to another parameter "sortby" in a specific "order". You want to pass parameters for "sortby" and "order":
You can:
Use query strings, e.g.
example.com/category/{catid}/item/{itemid}?sortby=itemname&order=asc
Use mod_rewrite (or similar) for paths:
example.com/category/{catid}/item/{itemid}/{sortby}/{order}
Use individual HTTP headers you pass with the request
Use a different method, e.g. POST, to retrieve a resource.
All have their downsides, but are far better than using a GET with a body.

What about nonconforming base64 encoded headers? "SOMETHINGAPP-PARAMS:sdfSD45fdg45/aS"
Length restrictions hm. Can't you make your POST handling distinguish between the meanings? If you want simple parameters like sorting, I don't see why this would be a problem. I guess it's certainty you're worried about.

I'm upset that REST as protocol doesn't support OOP and Get method is proof. As a solution, you can serialize your a DTO to JSON and then create a query string. On server side you'll able to deserialize the query string to the DTO.
Take a look on:
Message-based design in ServiceStack
Building RESTful Message Based Web Services with WCF
Message based approach can help you to solve Get method restriction. You'll able to send any DTO as with request body
Nelibur web service framework provides functionality which you can use
var client = new JsonServiceClient(Settings.Default.ServiceAddress);
var request = new GetClientRequest
{
Id = new Guid("2217239b0e-b35b-4d32-95c7-5db43e2bd573")
};
var response = client.Get<GetClientRequest, ClientResponse>(request);
as you can see, the GetClientRequest was encoded to the following query string
http://localhost/clients/GetWithResponse?type=GetClientRequest&data=%7B%22Id%22:%2217239b0e-b35b-4d32-95c7-5db43e2bd573%22%7D

IMHO you could just send the JSON encoded (ie. encodeURIComponent) in the URL, this way you do not violate the HTTP specs and get your JSON to the server.

For example, it works with Curl, Apache and PHP.
PHP file:
<?php
echo $_SERVER['REQUEST_METHOD'] . PHP_EOL;
echo file_get_contents('php://input') . PHP_EOL;
Console command:
$ curl -X GET -H "Content-Type: application/json" -d '{"the": "body"}' 'http://localhost/test/get.php'
Output:
GET
{"the": "body"}

Even if a popular tool use this, as cited frequently on this page, I think it is still quite a bad idea, being too exotic, despite not forbidden by the spec.
Many intermediate infrastructures may just reject such requests.
By example, forget about using some of the available CDN in front of your web site, like this one:
If a viewer GET request includes a body, CloudFront returns an HTTP status code 403 (Forbidden) to the viewer.
And yes, your client libraries may also not support emitting such requests, as reported in this comment.

If you want to allow a GET request with a body, a way is to support POST request with header "X-HTTP-Method-Override: GET". It is described here : https://en.wikipedia.org/wiki/List_of_HTTP_header_fields. This header means that while the method is POST, the request should be treated as if it is a GET. Body is allowed for POST, so you're sure nobody willl drop the payload of your GET requests.
This header is oftenly used to make PATCH or HEAD requests through some proxies that do not recognize those methods and replace them by GET (always fun to debug!).

An idea on an old question:
Add the full content on the body, and a short hash of the body on the querystring, so caching won't be a problem (the hash will change if body content is changed) and you'll be able to send tons of data when needed :)

Create a Requestfactory class
import java.net.URI;
import javax.annotation.PostConstruct;
import org.apache.http.client.methods.HttpEntityEnclosingRequestBase;
import org.apache.http.client.methods.HttpUriRequest;
import org.springframework.http.HttpMethod;
import org.springframework.http.client.HttpComponentsClientHttpRequestFactory;
import org.springframework.stereotype.Component;
import org.springframework.web.client.RestTemplate;
#Component
public class RequestFactory {
private RestTemplate restTemplate = new RestTemplate();
#PostConstruct
public void init() {
this.restTemplate.setRequestFactory(new HttpComponentsClientHttpRequestWithBodyFactory());
}
private static final class HttpComponentsClientHttpRequestWithBodyFactory extends HttpComponentsClientHttpRequestFactory {
#Override
protected HttpUriRequest createHttpUriRequest(HttpMethod httpMethod, URI uri) {
if (httpMethod == HttpMethod.GET) {
return new HttpGetRequestWithEntity(uri);
}
return super.createHttpUriRequest(httpMethod, uri);
}
}
private static final class HttpGetRequestWithEntity extends HttpEntityEnclosingRequestBase {
public HttpGetRequestWithEntity(final URI uri) {
super.setURI(uri);
}
#Override
public String getMethod() {
return HttpMethod.GET.name();
}
}
public RestTemplate getRestTemplate() {
return restTemplate;
}
}
and #Autowired where ever you require and use, Here is one sample code GET request with RequestBody
#RestController
#RequestMapping("/v1/API")
public class APIServiceController {
#Autowired
private RequestFactory requestFactory;
#RequestMapping(method = RequestMethod.GET, path = "/getData")
public ResponseEntity<APIResponse> getLicenses(#RequestBody APIRequest2 APIRequest){
APIResponse response = new APIResponse();
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON);
Gson gson = new Gson();
try {
StringBuilder createPartUrl = new StringBuilder(PART_URL).append(PART_URL2);
HttpEntity<String> entity = new HttpEntity<String>(gson.toJson(APIRequest),headers);
ResponseEntity<APIResponse> storeViewResponse = requestFactory.getRestTemplate().exchange(createPartUrl.toString(), HttpMethod.GET, entity, APIResponse.class); //.getForObject(createLicenseUrl.toString(), APIResponse.class, entity);
if(storeViewResponse.hasBody()) {
response = storeViewResponse.getBody();
}
return new ResponseEntity<APIResponse>(response, HttpStatus.OK);
}catch (Exception e) {
e.printStackTrace();
return new ResponseEntity<APIResponse>(response, HttpStatus.INTERNAL_SERVER_ERROR);
}
}
}

Is it possible for an HTTP `GET` request with `Cache-Control: no-cache` to not hit the server exactly once? (Levering out idempotency of `GET`.)

In theory, one should use the HTTP GET method only for idempotent requests.
But, for some intricate reasons, I cannot use any other method than GET and my requests are not idempotent (they mutate the database). So my idea is to use the Cache-Control: no-cache header to ensure that any GET request actually hits the database. Also, I cannot change the URLs which means I cannot append a random URL argument to bust caches.
Am I safe or shall I implement some kind of mechanism to ensure that the GET request was received exactly once? (The client is the browser and the server is Node.js.)
What about a GET request that gets duplicated by some middle-man resulting in the same GET request being received twice by the server? I believe the spec allows such situation but does this ever happen in "real life"?
I've never seen a middle man, such as Cloudflare or NGNIX, preventing or duplicating a GET request with Cache-Control: no-cache.

Let's start by saying what you've already pointed out -- GET requests should be idempotent. That is, they should not modify the resource and therefore should return the same thing every time (barring any other methods being used to modify it in the meantime.)
It's worth pointing out, as restcookbook.com notes, that this doesn't mean nothing can change as a result of the request. Rather, the resource's representation should not change. So for instance, your database might log the request, but shouldn't return a different value in the response.
The main concern you've listed is middleware caching.
The danger isn't that the middleware sends the request to your server more than once (you mentioned 'duplicating' a request), but rather that (a) it sends an old, cached, no-longer-accurate response to whatever is making the request, and (b) the request does not reach the server.
For instance, imagine a response returning a count property that starts at 0 and increments when the GET endpoint is hit. Request #1 will return "1" as the count. Request #2 should now return "2" as the count, but if its cached, it might still show as 1, and not hit the server to increase the count to 2. That's 2 separate problems you have (caching, and not updating).
So, will a middleware prevent a request from reaching the server and serve a cached copy instead? We don't know. It depends on the middleware. You can absolutely write one right now that does just that. You can also write one that doesn't.
If you don't know what will be consuming your API, then it's not a great option. But whether it's "safe" depends on the specifics.
As you know, it's always best to follow the set of expectations that comes with the grammar of HTTP requests. Deviating from them sets yourself up for failure in many ways. (For instance, there are different security expectations for requests based on method. A browser may treat a GET request as "simple" from a CORS perspective, while it would never treat a PATCH request as such.)
I would go to great lengths to not break this convention, but if I were forced by circumstances to break this expectation, I would definitely note it in my APIs documentation.

One workaround to ensure that your GET request is only called once is to allow caching of responses and use the Vary header. The spec for the Vary header can be found here.
In summary, a Vary header basically tells any HTTP cache, which parts of the request header to take into account when trying to find the cached object.
For example, you have an endpoint /api/v1/something that accepts GET requests and does the required database updates. Let's say that when successful, this endpoint returns the following response.
HTTP/1.1 200 OK
Content-Length: 3458
Cache-Control: max-age=86400
Vary: X-Unique-ID
Notice the Vary header has a value of X-Unique-ID. This means that if you include the X-Unique-ID header in your request, any HTTP caching layer (be it the browser, CDN, or other middleware) will use the value in this header to determine whether to use a previously cached response or not.
Say your make a first request that includes a X-Unique-ID header with the value id_1 then you make a subsequent request with X-Unique-ID value of id_2. The caching layer will not use a previously cached response for the second request because the value of the X-Unique-ID is different.
However, if you make another request that contains the X-Unique-ID value of id_1 again, the caching layer will not make a request to the backend but instead reuse the cached response for the first request assuming that the cache hasn't expired yet.
One thing you have to consider though is this will only work if the caching layer actually respects the specifications for the Vary header.

The Hypertext Transfer Protocol (HTTP) is designed to enable communications between clients and servers.
where Get method is used to request the data from specified resources.
When we used 'Cache-control: no-cache' it means the cache can't store anything about the client request
or server responses. That Request hits to the server and a full response is downloaded each and every time.

This depends a lot on what's sat in the middle and where the retry logic sits, if there is any. Almost all of your problems will be in failure handling and retry handling - not the basic requests.
Let's say, for example that Alice talks to Bob via a proxy. Let's assume for the sake of simplicity that the requests are small and the proxy logic is pure store-and-forward. i.e. most of the time a request either gets through or doesn't but is unlikely to get stalled half-way through. There's no guarantee this is the case and some proxies will stop requests part-way through by design.
Alice -> Proxy GET
Proxy -> Bob GET
Bob -> Proxy 200
Proxy -> Alice 200
So far so good. Now imagine Bob dies before responding to the proxy. Does the proxy retry? If so, we have this:
Alice -> Proxy GET
Proxy -> Bob GET
Bob manipulates database then dies
Proxy -> Bob GET (retry)
Now we have a dupe
Unlikely, but possible.
Now imagine (much more likely) that the proxy (or even more likely, some bit of the network between the proxy and the client) dies. Does the client retry? If so, we have this:
Alice -> Proxy GET
Proxy -> Bob GET
Bob -> Proxy 200
Proxy or network dies
Alice -> Proxy GET (retry)
Proxy -> Bob GET
Is this a dupe or not? Depends on your point of view
Plus, for completeness there's also the degenerate case where the server receives the request zero times.

Create a Batch HTTP API With multipart response

Actually, I´ve create a Batch HTTP API that receives a JSON array with many different requests to our backend server. The Batch API just call all of these requests to a load balancer, wait for the return of all of them and return a new JSON to the client.
The client receives a huge JSON array response with its indices in the same position as the request, so it is easy to know what response is addressed for what request.
The motivation for this API was to solve the 5 browser simultaneous connections and improve performance as the Batch API has a much more direct access to the server (we do not have a reverse proxy or a SSL server between then).
The service is running pretty well, but now I have some new requirements as it is gaining more use. First, the service can use a lot of memory as it has a buffer for each request that will only be flushed when all responses are ready (I am using an ordered JSON Array). More, since it can take some time to all requests be delivered, the client will need to wait everything be processed before receiving a single byte.
I am planning change the service to return each response as soon it is available (and solve both issues above). And would like to share and validate my ideas with you:
I will change the response from a JSON response to a multipart response.
The server will include, for every part, the index of the response
The server will flush the response once its available
The client XHR will need to understand multipart content type response and be able to process each part as soon as it is available.
I will create a PoC to validate every step, but at this moment I would like to validate the idea and hear some thoughts about it. Here some doubts I have about the solution:
From what I read, I am in doubbt of that content-type is right for the response. multipart/mixed? multipart/digest?
Can I use an accept request header to identify if the client is able to handle the new service implementation? If so, what is the right accept header for this? My plan is to use the same endpoint but very accept header.
How can I develop a XHR client that is able to process many parts of a single response as soon as they are available? I found some ideias on the Web but I am not entirely confident with then.

I will change the response from a JSON response to a multipart
response.
The server will include, for every part, the index of the
response
The server will flush the response once its available
The
client XHR will need to understand multipart content type response and
be able to process each part as soon as it is available.
The XHR protocol will not support this work flow through a single request from the client. Since XHR relies heavily on the HTTP protocol for communications, XHR follows the HTTP connection rules. The first and most important rule: HTTP connections are always initiated by the client. Another rule: XHR returns the entire content-body or fails.
The implications for your workflow is that each part of the multipart response must be requested individually by the client.
From what I read, I am in doubbt of that content-type is right for the
response. multipart/mixed? multipart/digest?
You should be in doubt as there is no provision in the specfication to do this. The response-type attribute is limited to the empty string (default), "arraybuffer", "blob", "document", "json", and "text". it is possible to set the override MIME type header, but that does not change the response type. Event given that case, the XHR spec is very clear about what it will send back. It is one of the types listed above as documented here.
Can I use an accept
request header to identify if the client is able to handle the new
service implementation? If so, what is the right accept header for
this? My plan is to use the same endpoint but very accept header.
Custom HTTP headers are designed to assist us in telling the server what our capabilities are on the client. This is easily done. it doesn't necessarily have to be in the accept header (as that also is a defined list of MIME types).
How
can I develop a XHR client that is able to process many parts of a
single response as soon as they are available? I found some ideias on
the Web but I am not entirely confident with then.
XHR is processed natively by the client and cannot be overridden for all sorts of security reasons. So this is unlikely to be available as a solution for this reason.
Note: ordinarily one might suggest the use of a custom version of Chromium, but your constraints do not allow for that.

Enabling http compression with express

I'm messing around with the Darksky API and under one of the query parameters it states:
extend=hourly optional
When present, return hour-by-hour data for the next 168 hours, instead
of the next 48. When using this option, we strongly recommend enabling
HTTP compression.
I'm using Express as a node proxy which hits the Darksky api (i.e. localhost:3000/api/forecast/LATITUDE, LONGITUDE).
What does "HTTP compression" mean and how would I go about enabling it?

Here compression means the gzip compression on the express server. You can use the compression middleware to add easy gzip compression to your server.
Read more about how you can install that middleware on here.
https://github.com/expressjs/compression
An example implementation should be look like this.
var compression = require('compression')
var express = require('express')
var app = express()
// compress all responses
app.use(compression())
// add all routes

To quote from https://darksky.net/dev/docs
The Forecast Data API supports HTTP compression. We heartily recommend using it, as it will make responses much smaller over the wire. To enable it, simply add an Accept-Encoding: gzip header to your request. (Most HTTP client libraries wrap this functionality for you, please consult your library’s documentation for details.)
I'm not familiar with the Dark Sky API but I would imagine it returns a large amount of highly redundant data, which is ideal for compression. HTTP requests have a compression mechanism built in via Accept-Encoding, as mentioned above.
In your case that data will be travelling across the wire twice, once from Dark Sky to your server and then again from your server to your end user. You could compress just one of these two transmissions or both, it's up to you but it's likely you'd want both unless the end user is on the same local network as your server.
There are various SO questions about making compressed requests, such as:
node.js - easy http requests with gzip/deflate compression
The key decision for you is whether you want to decompress and recompress the data in your proxy or just stream it through. If you don't need a decompressed copy of the data in the server then it would be more efficient to skip the extra steps. You'd need to be careful to ensure all the headers are set correctly but if you just pass on the relevant headers that you receive (in both directions) it should be relatively simple to pipe through the response from Dark Sky.

HTML SSE request body

When using the EventSource API in JavaScript, is there any way to send a request body along with the HTTP request initiating the polling?
I need to send a large blob of JSON to the server at the SSE request so that the server can calculate what events to send to the client. It seems daft to do web-sockets when I don't need it or do weird things with cookies or multiple requests.
I worry i'll run in to length limits on query strings if I bundle the data in to that, which may be likely.
Thanks in advance!

The initial SSE request is a fairly ordinary HTTP GET request, so:
Given that SSE is only supported by modern browsers, the maximum URL length should not be assumed to be the old 255 bytes "for old browsers". Most modern browsers allow longer URLs, with IE providing the lowest cap of ~2k. (granted EventSource is not supported on IE anyway, but there's an XHR polyfill...) However, if by large blob you mean several kilobytes, the URL is not reliable. Proxies could also potentially cause problems.
See:
What is the maximum length of a URL in different browsers?,
Is there any limitation on Url's length in Android's WebView.loadUrl method?,
http://www.benzado.com/blog/post/28/iphone-openurl-limit
You could also store information in one or more cookies which will be sent along with the GET request. This could include a cookie you set on the request for the page that uses SSE, or a cookie you set in javascript (prior to creating your EventSource object). The max size for a cookie is specified as being at least 4096 bytes (which is the whole cookie, so somewhat less for your actual data portion) with at least 20 cookies per hostname supported. Emperical testing appears to bear this out: http://browsercookielimits.x64.me/ Worst case you could possibly chunk the information in multiple cookies.
Larger than that, and I think you need an initial request that uploads the JSON and sends back an ID that is referenced by the SSE request.
It is technically possible, but (strongly) discouraged, to send a body with a GET request. See HTTP GET with request body. The EventSource constructor only takes a URL and so does not directly support this.
As dandavis pointed out, you can compress your JSON.

We Keep Coding

JavaScript is the programming language of the Web.