Encoding user input to be stored in MongoDB

Encoding user input to be stored in MongoDB - javascript

I'm trying to determine the best practices for storing and displaying user input in MongoDB. Obviously, in SQL databases, all user input needs to be encoded to prevent injection attacks. However, my understanding is that with MongoDB we need to be more worried about XSS attacks, so does user input need to be encoded on the server before being stored in mongo? Or, is it enough to simply encode the string immediately before it is displayed on the client side using a template library like handlebars?
Here's the flow I'm talking about:
On the client side, user updates their name to "<script>alert('hi');</script>".
Does this need to be escaped to "<script>alert('hi');</script>" before sending it to the server?
The updated string is passed to the server in a JSON document via an ajax request.
The server stores the string in mongodb under "user.name".
Does the server need to escape the string in the same way just to be safe? Would it have to first un-escape the string before fully escaping so as to not double up on the '&'?
Later, user info is requested by client, and the name string is sent in JSON ajax response.
Immediately before display, user name is encoded using something like _.escape(name).
Would this flow display the correct information and be safe from XSS attacks? What about about unicode characters like Chinese characters?
This also could change how text search would need to be done, as the search term may need to be encoded before starting the search if all user text is encoded.
Thanks a lot!

Does this need to be escaped to "<script>alert('hi');</script>" before sending it to the server?
No, it has to be escaped like that just before it ends up in an HTML page - step (5) above.
The right type of escaping has to be applied when text is injected into a new surrounding context. That means you HTML-encode data at the moment you include it in an HTML page. Ideally you are using a modern templating system that will do that escaping for you automatically.
(Similarly if you include data in a JavaScript string literal in a <script> block, you have to JS-encode it; if you include data in in a stylesheet rule you have to CSS-encode it, and so on. If we were using SQL queries with data injected into their strings then we would need to do SQL-escaping, but luckily Mongo queries are typically done with JavaScript objects rather than a string language, so there is no escaping to worry about.)
The database is not an HTML context so HTML-encoding input data on the way to the database is not the right thing to do.
(There are also other sources of XSS than injections, most commonly unsafe URL schemes.)

The short answer is yes, you should still encode all user input.
Whenever you do string concatenation, you need to escape the data correctly. MongoDB supports converting Javascript queries to it's native query language expression in BSON. When doing this there are two contexts to be aware of:
Inside a Javascript string
Everywhere else
If you are concatenating user input outside a string, you really need to be careful. It's really hard to get the escaping right unless the datatype of the variable is an integer or similar where the possible values are known and limited.
The best practice would be to avoid string concatenation whenever possible. You can read more about how MongoDB addresses SQL-Injection here.

Related

How to efficiently handle the line break exploit when implementing server sent events?

When implementing Server Sent Events on your application server, you can terminate a message and have it send by ending it with two line breaks: \n\n, as demonstrated on this documentation page.
So, what if you're receiving user input and forwarding it to all interested parties (as is typical in a chat application)? Could a malicious user not insert two line breaks in their payload to terminate the message early? Even more, could they not then set special fields such as the id and retry fields, now that they have access to the first characters of a line?
It seems that the only alternative is to instead scan their entire payload, and then replace instances of \n with something like \ndata:, such that their entire message payload has to maintain its position in the data tag.
However, is this not very inefficient? Having to scan the entire message payload for each message and then potentially do replacements involves not only scanning each entire payload, but also reallocating in the case of maleficence.
Or is there an alternative? I'm currently trying to decide between websockets and SSE, as they are quite similar, and this issue is making me learn more towards WebSockets, because it feels as if they would be more efficient if they are able to avoid this potential vulnerability.
Edit: To clarify, I'm mostly ignorant as to whether or not there is a way around having to scan each message in its entirety for \n\n. And if not, does WebSockets have the same issue where you need to scan each message in its entirety? Because if it does, then no matter. But if that's not the case, then it seems to be a point in favor of using websockets over SSE.

it shouldnt be necessary to scan the payload if you're encoding the user data correctly. With JSON it is safe to use the "data" field in server-sent events because JSON decode newline and controls characters per default, as the RFC says:
The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks. All Unicode characters may be placed within the
quotation marks, except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).
https://www.rfc-editor.org/rfc/rfc7159#page-8
the important thing is that nobody sneaks in an newline charactes but this isnt new to server sent events, header are seperate by a single new line and can be tampered too (if not correctly encoded) see https://www.owasp.org/index.php/HTTP_Response_Splitting
Heres an example of an server sent application with json encoding:
https://repl.it/#BlackEspresso/PointedWelloffCircles
you shouldnt be able to tampere the data field even with the newline characters are allowed
Encoding souldnt stop you from using server side events, but there are major differences between websockets and sse. For a comparison see this answer: https://stackoverflow.com/a/5326159/1749420

Unless I'm missing something obvious, sanitizing input is a common thing in web development.
Since the source that you shared explicitly mentioned a PHP example, I just did some research and lookie here:
https://www.php.net/manual/en/filter.filters.sanitize.php
FILTER_SANITIZE_SPECIAL_CHARS
HTML-escape '"<>& and characters with ASCII value less than 32,
optionally strip or encode other special characters.
and:
'\n' = 10 = 0x0A = line feed
So I'm not sure I understand why you would assume that converting certain input to character entities would necessarily be a bad thing.
Avoiding users to abuse the system by uploading unwanted input is what sanitization is for.

XSS preventing for parameters passed to JS code file

I have the following flow: A URL with query parameters, that runs some logic on the server side, and then generates using a template engine a stub HTML page with
A javascript file included, that does the main logic.
<script> tag that includes a JS object, that has parameters to this JS code, partially taken from the query parameters before.
Now I want to sanitize the parameters I receive, to prevent XSS injection. The issue, that one of the parameters there is a token, so I don't want to be too strict on the validations (simply not allowing all possible XSS characters sounds too strict), yet most of the libraries I've found dealing with pure HTML, and not a JS code (within <script> tag). I also feeling a bit uneasy, when I read all the regex solution, because I'm used to trust open source libraries when dealing with security (that have unit tests and not a bunch of regex).
Any advice on libraries & possible approach? We run in JVM environment.

The easiest, simplest, and therefore more secure approach is to use data attributes to represent the dynamic, user supplied values.
This way you only need to worry about HTML encoding, none of the complex hex entity encoding (\x00) that OWASP recommend.
For example, you could have:
<body data-token="#param.token" />
Where #param.token will output an HTML encoded version of the query string parameter. e.g. page?token=xyz" would output
<body data-token="xyz"" />
This will mitigate your XSS vulnerability concern.
Then you can use something like JQuery to easy retrieve the data attribute values in your JavaScript:
var token = $("body").data("token");
Simple and secure.

Imagining you want to assign your parameter as a string, as such:
{
...
x: '[PARAMETER]'
}
You want to make sure that [PARAMETER] does not break out of the quoted string.
In this case what you need to escape is the ' character and the closing </script>tag. Note: take into consideration "escape-the-escape" attacks, where the attacker sends the string \', which is escaped as \\', which turns back to ' (and you are back from where you started).
It's generally simply safer, as OWASP notes, to
escape all characters less than 256 with the \xHH format
I invite you to read the OWASP page on XSS attacks, and in particular https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet#RULE_.233_-_JavaScript_Escape_Before_Inserting_Untrusted_Data_into_JavaScript_Data_Values

When outputting JSON content via Javascript, should I HTML escape on the server or client side?

I have an application that consists of a server-side REST API written in PHP, and some client-side Javascript that consumes this API and uses the JSON it produces to render a page. So, a pretty typical setup.
The data provided by the REST API is "untrusted", in the sense that it is fetching user-provided content from a database. So, for example, it might fetch something like:
{
"message": "<script>alert("Gotcha!")</script>"
}
Obviously, if my client-side code were to render this directly into the page's DOM, I've created an XSS vulnerability. So, this content needs to be HTML-escaped first.
The question is, when outputting untrusted content, should I escape the content on the server side, or the client side? I.e., should my API return the raw content, and then make it the client Javascript code's responsibility to escape the special characters, or should my API return "safe" content:
{
"message": "<script>alert('Gotcha!');<\/script>"
}
that has been already escaped?
On one hand, it seems to be that the client should not have to worry about unsafe data from my server. On the other hand, one could argue that output should always be escaped at the last minute possible, when we know exactly how the data is to be consumed.
Which approach is correct?
Note: There are plenty of questions about handling input and yes, I am aware that client-side code can always be manipulated. This question is about outputting data from my server which may not be trustable.
Update: I looked into what other people are doing, and it does seem that some REST APIs tend to send "unsafe" JSON. Gitter's API actually sends both, which is an interesting idea:
[
{
"id":"560ab5d0081f3a9c044d709e",
"text":"testing the API: <script>alert('hey')</script>",
"html":"testing the API: <script>alert('hey')</script>",
"sent":"2015-09-29T16:01:19.999Z",
"fromUser":{
...
},"unread":false,
"readBy":0,
"urls":[],
"mentions":[],
"issues":[],
"meta":[],
"v":1
}
]
Notice that they send the raw content in the text key, and then the HTML-escaped version in the html key. Not a bad idea, IMO.
I have accepted an answer, but I don't believe this is a cut-and-dry problem. I would like to encourage further discussion on this topic.

Escape on the client side only.
The reason to escape on the client side is security: the server's output is the client's input, and so the client should not trust it. If you assume that the input is already escaped, then you potentially open yourself to client attacks via, for example, a malicious reverse-proxy. This is not so different from why you should always validate input on the server side, even if you also include client-side validation.
The reason not to escape on the server side is separation of concerns: the server should not assume that the client intends to render the data as HTML. The server's output should be as media-neutral as possible (given the constraints of JSON and the data structure, of course), so that the client can most easily transform it into whatever format is needed.

For escaping on output:
I suggest reading this XSS Filter Evasion Cheat Sheet.
To prevent user correctly you better not only escape, but also before escaping filter it with an appropriate anti XSS library. Like htmLawed, or HTML Purifier, or any from this thread.
IMHO sanitizing should be done on user inputed data whenever you are going to show it back in web project.
should I escape the content on the server side, or the client side? I.e., should my API return the raw content, and then make it the client Javascript code's responsibility to escape the special characters, or should my API return "safe" content:
It's better to return already escaped, and xss purified content, so:
Take raw data and purify if from xss on server
Escape it
Return to JavaScript
And also, you should notice one important thing, like a load of your site and read/write balance: for example if your client enters data once and you are going to show this data to 1M users, what do you prefer: run protection logic once before write (protect on input) on a million time each read(protect on output)?
If you are going to show like 1K posts on a page and escape each on client, how well will it work on the client's mobile phone? This last one will help you to chose where to protect data on client or on server.

This answer is more focused on arguing whether to do client-side escaping vs server-side, since OP seems aware of the argument against escaping on input vs output.
Why not escape client-side?
I would argue that escaping at the javascript level is not a good idea. Just an issue off the top of my head would be if there was an error in the sanitizing script, it would not run, and then the dangerous script would be allowed to run. So you have introduced a vector where an attacker can try to craft input to break the JS sanitizer, so that their plain script is allowed to run. I also do not know of any built-in AntiXSS libraries that run in JS. I am sure someone has made one, or could make one, but there are established server-side examples that are a little more trust-worthy. It is also worth mentioning that writing a sanitizer in JS that works for all browsers is not a trivial task.
OK, what if you escape on both?
Escaping server-side and client-side is just kind of confusing to me, and shouldn't provide any additional security. You mentioned the difficulties with double-escaping, and I have experienced that pain before.
Why is server-side good enough?
Escaping server-side should be sufficient. Your point about doing it as late as possible makes some sense, but I think the drawbacks of escaping client-side are outweighed by whatever tiny benefit you may get by doing it. Where is the threat? If an attacker exists between your site and the client, then the client is already compromised since they can just send a blank html file with their script if they want. You need to do your best to send something safe, not just send the tools to deal with your dangerous data.

TLDR; If your API is to convey formatting information, it should output HTML encoded strings. Caveat: Any consumer will need to trust your API not to output malicious code. A Content Security Policy can help with this too.
If your API is to output only plain text, then HTML encode on the client-side (as < in the plain text also means < in any output).
Not too long, not done reading:
If you own both the API and the web application, either way is acceptable. As long as you are not outputting JSON to HTML pages without hex entity encoding like this:
<%
payload = "[{ foo: '" + foo + "'}]"
%>
<script><%= payload %></script>
then it doesn't matter whether the code on your server changes & to & or the code in the browser changes & to &.
Let's take the example from your question:
[
{
"id":"560ab5d0081f3a9c044d709e",
"text":"testing the API: <script>alert('hey')</script>",
"html":"testing the API: <script>alert('hey')</script>",
"sent":"2015-09-29T16:01:19.999Z",
If the above is returned from api.example.com and you call it from www.example.com, as you control both sides you can decide whether you want to take the plain text, "text", or the formatted text, "html".
It is important to remember though that any variables inserted into html have been HTML encoded server-side here. And also assume that correct JSON encoding has been carried out which stops any quote characters from breaking, or changing the context of the JSON (this is not shown in the above for simplicity).
text would be inserted into the document using Node.textContent and html as Element.innerHTML. Using Node.textContent will cause the browser to ignore any HTML formatting and script that may be present because characters like < are literally taken to be output as < on the page.
Note your example shows user content being input as script. i.e. a user has typed <script>alert('hey')</script> into your application, it is not API generated. If your API actually wanted to output tags as part of its function, then it'd have to put them in the JSON:
"html":"<u>Underlined</u>"
And then your text would have to only output the text without formatting:
"text":"Underlined"
Therefore, your API while sending information to your web application consumer is no longer transmitting rich text, only plain text.
If, however, a third party was consuming your API, then they may wish to get the data from your API as plain text because then they can set Node.textContent (or HTML encode it) on the client-side themselves, knowing that it is safe. If you return HTML then your consumer needs to trust you that your HTML does not contain any malicious script.
So if the above content is from api.example.com, but your consumer is a third party site, say, www.example.edu, then they may feel more comfortable taking in text rather than HTML. Your output may need to be more granularly defined in this case, so rather than outputting
"text":"Thank you Alice for signing up."
You would output
[{ "name", "alice",
"messageType": "thank_you" }]
Or similar so you are not defining the layout in your JSON any longer, you are just conveying the information for the client-side to interpret and format using their own style. To further clarify what I mean, if all your consumer got was
"text":"Thank you Alice for signing up."
and they wanted to show names in bold, it would be very tricky for them to accomplish this without complex parsing. However, with defining API outputs on a granular level, the consumer can take the relevant pieces of output like variables, and then apply their own HTML formatting, without having to trust your API to only output bold tags (<b>) and not to output malicious JavaScript (either from the user or from you, if you were indeed malicious, or if your API had been compromised).

What is best practice for sending html special characters to server and back?

What is the best practice for sending possible html/javascript/code to server and then receiving it?
I have a chat where users should be able to send code i.e. html. Should I send this html and other text as plain text or should I change the <> and some other chracters into html entities before saving it to database?
I was thinking of sending and saving plain text and then converting chracters to html entities using JavaScript regex, when displaying it to a user.
What is the best practice? I also have to use this data in angular.js where plain text is easier to handle.

You should always submit and store the original input. Storing the input already prepared for being output on a special device or software (i.e., a HTML interpreter in your case) makes it difficult to use the data for other purposes. This might be for example a chat app for mobile devices which you might add later.

How to avoid "Cross-Site Script Attacks"

How do you avoid cross-site script attacks?
Cross-site script attacks (or cross-site scripting) is if you for example have a guestbook on your homepage and a client posts some javascript code which fx redirects you to another website or sends your cookies in an email to a malicious user or it could be a lot of other stuff which can prove to be real harmful to you and the people visiting your page.
I'm sure it can be done fx. in PHP by validating forms but I'm not experienced enough to fx. ban javascript or other things which can harm you.
I hope you understand my question and that you are able to help me.

I'm sure it can be done fx. in PHP by validating forms
Not really. The input stage is entirely the wrong place to be addressing XSS issues.
If the user types, say <script>alert(document.cookie)</script> into an input, there is nothing wrong with that in itself. I just did it in this message, and if StackOverflow didn't allow it we'd have great difficulty talking about JavaScript on the site! In most cases you want to allow any input(*), so that users can use a < character to literally mean a less-than sign.
The thing is, when you write some text into an HTML page, you must escape it correctly for the context it's going into. For PHP, that means using htmlspecialchars() at the output stage:
<p> Hello, <?php echo htmlspecialchars($name); ?>! </p>
[PHP hint: you can define yourself a function with a shorter name to do echo htmlspecialchars, since this is quite a lot of typing to do every time you want to put a variable into some HTML.]
This is necessary regardless of where the text comes from, whether it's from a user-submitted form or not. Whilst user-submitted data is the most dangerous place to forget your HTML-encoding, the point is really that you're taking a string in one format (plain text) and inserting it into a context in another format (HTML). Any time you throw text into a different context, you're going to need an encoding/escaping scheme appropriate to that context.
For example if you insert text into a JavaScript string literal, you would have to escape the quote character, the backslash and newlines. If you insert text into a query component in a URL, you will need to convert most non-alphanumerics into %xx sequences. Every context has its own rules; you have to know which is the right function for each context in your chosen language/framework. You cannot solve these problems by mangling form submissions at the input stage—though many naïve PHP programmers try, which is why so many apps mess up your input in corner cases and still aren't secure.
(*: well, almost any. There's a reasonable argument for filtering out the ASCII control characters from submitted text. It's very unlikely that allowing them would do any good.
Plus of course you will have application-specific validations that you'll want to do, like making sure an e-mail field looks like an e-mail address or that numbers really are numeric. But this is not something that can be blanket-applied to all input to get you out of trouble.)

Cross-site scripting attacks (XSS) happen when a server accepts input from the client and then blindly writes that input back to the page. Most of the protection from these attacks involves escaping the output, so the Javascript turns into plain HTML.
One thing to keep in mind is that it is not only data coming directly from the client that may contain an attack. A Stored XSS attack involves writing malicious JavaScript to a database, whose contents are then queried by the web application. If the database can be written separately from the client, the application may not be able to be sure that the data had been escaped properly. For this reason, the web application should treat ALL data that it writes to the client as if it may contain an attack.
See this link for a thorough resource on how to protect yourself: http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet

We Keep Coding

JavaScript is the programming language of the Web.