Sanitizing input that will later appear in HTML

Sanitizing input that will later appear in HTML - javascript

I've got a <textarea> whose value is sent off to the server and stored in a database. This value is then later rendered on different pages in HTML.
What do I need to do to sanitize this? Just remove the HTML tags? (It's already SQL-injection safe because I'm using a stored procedure and parameters.)
Does anyone have a sanitize routine?

Do not sanitize input. Instead encode it when you output it. This is easy to enforce with the .net 4 features (<%: "" %>) or by code-reviewing.
Data should be stored in its native format. Human-readable text has as its native format just text, not some encoded version of it. You cannot easily manipulate encoded text (say doing highlighting of words or replaces).
Not encoding text in the database even saves a little storage space.
Sanitizing input is hard anyway. It is very hard to do more than just encoding everything. Blacklisting HTML tags is a certain way to forget something so don't do it.

Either remove the tags completely, or replace any special characters such as < and > with their HTML entities (<). Whatever server-side language you're using probably already has a function to do this. PHP's htmlspecialchars or strip_tags will do the trick, for example.

Related

How to safely display user's server side data in a form?

I'm using mustache to escape user data before displaying it in html via javascript.
However, when I use it to display their data in form input fields (pre populating the form after retrieving their user data from the server via AJAX), it displays entities instead.
eg
Shaun's place
displays as
Shaun's place
I'd use the three bracket trick in Mustache, but doesn't that mean the data won't be escaped, therefore making the page vulnerable?
the data stored in MySQL is user profiles.
when I get it back from the server, I run the data through the following mustache code before saving it in a local object - the idea being I don't have to keep running it through multiple lots of mustache each time I wasn to display it in various places on the site. So here's an example of one of the keys, location, being process :
var t="{{x}}",o={x:serverObjectReturn.location},x=Mustache.render(t, o);
localUserDataObject.location=x;
not 100% sure what you mean by content-type - I'm using AJAX calls, and sending the data back as JSON. The AJAX function dataType is "json". Let me know if you needed something else and I'll edit
I update the form input field so they have their profile data loaded ready for editing like so:
$("#location").val(userData.location);
That's when I get:
Shaun's place
...whereas:
$("#testerDivLocation").html(userData.location);
shows
Shaun's place
Thank you.

Here is the documentation on the subject:
All variables are HTML escaped by default. If you want to return
unescaped HTML, use the triple mustache: {{{name}}}.
So you're right that it may open you up to vulnerabilities (XSS in particular). This is useful for data that you already known is safe to render in HTML, perhaps because you've escaped and/or sanitized it elsewhere.
Since input field values are a little different because they are more raw than plain HTML, you may need to do some custom escaping either on the server-side, or in javascript just prior to the render call. This means figuring out the set of all possible characters that put an input field at risk (quotes are obvious, but what about angle brackets <, >?)
An alternative is to leave the form field values blank in HTML-land and use JavaScript up set their values through the DOM. That feels much safer to me since you are no longer trying to embed the raw data into HTML code.
var t="<input type='text' name='firstname' value=''>";
var x=Mustache.render(t);
$('input', x).val(serverObjectReturn.firstname);
(Assumes jQuery is loaded for the $ call).

What is best practice for sending html special characters to server and back?

What is the best practice for sending possible html/javascript/code to server and then receiving it?
I have a chat where users should be able to send code i.e. html. Should I send this html and other text as plain text or should I change the <> and some other chracters into html entities before saving it to database?
I was thinking of sending and saving plain text and then converting chracters to html entities using JavaScript regex, when displaying it to a user.
What is the best practice? I also have to use this data in angular.js where plain text is easier to handle.

You should always submit and store the original input. Storing the input already prepared for being output on a special device or software (i.e., a HTML interpreter in your case) makes it difficult to use the data for other purposes. This might be for example a chat app for mobile devices which you might add later.

Encoding user input to be stored in MongoDB

I'm trying to determine the best practices for storing and displaying user input in MongoDB. Obviously, in SQL databases, all user input needs to be encoded to prevent injection attacks. However, my understanding is that with MongoDB we need to be more worried about XSS attacks, so does user input need to be encoded on the server before being stored in mongo? Or, is it enough to simply encode the string immediately before it is displayed on the client side using a template library like handlebars?
Here's the flow I'm talking about:
On the client side, user updates their name to "<script>alert('hi');</script>".
Does this need to be escaped to "<script>alert('hi');</script>" before sending it to the server?
The updated string is passed to the server in a JSON document via an ajax request.
The server stores the string in mongodb under "user.name".
Does the server need to escape the string in the same way just to be safe? Would it have to first un-escape the string before fully escaping so as to not double up on the '&'?
Later, user info is requested by client, and the name string is sent in JSON ajax response.
Immediately before display, user name is encoded using something like _.escape(name).
Would this flow display the correct information and be safe from XSS attacks? What about about unicode characters like Chinese characters?
This also could change how text search would need to be done, as the search term may need to be encoded before starting the search if all user text is encoded.
Thanks a lot!

Does this need to be escaped to "<script>alert('hi');</script>" before sending it to the server?
No, it has to be escaped like that just before it ends up in an HTML page - step (5) above.
The right type of escaping has to be applied when text is injected into a new surrounding context. That means you HTML-encode data at the moment you include it in an HTML page. Ideally you are using a modern templating system that will do that escaping for you automatically.
(Similarly if you include data in a JavaScript string literal in a <script> block, you have to JS-encode it; if you include data in in a stylesheet rule you have to CSS-encode it, and so on. If we were using SQL queries with data injected into their strings then we would need to do SQL-escaping, but luckily Mongo queries are typically done with JavaScript objects rather than a string language, so there is no escaping to worry about.)
The database is not an HTML context so HTML-encoding input data on the way to the database is not the right thing to do.
(There are also other sources of XSS than injections, most commonly unsafe URL schemes.)

The short answer is yes, you should still encode all user input.
Whenever you do string concatenation, you need to escape the data correctly. MongoDB supports converting Javascript queries to it's native query language expression in BSON. When doing this there are two contexts to be aware of:
Inside a Javascript string
Everywhere else
If you are concatenating user input outside a string, you really need to be careful. It's really hard to get the escaping right unless the datatype of the variable is an integer or similar where the possible values are known and limited.
The best practice would be to avoid string concatenation whenever possible. You can read more about how MongoDB addresses SQL-Injection here.

What is the best way to store a field that supports markdown in my database when I need to render both HTML and "simple text" views?

I have a database and I have a website front end. I have a field in my front end that is text now but I want it to support markdown. I am trying to figure out the right was to store in my database because I have various views that needs to be supported (PDF reports, web pages, excel files, etc)?
My concern is that since some of those views don't support HTML, I don't just want to have an HTML version of this field.
Should I store 2 copies (one text only and one HTML?), or should I store HTML and on the fly try to remove them HTML tags when I am rendering out to Excel for example?
I need to figure out correct format (or formats) to store in the database to be able to render both:
HTML, and
Regular text (with no markdown or HTML syntax)
Any suggestions would be appreciated as I don't want to go down the wrong path. My point is that I don't want to show any HTML tags or markdown syntax in my Excel output.

Decide like this:
Store the original data (text with markdown).
Generate the derived data (HTML and plaintext) on the fly.
Measure the performance:
If it's acceptable, you're done, woohoo!
If not, cache the derived data.
Caching can be done in many ways... you can generate the derived data immediately, and store it in the database, or you can initially store NULLs and do the generation lazily (when and if it's needed). You can even cache it outside the database.
But whatever you do, make sure the cache is never "stale" - i.e. when the original data changes, the derived data in the cache must be re-generated or at least marked as "dirty" somehow. One way to do that is via triggers.

You need to store your data in a canonical format. That is, in one true format within your database. It sounds like this format should be a text column that contains markdown. That answers the database-design part of your question.
Then, depending on what format you need to export, you should take the canonical format and convert it to the required output format. This might be just outputting the markdown text, or running it through some sort of parser to remove the markdown or convert it to HTML.

Most everyone seems to be saying to just store the data as HTML in the database and then process it to turn it into plain text. In my opinion there are some downsides to that:
You will likely need application code to strip the HTML and extract the plain text. Imagine if you did this in SQL Server. What if you want to write a stored procedure/query that has the plain text version? How do you extract plain text in SQL? It's possible with a function, but it's a lot of work.
Processing the HTML blob can be slow. I would imagine for small HTML blobs it will be very fast, but there is certainly more overhead than just reading a plain text field.
HTML parsers don't always work well/they can be complex. The idea is that your users can be very creative and insert blobs that won't work well with your parser. I know from experience that it's not always trivial to extract plain text from HTML well.
I would propose what most email providers do:
Store a rich text/HTML version and a plain text version. Two fields in the database.
As is the use case with email providers, the users might want those two fields to have different content.
You can write a UI function that lets the user enter in HTML and then transforms it via the application into a plain text version. This gives the user a nice starting point and they can massage/edit the plain text version before saving to the database.

Always store the source, in your case it is markdown.
Also store the formats that are frequently used.
Use on demand conversion/rendering for less frequent used formats.
Explanation:
Always have the source. You may need it for various purpose, e.g. the same input can be edited, audit trail, debugging etc etc.
No overhead for processor/ram if the same format is frequently requested, you are trading it with the disk storage which is cheap comparing to the formars.
Occasional overhead, see the #2

I would suggest to store it in the HTML format, since is the richest one in this case, and remove the tags when obtaining the data for other formats (such PDF, Latex or whatever). In the following question you'll find a way to remove tags easily.
Regular expression to remove HTML tags
From my point of view, storing data (original and downgraded) in two separate fields is a waste of space, but also an integrity problem, since one of the fields could be -in theory- modified without changing the second one.
Good luck!

I think that what I'd do - if storage is not an issue - would be store the canonical version, but automatically generate from it, in persisted, computed fields, whatever other versions one might need. You want the fields to be persisted because it's pointless doing the conversion every time you need the data. And you want them to be computed because you don't want them to get out of synch with the canonical version.
In essence this is using the database as a cache for the other versions, but a cache that guarantees you data integrity.

HTML-encoding/decoding as it pertains to textboxes

I'm making the transition from the Microsoft stack (i.e. WPF) to HTML5 so apologies in advance for the rather amateurish nature of this question.
The topic at hand is HTML encoding and decoding.
Consider an HTML5 app making AJAX calls to a C# back-end via HTTP. The server returns JSON-formatted data exclusively, always making sure to HTML-encode the JSON value fields using HttpUtility.HTMLEncode().
The HTML5 client performs the same process in reverse. All data posted to the server is HTML-decoded first using a simple JavaScript helper function.
All potentially displayable string data in my HTML5 app is stored and passed from place to place in its HTML-encoded form. This scheme is working well for me. But today I discovered HTML5 text boxes and in doing so, noticed something odd. Text boxes don't seem to like encoded text.
If I have a text box defined as such:
<input id="festus" type="text"/>
and update it as follows:
$("#festus").val(someEncodedString)
…the text box shows the actual codes that are embedded into someEncodedString instead of converting those codes to the appropriate characters. I was surprised by this behavior as I assumed that browsers perform the proper escape code interpretation for all DOM elements.
I've tried to abstract away the problem by writing a helper/wrapper for val() called val2():
$.prototype.val2=function(newVal){
return (newVal===undefined)
?iHub.Utils.encodeHTML(this.val()) //getting value
:this.val(iHub.Utils.decodeHTML(newVal)); //setting value
}
[iHub.Utils is a library of helper functions that I wrote]
The idea here is that val2() will appropriately encode the data retrieved from my text box when getting the value, and decode it prior to setting the value. This seems to work but I can't shake the feeling that I must have a fundamental misunderstanding of how encoding/decoding is supposed to work in HTML5.
Is it standard practice to encode/decode data when using text boxes? Are text boxes special somehow in so far as they, unlike other common elements like <p> and <select>, don't perform standard decoding when displaying an encoded input string?
Again, sorry if this is too basic. HTML5 and JavaScript are fairly new to me and my "Intro to HTML5"-type books don't really discuss this topic in any depth.

HTML encoding is for HTML documents. If you were including your value in the HTML document itself, e.g. <input value="10 > 5" />, you would encode it, to make sure that things like > in your value aren't confused with the > that closes the tag.
But when you use JavaScript to set a field's value, there's no room for confusion. You're not modifying a tag like <input.../>; you're modifying a JavaScript object. So you shouldn't HTML-encode the value. If you're using a string variable, like in your example, you don't need to do any encoding at all.
On the other hand, if you're using a string literal to specify the value, you need to encode it as a JavaScript string, e.g. by escaping the ' in $("#festus").val('can\'t'). This is exactly the same reason you do HTML encoding; to avoid confusion with the ' that closes the string.
The only time you'd do HTML-encoding in JavaScript is when you're using it to generate HTML code, e.g. el.innerHTML = '<input value="10 > 5" />';.
Because of this, I would suggest that you not HTML-encode strings in your AJAX responses or requests. Instead, avoid encoding until you're actually generating the kind of code that requires the encoding. So only HTML-encode strings when you're writing HTML, only JavaScript-encode strings when you're writing JavaScript, and so on.

We Keep Coding

JavaScript is the programming language of the Web.