User equals untrustworthy. Never trust untrustworthy user's input. I get that. However, I am wondering when the best time to sanitize input is. For example, do you blindly store user input and then sanitize it whenever it is accessed/used, or do you sanitize the input immediately and then store this "cleaned" version? Maybe there are also some other approaches I haven't though of in addition to these. I am leaning more towards the first method, because any data that came from user input must still be approached cautiously, where the "cleaned" data might still unknowingly or accidentally be dangerous. Either way, what method do people think is best, and for what reasons?
Unfortunately, almost no one of the participants ever clearly understands what are they talking about. Literally. Only Kibbee managed to make it straight.
This topic is all about sanitization. But the truth is, such a thing like wide-termed "general purpose sanitization" everyone is so eager to talk about is just doesn't exist.
There are a zillion different mediums, each require it's own, distinct data formatting. Moreover - even single certain medium require different formatting for it's parts. Say, HTML formatting is useless for javascript embedded in HTML page. Or, string formatting is useless for the numbers in SQL query.
As a matter of fact, such a "sanitization as early as possible", as suggested in most upvoted answers, is just impossible. As one just cannot tell in which certain medium or medium part the data will be used. Say, we are preparing to defend from "sql-injection", escaping everything that moves. But whoops! - some required fields weren't filled and we have to fill out data back into form instead of database... with all the slashes added.
On the other hand, we diligently escaped all the "user input"... but in the sql query we have no quotes around it, as it is a number or identifier. And no "sanitization" ever helped us.
On the third hand - okay, we did our best in sanitizing the terrible, untrustworthy and disdained "user input"... but in some inner process we used this very data without any formatting (as we did our best already!) - and whoops! have got second order injection in all its glory.
So, from the real life usage point of view, the only proper way would be
formatting, not whatever "sanitization"
right before use
according to the certain medium rules
and even following sub-rules required for this medium's different parts.
It depends on what kind of sanitizing you are doing.
For protecting against SQL injection, don't do anything to the data itself. Just use prepared statements, and that way, you don't have to worry about messing with the data that the user entered, and having it negatively affect your logic. You have to sanitize a little bit, to ensure that numbers are numbers, and dates are dates, since everything is a string as it comes from the request, but don't try to do any checking to do things like block keywords or anything.
For protecting against XSS attacks, it would probably be easier to fix the data before it's stored. However, as others mentioned, sometimes it's nice to have a pristine copy of exactly what the user entered, because once you change it, it's lost forever. It's almost too bad there's not a fool proof way to ensure you application only puts out sanitized HTML the way you can ensure you don't get caught by SQL injection by using prepared queries.
I sanitize my user data much like Radu...
First client-side using both regex's and taking control over allowable characters
input into given form fields using javascript or jQuery tied to events, such as
onChange or OnBlur, which removes any disallowed input before it can even be
submitted. Realize however, that this really only has the effect of letting those
users in the know, that the data is going to be checked server-side as well. It's
more a warning than any actual protection.
Second, and I rarely see this done these days anymore, that the first check being
done server-side is to check the location of where the form is being submitted from.
By only allowing form submission from a page that you have designated as a valid
location, you can kill the script BEFORE you have even read in any data. Granted,
that in itself is insufficient, as a good hacker with their own server can 'spoof'
both the domain and the IP address to make it appear to your script that it is coming
from a valid form location.
Next, and I shouldn't even have to say this, but always, and I mean ALWAYS, run
your scripts in taint mode. This forces you to not get lazy, and to be diligent about
step number 4.
Sanitize the user data as soon as possible using well-formed regexes appropriate to
the data that is expected from any given field on the form. Don't take shortcuts like
the infamous 'magic horn of the unicorn' to blow through your taint checks...
or you may as well just turn off taint checking in the first place for all the good
it will do for your security. That's like giving a psychopath a sharp knife, bearing
your throat, and saying 'You really won't hurt me with that will you".
And here is where I differ than most others in this fourth step, as I only sanitize
the user data that I am going to actually USE in a way that may present a security
risk, such as any system calls, assignments to other variables, or any writing to
store data. If I am only using the data input by a user to make a comparison to data
I have stored on the system myself (therefore knowing that data of my own is safe),
then I don't bother to sanitize the user data, as I am never going to us it a way
that presents itself as a security problem. For instance, take a username input as
an example. I use the username input by the user only to check it against a match in
my database, and if true, after that I use the data from the database to perform
all other functions I might call for it in the script, knowing it is safe, and never
use the users data again after that.
Last, is to filter out all the attempted auto-submits by robots these days, with a
'human authentication' system, such as Captcha. This is important enough these days
that I took the time to write my own 'human authentication' schema that uses photos
and an input for the 'human' to enter what they see in the picture. I did this because
I've found that Captcha type systems really annoy users (you can tell by their
squinted-up eyes from trying to decipher the distorted letters... usually over and
over again). This is especially important for scripts that use either SendMail or SMTP
for email, as these are favorites for your hungry spam-bots.
To wrap it up in a nutshell, I'll explain it as I do to my wife... your server is like a popular nightclub, and the more bouncers you have, the less trouble you are likely to have
in the nightclub. I have two bouncers outside the door (client-side validation and human authentication), one bouncer right inside the door (checking for valid form submission location... 'Is that really you on this ID'), and several more bouncers in
close proximity to the door (running taint mode and using good regexes to check the
user data).
I know this is an older post, but I felt it important enough for anyone that may read it after my visit here to realize their is no 'magic bullet' when it comes to security, and it takes all these working in conjuction with one another to make your user-provided data secure. Just using one or two of these methods alone is practically worthless, as their power only exists when they all team together.
Or in summary, as my Mum would often say... 'Better safe than sorry".
UPDATE:
One more thing I am doing these days, is Base64 encoding all my data, and then encrypting the Base64 data that will reside on my SQL Databases. It takes about a third more total bytes to store it this way, but the security benefits outweigh the extra size of the data in my opinion.
I like to sanitize it as early as possible, which means the sanitizing happens when the user tries to enter in invalid data. If there's a TextBox for their age, and they type in anything other that a number, I don't let the keypress for the letter go through.
Then, whatever is reading the data (often a server) I do a sanity check when I read in the data, just to make sure that nothing slips in due to a more determined user (such as hand-editing files, or even modifying packets!)
Edit: Overall, sanitize early and sanitize any time you've lost sight of the data for even a second (e.g. File Save -> File Open)
The most important thing is to always be consistent in when you escape. Accidental double sanitizing is lame and not sanitizing is dangerous.
For SQL, just make sure your database access library supports bind variables which automatically escapes values. Anyone who manually concatenates user input onto SQL strings should know better.
For HTML, I prefer to escape at the last possible moment. If you destroy user input, you can never get it back, and if they make a mistake they can edit and fix later. If you destroy their original input, it's gone forever.
Early is good, definitely before you try to parse it. Anything you're going to output later, or especially pass to other components (i.e., shell, SQL, etc) must be sanitized.
But don't go overboard - for instance, passwords are hashed before you store them (right?). Hash functions can accept arbitrary binary data. And you'll never print out a password (right?). So don't parse passwords - and don't sanitize them.
Also, make sure that you're doing the sanitizing from a trusted process - JavaScript/anything client-side is worse than useless security/integrity-wise. (It might provide a better user experience to fail early, though - just do it both places.)
My opinion is to sanitize user input as soon as posible client side and server side, i'm doing it like this
(client side), allow the user to
enter just specific keys in the field.
(client side), when user goes to the next field using onblur, test the input he entered
against a regexp, and notice the user if something is not good.
(server side), test the input again,
if field should be INTEGER check for that (in PHP you can use is_numeric() ),
if field has a well known format
check it against a regexp, all
others ( like text comments ), just
escape them. If anything is suspicious stop script execution and return a notice to the user that the data he enetered in invalid.
If something realy looks like a posible attack, the script send a mail and a SMS to me, so I can check and maibe prevent it as soon as posible, I just need to check the log where i'm loggin all user inputs, and the steps the script made before accepting the input or rejecting it.
Perl has a taint option which considers all user input "tainted" until it's been checked with a regular expression. Tainted data can be used and passed around, but it taints any data that it comes in contact with until untainted. For instance, if user input is appended to another string, the new string is also tainted. Basically, any expression that contains tainted values will output a tainted result.
Tainted data can be thrown around at will (tainting data as it goes), but as soon as it is used by a command that has effect on the outside world, the perl script fails. So if I use tainted data to create a file, construct a shell command, change working directory, etc, Perl will fail with a security error.
I'm not aware of another language that has something like "taint", but using it has been very eye opening. It's amazing how quickly tainted data gets spread around if you don't untaint it right away. Things that natural and normal for a programmer, like setting a variable based on user data or opening a file, seem dangerous and risky with tainting turned on. So the best strategy for getting things done is to untaint as soon as you get some data from the outside.
And I suspect that's the best way in other languages as well: validate user data right away so that bugs and security holes can't propagate too far. Also, it ought to be easier to audit code for security holes if the potential holes are in one place. And you can never predict which data will be used for what purpose later.
Clean the data before you store it. Generally you shouldn't be preforming ANY SQL actions without first cleaning up input. You don't want to subject yourself to a SQL injection attack.
I sort of follow these basic rules.
Only do modifying SQL actions, such as, INSERT, UPDATE, DELETE through POST. Never GET.
Escape everything.
If you are expecting user input to be something make sure you check that it is that something. For example, you are requesting an number, then make sure it is a number. Use validations.
Use filters. Clean up unwanted characters.
Users are evil!
Well perhaps not always, but my approach is to always sanatize immediately to ensure nothing risky goes anywhere near my backend.
The added benefit is that you can provide feed back to the user if you sanitize at point of input.
Assume all users are malicious.
Sanitize all input as soon as possible.
Full stop.
I sanitize my data right before I do any processing on it. I may need to take the First and Last name fields and concatenate them into a third field that gets inserted to the database. I'm going to sanitize the input before I even do the concatenation so I don't get any kind of processing or insertion errors. The sooner the better. Even using Javascript on the front end (in a web setup) is ideal because that will occur without any data going to the server to begin with.
The scary part is that you might even want to start sanitizing data coming out of your database as well. The recent surge of ASPRox SQL Injection attacks that have been going around are doubly lethal because it will infect all database tables in a given database. If your database is hosted somewhere where there are multiple accounts being hosted in the same database, your data becomes corrupted because of somebody else's mistake, but now you've joined the ranks of hosting malware to your visitors due to no initial fault of your own.
Sure this makes for a whole lot of work up front, but if the data is critical, then it is a worthy investment.
User input should always be treated as malicious before making it down into lower layers of your application. Always handle sanitizing input as soon as possible and should not for any reason be stored in your database before checking for malicious intent.
I find that cleaning it immediately has two advantages. One, you can validate against it and provide feedback to the user. Two, you do not have to worry about consuming the data in other places.
In my application, there is a comment box. If someone enters a comment like
<script>alert("hello")</script>
then an alert appears when I load that page.
Is there anyway to prevent this?
There are several ways to address this, but since you haven't mentioned which back-end technology you are using, it is hard to give anything but rough answers.
Also, you haven't mentioned if you want to allow, or deny, the ability to enter regular HTML in the box.
Method 1:
Sanitize inputs on the way in. When you accept something at the server, look for the script tags and remove them.
This is actually far more difficult to get right then might be expected.
Method 2:
Escape the data on the way back down to the server. In PHP there is a function called
htmlentities which will turn all HTML into which renders as literally what was typed.
The words <script>alert("hello")</script> would appear on your page.
Method 3
White-list
This is far beyond the answer of a single post and really required knowing your back-end system, but it is possible to allow some HTML characters with disallowing others.
This is insanely difficult to get right and you really are best using a library package that has been very well tested.
You should treat user input as plain text rather than HTML. By correctly escaping HTML entities, you can render what looks like valid HTML text without having the browser try to execute it. This is good practice in general, for your client-side code as well as any user provided values passed to your back-end. Issues arising from this are broadly referred to as script injection or cross-site scripting.
Practically on the client-side this is pretty easy since you're using jQuery. When updating the DOM based on user input, rely on the text method in place of the html method. You can see a simple example of the difference in this jsFiddle.
The best way is replace <script> with other string.For example in C#use:
str.replace("<script>","O_o");
Other options has a lot of disadvantage.
1.Block javascript: It cause some validation disabled too.those validation that done in frontend.Also after retrive from database it works again.I mean attacker can inject script as input in forms and it saved in database.after you return records from database in another page it render as script!!!!
2.render as text. In some technologies it needs third-party packages that it is risk in itself.Maybe these packages has backdoor!!!
convert value into string ,it solved in my case
example
var anything
So I learned earlier that using the data attribute in html5 you could insert values to be handled in a javascript file. e.g
Hey
the handling javascript file will have a line to handle that link tag which might do this
var value=$('.check').data('name');
window.location.href="http://www.example.com/'+value+'";
Now I was wondering, can a malicious coder exploit this? Do you need to sanitize the value before using it for a redirect?
It really depends.
An attacker can modify anything he wants in his browser, so it doesn't matter how much sanitization you put in the front-end, an attacker can work his way around all your javascript functions and the like to circumvent your front-end code.
I'm not saying that you shouldn't sanitize your input in the front-end because it will always help in terms of usability and experience for a legitimate user.
If the address that you're redirecting your user to uses that data attribute to do something with the server, then yes by all means sanitize it in both places: front and back end. Otherwise, you shouldn't worry, the worst case scenario is that a malicious user (or a knowledgable one) will end up in a 404 page.
** EDIT **
After reading your comment in this answer, here's my updated answer:
The dangers reside in how you're using that piece of information. Take as an example google analytics script:
Google provides with you a script that will help you track your visitors actions and behaviors through google analytics interface.
If you change any value in google's script, google analytics won't work, and there's no way you can hack google through the analytics script.
How does google achieve this? They put all their security in the backend, and they sanitize modifiable user input that will be rendered in a website, stored in a database or somehow interacts with the server.
Back to your case:
If you're going to use that data attribute to do a document.write(), an eval, do a database lookup or any sensitive operation (delete, update, retrieve data) then yes by all means: sanitize it.
How are you going to sanitize it? That's problem specific and more than likely you should ask a new question.
If the HTML is taken from user input or generated from user input, yes, you should definitely perform sanitation. However, if you're asking if data attributes are somehow vulnerable in a way other attributes aren't, the answer is no.
A user with access to the browser (e.g. via XSS) can insert anything into a data attribute. But (s)he can just redirect anywhere at anytime, so this trivial case is irrelevant.
If the value is set by a user via some other means, then the link could be set somewhere other than intended within the same domain. That might be annoying but it shouldn't be a security risk.
If you're doing something else, like including a javascript string for eval in the attribute and that comes from a user (e.g. via a database value), then you will create an XSS vulnerability. But you should never, ever, ever, trust user supplied values anyway. Nothing special about html data attributes there.
Do you need to sanitize the value before using it for a redirect?
No need to sanitize before, but you need to sanitize after.
In your example, if you are not sanitizing data - you can get a victim of classic XSS.
I.e: http://www.example.com/ + value, where value is search?q=<script>alert(1)</script>, and where search page actually outputs raw query to the browser.
p.s.: this is not specific to data-attributes. It will work the same with normal attributes.
I've been asked at work whether it is possible to write, on purpose or by accident, JavaScript that will remove specific characters from a HTML document and thus break the HTML. An example would be adding some JavaScript that removes the < symbol in the page. I've tried searching online and I know JavaScript can replace strings, but my knowledge of the language is negligible.
I've been asked to look into it as a way of hopefully addressing why a site I work on needs to have controls over who can add bespoke functionality to the page. I'm hoping it's not possible but would be grateful for the peace of mind!
Yes, and in fact you can do things far more insidious with javascript as well.
http://en.wikipedia.org/wiki/Cross-site_scripting
yes, thats possible. the easiest example is
var body = document.getElemetsByTagName('body')[0];
body.innerHTML = 'destroyed';
wich will remove the whole page and just write "destroyed" instead. to get back to your example: in the same way it's possible to replace <:
var body = document.getElemetsByTagName('body')[0];
body.innerHTML = body.innerHTML.replace('<','some other character');
such "extreme" cases are very unlikely to happen by accident, but it's absolutely possible (particularly for inexperienced javascript-developers) to break things on a site that usually shouldn't be affected by javascript.
note that this will only mess op the displayed page in the clients browser and doesn't change your html-file on the server in any way. just find and remove/fix the "bad" lines of code and everything is fine again.
Any client/browser can manipulate how the page is viewed at any time, for instance in chrome hit F12 and then you can write whatever you want in the html and you will see the changes immediately. But that's not to worry about...
The scary part is when JavaScript on the site communicates with the back-end server and supplies it with some input parameters that are not being sanitized on the server side before it is processed in some way. SQL Injection can also happen this way if the back-end utilizes a database which they almost always do, and so on...
A webpage can be manipulated in two ways, either its none-persistent or its persistent.
[none-persistent]: this way you can manipulate your access to a webpage but, this won't affect other users in it self, but you can do harm once your in.
[persistent]: this way the server side code will permanently be affected by the injected code, and most likely affect other users.
Key thing here is to always sanitize the input a back-end server used before it processes anything.
You could definitely write some javascript function to modify the contents of a file. If that file is your HTML page, then sure.
If you want to prevent this from happening, you can just set the permissions of that HTML file to be read-only, though.
you could:
Overwrite the page,
Mess with the innerHTML of the body tag (almost the same),
Insert illegal elements.
Yes. In the least, you could use it to write CSS that sets any element, class, ID... even the body to display:none;
How do you avoid cross-site script attacks?
Cross-site script attacks (or cross-site scripting) is if you for example have a guestbook on your homepage and a client posts some javascript code which fx redirects you to another website or sends your cookies in an email to a malicious user or it could be a lot of other stuff which can prove to be real harmful to you and the people visiting your page.
I'm sure it can be done fx. in PHP by validating forms but I'm not experienced enough to fx. ban javascript or other things which can harm you.
I hope you understand my question and that you are able to help me.
I'm sure it can be done fx. in PHP by validating forms
Not really. The input stage is entirely the wrong place to be addressing XSS issues.
If the user types, say <script>alert(document.cookie)</script> into an input, there is nothing wrong with that in itself. I just did it in this message, and if StackOverflow didn't allow it we'd have great difficulty talking about JavaScript on the site! In most cases you want to allow any input(*), so that users can use a < character to literally mean a less-than sign.
The thing is, when you write some text into an HTML page, you must escape it correctly for the context it's going into. For PHP, that means using htmlspecialchars() at the output stage:
<p> Hello, <?php echo htmlspecialchars($name); ?>! </p>
[PHP hint: you can define yourself a function with a shorter name to do echo htmlspecialchars, since this is quite a lot of typing to do every time you want to put a variable into some HTML.]
This is necessary regardless of where the text comes from, whether it's from a user-submitted form or not. Whilst user-submitted data is the most dangerous place to forget your HTML-encoding, the point is really that you're taking a string in one format (plain text) and inserting it into a context in another format (HTML). Any time you throw text into a different context, you're going to need an encoding/escaping scheme appropriate to that context.
For example if you insert text into a JavaScript string literal, you would have to escape the quote character, the backslash and newlines. If you insert text into a query component in a URL, you will need to convert most non-alphanumerics into %xx sequences. Every context has its own rules; you have to know which is the right function for each context in your chosen language/framework. You cannot solve these problems by mangling form submissions at the input stage—though many naïve PHP programmers try, which is why so many apps mess up your input in corner cases and still aren't secure.
(*: well, almost any. There's a reasonable argument for filtering out the ASCII control characters from submitted text. It's very unlikely that allowing them would do any good.
Plus of course you will have application-specific validations that you'll want to do, like making sure an e-mail field looks like an e-mail address or that numbers really are numeric. But this is not something that can be blanket-applied to all input to get you out of trouble.)
Cross-site scripting attacks (XSS) happen when a server accepts input from the client and then blindly writes that input back to the page. Most of the protection from these attacks involves escaping the output, so the Javascript turns into plain HTML.
One thing to keep in mind is that it is not only data coming directly from the client that may contain an attack. A Stored XSS attack involves writing malicious JavaScript to a database, whose contents are then queried by the web application. If the database can be written separately from the client, the application may not be able to be sure that the data had been escaped properly. For this reason, the web application should treat ALL data that it writes to the client as if it may contain an attack.
See this link for a thorough resource on how to protect yourself: http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet