Related
User equals untrustworthy. Never trust untrustworthy user's input. I get that. However, I am wondering when the best time to sanitize input is. For example, do you blindly store user input and then sanitize it whenever it is accessed/used, or do you sanitize the input immediately and then store this "cleaned" version? Maybe there are also some other approaches I haven't though of in addition to these. I am leaning more towards the first method, because any data that came from user input must still be approached cautiously, where the "cleaned" data might still unknowingly or accidentally be dangerous. Either way, what method do people think is best, and for what reasons?
Unfortunately, almost no one of the participants ever clearly understands what are they talking about. Literally. Only Kibbee managed to make it straight.
This topic is all about sanitization. But the truth is, such a thing like wide-termed "general purpose sanitization" everyone is so eager to talk about is just doesn't exist.
There are a zillion different mediums, each require it's own, distinct data formatting. Moreover - even single certain medium require different formatting for it's parts. Say, HTML formatting is useless for javascript embedded in HTML page. Or, string formatting is useless for the numbers in SQL query.
As a matter of fact, such a "sanitization as early as possible", as suggested in most upvoted answers, is just impossible. As one just cannot tell in which certain medium or medium part the data will be used. Say, we are preparing to defend from "sql-injection", escaping everything that moves. But whoops! - some required fields weren't filled and we have to fill out data back into form instead of database... with all the slashes added.
On the other hand, we diligently escaped all the "user input"... but in the sql query we have no quotes around it, as it is a number or identifier. And no "sanitization" ever helped us.
On the third hand - okay, we did our best in sanitizing the terrible, untrustworthy and disdained "user input"... but in some inner process we used this very data without any formatting (as we did our best already!) - and whoops! have got second order injection in all its glory.
So, from the real life usage point of view, the only proper way would be
formatting, not whatever "sanitization"
right before use
according to the certain medium rules
and even following sub-rules required for this medium's different parts.
It depends on what kind of sanitizing you are doing.
For protecting against SQL injection, don't do anything to the data itself. Just use prepared statements, and that way, you don't have to worry about messing with the data that the user entered, and having it negatively affect your logic. You have to sanitize a little bit, to ensure that numbers are numbers, and dates are dates, since everything is a string as it comes from the request, but don't try to do any checking to do things like block keywords or anything.
For protecting against XSS attacks, it would probably be easier to fix the data before it's stored. However, as others mentioned, sometimes it's nice to have a pristine copy of exactly what the user entered, because once you change it, it's lost forever. It's almost too bad there's not a fool proof way to ensure you application only puts out sanitized HTML the way you can ensure you don't get caught by SQL injection by using prepared queries.
I sanitize my user data much like Radu...
First client-side using both regex's and taking control over allowable characters
input into given form fields using javascript or jQuery tied to events, such as
onChange or OnBlur, which removes any disallowed input before it can even be
submitted. Realize however, that this really only has the effect of letting those
users in the know, that the data is going to be checked server-side as well. It's
more a warning than any actual protection.
Second, and I rarely see this done these days anymore, that the first check being
done server-side is to check the location of where the form is being submitted from.
By only allowing form submission from a page that you have designated as a valid
location, you can kill the script BEFORE you have even read in any data. Granted,
that in itself is insufficient, as a good hacker with their own server can 'spoof'
both the domain and the IP address to make it appear to your script that it is coming
from a valid form location.
Next, and I shouldn't even have to say this, but always, and I mean ALWAYS, run
your scripts in taint mode. This forces you to not get lazy, and to be diligent about
step number 4.
Sanitize the user data as soon as possible using well-formed regexes appropriate to
the data that is expected from any given field on the form. Don't take shortcuts like
the infamous 'magic horn of the unicorn' to blow through your taint checks...
or you may as well just turn off taint checking in the first place for all the good
it will do for your security. That's like giving a psychopath a sharp knife, bearing
your throat, and saying 'You really won't hurt me with that will you".
And here is where I differ than most others in this fourth step, as I only sanitize
the user data that I am going to actually USE in a way that may present a security
risk, such as any system calls, assignments to other variables, or any writing to
store data. If I am only using the data input by a user to make a comparison to data
I have stored on the system myself (therefore knowing that data of my own is safe),
then I don't bother to sanitize the user data, as I am never going to us it a way
that presents itself as a security problem. For instance, take a username input as
an example. I use the username input by the user only to check it against a match in
my database, and if true, after that I use the data from the database to perform
all other functions I might call for it in the script, knowing it is safe, and never
use the users data again after that.
Last, is to filter out all the attempted auto-submits by robots these days, with a
'human authentication' system, such as Captcha. This is important enough these days
that I took the time to write my own 'human authentication' schema that uses photos
and an input for the 'human' to enter what they see in the picture. I did this because
I've found that Captcha type systems really annoy users (you can tell by their
squinted-up eyes from trying to decipher the distorted letters... usually over and
over again). This is especially important for scripts that use either SendMail or SMTP
for email, as these are favorites for your hungry spam-bots.
To wrap it up in a nutshell, I'll explain it as I do to my wife... your server is like a popular nightclub, and the more bouncers you have, the less trouble you are likely to have
in the nightclub. I have two bouncers outside the door (client-side validation and human authentication), one bouncer right inside the door (checking for valid form submission location... 'Is that really you on this ID'), and several more bouncers in
close proximity to the door (running taint mode and using good regexes to check the
user data).
I know this is an older post, but I felt it important enough for anyone that may read it after my visit here to realize their is no 'magic bullet' when it comes to security, and it takes all these working in conjuction with one another to make your user-provided data secure. Just using one or two of these methods alone is practically worthless, as their power only exists when they all team together.
Or in summary, as my Mum would often say... 'Better safe than sorry".
UPDATE:
One more thing I am doing these days, is Base64 encoding all my data, and then encrypting the Base64 data that will reside on my SQL Databases. It takes about a third more total bytes to store it this way, but the security benefits outweigh the extra size of the data in my opinion.
I like to sanitize it as early as possible, which means the sanitizing happens when the user tries to enter in invalid data. If there's a TextBox for their age, and they type in anything other that a number, I don't let the keypress for the letter go through.
Then, whatever is reading the data (often a server) I do a sanity check when I read in the data, just to make sure that nothing slips in due to a more determined user (such as hand-editing files, or even modifying packets!)
Edit: Overall, sanitize early and sanitize any time you've lost sight of the data for even a second (e.g. File Save -> File Open)
The most important thing is to always be consistent in when you escape. Accidental double sanitizing is lame and not sanitizing is dangerous.
For SQL, just make sure your database access library supports bind variables which automatically escapes values. Anyone who manually concatenates user input onto SQL strings should know better.
For HTML, I prefer to escape at the last possible moment. If you destroy user input, you can never get it back, and if they make a mistake they can edit and fix later. If you destroy their original input, it's gone forever.
Early is good, definitely before you try to parse it. Anything you're going to output later, or especially pass to other components (i.e., shell, SQL, etc) must be sanitized.
But don't go overboard - for instance, passwords are hashed before you store them (right?). Hash functions can accept arbitrary binary data. And you'll never print out a password (right?). So don't parse passwords - and don't sanitize them.
Also, make sure that you're doing the sanitizing from a trusted process - JavaScript/anything client-side is worse than useless security/integrity-wise. (It might provide a better user experience to fail early, though - just do it both places.)
My opinion is to sanitize user input as soon as posible client side and server side, i'm doing it like this
(client side), allow the user to
enter just specific keys in the field.
(client side), when user goes to the next field using onblur, test the input he entered
against a regexp, and notice the user if something is not good.
(server side), test the input again,
if field should be INTEGER check for that (in PHP you can use is_numeric() ),
if field has a well known format
check it against a regexp, all
others ( like text comments ), just
escape them. If anything is suspicious stop script execution and return a notice to the user that the data he enetered in invalid.
If something realy looks like a posible attack, the script send a mail and a SMS to me, so I can check and maibe prevent it as soon as posible, I just need to check the log where i'm loggin all user inputs, and the steps the script made before accepting the input or rejecting it.
Perl has a taint option which considers all user input "tainted" until it's been checked with a regular expression. Tainted data can be used and passed around, but it taints any data that it comes in contact with until untainted. For instance, if user input is appended to another string, the new string is also tainted. Basically, any expression that contains tainted values will output a tainted result.
Tainted data can be thrown around at will (tainting data as it goes), but as soon as it is used by a command that has effect on the outside world, the perl script fails. So if I use tainted data to create a file, construct a shell command, change working directory, etc, Perl will fail with a security error.
I'm not aware of another language that has something like "taint", but using it has been very eye opening. It's amazing how quickly tainted data gets spread around if you don't untaint it right away. Things that natural and normal for a programmer, like setting a variable based on user data or opening a file, seem dangerous and risky with tainting turned on. So the best strategy for getting things done is to untaint as soon as you get some data from the outside.
And I suspect that's the best way in other languages as well: validate user data right away so that bugs and security holes can't propagate too far. Also, it ought to be easier to audit code for security holes if the potential holes are in one place. And you can never predict which data will be used for what purpose later.
Clean the data before you store it. Generally you shouldn't be preforming ANY SQL actions without first cleaning up input. You don't want to subject yourself to a SQL injection attack.
I sort of follow these basic rules.
Only do modifying SQL actions, such as, INSERT, UPDATE, DELETE through POST. Never GET.
Escape everything.
If you are expecting user input to be something make sure you check that it is that something. For example, you are requesting an number, then make sure it is a number. Use validations.
Use filters. Clean up unwanted characters.
Users are evil!
Well perhaps not always, but my approach is to always sanatize immediately to ensure nothing risky goes anywhere near my backend.
The added benefit is that you can provide feed back to the user if you sanitize at point of input.
Assume all users are malicious.
Sanitize all input as soon as possible.
Full stop.
I sanitize my data right before I do any processing on it. I may need to take the First and Last name fields and concatenate them into a third field that gets inserted to the database. I'm going to sanitize the input before I even do the concatenation so I don't get any kind of processing or insertion errors. The sooner the better. Even using Javascript on the front end (in a web setup) is ideal because that will occur without any data going to the server to begin with.
The scary part is that you might even want to start sanitizing data coming out of your database as well. The recent surge of ASPRox SQL Injection attacks that have been going around are doubly lethal because it will infect all database tables in a given database. If your database is hosted somewhere where there are multiple accounts being hosted in the same database, your data becomes corrupted because of somebody else's mistake, but now you've joined the ranks of hosting malware to your visitors due to no initial fault of your own.
Sure this makes for a whole lot of work up front, but if the data is critical, then it is a worthy investment.
User input should always be treated as malicious before making it down into lower layers of your application. Always handle sanitizing input as soon as possible and should not for any reason be stored in your database before checking for malicious intent.
I find that cleaning it immediately has two advantages. One, you can validate against it and provide feedback to the user. Two, you do not have to worry about consuming the data in other places.
I have an application that consists of a server-side REST API written in PHP, and some client-side Javascript that consumes this API and uses the JSON it produces to render a page. So, a pretty typical setup.
The data provided by the REST API is "untrusted", in the sense that it is fetching user-provided content from a database. So, for example, it might fetch something like:
{
"message": "<script>alert("Gotcha!")</script>"
}
Obviously, if my client-side code were to render this directly into the page's DOM, I've created an XSS vulnerability. So, this content needs to be HTML-escaped first.
The question is, when outputting untrusted content, should I escape the content on the server side, or the client side? I.e., should my API return the raw content, and then make it the client Javascript code's responsibility to escape the special characters, or should my API return "safe" content:
{
"message": "<script>alert('Gotcha!');<\/script>"
}
that has been already escaped?
On one hand, it seems to be that the client should not have to worry about unsafe data from my server. On the other hand, one could argue that output should always be escaped at the last minute possible, when we know exactly how the data is to be consumed.
Which approach is correct?
Note: There are plenty of questions about handling input and yes, I am aware that client-side code can always be manipulated. This question is about outputting data from my server which may not be trustable.
Update: I looked into what other people are doing, and it does seem that some REST APIs tend to send "unsafe" JSON. Gitter's API actually sends both, which is an interesting idea:
[
{
"id":"560ab5d0081f3a9c044d709e",
"text":"testing the API: <script>alert('hey')</script>",
"html":"testing the API: <script>alert('hey')</script>",
"sent":"2015-09-29T16:01:19.999Z",
"fromUser":{
...
},"unread":false,
"readBy":0,
"urls":[],
"mentions":[],
"issues":[],
"meta":[],
"v":1
}
]
Notice that they send the raw content in the text key, and then the HTML-escaped version in the html key. Not a bad idea, IMO.
I have accepted an answer, but I don't believe this is a cut-and-dry problem. I would like to encourage further discussion on this topic.
Escape on the client side only.
The reason to escape on the client side is security: the server's output is the client's input, and so the client should not trust it. If you assume that the input is already escaped, then you potentially open yourself to client attacks via, for example, a malicious reverse-proxy. This is not so different from why you should always validate input on the server side, even if you also include client-side validation.
The reason not to escape on the server side is separation of concerns: the server should not assume that the client intends to render the data as HTML. The server's output should be as media-neutral as possible (given the constraints of JSON and the data structure, of course), so that the client can most easily transform it into whatever format is needed.
For escaping on output:
I suggest reading this XSS Filter Evasion Cheat Sheet.
To prevent user correctly you better not only escape, but also before escaping filter it with an appropriate anti XSS library. Like htmLawed, or HTML Purifier, or any from this thread.
IMHO sanitizing should be done on user inputed data whenever you are going to show it back in web project.
should I escape the content on the server side, or the client side? I.e., should my API return the raw content, and then make it the client Javascript code's responsibility to escape the special characters, or should my API return "safe" content:
It's better to return already escaped, and xss purified content, so:
Take raw data and purify if from xss on server
Escape it
Return to JavaScript
And also, you should notice one important thing, like a load of your site and read/write balance: for example if your client enters data once and you are going to show this data to 1M users, what do you prefer: run protection logic once before write (protect on input) on a million time each read(protect on output)?
If you are going to show like 1K posts on a page and escape each on client, how well will it work on the client's mobile phone? This last one will help you to chose where to protect data on client or on server.
This answer is more focused on arguing whether to do client-side escaping vs server-side, since OP seems aware of the argument against escaping on input vs output.
Why not escape client-side?
I would argue that escaping at the javascript level is not a good idea. Just an issue off the top of my head would be if there was an error in the sanitizing script, it would not run, and then the dangerous script would be allowed to run. So you have introduced a vector where an attacker can try to craft input to break the JS sanitizer, so that their plain script is allowed to run. I also do not know of any built-in AntiXSS libraries that run in JS. I am sure someone has made one, or could make one, but there are established server-side examples that are a little more trust-worthy. It is also worth mentioning that writing a sanitizer in JS that works for all browsers is not a trivial task.
OK, what if you escape on both?
Escaping server-side and client-side is just kind of confusing to me, and shouldn't provide any additional security. You mentioned the difficulties with double-escaping, and I have experienced that pain before.
Why is server-side good enough?
Escaping server-side should be sufficient. Your point about doing it as late as possible makes some sense, but I think the drawbacks of escaping client-side are outweighed by whatever tiny benefit you may get by doing it. Where is the threat? If an attacker exists between your site and the client, then the client is already compromised since they can just send a blank html file with their script if they want. You need to do your best to send something safe, not just send the tools to deal with your dangerous data.
TLDR; If your API is to convey formatting information, it should output HTML encoded strings. Caveat: Any consumer will need to trust your API not to output malicious code. A Content Security Policy can help with this too.
If your API is to output only plain text, then HTML encode on the client-side (as < in the plain text also means < in any output).
Not too long, not done reading:
If you own both the API and the web application, either way is acceptable. As long as you are not outputting JSON to HTML pages without hex entity encoding like this:
<%
payload = "[{ foo: '" + foo + "'}]"
%>
<script><%= payload %></script>
then it doesn't matter whether the code on your server changes & to & or the code in the browser changes & to &.
Let's take the example from your question:
[
{
"id":"560ab5d0081f3a9c044d709e",
"text":"testing the API: <script>alert('hey')</script>",
"html":"testing the API: <script>alert('hey')</script>",
"sent":"2015-09-29T16:01:19.999Z",
If the above is returned from api.example.com and you call it from www.example.com, as you control both sides you can decide whether you want to take the plain text, "text", or the formatted text, "html".
It is important to remember though that any variables inserted into html have been HTML encoded server-side here. And also assume that correct JSON encoding has been carried out which stops any quote characters from breaking, or changing the context of the JSON (this is not shown in the above for simplicity).
text would be inserted into the document using Node.textContent and html as Element.innerHTML. Using Node.textContent will cause the browser to ignore any HTML formatting and script that may be present because characters like < are literally taken to be output as < on the page.
Note your example shows user content being input as script. i.e. a user has typed <script>alert('hey')</script> into your application, it is not API generated. If your API actually wanted to output tags as part of its function, then it'd have to put them in the JSON:
"html":"<u>Underlined</u>"
And then your text would have to only output the text without formatting:
"text":"Underlined"
Therefore, your API while sending information to your web application consumer is no longer transmitting rich text, only plain text.
If, however, a third party was consuming your API, then they may wish to get the data from your API as plain text because then they can set Node.textContent (or HTML encode it) on the client-side themselves, knowing that it is safe. If you return HTML then your consumer needs to trust you that your HTML does not contain any malicious script.
So if the above content is from api.example.com, but your consumer is a third party site, say, www.example.edu, then they may feel more comfortable taking in text rather than HTML. Your output may need to be more granularly defined in this case, so rather than outputting
"text":"Thank you Alice for signing up."
You would output
[{ "name", "alice",
"messageType": "thank_you" }]
Or similar so you are not defining the layout in your JSON any longer, you are just conveying the information for the client-side to interpret and format using their own style. To further clarify what I mean, if all your consumer got was
"text":"Thank you Alice for signing up."
and they wanted to show names in bold, it would be very tricky for them to accomplish this without complex parsing. However, with defining API outputs on a granular level, the consumer can take the relevant pieces of output like variables, and then apply their own HTML formatting, without having to trust your API to only output bold tags (<b>) and not to output malicious JavaScript (either from the user or from you, if you were indeed malicious, or if your API had been compromised).
In terms of jQuery (or Javascript), what happens behind the scenes when a person posts a comment on Facebook, Twitter, or a blog?
For instance, do they sanitize the text first, and then pattern match URL's into an actual link? Are there other items of concern that the client-side should check in addition to doing some checks on the backend?
I have found a few regex's for turning URL's into links, but I'm not sure if there are better solutions.
I'm trying to wrap my head around the problem, but I'm having a difficult time knowing where to start. Any guidance you can provide is greatly appreciated!
This is a matter of opinion (in my opinion) so I'll CW this answer. Here's my opnion as a bona-fide citizen of the Internet:
There are two broad kinds of "sanitization": one is semantic sanitization, where input is checked to make sure it's what it's supposed to be (phone number, postal code, currency amount, whatever). The other is defensive sanitization, which is (again, in my opinion) a generally misguided, user-hostile activity.
Really, input is never really scary until it touches something: the database server, an HTML renderer, a JavaScript interpreter, and so on. The list is long.
As to point 1, I think that defensive sanitization is misguided because it ignores point 2 above: without knowing what environment you're defending from malicious input, you can't really sanitize it without greatly restricting the input alphabet, and even then the process may be fighting against itself. It's user-hostile because it needlessly restricts what legitimate users can do with the data they want to keep in their account. Who is to say that me wanting to include in my "comments" or "nickname" or "notes" fields characters that look like XML, or SQL, or any other language's special characters? If there's no semantic reason to filter inputs, why do that to your users?
Point 2 is really the crux of this. User input can be dangerous because server-side code (or client-side code, for that matter) can hand it over directly to unsuspecting interpretation environments where meta-characters important to each distinct environment can cause unexpected behavior. If you hand untouched user input directly to SQL by pasting it directly into a query template, then special SQL meta-characters like quotes can be used by a malicious user to control the database in ways you definitely don't want. However, that alone is no reason to prevent me from telling you that my name is "O'Henry".
The key issue with point 2 is that there are many different interpretation environments, and each of them is completely distinct as far as the threat posed by user input. Let's list a few:
SQL - quote marks in user input are a big potential problem; specific DB servers may have other exploitable syntax conventions
HTML - when user input is dropped straight into HTML, the browser's HTML parser will happily obey whatever embedded markup tells it to do, including run scripts, load tracker images, and whatever else. The key meta-characters are "<", ">", and "&" (the latter not so much because of attacks, but because of the mess they cause). It's probably also good to worry about quotes here too because user input may need to go inside HTML element attributes.
JavaScript - if a page template needs to put some user input directly into some running JavaScript code, the things to worry about are probably quotes (if the input is to be treated as a JavaScript string). If the user input needs to go into a regular expression, then a lot more scrubbing is necessary.
Logfiles - yes, logfiles. How do you look at logfiles? I do it on a simple command-line window on my Linux box. Such command-line "console" applications generally obey ancient "escape sequences" that date back to old ASCII terminals, for controlling cursor position and various other things. Well, embedded escape sequences in cleverly crafted user input can be used for crazy attacks that leverage those escape sequences; the general idea is to have some user input get dropped into some log file (maybe as part of a page error log) and trick an administrator into scrolling through the logfile in an xterm window. Wild, huh?
The key point here is that the exact techniques necessary to protect those environments from malformed or malicious input differ significantly from one to the next. Protecting your SQL server from malicious quotes is a completely different problem from guarding those quotes in HTML or JavaScript (and note that both of those are totally different from each other too!).
The bottom line: my opinion, therefore, is that the proper focus of attention when worrying about potentially malformed or malicious input is the process of writing user data, not reading it. As each fragment of user-supplied data is used by your software in cooperation with each interpreting environment, a "quoting" or "escaping" operation has to be done, and it has to be an operation specific to the target environment. How exactly that's arranged may vary all over the place. Traditionally in SQL, for example, one uses prepared statements, though there are times when the deficiencies of prepared statements make that approach difficult. When spitting out HTML, most server-side frameworks have all sorts of built-in hooks for HTML or XML escaping with entity notation (like & for "&"). Nowadays, the simplest way to protect things for Javascript is to leverage a JSON serializer, though of course there are other ways to go.
Greetings Everyone
Being a newbie to javascript attacks and all other sort of attack prevention and care, I would like some input as to how I can make my website more secure to attacks.
We are launching a new website which is in Arabic (utf8). There is an input box on our website that takes a search string from users and displays this string with the results. Also this string is inserted into our mysql database to keep track of what people are searching for.
What I've done on the backend is put strip_tags , mysql_real_escape_string on the search string. I tried using javascript's escape function on the input search string but this totally messes up the arabic text and can't use this string to search in the backend or even display searched string in the front end. Is there anything more I can do on the front end or the back end to make website more secure from attacks?
Thanking You
Imran
I would strongly advise against doing any sort of escaping on the client-side. You can't rely on the fact that the escaping actually happens, both because malicious users could modify the script on their end to bypass the escaping and because older browsers (or browsers with script blockers) might end up preventing the script from running. As a general rule, never trust any data that comes from the client, even if you've made an effort to sanitize it on the client end.
What you're doing on the server end seems well-intentioned. Without more access to the code I don't think we can confirm that you're using those functions correctly, but you're at least on the right track for using them.
Best of luck with the site, by the way!
You can avoid the problem by changing the design.
Leave the input box untouched, and do not give back the search string in the result.
For the list that you store in your DB, as long as you do not show it to a user in a web page there is no risk for script injection.
I'm planning on making a web app that will allow users to post entire web pages on my website. I'm thinking of using HTML Purifier but I'm not sure because HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted. So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.
I saw a Google video a while ago that had a solution for this. Their solution was to use another website to post javascript in so the original website cannot be accessed by it. But I don't wanna purchase a new domain just for this.
be careful with homebrew regexes for this kind of thing
A regex like
s/(<.*?)onClick=['"].*?['"](.*?>)/$1 $3/
looks like it might get rid of onclick events, but you can circumvent it with
<a onClick<a onClick="malicious()">="malicious()">
running the regex on that will get you something like
<a onClick ="malicious()">
You can fix it by repeatedly running the regex on that string until it doesn't match, but that's just one example of how easy it is to get around simple regex sanitizers.
The most critical error people make when doing this is validating things on input.
Instead, you should validate on display.
The context matters when determing what is XSS and what isn't. Therefore, you can happily accept any input, as long as you pass it through appropriate cleaning functions when displaying it.
Consider that something that constitutes 'XSS' will be different when the input is placed in a '<a href="HERE"> as opposed to <a>here!</a>.
Thus, all you need to do, is make sure that any time you write user data, you consider, very carefully, where you are displaying it, and make sure that it can't escape the context you are writing it to.
If you can find any other way of letting users post content, that does not involve HTML, do that. There are plenty of user-side light markup systems you can use to generate HTML.
So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.
Forget it. You cannot process HTML with regex in any useful way. Let alone when security is involved and attackers might be deliberately throwing malformed markup at you.
If you can convince your users to input XHTML, that's much easier to parse. You still can't do it with regex, but you can throw it into a simple XML parser, and walk over the resulting node tree to check that every element and attribute is known-safe, and delete any that aren't, then re-serialise.
HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted.
Why?
If it's so they can edit it in their original form, then the answer is simply to purify it on the way out to be displayed in the browser, not on the way in at submit-time.
If you must let users input their own free-form HTML — and in general I'd advise against it — then HTML Purifier, with a whitelist approach (ban all elements/attributes that aren't known-safe) is about as good as it gets. It's very very complicated and you may have to keep it up to date when hacks are found, but it's streets ahead of anything you're going to hack up yourself with regexes.
But I don't wanna purchase a new domain just for this.
You can use a subdomain, as long as any authentication tokens (in particular, cookies) can't cross between subdomains. (Which for cookies they can't by default as the domain parameter is set to only the current hostname.)
Do you trust your users with scripting capability? If not don't let them have it, or you'll get attack scripts and iframes to Russian exploit/malware sites all over the place...
Make sure that user content doesn't contain anything that could cause Javascript to be ran on your page.
You can do this by using an HTML stripping function that gets rid of all HTML tags (like strip_tags from PHP), or by using another similar tool. There are actually many reasons besides XSS to do this. If you have user submitted content, you want to make sure that it doesn't break the site layout.
I belive you can simply use a sub-domain of your current domain to host Javascript, and you will get the same security benefits for AJAX. Not cookies however.
In your specific case, filtering out the <script> tag and Javascript actions is probably going to be your best bet.
1) Use clean simple directory based URIs to serve user feed data.
Make sure when you dynamically create URIs to address the user's uploaded data, service account, or anything else off your domain make sure you don't post information as parameters to the URI. That is an extremely easy point of manipulation that could be used to expose flaws in your server security and even possibly inject code onto your server.
2) Patch your server.
Ensure you keep your server up to date on all the latest security patches for all the services running on that server.
3) Take all possible server-side protections against SQL injection.
If somebody can inject code to your SQL database that can execute from services on your box that person will own your box. At that point they can then install malware onto your webserver to be feed back to your users or simple record data from the server and send it out to a malicious party.
4) Force all new uploads into a protected sandboxed area to test for script execution.
No matter how you try to remove script tags from submitted code there will be a way to circumvent your safeguards to execute script. Browsers are sloppy and do all kinds of stupid crap they are not supposed to do. Test your submissions in a safe area before you publish them for public consumption.
5) Check for beacons in submitted code.
This step requires the previous step and can be very complicated, because it can occur in script code that requires a browser plugin to execute, such as Action Script, but is just as much a vulnerability as allowing JavaScript to execute from user submitted code. If a user can submit code that can beacon out to a third party then your users, and possibly your server, is completely exposed to data loss to a malicious third party.
You should filter ALL HTML and whitelist only the tags and attributes that are safe and semantically useful. WordPress is great at this and I assume that you will find the regular expressions used by WordPress if you search their source code.