Sanitizing HTML input value - javascript

Do you have to convert anything besides the quotes (") to (") inside of:
<input type="text" value="$var">
I personally do not see how you can possibly break out of that without using " on*=....
Is this correct?
Edit: Apparently some people think my question is too vague;
<input type="text" value="<script>alert(0)</script>"> does not execute. Thus, making it impossible to break out of using without the usage of ".
Is this correct?

There really are two questions that you're asking (or at least can be interpreted):
Can the quoted value attribute of input[type="text"] be injected if quotes are disallowed?
Can an arbitrary quoted attribute of an element be injected if quotes are disallowed.
The second is trivially demonstrated by the following:
Foo
Or
<div onmousemove="alert(123);">...
The first is a bit more complicated.
HTML5
According to the HTML5 spec:
Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.
Which is further refined in quoted attributes to:
The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by a single """ (U+0022) character, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0022 QUOTATION MARK characters ("), and finally followed by a second single """ (U+0022) character.
So in short, any character except an "ambiguous ampersand" (&[a-zA-Z0-9]+; when the result is not a valid character reference) and a quote character is valid inside of an attribute.
HTML 4.01
HTML 4.01 is less descriptive than HTML5 about the syntax (one of the reasons HTML5 was created in the first place). However, it does say this:
When script or style data is the value of an attribute (either style or the intrinsic event attributes), authors should escape occurrences of the delimiting single or double quotation mark within the value according to the script or style language convention. Authors should also escape occurrences of "&" if the "&" is not meant to be the beginning of a character reference.
Note, this is saying what an author should do, not what a parser should do. So a parser could technically accept or reject invalid input (or mangle it to be valid).
XML 1.0
The XML 1.0 Spec defines an attribute as:
Attribute ::= Name Eq AttValue
where AttValue is defined as:
AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'"
The & is similar to the concept of an "ambiguous ampersand" from HTML5, however it's basically saying "any unencoded ampersand".
Note though that it explicitly denies < from attribute values.
So while HTML5 allows it, XML1.0 explicitly denies it.
What Does It Mean
It means that for a compliant and bug free parser, HTML5 will ignore < characters in an attribute, and XML will error.
It also means that for a compliant and bug free parser, HTML 4.01 will behave in unspecified and potentially odd ways (since the specification doesn't detail the behavior).
And this gets down to the crux of the issue. In the past, HTML was such a loose spec, that every browser had slightly different rules for how it would deal with malformed html. Each would try to "fix" it, or "interpret" what you meant. So that means that while a HTML5 compliant browser wouldn't execute the JS in <input type="text" value="<script>alert(0)</script>">, there's nothing to say that a HTML 4.01 compliant browser wouldn't. And there's nothing to say that a bug may not exist in the XML or HTML5 parser that causes it to be executed (though that would be a pretty significant problem).
THAT is why OWASP (and most security experts) recommend you encode either all non-alpha-numeric characters or &<" inside of an attribute value. There's no cost in doing so, only the added security of knowing how the browser's parser will interpret the value.
Do you have to? no. But defense in depth suggests that, since there's no cost to doing so, the potential benefit is worth it.

If your question is "what types of xss-attacks are possible" then you better google it. I'll just leavev some examples of why you should sanitize your inputs
If input is generated by echo '<input type="text" value="$var">', then simple ' breaks it.
If input is plain HTML in PHP page then value=<?php deadly_php_script ?> breaks it
If this is plain HTML input in HTML file - then converting doublequotes should be enough.
Although, converting other special symbols (like <, > and so on) is a good practice. Inputs are made to input info that would be stored on server\transferred into another page\script, so you need to check what could break those files. Let's say we have this setup:
index.html:
<form method=post action=getinput.php>
<input type="text" name="xss">
<input type="submit"></form>
getinput.php:
echo $_POST['xss'];
Input value ;your_deadly_php_script breaks it totally (you can also sanitize server-side in that case)
If that's not enough - provide more info on your question, add more examples of your code.

I believe the person is referring to cross site scripting attacks. They tagged this as php, security, and xss
take for example
<input type="text" value=""><script>alert(0)</script><"">
The above code will execute the alert box code;
<?php $var= "\"><script>alert(0)</script><\""; ?>
<input type="text" value="<?php echo $var ?>">
This will also execute the alert box.
To solve this you need to escape ", < >, and a few more to be safe. PHP has a couple of functions worth looking into and each have their ups and downs!
htmlentities() - Convert all applicable characters to HTML entities
htmlspecialchars() - Convert special characters to HTML entities
get_html_translation_table() - Returns the translation table used by htmlspecialchars and htmlentities
urldecode() - Decodes URL-encoded string
What you have to be careful of is that you are passing in a variable and there ways to create errors and such to cause it to break out. Your best bet is to make sure that data is not formatted in an executable manner in case of errors. But you are right if they are no quotes you can't break out but there are ways you or I don't understand at this point that will allow that to happen.

$var = "><script>alert(0);</script> would work... If you can close the quotes you can then close the tag and open another one... But I think you are right, without closing the quotes no injection is possible...

Related

How to bypass the PHP strip_tags function [duplicate]

Is there a known XSS or other attack that makes it past a
$content = "some HTML code";
$content = strip_tags($content);
echo $content;
?
The manual has a warning:
This function does not modify any attributes on the tags that you allow using allowable_tags, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.
but that is related to using the allowable_tags parameter only.
With no allowed tags set, is strip_tags() vulnerable to any attack?
Chris Shiflett seems to say it's safe:
Use Mature Solutions
When possible, use mature, existing solutions instead of trying to create your own. Functions like strip_tags() and htmlentities() are good choices.
is this correct? Please if possible, quote sources.
I know about HTML purifier, htmlspecialchars() etc.- I am not looking for the best method to sanitize HTML. I just want to know about this specific issue. This is a theoretical question that came up here.
Reference: strip_tags() implementation in the PHP source code
As its name may suggest, strip_tags should remove all HTML tags. The only way we can proof it is by analyzing the source code. The next analysis applies to a strip_tags('...') call, without a second argument for whitelisted tags.
First at all, some theory about HTML tags: a tag starts with a < followed by non-whitespace characters. If this string starts with a ?, it should not be parsed. If this string starts with a !--, it's considered a comment and the following text should neither be parsed. A comment is terminated with a -->, inside such a comment, characters like < and > are allowed. Attributes can occur in tags, their values may optionally be surrounded by a quote character (' or "). If such a quote exist, it must be closed, otherwise if a > is encountered, the tag is not closed.
The code text is interpreted in Firefox as:
text
The PHP function strip_tags is referenced in line 4036 of ext/standard/string.c. That function calls the internal function php_strip_tags_ex.
Two buffers exist, one for the output, the other for "inside HTML tags". A counter named depth holds the number of open angle brackets (<).
The variable in_q contains the quote character (' or ") if any, and 0 otherwise. The last character is stored in the variable lc.
The functions holds five states, three are mentioned in the description above the function. Based on this information and the function body, the following states can be derived:
State 0 is the output state (not in any tag)
State 1 means we are inside a normal html tag (the tag buffer contains <)
State 2 means we are inside a php tag
State 3: we came from the output state and encountered the < and ! characters (the tag buffer contains <!)
State 4: inside HTML comment
We need just to be careful that no tag can be inserted. That is, < followed by a non-whitespace character. Line 4326 checks an case with the < character which is described below:
If inside quotes (e.g. <a href="inside quotes">), the < character is ignored (removed from the output).
If the next character is a whitespace character, < is added to the output buffer.
if outside a HTML tag, the state becomes 1 ("inside HTML tag") and the last character lc is set to <
Otherwise, if inside the a HTML tag, the counter named depth is incremented and the character ignored.
If > is met while the tag is open (state == 1), in_q becomes 0 ("not in a quote") and state becomes 0 ("not in a tag"). The tag buffer is discarded.
Attribute checks (for characters like ' and ") are done on the tag buffer which is discarded. So the conclusion is:
strip_tags without a tag whitelist is safe for inclusion outside tags, no tag will be allowed.
By "outside tags", I mean not in tags as in outside tag. Text may contain < and > though, as in >< a>>. The result is not valid HTML though, <, > and & need still to be escaped, especially the &. That can be done with htmlspecialchars().
The description for strip_tags without an whitelist argument would be:
Makes sure that no HTML tag exist in the returned string.
I cannot predict future exploits, especially since I haven't looked at the PHP source code for this. However, there have been exploits in the past due to browsers accepting seemingly invalid tags (like <s\0cript>). So it's possible that in the future someone might be able to exploit odd browser behavior.
That aside, sending the output directly to the browser as a full block of HTML should never be insecure:
echo '<div>'.strip_tags($foo).'</div>'
However, this is not safe:
echo '<input value="'.strip_tags($foo).'" />';
because one could easily end the quote via " and insert a script handler.
I think it's much safer to always convert stray < into < (and the same with quotes).
According to this online tool, this string will be "perfectly" escaped, but
the result is another malicious one!
<<a>script>alert('ciao');<</a>/script>
In the string the "real" tags are <a> and </a>, since < and script> alone aren't tags.
I hope I'm wrong or that it's just because of an old version of PHP, but it's better to check in your environment.
YES, strip_tags() is vulnerable to scripting attacks, right through to (at least) PHP 8. Do not use it to prevent XSS. Instead, you should use filter_input().
The reason that strip_tags() is vulnerable is because it does not run recursively. That is to say, it does not check whether or not valid tags will remain after valid tags have been stripped. For example, the string
<<a>script>alert(XSS);<</a>/script> will strip the <a> tag successfully, yet fail to see this leaves
<script>alert(XSS);</script>.
This can be seen (in a safe environment) here.
Strip tags is perfectly safe - if all that you are doing is outputting the text to the html body.
It is not necessarily safe to put it into mysql or url attributes.

javascript generating invalid HTML5 attributes in Firefox

I am noticing some very strange behavior in firefox and I'm wondering if anyone has a strategy for how to normalize or work around this behavior.
Specifically if you provide firefox a basic anchor containing html entities it will unescape those entities, fail to re-escape them and hand you back invalid html.
For example firefox mishandles the following url:
My Original Link
If this url is parsed by firefox it will unescape the ><" and start handling a url like:
My Original Link
This same operation appears to work fine elsewhere, even safari and edge.
I tried quite a few different ways of handing the html to firefox to avoid this problem. Tried manually invoking the parser, tried setting innerHTML, tried jQuery html(), tried giving jQuery constructor a giant string, etc. All methods produced the same broken result.
See a fiddle here:
https://jsfiddle.net/kamelkev/hfd2b6sn/
I am a little mystified by how broken this handling seems to be. There must be a way to work around this issue, but I can't seem to find a way.
My application is an html manipulation tool, so I typically normalize around issues like this by dropping down to XML and handling the problems there before persisting to a dumb key-value store, but in this particular case the <> characters are preventing me from processing this document as XML.
Ideas?
A < or a > is valid inside of an attribute value, unescaped. It's not best practice, but it is valid.
What's happening is that Firefox is parsing the original HTML and making elements out of it. At that point, the original HTML no longer exists. When you call .outerHTML, the HTML is reconstructed from the element.
Firefox then generates it using a different set of rules than Chrome does.
It isn't clear what exactly you need to do this for... really you should edit the DOM and export the HTML for the whole DOM when done. Constantly re-interpreting HTML isn't necessary.
The > and < are unescaped when the parser parses the source to construct the DOM. When you serialize an element back to a string, you are not guaranteed to obtain the same text as the source.
In this case, innerHTML and outerHTML use the HTML fragment serialization algorithm, which escapes attribute values using attribute mode:
Escaping a string (for the purposes of the algorithm above) consists
of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any
occurrences of the ">" character by the string ">".
That's why " is escaped to ", but < and > remain.
This is OK, because < and > are allowed in HTML double-quoted attribute values:
U+0022 QUOTATION MARK ("): Switch to the after attribute value (quoted) state.
U+0026 AMPERSAND (&): Switch to the character reference in attribute value state [...]
U+0000 NULL: Parse error [...]
EOF: Parse error [...]
Anything else: Append the current input character to the current attribute's value.
However, XML does not allow < and > in attribute values. If you want to get valid XHTML, use a XML serializer:
var s = new XMLSerializer();
var str = s.serializeToString(document.querySelector('a'));
console.log(str);
My Original Link

"&lang" misinterpreted in URL

I am developing for javascript disabled phones. My code looks like this
Link 1
Link 2
But the browser interprets the URL as -
someurl?var=a%e2%8c%a9=english (Link 1, incorrect)
someurl?lang=english&var=a (Link 2 works just fine !)
It seems like &lang=english is being converted to a%e2%8c%a9=english
Could someone explain why this is happening?
In HTML, the & character represents the start of a character reference.
If you try to specify an invalid character reference, then browsers will perform error recovery and treat it as an ampersand instead.
From the HTML DTD:
<!ENTITY lang CDATA "〈" -- left-pointing angle bracket = bra,
U+2329 ISOtech -->
… so &lang is not an invalid character reference.
To include an ampersand character as data, use the character reference for an ampersand: &
By HTML 4.01 rules, the &lang entity reference denotes the character U+2329 LEFT-POINTING ANGLE BRACKET “〈”. In UTF-8 encoding, that character is represented as 0xE2 0x8C 0xA9, and therefore in a URL, it gets %-encoded as a%e2%8c%a9.
Nowadays, most browsers don’t work that way. Specifically, in a URL, the reference &lang is not recognized when followed by an equals sign = (even though it is valid HTML 4.01 in that context).
To deal with browsers that may follow the old rules, as well as in order to comply with syntax rules independently of HTML version, escape each occurrence of the ampersand “&” as &—it is safest to do this for all occurrences of “&” as a data character, in attribute values and elsewhere.
Depending on the server-side software that processes the URL when they have been followed, you might be able to use an unproblematic character like “;” instead of “&” as a separator.
http://www.htmlhelp.com/tools/validator/problems.html#amp (linked by w3 from http://validator.w3.org/docs/help.html) explains it.
& marks the start of a so called entity. Entities are for example € (€), < (<),..
If you now put in the URL &lang, this throws an error in any validator, because its not a valid entity. The browser is then escaping this sequence.
Solution:
You have to escape the & by its own entity: & so the URL will look like:
Link 1

How to handle possibly HTML encoded values in javascript

I have a situation where I'm not sure if the input I get is HTML encoded or not. How do I handle this? I also have jQuery available.
function someFunction(userInput){
$someJqueryElement.text(userInput);
}
// userInput "<script>" returns "<script>", which is fine
// userInput "<script>" returns &lt;script&gt;", which is bad
I could avoid escaping ampersands (&), but what are the risks in that? Any help is very much appreciated!
Important note: This user input is not in my control. It returns from a external service, and it is possible for someone to tamper with it and avoid the html escaping provided by that service itself.
You really need to make sure you avoid these situations as it introduces really difficult conditions to predict.
Try adding an additional variable input to the function.
function someFunction(userInput, isEncoded){
//Add some conditional logic based on isEncoded
$someJqueryElement.text(userInput);
}
If you look at products like fckEditor, you can choose to edit source or use the rich text editor. This prevents the need for automatic encoding detection.
If you are still insistent on automatically detecting html encoding characters, I would recommend using index of to verify that certain key phrases exist.
str.indexOf('<') !== -1
This example above will detect the < character.
~~~New text added after edit below this line.~~~
Finally, I would suggest looking at this answer. They suggest using the decode function and detecting lengths.
var string = "Your encoded & decoded string here"
function decode(str){
return decodeURIComponent(str).replace(/</g,'<').replace(/>/g,'>');
}
if(string.length == decode(string).length){
// The string does not contain any encoded html.
}else{
// The string contains encoded html.
}
Again, this still has the problem of a user faking out the process by entering those specially encoded characters, but that is what html encoding is. So it would be proper to assume html encoding as soon as one of these character sequences comes up.
You must always correctly encode untrusted input before concatenating it into a structured language like HTML.
Otherwise, you'll enable injection attacks like XSS.
If the input is supposed to contain HTML formatting, you should use a sanitizer library to strip all potentially unsafe tags & attributes.
You can also use the regex /<|>|&(?![a-z]+;) to check whether a string has any non-encoded characters; however, you cannot distinguish a string that has been encoded from an unencoded string that talks about encoding.

XSS and other ways to terminate JavaScript strings

Is there a different way to terminate strings in JavaScript?
I'm testing a server for XSS vulnerabilities, and I'm seeing the following code in the HTTP response:
<script>
var myVar = "USER CONTROLLED STRING";
</script>
The user-controlled string comes from the URL and all double quotes are removed before the response is generated. Besides that, all other characters are allowed.
Is XSS possible?
(And ,yes, I know the proper thing to do here would would hex encode(\xHH) all non-alphanumeric characters as per recommended by the OWASP XSS Prevention Cheat Sheet, but from I tester's perspective I want to know if I could exploit this.)
Yes, you can perform an attack: http://jsfiddle.net/vTmq6/1/
<script>
var myVar = "</script><script>alert('hacked');</script>";
</script>
A </script> would denote the script element’s end tag regardless of whether the resulting JavaScript code is valid or not. This is due to the restrictions on the contents of raw text elements, which the script element belongs to:
The text in raw text and RCDATA elements must not contain any occurrences of the string "</" (U+003C LESS-THAN SIGN, U+002F SOLIDUS) followed by characters that case-insensitively match the tag name of the element followed by one of "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), "CR" (U+000D), U+0020 SPACE, ">" (U+003E), or "/" (U+002F).

Categories