I've read a lot of the HTML encoding post for the last day to solve this. I just managed to locate it.
Basicly I have set an attribute on an embed tag with jQuery. It all works fine in the browser.
No I want to read the HTML itself to add the result as a value for an input field to let the user copy & past it.
The PROBLEM is that the .html() function (also plain JS .innerHTML) converts the '&' char into '& amp;' (without the space). Using differen html encoder functions doesnt make a difference. I need the '&' char in the embed code.
Here is the code:
HTML:
<div id="preview_small">
<object><embed src="main.swf?XY=xyz&YXX=xyzz"></embed>
</object></div>
jQuery:
$("#preview_small object").clone().html();
returns
... src=main.swf?XY=xyz&YXX=xyzz ...
When I use:
$("#preview_small object").clone().children("embed").attr("src");
returns
main.swf?XY=xyz&YXX=xyzz
Any ideas how I can get the '&' char direct, without using regex after I got the string with .html()
I need the & char in the embed code.
No you don't. This:
<embed src="xyz&YXX=xyz"></embed>
is invalid HTML. It'll work in browsers since they try to fix up mistakes like this, but only as long as the string YXX doesn't happen to match an HTML entity name. You don't want to rely on that.
This:
<embed src="xyz&YXX=xyz"></embed>
is correct, works everywhere, and is the version you should be telling your users to copy and paste.
attr("src") returns xyz&YXX=xyz
Yes, that's the underlying value of that attribute. Attribute values and text content can contain almost any character directly. It's only the HTML serialisation of them where they have to be encoded:
<div title="a<b"&c>d">
$('div').attr('title') -> a<b"&c>d
I want to read the HTML itself to add the result as a value for an input field
<textarea id="foo"></textarea>
$('#foo').val($('#preview_small object').html());
However note that the serialised output of innerHTML/html() is not in any particular fixed dialect of HTML, and in particular IE may give you code that, though generally understandable by browsers, is also not technically valid:
$('#somediv').html('<div title="a/b"></div>');
$('#somediv').html() -> '<DIV title=a/b></DIV>' - missing quotes
So if you know the particular format of HTML you want to present to the user, you may be better off generating it yourself:
function encodeHTML(s) {
return s.replace(/&/g, '&').replace(/</g, '<').replace(/"/g, '"');
}
var src= 'XY=xyz&YXX=xyzz';
$('#foo').val('<embed src="'+encodeHTML(src)+'"><\/embed>');
(The \/ in the close tag is just so that doesn't get mistaken as the end of a <script> block, in case you're in one.)
Related
Is there a known XSS or other attack that makes it past a
$content = "some HTML code";
$content = strip_tags($content);
echo $content;
?
The manual has a warning:
This function does not modify any attributes on the tags that you allow using allowable_tags, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.
but that is related to using the allowable_tags parameter only.
With no allowed tags set, is strip_tags() vulnerable to any attack?
Chris Shiflett seems to say it's safe:
Use Mature Solutions
When possible, use mature, existing solutions instead of trying to create your own. Functions like strip_tags() and htmlentities() are good choices.
is this correct? Please if possible, quote sources.
I know about HTML purifier, htmlspecialchars() etc.- I am not looking for the best method to sanitize HTML. I just want to know about this specific issue. This is a theoretical question that came up here.
Reference: strip_tags() implementation in the PHP source code
As its name may suggest, strip_tags should remove all HTML tags. The only way we can proof it is by analyzing the source code. The next analysis applies to a strip_tags('...') call, without a second argument for whitelisted tags.
First at all, some theory about HTML tags: a tag starts with a < followed by non-whitespace characters. If this string starts with a ?, it should not be parsed. If this string starts with a !--, it's considered a comment and the following text should neither be parsed. A comment is terminated with a -->, inside such a comment, characters like < and > are allowed. Attributes can occur in tags, their values may optionally be surrounded by a quote character (' or "). If such a quote exist, it must be closed, otherwise if a > is encountered, the tag is not closed.
The code text is interpreted in Firefox as:
text
The PHP function strip_tags is referenced in line 4036 of ext/standard/string.c. That function calls the internal function php_strip_tags_ex.
Two buffers exist, one for the output, the other for "inside HTML tags". A counter named depth holds the number of open angle brackets (<).
The variable in_q contains the quote character (' or ") if any, and 0 otherwise. The last character is stored in the variable lc.
The functions holds five states, three are mentioned in the description above the function. Based on this information and the function body, the following states can be derived:
State 0 is the output state (not in any tag)
State 1 means we are inside a normal html tag (the tag buffer contains <)
State 2 means we are inside a php tag
State 3: we came from the output state and encountered the < and ! characters (the tag buffer contains <!)
State 4: inside HTML comment
We need just to be careful that no tag can be inserted. That is, < followed by a non-whitespace character. Line 4326 checks an case with the < character which is described below:
If inside quotes (e.g. <a href="inside quotes">), the < character is ignored (removed from the output).
If the next character is a whitespace character, < is added to the output buffer.
if outside a HTML tag, the state becomes 1 ("inside HTML tag") and the last character lc is set to <
Otherwise, if inside the a HTML tag, the counter named depth is incremented and the character ignored.
If > is met while the tag is open (state == 1), in_q becomes 0 ("not in a quote") and state becomes 0 ("not in a tag"). The tag buffer is discarded.
Attribute checks (for characters like ' and ") are done on the tag buffer which is discarded. So the conclusion is:
strip_tags without a tag whitelist is safe for inclusion outside tags, no tag will be allowed.
By "outside tags", I mean not in tags as in outside tag. Text may contain < and > though, as in >< a>>. The result is not valid HTML though, <, > and & need still to be escaped, especially the &. That can be done with htmlspecialchars().
The description for strip_tags without an whitelist argument would be:
Makes sure that no HTML tag exist in the returned string.
I cannot predict future exploits, especially since I haven't looked at the PHP source code for this. However, there have been exploits in the past due to browsers accepting seemingly invalid tags (like <s\0cript>). So it's possible that in the future someone might be able to exploit odd browser behavior.
That aside, sending the output directly to the browser as a full block of HTML should never be insecure:
echo '<div>'.strip_tags($foo).'</div>'
However, this is not safe:
echo '<input value="'.strip_tags($foo).'" />';
because one could easily end the quote via " and insert a script handler.
I think it's much safer to always convert stray < into < (and the same with quotes).
According to this online tool, this string will be "perfectly" escaped, but
the result is another malicious one!
<<a>script>alert('ciao');<</a>/script>
In the string the "real" tags are <a> and </a>, since < and script> alone aren't tags.
I hope I'm wrong or that it's just because of an old version of PHP, but it's better to check in your environment.
YES, strip_tags() is vulnerable to scripting attacks, right through to (at least) PHP 8. Do not use it to prevent XSS. Instead, you should use filter_input().
The reason that strip_tags() is vulnerable is because it does not run recursively. That is to say, it does not check whether or not valid tags will remain after valid tags have been stripped. For example, the string
<<a>script>alert(XSS);<</a>/script> will strip the <a> tag successfully, yet fail to see this leaves
<script>alert(XSS);</script>.
This can be seen (in a safe environment) here.
Strip tags is perfectly safe - if all that you are doing is outputting the text to the html body.
It is not necessarily safe to put it into mysql or url attributes.
I am noticing some very strange behavior in firefox and I'm wondering if anyone has a strategy for how to normalize or work around this behavior.
Specifically if you provide firefox a basic anchor containing html entities it will unescape those entities, fail to re-escape them and hand you back invalid html.
For example firefox mishandles the following url:
My Original Link
If this url is parsed by firefox it will unescape the ><" and start handling a url like:
My Original Link
This same operation appears to work fine elsewhere, even safari and edge.
I tried quite a few different ways of handing the html to firefox to avoid this problem. Tried manually invoking the parser, tried setting innerHTML, tried jQuery html(), tried giving jQuery constructor a giant string, etc. All methods produced the same broken result.
See a fiddle here:
https://jsfiddle.net/kamelkev/hfd2b6sn/
I am a little mystified by how broken this handling seems to be. There must be a way to work around this issue, but I can't seem to find a way.
My application is an html manipulation tool, so I typically normalize around issues like this by dropping down to XML and handling the problems there before persisting to a dumb key-value store, but in this particular case the <> characters are preventing me from processing this document as XML.
Ideas?
A < or a > is valid inside of an attribute value, unescaped. It's not best practice, but it is valid.
What's happening is that Firefox is parsing the original HTML and making elements out of it. At that point, the original HTML no longer exists. When you call .outerHTML, the HTML is reconstructed from the element.
Firefox then generates it using a different set of rules than Chrome does.
It isn't clear what exactly you need to do this for... really you should edit the DOM and export the HTML for the whole DOM when done. Constantly re-interpreting HTML isn't necessary.
The > and < are unescaped when the parser parses the source to construct the DOM. When you serialize an element back to a string, you are not guaranteed to obtain the same text as the source.
In this case, innerHTML and outerHTML use the HTML fragment serialization algorithm, which escapes attribute values using attribute mode:
Escaping a string (for the purposes of the algorithm above) consists
of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any
occurrences of the ">" character by the string ">".
That's why " is escaped to ", but < and > remain.
This is OK, because < and > are allowed in HTML double-quoted attribute values:
U+0022 QUOTATION MARK ("): Switch to the after attribute value (quoted) state.
U+0026 AMPERSAND (&): Switch to the character reference in attribute value state [...]
U+0000 NULL: Parse error [...]
EOF: Parse error [...]
Anything else: Append the current input character to the current attribute's value.
However, XML does not allow < and > in attribute values. If you want to get valid XHTML, use a XML serializer:
var s = new XMLSerializer();
var str = s.serializeToString(document.querySelector('a'));
console.log(str);
My Original Link
I am using AJAX to handle a form submission. The AJAX request returns a javascript script with text string arguments. I run into a problem when I try to add the AJAX returned script to the existing page.
Here are the different things I've tried to accomplish this already:
newAjaxBlock.appendChild(document.createTextNode(ajaxRequest.responseText));
newAjaxBlock.innerHTML = ajaxRequest.responseText;
newAjaxBlock.textContent = ajaxRequest.responseText;
The problem is that if I use .innerHTML to insert the returned script, it converts the escaped characters in the argument text string to their HTML equivalent and the script will throw errors because of single quotes and other characters in the string.
I expected .innerHTML to take the text and write it exactly as PHP provides it without unexpected conversions from escaped characters to their HTML equivalents.
For example I would generate a script in PHP and run it through htmlspecialchars() and make a text string exactly as follows:
<script type='text/javascript' id='layerScript'>Lib.alertFunction(arg1, $arg2, '<p>You changed THING from "value1" to "newValue".</p>');</script>
But instead .innerHTML converts it to this:
<script type='text/javascript' id='layerScript'>Lib.alertFunction(arg1, $arg2, '<p>You changed THING from "value1" to "newValue".</p>');</script>
and as you can see, the script won't work with single quotes and other characters messing up the argument list.
In contrast, when I tried using the createTextNode or .textContent options it creates a text node that ignores the HTML tags and shows it ALL as text instead of interpreting the HTML. This is not a surprise to me but leaves me with no option that actually just puts the HTML code in as it's written without converting the escaped characters.
All of the code works exactly as I expect and need it to except when the script argument contains single quotes or lt and gt symbols so I know I have narrowed the problem down to this single issue. I don't want jquery suggestions and I know I could code for an extra few days to make a function that does what I need but I want to know if there's something that does what .innerHTML does without converting escaped characters before I waste that time.
This exact question was already asked and was answered with "use .textContent" which as I mentioned doesn't work to insert formatted HTML with AJAX.
I have the below code in my JSP. UI displays every character correctly other than "&".
<c:out value="<script>var escapedData=unescape('${column}');
$('div').html(escapedData);</script>" escapeXml="false" /> </div>
E.g. 1) working case
input = ni!er#
Value in my escapedData variable is ni%21er%40. Now when I put it in my div using
$('div').html(escapedData); then o/p on html is as expected
E.g. 2) Issue case
input = nice&
Value in my escapedData variable is nice%26. Now when I put it in my div using
$('div').html(escapedData); then also it displays below
$('#test20').html('nice%26');
However, when output is displayed in JSP, it just prints "nice". It truncates everything after &.
Any suggestions?
It looks like you have some misunderstandings what unescape(val)/escape(val) do and where you need them. And what you need to take attention of when you use .html().
HTML and URI have certain character that have special meanings. The most important ones are:
HTML: <, >, &
URI: /,?,%,&
If you want to use one of those characters in HTML or URI you need to escape them.
The escaping for URI and for HTML are different.
The functions unescape/escape (deprecated) and decodeURI/endcodeURI are for URI. But was you want is to escape your data into the HTML format.
There is no build-in function in_JS_ that does this but you could e.g. use the code of the answer to this question Can I escape html special chars in javascript?.
But as it seems that you use jQuery you could think of just using .text instead of .html as this will do the escaping for you.
An additional note:
I'm pretty sure that the var escapedData=unescape('${column}'); does not do anything. I assume that ${column} already is ni!er#/nice&.
So please check your source code. If var escapedData=unescape('${column}'); will look like var escapedData=unescape('ni!er#'); then you should remove the unescape otherwise you would not get the expected result if the ${column} contains something like e.g. %23.
I am trying to populate a DOM element with ID 'myElement'. The content which I'm populating is a mix of text and HTML elements.
Assume following is the content I wish to populate in my DOM element.
var x = "<b>Success</b> is a matter of hard work &luck";
I tried using innerHTML as follows,
document.getElementById("myElement").innerHTML=x;
This resulted in chopping off of the last word in my sentence.
Apparently, the problem is due to the '&' character present in the last word. I played around with the '&' and innerHTML and following are my observations.
If the last word of the content is less than 10 characters and if it has a '&' character present in it, innerHTML chops off the sentence at '&'.
This problem does not happen in firefox.
If I use innerText the last word is in tact but then all the HTML tags which are part of the content becomes plain text.
I tried populating through jQuery's #html method,
$("#myElement").html(x);
This approach solves the problem in IE but not in chrome.
How can I insert a HTML content with a last word containing '&' without it being chopped off in all browsers?
Update : 1. I tried html encoding the content which I am trying to insert into the DOM. When I encode the content, the html tags which are part of the content becomes plain string.
For the above mentioned content, I expect the result to be rendered as,
Success is a matter of hard work &luck
but when I encode what I actually get in the rendered page is,
<b>Success</b> is a matter of hard work &luck
You should replace your & with &.
The & (ampersand) character is used within HTML to represent various special characters. For example, " = ", < = <, etcetera. Now, &luck clearly is not a valid HTML entity (for one it is missing the semicolon). However, various browsers may, due to combinations of error correcting (the semicolon), and the fact that it looks somewhat like an HTML entity (& followed by four characters) try to parse it as such.
Because &luck; is not a valid HTML entity, the original text is lost. Because of this, when using an ampersand in your HTML, always use &.
Update: When this text is entered by a user, it is up to you to escape this character properly. In PHP for example, you would call htmlentities on the text before displaying it to the user. This has the added benefit of filtering out malicious user code such as <script> tags.
The ampersand is a special character in HTML that indicates the start of a character entity reference or numeric character reference, you need to escape it like so:
var x = "<b>Success</b> is a matter of hard work &luck";
Try using this instead:
var x = "<b>Success</b> is a matter of hard work &luck";
By HTML encoding the ampersand, you are ensuring that there is no ambiguity in what you mean when you write "&luck".