I'm investigating a bug in our system where a link's title attribute is being set to something akin to click if value > 400 but the actual tooltip being displayed is click if value > 400. This title value is defined by user input and so the original engineer escaped the text so it wouldn't cause a XSS vulnerability. click if value > 400 becomes click if value > 400.
This extra escaping step seems to cause HTML special characters to be escaped too much so their escaped values are being rendered literally.
To be extra thorough I checked the HTML spec and according to this line it appears that the setAttribute function must automatically escape the attribute's value string.
https://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-F68F082
"If an attribute with that name is already present in the element, its value is changed to be that of the value parameter. This value is a simple string; it is not parsed as it is being set. So any markup (such as syntax to be recognized as an entity reference) is treated as literal text, and needs to be appropriately escaped by the implementation when it is written out."
As I understand it, this line means that the setAttribute function should escape HTML special characters. Is that the correct interpretation?
The plain English interpretation of that quote is that setAttribute() does not parse the value as HTML. The reason for that is because you're not writing HTML at all; the value is in plain text, not HTML, so what would normally be special characters in HTML have no special meaning in plain text, and escaping them as though they were HTML would actually be destructive.
> is the HTML representation of >. You only need to encode it in HTML, not in plain text.
Not exactly.
HTML is a data format.
Browsers will parse HTML and generate a DOM from it. It is at this point that character references (like >) get converted to the characters they represent (like >).
When you use setAttribute, you directly change the DOM.
This bypasses the HTML data format entirely so the HTML foo="&" and the JavaScript setAttribute("foo", "&") will give you the same end result.
Related
I am noticing some very strange behavior in firefox and I'm wondering if anyone has a strategy for how to normalize or work around this behavior.
Specifically if you provide firefox a basic anchor containing html entities it will unescape those entities, fail to re-escape them and hand you back invalid html.
For example firefox mishandles the following url:
My Original Link
If this url is parsed by firefox it will unescape the ><" and start handling a url like:
My Original Link
This same operation appears to work fine elsewhere, even safari and edge.
I tried quite a few different ways of handing the html to firefox to avoid this problem. Tried manually invoking the parser, tried setting innerHTML, tried jQuery html(), tried giving jQuery constructor a giant string, etc. All methods produced the same broken result.
See a fiddle here:
https://jsfiddle.net/kamelkev/hfd2b6sn/
I am a little mystified by how broken this handling seems to be. There must be a way to work around this issue, but I can't seem to find a way.
My application is an html manipulation tool, so I typically normalize around issues like this by dropping down to XML and handling the problems there before persisting to a dumb key-value store, but in this particular case the <> characters are preventing me from processing this document as XML.
Ideas?
A < or a > is valid inside of an attribute value, unescaped. It's not best practice, but it is valid.
What's happening is that Firefox is parsing the original HTML and making elements out of it. At that point, the original HTML no longer exists. When you call .outerHTML, the HTML is reconstructed from the element.
Firefox then generates it using a different set of rules than Chrome does.
It isn't clear what exactly you need to do this for... really you should edit the DOM and export the HTML for the whole DOM when done. Constantly re-interpreting HTML isn't necessary.
The > and < are unescaped when the parser parses the source to construct the DOM. When you serialize an element back to a string, you are not guaranteed to obtain the same text as the source.
In this case, innerHTML and outerHTML use the HTML fragment serialization algorithm, which escapes attribute values using attribute mode:
Escaping a string (for the purposes of the algorithm above) consists
of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any
occurrences of the ">" character by the string ">".
That's why " is escaped to ", but < and > remain.
This is OK, because < and > are allowed in HTML double-quoted attribute values:
U+0022 QUOTATION MARK ("): Switch to the after attribute value (quoted) state.
U+0026 AMPERSAND (&): Switch to the character reference in attribute value state [...]
U+0000 NULL: Parse error [...]
EOF: Parse error [...]
Anything else: Append the current input character to the current attribute's value.
However, XML does not allow < and > in attribute values. If you want to get valid XHTML, use a XML serializer:
var s = new XMLSerializer();
var str = s.serializeToString(document.querySelector('a'));
console.log(str);
My Original Link
I am trying to include some symbols into a div using JavaScript.
It should look like this:
x ∈ ℝ
, but all I get is: x ∈ ℝ.
var div=document.getElementById("text");
var textnode = document.createTextNode("x ∈ ℝ");
div.appendChild(textnode);
<div id="text"></div>
I had tried document.getElementById("something").innerHTML="x ∈ ℝ" and it worked, so I have no clue why createTextNode method did not.
What should I do in order to output the right thing?
You are including HTML escapes ("entities") in what needs to be text. According to the docs for createTextNode:
data is a string containing the data to be put in the text node
That's it. It's the data to be put in the text node. The DOM spec is just as clear:
Creates a Text node given the specified string.
You want to include Unicode in this string. To include Unicode in a JavaScript string, use Unicode escapes, in the format \uXXXX.
var textnode = document.createTextNode("x \u2208 \u211D");
Or, you could simply include the actual Unicode character and avoid all the trouble:
var textnode = document.createTextNode("x ∈ ℝ");
In this case, just make sure that the JS file is served as UTF-8, you are saving the file as UTF-8, etc.
The reason that setting .innerHTML works with HTML entities is that it sets the content as HTML, meaning it interprets it as HTML, in all regards, including markup, special entities, etc. It may be easier to understand this if you consider the difference between the following:
document.createTextNode("<div>foo</div>");
document.createElement("div").textContent = "<div>foo</div";
document.createElement("div").innerHTML = "<div>foo</div>";
The first creates a text node with the literal characters "<div>foo</div>". The second sets the content of the new element literally to "<div>foo</div>". The third, on the other hand, creates an actual div element inside the new element containing the text "foo".
Every character has a hexadecimal name (for example 0211D). if you want to transform it into a HTML entity, add &#x => ℝ or use the entity name ℝ or the decimal name ℝ which can be found all here: http://www.w3schools.com/charsets/ref_html_entities_4.asp
But when you use JavaScript, in order to make the browser understand that you want to output a unicode symbol and not a string, escape entities are required. To do that, add \u before the hexadecimal name =>\u211D;.
document.createTextNode will automatically html-escape the needed characters. You have to provide those texts as JavaScript strings, either escaped or not:
document.body.appendChild(document.createTextNode("x ∈ ℝ"));
document.body.appendChild(document.createElement("br"));
document.body.appendChild(document.createTextNode("x \u2208 \u211d"));
EDIT: It's not true that the createTextNode function will do actual html escaping here as it doesn't need to. #deceze gave a very good explanation about the connection between the dom and html: html is a textual representation of the dom, thus you don't need any html-related escaping when directly manipulating the dom.
The crux of my problem comes down to this issue:
$('<video allowfullscreen></video>').prop('outerHTML') === '<video allowfullscreen></video>' //Is False
$('<video allowfullscreen></video>').prop('outerHTML') === '<video allowfullscreen=""></video>' //Is True
The input I'm giving to jQuery gets partially mangled and transformed in an unwanted way.
My goal is that I have (trusted) html coming in that I want to modify by adding some attributes and wrapping it in other elements before converting it back to a String and passing it to the user as text they can copy.
So an expected output might be something like:
<div><video class="myClass" allowfullscreen></video></div>
Since the input html is coming from elsewhere I'd like to make as little assumptions about it as possible. So ideally I don't want to take the string and parse over it to fix specific attributes or remove instances of ="" (in case there's a reason at some point to specifically set a property to "").
Even if I don't care about having a value set on these properties the correct value would be allowfullscreen="allowfullscreen" anyways. I don't have control over the html coming in so I need to take it as-is. So I can't simply 'fix' the html to pass along something like allowfullscreen="allowfullscreen".
Are there any options or ways to preserve valueless properties when I go from string->jQuery->string?
I'm even open to other technology suggestions that would be better suited to this sort of DOM manipulation, but jQuery would otherwise be ideal because of how concise its syntax is. Vanilla Javascript can do it properly, but the syntax makes the code more brittle which I would like to avoid.
See HTML5 - 8.1.2.3 Attributes
8.1.2.3 Attributes
Attributes for an element are expressed inside the element's start
tag.
Attributes have a name and a value. Attribute names must consist of
one or more characters other than the space characters, U+0000 NULL,
U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/"
(U+002F), and "=" (U+003D) characters, the control characters, and any
characters that are not defined by Unicode. In the HTML syntax,
attribute names, even those for foreign elements, may be written with
any mix of lower- and uppercase letters that are an ASCII
case-insensitive match for the attribute's name.
Attribute values are a mixture of text and character references,
except with the additional restriction that the text cannot contain an
ambiguous ampersand.
Attributes can be specified in four different ways:
Empty attribute syntax
Just the attribute name. The value is implicitly the empty string.
In the following scenario:
var evil_string = "...";
$('#mytextarea').val(evil_string);
Do I have to escape an untrusted string before using it as the value of a textarea element?
I understand that I will have to handle the string with care if I want to do anything with it later on, but is the act of putting the string in a textarea without escaping inherently dangerous?
I have done some basic testing and the usual special characters &'"< seem to be successfully added to the textarea without interpretation.
No, you don't need to do that. When you assign directly to property of DOM element (which jQuery's .val does under the hood), the data is interpreted verbatim. You only need to quote text with methods that explicitly treat input as HTML - i.e. outer/innerHTML and like.
Putting unescaped strings as values of textboxes or textareas is fine. You only need to worry about it when you are putting strings in your HTML that could potentially be interpreted as other HTML. Generally speaking, this means you should escape the strings when the text could be a child of some HTML DOM Element. This could be done on the server (as lolka_bolka suggested), or on the client before adding the potentially dangerous string to the DOM.
I am trying to populate a DOM element with ID 'myElement'. The content which I'm populating is a mix of text and HTML elements.
Assume following is the content I wish to populate in my DOM element.
var x = "<b>Success</b> is a matter of hard work &luck";
I tried using innerHTML as follows,
document.getElementById("myElement").innerHTML=x;
This resulted in chopping off of the last word in my sentence.
Apparently, the problem is due to the '&' character present in the last word. I played around with the '&' and innerHTML and following are my observations.
If the last word of the content is less than 10 characters and if it has a '&' character present in it, innerHTML chops off the sentence at '&'.
This problem does not happen in firefox.
If I use innerText the last word is in tact but then all the HTML tags which are part of the content becomes plain text.
I tried populating through jQuery's #html method,
$("#myElement").html(x);
This approach solves the problem in IE but not in chrome.
How can I insert a HTML content with a last word containing '&' without it being chopped off in all browsers?
Update : 1. I tried html encoding the content which I am trying to insert into the DOM. When I encode the content, the html tags which are part of the content becomes plain string.
For the above mentioned content, I expect the result to be rendered as,
Success is a matter of hard work &luck
but when I encode what I actually get in the rendered page is,
<b>Success</b> is a matter of hard work &luck
You should replace your & with &.
The & (ampersand) character is used within HTML to represent various special characters. For example, " = ", < = <, etcetera. Now, &luck clearly is not a valid HTML entity (for one it is missing the semicolon). However, various browsers may, due to combinations of error correcting (the semicolon), and the fact that it looks somewhat like an HTML entity (& followed by four characters) try to parse it as such.
Because &luck; is not a valid HTML entity, the original text is lost. Because of this, when using an ampersand in your HTML, always use &.
Update: When this text is entered by a user, it is up to you to escape this character properly. In PHP for example, you would call htmlentities on the text before displaying it to the user. This has the added benefit of filtering out malicious user code such as <script> tags.
The ampersand is a special character in HTML that indicates the start of a character entity reference or numeric character reference, you need to escape it like so:
var x = "<b>Success</b> is a matter of hard work &luck";
Try using this instead:
var x = "<b>Success</b> is a matter of hard work &luck";
By HTML encoding the ampersand, you are ensuring that there is no ambiguity in what you mean when you write "&luck".