The crux of my problem comes down to this issue:
$('<video allowfullscreen></video>').prop('outerHTML') === '<video allowfullscreen></video>' //Is False
$('<video allowfullscreen></video>').prop('outerHTML') === '<video allowfullscreen=""></video>' //Is True
The input I'm giving to jQuery gets partially mangled and transformed in an unwanted way.
My goal is that I have (trusted) html coming in that I want to modify by adding some attributes and wrapping it in other elements before converting it back to a String and passing it to the user as text they can copy.
So an expected output might be something like:
<div><video class="myClass" allowfullscreen></video></div>
Since the input html is coming from elsewhere I'd like to make as little assumptions about it as possible. So ideally I don't want to take the string and parse over it to fix specific attributes or remove instances of ="" (in case there's a reason at some point to specifically set a property to "").
Even if I don't care about having a value set on these properties the correct value would be allowfullscreen="allowfullscreen" anyways. I don't have control over the html coming in so I need to take it as-is. So I can't simply 'fix' the html to pass along something like allowfullscreen="allowfullscreen".
Are there any options or ways to preserve valueless properties when I go from string->jQuery->string?
I'm even open to other technology suggestions that would be better suited to this sort of DOM manipulation, but jQuery would otherwise be ideal because of how concise its syntax is. Vanilla Javascript can do it properly, but the syntax makes the code more brittle which I would like to avoid.
See HTML5 - 8.1.2.3 Attributes
8.1.2.3 Attributes
Attributes for an element are expressed inside the element's start
tag.
Attributes have a name and a value. Attribute names must consist of
one or more characters other than the space characters, U+0000 NULL,
U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/"
(U+002F), and "=" (U+003D) characters, the control characters, and any
characters that are not defined by Unicode. In the HTML syntax,
attribute names, even those for foreign elements, may be written with
any mix of lower- and uppercase letters that are an ASCII
case-insensitive match for the attribute's name.
Attribute values are a mixture of text and character references,
except with the additional restriction that the text cannot contain an
ambiguous ampersand.
Attributes can be specified in four different ways:
Empty attribute syntax
Just the attribute name. The value is implicitly the empty string.
Related
I'm investigating a bug in our system where a link's title attribute is being set to something akin to click if value > 400 but the actual tooltip being displayed is click if value > 400. This title value is defined by user input and so the original engineer escaped the text so it wouldn't cause a XSS vulnerability. click if value > 400 becomes click if value > 400.
This extra escaping step seems to cause HTML special characters to be escaped too much so their escaped values are being rendered literally.
To be extra thorough I checked the HTML spec and according to this line it appears that the setAttribute function must automatically escape the attribute's value string.
https://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-F68F082
"If an attribute with that name is already present in the element, its value is changed to be that of the value parameter. This value is a simple string; it is not parsed as it is being set. So any markup (such as syntax to be recognized as an entity reference) is treated as literal text, and needs to be appropriately escaped by the implementation when it is written out."
As I understand it, this line means that the setAttribute function should escape HTML special characters. Is that the correct interpretation?
The plain English interpretation of that quote is that setAttribute() does not parse the value as HTML. The reason for that is because you're not writing HTML at all; the value is in plain text, not HTML, so what would normally be special characters in HTML have no special meaning in plain text, and escaping them as though they were HTML would actually be destructive.
> is the HTML representation of >. You only need to encode it in HTML, not in plain text.
Not exactly.
HTML is a data format.
Browsers will parse HTML and generate a DOM from it. It is at this point that character references (like >) get converted to the characters they represent (like >).
When you use setAttribute, you directly change the DOM.
This bypasses the HTML data format entirely so the HTML foo="&" and the JavaScript setAttribute("foo", "&") will give you the same end result.
I am trying to include some symbols into a div using JavaScript.
It should look like this:
x ∈ ℝ
, but all I get is: x ∈ ℝ.
var div=document.getElementById("text");
var textnode = document.createTextNode("x ∈ ℝ");
div.appendChild(textnode);
<div id="text"></div>
I had tried document.getElementById("something").innerHTML="x ∈ ℝ" and it worked, so I have no clue why createTextNode method did not.
What should I do in order to output the right thing?
You are including HTML escapes ("entities") in what needs to be text. According to the docs for createTextNode:
data is a string containing the data to be put in the text node
That's it. It's the data to be put in the text node. The DOM spec is just as clear:
Creates a Text node given the specified string.
You want to include Unicode in this string. To include Unicode in a JavaScript string, use Unicode escapes, in the format \uXXXX.
var textnode = document.createTextNode("x \u2208 \u211D");
Or, you could simply include the actual Unicode character and avoid all the trouble:
var textnode = document.createTextNode("x ∈ ℝ");
In this case, just make sure that the JS file is served as UTF-8, you are saving the file as UTF-8, etc.
The reason that setting .innerHTML works with HTML entities is that it sets the content as HTML, meaning it interprets it as HTML, in all regards, including markup, special entities, etc. It may be easier to understand this if you consider the difference between the following:
document.createTextNode("<div>foo</div>");
document.createElement("div").textContent = "<div>foo</div";
document.createElement("div").innerHTML = "<div>foo</div>";
The first creates a text node with the literal characters "<div>foo</div>". The second sets the content of the new element literally to "<div>foo</div>". The third, on the other hand, creates an actual div element inside the new element containing the text "foo".
Every character has a hexadecimal name (for example 0211D). if you want to transform it into a HTML entity, add &#x => ℝ or use the entity name ℝ or the decimal name ℝ which can be found all here: http://www.w3schools.com/charsets/ref_html_entities_4.asp
But when you use JavaScript, in order to make the browser understand that you want to output a unicode symbol and not a string, escape entities are required. To do that, add \u before the hexadecimal name =>\u211D;.
document.createTextNode will automatically html-escape the needed characters. You have to provide those texts as JavaScript strings, either escaped or not:
document.body.appendChild(document.createTextNode("x ∈ ℝ"));
document.body.appendChild(document.createElement("br"));
document.body.appendChild(document.createTextNode("x \u2208 \u211d"));
EDIT: It's not true that the createTextNode function will do actual html escaping here as it doesn't need to. #deceze gave a very good explanation about the connection between the dom and html: html is a textual representation of the dom, thus you don't need any html-related escaping when directly manipulating the dom.
In the following scenario:
var evil_string = "...";
$('#mytextarea').val(evil_string);
Do I have to escape an untrusted string before using it as the value of a textarea element?
I understand that I will have to handle the string with care if I want to do anything with it later on, but is the act of putting the string in a textarea without escaping inherently dangerous?
I have done some basic testing and the usual special characters &'"< seem to be successfully added to the textarea without interpretation.
No, you don't need to do that. When you assign directly to property of DOM element (which jQuery's .val does under the hood), the data is interpreted verbatim. You only need to quote text with methods that explicitly treat input as HTML - i.e. outer/innerHTML and like.
Putting unescaped strings as values of textboxes or textareas is fine. You only need to worry about it when you are putting strings in your HTML that could potentially be interpreted as other HTML. Generally speaking, this means you should escape the strings when the text could be a child of some HTML DOM Element. This could be done on the server (as lolka_bolka suggested), or on the client before adding the potentially dangerous string to the DOM.
I'd like to use array as data-* attribute and a lot of StackOverflow answers suggest that I should use JSON.stringify();
How to pass an array into jQuery .data() attribute
Store and use an array using the HTML data tag and jQuery
https://gist.github.com/charliepark/4266921
etc.
So, if I have this array: ['something', 'some\'thing', 'some"thing'] it will be parsed to "["something","some'thing","some\"thing"]" and therefore it won't fit neither data-*='' nor data-*="" because either ' or " will break the HTML tag.
Am I missing something or encodeURIComponent() is a true solution to encoding arrays like that? Why in other StackOverflow answers nobody noticed this?
The reasoning that JSON.stringify is not guaranteed to be safe in HTML attributes when the text is part of the HTML markup itself is valid. However, there is no escaping issue if using one of the access methods (eg. .data or .attr) to assign the value as these do not directly manipulate raw HTML text.
While encodeURIComponent would "work" as it escapes all the problematic characters, it both results in overly ugly values/markup and requires a manual decodeURIComponent step when consuming the values - yuck!
Instead, if inserting the data directly into the HTML, simply "html encode" the value and use the result as the attribute value. Such a function comes with most server-side languages, although an equivalent is not supplied natively with JavaScript.
Assuming the attribute values are quoted, the problematic characters that need to be replaced with the appropriate HTML entities are:
& - &, escape-the-escape, applied first
" - ", for double-quoted attribute
' - ', for single-quoted attribute
Optional (required for XML): < and >
Using the above approach relies on the parsing of the HTML markup, and the automatic decoding of HTML entities therein, such that the actual (non-encoded) result is stored as the data-attribute value in the DOM.
I have a div that contains a settings icon that is a html miscellaneous symbol
<span class="settings-icon">⚙</span>
I have a jasmine test that checks the div contents to makes sure that it is not changed.
it("the settings div should contain ⚙", function() {
var settingsIconDiv = $('.settings-icon');
expect(settingsIconDiv.text())
.toContain('⚙');
});
It will not pass as it is evaluated as its glyph symbol of a gear icon ⚙
How to I decode the glyph in order to pass the test?
To get actual character from Unicode to compare it to a literal in HTML you can use String.fromCharCode() e.g.
.toContain(String.fromCharCode(9881))
You should check against the string '⚙' or, if you do not how to enter it in your code, the escape notation \u2699. There are other, clumsier ways to construct a string containing the character, but simplicity is best.
No matter how the character is written in HTML source code (e.g., as the reference ⚙), it appears in the DOM as the character itself, U+2699. In JavaScript, a string like ⚙ is just a sequence of seven Ascii characters (though you can pass it to a function that parses it as an HTML character reference, or you can assign it e.g. to the innerHTML property, causing HTML parsing, but this is rather pointless and confusing).
To match the browser behavior (because you don't know how it is encoded in html or in text) i would try the following
.toContain($("<span>⚙</span>").text()) instead of .toContain('⚙').
That way it should match how it is stored in the dom.
The String.fromCharCode(9881); mentioned by Yuriy Galanter will definitely also work reliable. But because dom engine and the js engine are two different parts, that could behave differently, i would test with both techniques.