Do I need to do string sanitation before adding to DOM?

Do I need to do string sanitation before adding to DOM? - javascript

In our team we came up with the idea that we have to do sanitizing of strings before added to the DOM. We expected at least that double quotes would be troublesome if used in setAttribute and < and > if added to the node content.
The first tests showed something different. We are using innerHTML to set a nodes content. This escapes all unsafe characters by its own. But even setAttribute does escape < and >
So is this always the case because I couldn't find anything on google? I don't know if there are browsers out there that would fail.

innerHTML is editing the HTML inside an element and generating DOM nodes from it - you need write HTML according to the normal rules (e.g. you can't use a < character unless it is followed by a non-name character). Browsers will perform their usual error recovery though.
I don't understand why your experience of innerHTML differs from that.
createTextNode, setAttribute, etc edit the DOM directly. HTML is not involved, so you don't have to deal with characters that have special meaning in HTML.

Related

Can't create some kinds of nodes from HTML strings using innerHTML

The following javascript (run in the chrome console) does not do what I'd expect:
> var elem = document.createElement("foo");
undefined
> elem.innerHTML = "<tr></tr>"
"<tr></tr>"
> elem.outerHTML
"<foo></foo>"
The <tr> tag has disappeared!
This seems specific to table-related elements. Using <div> or <span> works as expected.
I expect what I'm doing is invalid, as "foo" is not a known element, and presumably table-related elements can only appear within a . Interestingly, the following code works just fine:
> var elem = document.createElement("foo"), tr = document.createElement("tr");
> elem.appendChild(tr);
> elem.outerHTML
"<foo><tr></tr></foo>"
So it seems like the construction itself (a <tr> not within a <table>) is allowable, but the method of using innerHTML to place it there does not work - perhaps this goes through some html cleanup, which removes things that are not strictly, while creating DOM nodes directly is not subject to the same validation.
My question: is there any way to populate an arbitrary DOM node from a string without running into such cleanup / validation issues? My use case will end up with perfectly valid structure (I plan to place this as the child of a sometime later), but the browser is stopping me while I'm trying to build the individual parts.
It sounds a little like DocumentFragment should be what I'm looking for, but as far as I can tell those are only constructable programmatically - they don't support innerHTML.
some background on why I want to do this:
My use case is javascript-based live templating (i.e not outputting html strings, but actual DOM nodes). So the requirements are:
template input must be allowed to be arbitrary HTML (this is why I'm using innerHTML and not constructing nodes programmatically)
it must be possible to create sub-templates that are then attached into a larger document (that's why I can't just create the whole at once).
The second point is how I encountered this bug. My template contains a sub-template.
var row = Html("<tr></tr>");
var table = Html(["<table><thead>", row, "</thead></table>"]);
I will later add code like:
row.append(Html(["<td>", column.header, "</td>"]));
to actually populate the columns. So when it's fully constructed, the html will be valid. But in the intermediate stages, each template / snippet is constructed under a single element. That means that templates like:
Html(["Hello <span>", name, "</span>"]);
still come out as a single node (so that they can be manipulated as a single entity):
<foo>Hello <span>bob</span></foo>
When the template results in only a single child inside the <foo>, the outer node is removed. But during construction, the row template above should look like <foo><tr></tr></foo>. Due to the validation behaviour I'm seeing when using innerHTML it just ends up as <foo></foo>.
I've checked all code works the same in both firefox & chrome, so I don't expect I'm just hitting a browser bug.

Unfortunately the answer to your general question is no, there is no way to use innerHTML to add arbitrarily incomplete HTML fragments. I know this is not the answer you want to hear but that's the way it is.
One of the most misunderstood thing about innerHTML stems from the way the API is designed. It overloads the + and = operators to perform DOM insertion. This tricks programmers into thinking that it is merely doing string operations when in fact innerHTML behaves more like a function rather than a variable. It would be less confusing to people if innerHTML was designed like this:
element.innerHTML('some <b>html</b> here');
unfortunately it's too late to change the API so we must instead understand that it is really an API instead of merely an attribute/variable.
Now, to understand the so called "validation" behavior of innerHTML. When you modify innerHTML it triggers a call to the browser's HTML compiler. It's the same compiler that compiles your html file/document. There's nothing special about the HTML compiler that innerHTML calls. Therefore, whatever you can do to a html file you can pass to innerHTML (the one exception being that embedded javascript don't get executed - probably for security reasons).
This makes sense from the point of view of a browser developer. Why include two separate HTML compilers in the browser? Especially considering the fact that HTML compilers are huge, complex beasts.
The down side to this is that incomplete HTML will be handled the same way it is handled for html documents. In the case of <td> elements not inside a table most browsers will simply strip it away (as you've observed for yourself). That is essentially what you're trying to do - create invalid/incomplete HTML.
There are two work arounds to this:
Extract the table from the page then using string processing (regex et. el.) insert the <td> into the table string then innerHTML the whole table back into the page.
Parse the inserted HTML string and if you find any <td> or <tr> (or <option>) extract out the html element and insert it using DOM methods.
Unfortunately both are quite painful.

Mihai Stancu's comment about jquery made me think: surely jquery manages this if you call $("<tr></tr>"). I know jquery has a shortcut for strings that look like single tags, but it must work for complex HTML as well.
So I took a dive into the jquery source code, and found just the ticket:
https://github.com/jquery/jquery/blob/6a0ee2d9ed34b81d4ad0662423bf815a3110990f/src/manipulation.js#L450
It's using a regex to detect just the name of the first tag in the string, then using this info to figure out what "context" it needs to wrap it in for the innerHTML process to treat it as valid. I think this technique should work for all well-formed inputs.
I've adopted this code into a standalone function which will turn an arbitrary string into a DOM node:
https://gist.github.com/gfxmonk/5299096

How can I get the actual displayed text from an HTML TextNode instead of the HTML markup?

I'm trying to turn a DOM node and all its children into a plain text markup of my design. I can use node.childNodes to get a list of all the content and recursively turn it into my string format.
However, when I take text out of a TextNode, it includes newlines and spaces that aren't visible on the page. For plain text I want to get the same appearance that was on the HTML - so there shouldn't be lots of indentations before the text or newlines after it, even if they were in the HTML markup, because my browser stripped those out when it rendered the HTML.
The obvious answer would be to .trim() the string myself - except that this can take out spaces that are supposed to exist in the text, in the case of something like <em>text.</em> moretext. The latter textnode loses the space before it.
Even if that was working it's also philosophically unappealing. I want this algorithm to be based on the text presented to the user. The webpage conceals implementation details like spaces, tabs, and newlines in the underlying markup and I would like to remain within that abstraction using whatever it used to trim them down, rather than the approximation granted by trim(). Ideally there would be an equivalent of node.textContent that has a list of both plain textand child elements somehow.
I haven't been able to find anything about this and I can't see a good way to code it to be smart about those spaces (short of comparing the .textContent and .nodeValue strings or parsing innerHTML myself or something). Help?

document.getElementById("someid").innerText.replace(/\s+/g," ")
The trim method removes the space at the head and the end of a string, but not in the middle

I have written an implementation of exactly this as part of my Rangy library's TextRange module, but it's a lot of code to include for just this.
var displayedText = rangy.innerText(node);

regex to change text inside a html tag

First of all I'm new to stackoverflow so I'm sorry if I posted this in the wrong section.
I need a regex to search within the html tag and replace the - with a _
e.g:
<TAG-NAME>-100</TAG-NAME>
would become
<TAG_NAME>-100</TAG_NAME>
note that the value inside the tag wasn't affected.
Can anyone help?
Thanks.

Since JavaScript is the language for DOM manipulation, you should generally consider parsing the XML properly and using JavaScript's DOM traversal functions instead of regular expressions.
Here is some example code on how to parse an XML document so that you can use the DOM traversal functions. Then you can traverse all elements and change their names. This will automatically exclude text nodes, attributes, comments and all other annoying things, you don't want to change.
If it has to be a regex, here is a makeshift solution. Note that it will badly fail you if you have tags (or even only >) inside attribute names or comments (in fact it will also apply the replacement to comments):
str = str.replace(/-(?=[^<>]*>)/g, '_');
This will match a - if it is followed by a > without encountering a < before. The concept is called a negative lookahead. The g modifier makes sure that all occurrences are replaced.
Note that this will apply the replacement to anything in front of a >. Even attribute values. If you don't want that you could also make sure that there is an even number of quotes between the hyphen and the closing >, like this:
str = str.replace(/-(?=[^<>"]*(?:"[^<>"]*"[^<>"]*)*>)/g, '_');
This will still change attribute names though.
Here is a regexpal demo that shows what works and what doesn't work. Especially the comment behavior is quite horrible. Of course this could be taken care of with an even more complex regex, but I guess you see where this is going? You should really, really use an XML parser!

s/(\<[^\>]+\>)\-([^\<]+\<\/)/\1_\2/
Although I am not familiar with JS libraries, but I am pretty sure there would be better libraries to parse HTML.

Javascript syntax highlighter that plays nicely with Markdown

I've looked at a few Javascript programs to add syntax highlighting to code blocks on a page, but they all the ones I've found require setting an attribute on the code block to tell it what language is being used. I am generating the HTML with Markdown, so I have no way of setting these attributes, are there any that will do this automatically and will not need an attribute to be set?
The only way I can think of this working is with a shebang line;
#!/usr/bin/ruby
def foo(bar)
bar
end
And it will know it's Ruby, and maybe even not display the shebang line (having a shebang for a one or two line fragment will get tiring).
I wont be needing it to do any very obscure languages, but it would be great if I could easily write new definitions.
Thanks.

Google Prettifier should do the job. StackOverflow uses it, too (with the markup generated by Markdown). It determines the language automatically.

It's my understanding that the Markdown spec allows for the presence of actual markup as a fallback:
For any markup that is not covered by
Markdown's syntax, you simply use HTML
itself. There's no need to preface it
or delimit it to indicate that you're
switching from Markdown to HTML; you
just use the tags.
The only restrictions are that
block-level HTML elements -- e.g.
<div>, <table>, <pre>, <p>, etc. --
must be separated from surrounding
content by blank lines, and the start
and end tags of the block should not
be indented with tabs or spaces.
So, if you've got a syntax highlighter you really like that doesn't auto-detect, you could simply throw a literal <code> block with the appropriate attribute into your Markdown. I don't think it particularly violates the goals of Markdown, either... it's a fairly straightforward and readable indicator.
It also might not be that hard to roll your own script that executes first after the DOM is ready, finds code blocks, and inserts appropriate attributes for the syntax highlighter of your choice into them depending on a few heuristics that you devise for their contents, but if there's a library out there that already does it, obviously that has some advantages. :)

innerHTML alternative for retrieving contents of page?

I'm currently using innerHTML to retrieve the contents of an HTML element and I've discovered that in some browsers it doesn't return exactly what is in the source.
For example, using innerHTML in Firefox on the following line:
<div id="test"><strong>Bold text</strong></strong></div>
Will return:
<strong>Bold text</strong>
In IE, it returns the original string, with two closing strong tags. I'm assuming in most cases it's not a problem (and may be a benefit) that Firefox cleans up the incorrect code. However, for what I'm trying to accomplish, I need the exact code as it appears in the original HTML source.
Is this at all possible? Is there another Javascript function I can us?

I don't think you can receive incorrect HTML code in modern browsers. And it's right behaviour, because you don't have source of dynamicly generated HTML. For example Firefox' innerHTML returns part of DOM tree represented in string. Not an HTML source. And this is not a problem because second </strong> tag is ignored by the browser anyway.

innerHTML is generated not from the actual source of the document ie. the HTML file but is derived from the DOM object that is rendered by the browser. So if IE somehow shows you incorrect HTML code then it's probably some kind of bug. There is no such method to retrieve the invalid HTML code in every browser.

You can't in general get the original invalid HTML for the reasons Ivan and Andris said.
IE is also “fixing” your code just like Firefox does, albeit in a way you don't notice on serialisation, by creating an Element node with the tagName /strong to correspond to the bogus end-tag. There is no guarantee at all that IE will happen to preserve other invalid markup structures through a parse/serialise cycle.
In fact even for valid code the output of innerHTML won't be exactly the same as the input. Attribute order isn't maintained, tagName case isn't maintained (IE gives you <STRONG>), whitespace is various places is lost, entity references aren't maintained, and so on. If you “need the exact code”, you will have to keep a copy of the exact code, for example in a JavaScript variable in a <script> block written after the content in question.

If you don't need the HTML to render (e.g., you're going to use it as a JS template or something) you can put it in a textarea and retrieve the contents with innerHTML.
<textarea id="myTemplate"><div id="test"><strong>Bold text</strong></strong></div></textarea>
And then:
$('#myTemplate').html() === '<div id="test"><strong>Bold text</strong></strong></div>'
Other than that, the browser gets to decide how to interpret the HTML and it will only return you it's interpretation, not the original.

innerTEXT ? or does that have the same eeffect?

You must use innerXML property. It does exactly what you want to achieve.

We Keep Coding

JavaScript is the programming language of the Web.