I am trying to avoid a cross-site scripting vulnerability on my server. Before any user-inputted string is embedded within HTML or sent to client-side javascript code it is escaped ('<' replaced with '<', '&' replaced with '&', etc.) When embedding into HTML this works mostly fine; the HTML code produced does not contain any HTML elements inside the user-provided string. However, when the client-side javascript inserts HTML into the document, the escape sequences get expanded back into their special characters, which can result in user-inputted tags appearing in the document HTML. Here's approximately what I'm doing, javascript client-side:
// response_data received from XMLHttpRequest and parsed as JSON
var s = "";
for (var i = 0; i < response_data.length; ++i) {
s += "<p>";
s += response_data[i];
s += "</p>";
}
console.log(s);
elem.innerHTML = s;
Suppose the user inputted the string "abcde <script>alert("Hello!");</script>" earlier. Then response_data could be ["abcde <script>alert("Hello!");</script>"]. The print to console shows s to be "<p>abcde <script>alert("Hello!");</script></p>". However, when I assign elem.innerHTML, I can see in Inspect Element that the inner HTML of the element is actually <p>abcde <script>alert("Hello!");</script></p>! I don't think it executed, probably because of some browser security features regarding script tags within p tags, but it's obviously not very good. How do I work around this?
Code snippet (run and inspect element over the text created, it shows a script tag within the p tag):
var div_elem = document.querySelector("div");
div_elem.innerHTML = "<p><script>alert("Hello!");</script></p>";
<html>
<head></head>
<body>
<div></div>
</body>
</html>
Use innerText, it's like innerHTML but it's treated as pure text and won't decode the HTML entities.
Edit:
Set innerHTML to the p tags, then set the actual text using innerText on the tag
elem.innerHTML = "<p></p>";
elem.childNodes[0].innerText = s;
Related
In our project, we are getting a response from the DB. We are using the same string in two ways.
We have to display the text part alone in one line
We are putting the entire content as an HTML.
We are getting a response similar to this.
"<html><head><title>SomeTitle</title></head><style>a.hover{color:green}cc.a{color:red},pq.a{text-decoration:underline}</style> <body> Some content </body></html>"
I need to get the content only from the body using string manipulation.I need to filter out all the contents of the other tags as well.
For example
Final result should be
Some content
I used text() in some case but at times the content inside is also getting displayed. That is not allowed for me.
Note: There are times where I don't get so there should be a check for that as well.
any solution on this?
At times we are getting inside body as well. So is there any way to remove that part off?
for example
var str = "<html><head><title>SomeTitle</title></head><style>a.hover{color:green}cc.a{color:red},pq.a{text-decoration:underline}</style> <body> <style>.hello12{color:green}</style>Some content </body></html>";
and i should get just "some content"
Use DOMParser and get text content from body tag. Where querySelector can be used to get body element and get text content from textContent property.
var str = "<html><head><title>SomeTitle</title></head><style>a.hover{color:green}cc.a{color:red},pq.a{text-decoration:underline}</style> <body> Some content </body></html>";
var parser = new DOMParser();
var doc = parser.parseFromString(str, "text/html");
console.log(
doc.querySelector('body').textContent
)
FYI : To avoid script and style tag content use innerText property instead of textContent property.
I have the following HTML:
<p>This contains an HTML space entity .</p>
I need to serialize this HTML to text along with HTML entities as their existing code (spaces added to prevent SO from rendering literal characters):
< p >This contains an HTML space entity & #160;.< / p >
When serializing the HTML the HTML entities are rendered instead of converted to their code/text form:
new XMLSerializer().serializeToString(element)
I've looked in to other methods of converting HTML code to text including innerHTML though I haven't managed to determine any direct means to outputting the HTML code that exists without it being modified by the browser.
I'm also open to replacing HTML entities with a createTreeWalker if need be though I'd prefer a more direct approach. No frameworks. Suggestions please?
Please see this SO answer: https://stackoverflow.com/a/3700369/3218479.
You can use code like:
// Prepare element
var myEl = document.createElement("p");
myEl.innerText = "This contains an HTML space entity .";
// Convert to string
var textArea = document.createElement("textarea");
textArea.innerHTML = myEl.outerHTML;
var myElText = textArea.innerText;
delete textArea;
When a user create a message there is a multibox and this multibox is connected to a design panel which lets users change fonts, color, size etc.. When the message is submited the message will be displayed with html tags if the user have changed color, size etc on the font.
Note: I need the design panel, I know its possible to remove it but this is not the case :)
It's a Sharepoint standard, The only solution I have is to use javascript to strip these tags when it displayed. The user should only be able to insert links, images and add linebreaks.
Which means that all html tags should be stripped except <a></a>, <img> and <br> tags.
Its also important that the attributes inside the the <img> tag that wont be removed. It could be isplayed like this:
<img src="/image/Penguins.jpg" alt="Penguins.jpg" style="margin:5px;width:331px;">
How can I accomplish this with javascript?
I used to use this following codebehind C# code which worked perfectly but it would strip all html tags except <br> tag only.
public string Strip(string text)
{
return Regex.Replace(text, #"<(?!br[\x20/>])[^<>]+>", string.Empty);
}
Any kind of help is appreciated alot
Does this do what you want? http://jsfiddle.net/smerny/r7vhd/
$("body").find("*").not("a,img,br").each(function() {
$(this).replaceWith(this.innerHTML);
});
Basically select everything except a, img, br and replace them with their content.
Smerny's answer is working well except that the HTML structure is like:
var s = '<div><div>Link<span> Span</span><li></li></div></div>';
var $s = $(s);
$s.find("*").not("a,img,br").each(function() {
$(this).replaceWith(this.innerHTML);
});
console.log($s.html());
The live code is here: http://jsfiddle.net/btvuut55/1/
This happens when there are more than two wrapper outside (two divs in the example above).
Because jQuery reaches the most outside div first, and its innerHTML, which contains span has been retained.
This answer $('#container').find('*:not(br,a,img)').contents().unwrap() fails to deal with tags with empty content.
A working solution is simple: loop from the most inner element towards outside:
var $elements = $s.find("*").not("a,img,br");
for (var i = $elements.length - 1; i >= 0; i--) {
var e = $elements[i];
$(e).replaceWith(e.innerHTML);
}
The working copy is: http://jsfiddle.net/btvuut55/3/
with jQuery you can find all the elements you don't want - then use unwrap to strip the tags
$('#container').find('*:not(br,a,img)').contents().unwrap()
FIDDLE
I think it would be better to extract to good tags. It is easy to match a few tags than to remove the rest of the element and all html possibilities. Try something like this, I tested it and it works fine:
// the following regex matches the good tags with attrinutes an inner content
var ptt = new RegExp("<(?:img|a|br){1}.*/?>(?:(?:.|\n)*</(?:img|a|br){1}>)?", "g");
var input = "<this string would contain the html input to clean>";
var result = "";
var match = ptt.exec(input);
while (match) {
result += match;
match = ptt.exec(input);
}
// result will contain the clean HTML with only the good tags
console.log(result);
I want to display a text with CR and tabs (let's say the code is into a var shtml) into a iframe without losing the ASCII characters.
<!--var shtml-->
<HEAD>
</HEAD>
<BODY style="FONT-SIZE: 12.5pt">
mmm
</BODY>
My iframe
<iframe rows="5" cols="60" id="tahtml">
</iframe >
My JS
document.getElementById('tahtml').textContent = shtml; //innerText = shtml;
If I use .innerText then the code(shtml) is interpreted in Firefox. If I use .textContent the code(shtml) is displayed wihtout the ASCII characters. the jQuery .text() dose the same as .textContent.
Just like an <input>, a <textarea>'s DOM interface (HTMLTextAreaElement) has a value property, it looks like you want to set this property to shtml.
document.getElementById('tahtml').value = shtml;
Demo
For an <iframe> make the MIME for the page loaded inside it text/plain. This can be done by, for example, fetching an empty .txt or setting the src to data:text/pain,. Then you can do the following
// don't collapse whitespace, only needed to be done once
ifrmDoc.documentElement.style.whiteSpace = 'pre';
// set text
ifrmDoc.documentElement.innerHTML = shtml;
Where
var ifrm = document.getElementById('tahtml'),
ifrmDoc = ifrm.contentDocument || ifrm.contentWindow.document;
Of course, you could also do it by
writing the whole thing as a dataURI and pass it as the src
ifrm.src = 'data:text/pain,' + window.encodeURIComponent(shtml);
appending your text using by using DOM methods and text nodes
ifrmDoc.body.appendChild(ifrmDoc.createTextNode(shtml))
(this would still require whiteSpace: pre;)
making a <textarea> or <pre> in your <iframe>, into which you put shtml as value or a text node, respectively.
I have an entire html document contained in a javascript string variable and I need to render it into a portion of my page. Have I to use frames?
How to do that?
With an iframe:
<iframe id="myiframe"></iframe>
var frame= document.getElementById('myiframe');
var doc= frame.contentDocument? frame.contentDocument : frame.contentWindow.document; // IE compatibility
doc.open('text/html');
doc.write(documenthtml);
doc.close();
Or, if you can cut off the bits you don't want (like any DOCTYPE and <head> element), you can just write innerHTML to any element. Normally handling [X][HT]ML with regexp or string processing is a really bad idea, but if you know that the body will always be contained within the exact strings ‘<body>...</body>’ and there will never be eg. any ‘<body>’ sequence hidden in a comment or script section, you might be able to get away with it.
To be honest, browsers at the moment are so forgiving they will typically even let you write a whole HTML document to a div's innerHTML, complete with doctype and ‘<head>’ in there! But that would be a bit naughty:
<div id="mycontent"></div>
document.getElementById('mycontent').innerHTML= htmldocument;
Here's a nasty hack combining both methods, to extract the body content without the use of regex:
<div id="mycontent"></div>
var frame= document.createElement('iframe');
frame.style.display= 'none';
document.body.appendChild(frame);
var doc= frame.contentDocument? frame.contentDocument : frame.contentWindow.document;
doc.open('text/html');
doc.write(documenthtml);
doc.close();
document.getElementById('mycontent').innerHTML= doc.body.innerHTML;
document.body.removeChild(frame);
document.getElementById('container').innerHTML = string;
This will load the contents of the string inside of an element (probably a div) with the id of "container".
myHtmlString = 'some stuff'; // whatever your big html string is
el = document.getElementById("myTarget"); // where you'd like the html to end up
el.innerHTML = myHtmlString; // set the HTML of el to be your big string.