Using Javascript to format an HTML string to display properly

Using Javascript to format an HTML string to display properly - javascript

The back end hands off a string that gets displayed like:
"Hello, <br><br> This notice is to inform you that you are in violation of <font color=red><b>HR POLICY XXXXX</b></font>."
The point of this page is to let you easily copy-paste pre-generated emails, but spewing out a bunch of html tags through the sentences is unwanted.
The string in question is inside of a with an id of "textBlock".
The back end is Java with an Oracle DB. I can edit the java to some extent and I can't touch the DB at all. I've used the console to play around with the string and editing it in any way seems to make it display properly once I finish editing. The innerText includes tags like in my summary, the innerHTML displays the tags like <br>.
So far I've attempted to give the an onload attribute that calls a function named formatText(); that does:
temp var = document.getElementById("textBlock").innerText;
document.getElementById("textBlock").innerText = var;
as well as the above function with innerHTML instead of innerText. I've also tried using document.write(); but that clears the rest of the page.Finally I've added some random characters in front of the string and tried to use the replace("!##","") function to replace those in an effort to mimic the "editing it in any way seems to make it display properly" that I noticed.
java
out.println("<td align=left id=textBlock onload=formatText();> !##" + strTemp + "</td>" );
Expected:
Hello,
This notice is to inform you that you are in violation of HR POLICY XXXXX.
Actual:
Hello, <br><br> This notice is to inform you that you are in violation of <font color=red><b>HR POLICY XXXXX</b></font>.

What you want, if I understood correctly, is some stripping html tags function. You can use regex
var str = "Hello, <br><br> This notice is to inform you that you are in violation of <font color=red><b>HR POLICY XXXXX</b></font>."
console.log(str)
var str2 = str.replace(/<[^>]*>?/gm, '')
console.log(str2)
If you want the html element to render your html, you need to use the DOM property innerHtml
var str = "Hello, <br><br> This notice is to inform you that you are in violation of <font color=red><b>HR POLICY XXXXX</b></font>."
document.getElementById('myDiv').innerHTML = str
<div id="myDiv">Hi</div>

(resolved in comments, answer added for completeness)
When HTML tags are visible in the browser, it's usually encoded with html-entities, preventing it getting parsed as HTML. In this case a post-processing script was replacing the < and > characters to their entity counterparts < and >.
Disabling these replacements resolved the issue.

Related

XSS prevention and .innerHTML

When I allow users to insert data as an argument to the JS innerHTML function like this:
element.innerHTML = “User provided variable”;
I understood that in order to prevent XSS, I have to HTML encode, and then JS encode the user input because the user could insert something like this:
<img src=a onerror='alert();'>
Only HTML or only JS encoding would not help because the .innerHTML method as I understood decodes the input before inserting it into the page. With HTML+JS encoding, I noticed that the .innerHTML decodes only the JS, but the HTML encoding remains.
But I was able to achieve the same by double encoding into HTML.
My question is: Could somebody provide an example of why I should HTML encode and then JS encode, and not double encode in HTML when using the .innerHTML method?

Could somebody provide an example of why I should HTML encode and then
JS encode, and not double encode in HTML when using the .innerHTML
method?
Sure.
Assuming the "user provided data" is populated in your JavaScript by the server, then you will have to JS encode to get it there.
This following is pseudocode on the server-side end, but in JavaScript on the front end:
var userProdividedData = "<%=serverVariableSetByUser %>";
element.innerHTML = userProdividedData;
Like ASP.NET <%= %> outputs the server side variable without encoding. If the user is "good" and supplies the value foo then this results in the following JavaScript being rendered:
var userProdividedData = "foo";
element.innerHTML = userProdividedData;
So far no problems.
Now say a malicious user supplies the value "; alert("xss attack!");//. This would be rendered as:
var userProdividedData = ""; alert("xss attack!");//";
element.innerHTML = userProdividedData;
which would result in an XSS exploit where the code is actually executed in the first line of the above.
To prevent this, as you say you JS encode. The OWASP XSS prevention cheat sheet rule #3 says:
Except for alphanumeric characters, escape all characters less than
256 with the \xHH format to prevent switching out of the data value
into the script context or into another attribute.
So to secure against this your code would be
var userProdividedData = "<%=JsEncode(serverVariableSetByUser) %>";
element.innerHTML = userProdividedData;
where JsEncode encodes as per the OWASP recommendation.
This would prevent the above attack as it would now render as follows:
var userProdividedData = "\x22\x3b\x20alert\x28\x22xss\x20attack\x21\x22\x29\x3b\x2f\x2f";
element.innerHTML = userProdividedData;
Now you have secured your JavaScript variable assignment against XSS.
However, what if a malicious user supplied <img src="xx" onerror="alert('xss attack')" /> as the value? This would be fine for the variable assignment part as it would simply get converted into the hex entity equivalent like above.
However the line
element.innerHTML = userProdividedData;
would cause alert('xss attack') to be executed when the browser renders the inner HTML. This would be like a DOM Based XSS attack as it is using rendered JavaScript rather than HTML, however, as it passes though the server it is still classed as reflected or stored XSS depending on where the value is initially set.
This is why you would need to HTML encode too. This can be done via a function such as:
function escapeHTML (unsafe_str) {
return unsafe_str
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/\"/g, '"')
.replace(/\'/g, ''')
.replace(/\//g, '/')
}
making your code
element.innerHTML = escapeHTML(userProdividedData);
or could be done via JQuery's text() function.
Update regarding question in comments
I just have one more question: You mentioned that we must JS encode
because an attacker could enter "; alert("xss attack!");//. But if we
would use HTML encoding instead of JS encoding, wouldn't that also
HTML encode the " sign and make this attack impossible because we
would have: var userProdividedData =""; alert("xss attack!");//";
I'm taking your question to mean the following: Rather than JS encoding followed by HTML encoding, why don't we don't just HTML encode in the first place, and leave it at that?
Well because they could encode an attack such as <img src="xx" onerror="alert('xss attack')" /> all encoded using the \xHH format to insert their payload - this would achieve the desired HTML sequence of the attack without using any of the characters that HTML encoding would affect.
There are some other attacks too: If the attacker entered \ then they could force the browser to miss the closing quote (as \ is the escape character in JavaScript).
This would render as:
var userProdividedData = "\";
which would trigger a JavaScript error because it is not a properly terminated statement. This could cause a Denial of Service to the application if it is rendered in a prominent place.
Additionally say there were two pieces of user controlled data:
var userProdividedData = "<%=serverVariableSetByUser1 %>" + ' - ' + "<%=serverVariableSetByUser2 %>";
the user could then enter \ in the first and ;alert('xss');// in the second. This would change the string concatenation into one big assignment, followed by an XSS attack:
var userProdividedData = "\" + ' - ' + ";alert('xss');//";
Because of edge cases like these it is recommended to follow the OWASP guidelines as they are as close to bulletproof as you can get. You might think that adding \ to the list of HTML encoded values solves this, however there are other reasons to use JS followed by HTML when rendering content in this manner because this method also works for data in attribute values:
<a href="javascript:void(0)" onclick="myFunction('<%=JsEncode(serverVariableSetByUser) %>'); return false">
Despite whether it is single or double quoted:
<a href='javascript:void(0)' onclick='myFunction("<%=JsEncode(serverVariableSetByUser) %>"); return false'>
Or even unquoted:
<a href=javascript:void(0) onclick=myFunction("<%=JsEncode(serverVariableSetByUser) %>");return false;>
If you HTML encoded like mentioned in your comment an entity value:
onclick='var userProdividedData ="";"' (shortened version)
the code is actually run via the browser's HTML parser first, so userProdividedData would be
";;
instead of
";
so when you add it to the innerHTML call you would have XSS again. Note that <script> blocks are not processed via the browser's HTML parser, except for the closing </script> tag, but that's another story.
It is always wise to encode as late as possible such as shown above. Then if you need to output the value in anything other than a JavaScript context (e.g. an actual alert box does not render HTML, then it will still display correctly).
That is, with the above I can call
alert(serverVariableSetByUser);
just as easily as setting HTML
element.innerHTML = escapeHTML(userProdividedData);
In both cases it will be displayed correctly without certain characters from disrupting output or causing undesirable code execution.

A simple way to make sure the contents of your element is properly encoded (and will not be parsed as HTML) is to use textContent instead of innerHTML:
element.textContent = "User provided variable with <img src=a>";
Another option is to use innerHTML only after you have encoded (preferably on the server if you get the chance) the values you intend to use.

I have faced this issue in my ASP.NET Webforms application. The fix to this is relatively simple.
Install HtmlSanitizationLibrary from NuGet Package Manager and refer this in your application. At the code behind, please use the sanitizer class in the following way.
For example, if the current code looks something like this,
YourHtmlElement.InnerHtml = "Your HTML content" ;
Then, replace this with the following:
string unsafeHtml = "Your HTML content";
YourHtmlElement.InnerHtml = Sanitizer.GetSafeHtml(unsafeHtml);
This fix will remove the Veracode vulnerability and make sure that the string gets rendered as HTML. Encoding the string at code behind will render it as 'un-encoded string' rather than RAW HTML as it is encoded before the render begins.

Detect line break in text with javascript

I'm able to set line break with javascript in an element, but I can't read them.
Example:
<div id="text"></div>
Js:
document.getElementById("text").innerHTML = "Hello <br /> World";
var x = document.getElementById("text").innerHTML;
if(x == "Hello <br /> World")
{
alert('Match');
}
There is no match in my case...

It is pretty much never a good thing to get the .innerHTML property from an object and then try to compare it to some exact string. This is because the only contract the browser has is to return to you equivalent HTML, not necessarily the exact same HTML. This may not be a problem if there are no nested DOM elements in what you are requesting, but can often be a problem if there are nested HTML elements.
For example, some versions of IE will change the order of attributes, change the quoting, change the spacing, etc...
Instead, you should either search the actual DOM for what you want or look for only a smaller piece of the HTML which you know can't change or use a search algorithm that is tolerate of changes in the HTML.
As Niels mentioned in a comment, Chrome, IE11 and Firefox all return "Hello <br> World" which you can see for yourself with a simple debugging statement like this:
console.log("'" + x + "'");
Working demo to see for yourself what it shows: http://jsfiddle.net/jfriend00/zc59x2bL/
FYI, your code also contains an error. You need to pass document.getElementById() a string which would be document.getElementById("test"), not document.getElementById(test).

jQuery replacing spans in node+jade combo

Perhaps this is expected, but I found it odd since I am now starting with jQuery.
So, I am writing an application using node and jade. In the index.jade I have a statement of the form
p Welcome subscriber
span(id="subscriber") someID
Now once the connection is established between the client and the server, the server sends a welcome JSON message with some data. One of them is the id of the client which I want to replace above. Once the client receives the welcome JSON message it initializes the appropriate structures and then I make a call to a function loadStats:
function loadStats() {
var myText = "" + myData.id + ".";
$('#subscriber').text(myText);
$('#subscriber').html(myText);
};
In the screen I can see that the text "someID" is replaced by the ID of the client. However, when I actually inspect the html code of the page that I am looking at I see a statement of the form:
<p>Welcome subscriber <span id="subscriber">someID</span></p>
In other words in the actual HTML code the text "someID" has not been replaced. Is this something expected? How was the replacement done? Moreover, it appears that working with either of the statements
$('#subscriber').text(myText);
$('#subscriber').html(myText);
gives the replication on the screen but not on the actual html content of what is presented on screen. Is this the correct behavior? From what I understood (and expect) the .text() replaces the visual data of the element with the specific id and the .html() replaces the content. Am I missing something?
Thanks in advance. jQuery rookie here.

Two rules for expressions in pug:
In attributes you use quotes to output literal text and you leave the quotes out when you want to use a variable, and
For the content of a tag you use an equals sign when you want pug to evaluate an expression, or don't put anything if you want literal text
So with those rules in mind, looking at your code you will output the attribute "subscriber" as a literal and "someId" as a literal.
span(id="subscriber") someID
Results in:
<span id="subscriber">someId</span>
You wanted both to be dynamic so remove the quotes in the attribute and put an equals sign after the element:
span(id= subscriber)= someID
This will dynamically replace both with variables.

Javascript onclick function call issue: won't pass a string

So the basic rundown is that I'm trying to create a rudimentary means of flagging inappropriate content on our web mapping application.
Within a function that dynamically creates content for the sidebar of the webmap when the user clicks on a point I have this piece of code that should generate an image of a flag.
When the user clicks the flag, I want to run the function flagContent which should pass a url string into the function. From within this function I would then be able to
write it to a database later on (though I haven't made it this far yet).
Here are some code snippets I have been working with.:
1.This is where the flag image is generated
content += "<p class='info'><img id='flag' onclick='flagContent(" + attachmentInfo.url + ")
'src='assets/flag.png' style='height:15px'>Flag as inappropriate...</p>";
This is the connected function
function flagContent(imageUrl){ console.log(imageUrl)}
So basically the url is a string and I'd like to be able to manipulate it within the flagContent function. Unfortunately I can't get it to work. When I pass a numerical parameter such as attachmentInfo.objectID I do not run into the same problem.
For what it's worth I also get this error:
Uncaught SyntaxError: Unexpected token :
Any help would be greatly appreciated. Let me know if there is additional information that could help to solve this. Thanks!

I'm assuming that attachmentInfo.url would return a URL, which should be a string and it just needs to be surrounded by quotes. Since you've already used both types of quotes, you will have to escape some quotes.
content += "<p class='info'>";
content += "<img id='flag' onclick=\"flagContent('" + attachmentInfo.url + "')\" src='file.png'/>";
content += "Flag as inappropriate...";
content += "</p>";
Doing this makes the final out put look like this:
<p class='info'>
<img id="flag" onclick="flagContent('http://example.com')" src='file.png'/>
Flag as inappropriate...
</p>
The problem you had was that the URL was not surrounded by quotes, and it saw flagContent(http://example.com) and didn't know what to do with those bare words not in a string.

Remove formatting tags from string body of email

How do you remove all formatting tags when calling:
GmailApp.getInboxThreads()[0].getMessages()[0].getBody()
such that the only remainder of text is that which can be read.
Formatting can be destroyed; the text in the body is only needed to be parsed, but tags such as:
"&"
<br>
and possibly others, need to be removed.

Even though there's no DOM in Apps Script, you can parse out HTML and get the plain text this way:
function getTextFromHtml(html) {
return getTextFromNode(Xml.parse(html, true).getElement());
}
function getTextFromNode(x) {
switch(x.toString()) {
case 'XmlText': return x.toXmlString();
case 'XmlElement': return x.getNodes().map(getTextFromNode).join('');
default: return '';
}
}
calling
getTextFromHtml("hello <div>foo</div>& world <br /><div>bar</div>!");
will return
"hello foo& world bar!".
To explain, Xml.parse with the second param as "true" parses the document as an HTML page. We then walk the document (which will be patched up with missing HTML and BODY elements, etc. and turned into a valid XHTML page), turning text nodes into text and expanding all other nodes.
This is admittedly poorly documented; I wrote this by playing around with the Xml object and logging intermediate results until I got it to work. We need to document the Xml stuff better.

I noticed you are writing a Google Apps Script. There's no DOM in Google Apps Script, nor you can create elements and get the innerText property.
getBody() gives you the email's body in HTML. You can replace tags with this code:
var html = GmailApp.getInboxThreads()[0].getMessages()[0].getBody();
html=html.replace(/<\/div>/ig, '\n');
html=html.replace(/<\/li>/ig, '\n');
html=html.replace(/<li>/ig, ' *');
html=html.replace(/<\/ul>/ig, '\n');
html=html.replace(/<\/p>/ig, '\n');
html=html.replace(/<br\/?>/ig, '\n');
html=html.replace(/<[^>]+>/ig, '');
May be you can find more tags to replace. Remember this code isn't for any HTML, but for the getBody() HTML. GMail has its own way to format de body, and doesn't use every possible existing tag in HTML, only a subset of it; then our GMail specific code is shorter.

I found an easier way to accomplish this task.
Use the htmlBody advanced argument within the arguments of sendEmail(). Heres an example:
var threads = GmailApp.search ('is:unread'); //searches for unread messages
var messages = GmailApp.getMessagesForThreads(threads); //gets messages in 2D array
for (i = 0; i < messages.length; ++i)
{
j = messages[i].length; //to process most recent conversation in thread (contains messages from previous conversations as well, reduces redundancy
messageBody = messages[i][j-1].getBody(); //gets body of message in HTML
messageSubject = messages [i][j-1].getSubject();
GmailApp.sendEmail("dummyuser#dummysite.com", messageSubject, "", {htmlBody: messageBody});
}
First I find all the threads containing unread messages. Then I get the messages contained within the threads into a two dimensional array using the getMessagesForThreads() method within GmailApp. Then I created a for loop that runs for all of the threads I found. I set j equal to the threads message count so I can send only the most recent message on the thread (j-1). I get the HTML body of the message with getBody() and the subject through getSubject(). I use the sendEmail(recipients, subject, body, optAdvancedArgs) to send the email and process the HTML body. The result is an email sent properly formatted with all features of HTML included. The documentation for these methods can be found here: https://developers.google.com/apps-script/service_gmail
I hope this helps, again the manual parsing method does work, but I still found bits and pieces of HTML left hanging around so I thought I would give this a try, It worked for me, if I find any issues in the longrun I will update this post. So far so good!

Google now has the getPlainBody() function that will get the plain text from the body of an email. It is in the text class.
I had been using a script to send emails to convert them to tasks and google broke it with a change to the functionality of Corey's answer above. I've replaced it with the following.
var taskNote = ((thread.getMessages()[0]).getPlainBody()).substring(0,1000);

I am not sure what you mean by .getBody() - is this supposed to return a DOM body element?
However, the simplest solution for removing HTML tags is probably to let the browser render the HTML and ask him for the text content:
var myHTMLContent = "hello & world <br />!";
var tempDiv = document.createElement('div');
tempDiv.innerHTML = myHTMLContent;
// retrieve the cleaned content:
var textContent = tempDiv.innerText;
With the above example, the textContent variable will contain the text
"hello & world
!"
(Note the line break due to the <br /> tag.)

We Keep Coding

JavaScript is the programming language of the Web.

Using Javascript to format an HTML string to display properly - javascript

Related

XSS prevention and .innerHTML

Detect line break in text with javascript

jQuery replacing spans in node+jade combo

Javascript onclick function call issue: won't pass a string

Remove formatting tags from string body of email

Categories

Resources