Decoding HTML entities in javascript object [duplicate] - javascript

I have some JavaScript code that communicates with an XML-RPC backend.
The XML-RPC returns strings of the form:
<img src='myimage.jpg'>
However, when I use the JavaScript to insert the strings into HTML, they render literally. I don't see an image, I literally see the string:
<img src='myimage.jpg'>
My guess is that the HTML is being escaped over the XML-RPC channel.
How can I unescape the string in JavaScript? I tried the techniques on this page, unsuccessfully: http://paulschreiber.com/blog/2008/09/20/javascript-how-to-unescape-html-entities/
What are other ways to diagnose the issue?

Most answers given here have a huge disadvantage: if the string you are trying to convert isn't trusted then you will end up with a Cross-Site Scripting (XSS) vulnerability. For the function in the accepted answer, consider the following:
htmlDecode("<img src='dummy' onerror='alert(/xss/)'>");
The string here contains an unescaped HTML tag, so instead of decoding anything the htmlDecode function will actually run JavaScript code specified inside the string.
This can be avoided by using DOMParser which is supported in all modern browsers:
function htmlDecode(input) {
var doc = new DOMParser().parseFromString(input, "text/html");
return doc.documentElement.textContent;
}
console.log( htmlDecode("<img src='myimage.jpg'>") )
// "<img src='myimage.jpg'>"
console.log( htmlDecode("<img src='dummy' onerror='alert(/xss/)'>") )
// ""
This function is guaranteed to not run any JavaScript code as a side-effect. Any HTML tags will be ignored, only text content will be returned.
Compatibility note: Parsing HTML with DOMParser requires at least Chrome 30, Firefox 12, Opera 17, Internet Explorer 10, Safari 7.1 or Microsoft Edge. So all browsers without support are way past their EOL and as of 2017 the only ones that can still be seen in the wild occasionally are older Internet Explorer and Safari versions (usually these still aren't numerous enough to bother).

Do you need to decode all encoded HTML entities or just & itself?
If you only need to handle & then you can do this:
var decoded = encoded.replace(/&/g, '&');
If you need to decode all HTML entities then you can do it without jQuery:
var elem = document.createElement('textarea');
elem.innerHTML = encoded;
var decoded = elem.value;
Please take note of Mark's comments below which highlight security holes in an earlier version of this answer and recommend using textarea rather than div to mitigate against potential XSS vulnerabilities. These vulnerabilities exist whether you use jQuery or plain JavaScript.

EDIT: You should use the DOMParser API as Wladimir suggests, I edited my previous answer since the function posted introduced a security vulnerability.
The following snippet is the old answer's code with a small modification: using a textarea instead of a div reduces the XSS vulnerability, but it is still problematic in IE9 and Firefox.
function htmlDecode(input){
var e = document.createElement('textarea');
e.innerHTML = input;
// handle case of empty input
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
htmlDecode("<img src='myimage.jpg'>");
// returns "<img src='myimage.jpg'>"
Basically I create a DOM element programmatically, assign the encoded HTML to its innerHTML and retrieve the nodeValue from the text node created on the innerHTML insertion. Since it just creates an element but never adds it, no site HTML is modified.
It will work cross-browser (including older browsers) and accept all the HTML Character Entities.
EDIT: The old version of this code did not work on IE with blank inputs, as evidenced here on jsFiddle (view in IE). The version above works with all inputs.
UPDATE: appears this doesn't work with large string, and it also introduces a security vulnerability, see comments.

A more modern option for interpreting HTML (text and otherwise) from JavaScript is the HTML support in the DOMParser API (see here in MDN). This allows you to use the browser's native HTML parser to convert a string to an HTML document. It has been supported in new versions of all major browsers since late 2014.
If we just want to decode some text content, we can put it as the sole content in a document body, parse the document, and pull out the its .body.textContent.
var encodedStr = 'hello & world';
var parser = new DOMParser;
var dom = parser.parseFromString(
'<!doctype html><body>' + encodedStr,
'text/html');
var decodedString = dom.body.textContent;
console.log(decodedString);
We can see in the draft specification for DOMParser that JavaScript is not enabled for the parsed document, so we can perform this text conversion without security concerns.
The parseFromString(str, type) method must run these steps, depending on type:
"text/html"
Parse str with an HTML parser, and return the newly created Document.
The scripting flag must be set to "disabled".
NOTE
script elements get marked unexecutable and the contents of noscript get parsed as markup.
It's beyond the scope of this question, but please note that if you're taking the parsed DOM nodes themselves (not just their text content) and moving them to the live document DOM, it's possible that their scripting would be reenabled, and there could be security concerns. I haven't researched it, so please exercise caution.

Matthias Bynens has a library for this: https://github.com/mathiasbynens/he
Example:
console.log(
he.decode("Jörg &amp Jürgen rocked to & fro ")
);
// Logs "Jörg & Jürgen rocked to & fro"
I suggest favouring it over hacks involving setting an element's HTML content and then reading back its text content. Such approaches can work, but are deceptively dangerous and present XSS opportunities if used on untrusted user input.
If you really can't bear to load in a library, you can use the textarea hack described in this answer to a near-duplicate question, which, unlike various similar approaches that have been suggested, has no security holes that I know of:
function decodeEntities(encodedString) {
var textArea = document.createElement('textarea');
textArea.innerHTML = encodedString;
return textArea.value;
}
console.log(decodeEntities('1 & 2')); // '1 & 2'
But take note of the security issues, affecting similar approaches to this one, that I list in the linked answer! This approach is a hack, and future changes to the permissible content of a textarea (or bugs in particular browsers) could lead to code that relies upon it suddenly having an XSS hole one day.

If you're using jQuery:
function htmlDecode(value){
return $('<div/>').html(value).text();
}
Otherwise, use Strictly Software's Encoder Object, which has an excellent htmlDecode() function.

You can use Lodash unescape / escape function https://lodash.com/docs/4.17.5#unescape
import unescape from 'lodash/unescape';
const str = unescape('fred, barney, & pebbles');
str will become 'fred, barney, & pebbles'

var htmlEnDeCode = (function() {
var charToEntityRegex,
entityToCharRegex,
charToEntity,
entityToChar;
function resetCharacterEntities() {
charToEntity = {};
entityToChar = {};
// add the default set
addCharacterEntities({
'&' : '&',
'>' : '>',
'<' : '<',
'"' : '"',
''' : "'"
});
}
function addCharacterEntities(newEntities) {
var charKeys = [],
entityKeys = [],
key, echar;
for (key in newEntities) {
echar = newEntities[key];
entityToChar[key] = echar;
charToEntity[echar] = key;
charKeys.push(echar);
entityKeys.push(key);
}
charToEntityRegex = new RegExp('(' + charKeys.join('|') + ')', 'g');
entityToCharRegex = new RegExp('(' + entityKeys.join('|') + '|&#[0-9]{1,5};' + ')', 'g');
}
function htmlEncode(value){
var htmlEncodeReplaceFn = function(match, capture) {
return charToEntity[capture];
};
return (!value) ? value : String(value).replace(charToEntityRegex, htmlEncodeReplaceFn);
}
function htmlDecode(value) {
var htmlDecodeReplaceFn = function(match, capture) {
return (capture in entityToChar) ? entityToChar[capture] : String.fromCharCode(parseInt(capture.substr(2), 10));
};
return (!value) ? value : String(value).replace(entityToCharRegex, htmlDecodeReplaceFn);
}
resetCharacterEntities();
return {
htmlEncode: htmlEncode,
htmlDecode: htmlDecode
};
})();
This is from ExtJS source code.

The trick is to use the power of the browser to decode the special HTML characters, but not allow the browser to execute the results as if it was actual html... This function uses a regex to identify and replace encoded HTML characters, one character at a time.
function unescapeHtml(html) {
var el = document.createElement('div');
return html.replace(/\&[#0-9a-z]+;/gi, function (enc) {
el.innerHTML = enc;
return el.innerText
});
}

element.innerText also does the trick.

In case you're looking for it, like me - meanwhile there's a nice and safe JQuery method.
https://api.jquery.com/jquery.parsehtml/
You can f.ex. type this in your console:
var x = "test &";
> undefined
$.parseHTML(x)[0].textContent
> "test &"
So $.parseHTML(x) returns an array, and if you have HTML markup within your text, the array.length will be greater than 1.

jQuery will encode and decode for you. However, you need to use a textarea tag, not a div.
var str1 = 'One & two & three';
var str2 = "One & two & three";
$(document).ready(function() {
$("#encoded").text(htmlEncode(str1));
$("#decoded").text(htmlDecode(str2));
});
function htmlDecode(value) {
return $("<textarea/>").html(value).text();
}
function htmlEncode(value) {
return $('<textarea/>').text(value).html();
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<div id="encoded"></div>
<div id="decoded"></div>

CMS' answer works fine, unless the HTML you want to unescape is very long, longer than 65536 chars. Because then in Chrome the inner HTML gets split into many child nodes, each one at most 65536 long, and you need to concatenate them. This function works also for very long strings:
function unencodeHtmlContent(escapedHtml) {
var elem = document.createElement('div');
elem.innerHTML = escapedHtml;
var result = '';
// Chrome splits innerHTML into many child nodes, each one at most 65536.
// Whereas FF creates just one single huge child node.
for (var i = 0; i < elem.childNodes.length; ++i) {
result = result + elem.childNodes[i].nodeValue;
}
return result;
}
See this answer about innerHTML max length for more info: https://stackoverflow.com/a/27545633/694469

To unescape HTML entities* in JavaScript you can use small library html-escaper: npm install html-escaper
import {unescape} from 'html-escaper';
unescape('escaped string');
Or unescape function from Lodash or Underscore, if you are using it.
*) please note that these functions don't cover all HTML entities, but only the most common ones, i.e. &, <, >, ', ". To unescape all HTML entities you can use he library.

First create a <span id="decodeIt" style="display:none;"></span> somewhere in the body
Next, assign the string to be decoded as innerHTML to this:
document.getElementById("decodeIt").innerHTML=stringtodecode
Finally,
stringtodecode=document.getElementById("decodeIt").innerText
Here is the overall code:
var stringtodecode="<B>Hello</B> world<br>";
document.getElementById("decodeIt").innerHTML=stringtodecode;
stringtodecode=document.getElementById("decodeIt").innerText

The question doesn't specify the origin of x but it makes sense to defend, if we can, against malicious (or just unexpected, from our own application) input. For example, suppose x has a value of & <script>alert('hello');</script>. A safe and simple way to handle this in jQuery is:
var x = "& <script>alert('hello');</script>";
var safe = $('<div />').html(x).text();
// => "& alert('hello');"
Found via https://gist.github.com/jmblog/3222899. I can't see many reasons to avoid using this solution given it is at least as short, if not shorter than some alternatives and provides defence against XSS.
(I originally posted this as a comment, but am adding it as an answer since a subsequent comment in the same thread requested that I do so).

Not a direct response to your question, but wouldn't it be better for your RPC to return some structure (be it XML or JSON or whatever) with those image data (urls in your example) inside that structure?
Then you could just parse it in your javascript and build the <img> using javascript itself.
The structure you recieve from RPC could look like:
{"img" : ["myimage.jpg", "myimage2.jpg"]}
I think it's better this way, as injecting a code that comes from external source into your page doesn't look very secure. Imaging someone hijacking your XML-RPC script and putting something you wouldn't want in there (even some javascript...)

For one-line guys:
const htmlDecode = innerHTML => Object.assign(document.createElement('textarea'), {innerHTML}).value;
console.log(htmlDecode('Complicated - Dimitri Vegas & Like Mike'));

You're welcome...just a messenger...full credit goes to ourcodeworld.com, link below.
window.htmlentities = {
/**
* Converts a string to its html characters completely.
*
* #param {String} str String with unescaped HTML characters
**/
encode : function(str) {
var buf = [];
for (var i=str.length-1;i>=0;i--) {
buf.unshift(['&#', str[i].charCodeAt(), ';'].join(''));
}
return buf.join('');
},
/**
* Converts an html characterSet into its original character.
*
* #param {String} str htmlSet entities
**/
decode : function(str) {
return str.replace(/&#(\d+);/g, function(match, dec) {
return String.fromCharCode(dec);
});
}
};
Full Credit: https://ourcodeworld.com/articles/read/188/encode-and-decode-html-entities-using-pure-javascript

I know there are a lot of good answers here, but since I have implemented a bit different approach, I thought to share.
This code is a perfectly safe security-wise approach, as the escaping handler dependant on the browser, instead on the function. So, if a new vulnerability will be discovered in the future, this solution will be covered.
const decodeHTMLEntities = text => {
// Create a new element or use one from cache, to save some element creation overhead
const el = decodeHTMLEntities.__cache_data_element
= decodeHTMLEntities.__cache_data_element
|| document.createElement('div');
const enc = text
// Prevent any mixup of existing pattern in text
.replace(/⪪/g, '⪪#')
// Encode entities in special format. This will prevent native element encoder to replace any amp characters
.replace(/&([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+);/gi, '⪪$1⪫');
// Encode any HTML tags in the text to prevent script injection
el.textContent = enc;
// Decode entities from special format, back to their original HTML entities format
el.innerHTML = el.innerHTML
.replace(/⪪([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+)⪫/gi, '&$1;')
.replace(/#⪫/g, '⪫');
// Get the decoded HTML entities
const dec = el.textContent;
// Clear the element content, in order to preserve a bit of memory (it is just the text may be pretty big)
el.textContent = '';
return dec;
}
// Example
console.log(decodeHTMLEntities("<script>alert('&awconint;&CounterClockwiseContourIntegral;∳∳⪪#x02233⪫');</script>"));
// Prints: <script>alert('∳∳∳∳⪪##x02233⪫');</script>
By the way, I have chosen to use the characters ⪪ and ⪫, because they are rarely used, so the chance of impacting the performance by matching them is significantly lower.

Chris answer is nice & elegant but it fails if value is undefined. Just simple improvement makes it solid:
function htmlDecode(value) {
return (typeof value === 'undefined') ? '' : $('<div/>').html(value).text();
}

a javascript solution that catches the common ones:
var map = {amp: '&', lt: '<', gt: '>', quot: '"', '#039': "'"}
str = str.replace(/&([^;]+);/g, (m, c) => map[c])
this is the reverse of https://stackoverflow.com/a/4835406/2738039

I tried everything to remove & from a JSON array. None of the above examples, but https://stackoverflow.com/users/2030321/chris gave a great solution that led me to fix my problem.
var stringtodecode="<B>Hello</B> world<br>";
document.getElementById("decodeIt").innerHTML=stringtodecode;
stringtodecode=document.getElementById("decodeIt").innerText
I did not use, because I did not understand how to insert it into a modal window that was pulling JSON data into an array, but I did try this based upon the example, and it worked:
var modal = document.getElementById('demodal');
$('#ampersandcontent').text(replaceAll(data[0],"&", "&"));
I like it because it was simple, and it works, but not sure why it's not widely used. Searched hi & low to find a simple solution.
I continue to seek understanding of the syntax, and if there is any risk to using this. Have not found anything yet.

I was crazy enough to go through and make this function that should be pretty, if not completely, exhaustive:
function removeEncoding(string) {
return string.replace(/À/g, "À").replace(/Á/g, "Á").replace(/Â/g, "Â").replace(/Ã/g, "Ã").replace(/Ä/g, "Ä").replace(/Å/g, "Å").replace(/à/g, "à").replace(/â/g, "â").replace(/ã/g, "ã").replace(/ä/g, "ä").replace(/å/g, "å").replace(/Æ/g, "Æ").replace(/æ/g, "æ").replace(/ß/g, "ß").replace(/Ç/g, "Ç").replace(/ç/g, "ç").replace(/È/g, "È").replace(/É/g, "É").replace(/Ê/g, "Ê").replace(/Ë/g, "Ë").replace(/è/g, "è").replace(/é/g, "é").replace(/ê/g, "ê").replace(/ë/g, "ë").replace(/ƒ/g, "ƒ").replace(/Ì/g, "Ì").replace(/Í/g, "Í").replace(/Î/g, "Î").replace(/Ï/g, "Ï").replace(/ì/g, "ì").replace(/í/g, "í").replace(/î/g, "î").replace(/ï/g, "ï").replace(/Ñ/g, "Ñ").replace(/ñ/g, "ñ").replace(/Ò/g, "Ò").replace(/Ó/g, "Ó").replace(/Ô/g, "Ô").replace(/Õ/g, "Õ").replace(/Ö/g, "Ö").replace(/ò/g, "ò").replace(/ó/g, "ó").replace(/ô/g, "ô").replace(/õ/g, "õ").replace(/ö/g, "ö").replace(/Ø/g, "Ø").replace(/ø/g, "ø").replace(/Œ/g, "Œ").replace(/œ/g, "œ").replace(/Š/g, "Š").replace(/š/g, "š").replace(/Ù/g, "Ù").replace(/Ú/g, "Ú").replace(/Û/g, "Û").replace(/Ü/g, "Ü").replace(/ù/g, "ù").replace(/ú/g, "ú").replace(/û/g, "û").replace(/ü/g, "ü").replace(/µ/g, "µ").replace(/×/g, "×").replace(/Ý/g, "Ý").replace(/Ÿ/g, "Ÿ").replace(/ý/g, "ý").replace(/ÿ/g, "ÿ").replace(/°/g, "°").replace(/†/g, "†").replace(/‡/g, "‡").replace(/</g, "<").replace(/>/g, ">").replace(/±/g, "±").replace(/«/g, "«").replace(/»/g, "»").replace(/¿/g, "¿").replace(/¡/g, "¡").replace(/·/g, "·").replace(/•/g, "•").replace(/™/g, "™").replace(/©/g, "©").replace(/®/g, "®").replace(/§/g, "§").replace(/¶/g, "¶").replace(/Α/g, "Α").replace(/Β/g, "Β").replace(/Γ/g, "Γ").replace(/Δ/g, "Δ").replace(/Ε/g, "Ε").replace(/Ζ/g, "Ζ").replace(/Η/g, "Η").replace(/Θ/g, "Θ").replace(/Ι/g, "Ι").replace(/Κ/g, "Κ").replace(/Λ/g, "Λ").replace(/Μ/g, "Μ").replace(/Ν/g, "Ν").replace(/Ξ/g, "Ξ").replace(/Ο/g, "Ο").replace(/Π/g, "Π").replace(/Ρ/g, "Ρ").replace(/Σ/g, "Σ").replace(/Τ/g, "Τ").replace(/Υ/g, "Υ").replace(/Φ/g, "Φ").replace(/Χ/g, "Χ").replace(/Ψ/g, "Ψ").replace(/Ω/g, "Ω").replace(/α/g, "α").replace(/β/g, "β").replace(/γ/g, "γ").replace(/δ/g, "δ").replace(/ε/g, "ε").replace(/ζ/g, "ζ").replace(/η/g, "η").replace(/θ/g, "θ").replace(/ι/g, "ι").replace(/κ/g, "κ").replace(/λ/g, "λ").replace(/μ/g, "μ").replace(/ν/g, "ν").replace(/ξ/g, "ξ").replace(/ο/g, "ο").replace(/&piρ;/g, "ρ").replace(/ρ/g, "ς").replace(/ς/g, "ς").replace(/σ/g, "σ").replace(/τ/g, "τ").replace(/φ/g, "φ").replace(/χ/g, "χ").replace(/ψ/g, "ψ").replace(/ω/g, "ω").replace(/•/g, "•").replace(/…/g, "…").replace(/′/g, "′").replace(/″/g, "″").replace(/‾/g, "‾").replace(/⁄/g, "⁄").replace(/℘/g, "℘").replace(/ℑ/g, "ℑ").replace(/ℜ/g, "ℜ").replace(/™/g, "™").replace(/ℵ/g, "ℵ").replace(/←/g, "←").replace(/↑/g, "↑").replace(/→/g, "→").replace(/↓/g, "↓").replace(/&barr;/g, "↔").replace(/↵/g, "↵").replace(/⇐/g, "⇐").replace(/⇑/g, "⇑").replace(/⇒/g, "⇒").replace(/⇓/g, "⇓").replace(/⇔/g, "⇔").replace(/∀/g, "∀").replace(/∂/g, "∂").replace(/∃/g, "∃").replace(/∅/g, "∅").replace(/∇/g, "∇").replace(/∈/g, "∈").replace(/∉/g, "∉").replace(/∋/g, "∋").replace(/∏/g, "∏").replace(/∑/g, "∑").replace(/−/g, "−").replace(/∗/g, "∗").replace(/√/g, "√").replace(/∝/g, "∝").replace(/∞/g, "∞").replace(/&OEig;/g, "Œ").replace(/œ/g, "œ").replace(/Ÿ/g, "Ÿ").replace(/♠/g, "♠").replace(/♣/g, "♣").replace(/♥/g, "♥").replace(/♦/g, "♦").replace(/ϑ/g, "ϑ").replace(/ϒ/g, "ϒ").replace(/ϖ/g, "ϖ").replace(/Š/g, "Š").replace(/š/g, "š").replace(/∠/g, "∠").replace(/∧/g, "∧").replace(/∨/g, "∨").replace(/∩/g, "∩").replace(/∪/g, "∪").replace(/∫/g, "∫").replace(/∴/g, "∴").replace(/∼/g, "∼").replace(/≅/g, "≅").replace(/≈/g, "≈").replace(/≠/g, "≠").replace(/≡/g, "≡").replace(/≤/g, "≤").replace(/≥/g, "≥").replace(/⊂/g, "⊂").replace(/⊃/g, "⊃").replace(/⊄/g, "⊄").replace(/⊆/g, "⊆").replace(/⊇/g, "⊇").replace(/⊕/g, "⊕").replace(/⊗/g, "⊗").replace(/⊥/g, "⊥").replace(/⋅/g, "⋅").replace(/&lcell;/g, "⌈").replace(/&rcell;/g, "⌉").replace(/⌊/g, "⌊").replace(/⌋/g, "⌋").replace(/〈/g, "⟨").replace(/〉/g, "⟩").replace(/◊/g, "◊").replace(/'/g, "'").replace(/&/g, "&").replace(/"/g, "\"");
}
Used like so:
let decodedText = removeEncoding("Ich heiße David");
console.log(decodedText);
Prints: Ich Heiße David
P.S. this took like an hour and a half to make.

This is the most comprehensive solution I've tried so far:
const STANDARD_HTML_ENTITIES = {
nbsp: String.fromCharCode(160),
amp: "&",
quot: '"',
lt: "<",
gt: ">"
};
const replaceHtmlEntities = plainTextString => {
return plainTextString
.replace(/&#(\d+);/g, (match, dec) => String.fromCharCode(dec))
.replace(
/&(nbsp|amp|quot|lt|gt);/g,
(a, b) => STANDARD_HTML_ENTITIES[b]
);
};

Closures can avoid creating unnecessary objects.
const decodingHandler = (() => {
const element = document.createElement('div');
return text => {
element.innerHTML = text;
return element.textContent;
};
})();
A more concise way
const decodingHandler = (() => {
const element = document.createElement('div');
return text => ((element.innerHTML = text), element.textContent);
})();

I use this in my project: inspired by other answers but with an extra secure parameter, can be useful when you deal with decorated characters
var decodeEntities=(function(){
var el=document.createElement('div');
return function(str, safeEscape){
if(str && typeof str === 'string'){
str=str.replace(/\</g, '<');
el.innerHTML=str;
if(el.innerText){
str=el.innerText;
el.innerText='';
}
else if(el.textContent){
str=el.textContent;
el.textContent='';
}
if(safeEscape)
str=str.replace(/\</g, '<');
}
return str;
}
})();
And it's usable like:
var label='safe <b> character éntity</b>';
var safehtml='<div title="'+decodeEntities(label)+'">'+decodeEntities(label, true)+'</div>';

var encodedStr = 'hello & world';
var parser = new DOMParser;
var dom = parser.parseFromString(
'<!doctype html><body>' + encodedStr,
'text/html');
var decodedString = dom.body.textContent;
console.log(decodedString);

// decode-html.js v1
function decodeHtml(html) {
const textarea = document.createElement('textarea');
textarea.innerHTML = html;
const decodedHtml = textarea.textContent;
textarea.remove();
return decodedHtml;
};
// encode-html.js v1
function encodeHtml(html) {
const textarea = document.createElement('textarea');
textarea.textContent = html;
const encodedHtml = textarea.innerHTML;
textarea.remove();
return encodedHtml;
};
// example of use:
let htmlDecoded = 'one & two & three';
let htmlEncoded = 'one & two & three';
console.log(1, htmlDecoded);
console.log(2, encodeHtml(htmlDecoded));
console.log(3, htmlEncoded);
console.log(4, decodeHtml(htmlEncoded));

All of the other answers here have problems.
The document.createElement('div') methods (including those using jQuery) execute any javascript passed into it (a security issue) and the DOMParser.parseFromString() method trims whitespace. Here is a pure javascript solution that has neither problem:
function htmlDecode(html) {
var textarea = document.createElement("textarea");
html= html.replace(/\r/g, String.fromCharCode(0xe000)); // Replace "\r" with reserved unicode character.
textarea.innerHTML = html;
var result = textarea.value;
return result.replace(new RegExp(String.fromCharCode(0xe000), 'g'), '\r');
}
TextArea is used specifically to avoid executig js code. It passes these:
htmlDecode('<& >'); // returns "<& >" with non-breaking space.
htmlDecode(' '); // returns " "
htmlDecode('<img src="dummy" onerror="alert(\'xss\')">'); // Does not execute alert()
htmlDecode('\r\n') // returns "\r\n", doesn't lose the \r like other solutions.

Related

How to decode/replace all charCodes with characters in a string using javascript? [duplicate]

I have some JavaScript code that communicates with an XML-RPC backend.
The XML-RPC returns strings of the form:
<img src='myimage.jpg'>
However, when I use the JavaScript to insert the strings into HTML, they render literally. I don't see an image, I literally see the string:
<img src='myimage.jpg'>
My guess is that the HTML is being escaped over the XML-RPC channel.
How can I unescape the string in JavaScript? I tried the techniques on this page, unsuccessfully: http://paulschreiber.com/blog/2008/09/20/javascript-how-to-unescape-html-entities/
What are other ways to diagnose the issue?
Most answers given here have a huge disadvantage: if the string you are trying to convert isn't trusted then you will end up with a Cross-Site Scripting (XSS) vulnerability. For the function in the accepted answer, consider the following:
htmlDecode("<img src='dummy' onerror='alert(/xss/)'>");
The string here contains an unescaped HTML tag, so instead of decoding anything the htmlDecode function will actually run JavaScript code specified inside the string.
This can be avoided by using DOMParser which is supported in all modern browsers:
function htmlDecode(input) {
var doc = new DOMParser().parseFromString(input, "text/html");
return doc.documentElement.textContent;
}
console.log( htmlDecode("<img src='myimage.jpg'>") )
// "<img src='myimage.jpg'>"
console.log( htmlDecode("<img src='dummy' onerror='alert(/xss/)'>") )
// ""
This function is guaranteed to not run any JavaScript code as a side-effect. Any HTML tags will be ignored, only text content will be returned.
Compatibility note: Parsing HTML with DOMParser requires at least Chrome 30, Firefox 12, Opera 17, Internet Explorer 10, Safari 7.1 or Microsoft Edge. So all browsers without support are way past their EOL and as of 2017 the only ones that can still be seen in the wild occasionally are older Internet Explorer and Safari versions (usually these still aren't numerous enough to bother).
Do you need to decode all encoded HTML entities or just & itself?
If you only need to handle & then you can do this:
var decoded = encoded.replace(/&/g, '&');
If you need to decode all HTML entities then you can do it without jQuery:
var elem = document.createElement('textarea');
elem.innerHTML = encoded;
var decoded = elem.value;
Please take note of Mark's comments below which highlight security holes in an earlier version of this answer and recommend using textarea rather than div to mitigate against potential XSS vulnerabilities. These vulnerabilities exist whether you use jQuery or plain JavaScript.
EDIT: You should use the DOMParser API as Wladimir suggests, I edited my previous answer since the function posted introduced a security vulnerability.
The following snippet is the old answer's code with a small modification: using a textarea instead of a div reduces the XSS vulnerability, but it is still problematic in IE9 and Firefox.
function htmlDecode(input){
var e = document.createElement('textarea');
e.innerHTML = input;
// handle case of empty input
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
htmlDecode("<img src='myimage.jpg'>");
// returns "<img src='myimage.jpg'>"
Basically I create a DOM element programmatically, assign the encoded HTML to its innerHTML and retrieve the nodeValue from the text node created on the innerHTML insertion. Since it just creates an element but never adds it, no site HTML is modified.
It will work cross-browser (including older browsers) and accept all the HTML Character Entities.
EDIT: The old version of this code did not work on IE with blank inputs, as evidenced here on jsFiddle (view in IE). The version above works with all inputs.
UPDATE: appears this doesn't work with large string, and it also introduces a security vulnerability, see comments.
A more modern option for interpreting HTML (text and otherwise) from JavaScript is the HTML support in the DOMParser API (see here in MDN). This allows you to use the browser's native HTML parser to convert a string to an HTML document. It has been supported in new versions of all major browsers since late 2014.
If we just want to decode some text content, we can put it as the sole content in a document body, parse the document, and pull out the its .body.textContent.
var encodedStr = 'hello & world';
var parser = new DOMParser;
var dom = parser.parseFromString(
'<!doctype html><body>' + encodedStr,
'text/html');
var decodedString = dom.body.textContent;
console.log(decodedString);
We can see in the draft specification for DOMParser that JavaScript is not enabled for the parsed document, so we can perform this text conversion without security concerns.
The parseFromString(str, type) method must run these steps, depending on type:
"text/html"
Parse str with an HTML parser, and return the newly created Document.
The scripting flag must be set to "disabled".
NOTE
script elements get marked unexecutable and the contents of noscript get parsed as markup.
It's beyond the scope of this question, but please note that if you're taking the parsed DOM nodes themselves (not just their text content) and moving them to the live document DOM, it's possible that their scripting would be reenabled, and there could be security concerns. I haven't researched it, so please exercise caution.
Matthias Bynens has a library for this: https://github.com/mathiasbynens/he
Example:
console.log(
he.decode("Jörg &amp Jürgen rocked to & fro ")
);
// Logs "Jörg & Jürgen rocked to & fro"
I suggest favouring it over hacks involving setting an element's HTML content and then reading back its text content. Such approaches can work, but are deceptively dangerous and present XSS opportunities if used on untrusted user input.
If you really can't bear to load in a library, you can use the textarea hack described in this answer to a near-duplicate question, which, unlike various similar approaches that have been suggested, has no security holes that I know of:
function decodeEntities(encodedString) {
var textArea = document.createElement('textarea');
textArea.innerHTML = encodedString;
return textArea.value;
}
console.log(decodeEntities('1 & 2')); // '1 & 2'
But take note of the security issues, affecting similar approaches to this one, that I list in the linked answer! This approach is a hack, and future changes to the permissible content of a textarea (or bugs in particular browsers) could lead to code that relies upon it suddenly having an XSS hole one day.
If you're using jQuery:
function htmlDecode(value){
return $('<div/>').html(value).text();
}
Otherwise, use Strictly Software's Encoder Object, which has an excellent htmlDecode() function.
You can use Lodash unescape / escape function https://lodash.com/docs/4.17.5#unescape
import unescape from 'lodash/unescape';
const str = unescape('fred, barney, & pebbles');
str will become 'fred, barney, & pebbles'
var htmlEnDeCode = (function() {
var charToEntityRegex,
entityToCharRegex,
charToEntity,
entityToChar;
function resetCharacterEntities() {
charToEntity = {};
entityToChar = {};
// add the default set
addCharacterEntities({
'&' : '&',
'>' : '>',
'<' : '<',
'"' : '"',
''' : "'"
});
}
function addCharacterEntities(newEntities) {
var charKeys = [],
entityKeys = [],
key, echar;
for (key in newEntities) {
echar = newEntities[key];
entityToChar[key] = echar;
charToEntity[echar] = key;
charKeys.push(echar);
entityKeys.push(key);
}
charToEntityRegex = new RegExp('(' + charKeys.join('|') + ')', 'g');
entityToCharRegex = new RegExp('(' + entityKeys.join('|') + '|&#[0-9]{1,5};' + ')', 'g');
}
function htmlEncode(value){
var htmlEncodeReplaceFn = function(match, capture) {
return charToEntity[capture];
};
return (!value) ? value : String(value).replace(charToEntityRegex, htmlEncodeReplaceFn);
}
function htmlDecode(value) {
var htmlDecodeReplaceFn = function(match, capture) {
return (capture in entityToChar) ? entityToChar[capture] : String.fromCharCode(parseInt(capture.substr(2), 10));
};
return (!value) ? value : String(value).replace(entityToCharRegex, htmlDecodeReplaceFn);
}
resetCharacterEntities();
return {
htmlEncode: htmlEncode,
htmlDecode: htmlDecode
};
})();
This is from ExtJS source code.
The trick is to use the power of the browser to decode the special HTML characters, but not allow the browser to execute the results as if it was actual html... This function uses a regex to identify and replace encoded HTML characters, one character at a time.
function unescapeHtml(html) {
var el = document.createElement('div');
return html.replace(/\&[#0-9a-z]+;/gi, function (enc) {
el.innerHTML = enc;
return el.innerText
});
}
element.innerText also does the trick.
In case you're looking for it, like me - meanwhile there's a nice and safe JQuery method.
https://api.jquery.com/jquery.parsehtml/
You can f.ex. type this in your console:
var x = "test &";
> undefined
$.parseHTML(x)[0].textContent
> "test &"
So $.parseHTML(x) returns an array, and if you have HTML markup within your text, the array.length will be greater than 1.
jQuery will encode and decode for you. However, you need to use a textarea tag, not a div.
var str1 = 'One & two & three';
var str2 = "One & two & three";
$(document).ready(function() {
$("#encoded").text(htmlEncode(str1));
$("#decoded").text(htmlDecode(str2));
});
function htmlDecode(value) {
return $("<textarea/>").html(value).text();
}
function htmlEncode(value) {
return $('<textarea/>').text(value).html();
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<div id="encoded"></div>
<div id="decoded"></div>
CMS' answer works fine, unless the HTML you want to unescape is very long, longer than 65536 chars. Because then in Chrome the inner HTML gets split into many child nodes, each one at most 65536 long, and you need to concatenate them. This function works also for very long strings:
function unencodeHtmlContent(escapedHtml) {
var elem = document.createElement('div');
elem.innerHTML = escapedHtml;
var result = '';
// Chrome splits innerHTML into many child nodes, each one at most 65536.
// Whereas FF creates just one single huge child node.
for (var i = 0; i < elem.childNodes.length; ++i) {
result = result + elem.childNodes[i].nodeValue;
}
return result;
}
See this answer about innerHTML max length for more info: https://stackoverflow.com/a/27545633/694469
To unescape HTML entities* in JavaScript you can use small library html-escaper: npm install html-escaper
import {unescape} from 'html-escaper';
unescape('escaped string');
Or unescape function from Lodash or Underscore, if you are using it.
*) please note that these functions don't cover all HTML entities, but only the most common ones, i.e. &, <, >, ', ". To unescape all HTML entities you can use he library.
First create a <span id="decodeIt" style="display:none;"></span> somewhere in the body
Next, assign the string to be decoded as innerHTML to this:
document.getElementById("decodeIt").innerHTML=stringtodecode
Finally,
stringtodecode=document.getElementById("decodeIt").innerText
Here is the overall code:
var stringtodecode="<B>Hello</B> world<br>";
document.getElementById("decodeIt").innerHTML=stringtodecode;
stringtodecode=document.getElementById("decodeIt").innerText
The question doesn't specify the origin of x but it makes sense to defend, if we can, against malicious (or just unexpected, from our own application) input. For example, suppose x has a value of & <script>alert('hello');</script>. A safe and simple way to handle this in jQuery is:
var x = "& <script>alert('hello');</script>";
var safe = $('<div />').html(x).text();
// => "& alert('hello');"
Found via https://gist.github.com/jmblog/3222899. I can't see many reasons to avoid using this solution given it is at least as short, if not shorter than some alternatives and provides defence against XSS.
(I originally posted this as a comment, but am adding it as an answer since a subsequent comment in the same thread requested that I do so).
Not a direct response to your question, but wouldn't it be better for your RPC to return some structure (be it XML or JSON or whatever) with those image data (urls in your example) inside that structure?
Then you could just parse it in your javascript and build the <img> using javascript itself.
The structure you recieve from RPC could look like:
{"img" : ["myimage.jpg", "myimage2.jpg"]}
I think it's better this way, as injecting a code that comes from external source into your page doesn't look very secure. Imaging someone hijacking your XML-RPC script and putting something you wouldn't want in there (even some javascript...)
For one-line guys:
const htmlDecode = innerHTML => Object.assign(document.createElement('textarea'), {innerHTML}).value;
console.log(htmlDecode('Complicated - Dimitri Vegas & Like Mike'));
You're welcome...just a messenger...full credit goes to ourcodeworld.com, link below.
window.htmlentities = {
/**
* Converts a string to its html characters completely.
*
* #param {String} str String with unescaped HTML characters
**/
encode : function(str) {
var buf = [];
for (var i=str.length-1;i>=0;i--) {
buf.unshift(['&#', str[i].charCodeAt(), ';'].join(''));
}
return buf.join('');
},
/**
* Converts an html characterSet into its original character.
*
* #param {String} str htmlSet entities
**/
decode : function(str) {
return str.replace(/&#(\d+);/g, function(match, dec) {
return String.fromCharCode(dec);
});
}
};
Full Credit: https://ourcodeworld.com/articles/read/188/encode-and-decode-html-entities-using-pure-javascript
I know there are a lot of good answers here, but since I have implemented a bit different approach, I thought to share.
This code is a perfectly safe security-wise approach, as the escaping handler dependant on the browser, instead on the function. So, if a new vulnerability will be discovered in the future, this solution will be covered.
const decodeHTMLEntities = text => {
// Create a new element or use one from cache, to save some element creation overhead
const el = decodeHTMLEntities.__cache_data_element
= decodeHTMLEntities.__cache_data_element
|| document.createElement('div');
const enc = text
// Prevent any mixup of existing pattern in text
.replace(/⪪/g, '⪪#')
// Encode entities in special format. This will prevent native element encoder to replace any amp characters
.replace(/&([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+);/gi, '⪪$1⪫');
// Encode any HTML tags in the text to prevent script injection
el.textContent = enc;
// Decode entities from special format, back to their original HTML entities format
el.innerHTML = el.innerHTML
.replace(/⪪([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+)⪫/gi, '&$1;')
.replace(/#⪫/g, '⪫');
// Get the decoded HTML entities
const dec = el.textContent;
// Clear the element content, in order to preserve a bit of memory (it is just the text may be pretty big)
el.textContent = '';
return dec;
}
// Example
console.log(decodeHTMLEntities("<script>alert('&awconint;&CounterClockwiseContourIntegral;∳∳⪪#x02233⪫');</script>"));
// Prints: <script>alert('∳∳∳∳⪪##x02233⪫');</script>
By the way, I have chosen to use the characters ⪪ and ⪫, because they are rarely used, so the chance of impacting the performance by matching them is significantly lower.
Chris answer is nice & elegant but it fails if value is undefined. Just simple improvement makes it solid:
function htmlDecode(value) {
return (typeof value === 'undefined') ? '' : $('<div/>').html(value).text();
}
a javascript solution that catches the common ones:
var map = {amp: '&', lt: '<', gt: '>', quot: '"', '#039': "'"}
str = str.replace(/&([^;]+);/g, (m, c) => map[c])
this is the reverse of https://stackoverflow.com/a/4835406/2738039
I tried everything to remove & from a JSON array. None of the above examples, but https://stackoverflow.com/users/2030321/chris gave a great solution that led me to fix my problem.
var stringtodecode="<B>Hello</B> world<br>";
document.getElementById("decodeIt").innerHTML=stringtodecode;
stringtodecode=document.getElementById("decodeIt").innerText
I did not use, because I did not understand how to insert it into a modal window that was pulling JSON data into an array, but I did try this based upon the example, and it worked:
var modal = document.getElementById('demodal');
$('#ampersandcontent').text(replaceAll(data[0],"&", "&"));
I like it because it was simple, and it works, but not sure why it's not widely used. Searched hi & low to find a simple solution.
I continue to seek understanding of the syntax, and if there is any risk to using this. Have not found anything yet.
I was crazy enough to go through and make this function that should be pretty, if not completely, exhaustive:
function removeEncoding(string) {
return string.replace(/À/g, "À").replace(/Á/g, "Á").replace(/Â/g, "Â").replace(/Ã/g, "Ã").replace(/Ä/g, "Ä").replace(/Å/g, "Å").replace(/à/g, "à").replace(/â/g, "â").replace(/ã/g, "ã").replace(/ä/g, "ä").replace(/å/g, "å").replace(/Æ/g, "Æ").replace(/æ/g, "æ").replace(/ß/g, "ß").replace(/Ç/g, "Ç").replace(/ç/g, "ç").replace(/È/g, "È").replace(/É/g, "É").replace(/Ê/g, "Ê").replace(/Ë/g, "Ë").replace(/è/g, "è").replace(/é/g, "é").replace(/ê/g, "ê").replace(/ë/g, "ë").replace(/ƒ/g, "ƒ").replace(/Ì/g, "Ì").replace(/Í/g, "Í").replace(/Î/g, "Î").replace(/Ï/g, "Ï").replace(/ì/g, "ì").replace(/í/g, "í").replace(/î/g, "î").replace(/ï/g, "ï").replace(/Ñ/g, "Ñ").replace(/ñ/g, "ñ").replace(/Ò/g, "Ò").replace(/Ó/g, "Ó").replace(/Ô/g, "Ô").replace(/Õ/g, "Õ").replace(/Ö/g, "Ö").replace(/ò/g, "ò").replace(/ó/g, "ó").replace(/ô/g, "ô").replace(/õ/g, "õ").replace(/ö/g, "ö").replace(/Ø/g, "Ø").replace(/ø/g, "ø").replace(/Œ/g, "Œ").replace(/œ/g, "œ").replace(/Š/g, "Š").replace(/š/g, "š").replace(/Ù/g, "Ù").replace(/Ú/g, "Ú").replace(/Û/g, "Û").replace(/Ü/g, "Ü").replace(/ù/g, "ù").replace(/ú/g, "ú").replace(/û/g, "û").replace(/ü/g, "ü").replace(/µ/g, "µ").replace(/×/g, "×").replace(/Ý/g, "Ý").replace(/Ÿ/g, "Ÿ").replace(/ý/g, "ý").replace(/ÿ/g, "ÿ").replace(/°/g, "°").replace(/†/g, "†").replace(/‡/g, "‡").replace(/</g, "<").replace(/>/g, ">").replace(/±/g, "±").replace(/«/g, "«").replace(/»/g, "»").replace(/¿/g, "¿").replace(/¡/g, "¡").replace(/·/g, "·").replace(/•/g, "•").replace(/™/g, "™").replace(/©/g, "©").replace(/®/g, "®").replace(/§/g, "§").replace(/¶/g, "¶").replace(/Α/g, "Α").replace(/Β/g, "Β").replace(/Γ/g, "Γ").replace(/Δ/g, "Δ").replace(/Ε/g, "Ε").replace(/Ζ/g, "Ζ").replace(/Η/g, "Η").replace(/Θ/g, "Θ").replace(/Ι/g, "Ι").replace(/Κ/g, "Κ").replace(/Λ/g, "Λ").replace(/Μ/g, "Μ").replace(/Ν/g, "Ν").replace(/Ξ/g, "Ξ").replace(/Ο/g, "Ο").replace(/Π/g, "Π").replace(/Ρ/g, "Ρ").replace(/Σ/g, "Σ").replace(/Τ/g, "Τ").replace(/Υ/g, "Υ").replace(/Φ/g, "Φ").replace(/Χ/g, "Χ").replace(/Ψ/g, "Ψ").replace(/Ω/g, "Ω").replace(/α/g, "α").replace(/β/g, "β").replace(/γ/g, "γ").replace(/δ/g, "δ").replace(/ε/g, "ε").replace(/ζ/g, "ζ").replace(/η/g, "η").replace(/θ/g, "θ").replace(/ι/g, "ι").replace(/κ/g, "κ").replace(/λ/g, "λ").replace(/μ/g, "μ").replace(/ν/g, "ν").replace(/ξ/g, "ξ").replace(/ο/g, "ο").replace(/&piρ;/g, "ρ").replace(/ρ/g, "ς").replace(/ς/g, "ς").replace(/σ/g, "σ").replace(/τ/g, "τ").replace(/φ/g, "φ").replace(/χ/g, "χ").replace(/ψ/g, "ψ").replace(/ω/g, "ω").replace(/•/g, "•").replace(/…/g, "…").replace(/′/g, "′").replace(/″/g, "″").replace(/‾/g, "‾").replace(/⁄/g, "⁄").replace(/℘/g, "℘").replace(/ℑ/g, "ℑ").replace(/ℜ/g, "ℜ").replace(/™/g, "™").replace(/ℵ/g, "ℵ").replace(/←/g, "←").replace(/↑/g, "↑").replace(/→/g, "→").replace(/↓/g, "↓").replace(/&barr;/g, "↔").replace(/↵/g, "↵").replace(/⇐/g, "⇐").replace(/⇑/g, "⇑").replace(/⇒/g, "⇒").replace(/⇓/g, "⇓").replace(/⇔/g, "⇔").replace(/∀/g, "∀").replace(/∂/g, "∂").replace(/∃/g, "∃").replace(/∅/g, "∅").replace(/∇/g, "∇").replace(/∈/g, "∈").replace(/∉/g, "∉").replace(/∋/g, "∋").replace(/∏/g, "∏").replace(/∑/g, "∑").replace(/−/g, "−").replace(/∗/g, "∗").replace(/√/g, "√").replace(/∝/g, "∝").replace(/∞/g, "∞").replace(/&OEig;/g, "Œ").replace(/œ/g, "œ").replace(/Ÿ/g, "Ÿ").replace(/♠/g, "♠").replace(/♣/g, "♣").replace(/♥/g, "♥").replace(/♦/g, "♦").replace(/ϑ/g, "ϑ").replace(/ϒ/g, "ϒ").replace(/ϖ/g, "ϖ").replace(/Š/g, "Š").replace(/š/g, "š").replace(/∠/g, "∠").replace(/∧/g, "∧").replace(/∨/g, "∨").replace(/∩/g, "∩").replace(/∪/g, "∪").replace(/∫/g, "∫").replace(/∴/g, "∴").replace(/∼/g, "∼").replace(/≅/g, "≅").replace(/≈/g, "≈").replace(/≠/g, "≠").replace(/≡/g, "≡").replace(/≤/g, "≤").replace(/≥/g, "≥").replace(/⊂/g, "⊂").replace(/⊃/g, "⊃").replace(/⊄/g, "⊄").replace(/⊆/g, "⊆").replace(/⊇/g, "⊇").replace(/⊕/g, "⊕").replace(/⊗/g, "⊗").replace(/⊥/g, "⊥").replace(/⋅/g, "⋅").replace(/&lcell;/g, "⌈").replace(/&rcell;/g, "⌉").replace(/⌊/g, "⌊").replace(/⌋/g, "⌋").replace(/〈/g, "⟨").replace(/〉/g, "⟩").replace(/◊/g, "◊").replace(/'/g, "'").replace(/&/g, "&").replace(/"/g, "\"");
}
Used like so:
let decodedText = removeEncoding("Ich heiße David");
console.log(decodedText);
Prints: Ich Heiße David
P.S. this took like an hour and a half to make.
This is the most comprehensive solution I've tried so far:
const STANDARD_HTML_ENTITIES = {
nbsp: String.fromCharCode(160),
amp: "&",
quot: '"',
lt: "<",
gt: ">"
};
const replaceHtmlEntities = plainTextString => {
return plainTextString
.replace(/&#(\d+);/g, (match, dec) => String.fromCharCode(dec))
.replace(
/&(nbsp|amp|quot|lt|gt);/g,
(a, b) => STANDARD_HTML_ENTITIES[b]
);
};
Closures can avoid creating unnecessary objects.
const decodingHandler = (() => {
const element = document.createElement('div');
return text => {
element.innerHTML = text;
return element.textContent;
};
})();
A more concise way
const decodingHandler = (() => {
const element = document.createElement('div');
return text => ((element.innerHTML = text), element.textContent);
})();
I use this in my project: inspired by other answers but with an extra secure parameter, can be useful when you deal with decorated characters
var decodeEntities=(function(){
var el=document.createElement('div');
return function(str, safeEscape){
if(str && typeof str === 'string'){
str=str.replace(/\</g, '<');
el.innerHTML=str;
if(el.innerText){
str=el.innerText;
el.innerText='';
}
else if(el.textContent){
str=el.textContent;
el.textContent='';
}
if(safeEscape)
str=str.replace(/\</g, '<');
}
return str;
}
})();
And it's usable like:
var label='safe <b> character éntity</b>';
var safehtml='<div title="'+decodeEntities(label)+'">'+decodeEntities(label, true)+'</div>';
var encodedStr = 'hello & world';
var parser = new DOMParser;
var dom = parser.parseFromString(
'<!doctype html><body>' + encodedStr,
'text/html');
var decodedString = dom.body.textContent;
console.log(decodedString);
// decode-html.js v1
function decodeHtml(html) {
const textarea = document.createElement('textarea');
textarea.innerHTML = html;
const decodedHtml = textarea.textContent;
textarea.remove();
return decodedHtml;
};
// encode-html.js v1
function encodeHtml(html) {
const textarea = document.createElement('textarea');
textarea.textContent = html;
const encodedHtml = textarea.innerHTML;
textarea.remove();
return encodedHtml;
};
// example of use:
let htmlDecoded = 'one & two & three';
let htmlEncoded = 'one & two & three';
console.log(1, htmlDecoded);
console.log(2, encodeHtml(htmlDecoded));
console.log(3, htmlEncoded);
console.log(4, decodeHtml(htmlEncoded));
All of the other answers here have problems.
The document.createElement('div') methods (including those using jQuery) execute any javascript passed into it (a security issue) and the DOMParser.parseFromString() method trims whitespace. Here is a pure javascript solution that has neither problem:
function htmlDecode(html) {
var textarea = document.createElement("textarea");
html= html.replace(/\r/g, String.fromCharCode(0xe000)); // Replace "\r" with reserved unicode character.
textarea.innerHTML = html;
var result = textarea.value;
return result.replace(new RegExp(String.fromCharCode(0xe000), 'g'), '\r');
}
TextArea is used specifically to avoid executig js code. It passes these:
htmlDecode('<& >'); // returns "<& >" with non-breaking space.
htmlDecode(' '); // returns " "
htmlDecode('<img src="dummy" onerror="alert(\'xss\')">'); // Does not execute alert()
htmlDecode('\r\n') // returns "\r\n", doesn't lose the \r like other solutions.

How to convert a character to an HTML Entities using Javascript? [duplicate]

I have some JavaScript code that communicates with an XML-RPC backend.
The XML-RPC returns strings of the form:
<img src='myimage.jpg'>
However, when I use the JavaScript to insert the strings into HTML, they render literally. I don't see an image, I literally see the string:
<img src='myimage.jpg'>
My guess is that the HTML is being escaped over the XML-RPC channel.
How can I unescape the string in JavaScript? I tried the techniques on this page, unsuccessfully: http://paulschreiber.com/blog/2008/09/20/javascript-how-to-unescape-html-entities/
What are other ways to diagnose the issue?
Most answers given here have a huge disadvantage: if the string you are trying to convert isn't trusted then you will end up with a Cross-Site Scripting (XSS) vulnerability. For the function in the accepted answer, consider the following:
htmlDecode("<img src='dummy' onerror='alert(/xss/)'>");
The string here contains an unescaped HTML tag, so instead of decoding anything the htmlDecode function will actually run JavaScript code specified inside the string.
This can be avoided by using DOMParser which is supported in all modern browsers:
function htmlDecode(input) {
var doc = new DOMParser().parseFromString(input, "text/html");
return doc.documentElement.textContent;
}
console.log( htmlDecode("<img src='myimage.jpg'>") )
// "<img src='myimage.jpg'>"
console.log( htmlDecode("<img src='dummy' onerror='alert(/xss/)'>") )
// ""
This function is guaranteed to not run any JavaScript code as a side-effect. Any HTML tags will be ignored, only text content will be returned.
Compatibility note: Parsing HTML with DOMParser requires at least Chrome 30, Firefox 12, Opera 17, Internet Explorer 10, Safari 7.1 or Microsoft Edge. So all browsers without support are way past their EOL and as of 2017 the only ones that can still be seen in the wild occasionally are older Internet Explorer and Safari versions (usually these still aren't numerous enough to bother).
Do you need to decode all encoded HTML entities or just & itself?
If you only need to handle & then you can do this:
var decoded = encoded.replace(/&/g, '&');
If you need to decode all HTML entities then you can do it without jQuery:
var elem = document.createElement('textarea');
elem.innerHTML = encoded;
var decoded = elem.value;
Please take note of Mark's comments below which highlight security holes in an earlier version of this answer and recommend using textarea rather than div to mitigate against potential XSS vulnerabilities. These vulnerabilities exist whether you use jQuery or plain JavaScript.
EDIT: You should use the DOMParser API as Wladimir suggests, I edited my previous answer since the function posted introduced a security vulnerability.
The following snippet is the old answer's code with a small modification: using a textarea instead of a div reduces the XSS vulnerability, but it is still problematic in IE9 and Firefox.
function htmlDecode(input){
var e = document.createElement('textarea');
e.innerHTML = input;
// handle case of empty input
return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}
htmlDecode("<img src='myimage.jpg'>");
// returns "<img src='myimage.jpg'>"
Basically I create a DOM element programmatically, assign the encoded HTML to its innerHTML and retrieve the nodeValue from the text node created on the innerHTML insertion. Since it just creates an element but never adds it, no site HTML is modified.
It will work cross-browser (including older browsers) and accept all the HTML Character Entities.
EDIT: The old version of this code did not work on IE with blank inputs, as evidenced here on jsFiddle (view in IE). The version above works with all inputs.
UPDATE: appears this doesn't work with large string, and it also introduces a security vulnerability, see comments.
A more modern option for interpreting HTML (text and otherwise) from JavaScript is the HTML support in the DOMParser API (see here in MDN). This allows you to use the browser's native HTML parser to convert a string to an HTML document. It has been supported in new versions of all major browsers since late 2014.
If we just want to decode some text content, we can put it as the sole content in a document body, parse the document, and pull out the its .body.textContent.
var encodedStr = 'hello & world';
var parser = new DOMParser;
var dom = parser.parseFromString(
'<!doctype html><body>' + encodedStr,
'text/html');
var decodedString = dom.body.textContent;
console.log(decodedString);
We can see in the draft specification for DOMParser that JavaScript is not enabled for the parsed document, so we can perform this text conversion without security concerns.
The parseFromString(str, type) method must run these steps, depending on type:
"text/html"
Parse str with an HTML parser, and return the newly created Document.
The scripting flag must be set to "disabled".
NOTE
script elements get marked unexecutable and the contents of noscript get parsed as markup.
It's beyond the scope of this question, but please note that if you're taking the parsed DOM nodes themselves (not just their text content) and moving them to the live document DOM, it's possible that their scripting would be reenabled, and there could be security concerns. I haven't researched it, so please exercise caution.
Matthias Bynens has a library for this: https://github.com/mathiasbynens/he
Example:
console.log(
he.decode("Jörg &amp Jürgen rocked to & fro ")
);
// Logs "Jörg & Jürgen rocked to & fro"
I suggest favouring it over hacks involving setting an element's HTML content and then reading back its text content. Such approaches can work, but are deceptively dangerous and present XSS opportunities if used on untrusted user input.
If you really can't bear to load in a library, you can use the textarea hack described in this answer to a near-duplicate question, which, unlike various similar approaches that have been suggested, has no security holes that I know of:
function decodeEntities(encodedString) {
var textArea = document.createElement('textarea');
textArea.innerHTML = encodedString;
return textArea.value;
}
console.log(decodeEntities('1 & 2')); // '1 & 2'
But take note of the security issues, affecting similar approaches to this one, that I list in the linked answer! This approach is a hack, and future changes to the permissible content of a textarea (or bugs in particular browsers) could lead to code that relies upon it suddenly having an XSS hole one day.
If you're using jQuery:
function htmlDecode(value){
return $('<div/>').html(value).text();
}
Otherwise, use Strictly Software's Encoder Object, which has an excellent htmlDecode() function.
You can use Lodash unescape / escape function https://lodash.com/docs/4.17.5#unescape
import unescape from 'lodash/unescape';
const str = unescape('fred, barney, & pebbles');
str will become 'fred, barney, & pebbles'
var htmlEnDeCode = (function() {
var charToEntityRegex,
entityToCharRegex,
charToEntity,
entityToChar;
function resetCharacterEntities() {
charToEntity = {};
entityToChar = {};
// add the default set
addCharacterEntities({
'&' : '&',
'>' : '>',
'<' : '<',
'"' : '"',
''' : "'"
});
}
function addCharacterEntities(newEntities) {
var charKeys = [],
entityKeys = [],
key, echar;
for (key in newEntities) {
echar = newEntities[key];
entityToChar[key] = echar;
charToEntity[echar] = key;
charKeys.push(echar);
entityKeys.push(key);
}
charToEntityRegex = new RegExp('(' + charKeys.join('|') + ')', 'g');
entityToCharRegex = new RegExp('(' + entityKeys.join('|') + '|&#[0-9]{1,5};' + ')', 'g');
}
function htmlEncode(value){
var htmlEncodeReplaceFn = function(match, capture) {
return charToEntity[capture];
};
return (!value) ? value : String(value).replace(charToEntityRegex, htmlEncodeReplaceFn);
}
function htmlDecode(value) {
var htmlDecodeReplaceFn = function(match, capture) {
return (capture in entityToChar) ? entityToChar[capture] : String.fromCharCode(parseInt(capture.substr(2), 10));
};
return (!value) ? value : String(value).replace(entityToCharRegex, htmlDecodeReplaceFn);
}
resetCharacterEntities();
return {
htmlEncode: htmlEncode,
htmlDecode: htmlDecode
};
})();
This is from ExtJS source code.
The trick is to use the power of the browser to decode the special HTML characters, but not allow the browser to execute the results as if it was actual html... This function uses a regex to identify and replace encoded HTML characters, one character at a time.
function unescapeHtml(html) {
var el = document.createElement('div');
return html.replace(/\&[#0-9a-z]+;/gi, function (enc) {
el.innerHTML = enc;
return el.innerText
});
}
element.innerText also does the trick.
In case you're looking for it, like me - meanwhile there's a nice and safe JQuery method.
https://api.jquery.com/jquery.parsehtml/
You can f.ex. type this in your console:
var x = "test &";
> undefined
$.parseHTML(x)[0].textContent
> "test &"
So $.parseHTML(x) returns an array, and if you have HTML markup within your text, the array.length will be greater than 1.
jQuery will encode and decode for you. However, you need to use a textarea tag, not a div.
var str1 = 'One & two & three';
var str2 = "One & two & three";
$(document).ready(function() {
$("#encoded").text(htmlEncode(str1));
$("#decoded").text(htmlDecode(str2));
});
function htmlDecode(value) {
return $("<textarea/>").html(value).text();
}
function htmlEncode(value) {
return $('<textarea/>').text(value).html();
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<div id="encoded"></div>
<div id="decoded"></div>
CMS' answer works fine, unless the HTML you want to unescape is very long, longer than 65536 chars. Because then in Chrome the inner HTML gets split into many child nodes, each one at most 65536 long, and you need to concatenate them. This function works also for very long strings:
function unencodeHtmlContent(escapedHtml) {
var elem = document.createElement('div');
elem.innerHTML = escapedHtml;
var result = '';
// Chrome splits innerHTML into many child nodes, each one at most 65536.
// Whereas FF creates just one single huge child node.
for (var i = 0; i < elem.childNodes.length; ++i) {
result = result + elem.childNodes[i].nodeValue;
}
return result;
}
See this answer about innerHTML max length for more info: https://stackoverflow.com/a/27545633/694469
To unescape HTML entities* in JavaScript you can use small library html-escaper: npm install html-escaper
import {unescape} from 'html-escaper';
unescape('escaped string');
Or unescape function from Lodash or Underscore, if you are using it.
*) please note that these functions don't cover all HTML entities, but only the most common ones, i.e. &, <, >, ', ". To unescape all HTML entities you can use he library.
First create a <span id="decodeIt" style="display:none;"></span> somewhere in the body
Next, assign the string to be decoded as innerHTML to this:
document.getElementById("decodeIt").innerHTML=stringtodecode
Finally,
stringtodecode=document.getElementById("decodeIt").innerText
Here is the overall code:
var stringtodecode="<B>Hello</B> world<br>";
document.getElementById("decodeIt").innerHTML=stringtodecode;
stringtodecode=document.getElementById("decodeIt").innerText
The question doesn't specify the origin of x but it makes sense to defend, if we can, against malicious (or just unexpected, from our own application) input. For example, suppose x has a value of & <script>alert('hello');</script>. A safe and simple way to handle this in jQuery is:
var x = "& <script>alert('hello');</script>";
var safe = $('<div />').html(x).text();
// => "& alert('hello');"
Found via https://gist.github.com/jmblog/3222899. I can't see many reasons to avoid using this solution given it is at least as short, if not shorter than some alternatives and provides defence against XSS.
(I originally posted this as a comment, but am adding it as an answer since a subsequent comment in the same thread requested that I do so).
Not a direct response to your question, but wouldn't it be better for your RPC to return some structure (be it XML or JSON or whatever) with those image data (urls in your example) inside that structure?
Then you could just parse it in your javascript and build the <img> using javascript itself.
The structure you recieve from RPC could look like:
{"img" : ["myimage.jpg", "myimage2.jpg"]}
I think it's better this way, as injecting a code that comes from external source into your page doesn't look very secure. Imaging someone hijacking your XML-RPC script and putting something you wouldn't want in there (even some javascript...)
For one-line guys:
const htmlDecode = innerHTML => Object.assign(document.createElement('textarea'), {innerHTML}).value;
console.log(htmlDecode('Complicated - Dimitri Vegas & Like Mike'));
You're welcome...just a messenger...full credit goes to ourcodeworld.com, link below.
window.htmlentities = {
/**
* Converts a string to its html characters completely.
*
* #param {String} str String with unescaped HTML characters
**/
encode : function(str) {
var buf = [];
for (var i=str.length-1;i>=0;i--) {
buf.unshift(['&#', str[i].charCodeAt(), ';'].join(''));
}
return buf.join('');
},
/**
* Converts an html characterSet into its original character.
*
* #param {String} str htmlSet entities
**/
decode : function(str) {
return str.replace(/&#(\d+);/g, function(match, dec) {
return String.fromCharCode(dec);
});
}
};
Full Credit: https://ourcodeworld.com/articles/read/188/encode-and-decode-html-entities-using-pure-javascript
I know there are a lot of good answers here, but since I have implemented a bit different approach, I thought to share.
This code is a perfectly safe security-wise approach, as the escaping handler dependant on the browser, instead on the function. So, if a new vulnerability will be discovered in the future, this solution will be covered.
const decodeHTMLEntities = text => {
// Create a new element or use one from cache, to save some element creation overhead
const el = decodeHTMLEntities.__cache_data_element
= decodeHTMLEntities.__cache_data_element
|| document.createElement('div');
const enc = text
// Prevent any mixup of existing pattern in text
.replace(/⪪/g, '⪪#')
// Encode entities in special format. This will prevent native element encoder to replace any amp characters
.replace(/&([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+);/gi, '⪪$1⪫');
// Encode any HTML tags in the text to prevent script injection
el.textContent = enc;
// Decode entities from special format, back to their original HTML entities format
el.innerHTML = el.innerHTML
.replace(/⪪([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+)⪫/gi, '&$1;')
.replace(/#⪫/g, '⪫');
// Get the decoded HTML entities
const dec = el.textContent;
// Clear the element content, in order to preserve a bit of memory (it is just the text may be pretty big)
el.textContent = '';
return dec;
}
// Example
console.log(decodeHTMLEntities("<script>alert('&awconint;&CounterClockwiseContourIntegral;∳∳⪪#x02233⪫');</script>"));
// Prints: <script>alert('∳∳∳∳⪪##x02233⪫');</script>
By the way, I have chosen to use the characters ⪪ and ⪫, because they are rarely used, so the chance of impacting the performance by matching them is significantly lower.
Chris answer is nice & elegant but it fails if value is undefined. Just simple improvement makes it solid:
function htmlDecode(value) {
return (typeof value === 'undefined') ? '' : $('<div/>').html(value).text();
}
a javascript solution that catches the common ones:
var map = {amp: '&', lt: '<', gt: '>', quot: '"', '#039': "'"}
str = str.replace(/&([^;]+);/g, (m, c) => map[c])
this is the reverse of https://stackoverflow.com/a/4835406/2738039
I tried everything to remove & from a JSON array. None of the above examples, but https://stackoverflow.com/users/2030321/chris gave a great solution that led me to fix my problem.
var stringtodecode="<B>Hello</B> world<br>";
document.getElementById("decodeIt").innerHTML=stringtodecode;
stringtodecode=document.getElementById("decodeIt").innerText
I did not use, because I did not understand how to insert it into a modal window that was pulling JSON data into an array, but I did try this based upon the example, and it worked:
var modal = document.getElementById('demodal');
$('#ampersandcontent').text(replaceAll(data[0],"&", "&"));
I like it because it was simple, and it works, but not sure why it's not widely used. Searched hi & low to find a simple solution.
I continue to seek understanding of the syntax, and if there is any risk to using this. Have not found anything yet.
I was crazy enough to go through and make this function that should be pretty, if not completely, exhaustive:
function removeEncoding(string) {
return string.replace(/À/g, "À").replace(/Á/g, "Á").replace(/Â/g, "Â").replace(/Ã/g, "Ã").replace(/Ä/g, "Ä").replace(/Å/g, "Å").replace(/à/g, "à").replace(/â/g, "â").replace(/ã/g, "ã").replace(/ä/g, "ä").replace(/å/g, "å").replace(/Æ/g, "Æ").replace(/æ/g, "æ").replace(/ß/g, "ß").replace(/Ç/g, "Ç").replace(/ç/g, "ç").replace(/È/g, "È").replace(/É/g, "É").replace(/Ê/g, "Ê").replace(/Ë/g, "Ë").replace(/è/g, "è").replace(/é/g, "é").replace(/ê/g, "ê").replace(/ë/g, "ë").replace(/ƒ/g, "ƒ").replace(/Ì/g, "Ì").replace(/Í/g, "Í").replace(/Î/g, "Î").replace(/Ï/g, "Ï").replace(/ì/g, "ì").replace(/í/g, "í").replace(/î/g, "î").replace(/ï/g, "ï").replace(/Ñ/g, "Ñ").replace(/ñ/g, "ñ").replace(/Ò/g, "Ò").replace(/Ó/g, "Ó").replace(/Ô/g, "Ô").replace(/Õ/g, "Õ").replace(/Ö/g, "Ö").replace(/ò/g, "ò").replace(/ó/g, "ó").replace(/ô/g, "ô").replace(/õ/g, "õ").replace(/ö/g, "ö").replace(/Ø/g, "Ø").replace(/ø/g, "ø").replace(/Œ/g, "Œ").replace(/œ/g, "œ").replace(/Š/g, "Š").replace(/š/g, "š").replace(/Ù/g, "Ù").replace(/Ú/g, "Ú").replace(/Û/g, "Û").replace(/Ü/g, "Ü").replace(/ù/g, "ù").replace(/ú/g, "ú").replace(/û/g, "û").replace(/ü/g, "ü").replace(/µ/g, "µ").replace(/×/g, "×").replace(/Ý/g, "Ý").replace(/Ÿ/g, "Ÿ").replace(/ý/g, "ý").replace(/ÿ/g, "ÿ").replace(/°/g, "°").replace(/†/g, "†").replace(/‡/g, "‡").replace(/</g, "<").replace(/>/g, ">").replace(/±/g, "±").replace(/«/g, "«").replace(/»/g, "»").replace(/¿/g, "¿").replace(/¡/g, "¡").replace(/·/g, "·").replace(/•/g, "•").replace(/™/g, "™").replace(/©/g, "©").replace(/®/g, "®").replace(/§/g, "§").replace(/¶/g, "¶").replace(/Α/g, "Α").replace(/Β/g, "Β").replace(/Γ/g, "Γ").replace(/Δ/g, "Δ").replace(/Ε/g, "Ε").replace(/Ζ/g, "Ζ").replace(/Η/g, "Η").replace(/Θ/g, "Θ").replace(/Ι/g, "Ι").replace(/Κ/g, "Κ").replace(/Λ/g, "Λ").replace(/Μ/g, "Μ").replace(/Ν/g, "Ν").replace(/Ξ/g, "Ξ").replace(/Ο/g, "Ο").replace(/Π/g, "Π").replace(/Ρ/g, "Ρ").replace(/Σ/g, "Σ").replace(/Τ/g, "Τ").replace(/Υ/g, "Υ").replace(/Φ/g, "Φ").replace(/Χ/g, "Χ").replace(/Ψ/g, "Ψ").replace(/Ω/g, "Ω").replace(/α/g, "α").replace(/β/g, "β").replace(/γ/g, "γ").replace(/δ/g, "δ").replace(/ε/g, "ε").replace(/ζ/g, "ζ").replace(/η/g, "η").replace(/θ/g, "θ").replace(/ι/g, "ι").replace(/κ/g, "κ").replace(/λ/g, "λ").replace(/μ/g, "μ").replace(/ν/g, "ν").replace(/ξ/g, "ξ").replace(/ο/g, "ο").replace(/&piρ;/g, "ρ").replace(/ρ/g, "ς").replace(/ς/g, "ς").replace(/σ/g, "σ").replace(/τ/g, "τ").replace(/φ/g, "φ").replace(/χ/g, "χ").replace(/ψ/g, "ψ").replace(/ω/g, "ω").replace(/•/g, "•").replace(/…/g, "…").replace(/′/g, "′").replace(/″/g, "″").replace(/‾/g, "‾").replace(/⁄/g, "⁄").replace(/℘/g, "℘").replace(/ℑ/g, "ℑ").replace(/ℜ/g, "ℜ").replace(/™/g, "™").replace(/ℵ/g, "ℵ").replace(/←/g, "←").replace(/↑/g, "↑").replace(/→/g, "→").replace(/↓/g, "↓").replace(/&barr;/g, "↔").replace(/↵/g, "↵").replace(/⇐/g, "⇐").replace(/⇑/g, "⇑").replace(/⇒/g, "⇒").replace(/⇓/g, "⇓").replace(/⇔/g, "⇔").replace(/∀/g, "∀").replace(/∂/g, "∂").replace(/∃/g, "∃").replace(/∅/g, "∅").replace(/∇/g, "∇").replace(/∈/g, "∈").replace(/∉/g, "∉").replace(/∋/g, "∋").replace(/∏/g, "∏").replace(/∑/g, "∑").replace(/−/g, "−").replace(/∗/g, "∗").replace(/√/g, "√").replace(/∝/g, "∝").replace(/∞/g, "∞").replace(/&OEig;/g, "Œ").replace(/œ/g, "œ").replace(/Ÿ/g, "Ÿ").replace(/♠/g, "♠").replace(/♣/g, "♣").replace(/♥/g, "♥").replace(/♦/g, "♦").replace(/ϑ/g, "ϑ").replace(/ϒ/g, "ϒ").replace(/ϖ/g, "ϖ").replace(/Š/g, "Š").replace(/š/g, "š").replace(/∠/g, "∠").replace(/∧/g, "∧").replace(/∨/g, "∨").replace(/∩/g, "∩").replace(/∪/g, "∪").replace(/∫/g, "∫").replace(/∴/g, "∴").replace(/∼/g, "∼").replace(/≅/g, "≅").replace(/≈/g, "≈").replace(/≠/g, "≠").replace(/≡/g, "≡").replace(/≤/g, "≤").replace(/≥/g, "≥").replace(/⊂/g, "⊂").replace(/⊃/g, "⊃").replace(/⊄/g, "⊄").replace(/⊆/g, "⊆").replace(/⊇/g, "⊇").replace(/⊕/g, "⊕").replace(/⊗/g, "⊗").replace(/⊥/g, "⊥").replace(/⋅/g, "⋅").replace(/&lcell;/g, "⌈").replace(/&rcell;/g, "⌉").replace(/⌊/g, "⌊").replace(/⌋/g, "⌋").replace(/〈/g, "⟨").replace(/〉/g, "⟩").replace(/◊/g, "◊").replace(/'/g, "'").replace(/&/g, "&").replace(/"/g, "\"");
}
Used like so:
let decodedText = removeEncoding("Ich heiße David");
console.log(decodedText);
Prints: Ich Heiße David
P.S. this took like an hour and a half to make.
This is the most comprehensive solution I've tried so far:
const STANDARD_HTML_ENTITIES = {
nbsp: String.fromCharCode(160),
amp: "&",
quot: '"',
lt: "<",
gt: ">"
};
const replaceHtmlEntities = plainTextString => {
return plainTextString
.replace(/&#(\d+);/g, (match, dec) => String.fromCharCode(dec))
.replace(
/&(nbsp|amp|quot|lt|gt);/g,
(a, b) => STANDARD_HTML_ENTITIES[b]
);
};
Closures can avoid creating unnecessary objects.
const decodingHandler = (() => {
const element = document.createElement('div');
return text => {
element.innerHTML = text;
return element.textContent;
};
})();
A more concise way
const decodingHandler = (() => {
const element = document.createElement('div');
return text => ((element.innerHTML = text), element.textContent);
})();
I use this in my project: inspired by other answers but with an extra secure parameter, can be useful when you deal with decorated characters
var decodeEntities=(function(){
var el=document.createElement('div');
return function(str, safeEscape){
if(str && typeof str === 'string'){
str=str.replace(/\</g, '<');
el.innerHTML=str;
if(el.innerText){
str=el.innerText;
el.innerText='';
}
else if(el.textContent){
str=el.textContent;
el.textContent='';
}
if(safeEscape)
str=str.replace(/\</g, '<');
}
return str;
}
})();
And it's usable like:
var label='safe <b> character éntity</b>';
var safehtml='<div title="'+decodeEntities(label)+'">'+decodeEntities(label, true)+'</div>';
var encodedStr = 'hello & world';
var parser = new DOMParser;
var dom = parser.parseFromString(
'<!doctype html><body>' + encodedStr,
'text/html');
var decodedString = dom.body.textContent;
console.log(decodedString);
// decode-html.js v1
function decodeHtml(html) {
const textarea = document.createElement('textarea');
textarea.innerHTML = html;
const decodedHtml = textarea.textContent;
textarea.remove();
return decodedHtml;
};
// encode-html.js v1
function encodeHtml(html) {
const textarea = document.createElement('textarea');
textarea.textContent = html;
const encodedHtml = textarea.innerHTML;
textarea.remove();
return encodedHtml;
};
// example of use:
let htmlDecoded = 'one & two & three';
let htmlEncoded = 'one & two & three';
console.log(1, htmlDecoded);
console.log(2, encodeHtml(htmlDecoded));
console.log(3, htmlEncoded);
console.log(4, decodeHtml(htmlEncoded));
All of the other answers here have problems.
The document.createElement('div') methods (including those using jQuery) execute any javascript passed into it (a security issue) and the DOMParser.parseFromString() method trims whitespace. Here is a pure javascript solution that has neither problem:
function htmlDecode(html) {
var textarea = document.createElement("textarea");
html= html.replace(/\r/g, String.fromCharCode(0xe000)); // Replace "\r" with reserved unicode character.
textarea.innerHTML = html;
var result = textarea.value;
return result.replace(new RegExp(String.fromCharCode(0xe000), 'g'), '\r');
}
TextArea is used specifically to avoid executig js code. It passes these:
htmlDecode('<& >'); // returns "<& >" with non-breaking space.
htmlDecode(' '); // returns " "
htmlDecode('<img src="dummy" onerror="alert(\'xss\')">'); // Does not execute alert()
htmlDecode('\r\n') // returns "\r\n", doesn't lose the \r like other solutions.

Vue - decodeURI before it gets added to Vuex state [duplicate]

This question already has answers here:
Unescape HTML entities in JavaScript?
(33 answers)
Closed 5 years ago.
Say I get some JSON back from a service request that looks like this:
{
"message": "We're unable to complete your request at this time."
}
I'm not sure why that apostraphe is encoded like that ('); all I know is that I want to decode it.
Here's one approach using jQuery that popped into my head:
function decodeHtml(html) {
return $('<div>').html(html).text();
}
That seems (very) hacky, though. What's a better way? Is there a "right" way?
This is my favourite way of decoding HTML characters. The advantage of using this code is that tags are also preserved.
function decodeHtml(html) {
var txt = document.createElement("textarea");
txt.innerHTML = html;
return txt.value;
}
Example: http://jsfiddle.net/k65s3/
Input:
Entity: Bad attempt at XSS:<script>alert('new\nline?')</script><br>
Output:
Entity: Bad attempt at XSS:<script>alert('new\nline?')</script><br>
Don’t use the DOM to do this if you care about legacy compatibility. Using the DOM to decode HTML entities (as suggested in the currently accepted answer) leads to differences in cross-browser results on non-modern browsers.
For a robust & deterministic solution that decodes character references according to the algorithm in the HTML Standard, use the he library. From its README:
he (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports all standardized named character references as per HTML, handles ambiguous ampersands and other edge cases just like a browser would, has an extensive test suite, and — contrary to many other JavaScript solutions — he handles astral Unicode symbols just fine. An online demo is available.
Here’s how you’d use it:
he.decode("We're unable to complete your request at this time.");
→ "We're unable to complete your request at this time."
Disclaimer: I'm the author of the he library.
See this Stack Overflow answer for some more info.
If you don't want to use html/dom, you could use regex. I haven't tested this; but something along the lines of:
function parseHtmlEntities(str) {
return str.replace(/&#([0-9]{1,3});/gi, function(match, numStr) {
var num = parseInt(numStr, 10); // read num as normal number
return String.fromCharCode(num);
});
}
[Edit]
Note: this would only work for numeric html-entities, and not stuff like &oring;.
[Edit 2]
Fixed the function (some typos), test here: http://jsfiddle.net/Be2Bd/1/
There's JS function to deal with &#xxxx styled entities:
function at GitHub
// encode(decode) html text into html entity
var decodeHtmlEntity = function(str) {
return str.replace(/&#(\d+);/g, function(match, dec) {
return String.fromCharCode(dec);
});
};
var encodeHtmlEntity = function(str) {
var buf = [];
for (var i=str.length-1;i>=0;i--) {
buf.unshift(['&#', str[i].charCodeAt(), ';'].join(''));
}
return buf.join('');
};
var entity = '高级程序设计';
var str = '高级程序设计';
let element = document.getElementById("testFunct");
element.innerHTML = (decodeHtmlEntity(entity));
console.log(decodeHtmlEntity(entity) === str);
console.log(encodeHtmlEntity(str) === entity);
// output:
// true
// true
<div><span id="testFunct"></span></div>
jQuery will encode and decode for you.
function htmlDecode(value) {
return $("<textarea/>").html(value).text();
}
function htmlEncode(value) {
return $('<textarea/>').text(value).html();
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
$(document).ready(function() {
$("#encoded")
.text(htmlEncode("<img src onerror='alert(0)'>"));
$("#decoded")
.text(htmlDecode("<img src onerror='alert(0)'>"));
});
</script>
<span>htmlEncode() result:</span><br/>
<div id="encoded"></div>
<br/>
<span>htmlDecode() result:</span><br/>
<div id="decoded"></div>
_.unescape does what you're looking for
https://lodash.com/docs/#unescape
This is so good answer. You can use this with angular like this:
moduleDefinitions.filter('sanitize', ['$sce', function($sce) {
return function(htmlCode) {
var txt = document.createElement("textarea");
txt.innerHTML = htmlCode;
return $sce.trustAsHtml(txt.value);
}
}]);

XSS safe html decode for Javascript

I need to decode html in javascript. e.g.:
var str = 'apple & banana';
var strDecoded = htmlDecode(str); // I expect 'apple & banana'
There is no guarantee that the given str is already encoded and common jquery and DOM tricks are XSS vulnerable:
var attackStr = '&</textarea><img src=x onerror=alert(1)>ハローワールド'; // if you see 1 alerted, it means it is XSS vulnerable
var strDecoded; // I wish to get: &</textarea><img src=x onerror=alert(1)>ハローワールド
strDecoded = $('<div/>').html(attackStr).text(); // vulnerable in all browsers
strDecoded = $('<textarea/>').html(attackStr).text(); // vulnerable in ie 9 and firefox
var dv = document.createElement('div');
dv.innerHTML = attackStr; // vulnerable in all browsers
strDecoded = dv.innerText;
var ta = document.createElement('textarea');
ta.innerHTML = attackStr; // vulnerable in ie 9 and firefox
strDecoded = ta.value;
Is there any XSS-safe way to html-decode?
Taking a mix of your code and the highest-voted (not the accepted) answer at HTML Entity Decode, how about this:
var decodeEntities = (function() {
// this prevents any overhead from creating the object each time
var element = document.createElement('textarea');
function decodeHTMLEntities (str) {
if(str && typeof str === 'string') {
str = str.replace(/</g,"<");
str = str.replace(/>/g,">");
element.innerHTML = str;
str = element.textContent;
element.textContent = '';
}
return str;
}
return decodeHTMLEntities;
})();
Fiddle here: http://jsfiddle.net/ursu67z6/
You could also have a look at https://github.com/mathiasbynens/he maybe. I haven't gone through it myself, but it might deal with some cases better. I expect that if you are only decoding rather than encoding, the dom-based approach is better.
DOMPurify is a DOM-only, super-fast, uber-tolerant XSS sanitizer for
HTML, MathML and SVG. It's written in JavaScript and works in all
modern browsers (Safari, Opera (15+), Internet Explorer (9+), Firefox
and Chrome - as well as almost anything else using Blink or WebKit).
It doesn't break on IE6 or other legacy browsers. It simply does
nothing there.
DOMPurify is written by security people who have vast background in
web attacks and XSS. Fear not.
I've tested and use DOMPurify and it's really good at sanitize untrusted data on client-side. Using is very simple.
Import the purify.js
<script type="text/javascript" src="purify.js"></script>
And call your untrusted variable.
var attackStr = '</textarea><img src=x onerror=alert(1)>'
var clean = DOMPurify.sanitize(attackStr );
Output will be like following.
<img src="x">
You can test your XSS payload at here https://cure53.de/purify
Source codes, examples and documentations are can be found over here ( https://github.com/cure53/DOMPurify )
If you want to safely display the content.
Use innerText or jQuery.text() method instead of innerHTML/.html()
You can use jQuery function like below, to encode or decode the input String
function htmlEncode(value){
return $('<div/>').text(value).html();
}
function htmlDecode(value){
return $('<div/>').html(value).text();
}
htmlDecode('<b>test</b>')
// result "<b>test</b>"
htmlDecode('test')
// result "test"
In this code
I'm actually creating a Div which is not actually present on the page
Passing input string to the htmlDecode function
jQuery automatically encode/decode the string
Returning the new html/text
Hope this helps!
Here is a clean solution that does not imply to inject the HTML anywhere. Copy both these functions somewhere in your code:
http://phpjs.org/functions/html_entity_decode/ and
http://phpjs.org/functions/get_html_translation_table/
You'll have to remove "this" in "html_entity_decode" on line 26.
console.log( html_entity_decode('&</textarea><img src=x onerror=alert(1)>') );
// &</textarea><img src=x onerror=alert(1)>
Cheers.
-- EDIT --
Your textarea trick looks good, did it cover all your use cases ?
The only other javascript solution I think about is to use a sandboxed, same-domain, iframe. It gives me good results but would only work in recent web browsers... I post the code in case.
function safeHtmlDecode(str, callback)
{
var sameDomainBlankPage = document.location.href; // This should be a blank html page located on same domain
$iframe = $('<iframe sandbox="allow-same-origin"/>').attr("src", sameDomainBlankPage);
$iframe.on("load", function() {
var body = $iframe.contents()[0].body;
body.innerHTML = str;
callback(body.innerText);
});
$("body").append($iframe);
}
$(document).ready(function(){
var attackStr = '&</textarea><img src=x onerror=alert(1)>ハローワールド';
safeHtmlDecode(attackStr, function(htmlString) {
console.log( htmlString );
});
});
The best I could get so far:
function htmlDecode(str){
if(typeof str != "string") return str;
str = str.replace(/</g,"<");
str = str.replace(/>/g,">");
var ta = document.createElement("textarea");
ta.innerHTML = str;
return ta.value;
}
//test:
var attackStr = '&</textarea><img src=x onerror=alert(1)>ハローワールド';
alert(htmlDecode(attackStr)); // &</textarea><img src=x onerror=alert(1)>ハローワールド

Node.innerHTML giving tag names in lower case

I am iterating NodeList to get Node data, but while using Node.innerHTML i am getting the tag names in lowercase.
Actual Tags
<Panel><Label>test</Label></Panel>
giving as
<panel><label>test</label></panel>
I need these tags as it is. Is it possible to get it with regular expression? I am using it with dojo (is there any way in dojo?).
var xhrArgs = {
url: "./user/"+Runtime.userName+"/ws/workspace/"+Workbench.getProject()+"/lib/custom/"+(first.type).replace(".","/")+".html",
content: {},
sync:true,
load: function(data){
var test = domConstruct.toDom(data);
dojo.forEach(dojo.query("[id]",test),function(node){
domAttr.remove(node,"id");
});
var childEle = "";
dojo.forEach(test.childNodes,function(node){
if(node.innerHTML){
childEle+=node.innerHTML;
}
});
command.add(new ModifyCommand(newWidget,{},childEle,context));
}
};
You cannot count on .innerHTML preserving the exact nature of your original HTML. In fact, in some browsers, it's significantly different (though generates the same results) with different quotation, case, order of attributes, etc...
It is much better to not rely on the preservation of case and adjust your javascript to deal with uncertain case.
It is certainly possible to use a regular expression to do a case insensitive search (the "i" flag designates its searches as case insensitive), though it is generally much, much better to use direct DOM access/searching rather than innerHTML searching. You'd have to tell us more about what exactly you're trying to do before we could offer some code.
It would take me a bit to figure that out with a regex, but you can use this:
var str = '<panel><label>test</label></panel>';
chars = str.split("");
for (var i = 0; i < chars.length; i++) {
if (chars[i] === '<' || chars[i] === '/') {
chars[i + 1] = chars[i + 1].toUpperCase();
}
}
str = chars.join("");
jsFiddle
I hope it helps.
If you are trying to just capitalise the first character of the tag name, you can use:
var s = 'panel';
s.replace(/(^.)(.*)/,function(m, a, b){return a.toUpperCase() + b.toLowerCase()}); // Panel
Alternatively you can use string manipulation (probably more efficient than a regular expression):
s.charAt(0).toUpperCase() + s.substring(1).toLowerCase(); // Panel
The above will output any input string with the first character in upper case and everything else lower case.
this is not thoroughly tested , and is highly inefficcient, but it worked quite quickly in the console:
(also, it's jquery, but it can be converted to pure javascript/DOM easily)
in jsFiddle
function tagString (element) {
return $(element).
clone().
contents().
remove().
end()[0].
outerHTML.
replace(/(^<\s*\w)|(<\/\s*\w(?=\w*\s*>$))/g,
function (a) {
return a.
toUpperCase();
}).
split(/(?=<\/\s*\w*\s*>$)/);
}
function capContents (element) {
return $(element).
contents().
map(function () {
return this.nodeType === 3 ? $(this).text() : capitalizeHTML(this);
})
}
function capitalizeHTML (selector) {
var e = $(selector).first();
var wrap = tagString(e);
return wrap[0] + capContents(e).toArray().join("") + wrap[1];
}
capitalizeHTML('body');
also, besides being a nice exercise (in my opinion), do you really need to do this?

Categories