How do I escape quotes in HTML attribute values? - javascript

I'm building up a row to insert in a table using jQuery by creating a html string, i.e.
var row = "";
row += "<tr>";
row += "<td>Name</td>";
row += "<td><input value='"+data.name+"'/></td>";
row += "</tr>";
data.name is a string returned from an ajax call which could contain any characters. If it contains a single quote, ', it will break the HTML by defining the end of the attribute value.
How can I ensure that the string is rendered correctly in the browser?

Actually you may need one of these two functions (this depends on the context of use). These functions handle all kind of string quotes, and also protect from the HTML/XML syntax.
(1) The quoteattr() function for embeding text into HTML/XML:
The quoteattr() function is used in a context, where the result will not be evaluated by javascript but must be interpreted by an XML or HTML parser, and it must absolutely avoid breaking the syntax of an element attribute.
Newlines are natively preserved if generating the content of a text elements. However, if you're generating the value of an attribute this assigned value will be normalized by the DOM as soon as it will be set, so all whitespaces (SPACE, TAB, CR, LF) will be compressed, stripping leading and trailing whitespaces and reducing all middle sequences of whitespaces into a single SPACE.
But there's an exception: the CR character will be preserved and not treated as whitespace, only if it is represented with a numeric character reference! The result will be valid for all element attributes, with the exception of attributes of type NMTOKEN or ID, or NMTOKENS: the presence of the referenced CR will make the assigned value invalid for those attributes (for example the id="..." attribute of HTML elements): this value being invalid, will be ignored by the DOM. But in other attributes (of type CDATA), all CR characters represented by a numeric character reference will be preserved and not normalized. Note that this trick will not work to preserve other whitespaces (SPACE, TAB, LF), even if they are represented by NCR, because the normalization of all whitespaces (with the exception of the NCR to CR) is mandatory in all attributes.
Note that this function itself does not perform any HTML/XML normalization of whitespaces, so it remains safe when generating the content of a text element (don't pass the second preserveCR parameter for such case).
So if you pass an optional second parameter (whose default will be treated as if it was false) and if that parameter evaluates as true, newlines will be preserved using this NCR, when you want to generate a literal attribute value, and this attribute is of type CDATA (for example a title="..." attribute) and not of type ID, IDLIST, NMTOKEN or NMTOKENS (for example an id="..." attribute).
function quoteattr(s, preserveCR) {
preserveCR = preserveCR ? '
' : '\n';
return ('' + s) /* Forces the conversion to string. */
.replace(/&/g, '&') /* This MUST be the 1st replacement. */
.replace(/'/g, '&apos;') /* The 4 other predefined entities, required. */
.replace(/"/g, '"')
.replace(/</g, '<')
.replace(/>/g, '>')
/*
You may add other replacements here for HTML only
(but it's not necessary).
Or for XML, only if the named entities are defined in its DTD.
*/
.replace(/\r\n/g, preserveCR) /* Must be before the next replacement. */
.replace(/[\r\n]/g, preserveCR);
;
}
Warning! This function still does not check the source string (which is just, in Javascript, an unrestricted stream of 16-bit code units) for its validity in a file that must be a valid plain text source and also as valid source for an HTML/XML document.
It should be updated to detect and reject (by an exception):
any code units representing code points assigned to non-characters (like \uFFFE and \uFFFF): this is an Unicode requirement only for valid plain-texts;
any surrogate code units which are incorrectly paired to form a valid pair for an UTF-16-encoded code point: this is an Unicode requirement for valid plain-texts;
any valid pair of surrogate code units representing a valid Unicode code point in supplementary planes, but which is assigned to non-characters (like U+10FFFE or U+10FFFF): this is an Unicode requirement only for valid plain-texts;
most C0 and C1 controls (in the ranges \u0000..\u1F and \u007F..\u009F with the exception of TAB and newline controls): this is not an Unicode requirement but an additional requirement for valid HTML/XML.
Despite of this limitation, the code above is almost what you'll want to do. Normally. Modern javascript engine should provide this function natively in the default system object, but in most cases, it does not completely ensure the strict plain-text validity, not the HTML/XML validity. But the HTML/XML document object from which your Javascript code will be called, should redefine this native function.
This limitation is usually not a problem in most cases, because the source string are the result of computing from sources strings coming from the HTML/XML DOM.
But this may fail if the javascript extract substrings and break pairs of surrogates, or if it generates text from computed numeric sources (converting any 16-bit code value into a string containing that one-code unit, and appending those short strings, or inserting these short strings via replacement operations): if you try to insert the encoded string into a HTML/XML DOM text element or in an HTML/XML attribute value or element name, the DOM will itself reject this insertion and will throw an exception; if your javascript inserts the resulting string in a local binary file or sends it via a binary network socket, there will be no exception thrown for this emission. Such non-plain text strings would also be the result of reading from a binary file (such as an PNG, GIF or JPEG image file) or from your javascript reading from a binary-safe network socket (such that the IO stream passes 16-bit code units rather than just 8-bit units: most binary I/O streams are byte-based anyway, and text I/O streams need that you specify a charset to decode files into plain-text, so that invalid encodings found in the text stream will throw an I/O exception in your script).
Note that this function, the way it is implemented (if it is augmented to correct the limitations noted in the warning above), can be safely used as well to quote also the content of a literal text element in HTML/XML (to avoid leaving some interpretable HTML/XML elements from the source string value), not just the content of a literal attribute value ! So it should be better named quoteml(); the name quoteattr() is kept only by tradition.
This is the case in your example:
data.value = "It's just a \"sample\" <test>.\n\tTry & see yourself!";
var row = '';
row += '<tr>';
row += '<td>Name</td>';
row += '<td><input value="' + quoteattr(data.value) + '" /></td>';
row += '</tr>';
Alternative to quoteattr(), using only the DOM API:
The alternative, if the HTML code you generate will be part of the current HTML document, is to create each HTML element individually, using the DOM methods of the document, such that you can set its attribute values directly through the DOM API, instead of inserting the full HTML content using the innerHTML property of a single element :
data.value = "It's just a \"sample\" <test>.\n\tTry & see yourself!";
var row = document.createElement('tr');
var cell = document.createElement('td');
cell.innerText = 'Name';
row.appendChild(cell);
cell = document.createElement('td');
var input = document.createElement('input');
input.setAttribute('value', data.value);
cell.appendChild(input);
tr.appendChild(cell);
/*
The HTML code is generated automatically and is now accessible in the
row.innerHTML property, which you are not required to insert in the
current document.
But you can continue by appending tr into a 'tbody' element object, and then
insert this into a new 'table' element object, which ou can append or insert
as a child of a DOM object of your document.
*/
Note that this alternative does not attempt to preserve newlines present in the data.value, because you're generating the content of a text element, not an attribute value here. If you really want to generate an attribute value preserving newlines using 
, see the start of section 1, and the code within quoteattr() above.
(2) The escape() function for embedding into a javascript/JSON literal string:
In other cases, you'll use the escape() function below when the intent is to quote a string that will be part of a generated javascript code fragment, that you also want to be preserved (that may optionally also be first parsed by an HTML/XML parser in which a larger javascript code could be inserted):
function escape(s) {
return ('' + s) /* Forces the conversion to string. */
.replace(/\\/g, '\\\\') /* This MUST be the 1st replacement. */
.replace(/\t/g, '\\t') /* These 2 replacements protect whitespaces. */
.replace(/\n/g, '\\n')
.replace(/\u00A0/g, '\\u00A0') /* Useful but not absolutely necessary. */
.replace(/&/g, '\\x26') /* These 5 replacements protect from HTML/XML. */
.replace(/'/g, '\\x27')
.replace(/"/g, '\\x22')
.replace(/</g, '\\x3C')
.replace(/>/g, '\\x3E')
;
}
Warning! This source code does not check for the validity of the encoded document as a valid plain-text document. However it should never raise an exception (except for out of memory condition): Javascript/JSON source strings are just unrestricted streams of 16-bit code units and do not need to be valid plain-text or are not restricted by HTML/XML document syntax. This means that the code is incomplete, and should also replace:
all other code units representing C0 and C1 controls (with the exception of TAB and LF, handled above, but that may be left intact without substituting them) using the \xNN notation;
all code units that are assigned to non-characters in Unicode, which should be replaced using the \uNNNN notation (for example \uFFFE or \uFFFF);
all code units usable as Unicode surrogates in the range \uD800..\DFFF, like this:
if they are not correctly paired into a valid UTF-16 pair representing a valid Unicode code point in the full range U+0000..U+10FFFF, these surrogate code units should be individually replaced using the notation \uDNNN;
else if if the code point that the code unit pair represents is not valid in Unicode plain-text, because the code point is assigned to a non-character, the two code points should be replaced using the notation \U00NNNNNN;
finally, if the code point represented by the code unit (or the pair of code units representing a code point in a supplementary plane), independently of if that code point is assigned or reserved/unassigned, is also invalid in HTML/XML source documents (see their specification), the code point should be replaced using the \uNNNN notation (if the code point is in the BMP) or the \u00NNNNNN (if the code point is in a supplementary plane) ;
Note also that the 5 last replacements are not really necessary. But it you don't include them, you'll sometimes need to use the <![CDATA[ ... ]]> compatibility "hack" in some cases, such as further including the generated javascript in HTML or XML (see the example below where this "hack" is used in a <script>...</script> HTML element).
The escape() function has the advantage that it does not insert any HTML/XML character reference, the result will be first interpreted by Javascript and it will keep later at runtime the exact string length when the resulting string will be evaluated by the javascript engine. It saves you from having to manage mixed context throughout your application code (see the final section about them and about the related security considerations). Notably because if you use quoteattr() in this context, the javascript evaluated and executed later would have to explicitly handle character references to re-decode them, something that would not be appropriate. Usage cases include:
when the replaced string will be inserted in a generated javascript event handler surrounded by some other HTML code where the javascript fragment will contain attributes surrounded by literal quotes).
when the replaced string will be part of a settimeout() parameter which will be later eval()ed by the Javascript engine.
Example 1 (generating only JavaScript, no HTML content generated):
var title = "It's a \"title\"!";
var msg = "Both strings contain \"quotes\" & 'apostrophes'...";
setTimeout(
'__forceCloseDialog("myDialog", "' +
escape(title) + '", "' +
escape(msg) + '")',
2000);
Exemple 2 (generating valid HTML):
var msg =
"It's just a \"sample\" <test>.\n\tTry & see yourself!";
/* This is similar to the above, but this JavaScript code will be reinserted below: */
var scriptCode =
'alert("' +
escape(msg) + /* important here!, because part of a JS string literal */
'");';
/* First case (simple when inserting in a text element): */
document.write(
'<script type="text/javascript">' +
'\n//<![CDATA[\n' + /* (not really necessary but improves compatibility) */
scriptCode +
'\n//]]>\n' + /* (not really necessary but improves compatibility) */
'</script>');
/* Second case (more complex when inserting in an HTML attribute value): */
document.write(
'<span onclick="' +
quoteattr(scriptCode) + /* important here, because part of an HTML attribute */
'">Click here !</span>');
In this second example, you see that both encoding functions are simultaneously used on the part of the generated text that is embedded in JavaScript literals (using escape()), with the the generated JavaScript code (containing the generated string literal) being itself embedded again and re-encoded using quoteattr(), because that JavaScript code is inserted in an HTML attribute (in the second case).
(3) General considerations for safely encoding texts to embed in syntactic contexts:
So in summary,
the quotattr() function must be used when generating the content of an HTML/XML attribute literal, where the surrounding quotes are added externally within a concatenation to produce a complete HTML/XML code.
the escape() function must be used when generating the content of a JavaScript string constant literal, where the surrounding quotes are added externally within a concatenation to produce a complete HTML/XML code.
If used carefully, and everywhere you will find variable contents to safely insert into another context, and under only these rules (with the functions implemented exactly like above which takes care of "special characters" used in both contexts), you may mix both via multiple escaping, and the transform will still be safe, and will not require additional code to decode them in the application using those literals. Do not use these functions.
Those functions are only safe in those strict contexts (i.e. only HTML/XML attribute values for quoteattr(), and only Javascript string literals for escape()).
There are other contexts using different quoting and escaping mechanisms (e.g. SQL string literals, or Visual Basic string literals, or regular expression literals, or text fields of CSV data files, or MIME header values), which will each require their own distinct escaping function used only in these contexts:
Never assume that quoteattr() or escape() will be safe or will not alter the semantic of the escaped string, before checking first, that the syntax of (respectively) HTML/XML attribute values or JavaScript string literals will be natively understood and supported in those contexts.
For example the syntax of Javascript string literals generated by escape() is also appropriate and natively supported in the two other contexts of string literals used in Java programming source code, or text values in JSON data.
But the reverse is not always true. For example:
Interpreting the encoded escaped literals initially generated for other contexts than Javascript string literals (including for example string literals in PHP source code), is not always safe for direct use as Javascript literals. through the javascript eval() system function to decode those generated string literals that were not escaped using escape(), because those other string literals may contain other special characters generated specifically to those other initial contexts, which will be incorrectly interpreted by Javascript, this could include additional escapes such as "\Uxxxxxxxx", or "\e", or "${var}" and "$$", or the inclusion of additional concatenation operators such as ' + " which changes the quoting style, or of "transparent" delimiters, such as "<!--" and "-->" or "<[DATA[" and "]]>" (that may be found and safe within a different only complex context supporting multiple escaping syntaxes: see below the last paragraph of this section about mixed contexts).
The same will apply to the interpretation/decoding of encoded escaped literals that were initially generated for other contexts that HTML/XML attributes values in documents created using their standard textual representation (for example, trying to interpret the string literals that were generated for embedding in a non standard binary format representation of HTML/XML documents!)
This will also apply to the interpretation/decoding with the javascript function eval() of string literals that were only safely generated for inclusion in HTML/XML attribute literals using quotteattr(), which will not be safe, because the contexts have been incorrectly mixed.
This will also apply to the interpretation/decoding with an HTML/XML text document parser of attribute value literals that were only safely generated for inclusion in a Javascript string literal using escape(), which will not be safe, because the contexts have also been incorrectly mixed.
(4) Safely decoding the value of embedded syntactic literals:
If you want to decode or interpret string literals in contexts were the decoded resulting string values will be used interchangeably and indistinctly without change in another context, so called mixed contexts (including, for example: naming some identifiers in HTML/XML with string literals initially safely encoded with quotteattr(); naming some programming variables for Javascript from strings initially safely encoded with escape(); and so on...), you'll need to prepare and use a new escaping function (which will also check the validity of the string value before encoding it, or reject it, or truncate/simplify/filter it), as well as a new decoding function (which will also carefully avoid interpreting valid but unsafe sequences, only accepted internally but not acceptable for unsafe external sources, which also means that decoding function such as eval() in javascript must be absolutely avoided for decoding JSON data sources, for which you'll need to use a safer native JSON decoder; a native JSON decoder will not be interpreting valid Javascript sequences, such as the inclusion of quoting delimiters in the literal expression, operators, or sequences like "{$var}"), to enforce the safety of such mapping!
These last considerations about the decoding of literals in mixed contexts, that were only safely encoded with any syntax for the transport of data to be safe only a a more restrictive single context, is absolutely critical for the security of your application or web service. Never mix those contexts between the encoding place and the decoding place, if those places do not belong to the same security realm (but even in that case, using mixed contexts is always very dangerous, it is very difficult to track precisely in your code.
For this reason I recommend you never use or assume mixed contexts anywhere in your application: instead write a safe encoding and decoding function for a single precide context that has precise length and validity rules on the decoded string values, and precise length and validity rules on the encoded string string literals. Ban those mixed contexts: for each change of context, use another matching pair of encoding/decoding functions (which function is used in this pair depends on which context is embedded in the other context; and the pair of matching functions is also specific to each pair of contexts).
This means that:
To safely decode an HTML/XML attribute value literal that has been initially encoded with quoteattr(), you must '''not''' assume that it has been encoded using other named entities whose value will depend on a specific DTD defining it. You must instead initialize the HTML/XML parser to support only the few default named character entities generated by quoteattr() and optionally the numeric character entities (which are also safe is such context: the quoteattr() function only generates a few of them but could generate more of these numeric character references, but must not generate other named character entities which are not predefined in the default DTD). All other named entities must be rejected by your parser, as being invalid in the source string literal to decode. Alternatively you'll get better performance by defining an unquoteattr function (which will reject any presence of literal quotes within the source string, as well as unsupported named entities).
To safely decode a Javascript string literal (or JSON string literal) that has been initially encoded with escape(), you must use the safe JavaScript unescape() function, but not the unsafe Javascript eval() function!
Examples for these two associated safe decoding functions follow.
(5) The unquoteattr() function to parse text embedded in HTML/XML text elements or attribute values literals:
function unquoteattr(s) {
/*
Note: this can be implemented more efficiently by a loop searching for
ampersands, from start to end of ssource string, and parsing the
character(s) found immediately after after the ampersand.
*/
s = ('' + s); /* Forces the conversion to string type. */
/*
You may optionally start by detecting CDATA sections (like
`<![CDATA[` ... `]]>`), whose contents must not be reparsed by the
following replacements, but separated, filtered out of the CDATA
delimiters, and then concatenated into an output buffer.
The following replacements are only for sections of source text
found *outside* such CDATA sections, that will be concatenated
in the output buffer only after all the following replacements and
security checkings.
This will require a loop starting here.
The following code is only for the alternate sections that are
not within the detected CDATA sections.
*/
/* Decode by reversing the initial order of replacements. */
s = s
.replace(/\r\n/g, '\n') /* To do before the next replacement. */
.replace(/[\r\n]/, '\n')
.replace(/
/g, '\n') /* These 3 replacements keep whitespaces. */
.replace(/&#1[03];/g, '\n')
.replace(/ /g, '\t')
.replace(/>/g, '>') /* The 4 other predefined entities required. */
.replace(/</g, '<')
.replace(/"/g, '"')
.replace(/&apos;/g, "'")
;
/*
You may add other replacements here for predefined HTML entities only
(but it's not necessary). Or for XML, only if the named entities are
defined in *your* assumed DTD.
But you can add these replacements only if these entities will *not*
be replaced by a string value containing *any* ampersand character.
Do not decode the '&' sequence here !
If you choose to support more numeric character entities, their
decoded numeric value *must* be assigned characters or unassigned
Unicode code points, but *not* surrogates or assigned non-characters,
and *not* most C0 and C1 controls (except a few ones that are valid
in HTML/XML text elements and attribute values: TAB, LF, CR, and
NL='\x85').
If you find valid Unicode code points that are invalid characters
for XML/HTML, this function *must* reject the source string as
invalid and throw an exception.
In addition, the four possible representations of newlines (CR, LF,
CR+LF, or NL) *must* be decoded only as if they were '\n' (U+000A).
See the XML/HTML reference specifications !
*/
/* Required check for security! */
var found = /&[^;]*;?/.match(s);
if (found.length >0 && found[0] != '&')
throw 'unsafe entity found in the attribute literal content';
/* This MUST be the last replacement. */
s = s.replace(/&/g, '&');
/*
The loop needed to support CDATA sections will end here.
This is where you'll concatenate the replaced sections (CDATA or
not), if you have splitted the source string to detect and support
these CDATA sections.
Note that all backslashes found in CDATA sections do NOT have the
semantic of escapes, and are *safe*.
On the opposite, CDATA sections not properly terminated by a
matching `]]>` section terminator are *unsafe*, and must be rejected
before reaching this final point.
*/
return s;
}
Note that this function does not parse the surrounding quote delimiters which are used
to surround HTML attribute values. This function can in fact decode any HTML/XML text element content as well, possibly containing literal quotes, which are safe. It's your reponsability of parsing the HTML code to extract quoted strings used in HTML/XML attributes, and to strip those matching quote delimiters before calling the unquoteattr() function.
(6) The unescape() function to parse text contents embedded in Javascript/JSON literals:
function unescape(s) {
/*
Note: this can be implemented more efficiently by a loop searching for
backslashes, from start to end of source string, and parsing and
dispatching the character found immediately after the backslash, if it
must be followed by additional characters such as an octal or
hexadecimal 7-bit ASCII-only encoded character, or an hexadecimal Unicode
encoded valid code point, or a valid pair of hexadecimal UTF-16-encoded
code units representing a single Unicode code point.
8-bit encoded code units for non-ASCII characters should not be used, but
if they are, they should be decoded into a 16-bit code units keeping their
numeric value, i.e. like the numeric value of an equivalent Unicode
code point (which means ISO 8859-1, not Windows 1252, including C1 controls).
Note that Javascript or JSON does NOT require code units to be paired when
they encode surrogates; and Javascript/JSON will also accept any Unicode
code point in the valid range representable as UTF-16 pairs, including
NULL, all controls, and code units assigned to non-characters.
This means that all code points in \U00000000..\U0010FFFF are valid,
as well as all 16-bit code units in \u0000..\uFFFF, in any order.
It's up to your application to restrict these valid ranges if needed.
*/
s = ('' + s) /* Forces the conversion to string. */
/* Decode by reversing the initial order of replacements */
.replace(/\\x3E/g, '>')
.replace(/\\x3C/g, '<')
.replace(/\\x22/g, '"')
.replace(/\\x27/g, "'")
.replace(/\\x26/g, '&') /* These 5 replacements protect from HTML/XML. */
.replace(/\\u00A0/g, '\u00A0') /* Useful but not absolutely necessary. */
.replace(/\\n/g, '\n')
.replace(/\\t/g, '\t') /* These 2 replacements protect whitespaces. */
;
/*
You may optionally add here support for other numerical or symbolic
character escapes.
But you can add these replacements only if these entities will *not*
be replaced by a string value containing *any* backslash character.
Do not decode to any doubled backslashes here !
*/
/* Required check for security! */
var found = /\\[^\\]?/.match(s);
if (found.length > 0 && found[0] != '\\\\')
throw 'Unsafe or unsupported escape found in the literal string content';
/* This MUST be the last replacement. */
return s.replace(/\\\\/g, '\\');
}
Note that this function does not parse the surrounding quote delimiters which are used
to surround Javascript or JSON string literals. It's your responsibility of parsing the Javascript or JSON source code to extract quoted strings literals, and to strip those matching quote delimiters before calling the unescape() function.

You just need to swap any ' characters with the equivalent HTML entity character code:
data.name.replace(/'/g, "'");
Alternatively, you could create the whole thing using jQuery's DOM manipulation methods:
var row = $("<tr>").append("<td>Name</td><td></td>");
$("<input>", { value: data.name }).appendTo(row.children("td:eq(1)"));

" = " or "
' = '
Examples:
<div attr="Tim "The Toolman" Taylor"
<div attr='Tim "The Toolman" Taylor'
<div attr="Tim 'The Toolman' Taylor"
<div attr='Tim 'The Toolman' Taylor'
In JavaScript strings, you use \ to escape the quote character:
var s = "Tim \"The Toolman\" Taylor";
var s = 'Tim \'The Toolman\' Taylor';
So, quote your attribute values with " and use a function like this:
function escapeAttrNodeValue(value) {
return value.replace(/(&)|(")|(\u00A0)/g, function(match, amp, quote) {
if (amp) return "&";
if (quote) return """;
return " ";
});
}

My answer is partially based on Andy E and I still recommend reading what verdy_p wrote, but here it is
$("<a>", { href: 'very<script>\'b"ad' }).text('click me')[0].outerHTML
Disclaimer: this is answer not to exact question, but just "how to escape attribute"

Using Lodash:
const serialised = _.escape("Here's a string that could break HTML");
// Add it into data-attr in HTML
<a data-value-serialised=" + serialised + " onclick="callback()">link</a>
// and then at JS where this value will be read:
function callback(e) {
$(e.currentTarget).data('valueSerialised'); // with a bit of help from jQuery
const originalString = _.unescape(serialised); // can be used as part of a payload or whatever.
}

The given answers seem rather complicated, so for my use case I have tried the built in encodeURIComponent and decodeURIComponent and have found they worked well, as per comments this does not escape ' but for that you can use escape() and unescape() methods instead.

I think you could do:
var row = "";
row += "<tr>";
row += "<td>Name</td>";
row += "<td><input value=\""+data.name+"\"/></td>";
row += "</tr>";
If you are worried about in data.name which is existing single quote.
In best case, you could create an INPUT element then setValue(data.name) for it.

Related

changing string delimiters to backticks : possible impact?

ES6 introduced template strings delimited by backticks `.
In which cases would replacing single ' or double " quotes around a string by backticks yield a different result, or otherwise be unsafe ?
Escaping of existing backticks inside the code is performed as part of the operation.
// before
var message = "Display backtick ` on screen";
// after
var message = `Display backtick \` on screen`;
I understand that any string containing ${...} would fail as it would be (mis)interpreted as a placeholder. Are there any other relevant patterns ?
Context : this is for the development of a JS compression tool that automatically processes input code. The latter, and the strings it contains is user-provided, hence I have no control over its contents. The only assumption one can make is that it is valid Javascript.
Execution environment can by any recent browser or Node.js.
You can check the grammar of string literals and no-substitution-templates.
Apart from the ` vs. ' vs. " and the special meaning of ${ that you already mentioned, only line breaks differ between the two forms. We can ignore line continuations ("escaped linebreaks") as your minifier would strip them away anyway, but plain line breaks are valid inside templates while not inside string literals, so if you convert back and forth you'll have to care about those as well. You can even use them to save another byte if your string literal contains \n.

JavaScript:output symbols and special characters

I am trying to include some symbols into a div using JavaScript.
It should look like this:
x ∈ &reals;
, but all I get is: x ∈ &reals;.
var div=document.getElementById("text");
var textnode = document.createTextNode("x ∈ &reals;");
div.appendChild(textnode);
<div id="text"></div>
I had tried document.getElementById("something").innerHTML="x ∈ &reals;" and it worked, so I have no clue why createTextNode method did not.
What should I do in order to output the right thing?
You are including HTML escapes ("entities") in what needs to be text. According to the docs for createTextNode:
data is a string containing the data to be put in the text node
That's it. It's the data to be put in the text node. The DOM spec is just as clear:
Creates a Text node given the specified string.
You want to include Unicode in this string. To include Unicode in a JavaScript string, use Unicode escapes, in the format \uXXXX.
var textnode = document.createTextNode("x \u2208 \u211D");
Or, you could simply include the actual Unicode character and avoid all the trouble:
var textnode = document.createTextNode("x ∈ ℝ");
In this case, just make sure that the JS file is served as UTF-8, you are saving the file as UTF-8, etc.
The reason that setting .innerHTML works with HTML entities is that it sets the content as HTML, meaning it interprets it as HTML, in all regards, including markup, special entities, etc. It may be easier to understand this if you consider the difference between the following:
document.createTextNode("<div>foo</div>");
document.createElement("div").textContent = "<div>foo</div";
document.createElement("div").innerHTML = "<div>foo</div>";
The first creates a text node with the literal characters "<div>foo</div>". The second sets the content of the new element literally to "<div>foo</div>". The third, on the other hand, creates an actual div element inside the new element containing the text "foo".
Every character has a hexadecimal name (for example 0211D). if you want to transform it into a HTML entity, add &#x => ℝ or use the entity name &reals; or the decimal name ℝ which can be found all here: http://www.w3schools.com/charsets/ref_html_entities_4.asp
But when you use JavaScript, in order to make the browser understand that you want to output a unicode symbol and not a string, escape entities are required. To do that, add \u before the hexadecimal name =>\u211D;.
document.createTextNode will automatically html-escape the needed characters. You have to provide those texts as JavaScript strings, either escaped or not:
document.body.appendChild(document.createTextNode("x ∈ ℝ"));
document.body.appendChild(document.createElement("br"));
document.body.appendChild(document.createTextNode("x \u2208 \u211d"));
EDIT: It's not true that the createTextNode function will do actual html escaping here as it doesn't need to. #deceze gave a very good explanation about the connection between the dom and html: html is a textual representation of the dom, thus you don't need any html-related escaping when directly manipulating the dom.

check if javascript string is valid UTF-8

A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
Exposition
I think you misunderstand what "UTF-8 characters" means; UTF-8 is an encoding of Unicode which can represent any character, glyph, and grapheme that is defined in the (ever growing) Unicode standard. There are fewer Unicode code points than there are possible UTF8 byte values, so the only "invalid UTF8 characters" are UTF8 byte sequences that don't map to any Unicode code point, but I assume this is not what you're referring to.
for example, a copy and paste from a rtf file that contains tabs.
RTF is a formatting system which works independently of the underlying encoding scheme - you can use RTF with ASCII, UTF-8, UTF-16 and other encodings. With respect to the HTML textboxes in your post, both the <input type="text"> and <textarea> elements in HTML only respect plaintext, so any RTF formatting will be automatically stripped when pasted by a user, hence why JS-heavy "rich-edit" and contenteditable components are notuncommon in web-applications, though in this answer I assume you're not using a rich-edit component in a web-page).
Tabs in RTF files are not an RTF feature: they're just normal ASCII-style tab characters, i.e. \t or 0x09, which also appear in Unicode, and thus, can also appear in UTF-8 encoded text; furthermore, it's perfectly valid for web-browsers to allow users to paste those into <input> and <textarea>.
Javascript (ECMAScript) itself is Unicode-native; that is, the ECMAScript specification does require JS engines to use UTF-16 representations in some places, such as in the abstract operation IsStringWellFormedUnicode:
7.2.9 Static Semantics: IsStringWellFormedUnicode
The abstract operation IsStringWellFormedUnicode takes argument string (a String) and returns a Boolean. It interprets string as a sequence of UTF-16 encoded code points, as described in 6.1.4, and determines whether it is a well formed UTF-16 sequence.
...but that part of the specification is intended for JS engine programmers, and not people who write JS for use in browsers - in fact, I'd say it's safe to asume that within a web-browser, any-and-all JS string values will always be valid strings that can always be serialized out to UTF-8 and UTF-16, and also that JS scripts should not be concerned with the actual in-memory encoding of the string's content.
Your question
So given that your question is written as this:
A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
I'm going to interpret it as this:
A user can copy RTF text from a program like WordPad and paste it into a HTML <textarea> or <input type="text"> in a web-browser, and when it's pasted the plaintext representation of the RTF still contains certain characters that my application should not accept such as whitespace like tabs.
How can I detect these unwanted characters and inform the user - or remove those unwanted characters?
...to which my answer is:
I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )
let textBoxContent = document.getElementById( 'myTextarea' ).value;
textBoxContent = textBoxContent.replace( /[^\x20-\x7E]+/g, '' );
The expression [^\x20-\x7E] matches any character NOT in the codepoint range 0x20 (32, a normal space character ' ') to 0x7E (127, the tidle '~' character), all other characters will be removed, including non-Latin text.
The g switch at the end makes it a global find-and-replace operation; without the g then only the first unwanted character would be removed.
The range 0x20-0x7E works because Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/
Just an idea:
function checkUTF8(text) {
var utf8Text = text;
try {
// Try to convert to utf-8
utf8Text = decodeURIComponent(escape(text));
// If the conversion succeeds, text is not utf-8
}catch(e) {
// console.log(e.message); // URI malformed
// This exception means text is utf-8
}
return utf8Text; // returned text is always utf-8
}

Special characters not displaying correctly in a Javascript string

I have a function that assigns a string containing specials characters into a variable, then passes that variable to a DOM element via innerHTML property, but it prints strange characters. Let's say I code this...
someText = "äêíøù";
document.getElementById("someElement").innerHTML = someText;
It prints the following text...
äêíøù
I know how to use the entity names to prevent this, but when I use them to pass the value through a Javascript method, they print literally.
This means that you have a conflict of encodings. Your JavaScript and your HTML are being served to the browser with different encodings/character sets. Ensure that they're encoded in and served with the same encoding / character set (UTF8 is a good choice) to make sure that characters are correctly interpreted.
Obligatory link: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How to decode html miscellaneous symbol from its glyph?

I have a div that contains a settings icon that is a html miscellaneous symbol
<span class="settings-icon">⚙</span>
I have a jasmine test that checks the div contents to makes sure that it is not changed.
it("the settings div should contain ⚙", function() {
var settingsIconDiv = $('.settings-icon');
expect(settingsIconDiv.text())
.toContain('⚙');
});
It will not pass as it is evaluated as its glyph symbol of a gear icon ⚙
How to I decode the glyph in order to pass the test?
To get actual character from Unicode to compare it to a literal in HTML you can use String.fromCharCode() e.g.
.toContain(String.fromCharCode(9881))
You should check against the string '⚙' or, if you do not how to enter it in your code, the escape notation \u2699. There are other, clumsier ways to construct a string containing the character, but simplicity is best.
No matter how the character is written in HTML source code (e.g., as the reference ⚙), it appears in the DOM as the character itself, U+2699. In JavaScript, a string like ⚙ is just a sequence of seven Ascii characters (though you can pass it to a function that parses it as an HTML character reference, or you can assign it e.g. to the innerHTML property, causing HTML parsing, but this is rather pointless and confusing).
To match the browser behavior (because you don't know how it is encoded in html or in text) i would try the following
.toContain($("<span>⚙</span>").text()) instead of .toContain('⚙').
That way it should match how it is stored in the dom.
The String.fromCharCode(9881); mentioned by Yuriy Galanter will definitely also work reliable. But because dom engine and the js engine are two different parts, that could behave differently, i would test with both techniques.

Categories