How to decode html miscellaneous symbol from its glyph?

How to decode html miscellaneous symbol from its glyph? - javascript

I have a div that contains a settings icon that is a html miscellaneous symbol
<span class="settings-icon">⚙</span>
I have a jasmine test that checks the div contents to makes sure that it is not changed.
it("the settings div should contain ⚙", function() {
var settingsIconDiv = $('.settings-icon');
expect(settingsIconDiv.text())
.toContain('⚙');
});
It will not pass as it is evaluated as its glyph symbol of a gear icon ⚙
How to I decode the glyph in order to pass the test?

To get actual character from Unicode to compare it to a literal in HTML you can use String.fromCharCode() e.g.
.toContain(String.fromCharCode(9881))

You should check against the string '⚙' or, if you do not how to enter it in your code, the escape notation \u2699. There are other, clumsier ways to construct a string containing the character, but simplicity is best.
No matter how the character is written in HTML source code (e.g., as the reference ⚙), it appears in the DOM as the character itself, U+2699. In JavaScript, a string like ⚙ is just a sequence of seven Ascii characters (though you can pass it to a function that parses it as an HTML character reference, or you can assign it e.g. to the innerHTML property, causing HTML parsing, but this is rather pointless and confusing).

To match the browser behavior (because you don't know how it is encoded in html or in text) i would try the following
.toContain($("<span>⚙</span>").text()) instead of .toContain('⚙').
That way it should match how it is stored in the dom.
The String.fromCharCode(9881); mentioned by Yuriy Galanter will definitely also work reliable. But because dom engine and the js engine are two different parts, that could behave differently, i would test with both techniques.

Related

javascript generating invalid HTML5 attributes in Firefox

I am noticing some very strange behavior in firefox and I'm wondering if anyone has a strategy for how to normalize or work around this behavior.
Specifically if you provide firefox a basic anchor containing html entities it will unescape those entities, fail to re-escape them and hand you back invalid html.
For example firefox mishandles the following url:
My Original Link
If this url is parsed by firefox it will unescape the ><" and start handling a url like:
My Original Link
This same operation appears to work fine elsewhere, even safari and edge.
I tried quite a few different ways of handing the html to firefox to avoid this problem. Tried manually invoking the parser, tried setting innerHTML, tried jQuery html(), tried giving jQuery constructor a giant string, etc. All methods produced the same broken result.
See a fiddle here:
https://jsfiddle.net/kamelkev/hfd2b6sn/
I am a little mystified by how broken this handling seems to be. There must be a way to work around this issue, but I can't seem to find a way.
My application is an html manipulation tool, so I typically normalize around issues like this by dropping down to XML and handling the problems there before persisting to a dumb key-value store, but in this particular case the <> characters are preventing me from processing this document as XML.
Ideas?

A < or a > is valid inside of an attribute value, unescaped. It's not best practice, but it is valid.
What's happening is that Firefox is parsing the original HTML and making elements out of it. At that point, the original HTML no longer exists. When you call .outerHTML, the HTML is reconstructed from the element.
Firefox then generates it using a different set of rules than Chrome does.
It isn't clear what exactly you need to do this for... really you should edit the DOM and export the HTML for the whole DOM when done. Constantly re-interpreting HTML isn't necessary.

The > and < are unescaped when the parser parses the source to construct the DOM. When you serialize an element back to a string, you are not guaranteed to obtain the same text as the source.
In this case, innerHTML and outerHTML use the HTML fragment serialization algorithm, which escapes attribute values using attribute mode:
Escaping a string (for the purposes of the algorithm above) consists
of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any
occurrences of the ">" character by the string ">".
That's why " is escaped to ", but < and > remain.
This is OK, because < and > are allowed in HTML double-quoted attribute values:
U+0022 QUOTATION MARK ("): Switch to the after attribute value (quoted) state.
U+0026 AMPERSAND (&): Switch to the character reference in attribute value state [...]
U+0000 NULL: Parse error [...]
EOF: Parse error [...]
Anything else: Append the current input character to the current attribute's value.
However, XML does not allow < and > in attribute values. If you want to get valid XHTML, use a XML serializer:
var s = new XMLSerializer();
var str = s.serializeToString(document.querySelector('a'));
console.log(str);
My Original Link

JavaScript:output symbols and special characters

I am trying to include some symbols into a div using JavaScript.
It should look like this:
x ∈ &reals;
, but all I get is: x ∈ &reals;.
var div=document.getElementById("text");
var textnode = document.createTextNode("x ∈ &reals;");
div.appendChild(textnode);
<div id="text"></div>
I had tried document.getElementById("something").innerHTML="x ∈ &reals;" and it worked, so I have no clue why createTextNode method did not.
What should I do in order to output the right thing?

You are including HTML escapes ("entities") in what needs to be text. According to the docs for createTextNode:
data is a string containing the data to be put in the text node
That's it. It's the data to be put in the text node. The DOM spec is just as clear:
Creates a Text node given the specified string.
You want to include Unicode in this string. To include Unicode in a JavaScript string, use Unicode escapes, in the format \uXXXX.
var textnode = document.createTextNode("x \u2208 \u211D");
Or, you could simply include the actual Unicode character and avoid all the trouble:
var textnode = document.createTextNode("x ∈ ℝ");
In this case, just make sure that the JS file is served as UTF-8, you are saving the file as UTF-8, etc.
The reason that setting .innerHTML works with HTML entities is that it sets the content as HTML, meaning it interprets it as HTML, in all regards, including markup, special entities, etc. It may be easier to understand this if you consider the difference between the following:
document.createTextNode("<div>foo</div>");
document.createElement("div").textContent = "<div>foo</div";
document.createElement("div").innerHTML = "<div>foo</div>";
The first creates a text node with the literal characters "<div>foo</div>". The second sets the content of the new element literally to "<div>foo</div>". The third, on the other hand, creates an actual div element inside the new element containing the text "foo".

Every character has a hexadecimal name (for example 0211D). if you want to transform it into a HTML entity, add &#x => ℝ or use the entity name &reals; or the decimal name ℝ which can be found all here: http://www.w3schools.com/charsets/ref_html_entities_4.asp
But when you use JavaScript, in order to make the browser understand that you want to output a unicode symbol and not a string, escape entities are required. To do that, add \u before the hexadecimal name =>\u211D;.

document.createTextNode will automatically html-escape the needed characters. You have to provide those texts as JavaScript strings, either escaped or not:
document.body.appendChild(document.createTextNode("x ∈ ℝ"));
document.body.appendChild(document.createElement("br"));
document.body.appendChild(document.createTextNode("x \u2208 \u211d"));
EDIT: It's not true that the createTextNode function will do actual html escaping here as it doesn't need to. #deceze gave a very good explanation about the connection between the dom and html: html is a textual representation of the dom, thus you don't need any html-related escaping when directly manipulating the dom.

Preserving attributes without value when manipulating with JQuery

The crux of my problem comes down to this issue:
$('<video allowfullscreen></video>').prop('outerHTML') === '<video allowfullscreen></video>' //Is False
$('<video allowfullscreen></video>').prop('outerHTML') === '<video allowfullscreen=""></video>' //Is True
The input I'm giving to jQuery gets partially mangled and transformed in an unwanted way.
My goal is that I have (trusted) html coming in that I want to modify by adding some attributes and wrapping it in other elements before converting it back to a String and passing it to the user as text they can copy.
So an expected output might be something like:
<div><video class="myClass" allowfullscreen></video></div>
Since the input html is coming from elsewhere I'd like to make as little assumptions about it as possible. So ideally I don't want to take the string and parse over it to fix specific attributes or remove instances of ="" (in case there's a reason at some point to specifically set a property to "").
Even if I don't care about having a value set on these properties the correct value would be allowfullscreen="allowfullscreen" anyways. I don't have control over the html coming in so I need to take it as-is. So I can't simply 'fix' the html to pass along something like allowfullscreen="allowfullscreen".
Are there any options or ways to preserve valueless properties when I go from string->jQuery->string?
I'm even open to other technology suggestions that would be better suited to this sort of DOM manipulation, but jQuery would otherwise be ideal because of how concise its syntax is. Vanilla Javascript can do it properly, but the syntax makes the code more brittle which I would like to avoid.

See HTML5 - 8.1.2.3 Attributes
8.1.2.3 Attributes
Attributes for an element are expressed inside the element's start
tag.
Attributes have a name and a value. Attribute names must consist of
one or more characters other than the space characters, U+0000 NULL,
U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/"
(U+002F), and "=" (U+003D) characters, the control characters, and any
characters that are not defined by Unicode. In the HTML syntax,
attribute names, even those for foreign elements, may be written with
any mix of lower- and uppercase letters that are an ASCII
case-insensitive match for the attribute's name.
Attribute values are a mixture of text and character references,
except with the additional restriction that the text cannot contain an
ambiguous ampersand.
Attributes can be specified in four different ways:
Empty attribute syntax
Just the attribute name. The value is implicitly the empty string.

Must I escape strings before I set them as the value of a textarea?

In the following scenario:
var evil_string = "...";
$('#mytextarea').val(evil_string);
Do I have to escape an untrusted string before using it as the value of a textarea element?
I understand that I will have to handle the string with care if I want to do anything with it later on, but is the act of putting the string in a textarea without escaping inherently dangerous?
I have done some basic testing and the usual special characters &'"< seem to be successfully added to the textarea without interpretation.

No, you don't need to do that. When you assign directly to property of DOM element (which jQuery's .val does under the hood), the data is interpreted verbatim. You only need to quote text with methods that explicitly treat input as HTML - i.e. outer/innerHTML and like.

Putting unescaped strings as values of textboxes or textareas is fine. You only need to worry about it when you are putting strings in your HTML that could potentially be interpreted as other HTML. Generally speaking, this means you should escape the strings when the text could be a child of some HTML DOM Element. This could be done on the server (as lolka_bolka suggested), or on the client before adding the potentially dangerous string to the DOM.

Special characters not displaying correctly in a Javascript string

I have a function that assigns a string containing specials characters into a variable, then passes that variable to a DOM element via innerHTML property, but it prints strange characters. Let's say I code this...
someText = "äêíøù";
document.getElementById("someElement").innerHTML = someText;
It prints the following text...
Ã¤ÃªÃÃ¸Ã¹
I know how to use the entity names to prevent this, but when I use them to pass the value through a Javascript method, they print literally.

This means that you have a conflict of encodings. Your JavaScript and your HTML are being served to the browser with different encodings/character sets. Ensure that they're encoded in and served with the same encoding / character set (UTF8 is a good choice) to make sure that characters are correctly interpreted.
Obligatory link: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

We Keep Coding

JavaScript is the programming language of the Web.

How to decode html miscellaneous symbol from its glyph? - javascript

To get actual character from Unicode to compare it to a literal in HTML you can use String.fromCharCode() e.g. .toContain(String.fromCharCode(9881))

Related

javascript generating invalid HTML5 attributes in Firefox

JavaScript:output symbols and special characters

Preserving attributes without value when manipulating with JQuery

Must I escape strings before I set them as the value of a textarea?

Special characters not displaying correctly in a Javascript string

Categories

Resources