Detect whether HTML element contains a specific character entity

Detect whether HTML element contains a specific character entity - javascript

If I have markup like this:
<div id="foo"></div>
and I want to detect later whether div#foo still contains that same character entity, I'd like to be able to do so by comparing it to  rather than to  (which in my code base is rather obtuse for maintenance purposes).
I've tried things like this (using jQuery):
console.log($('<textarea />').html($('#foo').html()).val());
But that seems to still output the nice little square "what you talkin' 'bout" character rather than the desired .
I'm open to plain JavaScript or jQuery-specific solutions.

You can use a Unicode entity in JavaScript. For example:
(HTML: <div id='foo'></div>)
JavaScript:
console.log($('#foo').html().charCodeAt(0).toString(16));
//=> f067
console.log($('#foo').html().indexOf('\uf067'));
//=> 0
Here's a JSFiddle.

Related

How to use 'toHaveClass' to match a substring in Jest?

There's an element to which I'm giving a class success. But React-Mui appends a text to it on DoM, say, something like mui-AbcXYZ-success. So when I'm doing this
expect( getByTestId('thirdCheck')).toHaveClass("success")
I don't get this test passing as it expects the complete name mui-AbcXYZ-success. I’m getting this passed only when I provide the exact name (wirh hyphenated text and random number that mui adds)
How do I test that?
I tried doing the following without reasult:
expect( getByTestId('thirdCheck')).toHaveClass(/success/)
I also tried applying .className or .classList but that doesn’t give me the list of classes on the element.

After hours of frustrations, this is how I did and if any one has a better solution, feel free to post. I shall accept it.
let classes = getByTestId('thirdCheck').getAttribute('class'); //returns something like "Mui-root mui-AbcXYZ-success"
classes=classes.split(' ')[1].split('-'); //first split is to split on the basis of spaces and the second one to do on the bases of hyphen
expect(classes.includes('success'));
Looks like a bit verbose for a trivial-looing thing. But that's how did.
UPDATE:
#Fyodor has a great point in the comment.
We can simply do this as follows:
expect(getByTestId('thirdCheck').getAttribute('class')).toMatch(/success/gi)

With CSS-in-JS, classNames will sometimes get appended with generic identifiers. So it is nice to have a quick way of verifying your custom class is on an element. This is a loose one line approach for verifying the substring of a custom class without regex.
Note: It does not ignore casing. If you need that, then the answer with toMatch is a better solution for you.
expect(yourSelectedElement.getAttribute("class")).toContain("yourClassSubstring");

Hyperlink href incorrectly quoted in innerHTML?

Take this very simple example HTML:
<html>
<body>This is okay & fine, but the encoding of this link seems wrong.</body>
<html>
On examining document.body.innerHTML (e.g. in the browser's JS console, in JS itself, etc.), this is the value I see:
This is okay & fine, but the encoding of this link seems wrong.
This behaviour is the same across browsers but I can't understand it, it seems wrong.
Specifically, the link in the orginal document is to http://example.com?a=1&b=2, whereas if the value of innerHTML is treated as HTML then it links to http://example.com?a=1&b=2 which is NOT the same (e.g. If I created a new document, which actually had innerHTML as its inner HTML, and I clicked on the link then the browser would be sent to a materially different URL as far as I can see).
(EDIT #3: I'm wrong about the above. Firstly, yes, those two URLs are different; but secondly, the innerHTML which I thought was wrong is right, and it correctly represents the first URL, not the second! See the end of my own answer below.)
This is different from the issue discussed in question innerHTML gives me & as & !. In my case (which is the opposite to the case in that question) the original HTML is correct and it looks to me as if it is the innerHTML which is wrong (i.e. because it is HTML which does not represent what the original HTML represented).
(EDIT #2: I was wrong about this, too: it's not really different. But I think it is not widely known that & is the correct way to represent & inside an href, not just within body text. Once you realise that, then you can see that these are the same issue really.)
Can anyone explain this?
(EDIT #1+4: This only occurred to me a bit late, after writing my original question, but: "is & actually correct within the href text, and & technically incorrect?" As I said when I first wrote those words, that "seems very unlikely! I've certainly never seen HTML written that way." But however 'unlikely', or not, that is the case, and is the main part of what I wasn't understanding!)
Also related and would be useful, can anyone explain how to cleanly get HTML which does correctly represent the target of document links? You definitely can't just un-encode all HTML character references within innerHTML, because (as shown in the example I've used, and also as discussed in innerHTML gives me & as & !) the ones in the main run of text should be encoded, and just un-encoding everything would make these wrong.
I originally thought this was not a duplicate of innerHTML gives me & as & ! (as discussed above; and in a way it still isn't, if it's agreed that it's not as obvious or widely known that the same issues apply inside href as in body text). It's still definitely not a duplicate of A href in innerHTML (which somehwat unclearly asks about how to set innerHTML using JS).

Most browser tools don't show the actual HTML because it wouldn't be of much help:
HTML is often generated dynamically after page load with the help of CSS and JavaScript.
HTML is often broken and the browser needs to repair it in order to generate the memory representation needed for rendering and other stuff.
So the HTML you see is not the actual source but it's generated on the fly from the current status of the document, which of course includes all the fixed applied (in your case, the invalid HTML entities).
The following example hopefully illustrates all the combinations:
const section = document.querySelector("section");
const invalid = document.createElement("p");
invalid.innerHTML = 'Invalid HTML (dynamic)';
const valid = document.createElement("p");
valid.innerHTML = 'Valid HTML (dynamic)';
section.appendChild(valid);
section.appendChild(invalid);
const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
console.log(a.getAttribute("href"));
}
<section>
<p>Invalid HTML (static)</p>
<p>Valid HTML (static)</p>
<section>
Is & actually correct within the href text, and & technically incorrect? It seems very unlikely! I've certainly never seen HTML written that way.
There's no such thing as "technically correct", let alone today when HTML is pretty well standardised. (Well, yes, there're two competing standards bodies and specs are continuously evolving, but the basics were set up long ago.)
The & symbol starts a character entity and &b is an invalid character entity. Period.
But it works! Doesn't that mean it's technically correct?
It works because browsers are explicitly designed to deal with completely broken markup, what's known as tag soup, because it was thought that it would ease usage:
<p><strong>Hello, World!</u>
<body><br itspartytime="yeah">
<pink>It works!!!</red>
But HTML entities are just an encoding artefact. That doesn't mean that URLs are not allowed to contain literal ampersands, it just means that —when in HTML context— they need to be represented as &. It's the same as when you type a backslash in a JavaScript string to escape some quotes: the backslash does not become part of your data.

Having thought up a possible (but I thought 'unlikely') explanation - which I put in as an edit in the original question - I've realised that it is the answer:
Using & to represent & inside an href is technically incorrect, and & is technically correct
I gathered this initially from this SO answer https://stackoverflow.com/a/16168585/795690, and I think it is relevant that (as it also says in that answer) the idea that & is the correct way to represent & in an href is not as widely understood as the idea that & is the correct way to represent & in body text.
Once you do understand this, it makes sense that what the browser is doing is right, and that the innerHTML value which comes back represents the link correctly.
EDIT:
#ÁlvaroGonzález gives a much longer answer, and it took me a while to see how everything he says applies, so I thought I'd try to explain what I didn't understand starting from where I started from, in case it helps someone else!
If you start with raw HTML with <a href="http://example.com/?a=1&b=1"> and then you inspect the DOM in the browser, or look at the value of the href attribute in JS then you see "http://example.com/?a=1&b=1" everywhere. So it looks as if nothing has changed, and nothing was wrong. What I didn't understand is that actually the browser has parsed a technically incorrect href (with invalid entities) to be able to display this to you! (Yes, LOTS of people use this 'broken' format!)
To see this first hand, load this longer HTML example into your browser:
<html>
<body style="font-family: sans-serif">
<p>Now & then http://example.com/?a=1&b=2</p>
<p>Now & then http://example.com/?a=1&b=2</p>
<p>Now &amp; then http://example.com/?a=1&amp;b=2</p>
</body>
</html>
then in your javascript console try running this code taken from #ÁlvaroGonzález's answer:
const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
console.log(a.getAttribute("href"));
}
Also try clicking on the links to see where they go.
Once you've made sense of everything that you see there, it is no longer surprising how innerHTML works!

Convert textNode content to a string

Having problem with a textNode that I can't convert to a string.
I'm trying to scrape a site and get certain information out from it, and when I use an XPath to find this text I'm after I get an textNode back.
When I look in google development tool in chrome, I can se that the textNode itself contain the text I'm after, but how do I convert the textNode to plain text?
here is the line of code I use:
abstracts = ZU.xpath(doc, '//*[#id="abstract"]/div/div/par/text()');
I have tried to use stuff like .innerHTML, toString, textContent but nothing have worked so far.

I usually use Text.wholeText if I want to see the content string of a textNode, because textNode is an object so using toString or innerHTML will not work because it is an object not as the string itself...
Example: from https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText
The Text.wholeText read-only property returns the full text of all Text nodes logically adjacent to the node. The text is concatenated in document order. This allows to specify any text node and obtain all adjacent text as a single string.
Syntax
str = textnode.wholeText;
Notes and example:
Suppose you have the following simple paragraph within your webpage (with some whitespace added to aid formatting throughout the code samples here), whose DOM node is stored in the variable para:
<p>Thru-hiking is great! <strong>No insipid election coverage!</strong>
However, <a href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You decide you don’t like the middle sentence, so you remove it:
para.removeChild(para.childNodes[1]);
Later, you decide to rephrase things to, “Thru-hiking is great, but casting a ballot is tricky.” while preserving the hyperlink. So you try this:
para.firstChild.data = "Thru-hiking is great, but ";
All set, right? Wrong! What happened was you removed the strong element, but the removed sentence’s element separated two text nodes. One for the first sentence, and one for the first word of the last. Instead, you now effectively have this:
<p>Thru-hiking is great, but However, <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You’d really prefer to treat all those adjacent text nodes as a single one. That’s where wholeText comes in: if you have multiple adjacent text nodes, you can access the contents of all of them using wholeText. Let’s pretend you never made that last mistake. In that case, we have:
assert(para.firstChild.wholeText == "Thru-hiking is great! However, ");
wholeText is just a property of text nodes that returns the string of data making up all the adjacent (i.e. not separated by an element boundary) text nodes combined.
Now let’s return to our original problem. What we want is to be able to replace the whole text with new text. That’s where replaceWholeText() comes in:
para.firstChild.replaceWholeText("Thru-hiking is great, but ");
We’re removing every adjacent text node (all the ones that constituted the whole text) but the one on which replaceWholeText() is called, and we’re changing the remaining one to the new text. What we have now is this:
<p>Thru-hiking is great, but <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
Some uses of the whole-text functionality may be better served by using Node.textContent, or the longstanding Element.innerHTML; that’s fine and probably clearer in most circumstances. If you have to work with mixed content within an element, as seen here, wholeText and replaceWholeText() may be useful.
More info: https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText

... (three dots) in jQuery?

I was looking at the documentation page for jScroll plugin for jQuery (http://demos.flesler.com/jquery/scrollTo) and I noticed this :
$(...).scrollTo( $('ul').get(2).childNodes[20], 800 );
So, what does the three dots in jQuery mean ? I have never seen this selector before
EDIT :
DOM Element
This is from the source HTML. Viewing the source for the following links :
Relative
selectorjQuery
objectDOM
ElementAbsolute
numberAbsolute
all give the same implementation.
EDIT : I didnt look at the attribute clearly, its for the title attribute. I assumed its the href attribute. Feel silly asking this question now :) Thanks for the answers

I am fairly certain that he was using that as an example.
$( ... ) would be akin to $( your-selector-here ).
In other words, I have never seen any implementation of that.

Typically ... is used in various docs to shorten the example, and it means that you put something in place of the dots, or that what you would put there was omitted (to shorten the example)
It's not actually valid JS syntax.

It has no meaning. They meant just write your own selector.
Check out the souce code
$('div.pane').scrollTo( 0 );

They are not syntactically correct. They are just way the author uses to say scroll to some element, the name of which I don't bother to write here so I just write dots. Check the source code of the page if in doubt.

Three dots in javascript is Spread Syntax see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax
allows an iterable such as an array expression or string to be expanded in places where zero or more arguments (for function calls) or elements (for array literals)

How safe is it use document.body.innerHTML.replace?

Is running something like:
document.body.innerHTML = document.body.innerHTML.replace('old value', 'new value')
dangerous?
I'm worried that maybe some browsers might screw up the whole page, and since this is JS code that will be placed on sites out of my control, who might get visited by who knows what browsers I'm a little worried.
My goal is only to look for an occurrence of a string in the whole body and replace it.

Definitely potentially dangerous - particularly if your HTML code is complex, or if it's someone else's HTML code (i.e. its a CMS or your creating reusable javascript). Also, it will destroy any eventlisteners you have set on elements on the page.
Find the text-node with XPath, and then do a replace on it directly.
Something like this (not tested at all):
var i=0, ii, matches=xpath('//*[contains(text(),"old value")]/text()');
ii=matches.snapshotLength||matches.length;
for(;i<ii;++i){
var el=matches.snapshotItem(i)||matches[i];
el.wholeText.replace('old value','new value');
}
Where xpath() is a custom cross-browser xpath function along the lines of:
function xpath(str){
if(document.evaluate){
return document.evaluate(str,document,null,6,null);
}else{
return document.selectNodes(str);
}
}

I agree with lucideer, you should find the node containing the text you're looking for, and then do a replace. JS frameworks make this very easy. jQuery for example has the powerful :contains('your text') selector
http://api.jquery.com/contains-selector/

If you want rock solid solution, you should iterate over DOM and find value to replace that way.
However, if 'old value' is a long string that never could be mixed up with tag, attribute or attbibute value you are relatively safe by just doing replace.

We Keep Coding

JavaScript is the programming language of the Web.

Detect whether HTML element contains a specific character entity - javascript

You can use a Unicode entity in JavaScript. For example: (HTML: <div id='foo'></div>) JavaScript: console.log($('#foo').html().charCodeAt(0).toString(16)); //=> f067 console.log($('#foo').html().indexOf('\uf067')); //=> 0 Here's a JSFiddle.

Related

How to use 'toHaveClass' to match a substring in Jest?

Hyperlink href incorrectly quoted in innerHTML?

Convert textNode content to a string

... (three dots) in jQuery?

How safe is it use document.body.innerHTML.replace?

Categories

Resources