Hyperlink href incorrectly quoted in innerHTML? - javascript

Take this very simple example HTML:
<html>
<body>This is okay & fine, but the encoding of this link seems wrong.</body>
<html>
On examining document.body.innerHTML (e.g. in the browser's JS console, in JS itself, etc.), this is the value I see:
This is okay & fine, but the encoding of this link seems wrong.
This behaviour is the same across browsers but I can't understand it, it seems wrong.
Specifically, the link in the orginal document is to http://example.com?a=1&b=2, whereas if the value of innerHTML is treated as HTML then it links to http://example.com?a=1&b=2 which is NOT the same (e.g. If I created a new document, which actually had innerHTML as its inner HTML, and I clicked on the link then the browser would be sent to a materially different URL as far as I can see).
(EDIT #3: I'm wrong about the above. Firstly, yes, those two URLs are different; but secondly, the innerHTML which I thought was wrong is right, and it correctly represents the first URL, not the second! See the end of my own answer below.)
This is different from the issue discussed in question innerHTML gives me & as & !. In my case (which is the opposite to the case in that question) the original HTML is correct and it looks to me as if it is the innerHTML which is wrong (i.e. because it is HTML which does not represent what the original HTML represented).
(EDIT #2: I was wrong about this, too: it's not really different. But I think it is not widely known that & is the correct way to represent & inside an href, not just within body text. Once you realise that, then you can see that these are the same issue really.)
Can anyone explain this?
(EDIT #1+4: This only occurred to me a bit late, after writing my original question, but: "is & actually correct within the href text, and & technically incorrect?" As I said when I first wrote those words, that "seems very unlikely! I've certainly never seen HTML written that way." But however 'unlikely', or not, that is the case, and is the main part of what I wasn't understanding!)
Also related and would be useful, can anyone explain how to cleanly get HTML which does correctly represent the target of document links? You definitely can't just un-encode all HTML character references within innerHTML, because (as shown in the example I've used, and also as discussed in innerHTML gives me & as & !) the ones in the main run of text should be encoded, and just un-encoding everything would make these wrong.
I originally thought this was not a duplicate of innerHTML gives me & as & ! (as discussed above; and in a way it still isn't, if it's agreed that it's not as obvious or widely known that the same issues apply inside href as in body text). It's still definitely not a duplicate of A href in innerHTML (which somehwat unclearly asks about how to set innerHTML using JS).

Most browser tools don't show the actual HTML because it wouldn't be of much help:
HTML is often generated dynamically after page load with the help of CSS and JavaScript.
HTML is often broken and the browser needs to repair it in order to generate the memory representation needed for rendering and other stuff.
So the HTML you see is not the actual source but it's generated on the fly from the current status of the document, which of course includes all the fixed applied (in your case, the invalid HTML entities).
The following example hopefully illustrates all the combinations:
const section = document.querySelector("section");
const invalid = document.createElement("p");
invalid.innerHTML = 'Invalid HTML (dynamic)';
const valid = document.createElement("p");
valid.innerHTML = 'Valid HTML (dynamic)';
section.appendChild(valid);
section.appendChild(invalid);
const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
console.log(a.getAttribute("href"));
}
<section>
<p>Invalid HTML (static)</p>
<p>Valid HTML (static)</p>
<section>
Is & actually correct within the href text, and & technically incorrect? It seems very unlikely! I've certainly never seen HTML written that way.
There's no such thing as "technically correct", let alone today when HTML is pretty well standardised. (Well, yes, there're two competing standards bodies and specs are continuously evolving, but the basics were set up long ago.)
The & symbol starts a character entity and &b is an invalid character entity. Period.
But it works! Doesn't that mean it's technically correct?
It works because browsers are explicitly designed to deal with completely broken markup, what's known as tag soup, because it was thought that it would ease usage:
<p><strong>Hello, World!</u>
<body><br itspartytime="yeah">
<pink>It works!!!</red>
But HTML entities are just an encoding artefact. That doesn't mean that URLs are not allowed to contain literal ampersands, it just means that —when in HTML context— they need to be represented as &. It's the same as when you type a backslash in a JavaScript string to escape some quotes: the backslash does not become part of your data.

Having thought up a possible (but I thought 'unlikely') explanation - which I put in as an edit in the original question - I've realised that it is the answer:
Using & to represent & inside an href is technically incorrect, and & is technically correct
I gathered this initially from this SO answer https://stackoverflow.com/a/16168585/795690, and I think it is relevant that (as it also says in that answer) the idea that & is the correct way to represent & in an href is not as widely understood as the idea that & is the correct way to represent & in body text.
Once you do understand this, it makes sense that what the browser is doing is right, and that the innerHTML value which comes back represents the link correctly.
EDIT:
#ÁlvaroGonzález gives a much longer answer, and it took me a while to see how everything he says applies, so I thought I'd try to explain what I didn't understand starting from where I started from, in case it helps someone else!
If you start with raw HTML with <a href="http://example.com/?a=1&b=1"> and then you inspect the DOM in the browser, or look at the value of the href attribute in JS then you see "http://example.com/?a=1&b=1" everywhere. So it looks as if nothing has changed, and nothing was wrong. What I didn't understand is that actually the browser has parsed a technically incorrect href (with invalid entities) to be able to display this to you! (Yes, LOTS of people use this 'broken' format!)
To see this first hand, load this longer HTML example into your browser:
<html>
<body style="font-family: sans-serif">
<p>Now & then http://example.com/?a=1&b=2</p>
<p>Now & then http://example.com/?a=1&b=2</p>
<p>Now &amp; then http://example.com/?a=1&amp;b=2</p>
</body>
</html>
then in your javascript console try running this code taken from #ÁlvaroGonzález's answer:
const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
console.log(a.getAttribute("href"));
}
Also try clicking on the links to see where they go.
Once you've made sense of everything that you see there, it is no longer surprising how innerHTML works!

Related

Convert Html to document.write javascript

I've see a web page that when I try to view the source, I only see a single JavaScript statement. I searched and could not figure out how they have done that. you can do and see yourself.
The web page is: Weird Page
when I view the source I only see something like below which also looks like a comment:
<script type='text/javascript' language='Javascript'>
<!--document.write(unescape('%3C%2........................
................
How they have done that? How I can change my web page so when someone view the source it look like that?
This article may help to clarify what's happening there.
Basically, this company has encoded their entire page using standard escape characters. If you copy the bit from unencode(...) all the way to the next to last parens, and paste it into your browser console, you'll see the actual HTML content.
If you look at the source code for the page I linked, you'll see the function that converts normal text to all escape characters:
// CONVERTS *ALL* CHARACTERS INTO ESCAPED VERSIONS.
function escapeTxt(os){
var ns='';
var t;
var chr='';
var cc='';
var tn='';
for(i=0;i<256;i++){
tn=i.toString(16);
if(tn.length<2)tn="0"+tn;
cc+=tn;
chr+=unescape('%'+tn);
}
cc=cc.toUpperCase();
os.replace(String.fromCharCode(13)+'',"%13");
for(q=0;q<os.length;q++){
t=os.substr(q,1);
for(i=0;i<chr.length;i++){
if(t==chr.substr(i,1)){
t=t.replace(chr.substr(i,1),"%"+cc.substr(i*2,2));
i=chr.length;
}
}
ns+=t;
}
return ns;
}
What is happening is pretty much what it sounds like. The server is serving a single <script> element, which contains an HTML comment to make it easier for browser that don't support JS (read more here: Why does Javascript ignore single line HTML Comments?), and in the comment it's writing to the document the unescaped version of the string, which is sent escaped.

HTML entities in CSS content (convert entities to escape-string at runtime)

I know that html-entities like or ö or ð can not be used inside a css like this:
div.test:before {
content:"text with html-entities like ` ` or `ö` or `ð`";
}
There is a good question with good answers dealing with this problem: Adding HTML entities using CSS content
But I am reading the strings that are put into the css-content from a server via AJAX. The JavaScript running at the users client receives text with embedded html-entities and creates style-content from it instead of putting it as a text-element into an html-element's content. This method helps against thieves who try to steal my content via copy&paste. Text that is not part of the html-document (but part of css-content) is really hard to copy. This method works fine. There is only this nasty problem with that html-entities.
So I need to convert html-entities into unicode escape-sequences at runtime. I can do this either on the server with a perl-script or on the client with JavaScript, But I don't want to write a subroutine that contains a complete list of all existing named entities. There are more than 2200 named entities in html5, as listed here: http://www.w3.org/TR/2011/WD-html5-20110113/named-character-references.html And I don't want to change my subroutine every time this list gets changed. (Numeric entities are no problem.)
Is there any trick to perfom this conversion with javascript? Maybe by adding, reading and removing content to the DOM? (I am using jQuery)
I've found a solution:
var text = 'Text that contains html-entities';
var myDiv = document.createElement('div');
$(myDiv).html(text);
text = $(myDiv).text();
$('#id_of_a_style-element').html('#id_of_the_protected_div:before{content:"' + text + '"}');
Writing the Question was half way to get this answer. I hope this answer helps others too.

Adding Javascript variables to HTML elements

So, I have some code that should do four things:
remove the ".mp4" extension from every title
change my video category
put the same description in all of the videos
put the same keywords in all of the videos
Note: All of this would be done on the YouTube upload page. I'm using Greasemonkey in Mozilla Firefox.
I wrote this, but my question is: how do I change the HTML title in the actual HTML page to the new title (which is a Javascript variable)?
This is my code:
function remove_mp4()
{
var title = document.getElementsByName("title").value;
var new_title = title.replace(title.match(".mp4"), "");
}
function add_description()
{
var description = document.getElementsByName("description").value;
var new_description = "Subscribe."
}
function add_keywords()
{
var keywords = document.getElementsByName("keywords").value;
var new_keywords = prompt("Enter keywords.", "");
}
function change_category()
{
var category = document.getElementsByName("category").value;
var new_category = "<option value="27">Education</option>"
}
remove_mp4();
add_description();
add_keywords();
change_category();
Note: If you see any mistakes in the JavaScript code, please let me know.
Note 2: If you wonder why I stored the current HTML values in variables, that's because I think I will have to use them in order to replace HTML values (I may be wrong).
A lot of things have been covered already, but still i would like to remind you that if you are looking for cross browser compatibility innerHTML won't be enough, as you may need innerText too or textContent to tackle some old versions of IE or even using some other way to modify the content of an element.
As a side note innerHTML is considered from a great majority of people as deprecated though some others still use it. (i'm not here to debate about is it good or not to use it but this is just a little remark for you to checkabout)
Regarding remarks, i would suggest minimizing the number of functions you create by creating some more generic versions for editing or adding purposes, eg you could do the following :
/*
* #param $affectedElements the collection of elements to be changed
* #param $attribute here means the attribute to be added to each of those elements
* #param $attributeValue the value of that attribute
*/
function add($affectedElements, $attribute, $attributeValue){
for(int i=0; i<$affectedElements.length; i++){
($affectedElements[i]).setAttribute($attribute, $attributeValue);
}
}
If you use a global function to do the work for you, not only your coce is gonna be easier to maintain but also you'll avoid fetching for elements in the DOM many many times, which will considerably make your script run faster. For example, in your previous code you fetch the DOM for a set of specific elements before you can add a value to them, in other words everytime your function is executed you'll have to go through the whole DOM to retrieve your elements, while if you just fetch your elements once then store in a var and just pass them to a function that's focusing on adding or changing only, you're clearly avoiding some repetitive tasks to be done.
Concerning the last function i think code is still incomplete, but i would suggest you use the built in methods for manipulating HTMLOption stuff, if i remember well, using plain JavaScript you'll find yourself typing this :
var category = document.getElem.... . options[put-index-here];
//JavaScript also lets you create <option> elements with the Option() constructor
Anyway, my point is that you would better use JavaScript's available methods to do the work instead of relying on innerHTML fpr anything you may need, i know innerHTML is the simplest and fastest way to get your work done, but if i can say it's like if you built a whole HTML page using and tags only instead of using various semantic tags that would help make everything clearer.
As a last point for future use, if you're interested by jQuery, this will give you a different way to manipulate your DOM through CSS selectors in a much more advanced way than plain JavaScript can do.
you can check out this link too :
replacement for innerHTML
I assume that your question is only about the title changing, and not about the rest; also, I assume you mean changing all elements in the document that have "title" as name attribute, and not the document title.
In that case, you could indeed use document.getElementsByName("title").
To handle the name="title" elements, you could do:
titleElems=document.getElementsByName("title");
for(i=0;i<titleElems.length;i++){
titleInner=titleElems[i].innerHTML;
titleElems[i].innerHTML=titleInner.replace(titleInner.match(".mp4"), "");
}
For the name="description" element, use this: (assuming there's only one name="description" element on the page, or you want the first one)
document.getElementsByName("description")[0].value="Subscribe.";
I wasn't really sure about the keywords (I haven't got a YouTube page in front of me right now), so this assumes it's a text field/area just like the description:
document.getElementsByName("keywords")[0].value=prompt("Please enter keywords:","");
Again, based on your question which just sets the .value of the category thingy:
document.getElementsByName("description")[0].value="<option value='27'>Education</option>";
At the last one, though, note that I changed the "27" into '27': you can't put double quotes inside a double-quoted string assuming they're handled just like any other character :)
Did this help a little more? :)
Sry, but your question is not quite clear. What exactly is your HTML title that you are referring to?
If it's an element that you wish to modify, use this :
element.setAttribute('title', 'new-title-here');
If you want to modify the window title (shown in the browser tab), you can do the following :
document.title = "the new title";
You've reading elements from .value property, so you should write back it too:
document.getElementsByName("title").value = new_title
If you are refering to changing text content in an element called title try using innerHTML
var title = document.getElementsByName("title").value;
document.getElementsByName("title").innerHTML = title.replace(title.match(".mp4"), "");
source: https://developer.mozilla.org/en-US/docs/DOM/element.innerHTML
The <title> element is an invisible one, it is only displayed indirectly - in the window or tab title. This means that you want to change whatever is displayed in the window/tab title and not the HTML code itself. You can do this by changing the document.title property:
function remove_mp4()
{
document.title = document.title.replace(title.match(".mp4"), "");
}

Problem with regexp in userscript for chrome

This might be a noob question, but I have tried to find an answere here and on other sites and I have still not find the answere. At least not so that I understand enough to fix the problem.
This is used in a userscript for chrome.
I'm trying to select a date from a string. The string is the innerHTML from a tag that I have managed to select. The html structure, and also the string, is something like this: (the div is the selected tag so everything within is the content of the string)
<div id="the_selected_tag">
link
" 2011-02-18 23:02"
thing
</div>
If you have a solution that helps me select the date without this fuzz, it would also be great.
The javascript:
var pattern = /\"\s[\d\s:-]*\"/i;
var tag = document.querySelector('div.the_selected_tag');
var date_str = tag.innerHTML.match(pattern)[0]
When I use this script as ordinary javascript on a html document to test it, it works perfectly, but when I install it as a userscript in chrome, it doesn't find the pattern.
I can't figure out how to get around this problem.
Dump innerHTML into console. If it looks fine then start building regexp from more generic (/\d+/) to more specific ones and output everything into a console. There is a bunch of different quote characters in different encodings, many different types of dashes.
[\d\s:-]* is not a very good choice because it would match " 1", " ". I would rather write something as specific as possible:
/" \d{4}-\d{2}-\d{2} \d{2}:\d{2}"/
(Also document.querySelector('div.the_selected_tag') would return null on your sample but you probably wanted to write class instead of id)
It's much more likely that tag.innerHTML doesn't contain what you think it contains.

This javascript works in every browser EXCEPT for internet explorer!

The webpage is here:
http://develop.macmee.com/testdev/
I'm talking about when you click the ? on the left, it is supposed to open up a box with more content in it. It does that in every browser except IE!
function question()
{
$('.rulesMiddle').load('faq.php?faq=rules_main',function(){//load page into .rulesMiddle
var rulesa = document.getElementById('rulesMiddle').innerHTML;
var rules = rulesa.split('<div class="blockbody">');//split to chop off the top above rules
var rulesT = rules[1].split('<form class="block');//split to chop off below rules
rulesT[0] = rulesT[0].replace('class=','vbclass');//get rid of those nasty vbulletin defined classes
document.getElementById('rulesMiddle').innerHTML = rulesT[0];//readd the content back into the DIV
$('.rulesMain').slideToggle();//display the DIV
$('.rulesMain').center();//center DIV
$('.rulesMain').css('top','20px');//align with top
});
}
IE converts innerHTML contents into upper case, so you probably are not able to split the string this way, as string operations are case sensitive. Check what the contents really looks like by running
alert(rulesa);
Andris is right. And that's not all. It'll also throw away the quotes in attributes.
It is completely unreliable to make any assumptions about the format of the string you get from innerHTML; the browser may output it in a variety of forms — some of which, in IE's case, are not even valid HTML. The chances of you getting back the same string that was originally parsed are very low.
In general: HTML-string-hacking is a shonky waste of time. Modify HTML elements using their node objects instead. You seem to be using jQuery, so you've got loads of utility functions to help you.
In any case you should not be loading the whole HTML page into #rulesMiddle. It includes a load of scripts and stylesheets and other header nonsense that can't go in there. jQuery allows you to pick which part of the document to insert; you seem to just want the first .blockbody element, so pick that:
$('#rulesMiddle').load('faq.php?faq=rules_main .blockbody:first', function(){
$('#rulesMiddle .blockrow').attr('class', '');
$('.rulesMain').slideToggle();
$('.rulesMain').css('top', '20px');
});
My IE debugger throws an error on your script when I click that button. On this line:
var rulesT = rules[1].split('<form class="block');//split to chop off below rules
IE stops processing the Javascript and says '1' is null or not an object
Don't know if you solve it, but it work's on my Ugly IE ... (its an v8)
Btw: It's me, or does pop-up widows wen open are really, really, really slowing down that platform ?

Categories