jQuery: html() function does not match real HTML - javascript

I am trying to get the EXACT html content of a div.
When using the html() function from jQuery, the result does not match the actual content.
Please check this fiddle and click on the black square:
http://jsfiddle.net/qRska/6/
The code:
<div id="mydiv" style="width:100px; height: 100px; background-color:#000000; cursor:pointer;">
<div id="INSIDE" style="background-color:#ffffff; border-style:none;"></div>
</div>
$('#mydiv').click(function() {
alert($(this).html());
});
jQuery change the color to RGB format and remove the border-style attribute.
How can I solve this problem?

The browser consumes the HTML, generates a DOM, then discards the HTML. innerHTML (which is what .html() eventually hits) gives a serialisation of the DOM back to HTML.
If you want to get the raw HTML, then you'll need to use XMLHttpRequest to fetch the source code of the current URL and then process it yourself.

What you want to do is unfortunately not possible. The original HTML is not available after it is parsed by the browser, so you have to jump through some hoops to prevent the browser from processing it.
One possible solution that I've used before is to wrap the HTML in comment tags, which would remain unchanged by the browser. You can then extract the comment using jQuery's .text() method; strip out the comment tags with string replacement; make the necessary changes to the markup; and then inject it back into the document.
The other alternative is to use AJAX to load the HTML. Make sure you set the contentType to 'text' so it doesn't get processed by the browser.

Related

jQuery html() method a security risk like innerHTML?

I recently was reading a JavaScript book and discovered using innerHTML to pass plain text poses a security risk, so I was wondering does using the html() jQuery method pose these same risks? I tried to research it but I could not find anything.
For Example:
$("#saveContact").html("Save"); //change text to Save
var saveContact = document.getElementById("saveContact");
saveContact.innerHTML = "Save"; //change text to Save
These do the same thing from what I know, but do they both pose the same security risk of someone being able to inject some JavaScript and execute it?
I am not very knowledgeable in security, so I apologize in advance if anything is incorrect or explained incorrectly.
From the JQuery documentation:
Additional Notes:
By design, any jQuery constructor or method that
accepts an HTML string — jQuery(), .append(), .after(), etc. — can
potentially execute code. This can occur by injection of script tags
or use of HTML attributes that execute code (for example, ). Do not use these methods to insert strings obtained from
untrusted sources such as URL query parameters, cookies, or form
inputs. Doing so can introduce cross-site-scripting (XSS)
vulnerabilities. Remove or escape any user input before adding content
to the document.
So, for example, if the user were to pass an HTML string that contains a <script> element, then that script would be executed:
$("#input").focus();
$("#input").on("blur", function(){
$("#output").html($("#input").val());
});
textarea { width:300px; height: 100px; }
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea id="input"><script>alert("The HTML in this element contains a script element that was processed! What if the script contained malicious content?!")</script></textarea>
<div id="output">Press TAB</div>
But, if we escape the string's contents before we pass it, we're safer:
$("#input").focus();
$("#input").on("blur", function(){
$("#output").html($("#input").val().replace("<", "<").replace(">", ">"));
});
textarea { width:300px; height: 100px; }
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea id="input"><script>alert("This time the < and > characters (which signify an HTML tag are escaped into their HTML entity codes, so they won't be processed as HTML.")</script></textarea>
<div id="output">Press TAB</div>
Finally, the best way to avoid processing a string as HTML is not to pass it to .innerHTML or .html() in the first place. That's why we have .textContent and .text() - they do the escaping for us:
$("#input").focus();
$("#input").on("blur", function(){
// Using .text() escapes the HTML automatically
$("#output").text($("#input").val());
});
textarea { width:300px; height: 100px; }
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea id="input"><script>alert("This time nothing will be processed as HTML.")</script></textarea>
<div id="output">Press TAB</div>
From the .html() docs:
By design, any jQuery constructor or method that accepts an HTML string — jQuery(), .append(), .after(), etc. — can potentially execute code. This can occur by injection of script tags or use of HTML attributes that execute code (for example, ). Do not use these methods to insert strings obtained from untrusted sources such as URL query parameters, cookies, or form inputs. Doing so can introduce cross-site-scripting (XSS) vulnerabilities. Remove or escape any user input before adding content to the document.
This is why .innerHTML is bad and why .html() is also not good to use on strings from untrusted sources, say if you make an ajax request to get some data from an untrusted third party. You should use one of the numerous methods here or better still, a proven library function.

jQuery get html in div without any markup

I have some script written using the jQuery framework.
var site = {
link: $('#site-link').html()
}
This gets the html in the div site-link and assigns it to link. I later save link to the DB.
My issue is I don't want the html as I see this as being to dangerous, maybe?
I have tried:
link: $('#site-link').val()
... but this just gives me a blank value.
How can I get the value inside the div without any markup?
Try doing this:
$('#site-link').text()
From the jQuery API Documentation:
Get the combined text contents of each element in the set of matched
elements, including their descendants, or set the text contents of the
matched elements.
Use the .text() jquery method like this:
var site = {
link: $('#site-link').text()
}
Here is an example of what .val(), .html() and .text() do: jsfiddle example
Use the text() method.
Get the combined text contents of each element in the set of matched elements, including their descendants, or set the text contents of the matched elements.
Use the .text() function of jQuery to get the only text.
var site = {
link: $('#site-link').text()
}
to avoid html, you will be required to use text() method of jquery.
var site = {
link: $('#site-link').text()
}
http://api.jquery.com/text/
If you are planning to store the result in the database and you are concerned about HTML, than using something like .text() rather than .html() is just an illusion of security.
NEVER EVER trust anything that comes from the client side!
Everything on the client side is replaceble, hijackable by the client rather easily. With the Tamper Data firefox plugin for example, even my mother could change the data sent to the server. She could send in anything in place of the link. Like malicious scripts, whole websites, etc...
It is important that before saving the "link" to the database you validate it on the server side. You can write a regex to check if a string is a valid url, or just replace everything that is html.
It's also a good idea to html encode it before outputting. This way even if html gets into your database, after encoding it will be just a harmless string (well there are other stuff to be aware of like UTF-7, but the web is a dangerous place).

jQuery parse HTML without loading images

I load HTML from other pages to extract and display data from that page:
$.get('http://example.org/205.html', function (html) {
console.log( $(html).find('#c1034') );
});
That does work but because of the $(html) my browser tries to load images that are linked in 205.html. Those images do not exist on my domain so I get a lot of 404 errors.
Is there a way to parse the page like $(html) but without loading the whole page into my browser?
Actually if you look in the jQuery documentation it says that you can pass the "owner document" as the second argument to $.
So what we can then do is create a virtual document so that the browser does not automatically load the images present in the supplied HTML:
var ownerDocument = document.implementation.createHTMLDocument('virtual');
$(html, ownerDocument).find('.some-selector');
Use regex and remove all <img> tags
html = html.replace(/<img[^>]*>/g,"");
Sorry for resuscitating an old question, but this is the first result when searching for how to try to stop parsed html from loading external assets.
I took Nik Ahmad Zainalddin's answer, however there is a weakness in it in that any elements in between <script> tags get wiped out.
<script>
</script>
Inert text
<script>
</script>
In the above example Inert text would be removed along with the script tags. I ended up doing the following instead:
html = html.replace(/<\s*(script|iframe)[^>]*>(?:[^<]*<)*?\/\1>/g, "").replace(/(<(\b(img|style|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g, "");
Additionally I added the capability to remove iframes.
Hope this helps someone.
Using the following way to parse html will load images automatically.
var wrapper = document.createElement('div'),
html = '.....';
wrapper.innerHTML = html;
If use DomParser to parse html, the images will not be loaded automatically. See https://github.com/panzi/jQuery-Parse-HTML/blob/master/jquery.parsehtml.js for details.
You could either use jQuerys remove() method to select the image elements
console.log( $(html).find('img').remove().end().find('#c1034') );
or remove then from the HTML string. Something like
console.log( $(html.replace(/<img[^>]*>/g,"")) );
Regarding background images, you could do something like this:
$(html).filter(function() {
return $(this).css('background-image') !== '';
}).remove();
The following regex replace all occurance of <head>, <link>, <script>, <style>, including background and style attribute from data string returned by ajax load.
html = html.replace(/(<(\b(img|style|script|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g,"");
Test regex: https://regex101.com/r/nB1oP5/1
I wish there is a a better way to work around (other than using regex replace).
Instead of removing all img elements altogether, you can use the following regex to delete all src attributes instead:
html = html.replace(/src="[^"]*"/ig, "");

Getting the HTML content of an iframe using jQuery

I'm currently trying to customize OpenCms (java-based open source CMS) a bit, which is using the FCKEditor embedded, which is what I'm trying access using js / jQuery.
I try to fetch the html content of the iframe, however, always getting null as a return.
This is how I try to fetch the html content from the iframe:
var editFrame = document.getElementById('ta_OpenCmsHtml.LargeNews_1_.Teaser_1_.0___Frame');
alert( $(editFrame).attr('id') ); // returns the correct id
alert( $(editFrame).contents().html() ); // returns null (!!)
Looking at the screenshot, the what I want to access is the 'LargeNews1/Teaser' html section, which currently holds the values "Newsline en...".
Below you can also see the html structure in Firebug.
However, $(editFrame).contents().html() returns null and I can't figure out why, whereas $(editFrame).attr('id') returns the correct id.
The iframe content / FCKEditor is on the same site/domain, no cross-site issues.
HTML code of iframe is at http://pastebin.com/hPuM7VUz
Updated:
Here's a solution that works:
var editArea = document.getElementById('ta_OpenCmsHtml.LargeNews_1_.Teaser_1_.0___Frame').contentWindow.document.getElementById('xEditingArea');
$(editArea).find('iframe:first').contents().find('html:first').find('body:first').html('some <b>new</b><br/> value');
.contents().html() doesn't work to get the HTML code of an IFRAME. You can do the following to get it:
$(editFrame).contents().find("html").html();
That should return all the HTML in the IFRAME for you. Or you can use "body" or "head" instead of "html" to get those sections too.
you can get the content as
$('#iframeID').contents().find('#someID').html();
but frame should be in the same domain refer http://simple.procoding.net/2008/03/21/how-to-access-iframe-in-jquery/
I suggest replacing the first line with:
var editFrame = $('#ta_OpenCmsHtml.LargeNews_1_.Teaser_1_.0___Frame');
...and the 2nd alert expression with:
editFrame.html()
If, on the other hand, you prefer to accomplish the same w/o jquery (much cooler, IMHO) could use only JavaScript:
var editFrame = document.getElementById('ta_OpenCmsHtml.LargeNews_1_.Teaser_1_.0___Frame');
alert(editFrame.innerHTML);
After trying a number of jQuery solutions that recommended using the option below, I discovered I was unable to get the actual <html> content including the parent tags.
$("#iframeId").contents().find("html").html()
This worked much better for me and I was able to fetch the entire <html>...</html> iframe content as a string.
document.getElementById('iframeId').contentWindow.document.documentElement.outerHTML
I think the FCKEditor has its own API see http://cksource.com/forums/viewtopic.php?f=6&t=8368
Looks like jQuery doesn't provide a method to fetch the entire HTML of an iFrame, however since it provides access to the native DOM element, a hybrid approach is possible:
$("iframe")[0].contentWindow.document.documentElement.outerHTML;
This will return iFrame's HTML including <THTML>, <HEAD> and <BODY>.
Your iframe:
<iframe style="width: 100%; height: 100%;" frameborder="0" aria-describedby="cke_88" title="Rich text editor, content" src="" tabindex="-1" allowtransparency="true"/>
We can get the data from this iframe as:
var content=$("iframe").contents().find('body').html();
alert(content);

BODY tag disappear when using Jquery.Load()

Im trying to make a pop-up like window using jquery and its modal box. First I load the content from a html file:
$("#test").load("test.htm");
Then I load the popup:
$("#test").dialog("open");
This works like it should, the content of test.html is injectet into the modal pop-up. There is only one think that is wrong, and that is the BODY tags are gone from the source of the pop-up. I need the BODY tag to be there because I do some formatting based on the BODY tag.
Does anyone know why jQuery.Load() removes the BODY tag? And are there any workarounds?
A page can only have one body tag. If you already have one on the page, the second will be ignored.
In your case, it sounds like the browser is ignoring the duplicate body (nothing specific to jquery). Rather than use the body for styling, use a containing <div> with an id or class which will be retained.
It probably removes the body tag because it's not allowed! Each document can only have one body. Rather than force everyone to redo all their HTML pages, jQuery probably just grabs the contents of the body to use when you call load().
Have you thought about perhaps wrapping everything in a containing element? eg: <div class="body"> You can then apply the exact same styles to that element.
/* change this: */
body { color: #f0f; etc }
/* to this: */
body, div.body { color: #f0f; }
You are loading the HTML into an existing document that already has a body tag. A document can only have one so it automatically filters anything and extracts only the HTML inside the body tag when using load. You should wrap your HTML in a div with a specific class and do your formatting based on that class.
From the load docs (emphasis mine):
In jQuery 1.2 you can now specify a
jQuery selector in the URL. Doing so
will filter the incoming HTML
document, only injecting the elements
that match the selector. The syntax
looks something like "url #some >
selector". Default selector "body>*"
always applies. If the URL contains a
space it should be escape()d. See the
examples for more information.
You might dynamically create the body tag using document.write of js as an alternative.
I had the same issue, and solved it more or less as the following:
instead of using load(), you can use get(), and do some smart string replacement:
var content = get("test.htm")
.replace("<body>", "<body><div class='body'>")
.replace("</body>", "</body>");
$("#test").replace($(content).filter(".body"));

Categories