Possible to create custom "DOMs" by loading HTML from string in Javascript? - javascript

I'm trying to parse HTML in the browser. The browser receives 2 HTML files as strings, eg. HTML1 and HTML2.
I now need to parse these "documents" just as one would parse the current document. This is why I was wondering if it is possible to create custom documents based on these HTML strings (these strings are provided by the server or user).
So that for example the following would be valid:
$(html1Document).$("#someDivID")...
If anything is unclear, please ask me to clarify more.
Thanks.

var $docFragment = $(htmlString);
$docFragment.find("a"); // all anchors in the HMTL string
Note that this ignores any document structure tags (<html>, <head> and <body>), but any contained tags will be available.

With jQuery you can do this:
$(your_document_string).someParsingMethod().another();

You can always append your html to some hidden div (though innerHTML or jQuery .html(..)). It won't be treated exactly as a new document, but still will be able to search its contents.
It has a few side-effects, though. For example, if your html defines any script tags, they'll be loaded. Also, browser may (and probably will) remove html, body and similar tags.
edit
If you specifically need title and similar tags, you may try iframe loading content from your server.

Related

How does a browser render this inline JavaScript within an encoded tag?

I was trying to perform a Reflective XSS attack on a tutorial website. The webpage basically consists of a form with an input field and a submit button. On submitting the form, the content of the input field are displayed on the same webpage.
I figured out that the website is blacklisting script tag and some of the JavaScript methods in order to prevent an XSS attack. So, I decided to encode my input and then tried submitting the form. I tried 2 different inputs and one of them worked and the other one didn't.
When I tried:
<body onload="&#97lert('Hi')"></body>
It worked and an alert box was displayed. However, I when encoded some characters in the HTML tag, something like:
&#60body onload="&#97lert('Hi')"&#62&#60/body&#62
It didn't work! It simply printed <body onload="alert('Hi')"></body> as it is on the webpage!
I know that the browsers execute inline JavaScript as they parse an HTML document (please correct me if I'm wrong). But, I'm not able to understand why did the browser show different behavior for the different inputs that I've mentioned.
-------------------------------------------------------------Edit---------------------------------------------------------
I tired the same with a more basic XSS tutorial with no XSS protection. Again:
<script>alert("Hi")</script> -> Worked!
&#60s&#99ript&#62&#97lert("Hi")&#60/s&#99ript&#62 -> Didn't work! (Got printed as string on the Web Page)
So basically, if I encode anything in JavaScript, it works. But if I'm encoding anything that is HTML, it's not executing the JavaScript within that HTML!
I can't come up with words to describe the properly, so i'll just give you an example. Lets say we have this string:
<div>Hello World! <span id="foo">Foobar</span></div>
When this gets parsed, you end up with a div element that contains the text:
Hello World! <span id="foo">Foobar</span>
Note, while there is something that looks like html inside the text, it is still just text, not html. For that text to become html, it would have to be parsed again.
Attributes work a little bit differently, html entities in attributes do get parsed the first time.
tl;dr:
if the service you are using is stripping out tags, there's nothing you can do about it unless the script is poorly written in a way that results in the string getting parsed twice.
Demo: http://jsfiddle.net/W6UhU/ note how after setting the div's inner html equal to it's inner text, the span becomes an html element rather than a string.
When an HTML page says &#60body It treats it the same as if it said <body
That is, it just displays the encoded characters, doesn't parse them as HTML. So you're not creating a new tag with onload attributes http://jsfiddle.net/SSfNw/1/
alert(document.body.innerHTML);
// When an HTML page says <body It treats it the same as if it said <body
So in your case, you're never creating a body tag, just content that ends up getting moved into the body tag http://jsfiddle.net/SSfNw/2/
alert(document.body.innerHTML)
// <body onload="alert('Hi')"></body>
In the case <body onload="&#97lert('Hi')"></body>, the parser is able to create the body tag, once within the body tag, it's also able to create the onload attribute. Once within the attribute, everything gets parsed as a string.

jQuery: How to replace the whole DOM with another HTML using .load

I have a problem.
We are doing a Captive Portal.
Go to any site, for example www.php.net
Then in Chrome's console, use this:
$("html").load( "https://www.ccc.co.il/Suspend.aspx" );
You will notice, the DOM is replaced, but not quite the way it should be:
The wrapper elements of the loaded webpage (title, body for example) are missing!
This causes problems of course on the injected page.
How do I replace the entire initial DOM?
And please dont suggest to me using a link, or normal redirect.
Those are the restrictions, I need to replace the entire DOM tree please.
Thanks!
This is fundamentally a feature of browsers.
Here's a snip from the jQuery docs for .load():
jQuery uses the browser's .innerHTML property to parse the retrieved document and insert it into the current document. During this process, browsers often filter elements from the document such as <html>, <title>, or <head> elements. As a result, the elements retrieved by .load() may not be exactly the same as if the document were retrieved directly by the browser.
While I don't recommend what you're suggesting at all, I will attempt to answer your question:
Using a server-side language (like PHP, for example), return documents as parsed json:
{
"head": [head string],
"body": [body string]
}
Then your JavaScript can individually replace each element.
You'll need to switch from .load() to something more configurable, like .ajax()
I think you would have to use an iframe in this case as I don't think that you can replace an entire DOM with another.
$('body').html("<iframe height=100% width=100% frameBorder=0 src='https://www.ccc.co.il/Suspend.aspx'></iframe>");
http://jsfiddle.net/c7EbY/
$.get("https://www.ccc.co.il/Suspend.aspx", function(html){$("html").html(html)});
I'm using a regular AJAX function here because it shouldn't strip anything.
Sorry about those 4 htmls in a row. :P

When should one use .innerHTML and when document.write in JavaScript

Is there a general rule, when one should use document.write to change the website content and when to use .innerHTML?
So far my rules were:
1) Use document.write when adding new content
2) Use .innerHTML when changing existing content
But I got confused, since someone told me that on the one hand .innerHTML is a strange Microsoft standard, but on the other hand I read that document.write is not allowed in XHTML.
Which structures should I use to manipulate my source code with JavaScript?
innerHTML can be used to change the contents of the DOM by string munging. So if you wanted to add a paragraph with some text at the end of a selected element you could so something like
document.getElementById( 'some-id' ).innerHTML += '<p>here is some text</p>'
Though I'd suggest using as much DOM manipulation specific API as possible (e.g. document.createElement, document.createDocumentFragment, <element>.appendChild, etc.). But that's just my preference.
The only time I've seen applicable use of document.write is in the HTML5 Boilerplate (look at how it checks if jQuery was loaded properly). Other than that, I would stay away from it.
innerHTML and document.write are not really comparable methods to dynamically change/insert content, since their usage is different and for different purposes.
document.write should be tied to specific use cases. When a page has been loaded and the DOM is ready you cannot use that method anymore. That's why is generally most used in conditional statements in which you can use it to syncronously load external javascript file (javascript libraries), including <script> blocks (e.g. when you load jQuery from the CDN in HTML5 Boilerplate).
What you read about this method and XHTML is true when the page is served along with the application/xhtml+xml mime type: From w3.org
document.write (like document.writeln) does not work in XHTML documents (you'll get a "Operation is not supported" (NS_ERROR_DOM_NOT_SUPPORTED_ERR) error on the error console). This is the case if opening a local file with a .xhtml file extension or for any document served with an application/xhtml+xml MIME type
Another difference between these approaches is related on insertion node: when you use .innerHTML method you can choose where to append the content, while using document.write the insertion node is always the part of document in which this method was used.
1) document.write() puts the contents directly to the browser where the user can see it.
this method writes HTML expressions or JavaScript code to a document.
The below example will just print ‘Hello World’ into the document
<html>
<body>
<script>
document.write("Hello World!");
</script>
</body>
</html>
2) document.innerHTML changes the inner content of an element
It changes the existing content of an element
The below code will change the content of p tag
<html>
<body>
<p id="test" onclick="myFun()">Click me to change my HTML content or my inner HTML</p>
<script>
function myFun() {
document.getElementById("test").innerHTML = "I'm replaced by exiesting element";
}
</script>
</body>
</html>
you could use document.write() without any connected HTML, but if you already have HTML that you want to change, then document.innerHTML would be the obvious choice.
I agree with the above comments. Basically:
document.write can be useful while the page is loading, to output new HTML tags or content while the browser is building the document object model. That content is output precisely where the JavaScript statement is embedded.
.innerHTML is useful at any time to insert new HTML tags/content as a string, and can be more easily directed to specific elements in the DOM regardless of when/where the JavaScript is run.
A couple of additional notes...
When document.write is called from a script outside of the body element, its output will be appended to the body element if called while the page is loading; but once the page is loaded, that same document.write will overwrite the entire document object model, effectively erasing your page. It all depends on the timing of document.write with the page load.
If you are using document.write to append new content to the end of the body element, you may be better off using this:
document.body.innerHTML += "A string of new content!";
It's a bit safer.

how to pass as parameter to JS function from HTML large chunk of HTML code?

I want to pass as argument a large (maybe 2-3 paragraphs of html formatted code) chunk of HTML code to a Javascript function call from HTML. The problem is, the formatted HTML keeps appearing in the page itself, which shouldnt be the case ! I am assuming theres some problem with single/double quotes !
And, I am working on Facebook tab page.
Can anyone please help me ?
Thanks.
-
ahsan
One way is to have a hidden div (something with display:none), and populate that with your 2-3 paragraphs of html formatted code. Then, you can just pass the innerHTML of the div into your function. Quotes (of any kind) won't cause a problem in this method.
Some libraries like icanhaz.js also do something like this:
<script type="text/html" id="someHTMLTemplate">
<div>You can put whatever html you want here</div>
<p>And the browser just ignores it</p>
</script>
I use the same technique with mustache.js and then grab the template from the innerHTML of the script tag after grabbing it by the dom id. This has the advantage that the browser doesn't have to parse your extra html while loading it is just parsed when you need to display it in another node on the page.
Another way is to encode the HTML and then decode it in the JS. Here's an example using the JS escape info:
console.log(escape("<hello></hello>")); // %3Chello%3E%3C/hello%3E
console.log(unescape("%3Chello%3E%3C/hello%3E")); // <hello></hello>
Mind you, if you have an issue with your string quotations to begin with, then there will still be a problem encoding.

Read IFrame content using JavaScript

Ok, This is my first time dealing seriously with IFrames and I cant seem to understand a few things:
First the sample code I am testing with:
<head>
<script type="text/javascript">
function init(){
console.log("IFrame content: " + window.frames['i1'].document.getElementsByTagName('body')[0].innerHTML);
}
</script>
</head>
<body onload="init();">
<iframe name="i1" src="foo.txt"/>
</body>
the file "foo.txt" looks like this:
sample text file
Questions:
1) The iframe seems to be behaving as a HTML document and the file text is actually part of the body instead. Why ? Is it a rule for an IFrame to be a HTML document. Is it not possible for the content of an iframe to be just plain text ??
2) The file content gets wrapped inside a pre tag for some reason. Why is this so ? Is it always the case?
3) My access method in the javascript is working but is there any other alternative? [native js solutions please] If the content is wrapped in a pre tag always then I will actually have to lookup inside the pre tag rather than lookup the innerHTML
I was having a hard time getting the contents of a TXT file that was the src of an iframe.
This is my solution:
document.getElementById( 'iframeID' ).contentWindow.document.body.innerText
innerHTML does not return the exact content of an element, its a non-standardized method that returns HTML which is equivalent to the real content, and in HTML the equivalent to plain text is <pre>foo...</pre>.
You might have better luck with the innerText property..
1) The iframe seems to be behaving as a HTML document and the file text is actually part of the body instead. Why ?
you’re using the DOM/JS interface. this will only work, if the content is treated as HTML/XML.
That's how browsers treat text files, because they 'look better' this way (not only inside iframe). Browsers can process lot's of file types, and it's unreasonable to expect them to show everything in raw form, right? Because browser pages (and iframes) are about presentation, nobody really uses iframes for configuration or to read raw data from the disk.
If you want to have full control over presentation, just change file type to html and it will be treated like html. (in particular, it will solve 'pre' problem)
Will this answer your questions?

Categories