jQuery parse HTML without loading images - javascript

I load HTML from other pages to extract and display data from that page:
$.get('http://example.org/205.html', function (html) {
console.log( $(html).find('#c1034') );
});
That does work but because of the $(html) my browser tries to load images that are linked in 205.html. Those images do not exist on my domain so I get a lot of 404 errors.
Is there a way to parse the page like $(html) but without loading the whole page into my browser?

Actually if you look in the jQuery documentation it says that you can pass the "owner document" as the second argument to $.
So what we can then do is create a virtual document so that the browser does not automatically load the images present in the supplied HTML:
var ownerDocument = document.implementation.createHTMLDocument('virtual');
$(html, ownerDocument).find('.some-selector');

Use regex and remove all <img> tags
html = html.replace(/<img[^>]*>/g,"");

Sorry for resuscitating an old question, but this is the first result when searching for how to try to stop parsed html from loading external assets.
I took Nik Ahmad Zainalddin's answer, however there is a weakness in it in that any elements in between <script> tags get wiped out.
<script>
</script>
Inert text
<script>
</script>
In the above example Inert text would be removed along with the script tags. I ended up doing the following instead:
html = html.replace(/<\s*(script|iframe)[^>]*>(?:[^<]*<)*?\/\1>/g, "").replace(/(<(\b(img|style|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g, "");
Additionally I added the capability to remove iframes.
Hope this helps someone.

Using the following way to parse html will load images automatically.
var wrapper = document.createElement('div'),
html = '.....';
wrapper.innerHTML = html;
If use DomParser to parse html, the images will not be loaded automatically. See https://github.com/panzi/jQuery-Parse-HTML/blob/master/jquery.parsehtml.js for details.

You could either use jQuerys remove() method to select the image elements
console.log( $(html).find('img').remove().end().find('#c1034') );
or remove then from the HTML string. Something like
console.log( $(html.replace(/<img[^>]*>/g,"")) );
Regarding background images, you could do something like this:
$(html).filter(function() {
return $(this).css('background-image') !== '';
}).remove();

The following regex replace all occurance of <head>, <link>, <script>, <style>, including background and style attribute from data string returned by ajax load.
html = html.replace(/(<(\b(img|style|script|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g,"");
Test regex: https://regex101.com/r/nB1oP5/1
I wish there is a a better way to work around (other than using regex replace).

Instead of removing all img elements altogether, you can use the following regex to delete all src attributes instead:
html = html.replace(/src="[^"]*"/ig, "");

Related

Find body tag in an ajax HTML response

I'm making an ajax call to fetch content and append this content like this:
$(function(){
var site = $('input').val();
$.get('file.php', { site:site }, function(data){
mas = $(data).find('a');
mas.map(function(elem, index) {
divs = $(this).html();
$('#result').append('' + divs + '');
})
}, 'html');
});
The problem is that when I change a in body I get nothing (no error, just no html). Im assuming body is a tag just like 'a' is? What am I doing wrong?
So this works for me:
mas = $(data).find('a');
But this doesn't:
mas = $(data).find('body');
I ended up with this simple solution:
var body = data.substring(data.indexOf("<body>")+6,data.indexOf("</body>"));
$('body').html(body);
Works also with head or any other tag.
(A solution with xml parsing would be nicer but with an invalid XML response you have to do some "string parsing".)
Parsing the returned HTML through a jQuery object (i.e $(data)) in order to get the body tag is doomed to fail, I'm afraid.
The reason is that the returned data is a string (try console.log(typeof(data))). Now, according to the jQuery documentation, when creating a jQuery object from a string containing complex HTML markup, tags such as body are likely to get stripped. This happens since in order to create the object, the HTML markup is actually inserted into the DOM which cannot allow such additional tags.
Relevant quote from the documentation:
If a string is passed as the parameter to $(), jQuery examines the string to see if it looks like HTML.
[...]
If the HTML is more complex than a single tag without attributes, as it is in the above example, the actual creation of the elements is handled by the browser's innerHTML mechanism. In most cases, jQuery creates a new element and sets the innerHTML property of the element to the HTML snippet that was passed in. When the parameter has a single tag (with optional closing tag or quick-closing) — $( "< img / >" ) or $( "< img >" ), $( "< a >< /a >" ) or $( "< a >" ) — jQuery creates the element using the native JavaScript createElement() function.
When passing in complex HTML, some browsers may not generate a DOM
that exactly replicates the HTML source provided. As mentioned, jQuery
uses the browser"s .innerHTML property to parse the passed HTML and
insert it into the current document. During this process, some
browsers filter out certain elements such as < html >, < title >, or
< head > elements. As a result, the elements inserted may not be
representative of the original string passed.
I experimented a little, and have identified the cause to a point, so pending a real answer which I would be interested in, here is a hack to help understand the issue
$.get('/',function(d){
// replace the `HTML` tags with `NOTHTML` tags
// and the `BODY` tags with `NOTBODY` tags
d = d.replace(/(<\/?)html( .+?)?>/gi,'$1NOTHTML$2>',d)
d = d.replace(/(<\/?)body( .+?)?>/gi,'$1NOTBODY$2>',d)
// select the `notbody` tag and log for testing
console.log($(d).find('notbody').html())
})
Edit: further experimentation
It seems it is possible if you load the content into an iframe, then you can access the frame content through some dom object hierarchy...
// get a page using AJAX
$.get('/',function(d){
// create a temporary `iframe`, make it hidden, and attach to the DOM
var frame = $('<iframe id="frame" src="/" style="display: none;"></iframe>').appendTo('body')
// check that the frame has loaded content
$(frame).load(function(){
// grab the HTML from the body, using the raw DOM node (frame[0])
// and more specifically, it's `contentDocument` property
var html = $('body',frame[0].contentDocument).html()
// check the HTML
console.log(html)
// remove the temporary iframe
$("#frame").remove()
})
})
Edit: more research
It seems that contentDocument is the standards compliant way to get hold of the window.document element of an iFrame, but of course IE don't really care for standards, so this is how to get a reference to the iFrame's window.document.body object in a cross platform way...
var iframeDoc = iframe.contentDocument || iframe.contentWindow.document;
var iframeBody = iframeDoc.body;
// or for extra caution, to support even more obsolete browsers
// var iframeBody = iframeDoc.getElementsByTagName("body")[0]
See: contentDocument for an iframe
I FIGURED OUT SOMETHING WONDERFUL (I think!)
Got your html as a string?
var results = //probably an ajax response
Here's a jquery object that will work exactly like the elements currently attached to the DOM:
var superConvenient = $($.parseXML(response)).children('html');
Nothing will be stripped from superConvenient! You can do stuff like superConvenient.find('body') or even
superConvenient.find('head > script');
superConvenient works exactly like the jquery elements everyone is used to!!!!
NOTE
In this case the string results needs to be valid XML because it is fed to JQuery's parseXML method. A common feature of an HTML response may be a <!DOCTYPE> tag, which would invalidate the document in this sense. <!DOCTYPE> tags may need to be stripped before using this approach! Also watch out for features such as <!--[if IE 8]>...<![endif]-->, tags without closing tags, e.g.:
<ul>
<li>content...
<li>content...
<li>content...
</ul>
... and any other features of HTML that will be interpreted leniently by browsers, but will crash the XML parser.
Regex solution that worked for me:
var head = res.match(/<head.*?>.*?<\/head.*?>/s);
var body = res.match(/<body.*?>.*?<\/body.*?>/s);
Detailed explanation: https://regex101.com/r/kFkNeI/1

JS Regexp: get the inline javascripts from html

I need to get all script tags from an html string, separated the inline scripts and the "linked" scripts. By inline scripts I mean script tags without the src attribute.
Here is how I get the "linked scripts":
<script(.)+src=(.)+(/>|</script>)
so, having <script followed by one or more any character, followed by src=, followed by /> or </script>.
This works as expected.
Now I want to get all the script tags without the src tag, having some javascript code between <script .....> and </script>, but I can't figure it out how to do that. I just started understanding regular expressions, so the help of a more experienced r.e. guru is needed :)
UPDATE
Ok, so dear downvoters. I have the html code for a whole html page in a variable. I want to extract script tags from it. How to do it, using jquery for example?
var dom = $(html);
console.log(html.find('script');
will not work. So, what is the way to accomplish that?
UPDATE 2
I don't need to solve this problem with regex, but because now I am learning about them, I thought I will try it. I am opened for any other solution.
Create a DOM element using document.createElement, then set its innerHTML to the contents of your HTML string. This will automatically parse your HTML using the browser's built-in parser and fill your newly-created element with children.
dummyDoc = document.createElement("html");
dummyDoc.innerHTML = "<body><script>alert('foo');</script></body>"; // or myInput.value
var dom = $(dummyDoc);
var scripts = dom.find('script');
(I only use jQuery because you do so in your question. This is certainly also possible without jQuery.)
If you are in the position where no dom access is available (nodejs?), you'd be forced to use regex. Here is a solution that worked for me in the similar circumstances:
function scrapeInlineScripts(sHtml) {
var a = sHtml.split(/<script[^>]*>/).join('</script>').split('</script>'),
s = '';
for (var n=1; n<a.length; n+=2) {
s += a[n];
}
return s;
}

Most secure javascript JSON Inline technique

I'm using varnish+esi to return external json content from a RESTFul API.
This technique allows me to manage request and refresh data without using webserver resources for each request.
e.g:
<head>
....
<script>
var data = <esi:include src='apiurl/data'>;
</script>
...
After include the esi varnish will return:
var data = {attr:1, attr2:'martin'};
This works fine, but if the API returns an error, this technique will generate a parse error.
var data = <html><head><script>...api js here...</script></head><body><h1 ... api html ....
I solved this problem using a hidden div to parse and catch the error:
...
<b id=esi-data style=display:none;><esi:include src='apiurl/data'></b>
<script>
try{
var data = $.parseJSON($('#esi-data').html());
}catch{ alert('manage the error here');}
....
I've also tried using a script type text/esi, but the browser renders the html inside the script tag (wtf), e.g:
<script id=esi-data type='text/esi'><esi:include src='apiurl/data'></script>
Question:
Is there any why to wrap the tag and avoid the browser parse it ?
Let me expand upon the iframe suggestion I made in my comment—it's not quite what you think!
The approach is almost exactly the same as what you're doing already, but instead of using a normal HTML element like a div, you use an iframe.
<iframe id="esi-data" src="about:blank"><esi:include src="apiurl/data"></iframe>
var $iframe = $('#esi-data');
try {
var data = $.parseJSON($iframe.html());
} catch (e) { ... }
$iframe.remove();
#esi-data { display: none; }
How is this any different from your solution? Two ways:
The data/error page are truly hidden from your visitors. An iframe has an embedded content model, meaning that any content within the <iframe>…</iframe> tags gets completely replaced in the DOM—but you can still retrieve the original content using innerHTML.
It's valid HTML5… sort-of. In HTML5, markup inside iframe elements is treated as text. Sure, you're meant to be able to parse it as a fragment, and it's meant to contain only phrasing content (and no script elements!), but it's essentially just treated as text by the validator—and by browsers.
Scripts from the error page won't run. The content gets parsed as text and replaced in the DOM with another document—no chance for any script elements to be processed.
Take a look at it in action. If you comment out the line where I remove the iframe element and inspect the DOM, you can confirm that the HTML content is being replaced with an empty document. Also note that the embedded script tag never runs.
Important: this approach could still break if the third party added an iframe element into their error page for some reason. Unlikely as this may be, you can bulletproof the approach a little more by combining your technique with this one: surround the iframe with a hidden div that you remove when you're finished parsing.
Here I go with another attempt.
Although I believe you already have the possibly best solution for this, I could only imagine that you work around it with a fairly low-performance method of calling esi:insert in a separate HTML window, then retrieve the contents as if you were using AJAX on the server. Perhaps similar to this? Then check the contents you retrieved, maybe by using json_decode and on success generate an error JSON string.
The greatest downside I see to this is that I believe this would be very consuming and most likely even delays your requests as the separate page is called as if your server yourself was a client, parsed, then sent back.
I'd honestly stick to your current solution.
this is a rather tricky problem with no real elegant solution, if not with no solution at all
I asked you if it was an HTML(5) or XHTML(5) document, because in the later case a CDATA section can be used to wrap the content, changing slightly your solution to something like this :
...
<b id='esi-data' style='display:none;'>
<![CDATA[ <esi:include src='apiurl/data'> ]]>
</b>
<script>
try{
var data = $.parseJSON($('#esi-data').html());
}catch{ alert('manage the error here');}
....
Of crouse this solution works if :
you're using XHTML5 and
the error contains no CDATA section (because CDATA section nesting is impossible).
I don't know if switching from one serialization to the other is an option, but I wanted to clarify the intent of my question. It will hopefully help you out :).
Can't you simply change your API to return JSON { "error":"error_code_or_text" } on error? You can even do something meaningful in your interface to alert user about error if you do it that way.
<script>var data = 999;</script>
<script>
data = <esi:include src='apiurl/data'>;
</script>
<script>
if(data == 999) alert("there was an error");
</script>
If there is an error and "data" is not JSON, then a javascript error will be thrown. The next script block will pick that up.

Getting the HTML content of an iframe using jQuery

I'm currently trying to customize OpenCms (java-based open source CMS) a bit, which is using the FCKEditor embedded, which is what I'm trying access using js / jQuery.
I try to fetch the html content of the iframe, however, always getting null as a return.
This is how I try to fetch the html content from the iframe:
var editFrame = document.getElementById('ta_OpenCmsHtml.LargeNews_1_.Teaser_1_.0___Frame');
alert( $(editFrame).attr('id') ); // returns the correct id
alert( $(editFrame).contents().html() ); // returns null (!!)
Looking at the screenshot, the what I want to access is the 'LargeNews1/Teaser' html section, which currently holds the values "Newsline en...".
Below you can also see the html structure in Firebug.
However, $(editFrame).contents().html() returns null and I can't figure out why, whereas $(editFrame).attr('id') returns the correct id.
The iframe content / FCKEditor is on the same site/domain, no cross-site issues.
HTML code of iframe is at http://pastebin.com/hPuM7VUz
Updated:
Here's a solution that works:
var editArea = document.getElementById('ta_OpenCmsHtml.LargeNews_1_.Teaser_1_.0___Frame').contentWindow.document.getElementById('xEditingArea');
$(editArea).find('iframe:first').contents().find('html:first').find('body:first').html('some <b>new</b><br/> value');
.contents().html() doesn't work to get the HTML code of an IFRAME. You can do the following to get it:
$(editFrame).contents().find("html").html();
That should return all the HTML in the IFRAME for you. Or you can use "body" or "head" instead of "html" to get those sections too.
you can get the content as
$('#iframeID').contents().find('#someID').html();
but frame should be in the same domain refer http://simple.procoding.net/2008/03/21/how-to-access-iframe-in-jquery/
I suggest replacing the first line with:
var editFrame = $('#ta_OpenCmsHtml.LargeNews_1_.Teaser_1_.0___Frame');
...and the 2nd alert expression with:
editFrame.html()
If, on the other hand, you prefer to accomplish the same w/o jquery (much cooler, IMHO) could use only JavaScript:
var editFrame = document.getElementById('ta_OpenCmsHtml.LargeNews_1_.Teaser_1_.0___Frame');
alert(editFrame.innerHTML);
After trying a number of jQuery solutions that recommended using the option below, I discovered I was unable to get the actual <html> content including the parent tags.
$("#iframeId").contents().find("html").html()
This worked much better for me and I was able to fetch the entire <html>...</html> iframe content as a string.
document.getElementById('iframeId').contentWindow.document.documentElement.outerHTML
I think the FCKEditor has its own API see http://cksource.com/forums/viewtopic.php?f=6&t=8368
Looks like jQuery doesn't provide a method to fetch the entire HTML of an iFrame, however since it provides access to the native DOM element, a hybrid approach is possible:
$("iframe")[0].contentWindow.document.documentElement.outerHTML;
This will return iFrame's HTML including <THTML>, <HEAD> and <BODY>.
Your iframe:
<iframe style="width: 100%; height: 100%;" frameborder="0" aria-describedby="cke_88" title="Rich text editor, content" src="" tabindex="-1" allowtransparency="true"/>
We can get the data from this iframe as:
var content=$("iframe").contents().find('body').html();
alert(content);

jQuery selector loads images from server

here is the code:
<script type="text/javascript">
var ajax_data =
'<ul id="b-cmu-rgt-list-videos"><li><a href="{video.url}" '+
'title="{video.title.strip}"><img src="{video.image}" '+
'alt="{video.title.strip}" /><span>{video.title}</span></a></li></ul>';
var my_img = $(ajax_data).find('img');
</script>`
ajax_data is data from a JS template engine where I need to get some part of it. The problem is that jQuery does a GET on the
img src={video.image}: GET /test/%7Bvideo.image%7D HTTP/1.1
(on Firefox Live HTTP headers).
This GET generates a 404 from the server.
Any clues on how to solve this?
Thanks a lot :)
When you create a jquery object from html, it's immediately evaluated (because the document fragment is created), so this:
$("<img src='bob.jpg' />")
Immediately causes a fetch of the image. The way I see it you had 3 quick options (and probably others, but hard to say without more context to your question):
Replace {video.image} before creating the jQuery object.
Remove src="{video.image}", just find the <img> via the selector you already have and set the src attribute later, like this: $(ajax_data).find('img').attr('src','myImage.jpg');
Do everything you want via regex before inserting anything into the DOM.

Categories