JavaScript repair bad html tag - javascript

I'm working on a Sharepoint website. I don't have access to the webparts code. I can only change master pages with JavaScript.
One of the webpart has a bug. It changes the <img> with a bad SRC value.
example:
should have
<img alt="img" src="http://www.apicture.png" style="margin:5px" /><br /><br />
but have
<img alt="img" src="http://www.apicture.png" style="margin:5px" /><br /><br />
I tried to match and replace but the innerHtml broke the others scripts.
How can a repair my with JavaScript ?
Edit:
I have the code:
var markup = document.documentElement.innerHTML;
markup = markup.replace(/src=\".*?(http:\/\/[^\"]+)\"/g,'src=\"$1\"');
document.documentElement.innerHTML = markup;
but it broke my webpage.

Since the DOM has already been broken, you need to take a step back and try to salvage the HTML.
1) Find the parents of the broken elements. While search&replace inside the document.body.innerHTML would probably work, you shouldn't really let regexes anywhere near large chunks of HTML. Performance is a concern as well, albeit a lesser one.
<img alt="img" src="<a href="http://... will get parsed by the browser as an image with the source "<a href=".
With jQuery, you can simply ask $('img[src="<a href"]') to get the images. Except in IE<8, you can use querySelectorAll with the same selector. If you don't have jQuery, and want to support IE7, you need to use getElementsByTagName with manual filtering.
If you are really lucky, you can find the parent via getElementByID (or the equivalent jQuery).
This is the easy part.
2) Your HTML doesn't validate, and the browser had already made some effort to fix it. You need to reverse the process. Predicting the browser actions is problematic, but let's attempt to.
Let's see what the browser does with
<img src="http://www.test.com/img/image-20x20.png" style="margin:5px" />​
This is how Chrome and Firefox fix it:
<img src="<a href=" http:="" www.test.com="" img="" image-20x20.png"="">http://www.test.com/img/image-20x20.png" style="margin:5px" />
IE9 sorts the attributes within img alphabetically in innerHTML (o_0) and doesn't HTML-escape the < within src. IE7-8 additionally strip ="" from the attributes.
The image attributes will be hard to salvage, but the text content is unharmed. Anyways the pattern can be seen:
everything starting at <img and until src= should be preserved. Unfortunately, in IE, the arguments are rearranged, so you have to preserve the incorrect tags as well. src="..." itself must be removed. Everything past that is [incorrect] in modern browsers, but in IE, proper attributes could have crept there (and vice versa). Then the image tag ends.
Everything that follows is the real URL, up until the double quote. From the double quote up until the HTML-escaped /> are attributes that belong to the image tag. Let's hope they don't contain HTML. CSS is fine (for our purposes).
3) Let's build the regex: an opening IMG tag, any attributes (let's hope they don't contain HTML) (captured), the src attribute and its specific value (escaped or unescaped), any other attributes (captured), the end of tag, the URL (captured), some more attributes (captured) and the HTML-escaped closing tag.
/<img([^>]*?)src="(?:<|\&lt\;)a href="([^>]*?)>([^"]+?)"(.*?)\/>/gi
You might be interested in how it's seen by RegexPal.com.
What it should be replaced by: The image with the proper attributes concatenated, and with the src salvaged. It might be worthy to filter the attributes, so let's opt for a callback-replace. Normal attributes contain only word-characters in their keys. More importantly, normal attributes are usually non-empty strings (IMG tags don't have boolean attributes, unless you are using server-side maps). This will match all empty attributes but not valid attribute keys: /\S+(?:="")?(?!=)/
Here is the code:
//forEach, indexOf, map need shimming in IE<9
//querySelectorAll cannot be reliably shimmed, so I'm not using that.
//author: Jan Dvorak
// https://stackoverflow.com/a/14157761/499214
var images = document.getElementsByTagName("img");
var parents = [];
[].forEach.call(images, function(i){
if(
/(?:<|\&lt\;)a href=/.test(i.getAttribute("src"))
&& !~parents.indexOf(i.parentNode)
){
parents.push(i.parentNode)
}
})
var re = /<img([^>]*?)src="(?:<|\&lt\;)a href="([^>]*?)>([^"]+?)"(.*?)\/>/gi;
parents.forEach(function(p){
p.innerHTML = p.innerHTML.replace(
re,
function(match, attr1, attr2, url, attr3){
var attrs = [attr1, attr2, attr3].map(function(a){
return a.replace(/\S+(?:="")?(?!=)/g,"");
}).join(" ");
return '<img '+attrs+' src="'+url+'" />';
}
);
});
fiddle: http://jsfiddle.net/G2yj3/1/

You can repair src attribute with regex but it won't repair the entire page. The reason is that web browser is trying to parse such bad HTML and produces weird output (extra elements etc.) before JS is executed. Since you cannot interfere the HTML parsing/rendering engine, there's no reasonable way other than changing the original content to fix this.

Related

How to find a unique string within html and wrap it with a tag, but exclude links and urls

I'm looking for a way to look for a specific string within a page in the visible text and then wrap that string in <em> tags. I have tried used HTML Agility Pack and had some success with a Regex.Replace but if the string is included within a url it also gets replaced which I do not want, if it's within an image name, it gets replaced and this obviously breaks the link or image url.
An example attempt:
var markup = Encoding.UTF8.GetString(buffer);
var replaced = Regex.Replace(markup, "product-xs", " <em>product</em>-xs", RegexOptions.IgnoreCase);
var output = Encoding.UTF8.GetBytes(replaced);
_stream.Write(output, 0, output.Length);
This does not work as it would replace a <a href="product/product-xs"> with <a href="product/<em>product</em>-xs"> - which I don't want.
The string is coming from a text string value within a CMS so the user can't wrap the words there and ideally, I want to catch all instances of the word that are already published.
Ideally I would want to exclude <title> tags, <img> tags and <a> tags, everything else should get the wrapped tag.
Before I used the HTML Agility Pack, a fellow front end dev tried it with JavaScript but that had an unexpected impact on dropdown menus.
If you need any more info, just ask.
You can use HTML Agility Pack to select only the text nodes (i.e. the text that exists between any two tags) with a bit of XPath and modify them like this.
Looking only in body will exclude <title>, <meta> etc. The not excludes script tags, you can exclude others in the same way (or check the parent node in the loop).
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//body//*[not(self::script)]/text()"))
{
var newNode = htmlDoc.CreateTextNode(node.InnerText.Replace("product-xs", "<em>product</em>-xs"));
node.ParentNode.ReplaceChild(newNode, node);
}
I've used a simple replace, regex will work fine too, prob best to check the performance of each approach and choose which works best for your use case.

"Bad value for attribute src on element img: Must be non-empty", for dynamically generated img src

I have a web site with an image slider. I keep the some of the image tags empty as the images load on when slide comes into view for faster page load. The image tags defined as follows:
<img data-src="img/portfolio-desktop1-small.png" src="" alt=""/>
What I'm doing is on slide function I change the src to data-src with jQuery animation. The slider works great. My problem is when I try to validate it in w3c validation tool it gives the following error:
Line 131, Column 179: Bad value for attribute src on element img: Must be non-empty.
...data-src="img/portfolio-desktop1-small.jpg" src="" alt=""/>
Syntax of URL:
Any URL. For example: /hello, #canvas, or http://example.org/. > Characters should be represented in NFC and spaces should be escaped as %20.
Is there anyway to fix this without altering the JavaScript or CSS? If I leave it like this what can be the possible harmful outcomes of this matter?
Set the image src attribute to #:
<img data-src="img/portfolio-desktop1-small.png"
src="#" alt="Thumbnail">
The HTML passes the W3C validator just fine, and modern browsers know not to try to load the non-existent image.*
For contrast, using a src value that references a non-existent file results in an unnecessary HTTP request and an error:
<img data-src="img/portfolio-desktop1-small.png"
src="bogus.png" alt="Thumbnail">
Failed to load resource: The requested URL was not found on this server.
*Note: I've read conflicting information on how browsers handle the #. If anyone has definitive information, please add a comment.
Also see related answer by sideshowbarker about the action attribute: https://stackoverflow.com/a/32491636
Update: November 2022
It seems the src="#" trick used to be a decent workaround but it's no longer a good idea to send that to the browser. 
So, I created a build tool to convert occurrences of src="#" in source HTML to inline data URLs of a tiny invisible one pixel SVG appropriate for the browser.
Build tool img-src-placeholder:
https://github.com/center-key/img-src-placeholder (MIT License)
The interesting bits are:
const onePixelSvg =
'<svg xmlns="http://www.w3.org/2000/svg" width="1" height="1"></svg>';
const dataImage = 'data:image/svg+xml;base64,' +
Buffer.from(onePixelSvg).toString('base64');
html.replace(/src=["']?#["']?/gm, `src="${dataImage}"`);
The resulting HTML will have <img> tags like:
<img src=""
alt=avatar>
The advantage of using a build tool is that:
Source remains uncluttered
HTML always validates
The inline data URL prevents the browser from making an unnecessary and invalid network request
What happens if you just remove the src attribute then add it on the fly when you need it. The src attribute isn't required. And in my opinion I wouldn't worry about what the w3c validation tool says anyway. As long as you test it in the necessary browsers and it works.
Update Jan 2021. The src="#" trick works now on the validator at https://www.w3.org/TR/html-media-capture/
If anyone still looking for the answer, the src="/" code resolves the w3c validator complains and doesn't produce additional request like the solution with the # character.

JQuery and frame(set)s - why do they disappear?

When I use the console and I try to create an element with tag frameset, I get no result:
$('<div id="content" data-something="hello" />')
=> [<div id=​"content" data-something=​"hello">​</div>​]
$('<frameset frameborder="0" framespacing="0" marginwidth="0" marginheight="0" framespacing="0" border="0"></frameset>')
=> []
This behaviour persists across multiple JQuery versions (like 1.10.2, 2.0.0 and 1.2.6 etc.)
How can I read the 'frameborder' (for example) attribute from this frameset without having to build a parser by mself?
P.S. (If you wonder why I use frames) This line (or a line like this) is a response from an external (more or less) API that I cannot change. I would just like to read the information and go on.
The frameset tag is to be used as the body of framed documents in place of the body tag, and in conjuction with the frameset document type declaration. It is considered obsolete since HTML5.
To solve your issue, your best bet is to parse the portions you require from the string on your own or use an HTML parsing library such as htmlparser.js
The attributes on your frameset element are not valid HTML, so jQuery does not create it. It will work if you take them out. You can then add the attributes one at a time using .attr.
var x = $('<frameset></frameset>');
x.attr('frameborder', '0');
x.attr('framespacing', '0');
x.attr('border', '0');
But the added code and resource cost of creating an element is not necessary just to find an attribute value in the string. You can find the substring you are looking for with the match method like this:
var frameborder = '<frameset frameborder="0" border="0"></frameset>'.match(/frameborder="(.+?)"/)[1]
Just replace 'frameborder' in the regex with the name of another attribute to get its value. Simple.
I used a workaround:
var sourceWithFrames = ...
sourceWithFrames = sourceWithFrames.replace(/<frame/g, '<xyzFrame')
// e.g.
var frameborder = $(sourceWithFrames).find('xyzFrameset').attr('frameborder')
// and so on
This is, in my opinion, the best way to approach this (and probably the only one...)

Need to escape /> (forward slash and greater than) with jQuery or Javascript

I'm working on a web page that will display code inside pre tags, and need to render characters used to form HTML tags within those pre tags. I'm able to escape the greater-than and less-than symbols via jQuery/Javascript per my code below.
However, the combination of a forward slash and a greater than symbol (/>) is problematic. Additionally, I'm getting more expected results rendered in the final output when the page runs.
The contents of the pre tag are simple.
<pre>
<abc />
<xyz />
</pre>
Here is my jQuery code.
$(function(){
$('pre').html(function() {
//this.innerHTML = this.innerHTML.replace(new RegExp(['/>'],"g"), "#");
//this.innerHTML = this.innerHTML.replace(new RegExp(['/'],"g"), "*");
this.innerHTML = this.innerHTML.replace(new RegExp(['<'],"g"), "<");
this.innerHTML = this.innerHTML.replace(new RegExp(['>'],"g"), ">");
});
});
When this runs, what I expect to happen is the page will render the following:
<abc/><xyz/>
Pretty simple. Instead, here is what gets rendered in Chrome, Firefox, and IE.
<abc>
<xyz>
</xyz></abc>
The tags get duplicated, and the forward slashes get moved after the less-than symbols. Presently I'm learning jQuery, so there may be something more fundamental wrong with my function. Thanks for your help.
You have some invalid HTML. The browser then tries to turn the invalid HTML into a DOM. jQuery then asks the browser to turn the DOM back into HTML. What it gets is a serialisation of the DOM at that stage. The original source is lost.
You can't use jQuery to recover the original broken source of an HTML document (short of making a new HTTP request and forcing it to treat the response as plain text).
Fix the HTML on the server before you send it to the client.

How can I use literal DOM markup as a Prototype Template?

Prototype's Template class allows you to easily substitute values into a string template. Instead of declaring the Template source-string in my code, I want to extract the source-string from the DOM.
For example, in my markup I have an element:
<div id="template1">
<img src="#{src}" title="#{title}" />
</div>
I want to create the template with the inner contents of the div element, so I've tried something like this:
var template = new Template($('template1').innerHTML);
The issue is that Internet Explorer's representation of the innerHTML omits the quotes around the attribute value when the value has no spaces. I've also attempted to use Element#inspect, but in Internet Explorer I get back a non-recursive representation of the element / sub-tree.
Is there another way to get a Template-friendly representation of the sub-tree's contents?
Looks like you can embed the template source inside a textarea tag instead of a div and retrieve it using Element#value.
Certainly makes the markup a little weird, but it still seems reasonably-friendly to designers.
Additionally, as Jason pointed out in a comment to the original question, including the img tag in the textarea prevents a spurious request for an invalid image.
Resig to the rescue:
You can also inline script:
<script type="text/html" id="user_tmpl">
<% for ( var i = 0; i < users.length; i++ ) { %>
<li><%=users[i].name%></li>
<% } %>
</script>
Quick tip: Embedding scripts in your
page that have a unknown content-type
(such is the case here - the browser
doesn't know how to execute a
text/html script) are simply ignored
by the browser - and by search engines
and screenreaders. It's a perfect
cloaking device for sneaking templates
into your page. I like to use this
technique for quick-and-dirty cases
where I just need a little template or
two on the page and want something
light and fast.
and you would use it from script like
so:
var results = document.getElementById("results");
results.innerHTML = tmpl("item_tmpl", dataObject);

Categories