In an app I receive some HTML text: since the app can't display (interpret) HTML, I need to remove any HTML tag and entity from the string I receive from the server.
I tried the following, but this one removes HTML tags but not entities (eg. &bnsp;):
stringFromServer.replace(/(<([^>]+)>)/ig,"");
Any help is appreciated.
Disclaimer: I need a pure JavaScript solution (no JQuery, Underscore, etc.).
[UPDATE] I'm reading all your answers now and I forgot to mention that I'm using JavaScript BUT the environment is not a web page, so I have no DOM.
You can try something like this:
var placeholder = document.createElement('div');
placeholder.innerHTML = stringFromServer;
var theText = placeholder.innerText;
.innerText only grabs text content from the element.
However, since it appears you don't have access to any DOM manipulation at all, you're probably going to have to use some kind of HTML parser, like these:
https://www.npmjs.org/package/htmlparser
http://ejohn.org/blog/pure-javascript-html-parser/
A solution without using regexes or phantom divs can be found on Mozilla's MDN.
I put the code in a JSfiddle here:
var sMyString = "<a id=\"a\"><b id=\"b\">hey!<\/b><\/a>";
var oParser = new DOMParser();
var oDOM = oParser.parseFromString(sMyString, "text/xml");
// print the name of the root element or error message
alert(oDOM.documentElement.nodeName == "parsererror" ?
"error while parsing" : oDOM.documentElement.textContent);
Alternatively, parse the HTML snippet in a new document and do your dom manipulations from that (if you'd rather keep it separate from the current document):
var tmpDoc=document.implementation.createHTMLDocument("");
tmpDoc.body.innerHTML="<a href='#'>some text</a><p style=''> more text</p>";
tmpDoc.body.textContent;
tmpDoc.body.textContent evaluates to:
some text more text
stringFromServer.replace(/(<([^>]+)>|&[^;]+;)/ig, "")
Pretty simple question that I couldn't find an answer to, maybe because it's a non-issue, but I'm wondering if there is a difference between creating an HTML object using Javascript or using a string to build an element. Like, is it a better practice to declare any HTML elements in JS as JS objects or as strings and let the browser/library/etc parse them? For example:
jQuery('<div />', {'class': 'example'});
vs
jQuery('<div class="example></div>');
(Just using jQuery as an example, but same question applies for vanilla JS as well.)
It seems like a non-issue to me but I'm no JS expert, and I want to make sure I'm doing it right. Thanks in advance!
They're both "correct". And both are useful at different times for different purposes.
For instance, in terms of page-speed, these days it's faster to just do something like:
document.body.innerHTML = "<header>....big string o' html text</footer>";
The browser will spit it out in an instant.
As a matter of safety, when dealing with user-input, it's safer to build elements, attach them to a documentFragment and then append them to the DOM (or replace a DOM node with your new version, or whatever).
Consider:
var userPost = "My name is Bob.<script src=\"//bad-place.com/awful-things.js\"></script>",
paragraph = "<p>" + userPost + "</p>";
commentList.innerHTML += paragraph;
Versus:
var userPost = "My name is Bob.<script src=\"//bad-place.com/awful-things.js\"></script>",
paragraph = document.createElement("p");
paragraph.appendChild( document.createTextNode(userPost) );
commentList.appendChild(paragraph);
One does bad things and one doesn't.
Of course, you don't have to create textNodes, you could use innerText or textContent or whatever (the browser will create the text node on its own).
But it's always important to consider what you're sharing and how.
If it's coming from anywhere other than a place you trust (which should be approximately nowhere, unless you're serving static pages, in which case, why are you building html?), then you should keep injection in mind -- only the things you WANT to be injected should be.
Either can be preferable depending on your particular scenario—ie, if everything is hard-coded, option 2 is probably better, as #camus said.
One limitation with the first option though, is that this
$("<div data-foo='X' />", { 'class': 'example' });
will not work. That overload expects a naked tag as the first parameter with no attributes at all.
This was reported here
1/ is better if your attribubes depends on variables set before calling the $ function , dont have to concatenate strings and variables. Aside from that fact ,since you can do both , and it's just some js code somebody else wrote , not a C++ DOM API hardcoded in the browser...
The following code [jsfiddle]...
var div = document.createElement("div");
div.innerHTML = "<foo>This is a <bar /> test. <br> Another test.</foo>";
alert(div.innerHTML);
...shows this parsed structure:
<foo>This is a <bar> test. <br> Another test.</bar></foo>
i.e. the browser knows that <br> has no closing tag but since <bar> is an unknown tag to the browser, it assumes that it needs an closing tag.
I know that the /> (solidus) syntax is ignored in HTML5 and invalid in HTML4, but anyway would like to teach somehow the browser that <bar> does not need an ending tag and I can omit it. Is that possible?
Yes, I'm trying to (temporarily) misuse the HTML code for custom tags and I have my specific reasons to do that. After all, browsers should ignore unknown tags and treat them just like unstyled inline tags, so I should not break anything as long I can make sure the tag names won't ever be used in real HTML standards.
You'd have to use Object.defineProperty on HTMLElement.prototype to override the innerHTML setter and getter with your own innerHTML implementation that treats the elements you want as void. Look here for how innerHTML and the HTML parser is implemented by default.
Note though that Firefox sucks at inheritance when it comes to defining stuff on HTMLElement.prototype where it filters down to HTMLDivElement for example. Things should work fine in Opera though.
In other words, what elements are void depends on the HTML parser. The parser follows this list and innerHTML uses the same rules mostly.
So, in other words, unless you want to create your own innerHTML implementation in JS, you probably should just forget about this.
You can use the live DOM viewer though to show others how certain markup is parsed. You'll then probably notice that same end tags will implicitly close the open element.
I have some outdated innerHTML getter (not setter though) code here that uses a void element list. That may give you some ideas. But, writing a setter implementation might be more difficult.
On the other hand, if you use createElement() and appendChild() etc. instead of innerHTML, you shouldn't have to worry about this and the native innerHTML getter will output the unknown elements with end tags.
Note though, you can treat the unknown element as xml and use XMLSerializer() and DOMParser() to do things:
var x = document.createElement("test");
var serializer = new XMLSerializer();
alert(serializer.serializeToString(x));
var parser = new DOMParser();
var doc = parser.parseFromString("<test/>", "application/xml");
var div = document.createElement("div");
div.appendChild(document.importNode(doc.documentElement, true));
alert(serializer.serializeToString(div));
It's not exactly what you want, but something you can play with. (Test that in Opera instead of Firefox to see the difference with xmlns attributes. Also note that Chrome doesn't do like Opera and Firefox.)
We are using jQuery to generate an XML fragment and then convert it to a string using the html() function. But as we just found out, and if anyone doesn't know, the html() JavaScript function as implemented in IE is broken, broken, broken. Basically, it capitalizes some tags, adds attributes to others "helpfully" (in our case, ), and generally doesn't do the Right Thing.
I would like to use something like this to generate the XML string instead:
http://www.stainlessvision.com/jquery-html-vs-innerxhtml
However, this library won't play nicely with jQuery out of the box, e.g.:
var $dummyRoot = $('<dummyroot/>'); // since html() doesn't generate the outer element
var $foo = $('<foo></foo>');
var $font = $('<font ></font >');
$foo.append($font);
$dummyRoot.append($foo);
var $s = innerXHTML($dummyRoot); // <-- Doesn't work
I think it wants a more W3C DOM-ish object.
How can I get jQuery to talk to this innerXHTML() function; or, alternatively, is there another function I can use (maybe something built into jQuery or a jQuery plugin))?
Edit: Follow up for DDaviesBrackett's question. I also have a "body" element in my XML; look how it picks up CSS styling (and not just a element).
Is there an unwritten rule to not generate XML inside the DOM whose elements have names like body, font, head, etc.?
var $dummyRoot = $('<dummyroot/>');
var $foo = $('<foo></foo>');
var $body = $('<body></body>');
var $font = $('<font></font>');
$body.append($font);
$foo.append($body);
$dummyRoot.append($foo);
var $s = innerXHTML($dummyRoot[0]);
// $s equals "<foo><body bottommargin="15" leftmargin="10" rightmargin="10" topmargin="15"><font size="+0"></font></body></foo>"
the jQuery object wraps its contents, but exposes them via an array indexer. What do you get when you use
var $s = innerXHTML($dummyRoot[0]);
instead of your example?
Is there an unwritten rule to not generate XML inside the DOM whose elements have names like body, font, head, etc.?
jQuery relies on the innerHTML property to parse a given piece of text and construct the DOM from that. It was never meant to parse or generate XML as colliding names can give totally unpredictable results depending on how the browser sees it.
See
jQuery won’t parse xml with nodes called option
How do I parse xml with jQuery?
Parse content like XML, with jQuery
I have given a similar answer for generating proper XML in fewer steps using a recursive approach. To create the following XML:
<foo>
<body>
<font></font>
</body>
</foo>
you would write:
Σ('foo',
Σ('body',
Σ('font', '')
)
);
Σ just looks cooler, but you can change the function name to whatever you want :)
Recently I have been reading more and more about people using custom attributes in their HTML tags, mainly for the purpose of embedding some extra bits of data for use in javascript code.
I was hoping to gather some feedback on whether or not using custom attributes is a good practice, and also what some alternatives are.
It seems like it can really simplify both server side and client side code, but it also isn't W3C compliant.
Should we be making use of custom HTML attributes in our web apps? Why or why not?
For those who think custom attributes are a good thing: what are some things to keep in mind when using them?
For those who think custom attributes are bad thing: what alternatives do you use to accomplish something similar?
Update: I'm mostly interested in the reasoning behind the various methods, as well as points as to why one method is better than another. I think we can all come up with 4-5 different ways to accomplish the same thing. (hidden elements, inline scripts, extra classes, parsing info from ids, etc).
Update 2: It seems that the HTML 5 data- attribute feature has a lot of support here (and I tend to agree, it looks like a solid option). So far I haven't seen much in the way of rebuttals for this suggestion. Are there any issues/pitfalls to worry about using this approach? Or is it simply a 'harmless' invalidation of the current W3C specs?
HTML 5 explicitly allows custom attributes that begin with data. So, for example, <p data-date-changed="Jan 24 5:23 p.m.">Hello</p> is valid. Since it's officially supported by a standard, I think this is the best option for custom attributes. And it doesn't require you to overload other attributes with hacks, so your HTML can stay semantic.
Source: http://www.w3.org/TR/html5/dom.html#embedding-custom-non-visible-data-with-the-data-*-attributes
Here's a technique I've been using recently:
<div id="someelement">
<!-- {
someRandomData: {a:1,b:2},
someString: "Foo"
} -->
<div>... other regular content...</div>
</div>
The comment-object ties to the parent element (i.e. #someelement).
Here's the parser: http://pastie.org/511358
To get the data for any particular element simply call parseData with a reference to that element passed as the only argument:
var myElem = document.getElementById('someelement');
var data = parseData( myElem );
data.someRandomData.a; // <= Access the object staight away
It can be more succinct than that:
<li id="foo">
<!--{specialID:245}-->
... content ...
</li>
Access it:
parseData( document.getElementById('foo') ).specialID; // <= 245
The only disadvantage of using this is that it cannot be used with self-closing elements (e.g. <img/>), since the comments must be within the element to be considered as that element's data.
EDIT:
Notable benefits of this technique:
Easy to implement
Does not invalidate HTML/XHTML
Easy to use/understand (basic JSON notation)
Unobtrusive and semantically cleaner than most alternatives
Here's the parser code (copied from the http://pastie.org/511358 hyperlink above, in case it ever becomes unavailable on pastie.org):
var parseData = (function(){
var getAllComments = function(context) {
var ret = [],
node = context.firstChild;
if (!node) { return ret; }
do {
if (node.nodeType === 8) {
ret[ret.length] = node;
}
if (node.nodeType === 1) {
ret = ret.concat( getAllComments(node) );
}
} while( node = node.nextSibling );
return ret;
},
cache = [0],
expando = 'data' + +new Date(),
data = function(node) {
var cacheIndex = node[expando],
nextCacheIndex = cache.length;
if(!cacheIndex) {
cacheIndex = node[expando] = nextCacheIndex;
cache[cacheIndex] = {};
}
return cache[cacheIndex];
};
return function(context) {
context = context || document.documentElement;
if ( data(context) && data(context).commentJSON ) {
return data(context).commentJSON;
}
var comments = getAllComments(context),
len = comments.length,
comment, cData;
while (len--) {
comment = comments[len];
cData = comment.data.replace(/\n|\r\n/g, '');
if ( /^\s*?\{.+\}\s*?$/.test(cData) ) {
try {
data(comment.parentNode).commentJSON =
(new Function('return ' + cData + ';'))();
} catch(e) {}
}
}
return data(context).commentJSON || true;
};
})();
You can create any attribute if you specify a schema for your page.
For example:
Addthis
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:addthis="http://www.addthis.com/help/api-spec">
...
<a addthis:title="" addthis:url="" ...>
Facebook (even tags)
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
...
<fb:like href="http://developers.facebook.com/" width="450" height="80"/>
The easiest way to avoid use of custom attributes is to use existing attributes.
use meaningful, relevant class names.
For example, do something like: type='book' and type='cd',
to represent books and cds. Classes are much better for representing what something IS.
e.g. class='book'
I have used custom attributes in the past, but honestly, there really isn't a need to for them if you make use of existing attributes in a semantically meaningful way.
To give a more concrete example, let's say you have a site giving links to different kinds of stores. You could use the following:
<a href='wherever.html' id='bookstore12' class='book store'>Molly's books</a>
<a href='whereverelse.html' id='cdstore3' class='cd store'>James' Music</a>
css styling could use classes like:
.store { }
.cd.store { }
.book.store { }
In the above example we see that both are links to stores (as opposed to the other unrelated links on the site) and one is a cd store, and the other is a book store.
Embed the data in the dom and use metadata for jQuery.
All the good plug-ins support the metadata plugin(allowing per tag options).
It also allows infinitely complex data/data structures, as well as key-value pairs.
<li class="someclass {'some': 'random,'json':'data'} anotherclass">...</li>
OR
<li class="someclass" data="{'some':'random', 'json': 'data'}">...</li>
OR
<li class="someclass"><script type="data">{"some":"random","json":"data"}</script> ...</li>
Then get the data like so:
var data = $('li.someclass').metadata();
if ( data.some && data.some == 'random' )
alert('It Worked!');
I see no problem in using existing XHTML features without breaking anything or extending your namespace. Let's take a look at a small example:
<div id="some_content">
<p>Hi!</p>
</div>
How to add additional information to some_content without additional attributes? What about adding another tag like the following?
<div id="some_content">
<div id="some_content_extended" class="hidden"><p>Some alternative content.</p></div>
<p>Hi!</p>
</div>
It keeps the relation via a well defined id/extension "_extended" of your choice and by its position in the hierarchy. I often use this approach together with jQuery and without actually using Ajax like techniques.
Nay. Try something like this instead:
<div id="foo"/>
<script type="text/javascript">
document.getElementById('foo').myProperty = 'W00 H00! I can add JS properties to DOM nodes without using custom attributes!';
</script>
I'm not doing using custom attributes, because I'm outputing XHTML, because I want the data to be machine-readable by 3rd-party software (although, I could extend the XHTML schema if I wanted to).
As an alternative to custom attributes, mostly I'm finding the id and class attributes (e.g. as mentioned in other answers) sufficient.
Also, consider this:
If the extra data is to be human-readable as well as machine-readable, then it needs to be encoded using (visible) HTML tags and text instead of as custom attributes.
If it doesn't need to be human readable, then perhaps it can be encoded using invisible HTML tags and text.
Some people make an exception: they allow custom attributes, added to the DOM by Javascript on the client side at run-time. They reckon this is OK: because the custom attributes are only added to the DOM at run-time, the HTML contains no custom attributes.
We've made a web-based editor that understands a subset of HTML - a very strict subset (that understood nearly universally by mail clients). We need to express things like <td width="#INSWIDTH_42#"> in the database, but we can't have that in the DOM, otherwise the browser where the editor runs, freaks out (or is more likely to freak out than it is likely to freak out over custom attributes). We wanted drag-and-drop, so putting it purely in the DOM was out, as was jquery's .data() (the extra data didn't get copied properly). We probably also needed the extra data to come along for the ride in .html(). In the end we settled on using <td width="1234" rs-width="#INSWIDTH_42#"> during the editing process, and then when we POST it all, we remove width and do a regex search-and-destroy s/rs-width=/width=/g.
At first the guy writing most of this was the validation-nazi on this issue and tried everything to avoid our custom attribute, but in the end acquiesced when nothing else seemed to work for ALL our requirements. It helped when he realized that the custom attribute would never appear in an email We did consider encoding our extra data in class, but decided that would be the greater of two evils.
Personally, I prefer to have things clean and passing validators etc., but as a company employee I have to remember that my primary responsibility is advancing the company's cause (making as much money as quickly as possible), not that of my egotistical desire for technical purity. Tools should work for us; not us for them.
I know people are against it, but I came up with a super short solution for this. If you want to use a custom attribute like "mine" so for example:
Test
Then you can run this code to get an object back just like jquery.data() does.
var custom_props = {} ;
$.each($(".selector")[0].attributes, function(i,x) {
if (this.specified && x.name.indexOf("mine-") !== -1)
self.new_settings[x.name.replace("modal-","")] = x.value;
});
For complex web apps, I drop custom attributes all over the place.
For more public facing pages I use the "rel" attribute and dump all my data there in JSON and then decode it with MooTools or jQuery:
<a rel="{color:red, awesome:true, food: tacos}">blah</a>
I'm trying to stick with HTML 5 data attribute lately just to "prepare", but it hasn't come naturally yet.
Spec: Create an ASP.NET TextBox control which dynamically auto-formats its text as a number, according to properties "DecimalSeparator" and "ThousandsSeparator", using JavaScript.
One way to transfer these properties from the control to JavaScript is to have the control render out custom properties:
<input type="text" id="" decimalseparator="." thousandsseparator="," />
Custom properties are easily accessible by JavaScript. And whilst a page using elements with custom properties won't validate, the rendering of that page won't be affected.
I only use this approach when I want to associate simple types like strings and integers to HTML elements for use with JavaScript. If I want to make HTML elements easier to identify, I'll make use of the class and id properties.
I use custom fields all the time for example <a i="" .... Then reference to i with jquery. Invalid html , yes. It works well, yes.
Contrary to answers which say custom attributes won't validate:
Custom attributes will validate.
So will custom tags, as long as the custom tags are lowercase and hyphenated.
Try this in any validator. It will validate.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Custom Test</title>
</head>
<body>
<dog-cat PIANO="yellow">test</dog-cat>
</body>
</html>
Some validators:
https://appdevtools.com/html-validator
https://www.freeformatter.com/html-validator.html
https://validator.w3.org/nu/
The question is: Is it safe? Will it break later?
Custom Tags
No hyphenated tags exist. I believe that W3C will never use a hyphenated tag. And if they did, as long as you use an uncommon prefix, you'll never see a conflict. Eg.<johny-mytag>.
Custom Attributes
There are hyphenated HTML attributes. But the HTML spec promises never to use an attribute starting with data-. So data-myattrib is guaranteed to be safe. However, i believe that W3C will never introduce any attribute that starts with johny-. As long as your prefix is unusual, you'll never see a conflict.
Custom attributes, in my humble opinion, should not be used as they do not validate. Alternative to that, you can define many classes for a single element like:
<div class='class1 class2 class3'>
Lorem ipsum
</div>