DOMstring parser - javascript

I have a DOMstring object, text of some web page which I get from server using XMLHttpRequest. I need to cut a substring from it, which lies between some specific tags. Is there any easy way to do this? Such methods as substring() or slice() won't work in my case, because content of the web page is dynamic, so I can't specify the beginning and the end of substring (I only know that it's surrounded by <tag> and </tag>).

yourString.subtring(yourString.indexOf('<tag>') + 5, yourString.indexOf('</tag>'));
This should work, assuming you know the name of the surrounding tags.

A DOMString is just implemented as a string in most (all?) JavaScript browser environments so you can use any parsing technique you like, including regular expressions, DOMParser, and the HTML parser provided by libraries such as jQuery. For example:
function extractText(domString) {
var m = (''+domString).match(/<tag>(.*?)<\/tag>/i);
return (m) ? m[0] : null;
}
Of course, this is a terrible idea; you should really use a DOM parser, for example, with jQuery:
$('tag', htmlString).html();
[Edit] To clarify the above jQuery example, it's the equivalent of doing something like below:
function extractText2(tagName, htmlString) {
var div = document.createElement('div'); // Build a DOM element.
div.innerHTML = htmlString; // Set its contents to the HTML string.
var el = div.getElementsByTagName(tagName) // Find the target tag.
return (el.length > 0) ? el[0].textContent : null; // Return its contents.
}
extractText2('tag', '<tag>Foo</tag>'); // => "Foo"
extractText2('x', '<x><y>Bar</y></x>'); // => "Bar"
extractText2('y', '<x><y>Bar</y></x>'); // => "Bar"
This solution is better than a regex solution since it will handle any HTML syntax nuances on which the regex solution would fail. Of course, it likely needs some cross-browser testing, hence the recommendation to a library like jQuery (or Prototype, ExtJS, etc).

Assuming the surrounding tag is unique in the string...
domString.match(/.*<tag>(.*)<\/tag>.*/)[0]
or
/.*<tag>(.*)<\/tag>.*/.exec(domString)[0]
Seems like it should do the trick

As #Gus but improved, if you only have text and the tags are repited:
"<tag>asd</tag>".match(/<tag>[^<]+<\/tag>/);

Related

wrap HTML tags in plain string with another HTML tag

I want to wrap a HTML tag with another HTML tag in a string (so not a DOM element, a plain string). I created this function but I wonder if I could do it in one go without a forEach loop.
This is the working function:
function style(content) {
var tempStyledContent = content;
var imgMatches = tempStyledContent.match(/(<img.*?src=[\"'](.+?)[\"'].*?>)/g);
imgMatches.forEach(function (imgMatch) {
var imgTag = imgMatch;
var imgSrc = imgMatch.match(/src\s*=\s*"(.+?)"/)[1];
tempStyledContent = tempStyledContent.replace(imgTag,
"<a href=\"" + imgSrc + "\" data-fancybox>" + imgTag + "</a>");
});
return tempStyledContent;
}
The parameter content is a string with HTML code in it. The function above outputs the same html as the input but with the (fancybox) a tags surrounding all the child img tags.
So an input string like
"<div><img src='example.jpg'/></div>"
will output
"<div><a href='example.jpg' data-fancybox><img src='example.jpg'/></a></div>"
Can anyone improve this? I know too little about regex's to make this better.
Manipulating HTML with regex is notoriously problematic. Changes that would be trivial in a DOM parser can be very difficult to create a robust regex for; and when regex fails, it fails silently, which makes errors easy to miss. When working in regex you also have to be careful to handle all possible variations in markup such as whitespace, attribute order, quoting style, tag closing style, attribute contents that resemble html but which you don't want modified, etc.
As discussed exhaustively in the comment thread below, given enough time and effort it's certainly possible to handle all of these things in regex; but it leads to a complex, difficult to maintain regex -- and most importantly it's difficult to be certain your regex accommodates every possible valid markup variation. DOM parsing handles all of this stuff automatically, and lets you work with the structured data directly instead of having to cope with all the possible variations in its string representation.
Therefore, if you need to make nontrivial changes to an HTML string, it's almost always best to convert your HTML into a true DOM tree, manipulate that using standard DOM methods, then (if necessary) convert it back into a string. Fortunately it doesn't take a lot of code to do so. Here's a simple vanilla JS demo:
var htmlToElement = function(html) {
var template = document.createElement('template');
template.innerHTML = html.trim();
return template.content.firstChild;
};
var elementToHtml = function(el) {
return el.outerHTML;
}
// Usage demo:
var string = "<div>This <b>is some</b> <i>html</i><img src='http://example.com'></div>";
var foo = htmlToElement(string);
// perform your DOM manipulation as needed on foo here. This would look much simpler if I wasn't so stubborn about avoiding jQuery these days, but here we are anyway:
foo.querySelectorAll('img').forEach(function(img) {
var link = document.createElement('a');
link.setAttribute('data-fancybox',true);
link.setAttribute('href', img.getAttribute('src'));
img.parentNode.insertBefore(link,img);
link.appendChild(img);
});
// back to a string:
var bar = elementToHtml(foo);
console.log(bar);
Ok, I'm probably going to do DOM manipulation as #DanielBeck suggested. Once knouckout finished binding I will use $.wrap http://api.jquery.com/wrap/ to do my manipulation. I just hoped there was an easy way without using jquery, so if there are other suggestions please comment them.

javascript - string starts with an html tag

Can anyone suggest a way to tell if the first word in a string is an html tag?
At the moment, I am doing this:
var text = model.get('message');
try {
$(text)[0];
} catch (_error) {
text = text.replace(/\n/g, '<br />');
}
But this seems terribly inefficient.
I'd suggest to do exactly as is done by the library you're using (jQuery). Here's an extract from the source :
// A simple way to check for HTML strings
// Prioritize #id over <tag> to avoid XSS via location.hash (#9521)
// Strict HTML recognition (#11290: must start with <)
rquickExpr = /^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]*))$/,
So you could simply do
if (rquickExpr.test(text)) { // it's HTML
Note that there's no guarantee it's really valid HTML.

regular expression to unlink html code with javascript

I'm sorry,I can't believe this question is not solved in stackoverflow but I've been searching a lot and I don't find any solution.
I want to change HTML code with regular expressions in this way:
testing anchor
to
testing anchor
Only I want to unlink a text code without use DOM functions, the code is in a string not in the document and I don't want to remove other tags that the a ones.
If you really don't want to use DOM functions (why ?) you might do
str = str.replace(/<[^>]*>/g, '')
You can use it if you're fairly confident you don't have a more complex HTML but it will fail in many cases, for example some nested tags, or > in an attribute. You might fix some of the problems with more complex regular expressions but they aren't the right tool for this job in the general case.
If you don't want to remove other tags than a, do this :
str = str.replace(/<\/?a( [^>]*)?>/g, '')
This changes
<a>testing</a> <b>a</b>nchor<div>test</div><aaa>E</aaa>
to
testing <b>a</b>nchor<div>test</div><aaa>E</aaa>
I know you only want regex, for future viewers, here is a trivial solution using DOM methods.
var a = document.createElement("div");
a.innerHTML = 'testing anchor';
var wordsOnly = a.textContent || a.innerText;
This will not fail on complicated use cases, allows nested tags and it's perfectly clear what's happening:
Hey browser! Create an element
Put that HTML in it
Give me back just the text, that's what I want now.
NOTE:
The element we're creating will not be added to the actual DOM since we're not adding it anywhere, it'll stay invisible. Here is a fiddle to illustrate how this works.
As has been mentioned, you cannot parse HTML with regular expressions. The principal reason is that HTML elements nest and regular expressions cannot handle that.
That said, with a few restrictions which I will mention, you can do the following :
string.replace (/(\b\w+\s*)<a\s+href="([^"]*)">(.*)<\/a>/g, '$1 $3')
This requires there to be a word before the tag, spacing between the word and the tag is optional, no attributes other than the href specified in the <a> tag and you accept anything between the <a> and the .
You can create a DOM object from the string, use DOM methods to parse, without having had appended said DOM object to the document

Modify HTML source on-the-fly with xPath using Javascript

Let say I'm having a HTML source, something like :
Google
Yahoo
MSN
Is there any way for me to modify this HTML source with xPath using Javascript : find all anchors, prepend text to them and show the new HTML source using an alert box?
Visit Google
Visit Yahoo
Visit MSN
If you need full power of XSLT for making several transformations, you could use something like sarissa.
I think you might be confusing xPath expressions with CSS selectors, so for that case I would recommend to use the following jQuery code:
// Put a script tag including jquery.js here
<div id="container">
Google
Yahoo
MSN
</div>
<script>
$("a").prepend("Visit ");
alert($("#container").html());
</script>
Regards.
The short answer is "no".
XPath is a method for selecting elements in a DOM. It can also be used to read attributes and calculate values, but it can't be used to modify the DOM. You might be getting confused with XSLT, which uses XPath expressions to select elements and can return a modified document. You could use a generic XML document, then use different XSL style sheets using XSLT to generate different documents in various languages, say HTML, XML, postscript, and so on.
In any case, why would you bother with XPath in this case? There is a document.links collection that requires simple property access, no function calls or evaluating XPath expressoins. You can change simple text content by assigning to the W3C textContent or proprietary MS innerText property (again, simple property access rather than function calls):
function modLinks() {
var links = document.links;
var i = links.length;
while (i--) {
setText(links[i], 'Visit ' + getText(links[i]) );
}
}
// Simple helper functions, can be made faster and more robust
// but sufficient for an example.
function getText(el) {
if (typeof el.textContent == 'string') {
return el.textContent;
} else if (typeof el.innerText == 'string') {
return el.innerText;
}
}
function setText(el, text) {
if (typeof el.textContent == 'string') {
el.textContent = text;
} else if (typeof el.innerText == 'string') {
el.innerText = text;
}
}
As mentioned above, XPath doesn't work effectively with unparsed strings.
So one approach would be to set the innerHTML of some element (e.g. an invisible ) to your HTML source string.
This would cause the source to be parsed into a DOM tree. Then you could use myDiv.getElementsByTagName('a') or jQuery $('a', myDiv) to find the links. (You could even use XPath .//a, but why use a more complex tool when a simpler one will do?)
Then once you've modified the strings, e.g. as somebody said using jQuery $('a', myDiv).prepend("Visit "); you could output the modified HTML by retrieving the innerHTML property of the invisible div.

Transform URL into a link unless there already was a link

I know this has been talked here, but no solutions were offer to the exact problem. Please, take a look...
I'm using a function to transform plain-text URLs into clickable links. This is what I have:
<script type='text/javascript' language='javascript'>
window.onload = autolink;
function autolink(text) {
var exp = /(\b(https?|ftp):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gim;
document.body.innerHTML = document.body.innerHTML.replace(exp,"<a href='$1'>$1</a>");
}
</script>
This makes
https://stackoverflow.com/
Looks like:
https://stackoverflow.com/
It works, but also replace the existent HTML links with nested links.
So, a valid HTML link like
StackOverflow
Becomes something messy like:
StackOverflow">StackOverflow</a>...
How can I fix the expression to ignore the content of link tags? Thanks!
I'm a newbie... I barely understand the regex code. Please be gentle :) Thanks again.
Using the jQuery JavaScript library, this would look like (demo at http://jsfiddle.net/BRPRH/4):
function autolink() {
var exp = /(\b(https?|ftp):\/\/[-A-Z0-9+\u0026##\/%?=~_|!:,.;]*[-A-Z0-9+\u0026##\/%=~_|])/gi,
lt = '\u003c',
gt = '\u003e';
$('*:not(a, script, style, textarea)').contents().each(function() {
if (this.nodeType == Node.TEXT_NODE) {
var textNode = $(this);
var span = $(lt + 'span/' + gt).text(this.nodeValue);
span.html(span.html().replace(exp, lt + 'a href=\'$1\'' + gt + '$1' + lt + '/a' + gt));
textNode.replaceWith(span);
}
});
}
$(autolink);
Edit: Excluded textareas, scripts, and embedded CSS. I note that this can also be done using pure DOM's splitText, which has the advantage of not adding extra span elements.
Edit 2: Eliminated all ampersands and double quotes.
Edit 3: Got rid of < and > characters as well.
This problem is beyond the power of regular expressions. You might be able to write a regex that could avoid some links, but you wouldn't be able to avoid every existing link.
The good news is that a different approach will make the job much easier. Right now you using document.body.innerHTML to manipulate the HTML as plain text. To do it correctly that way, you will basically need to parse the HTML yourself. But you don't have to, because the browser has already parsed it for you!
The web browser allows you to access an HTML document as a series of object. It's called the Document Object Model (DOM) and if you do some reading on that, you should be able to learn how to traverse through the HTML, skipping over anything inside an A element, and using the regex you have on plain text only.

Categories