I know that counting the number of tags in a document can be done with something like the following
var tableCount = $('body table tr').length;
Now I presume that this only counts the number of tags. What I want to know is that I have the same number of closing tags . So if the code above shows there are 72 tags, I now want something to tell me that there are 72 closing tr tags.
Is this possible?
Thanks
Ideally, you would use a function like this:
function checkTable(tableElement) {
// Get inner HTML
var html = tableElement.innerHTML;
// Count <tr>
var count1 = html.match(/<tr/g).length;
// Count </tr>
var count2 = html.match(/<\/tr/g).length;
// Equals?
return count1 === count2;
}
However, due to browser's mumbo-jumbo, the mismatched tags get auto-corrected (i.e. auto-closed). Therefore it is impossible for a running page to validate itself. Here is a proof of concept: JS Bin.
Explanation: The second table has a typo (opening tag instead of a closing tag), but the function returns true in both cases. If one inspects the generated HTML (the one that is accessible through DOM), one can see that the browser auto-corrected the mismatched tags (there is an additional empty table row).
Luckily, there is another way. To obtain the pure (i.e. not modified by the browser) HTML code, you can make an AJAX request to the current page URL. Yes, you read correctly - the page loads itself again. But no worries, there is no recursion and possible stackoverflow here, since you do not process the fetched page.
The JS code for the following is:
var selfUrl = document.location.href;
function checkHTML(html) {
// Count <tr>
var count1 = html.match(/<tr/g).length;
console.log(count1);
// Count </tr>
var count2 = html.match(/<\/tr/g).length; // </tr (do not remove this comment!)
console.log(count2);
// Equals?
return count1 === count2;
}
$.get(selfUrl, function(html) {
console.log(checkHTML(html));
});
But beware of one pitfall. If you include this code in the HTML itself (usually discouraged), then you must not remove that one comment. The reason is the following: one regex contains <tr, while the other has the forward slash escaped and does therefore not contain a </tr. And since you fetch the whole HTML code (including the JS code), the count is mismatched. To even this, I have added an additional </tr inside a comment.
Your question reminds me the idea of the SAX Parser, as the HTML code obviously is the kind of XML. SAX Parser is commonly looking at the start and end tags, as long as element attributes and content.
Some time ago, I have used the simple SAX Parser library from: http://ejohn.org/blog/pure-javascript-html-parser/
Available at: http://ejohn.org/files/htmlparser.js
Using this library you can do the following:
$(document).ready(function(){
var htmlString = $('#myTable').html(),
countStart = 0,
countEnd = 0;
HTMLParser(htmlString, {
start: function(tag, attrs, unary) {
countStart += 1; // you may add the if tag === 'tr' or else
console.log("start: " + tag);
},
end: function(tag) {
countEnd += 1; // you may add the if tag === 'tr' or else
console.log("end: " + tag);
},
chars: function(text) {},
comment: function(text) {}
});
});
There are also modern Node-based approaches like: https://github.com/isaacs/sax-js/blob/master/examples/example.js which can be used for the same task.
Related
I need to create a javacript function that downloads the html source code of a web page and returns the number of times a CSS class is mentioned.
var str = document.body.innerHTML;
function getFrequency(str) {
var freq = {};
for (var i=0; i<string.length;i++) {
var css_class = "ENTER CLASS HERE";
if (freq[css_class]) {
freq[css_class]++;
} else {
freq[css_class] = 1;
}
}
return freq;
};
What am I doing wrong here?
What am I doing wrong here?
I hate to say it, but fundamentally... everything. Getting information about HTML does not involve string functions or regular expressions. HTML cannot be dealt with this way, its rules are way too complex.
HTML needs to be parsed by an HTML parser.
In the browser there are two possible scenarios:
If you work with the current document (as you seem to do), then the parsing is already done by the browser.
Counting the number of times a CSS class is used actually is the same thing as finding out how many HTML elements have that class. And that is easily done via document.querySelectorAll() and a CSS selector.
var elements = document.querySelectorAll(".my-css-class");
alert("There are " + elements.length + " occurrences of the class.");
If you have an HTML string that you loaded from somewhere, you need to parse it first. In JavaScript you can make the browser parse the HTML for you very easily:
var html = '<div class="my-css-class">some random HTML</div>';
var div = document.createElement("div");
div.innerHTML = html; // parsing happens here
Now you can employ the same strategy as above, only with div as your selector context:
var elements = div.querySelectorAll(".my-css-class");
alert("There are " + elements.length + " occurrences of the class.");
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
JSON pretty print using JavaScript
I'd like to display my raw JSON data on a HTML page just as JSONview does. For example, my raw json data is:
{
"hey":"guy",
"anumber":243,
"anobject":{
"whoa":"nuts",
"anarray":[
1,
2,
"thr<h1>ee"
],
"more":"stuff"
},
"awesome":true,
"bogus":false,
"meaning":null,
"japanese":"明日がある。",
"link":"http://jsonview.com",
"notLink":"http://jsonview.com is great"
}
It comes from http://jsonview.com/, and what I want to achieve is like http://jsonview.com/example.json if you use Chrome and have installed the JSONView plugin.
I've tried but failed to understand how it works. I'd like to use a JS script (CSS to highlight) to custom format my raw JSON data which is retrieved by ajax and finally put it on a HTML page in any position like into a div element. Are there any existing JS libraries that can achieve this? Or how to do it?
I think all you need to display the data on an HTML page is JSON.stringify.
For example, if your JSON is stored like this:
var jsonVar = {
text: "example",
number: 1
};
Then you need only do this to convert it to a string:
var jsonStr = JSON.stringify(jsonVar);
And then you can insert into your HTML directly, for example:
document.body.innerHTML = jsonStr;
Of course you will probably want to replace body with some other element via getElementById.
As for the CSS part of your question, you could use RegExp to manipulate the stringified object before you put it into the DOM. For example, this code (also on JSFiddle for demonstration purposes) should take care of indenting of curly braces.
var jsonVar = {
text: "example",
number: 1,
obj: {
"more text": "another example"
},
obj2: {
"yet more text": "yet another example"
}
}, // THE RAW OBJECT
jsonStr = JSON.stringify(jsonVar), // THE OBJECT STRINGIFIED
regeStr = '', // A EMPTY STRING TO EVENTUALLY HOLD THE FORMATTED STRINGIFIED OBJECT
f = {
brace: 0
}; // AN OBJECT FOR TRACKING INCREMENTS/DECREMENTS,
// IN PARTICULAR CURLY BRACES (OTHER PROPERTIES COULD BE ADDED)
regeStr = jsonStr.replace(/({|}[,]*|[^{}:]+:[^{}:,]*[,{]*)/g, function (m, p1) {
var rtnFn = function() {
return '<div style="text-indent: ' + (f['brace'] * 20) + 'px;">' + p1 + '</div>';
},
rtnStr = 0;
if (p1.lastIndexOf('{') === (p1.length - 1)) {
rtnStr = rtnFn();
f['brace'] += 1;
} else if (p1.indexOf('}') === 0) {
f['brace'] -= 1;
rtnStr = rtnFn();
} else {
rtnStr = rtnFn();
}
return rtnStr;
});
document.body.innerHTML += regeStr; // appends the result to the body of the HTML document
This code simply looks for sections of the object within the string and separates them into divs (though you could change the HTML part of that). Every time it encounters a curly brace, however, it increments or decrements the indentation depending on whether it's an opening brace or a closing (behaviour similar to the space argument of 'JSON.stringify'). But you could this as a basis for different types of formatting.
Note that the link you provided does is not an HTML page, but rather a JSON document. The formatting is done by the browser.
You have to decide if:
You want to show the raw JSON (not an HTML page), as in your example
Show an HTML page with formatted JSON
If you want 1., just tell your application to render a response body with the JSON, set the MIME type (application/json), etc.
In this case, formatting is dealt by the browser (and/or browser plugins)
If 2., it's a matter of rendering a simple minimal HTML page with the JSON where you can highlight it in several ways:
server-side, depending on your stack. There are solutions for almost every language
client-side with Javascript highlight libraries.
If you give more details about your stack, it's easier to provide examples or resources.
EDIT: For client side JS highlighting you can try higlight.js, for instance.
JSON in any HTML tag except <script> tag would be a mere text. Thus it's like you add a story to your HTML page.
However, about formatting, that's another matter. I guess you should change the title of your question.
Take a look at this question. Also see this page.
We have a glossary with up to 2000 terms (where each glossary term may
consist of one, two or three words (either separated with whitespaces
or a dash).
Now we are looking for a solution for highlighting all terms inside a
(longer) HTML document (up to 100 KB of HTML markup) in order to
generate a static HTML page with the highlighted terms.
The constraints for a working solution are: large number of glossary terms
and long HTML documents...what would be the blueprint for an efficient solution
(within Python).
Right now I am thinking about parsing the HTML document using lxml, iterating over all text nodes and then matching the contents within each text node against all glossary terms.
Client-side (browser) highlighting on the fly is not an option since IE will complain about long running scripts with a script timeout...so unusable for production use.
Any better idea?
You could use a parser to navigate your tree in a recursive manner and replace only tags that are made of text.
In doing so, there are still several things you will need to account for:
- Not all text needs to be replaced (ex. Inline javascript)
- Some elements of the document might not need parsing (ex. Headings, etc.)
Here's a quick and non-production ready example of how you could achieve this :
html = """The HTML you need to parse"""
import BeautifulSoup
IGNORE_TAGS = ['script', 'style']
def parse_content(item, replace_what, replace_with, ignore_tags = IGNORE_TAGS):
for content in item.contents:
if isinstance(content, BeautifulSoup.NavigableString):
content.replaceWith(content.replace(replace_what, replace_with, ignore_tags))
else:
if content.name not in ignore_tags:
parse_content(content, replace_what, replace_with, ignore_tags)
return item
soup = BeautifulSoup.BeautifulSoup(html)
body = soup.html.body
replaced_content = parse_content(body, 'a', 'b')
This should replace any occurence of an "a" with a "b", however leaving content that is:
- Inside inline javascript or css (Although inline JS or CSS should not appear in a document's body).
- A reference in a tag such as img, a...
- A tag itself
Of course, you will then need, depending on your glossary, to make sure that you don't replace only part of a word with something else ; to do this it makes sense to use regex insted of content.replace.
I think highlighting with client-side javascript is the best option. It saves your server processing time and bandwidth, and more important, keeps html clean and usable for those who don't need unnecessary markup, for example, when printing or converting to other formats.
To avoid timeouts, just split the job into chunks and process them one by one in a setTimeout'ed threaded function. Here's an example of this approach
function hilite(terms, chunkSize) {
// prepare stuff
var terms = new RegExp("\\b(" + terms.join("|") + ")\\b", "gi");
// collect all text nodes in the document
var textNodes = [];
$("body").find("*").contents().each(function() {
if (this.nodeType == 3)
textNodes.push(this)
});
// process N text nodes at a time, surround terms with text "markers"
function step() {
for (var i = 0; i < chunkSize; i++) {
if (!textNodes.length)
return done();
var node = textNodes.shift();
node.nodeValue = node.nodeValue.replace(terms, "\x1e$&\x1f");
}
setTimeout(step, 100);
}
// when done, replace "markers" with html
function done() {
$("body").html($("body").html().
replace(/\x1e/g, "<b>").
replace(/\x1f/g, "</b>")
);
}
// let's go
step()
}
Use it like this:
$(function() {
hilite(["highlight", "these", "words"], 100)
})
Let me know if you have questions.
How about going through each term in the glossary and then, for each term, using regex to find all occurrences in the HTML? You could replace each of those occurrences with the term wrapped in a span with a class "highlighted" that will be styled to have a background color.
I'm making a highlighting plugin for a client to find things in a page and I decided to test it with a help viewer im still building but I'm having an issue that'll (probably) require some regex.
I do not want to parse HTML, and im totally open on how to do this differently, this just seems like the the best/right way.
http://oscargodson.com/labs/help-viewer
http://oscargodson.com/labs/help-viewer/js/jquery.jhighlight.js
Type something in the search... ok, refresh the page, now type, like, class or class=" or type <a you'll notice it'll search the actual HTML (as expected). How can I only search the text?
If i do .text() it'll vaporize all the HTML and what i get back will just be a big blob of text, but i still want the HTML so I dont lose formatting, links, images, etc. I want this to work like CMD/CTRL+F.
You'd use this plugin like:
$('article').jhighlight({find:'class'});
To remove them:
.jhighlight('remove')
==UPDATE==
While Mike Samuel's idea below does in fact work, it's a tad heavy for this plugin. It's mainly for a client looking to erase bad words and/or MS Word characters during a "publishing" process of a form. I'm looking for a more lightweight fix, any ideas?
You really don't want to use eval, mess with innerHTML or parse the markup "manually". The best way, in my opinion, is to deal with text nodes directly and keep a cache of the original html to erase the highlights. Quick rewrite, with comments:
(function($){
$.fn.jhighlight = function(opt) {
var options = $.extend($.fn.jhighlight.defaults, opt)
, txtProp = this[0].textContent ? 'textContent' : 'innerText';
if ($.trim(options.find.length) < 1) return this;
return this.each(function(){
var self = $(this);
// use a cache to clear the highlights
if (!self.data('htmlCache'))
self.data('htmlCache', self.html());
if(opt === 'remove'){
return self.html( self.data('htmlCache') );
}
// create Tree Walker
// https://developer.mozilla.org/en/DOM/treeWalker
var walker = document.createTreeWalker(
this, // walk only on target element
NodeFilter.SHOW_TEXT,
null,
false
);
var node
, matches
, flags = 'g' + (!options.caseSensitive ? 'i' : '')
, exp = new RegExp('('+options.find+')', flags) // capturing
, expSplit = new RegExp(options.find, flags) // no capturing
, highlights = [];
// walk this wayy
// and save matched nodes for later
while(node = walker.nextNode()){
if (matches = node.nodeValue.match(exp)){
highlights.push([node, matches]);
}
}
// must replace stuff after the walker is finished
// otherwise replacing a node will halt the walker
for(var nn=0,hln=highlights.length; nn<hln; nn++){
var node = highlights[nn][0]
, matches = highlights[nn][1]
, parts = node.nodeValue.split(expSplit) // split on matches
, frag = document.createDocumentFragment(); // temporary holder
// add text + highlighted parts in between
// like a .join() but with elements :)
for(var i=0,ln=parts.length; i<ln; i++){
// non-highlighted text
if (parts[i].length)
frag.appendChild(document.createTextNode(parts[i]));
// highlighted text
// skip last iteration
if (i < ln-1){
var h = document.createElement('span');
h.className = options.className;
h[txtProp] = matches[i];
frag.appendChild(h);
}
}
// replace the original text node
node.parentNode.replaceChild(frag, node);
};
});
};
$.fn.jhighlight.defaults = {
find:'',
className:'jhighlight',
color:'#FFF77B',
caseSensitive:false,
wrappingTag:'span'
};
})(jQuery);
If you're doing any manipulation on the page, you might want to replace the caching with another clean-up mechanism, not trivial though.
You can see the code working here: http://jsbin.com/anace5/2/
You also need to add display:block to your new html elements, the layout is broken on a few browsers.
In the javascript code prettifier, I had this problem. I wanted to search the text but preserve tags.
What I did was start with HTML, and decompose that into two bits.
The text content
Pairs of (index into text content where a tag occurs, the tag content)
So given
Lorem <b>ipsum</b>
I end up with
text = 'Lorem ipsum'
tags = [6, '<b>', 10, '</b>']
which allows me to search on the text, and then based on the result start and end indices, produce HTML including only the tags (and only balanced tags) in that range.
Have a look here: getElementsByTagName() equivalent for textNodes.
You can probably adapt one of the proposed solutions to your needs (i.e. iterate over all text nodes, replacing the words as you go - this won't work in cases such as <tag>wo</tag>rd but it's better than nothing, I guess).
I believe you could just do:
$('#article :not(:has(*))').jhighlight({find : 'class'});
Since it grabs all leaf nodes in the article it would require valid xhtml, that is, it would only match link in the following example:
<p>This is some paragraph content with a link</p>
DOM traversal / selector application could slow things down a bit so it might be good to do:
article_nodes = article_nodes || $('#article :not(:has(*))');
article_nodes.jhighlight({find : 'class'});
May be something like that could be helpful
>+[^<]*?(s(<[\s\S]*?>)?e(<[\s\S]*?>)?e)[^>]*?<+
The first part >+[^<]*? finds > of the last preceding tag
The third part [^>]*?<+ finds < of the first subsequent tag
In the middle we have (<[\s\S]*?>)? between characters of our search phrase (in this case - "see").
After regular expression searching you could use the result of the middle part to highlight search phrase for user.
I have an issue with jquery where elements are not found when the query string has '$' char in them -- is there a known issue? Unfortunately search engines make it so hard to searh for symbols in threads.
I have an html such as this:
<TD id="ctl00$m$g_cd3cd7fd_df51_4f95_9057_d98f0c1e1d60$ctl00$ctl00_5"
class="MenuItem"
onclick="setSelectedTab('ctl00$m$g_cd3cd7fd_df51_4f95_9057_d98f0c1e1d60$ctl00$ctl00_5');"
tabsrowid="ctl00$m$g_cd3cd7fd_df51_4f95_9057_d98f0c1e1d60$ctl00$ctl00_"
nohide="false">...
and my jscript goes something like:
function setSelectedTab(selection) {
var ids = selection.split('/');
for (var i = 0; i<ids.length; i++) {
var item = $("#" + ids[i]);
item.addClass("selected");
$("#" + item.attr("tabsrowid")).show();
}
}
While analyzing in firebug, I see that 'item' is an empty set. If I query $('.MenuItem') for example, it correctly returns a result set with 25 matching items in the page; it appears like $(s) doesn't work when s contains $ chars in it?
What's the solution to it? Sorry if it a dumb question or well known issue -- as I said I tried to google around, but unsuccessfully.
Note: It's not an issue with javascript itself, or duplicate ids, or jquery not loaded, or anything like that. The function does get called onclick, and if I replace $('#' + ids[i]) with document.getElementById(ids[i]), it does return the correct element.
fyi, the string passed to the function setSelectedTab usually contains a hierarchical path to the TD element; though in the example TD above, the ids.length is 1.
Thanks,
Raja.
Perhaps try escaping them with backslashes
<TD id="ctl00\$m\$g_cd3cd7fd_df51_4f95_9057_d98f0c1e1d60\$ctl00\$ctl00_5"
class="MenuItem"
onclick="setSelectedTab('ctl00\$m\$g_cd3cd7fd_df51_4f95_9057_d98f0c1e1d60\$ctl00\$ctl00_5');"
tabsrowid="ctl00\$m\$g_cd3cd7fd_df51_4f95_9057_d98f0c1e1d60\$ctl00\$ctl00_"
nohide="false">...