Highlighting glossary terms inside a HTML document - javascript

We have a glossary with up to 2000 terms (where each glossary term may
consist of one, two or three words (either separated with whitespaces
or a dash).
Now we are looking for a solution for highlighting all terms inside a
(longer) HTML document (up to 100 KB of HTML markup) in order to
generate a static HTML page with the highlighted terms.
The constraints for a working solution are: large number of glossary terms
and long HTML documents...what would be the blueprint for an efficient solution
(within Python).
Right now I am thinking about parsing the HTML document using lxml, iterating over all text nodes and then matching the contents within each text node against all glossary terms.
Client-side (browser) highlighting on the fly is not an option since IE will complain about long running scripts with a script timeout...so unusable for production use.
Any better idea?

You could use a parser to navigate your tree in a recursive manner and replace only tags that are made of text.
In doing so, there are still several things you will need to account for:
- Not all text needs to be replaced (ex. Inline javascript)
- Some elements of the document might not need parsing (ex. Headings, etc.)
Here's a quick and non-production ready example of how you could achieve this :
html = """The HTML you need to parse"""
import BeautifulSoup
IGNORE_TAGS = ['script', 'style']
def parse_content(item, replace_what, replace_with, ignore_tags = IGNORE_TAGS):
for content in item.contents:
if isinstance(content, BeautifulSoup.NavigableString):
content.replaceWith(content.replace(replace_what, replace_with, ignore_tags))
else:
if content.name not in ignore_tags:
parse_content(content, replace_what, replace_with, ignore_tags)
return item
soup = BeautifulSoup.BeautifulSoup(html)
body = soup.html.body
replaced_content = parse_content(body, 'a', 'b')
This should replace any occurence of an "a" with a "b", however leaving content that is:
- Inside inline javascript or css (Although inline JS or CSS should not appear in a document's body).
- A reference in a tag such as img, a...
- A tag itself
Of course, you will then need, depending on your glossary, to make sure that you don't replace only part of a word with something else ; to do this it makes sense to use regex insted of content.replace.

I think highlighting with client-side javascript is the best option. It saves your server processing time and bandwidth, and more important, keeps html clean and usable for those who don't need unnecessary markup, for example, when printing or converting to other formats.
To avoid timeouts, just split the job into chunks and process them one by one in a setTimeout'ed threaded function. Here's an example of this approach
function hilite(terms, chunkSize) {
// prepare stuff
var terms = new RegExp("\\b(" + terms.join("|") + ")\\b", "gi");
// collect all text nodes in the document
var textNodes = [];
$("body").find("*").contents().each(function() {
if (this.nodeType == 3)
textNodes.push(this)
});
// process N text nodes at a time, surround terms with text "markers"
function step() {
for (var i = 0; i < chunkSize; i++) {
if (!textNodes.length)
return done();
var node = textNodes.shift();
node.nodeValue = node.nodeValue.replace(terms, "\x1e$&\x1f");
}
setTimeout(step, 100);
}
// when done, replace "markers" with html
function done() {
$("body").html($("body").html().
replace(/\x1e/g, "<b>").
replace(/\x1f/g, "</b>")
);
}
// let's go
step()
}
Use it like this:
$(function() {
hilite(["highlight", "these", "words"], 100)
})
Let me know if you have questions.

How about going through each term in the glossary and then, for each term, using regex to find all occurrences in the HTML? You could replace each of those occurrences with the term wrapped in a span with a class "highlighted" that will be styled to have a background color.

Related

Scraping data from HTML using JavaScript RegExp [duplicate]

I'm trying to figure out how to, in raw javascript (no jQuery, etc.), find an element with specific text and modify that text.
My first incarnation of the solution... is less than adequate. What I did was basically:
var x = document.body.innerHTML;
x.replace(/regular-expression/,"text");
document.body.innerHTML = x;
Naively I thought I succeeded with flying colors, especially since it was so simple. So then I added an image to my example and thought I could check every 5 seconds (because this string may enter the DOM dynamically)... and the image flickered every 5 seconds.
Oops.
So, there has to be a correct way to do this. A way that specifically singles out a specific DOM element and updates the text portion of that DOM element.
Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid. And even then, I'm skeptical about "changing the innerHTML to something different" being the correct way to update a DOM element.
So, what's the correct way to search through the DOM for a string? And what's the correct way to update a DOM element's text?
Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid.
I want to search for an element in an unordered random list. Now, there's a "go through all the elements till you find what you're looking for approach", which I want to avoid.
Old-timer magno tape, record, listen, meditate.
Btw, see: Find and replace text with JavaScript on James Padolsey's github
(also hig blog articles explaining it)
Edit: Changed querySelectorAll to getElementsByTagName from RobG's suggestion.
You can use the getElementsByTagName function to grab all of the tags on the page. From there, you can check their children and see if they have any Text Nodes as children. If they do, you'd then look at their text and see if it matches what you need. Here is an example that will print out the text of every Text Node in your document with the console object:
var elms = document.getElementsByTagName("*"),
len = elms.length;
for(var ii = 0; ii < len; ii++) {
var myChildred = elms[ii].childNodes;
len2 = myChildred.length;
for (var jj = 0; jj < len2; jj++) {
if(myChildred[jj].nodeType === 3) {
console.log(myChildred[jj].nodeValue);
// example on update a text node's value
myChildred[jj].nodeValue = myChildred[jj].nodeValue.replace(/test/,"123");
}
}
}
To update a DOM element's text, simple update the nodeValue property of the Text Node.
Don't use innerHTML with a regular expression, it will almost certainly fail for non-trivial content. Also, there are still differences in how browsers generate it from the live DOM. Replacing the innerHTML will also remove any event listeners added as element properties (i.e. like element.onclick = fn).
It is best if you can have the string enclosed in an element with an attribute or property you can search on (id, class, etc.) but failing that, a search of text nodes is the best approach.
Edit
Attempting a general purpose text selection function for an HTML document may result in a very complex algorithm since the string could be part of a complex structure, e.g.:
<h1>Some <span class="foo"><em>s</em>pecial</span> heading</h1>
Searching for the string "special heading" is tricky as it is split over 2 elements. Wrapping it another element (say for highlighting) is also not trivial since the resulting DOM structure must be valid. For example, the text matching "some special" in the above could be wrapped in a span but not a div.
Any such function must be accompanied by documentation stating its limitations and most appropriate use.
Forget regular expressions.
Iterate over each text node (and doing it recursively will be the most elegant) and modify the text nodes if the text is found. If just looking for a string, you can use indexOf().
x.replace(/regular-expression/,"text");
will return a value so
var y = x.replace(/regular-expression/,"text");
now you can assign new value.
document.body.innerHTML = y;
Bu you want to think about this, you dont't want to get the whole body just to change one small piece of code, why not get the content of a div or any element and so on
example:
<p id='paragraph'>
... some text here ...
</p>
now you can use javascript
var para = document.getElementById('paragraph').innerHTML;
var newPara = para.replace(/regex/,'new content');
para.innerHTML = newPara;
This should be the simplest way.

Splitting a long phrase into an array

I need to take the phrase
It’s that time of year when you clean out your closets, dust off shelves, and spruce up your floors. Once you’ve taken care of the dust and dirt, what about some digital cleaning? Going through all your files and computers may seem like a daunting task, but we found ways to make the process fairly painless.
and upon pressing a button
split it into an array
iterate over that array at each step
Build SPAN elements as you go, along with the attributes
Add the SPAN elements to the original DIV
Add a click handler to the SPAN elements, or to the DIV, which causes the style on the SPAN to change on mouseover.
So far I had
function splitString(stringToSplit, separator) {
var arrayOfStrings = stringToSplit.split(separator);
print('The original string is: "' + stringToSplit + '"');
print('The separator is: "' + separator + '"');
print("The array has " + arrayOfStrings.length + " elements: ");
for (var i=0; i < arrayOfStrings.length; i++)
print(arrayOfStrings[i] + " / ");
}
var space = " ";
var comma = ",";
splitString(tempestString, space);
splitString(tempestString);
splitString(monthString, comma);
for (var i=0; i < myArray.length; i++)
{
}
var yourSpan = document.createElement('span');
yourSpan.innerHTML = "Hello";
var yourDiv = document.getElementById('divId');
yourDiv.appendChild(yourSpan);
yourSpan.onmouseover = function () {
alert("On MouseOver");
}
and for html I have
The DIV that will serve as your input (and output) is here, with
id="transcriptText":</p>
<div id="transcriptText"> It’s that time of year when you clean out your
closets, dust off shelves, and spruce up your floors. Once you’ve taken
care of the dust and dirt, what about some digital cleaning? Going
through all your files and computers may seem like a daunting task, but
we found ways to make the process fairly painless.</div>
<br>
<div id="divideTranscript" class="button"> Transform the
Transcript! </div>
Any help on how to move one? I have been stuck for quite some time
Well, first off this looks like homework.
That said, I'll try to help without giving you the actual code, since we're not supposed to give actual working solutions to homework. You're splitting the string too many times (once is all that's needed based on the instructions you gave) and you have to actually store the result of the split call somewhere that your other code can use it.
Your instructions say to add attributes to the span, but not which attributes nor what their contents should be.
Your function should follow the instructions:
1) Split the string. Since it doesn't specify on what, I'd assume words. So split it on spaces only and leave the punctuation where it is.
2) with the array of words returned from the split() function, iterate over it like you attempt to, but inside the braces that scope the loop is where you want to concatenate the <span> starting and ending tags around the original word.
3) use the document.createElement() to make that current span into a DOM element. Attach the mouseover and click handlers to it, then appendChild() it to the div.
add the handler to your button to call the above function.
Note that it's possibly more efficient to use the innerHTML() function to insert all the spans at once, but then you have to loop again to add the hover/click handlers.

Finding the DOM element with specific text and modify it

I'm trying to figure out how to, in raw javascript (no jQuery, etc.), find an element with specific text and modify that text.
My first incarnation of the solution... is less than adequate. What I did was basically:
var x = document.body.innerHTML;
x.replace(/regular-expression/,"text");
document.body.innerHTML = x;
Naively I thought I succeeded with flying colors, especially since it was so simple. So then I added an image to my example and thought I could check every 5 seconds (because this string may enter the DOM dynamically)... and the image flickered every 5 seconds.
Oops.
So, there has to be a correct way to do this. A way that specifically singles out a specific DOM element and updates the text portion of that DOM element.
Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid. And even then, I'm skeptical about "changing the innerHTML to something different" being the correct way to update a DOM element.
So, what's the correct way to search through the DOM for a string? And what's the correct way to update a DOM element's text?
Now, there's always "recursively search through the children till you find the deepest child with the string" approach, which I want to avoid.
I want to search for an element in an unordered random list. Now, there's a "go through all the elements till you find what you're looking for approach", which I want to avoid.
Old-timer magno tape, record, listen, meditate.
Btw, see: Find and replace text with JavaScript on James Padolsey's github
(also hig blog articles explaining it)
Edit: Changed querySelectorAll to getElementsByTagName from RobG's suggestion.
You can use the getElementsByTagName function to grab all of the tags on the page. From there, you can check their children and see if they have any Text Nodes as children. If they do, you'd then look at their text and see if it matches what you need. Here is an example that will print out the text of every Text Node in your document with the console object:
var elms = document.getElementsByTagName("*"),
len = elms.length;
for(var ii = 0; ii < len; ii++) {
var myChildred = elms[ii].childNodes;
len2 = myChildred.length;
for (var jj = 0; jj < len2; jj++) {
if(myChildred[jj].nodeType === 3) {
console.log(myChildred[jj].nodeValue);
// example on update a text node's value
myChildred[jj].nodeValue = myChildred[jj].nodeValue.replace(/test/,"123");
}
}
}
To update a DOM element's text, simple update the nodeValue property of the Text Node.
Don't use innerHTML with a regular expression, it will almost certainly fail for non-trivial content. Also, there are still differences in how browsers generate it from the live DOM. Replacing the innerHTML will also remove any event listeners added as element properties (i.e. like element.onclick = fn).
It is best if you can have the string enclosed in an element with an attribute or property you can search on (id, class, etc.) but failing that, a search of text nodes is the best approach.
Edit
Attempting a general purpose text selection function for an HTML document may result in a very complex algorithm since the string could be part of a complex structure, e.g.:
<h1>Some <span class="foo"><em>s</em>pecial</span> heading</h1>
Searching for the string "special heading" is tricky as it is split over 2 elements. Wrapping it another element (say for highlighting) is also not trivial since the resulting DOM structure must be valid. For example, the text matching "some special" in the above could be wrapped in a span but not a div.
Any such function must be accompanied by documentation stating its limitations and most appropriate use.
Forget regular expressions.
Iterate over each text node (and doing it recursively will be the most elegant) and modify the text nodes if the text is found. If just looking for a string, you can use indexOf().
x.replace(/regular-expression/,"text");
will return a value so
var y = x.replace(/regular-expression/,"text");
now you can assign new value.
document.body.innerHTML = y;
Bu you want to think about this, you dont't want to get the whole body just to change one small piece of code, why not get the content of a div or any element and so on
example:
<p id='paragraph'>
... some text here ...
</p>
now you can use javascript
var para = document.getElementById('paragraph').innerHTML;
var newPara = para.replace(/regex/,'new content');
para.innerHTML = newPara;
This should be the simplest way.

Regex to search html return, but not actual html jQuery

I'm making a highlighting plugin for a client to find things in a page and I decided to test it with a help viewer im still building but I'm having an issue that'll (probably) require some regex.
I do not want to parse HTML, and im totally open on how to do this differently, this just seems like the the best/right way.
http://oscargodson.com/labs/help-viewer
http://oscargodson.com/labs/help-viewer/js/jquery.jhighlight.js
Type something in the search... ok, refresh the page, now type, like, class or class=" or type <a you'll notice it'll search the actual HTML (as expected). How can I only search the text?
If i do .text() it'll vaporize all the HTML and what i get back will just be a big blob of text, but i still want the HTML so I dont lose formatting, links, images, etc. I want this to work like CMD/CTRL+F.
You'd use this plugin like:
$('article').jhighlight({find:'class'});
To remove them:
.jhighlight('remove')
==UPDATE==
While Mike Samuel's idea below does in fact work, it's a tad heavy for this plugin. It's mainly for a client looking to erase bad words and/or MS Word characters during a "publishing" process of a form. I'm looking for a more lightweight fix, any ideas?
You really don't want to use eval, mess with innerHTML or parse the markup "manually". The best way, in my opinion, is to deal with text nodes directly and keep a cache of the original html to erase the highlights. Quick rewrite, with comments:
(function($){
$.fn.jhighlight = function(opt) {
var options = $.extend($.fn.jhighlight.defaults, opt)
, txtProp = this[0].textContent ? 'textContent' : 'innerText';
if ($.trim(options.find.length) < 1) return this;
return this.each(function(){
var self = $(this);
// use a cache to clear the highlights
if (!self.data('htmlCache'))
self.data('htmlCache', self.html());
if(opt === 'remove'){
return self.html( self.data('htmlCache') );
}
// create Tree Walker
// https://developer.mozilla.org/en/DOM/treeWalker
var walker = document.createTreeWalker(
this, // walk only on target element
NodeFilter.SHOW_TEXT,
null,
false
);
var node
, matches
, flags = 'g' + (!options.caseSensitive ? 'i' : '')
, exp = new RegExp('('+options.find+')', flags) // capturing
, expSplit = new RegExp(options.find, flags) // no capturing
, highlights = [];
// walk this wayy
// and save matched nodes for later
while(node = walker.nextNode()){
if (matches = node.nodeValue.match(exp)){
highlights.push([node, matches]);
}
}
// must replace stuff after the walker is finished
// otherwise replacing a node will halt the walker
for(var nn=0,hln=highlights.length; nn<hln; nn++){
var node = highlights[nn][0]
, matches = highlights[nn][1]
, parts = node.nodeValue.split(expSplit) // split on matches
, frag = document.createDocumentFragment(); // temporary holder
// add text + highlighted parts in between
// like a .join() but with elements :)
for(var i=0,ln=parts.length; i<ln; i++){
// non-highlighted text
if (parts[i].length)
frag.appendChild(document.createTextNode(parts[i]));
// highlighted text
// skip last iteration
if (i < ln-1){
var h = document.createElement('span');
h.className = options.className;
h[txtProp] = matches[i];
frag.appendChild(h);
}
}
// replace the original text node
node.parentNode.replaceChild(frag, node);
};
});
};
$.fn.jhighlight.defaults = {
find:'',
className:'jhighlight',
color:'#FFF77B',
caseSensitive:false,
wrappingTag:'span'
};
})(jQuery);
If you're doing any manipulation on the page, you might want to replace the caching with another clean-up mechanism, not trivial though.
You can see the code working here: http://jsbin.com/anace5/2/
You also need to add display:block to your new html elements, the layout is broken on a few browsers.
In the javascript code prettifier, I had this problem. I wanted to search the text but preserve tags.
What I did was start with HTML, and decompose that into two bits.
The text content
Pairs of (index into text content where a tag occurs, the tag content)
So given
Lorem <b>ipsum</b>
I end up with
text = 'Lorem ipsum'
tags = [6, '<b>', 10, '</b>']
which allows me to search on the text, and then based on the result start and end indices, produce HTML including only the tags (and only balanced tags) in that range.
Have a look here: getElementsByTagName() equivalent for textNodes.
You can probably adapt one of the proposed solutions to your needs (i.e. iterate over all text nodes, replacing the words as you go - this won't work in cases such as <tag>wo</tag>rd but it's better than nothing, I guess).
I believe you could just do:
$('#article :not(:has(*))').jhighlight({find : 'class'});
Since it grabs all leaf nodes in the article it would require valid xhtml, that is, it would only match link in the following example:
<p>This is some paragraph content with a link</p>
DOM traversal / selector application could slow things down a bit so it might be good to do:
article_nodes = article_nodes || $('#article :not(:has(*))');
article_nodes.jhighlight({find : 'class'});
May be something like that could be helpful
>+[^<]*?(s(<[\s\S]*?>)?e(<[\s\S]*?>)?e)[^>]*?<+
The first part >+[^<]*? finds > of the last preceding tag
The third part [^>]*?<+ finds < of the first subsequent tag
In the middle we have (<[\s\S]*?>)? between characters of our search phrase (in this case - "see").
After regular expression searching you could use the result of the middle part to highlight search phrase for user.

Javascript: Changing color of every "r" in html document

EDIT [how can I] change the color of every R and r in my HTML document with javascript?
I'd use the highlight plugin for jQuery. Then do something like:
$('*').highlight('r'); // Not sure if it's case-insensitive or not
and in CSS:
.highlight { background-color: yellow; }
Doable, but not super easy. There's no CSS way to do it.
Basically, you'll need to use Javascript and iterate through the all nodes. If it's a text node, you can search it for "R" and then replace the R with a <span style="color:red">R</span>
I am obviously simplifying this a bit, it's probably better to just dynamically add a "highlight" class, rather than hard code a style, and have that defined in CSS. Similarly, I'm sure you'll wanna parameterize the search string. Also, this doesn't take into account what the text node is, for instance, I have special handling to skip comments, but you'll probably find there's other things (script nodes?) you also need to skip.
function updateNodes(node) {
if (node.nextSibling)
updateNodes(node.nextSibling);
if (node.nodeType ==8) return; //Don't update comments
if (node.firstChild)
updateNodes(node.firstChild);
if (node.nodeValue) { // update me
if (node.nodeValue.search(/[Rr]/) > -1){ // does the text node have an R
var span=document.createElement("span");
var remainingText = node.nodeValue;
var newValue='';
while (remainingText.search(/[Rr]/) > -1){ //Crawl through the node finding each R
var rPos = remainingText.search(/[Rr]/);
var bit = remainingText.substr(0,rPos);
var r = remainingText.substr(rPos,1);
remainingText=remainingText.substr(rPos+1);
newValue+=bit;
newValue+='<span style="color:red">';
newValue+=r;
newValue+='</span>';
}
newValue+=remainingText;
span.innerHTML=newValue;
node.parentNode.insertBefore(span,node);
node.parentNode.removeChild(node);
}
}
}
function replace(){ updateNodes(document.body);
}
Yes this is possible with a little Javascript, a smattering of CSS and some regex.
First, you need to define a style which provides the colour you require (in my example below I refer to a CSS class called "new-colour"), and then run some regex over your HTML content which does a search and replace. You are looking to change all 'r' and 'R' characters into something like this (as an example):
<span class="new-colour">r</span>
If you don't know regex, there are oodles of resources out there to get you started. You will be pleased to know that your requirement is very simple, so no worries there. Here are a couple of links:
regexlib.com
8 regular expressions you should know
You would need to use the DOM (or jQuery) to iterate through every text node in the document. Whenever you find the letter R, apply a transformation that wraps the letter in an appropriate element.
e.g. Transform the text node "art" into "a<span class="colored">r</span>t". This adds two new text nodes, "r" and "t", and the new span element.
The highlight plugin for jQuery is one option. Another option - especially since to-morrow - you might want to extend your highlighting into keywords or other terms is to use Google's Closure goog.dom.annotate Class. The beauty of this Class is that it will actually parse the dom tree properly and ONLY
annotate the relevant terms. It will also allow you to EXCLUDE elements or elements with certain classes.
A common problem with annotations is that you can mess your HTML, if you are not careful.
For example the 'simple solution posted above'
var body = document.getElementsByTagName("body")[0];
var html = body.innerHTML
.replace(/(^|>[^<rR]*)([rR])/g, "$1<em>$2</em>");
body.innerHTML = html;
will surely also capture terms in any style attributes. If you had this:
<p class="red">text......</p>
It will become
<p class="<span class="red">r</span>ed .....
that will break your html.
In general DOM parsing is 'slow', so try and avoid annotating the whole body of a webpage, ask yourself why you only need to highlight the R's? Actually I am curious why do you want to annotate the r's?:)
Plain JS solution without need of any 20kB JS library:
var body = document.getElementsByTagName("body")[0];
var html = body.innerHTML
.replace(/(^|>[^<rR]*)([rR])/g, "$1<em>$2</em>");
body.innerHTML = html; // note that you will lose all
// event handlers in this step...

Categories