I am writing a jquery plugin that will do a browser-style find-on-page search. I need to improve the search, but don't want to get into parsing the html quite yet.
At the moment my approach is to take an entire DOM element and all nested elements and simply run a regex find/replace for a given term. In the replace I will simply wrap a span around the matched term and use that span as my anchor to do highlighting, scrolling, etc. It is vital that no characters inside any html tags are matched.
This is as close as I have gotten:
(?<=^|>)([^><].*?)(?=<|$)
It does a very good job of capturing all characters that are not in an html tag, but I'm having trouble figuring out how to insert my search term.
Input: Any html element (this could be quite large, eg <body>)
Search Term: 1 or more characters
Replace Txt: <span class='highlight'>$1</span>
UPDATE
The following regex does what I want when I'm testing with http://gskinner.com/RegExr/...
Regex: (?<=^|>)(.*?)(SEARCH_STRING)(?=.*?<|$)
Replacement: $1<span class='highlight'>$2</span>
However I am having some trouble using it in my javascript. With the following code chrome is giving me the error "Invalid regular expression: /(?<=^|>)(.?)(Mary)(?=.?<|$)/: Invalid group".
var origText = $('#'+opt.targetElements).data('origText');
var regx = new RegExp("(?<=^|>)(.*?)(" + $this.val() + ")(?=.*?<|$)", 'gi');
$('#'+opt.targetElements).each(function() {
var text = origText.replace(regx, '$1<span class="' + opt.resultClass + '">$2</span>');
$(this).html(text);
});
It's breaking on the group (?<=^|>) - is this something clumsy or a difference in the Regex engines?
UPDATE
The reason this regex is breaking on that group is because Javascript does not support regex lookbehinds. For reference & possible solutions: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript.
Just use jQuerys built-in text() method. It will return all the characters in a selected DOM element.
For the DOM approach (docs for the Node interface): Run over all child nodes of an element. If the child is an element node, run recursively. If it's a text node, search in the text (node.data) and if you want to highlight/change something, shorten the text of the node until the found position, and insert a highligth-span with the matched text and another text node for the rest of the text.
Example code (adjusted, origin is here):
(function iterate_node(node) {
if (node.nodeType === 3) { // Node.TEXT_NODE
var text = node.data,
pos = text.search(/any regular expression/g), //indexOf also applicable
length = 5; // or whatever you found
if (pos > -1) {
node.data = text.substr(0, pos); // split into a part before...
var rest = document.createTextNode(text.substr(pos+length)); // a part after
var highlight = document.createElement("span"); // and a part between
highlight.className = "highlight";
highlight.appendChild(document.createTextNode(text.substr(pos, length)));
node.parentNode.insertBefore(rest, node.nextSibling); // insert after
node.parentNode.insertBefore(highlight, node.nextSibling);
iterate_node(rest); // maybe there are more matches
}
} else if (node.nodeType === 1) { // Node.ELEMENT_NODE
for (var i = 0; i < node.childNodes.length; i++) {
iterate_node(node.childNodes[i]); // run recursive on DOM
}
}
})(content); // any dom node
There's also highlight.js, which might be exactly what you want.
Related
I'm making an online text editor for a website I'm building, and I use custom tags for the markup.
To make it easier to read, the markup is highlighted by blue, which I do buy using the following function:
var imgOccurences = (informationText.match(/\[img/gi)).length;
for(var i = 0; i < imgOccurences; i++){
var imgLocation = informationText.indexOf('[img');
var endImgLocation = informationText.indexOf(']', imgLocation+1);
if(imgLocation != -1 && endImgLocation != -1){
var informationTextTemp1 = informationText.slice(0, imgLocation);
var informationTextTemp2 = informationText.slice(endImgLocation+1, -1);
var informationTextTemp3 = informationText.slice(imgLocation, endImgLocation+1);
informationTextTemp3 = "<span class='highlightWord'>"+informationTextTemp3+"</span>";
informationText = informationTextTemp1 + informationTextTemp3 + informationTextTemp2;
}
}
However the problem I face is that, when normalizing the text to HTML, I cannot use regex expressions, which I was previously using with the other tags, on the [img] tag, due to the fact that I wanted to highlight the image tag, and all of its contents, which includes a URL.
So I decided to count up all the occurrences of just the '[img' part of the [img] tag and then look for the next occurrence of ']', then slice it out of the normal text, then highlight it using a span, and then add it back to the normal text, while I put it in a for loop.
However only the first occurrence of the [img] tag is highlighted, and I am unsure as to how I should deal with this. Any help would be appreciated.
Basically I need to get everything which looks like: [img src='www.example.com/image.png']and make it look like:<span class='highlightWord'>[img src='example.com/image.png']</span> and then put it into the .innerHTML of the div called textHighlights.
Expected result:
The result I got:
You can do it much simpler since the .replace method accepts a regular expression as a parameter for the matching string.
informationText = informationText.replace(/(\[img.+?\])/gi, '<span class="highlightWord">$1</span>');
The above will replace all matches directly (by wrapping them in the span you want)
I'm designing a rudimentary spell checker of sorts. Suppose I have a div with the following content:
<div>This is some text with xyz and other text</div>
My spell checker correctly identifies the div (returning a jQuery object entitled current_object) and an index for the word (in the case of the example, 5 (due to starting at zero)).
What I need to do now, is surround this word with a span e.g.
<span class="spelling-error">xyz</span>
Leaving me with the final structure like this:
<div>
This is some text with
<span class="spelling-error">xyz</span>
and other text
</div>
However, I need to do this without altering the existing user selection / moving the caret / invoking methods that do so e.g.
window.getSelection().getRangeAt(0).cloneRange().surroundContents();
In other words, if the user is working on the 4th div in the contenteditable document, my code would identify issues in the other divs (1st - 3rd) while not removing focus from the 4th div.
Many thanks!
You've tagged this post as jQuery but I don't think it's particularly necessary to use it. I've written you an example.
https://jsfiddle.net/so0jrj2b/2/
// Redefine the innerHTML for our spellcheck target
spellcheck.innerHTML = (function(text)
{
// We're using an IIFE here to keep namespaces tidy.
// words is each word in the sentence split apart by text
var words = text.split(" ");
// newWords is our array of words after spellchecking.
var newWords = new Array;
// Loop through the sentences.
for (var i = 0; i < words.length; ++i)
{
// Pull the word from our array.
var word = words[i];
if (i === 5) // spellcheck logic here.
{
// Push this content to the array.
newWords.push("<span class=\"mistake\">" + word + "</span>");
}
else
{
// Push the word back to the array.
newWords.push(word);
}
}
// Return the rejoined text block.
return newWords.join(" ");
})(spellcheck.innerHTML);
Worth noting my usage of an IIFE her can be easily reproduced by moving that logic to its own function declaration to make better use of it.
Be aware you also need to account for punctuation in your spellchecking instances.
UPDATE: I am no longer specifically in need of the answer to this question - I was able to solve the (larger) problem I had in an entirely different way (see my comment). However, I'll check in occasionally, and if a viable answer arrives, I'll accept it. (It may take a week or three, though, as I'm only here sporadically.)
I have a string. It may or may not have HTML tags in it. So, it could be:
'This is my unspanned string'
or it could be:
'<span class="someclass">This is my spanned string</span>'
or:
'<span class="no-text"></span><span class="some-class"><span class="other-class">This is my spanned string</span></span>'
or:
'<span class="no-text"><span class="silly-example"></span></span><span class="some-class">This is my spanned string</span>'
I want to find the index of a substring, but only in the portion of the string that, if the string were turned into a DOM element, would be (a) TEXT node(s). In the example, only in the part of the string that has the plain text This is my string.
However, I need the location of the substring in the whole string, not only in the plain text portion.
So, if I'm searching for "span" in each of the strings above:
searching the first one will return 13 (0-based),
searching the second will skip the opening span tag in the string and return 35 for the string span in the word spanned
searching the third will skip the empty span tag and the openings of the two nested span tags, and return 91
searching the fourth will skip the nested span tags and the opening of the second span tag, and return 100
I don't want to remove any of the HTML tags, I just don't want them included in the search.
I'm aware that attempting to use regex is almost certainly a bad idea, probably even for simplistic strings as my code will be encountering, so please refrain from suggesting it.
I'm guessing I will need to use an HTML parser (something I've never done before). Is there one with which I can access the original parsed strings (or at least their lengths) for each node?
Might there be a simpler solution than that?
I did search around and wasn't been able to find anyone ask this particular question before, so if someone knows of something I missed, I apologize for faulty search skills.
The search could loop through the string char by char. If inside a tag, skip the tag, search the string only outside tags and remember partial match in case the text is matched partially then interrupted with another tag, continue the search outside the tag.
Here is a little function I came up with:
function customSearch(haysack,needle){
var start = 0;
var a = haysack.indexOf(needle,start);
var b = haysack.indexOf('<',start);
while(b < a && b != -1){
start = haysack.indexOf('>',b) + 1;
a = haysack.indexOf(needle,start);
b = haysack.indexOf('<',start);
}
return a;
}
It returns the results you expected based in your examples. Here is a JSFiddle where the results are logged in the console.
Let's start with your third example:
var desiredSubString = 'span';
var entireString = '<span class="no-text"></span><span class="some-class"><span class="other-class">This is my spanned string</span></span>';
Remove all HTML elements from entireString, above, to establish textString:
var textString = entireString.replace(/(data-([^"]+"[^"]+")/ig,"");
textString = textString.replace(/(<([^>]+)>)/ig,"");
You can then find the index of the start of the textString within the entireString:
var indexOfTextString = entireString.indexOf(textString);
Then you can find the index of the start of the substring you're looking for within the textString:
var indexOfSubStringWithinTextString = textString.indexOf(desiredSubString);
Finally you can add indexOfTextString and indexOfSubStringWithinTextString together:
var indexOfSubString = indexOfTextString + indexOfSubStringWithinTextString;
Putting it all together:
var entireString = '<span class="no-text"></span><span class="some-class"><span class="other-class">This is my spanned string</span></span>';
var desiredSubString = 'span';
var textString = entireString.replace(/(data-([^"]+"[^"]+")/ig,"");
textString = textString.replace(/(<([^>]+)>)/ig,"");
var indexOfTextString = entireString.indexOf(textString);
var indexOfSubStringWithinTextString = textString.indexOf(desiredSubString);
var indexOfSubString = indexOfTextString + indexOfSubStringWithinTextString;
You could use the browser's own HTML parser and XPath engine to search only inside the text nodes and do whatever processing you need.
Here's a partial solution:
var haystack = ' <span class="no-text"></span><span class="some-class"><span class="other-class">This is my spanned string</span></span>';
var needle = 'span';
var elt = document.createElement('elt');
elt.innerHTML = haystack;
var iter = document.evaluate('.//text()[contains(., "' + needle + '")]', elt).iterateNext();
if (iter) {
var position = iter.textContent.indexOf(needle);
var range = document.createRange();
range.setStart(iter, position);
range.setEnd(iter, position + needle.length);
// At this point, range points at the first occurence of `needle`
// in `haystack`. You can now delete it, replace it with something
// else, and so on, and after that, set your original string to the
// innerHTML of the document fragment representing the range.
console.log(range);
}
JSFiddle.
I'm making a highlighting plugin for a client to find things in a page and I decided to test it with a help viewer im still building but I'm having an issue that'll (probably) require some regex.
I do not want to parse HTML, and im totally open on how to do this differently, this just seems like the the best/right way.
http://oscargodson.com/labs/help-viewer
http://oscargodson.com/labs/help-viewer/js/jquery.jhighlight.js
Type something in the search... ok, refresh the page, now type, like, class or class=" or type <a you'll notice it'll search the actual HTML (as expected). How can I only search the text?
If i do .text() it'll vaporize all the HTML and what i get back will just be a big blob of text, but i still want the HTML so I dont lose formatting, links, images, etc. I want this to work like CMD/CTRL+F.
You'd use this plugin like:
$('article').jhighlight({find:'class'});
To remove them:
.jhighlight('remove')
==UPDATE==
While Mike Samuel's idea below does in fact work, it's a tad heavy for this plugin. It's mainly for a client looking to erase bad words and/or MS Word characters during a "publishing" process of a form. I'm looking for a more lightweight fix, any ideas?
You really don't want to use eval, mess with innerHTML or parse the markup "manually". The best way, in my opinion, is to deal with text nodes directly and keep a cache of the original html to erase the highlights. Quick rewrite, with comments:
(function($){
$.fn.jhighlight = function(opt) {
var options = $.extend($.fn.jhighlight.defaults, opt)
, txtProp = this[0].textContent ? 'textContent' : 'innerText';
if ($.trim(options.find.length) < 1) return this;
return this.each(function(){
var self = $(this);
// use a cache to clear the highlights
if (!self.data('htmlCache'))
self.data('htmlCache', self.html());
if(opt === 'remove'){
return self.html( self.data('htmlCache') );
}
// create Tree Walker
// https://developer.mozilla.org/en/DOM/treeWalker
var walker = document.createTreeWalker(
this, // walk only on target element
NodeFilter.SHOW_TEXT,
null,
false
);
var node
, matches
, flags = 'g' + (!options.caseSensitive ? 'i' : '')
, exp = new RegExp('('+options.find+')', flags) // capturing
, expSplit = new RegExp(options.find, flags) // no capturing
, highlights = [];
// walk this wayy
// and save matched nodes for later
while(node = walker.nextNode()){
if (matches = node.nodeValue.match(exp)){
highlights.push([node, matches]);
}
}
// must replace stuff after the walker is finished
// otherwise replacing a node will halt the walker
for(var nn=0,hln=highlights.length; nn<hln; nn++){
var node = highlights[nn][0]
, matches = highlights[nn][1]
, parts = node.nodeValue.split(expSplit) // split on matches
, frag = document.createDocumentFragment(); // temporary holder
// add text + highlighted parts in between
// like a .join() but with elements :)
for(var i=0,ln=parts.length; i<ln; i++){
// non-highlighted text
if (parts[i].length)
frag.appendChild(document.createTextNode(parts[i]));
// highlighted text
// skip last iteration
if (i < ln-1){
var h = document.createElement('span');
h.className = options.className;
h[txtProp] = matches[i];
frag.appendChild(h);
}
}
// replace the original text node
node.parentNode.replaceChild(frag, node);
};
});
};
$.fn.jhighlight.defaults = {
find:'',
className:'jhighlight',
color:'#FFF77B',
caseSensitive:false,
wrappingTag:'span'
};
})(jQuery);
If you're doing any manipulation on the page, you might want to replace the caching with another clean-up mechanism, not trivial though.
You can see the code working here: http://jsbin.com/anace5/2/
You also need to add display:block to your new html elements, the layout is broken on a few browsers.
In the javascript code prettifier, I had this problem. I wanted to search the text but preserve tags.
What I did was start with HTML, and decompose that into two bits.
The text content
Pairs of (index into text content where a tag occurs, the tag content)
So given
Lorem <b>ipsum</b>
I end up with
text = 'Lorem ipsum'
tags = [6, '<b>', 10, '</b>']
which allows me to search on the text, and then based on the result start and end indices, produce HTML including only the tags (and only balanced tags) in that range.
Have a look here: getElementsByTagName() equivalent for textNodes.
You can probably adapt one of the proposed solutions to your needs (i.e. iterate over all text nodes, replacing the words as you go - this won't work in cases such as <tag>wo</tag>rd but it's better than nothing, I guess).
I believe you could just do:
$('#article :not(:has(*))').jhighlight({find : 'class'});
Since it grabs all leaf nodes in the article it would require valid xhtml, that is, it would only match link in the following example:
<p>This is some paragraph content with a link</p>
DOM traversal / selector application could slow things down a bit so it might be good to do:
article_nodes = article_nodes || $('#article :not(:has(*))');
article_nodes.jhighlight({find : 'class'});
May be something like that could be helpful
>+[^<]*?(s(<[\s\S]*?>)?e(<[\s\S]*?>)?e)[^>]*?<+
The first part >+[^<]*? finds > of the last preceding tag
The third part [^>]*?<+ finds < of the first subsequent tag
In the middle we have (<[\s\S]*?>)? between characters of our search phrase (in this case - "see").
After regular expression searching you could use the result of the middle part to highlight search phrase for user.
Why would the below eliminate the whitespace around matched keyword text when replacing it with an anchor link? Note, this error only occurs in Chrome, and not firefox.
For complete context, the file is located at: http://seox.org/lbp/lb-core.js
To view the code in action (no errors found yet), the demo page is at http://seox.org/test.html. Copy/Pasting the first paragraph into a rich text editor (ie: dreamweaver, or gmail with rich text editor turned on) will reveal the problem, with words bunched together. Pasting it into a plain text editor will not.
// Find page text (not in links) -> doxdesk.com
function findPlainTextExceptInLinks(element, substring, callback) {
for (var childi= element.childNodes.length; childi-->0;) {
var child= element.childNodes[childi];
if (child.nodeType===1) {
if (child.tagName.toLowerCase()!=='a')
findPlainTextExceptInLinks(child, substring, callback);
} else if (child.nodeType===3) {
var index= child.data.length;
while (true) {
index= child.data.lastIndexOf(substring, index);
if (index===-1 || limit.indexOf(substring.toLowerCase()) !== -1)
break;
// don't match an alphanumeric char
var dontMatch =/\w/;
if(child.nodeValue.charAt(index - 1).match(dontMatch) || child.nodeValue.charAt(index+keyword.length).match(dontMatch))
break;
// alert(child.nodeValue.charAt(index+keyword.length + 1));
callback.call(window, child, index)
}
}
}
}
// Linkup function, call with various type cases (below)
function linkup(node, index) {
node.splitText(index+keyword.length);
var a= document.createElement('a');
a.href= linkUrl;
a.appendChild(node.splitText(index));
node.parentNode.insertBefore(a, node.nextSibling);
limit.push(keyword.toLowerCase()); // Add the keyword to memory
urlMemory.push(linkUrl); // Add the url to memory
}
// lower case (already applied)
findPlainTextExceptInLinks(lbp.vrs.holder, keyword, linkup);
Thanks in advance for your help. I'm nearly ready to launch the script, and will gladly comment in kudos to you for your assistance.
It's not anything to do with the linking functionality; it happens to copied links that are already on the page too, and the credit content, even if the processSel() call is commented out.
It seems to be a weird bug in Chrome's rich text copy function. The content in the holder is fine; if you cloneContents the selected range and alert its innerHTML at the end, the whitespaces are clearly there. But whitespaces just before, just after, and at the inner edges of any inline element (not just links!) don't show up in rich text.
Even if you add new text nodes to the DOM containing spaces next to a link, Chrome swallows them. I was able to make it look right by inserting non-breaking spaces:
var links= lbp.vrs.holder.getElementsByTagName('a');
for (var i= links.length; i-->0;) {
links[i].parentNode.insertBefore(document.createTextNode('\xA0 '), links[i]);
links[i].parentNode.insertBefore(document.createTextNode(' \xA0), links[i].nextSibling);
}
but that's pretty ugly, should be unnecessary, and doesn't fix up other inline elements. Bad Chrome!
var keyword = links[i].innerHTML.toLowerCase();
It's unwise to rely on innerHTML to get text from an element, as the browser may escape or not-escape characters in it. Most notably &, but there's no guarantee over what characters the browser's innerHTML property will output.
As you seem to be using jQuery already, grab the content with text() instead.
var isDomain = new RegExp(document.domain, 'g');
if (isDomain.test(linkUrl)) { ...
That'll fail every second time, because global regexps remember their previous state (lastIndex): when used with methods like test, you're supposed to keep calling repeatedly until they return no match.
You don't seem to need g (multiple matches) here... but then you don't seem to need regexp here either as a simple String indexOf would be more reliable. (In a regexp, each . in the domain would match any character in the link.)
Better still, use the URL decomposition properties on Location to do a direct comparison of hostnames, rather than crude string-matching over the whole URL:
if (location.hostname===links[i].hostname) { ...
// don't match an alphanumeric char
var dontMatch =/\w/;
if(child.nodeValue.charAt(index - 1).match(dontMatch) || child.nodeValue.charAt(index+keyword.length).match(dontMatch))
break;
If you want to match words on word boundaries, and case insensitively, I think you'd be better off using a regex rather than plain substring matching. That'd also save doing four calls to findText for each keyword as it is at the moment. You can grab the inner bit (in if (child.nodeType==3) { ...) of the function in this answer and use that instead of the current string matching.
The annoying thing about making regexps from string is adding a load of backslashes to the punctuation, so you'll want a function for that:
// Backslash-escape string for literal use in a RegExp
//
function RegExp_escape(s) {
return s.replace(/([/\\^$*+?.()|[\]{}])/g, '\\$1')
};
var keywordre= new RegExp('\\b'+RegExp_escape(keyword)+'\\b', 'gi');
You could even do all the keyword replacements in one go for efficiency:
var keywords= [];
var hrefs= [];
for (var i=0; i<links.length; i++) {
...
var text= $(links[i]).text();
keywords.push('(\\b'+RegExp_escape(text)+'\\b)');
hrefs.push[text]= links[i].href;
}
var keywordre= new RegExp(keywords.join('|'), 'gi');
and then for each match in linkup, check which match group has non-zero length and link with the hrefs[ of the same number.
I'd like to help you more, but it's hard to guess without being able to test it, but I suppose you can get around it by adding space-like characters around your links, eg. .
By the way, this feature of yours that adds helpful links on copying is really interesting.