Javascript: Whitespace Characters being Removed in Chrome (but not Firefox)

Javascript: Whitespace Characters being Removed in Chrome (but not Firefox) - javascript

Why would the below eliminate the whitespace around matched keyword text when replacing it with an anchor link? Note, this error only occurs in Chrome, and not firefox.
For complete context, the file is located at: http://seox.org/lbp/lb-core.js
To view the code in action (no errors found yet), the demo page is at http://seox.org/test.html. Copy/Pasting the first paragraph into a rich text editor (ie: dreamweaver, or gmail with rich text editor turned on) will reveal the problem, with words bunched together. Pasting it into a plain text editor will not.
// Find page text (not in links) -> doxdesk.com
function findPlainTextExceptInLinks(element, substring, callback) {
for (var childi= element.childNodes.length; childi-->0;) {
var child= element.childNodes[childi];
if (child.nodeType===1) {
if (child.tagName.toLowerCase()!=='a')
findPlainTextExceptInLinks(child, substring, callback);
} else if (child.nodeType===3) {
var index= child.data.length;
while (true) {
index= child.data.lastIndexOf(substring, index);
if (index===-1 || limit.indexOf(substring.toLowerCase()) !== -1)
break;
// don't match an alphanumeric char
var dontMatch =/\w/;
if(child.nodeValue.charAt(index - 1).match(dontMatch) || child.nodeValue.charAt(index+keyword.length).match(dontMatch))
break;
// alert(child.nodeValue.charAt(index+keyword.length + 1));
callback.call(window, child, index)
}
}
}
}
// Linkup function, call with various type cases (below)
function linkup(node, index) {
node.splitText(index+keyword.length);
var a= document.createElement('a');
a.href= linkUrl;
a.appendChild(node.splitText(index));
node.parentNode.insertBefore(a, node.nextSibling);
limit.push(keyword.toLowerCase()); // Add the keyword to memory
urlMemory.push(linkUrl); // Add the url to memory
}
// lower case (already applied)
findPlainTextExceptInLinks(lbp.vrs.holder, keyword, linkup);
Thanks in advance for your help. I'm nearly ready to launch the script, and will gladly comment in kudos to you for your assistance.

It's not anything to do with the linking functionality; it happens to copied links that are already on the page too, and the credit content, even if the processSel() call is commented out.
It seems to be a weird bug in Chrome's rich text copy function. The content in the holder is fine; if you cloneContents the selected range and alert its innerHTML at the end, the whitespaces are clearly there. But whitespaces just before, just after, and at the inner edges of any inline element (not just links!) don't show up in rich text.
Even if you add new text nodes to the DOM containing spaces next to a link, Chrome swallows them. I was able to make it look right by inserting non-breaking spaces:
var links= lbp.vrs.holder.getElementsByTagName('a');
for (var i= links.length; i-->0;) {
links[i].parentNode.insertBefore(document.createTextNode('\xA0 '), links[i]);
links[i].parentNode.insertBefore(document.createTextNode(' \xA0), links[i].nextSibling);
}
but that's pretty ugly, should be unnecessary, and doesn't fix up other inline elements. Bad Chrome!
var keyword = links[i].innerHTML.toLowerCase();
It's unwise to rely on innerHTML to get text from an element, as the browser may escape or not-escape characters in it. Most notably &, but there's no guarantee over what characters the browser's innerHTML property will output.
As you seem to be using jQuery already, grab the content with text() instead.
var isDomain = new RegExp(document.domain, 'g');
if (isDomain.test(linkUrl)) { ...
That'll fail every second time, because global regexps remember their previous state (lastIndex): when used with methods like test, you're supposed to keep calling repeatedly until they return no match.
You don't seem to need g (multiple matches) here... but then you don't seem to need regexp here either as a simple String indexOf would be more reliable. (In a regexp, each . in the domain would match any character in the link.)
Better still, use the URL decomposition properties on Location to do a direct comparison of hostnames, rather than crude string-matching over the whole URL:
if (location.hostname===links[i].hostname) { ...

// don't match an alphanumeric char
var dontMatch =/\w/;
if(child.nodeValue.charAt(index - 1).match(dontMatch) || child.nodeValue.charAt(index+keyword.length).match(dontMatch))
break;
If you want to match words on word boundaries, and case insensitively, I think you'd be better off using a regex rather than plain substring matching. That'd also save doing four calls to findText for each keyword as it is at the moment. You can grab the inner bit (in if (child.nodeType==3) { ...) of the function in this answer and use that instead of the current string matching.
The annoying thing about making regexps from string is adding a load of backslashes to the punctuation, so you'll want a function for that:
// Backslash-escape string for literal use in a RegExp
//
function RegExp_escape(s) {
return s.replace(/([/\\^$*+?.()|[\]{}])/g, '\\$1')
};
var keywordre= new RegExp('\\b'+RegExp_escape(keyword)+'\\b', 'gi');
You could even do all the keyword replacements in one go for efficiency:
var keywords= [];
var hrefs= [];
for (var i=0; i<links.length; i++) {
...
var text= $(links[i]).text();
keywords.push('(\\b'+RegExp_escape(text)+'\\b)');
hrefs.push[text]= links[i].href;
}
var keywordre= new RegExp(keywords.join('|'), 'gi');
and then for each match in linkup, check which match group has non-zero length and link with the hrefs[ of the same number.

I'd like to help you more, but it's hard to guess without being able to test it, but I suppose you can get around it by adding space-like characters around your links, eg. .
By the way, this feature of yours that adds helpful links on copying is really interesting.

Related

Modifying how text is copied from a web page where whitespace has been rendered visible

I've been working on creating an HTML parser and formatter, and I've just added a feature to optionally render whitespace visible, by replacing spaces with · (middle dot) characters, adding arrows for tabs and newlines, etc.
The full in-progress source code is here: https://github.com/kshetline/html-parser, and the most relevant file that's doing the CSS styling of HTML is here: https://github.com/kshetline/html-parser/blob/master/src/stylizer.ts.
While it's nice to be able to visualize whitespace when you want to, you wouldn't want spaces to be turned into middle dot characters if you select and copy the text. But that's just what happens, at least without some intervention.
I have found a crude way to fix the problem with a bit of JavaScript, which I've put into a Code Pen here: https://codepen.io/kshetline/pen/NWKYZJg.
document.body.addEventListener('copy', (event) => {
let selection = document.getSelection().toString();
selection = selection.replace(/·|↵\n|↵/g, ch => ch === '·' ? ' ' : '\n');
event.clipboardData.setData('text/plain', selection);
event.preventDefault();
});
I'm wondering, however, if there's a better way to do this.
My first choice would be something that didn't rely on JavaScript at all, like if there's some way via CSS or perhaps some accessibility-related HTML attribute that would essentially say, "this is the real text that should be copied, not what you see on the screen".
My second choice would be if someone can point me to more detailed documentation of the JavaScript clipboard feature than I've been able to find, because if I have to rely on JavaScript, I'd at least like my JavaScript to be smarter. The quick-and-dirty solution turns every middle dot character into a space, even if it was truly supposed to be a middle dot in the first place.
Is there enough info in the clipboard object to figure out which of the selected text has what CSS styling, so I could know to convert only the text that's inside <span>s which have my whitespace class, and still also find the rest of the non-whitespace text, in proper order, to piece it all back together again?

I still couldn't find much documentation on how selection objects work, but I played around with them in the web console, and eventually figured out enough to get by.
This is the JavaScript I came up with:
function restoreWhitespaceStrict(s) {
return s.replace(/·|[\u2400-\u241F]|\S/g, ch => ch === '·' ? ' ' :
ch.charCodeAt(0) >= 0x2400 ? String.fromCharCode(ch.charCodeAt(0) - 0x2400) : '');
}
const wsReplacements = {
'·': ' ',
'→\t': '\t',
'↵\n': '\n',
'␍\r': '\r',
'␍↵\r\n': '\r\n'
}
function restoreWhitespace(s) {
return s.replace(/·|→\t|↵\n|␍\r|␍↵\r\n|→|↵|␍|[\u2400-\u241F]/g, ws =>
wsReplacements[ws] || (ws.charCodeAt(0) >= 0x2400 ? String.fromCharCode(ws.charCodeAt(0) - 0x2400) : ''));
}
document.body.addEventListener('copy', (event) => {
const selection = document.getSelection();
let newSelection;
let copied = false;
if (selection.anchorNode && selection.getRangeAt) {
try {
const nodes = selection.getRangeAt(0).cloneContents().childNodes;
let parts = [];
// nodes isn't a "real" array - no forEach!
for (let i = 0; i < nodes.length; ++i) {
const node = nodes[i];
if (node.classList && node.classList.contains('whitespace'))
parts.push(restoreWhitespaceStrict(node.innerText));
else if (node.localName === 'span')
parts.push(node.innerText);
else
parts.push(node.nodeValue);
}
newSelection = parts.join('');
copied = true;
}
catch (err) {}
}
if (!copied)
newSelection = restoreWhitespace(selection.toString());
event.clipboardData.setData('text/plain', newSelection);
event.preventDefault();
});
I've tried this on three browsers (Chrome, Firefox, and Safari), and it's working on all of them, but I still took the precaution of both testing for the presence of some of the expected object properties, and then also using try/catch, just in case I hit an incompatible browser, in which case the not-so-smart version of fixing the clipboard takes over.
It looks like the selection is handled as a list of regular DOM nodes. Chrome's selection object has both an anchorNode and an extentNode to mark the start and end of the selection, but Firefox only has the anchorNode (I didn't check Safari for extentNode). I couldn't find any way to get the full list of nodes directly, however. I could only get the full list using the cloneContents() method. The first and last nodes obtained this way are altered from the original start and end nodes by being limited to the portion of the text content that was selected in each node.

How to properly bold search terms from Twitter, strange regex case in JS

I'm retrieving tweets from Twitter with the Twitter API and displaying them in my own client.
However, I'm having some difficulty properly highlighting the right search terms. I want to an effect like the following:
The way I'm trying to do this in JS is with a function called highlightSearchTerms(), which takes the text of the tweet and an array of keywords to bold as arguments. It returns the text of the fixed tweet. I'm bolding keywords by wrapping them in a that has the class .search-term.
I'm having a lot of problems, which include:
Running a simple replace doesn't preserve case
There is a lot of conflict with the keyword being in href tags
If I try to do a for loop with a replace, I don't know how to only modify search terms that aren't in an href, and that I haven't already wrapped with the span above
An example tweet I want to be able to handle for:
Input:
This is a keyword. This is a <a href="http://search.twitter.com/q=%23keyword">
#keyword</a> with a hashtag. This is a link with kEyWoRd:
http://thiskeyword.com.
Expected Output:
This is a
<span class="search-term">keyword</span>
. This is a <a href="http://search.twitter.com/q=%23keyword"> #
<span class="search-term">keyword</span>
</a> with a hashtag. This is a link with
<span class="search-term">kEyWoRd</span>
:<a href="http://thiskeyword.com">http://this
<span class="search-term>keyword.com</span>
</a>.
I've tried many things, but unfortunately I can't quite find out the right way to tackle the problem. Any advice at all would be greatly appreciated.
Here is my code that works for some cases but ultimately doesn't do what I want. It fails to handle for when the keyword is in the later half of the link (e.g. http://twitter.com/this_keyword). Sometimes it strangely also highlights 2 characters before a keyword as well. I doubt the best solution would resemble my code too much.
function _highlightSearchTerms(text, keywords){
for (var i=0;i<keywords.length;i++) {
// create regex to find all instances of the keyword, catch the links that potentially come before so we can filter them out in the next step
var searchString = new RegExp("[http://twitter.com/||q=%23]*"+keywords[i], "ig");
// create an array of all the matched keyword terms in the tweet, we can't simply run a replace all as we need them to retain their initial case
var keywordOccurencesInitial = text.match(searchString);
// create an array of the keyword occurences we want to actually use, I'm sure there's a better way to create this array but rather than try to optimize, I just worked with code I know should work because my problem isn't centered around this block
var keywordOccurences = [];
if (keywordOccurencesInitial != null) {
for(var i3=0;i3<keywordOccurencesInitial.length;i3++){
if (keywordOccurencesInitial[i3].indexOf("http://twitter.com/") > -1 || keywordOccurencesInitial[i3].indexOf("q=%23") > -1)
continue;
else
keywordOccurences.push(keywordOccurencesInitial[i3]);
}
}
// replace our matches with search term
// the regex should ensure to NOT catch terms we've already wrapped in the span
// i took the negative lookbehind workaround from http://stackoverflow.com/a/642746/1610101
if (keywordOccurences != null) {
for(var i2=0;i2<keywordOccurences.length;i2++){
var searchString2 = new RegExp("(q=%23||http://twitter.com/||<span class='search-term'>)?"+keywordOccurences[i2].trim(), "g"); // don't replace what we've alrdy replaced
text = text.replace(searchString2,
function($0,$1){
return $1?$0:"<span class='search-term'>"+keywordOccurences[i2].trim()+"</span>";
});
}
}
return text;
}

Here's something you can probably work with:
var getv = document.getElementById('tekt').value;
var keywords = "keyword,big elephant"; // comma delimited keyword list
var rekeywords = "(" + keywords.replace(/\, ?/ig,"|") + ")"; // wraps keywords in ( and ), and changes , to a pipe (character for regex alternation)
var keyrex = new RegExp("(#?\\b" + rekeywords + "\\b)(?=[^>]*?<[^>]*>|(?![^>]*>))","igm")
alert(keyrex);
document.getElementById('tekt').value = document.getElementById('tekt').value.replace(keyrex,"<span class=\"search-term\">$1</span>");
And here is a variation that attempts to deal with word forms. If the word ends with ed,es,s,ing,etc, it chops it off and also, while looking for word-boundaries at the end of the word, it also looks for words ending in common suffixes. It's not perfect, for instance the past tense of ride is rode. Accounting for that with Regex is nigh-impossible without opening yourself up to tons of false-positives.
var getv = document.getElementById('tekt').value;
var keywords = "keywords,big elephant";
var rekeywords = "(" + keywords.replace(/(es|ing|ed|d|s|e)?\b(\s*,\s*|$)/ig,"(es|ing|ed|d|s|e)?$2").replace(/,/g,"|") + ")";
var keyrex = new RegExp("(#?\\b" + rekeywords + "\\b)(?=[^>]*?<[^>]*>|(?![^>]*>))","igm")
console.log(keyrex);
document.getElementById('tekt').value = document.getElementById('tekt').value.replace(keyrex,"<span class=\"search-term\">$1</span>");
Edit
This is just about perfect. Do you know how to slightly modify it so the keyword in thiskeyword.com would also be highlighted?
Change this line
var keyrex = new RegExp("(#?\\b" + rekeywords + "\\b)(?=[^>]*?<[^>]*>|(?![^>]*>))","igm")
to (All I did was remove both \\b's):
var keyrex = new RegExp("(#?" + rekeywords + ")(?=[^>]*?<[^>]*>|(?![^>]*>))","igm")
But be warned, you'll have problems like smiles ending up as smiles (if a user searches for mile), and there's nothing regex can do about that. Regex's definition of a word is alphanumeric characters, it has no dictionary to check.

RegEx ignoring string values in encased in (brackets); and removing mark when input box is cleared

I need your help,
The following JavaScript code below works to find a given match in the UL LI list and compares it to that of the input box. My question is two fold though, while it works, it seems to ignore any values that are encased by any brackets and;
The code seems to leave the search results (matches) still highlighted, even if the input box is cleared. The mark tags should be removed if the user is not searching for anything.
How can the code be modified so as to accommodate the two above conditions?
Here is a fiddle: http://jsfiddle.net/z4TMC/4/
The code in question:
$("#refdocs").keyup(function(){
var search = $(this).val();
$("#refdocs_list li").each(function(){
var val = $(this).text();
if (val.toLowerCase().indexOf(search) >= 0 && search.length) {
//$(this).addClass("selected");
$(this).html(val.replace(RegExp("("+search.replace(/[\-$*{}()]/g,"\\$1")+")","ig"), "<mark>$1</mark>" ));
}
else {
$(this).removeClass("selected");
}
});
});

Your regex isn't perfect just yet. You aren't replacing/escaping sqare brackets ([]), nor are you properly escaping any backslashes that might have been entered. I've set up this working fiddle
All I had to do to get it working was add a listener to handle mouse input events, and change the regex a tad:
//add this, change triggers when focus is lost, though, but handles empty fields
$("#refdocs").on('change', function()
{
if (this.value == '')
$("#refdocs_list li").each(function()
{
this.innerHTML = $(this).text();
});
else
$(this).trigger('keyup');//run keyup handler if required
});
I've also noticed that, for one of the list items (XAF-2014-123456), your code fails to match XAF if the user input is in upper-case. That's because of your if condition:
if (val.toLowerCase().indexOf(search) >= 0 && search.length)
{//val.toLowerCase() -> also lowerCase the user input to ensure same case!
I've simply changed it to:
if (val.toLowerCase().indexOf(search.toLowerCase()) >= 0 && search.length)
{//lower-case to lower-case comparison, XAF and xaf both work now.
And now for the regex:
search.replace(/[-\\$*{}()[]]/g,"\\$1")
Notice how I moved the - to the front of the character-class. That's simply so that it doesn't need escaping.
Next, to filter out any backslashes that might have been put in by the user, I match a \\ (an escaped backslash matches a literal \).
The rest of the character class is pretty much the same, but right near the end (the closing ]), I've also added [], to match square brackets. Normally, you'd escape them, but in this case, there is no ambiguity. Compare this:
/ambi[a-z[]0-9]/g
Here, there are a number of ways to interpret the regex: a char class of a-z+ 0-9 and [] literals, or a malformed pattern, with a character class of a-z[, a digit and a trailing ]... However, [{}[]] is clearly a char-class that matches either curly or square brackets. If you want to play it safe, though, adding backslashes doesn't make any difference:
/[-\\$*{}()\[\]]/g
Update
Though this isn't something you were looking for: you commented you wanted to support IE10. IE has, since IE9, slowly but surely been conforming more and more to the standard, upheld by other browsers. Therefore, it's perfectly do-ably and pretty easy to write your code in a X-browser compatible way whilst maintaining X-browser compatibility.
It's not a big secret that I don't really like jQ (I find it bulky, slow and used way more than it should). So I've gone ahead and put together this fiddle, which shows you a VanillaJS approach of what you're trying to do. It's still a tad rough around the edges, but you should be able to get this up and running in no time.
In this code, I've used "clever" (as in efficient) techniques such as a closure/IFFE (to store the DOM references, avoiding too many DOM lookups), and event delegation. Google these terms and work out how this code ticks. If you want to, that is. I promise you though: it's worth it, you'll find that most of the jQ code out there is quite ghastly.
Anyway, for those of you that care: here's the code:
(function(refdoc)
{
var listItems = document.querySelectorAll('#refdocs_list li'),
wrapper = document.querySelector('#refdocs_list'),
classPattern = /\bselected/,
callback = function(e)
{
var i;
for (i=0;i<listItems.length;++i)
listItems[i].innerHTML = listItems[i].textContent;
if (!this.value)
return e;
for (i=0;i<listItems.length;++i)
if (listItems[i].textContent.toLowerCase().indexOf(this.value.toLowerCase()) >= 0)
listItems[i].innerHTML = listItems[i].textContent.replace(new RegExp("("+this.value.replace(/[-\\$*{}()[]]/g,"\\$1")+")","ig"), "<mark>$1</mark>" );
};
wrapper.addEventListener('click', function(e)
{
var i, t = (e = e || window.event).target || e.srcElement;
if (t.tagName.toLowerCase() === 'li')
{
for (i=0;i<listItems.length;++i)
{
listItems[i].innerHTML = listItems[i].textContent;
listItems[i].className = listItems[i].className.replace(classPattern, '');
}
t.className += ' selected';
refdoc.value = t.textContent;
}
}, false);
refdoc.addEventListener('keyup', callback, false);
refdoc.addEventListener('change', callback, false);
}(document.querySelector('#refdocs')));
All you need to do is put this in a window.addEventListener('load', function(){ /*code here */}, false); callback, and that's it.

JavaScript, Regex and null result

I have written this regexp: <(a*)\b[^>]*>.*?</\1>
and is tested on this regexp testing site: http://gskinner.com/RegExr/?2tntr
The point of the regexp is to go through a sites HTML and find all of the links. It should then return these in an Array for me to manipulate.
On the regexp testing site it works perfectly, but when put in action with JavaScript on my site it returns null.
JavaScript looks like this:
var data = $('#mainDivOnMiddleOfPage').html();
var pattern = "<(a*).*href=.*>.*</a>";
var modi = "g";
var patt = new RegExp(pattern, modi);
var result = patt.exec(data);
jQuery gets the content of the page. This is tested and verified.
Question is, why does this return null in JavaScript but what it is supposed to return in the regexp tester?

All <a> links:
<a[^>]*?\bhref=['\"](.*?)['\"]
Absolute links only (starting with http):
<a[^>]*?\bhref=['\"](http.*?)['\"]
JavaScript code:
var html = '<a href="test.html">';
var m = html.match(/<a[^>]*?\bhref=['"](.*?)['"]/);
print (m[1]);
See and test the code here.

I use the following code to do the same thing and it works for me, try it out
var data = document.getElementById('mainDivOnMiddleOfPage').textContent;
var result = data.match(/<(a*).*href=.*>.*<\/a>/);

Going to go ahead and post this here, since I think it's what you want -- it is not a RegEx solution, however.
$(function(){
$.ajax({
url: "test.htm",
success: function(data){
var array_of_links = $.makeArray($("a",data));
// do your stuff here
}
});
});

I'm conscious an answer has been chosen. However it's worth mentioning that the current REGEX solutions match the tags but not the actual HREFs in isolation.
This is where JavaScript falls down, since its somewhat simplistic implementation of REGEX does not allow for the capturing of sub-groups when the global g flag is specified.
One way round this is to exploit the REGEX replacement callback. This will get just the link HREFs, not the tags.
var html = document.body.innerHTML,
links = [];
html.replace(/<a[^>]*?href=('|")(.*?)\1/gi, function($0, $1, $2) {
links.push($2);
});
//links is now an array of hrefs
It also uses a back-reference to close the href attribute, i.e. making sure both opening and closing quote are single or double, not mixed.
Sidenote: as others have mentioned, where possible, you'd want to DOM this rather than REGEX.

"The point of the regexp is to go through a sites HTML and find all of the links. It should then return these in an Array for me to manipulate."
I won't add another regex answer, but just want to point out that if you have hold of the document (not just the html) then it's easier to walk trhough the links collection. That contains all <a href="">'s but also all <area> elements:
for (var link, links = document.links, n = links.length, i=0; i<n; i++){
link = links[i];
switch (link.tagName){
case "A":
//do something with the link
break;
case "AREA":
//do something with the area.
break;
}
}

Your problem is that you are not compiling your regex:
patt.compile();
You have to call it before using with the exec() method.

Regex to search html return, but not actual html jQuery

I'm making a highlighting plugin for a client to find things in a page and I decided to test it with a help viewer im still building but I'm having an issue that'll (probably) require some regex.
I do not want to parse HTML, and im totally open on how to do this differently, this just seems like the the best/right way.
http://oscargodson.com/labs/help-viewer
http://oscargodson.com/labs/help-viewer/js/jquery.jhighlight.js
Type something in the search... ok, refresh the page, now type, like, class or class=" or type <a you'll notice it'll search the actual HTML (as expected). How can I only search the text?
If i do .text() it'll vaporize all the HTML and what i get back will just be a big blob of text, but i still want the HTML so I dont lose formatting, links, images, etc. I want this to work like CMD/CTRL+F.
You'd use this plugin like:
$('article').jhighlight({find:'class'});
To remove them:
.jhighlight('remove')
==UPDATE==
While Mike Samuel's idea below does in fact work, it's a tad heavy for this plugin. It's mainly for a client looking to erase bad words and/or MS Word characters during a "publishing" process of a form. I'm looking for a more lightweight fix, any ideas?

You really don't want to use eval, mess with innerHTML or parse the markup "manually". The best way, in my opinion, is to deal with text nodes directly and keep a cache of the original html to erase the highlights. Quick rewrite, with comments:
(function($){
$.fn.jhighlight = function(opt) {
var options = $.extend($.fn.jhighlight.defaults, opt)
, txtProp = this[0].textContent ? 'textContent' : 'innerText';
if ($.trim(options.find.length) < 1) return this;
return this.each(function(){
var self = $(this);
// use a cache to clear the highlights
if (!self.data('htmlCache'))
self.data('htmlCache', self.html());
if(opt === 'remove'){
return self.html( self.data('htmlCache') );
}
// create Tree Walker
// https://developer.mozilla.org/en/DOM/treeWalker
var walker = document.createTreeWalker(
this, // walk only on target element
NodeFilter.SHOW_TEXT,
null,
false
);
var node
, matches
, flags = 'g' + (!options.caseSensitive ? 'i' : '')
, exp = new RegExp('('+options.find+')', flags) // capturing
, expSplit = new RegExp(options.find, flags) // no capturing
, highlights = [];
// walk this wayy
// and save matched nodes for later
while(node = walker.nextNode()){
if (matches = node.nodeValue.match(exp)){
highlights.push([node, matches]);
}
}
// must replace stuff after the walker is finished
// otherwise replacing a node will halt the walker
for(var nn=0,hln=highlights.length; nn<hln; nn++){
var node = highlights[nn][0]
, matches = highlights[nn][1]
, parts = node.nodeValue.split(expSplit) // split on matches
, frag = document.createDocumentFragment(); // temporary holder
// add text + highlighted parts in between
// like a .join() but with elements :)
for(var i=0,ln=parts.length; i<ln; i++){
// non-highlighted text
if (parts[i].length)
frag.appendChild(document.createTextNode(parts[i]));
// highlighted text
// skip last iteration
if (i < ln-1){
var h = document.createElement('span');
h.className = options.className;
h[txtProp] = matches[i];
frag.appendChild(h);
}
}
// replace the original text node
node.parentNode.replaceChild(frag, node);
};
});
};
$.fn.jhighlight.defaults = {
find:'',
className:'jhighlight',
color:'#FFF77B',
caseSensitive:false,
wrappingTag:'span'
};
})(jQuery);
If you're doing any manipulation on the page, you might want to replace the caching with another clean-up mechanism, not trivial though.
You can see the code working here: http://jsbin.com/anace5/2/
You also need to add display:block to your new html elements, the layout is broken on a few browsers.

In the javascript code prettifier, I had this problem. I wanted to search the text but preserve tags.
What I did was start with HTML, and decompose that into two bits.
The text content
Pairs of (index into text content where a tag occurs, the tag content)
So given
Lorem <b>ipsum</b>
I end up with
text = 'Lorem ipsum'
tags = [6, '<b>', 10, '</b>']
which allows me to search on the text, and then based on the result start and end indices, produce HTML including only the tags (and only balanced tags) in that range.

Have a look here: getElementsByTagName() equivalent for textNodes.
You can probably adapt one of the proposed solutions to your needs (i.e. iterate over all text nodes, replacing the words as you go - this won't work in cases such as <tag>wo</tag>rd but it's better than nothing, I guess).

I believe you could just do:
$('#article :not(:has(*))').jhighlight({find : 'class'});
Since it grabs all leaf nodes in the article it would require valid xhtml, that is, it would only match link in the following example:
<p>This is some paragraph content with a link</p>
DOM traversal / selector application could slow things down a bit so it might be good to do:
article_nodes = article_nodes || $('#article :not(:has(*))');
article_nodes.jhighlight({find : 'class'});

May be something like that could be helpful
>+[^<]*?(s(<[\s\S]*?>)?e(<[\s\S]*?>)?e)[^>]*?<+
The first part >+[^<]*? finds > of the last preceding tag
The third part [^>]*?<+ finds < of the first subsequent tag
In the middle we have (<[\s\S]*?>)? between characters of our search phrase (in this case - "see").
After regular expression searching you could use the result of the middle part to highlight search phrase for user.

We Keep Coding

JavaScript is the programming language of the Web.

Javascript: Whitespace Characters being Removed in Chrome (but not Firefox) - javascript

I'd like to help you more, but it's hard to guess without being able to test it, but I suppose you can get around it by adding space-like characters around your links, eg. . By the way, this feature of yours that adds helpful links on copying is really interesting.

Related

Modifying how text is copied from a web page where whitespace has been rendered visible

How to properly bold search terms from Twitter, strange regex case in JS

RegEx ignoring string values in encased in (brackets); and removing mark when input box is cleared

JavaScript, Regex and null result

Regex to search html return, but not actual html jQuery

Categories

Resources