Related
I have a challenging problem to solve. I'm working on a script which takes a regex as an input. This script then finds all matches for this regex in a document and wraps each match in its own <span> element. The hard part is that the text is a formatted html document, so my script needs to navigate through the DOM and apply the regex across multiple text nodes at once, while figuring out where it has to split text nodes if needed.
For example, with a regex that captures full sentences starting with a capital letter and ending with a period, this document:
<p>
<b>HTML</b> is a language used to make <b>websites.</b>
It was developed by <i>CERN</i> employees in the early 90s.
</p>
Would ideally be turned into this:
<p>
<span><b>HTML</b> is a language used to make <b>websites.</b></span>
<span>It was developed by <i>CERN</i> employees in the early 90s.</span>
</p>
The script should then return the list of all created spans.
I already have some code which finds all the text nodes and stores them in a list along with their position across the whole document and their depth. You don't really need to understand that code to help me and its recursive structure can be a bit confusing. The first part I'm not sure how to do is figure out which elements should be included within the span.
function findTextNodes(node, depth = -1, start = 0) {
let list = [];
if (node.nodeType === Node.TEXT_NODE) {
list.push({ node, depth, start });
} else {
for (let i = 0; i < node.childNodes.length; ++i) {
list = list.concat(findTextNodes(node.childNodes[i], depth+1, start));
if (list.length) {
start += list[list.length-1].node.nodeValue.length;
}
}
}
return list;
}
I figure I'll make a string out of all the document, run the regex through it and use the list to find which nodes correspond to witch regex matches and then split the text nodes accordingly.
But an issue arrives when I have a document like this:
<p>
This program is not stable yet. Do not use this in production yet.
</p>
There's a sentence which starts outside of the <a> tag but ends inside it. Now I don't want the script to split that link in two tags. In a more complex document, it could ruin the page if it did. The code could either wrap two sentences together:
<p>
<span>This program is not stable yet. Do not use this in production yet.</span>
</p>
Or just wrap each part in its own element:
<p>
<span>This program is </span>
<a href="beta.html">
<span>not stable yet.</span>
<span>Do not use this in production yet.</span>
</a>
</p>
There could be a parameter to specify what it should do. I'm just not sure how to figure out when an impossible cut is about to happen, and how to recover from it.
Another issue comes when I have whitespace inside a child element like this:
<p>This is a <b>sentence. </b></p>
Technically, the regex match would end right after the period, before the end of the <b> tag. However, it would be much better to consider the space as part of the match and wrap it like this:
<p><span>This is a <b>sentence. </b></span></p>
Than this:
<p><span>This is a </span><b><span>sentence.</span> </b></p>
But that's a minor issue. After all, I could just allow extra white-space to be included within the regex.
I know this might sound like a "do it for me" question and its not the kind of quick question we see on SO on a daily basis, but I've been stuck on this for a while and it's for an open-source library I'm working on. Solving this problem is the last obstacle. If you think another SE site is best suited for this question, redirect me please.
Here are two ways to deal with this.
I don't know if the following will exactly match your needs. It's a simple enough solution to the problem, but at least it doesn't use RegEx to manipulate HTML tags. It performs pattern matching against the raw text and then uses the DOM to manipulate the content.
First approach
This approach creates only one <span> tag per match, leveraging some less common browser APIs.
(See the main problem of this approach below the demo, and if not sure, use the second approach).
The Range class represents a text fragment. It has a surroundContents function that lets you wrap a range in an element. Except it has a caveat:
This method is nearly equivalent to newNode.appendChild(range.extractContents()); range.insertNode(newNode). After surrounding, the boundary points of the range include newNode.
An exception will be thrown, however, if the Range splits a non-Text node with only one of its boundary points. That is, unlike the alternative above, if there are partially selected nodes, they will not be cloned and instead the operation will fail.
Well, the workaround is provided in the MDN, so all's good.
So here's an algorithm:
Make a list of Text nodes and keep their start indices in the text
Concatenate these nodes' values to get the text
Find matches over the text, and for each match:
Find the start and end nodes of the match, comparing the the nodes' start indices to the match position
Create a Range over the match
Let the browser do the dirty work using the trick above
Rebuild the node list since the last action changed the DOM
Here's my implementation with a demo:
function highlight(element, regex) {
var document = element.ownerDocument;
var getNodes = function() {
var nodes = [],
offset = 0,
node,
nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false);
while (node = nodeIterator.nextNode()) {
nodes.push({
textNode: node,
start: offset,
length: node.nodeValue.length
});
offset += node.nodeValue.length
}
return nodes;
}
var nodes = getNodes(nodes);
if (!nodes.length)
return;
var text = "";
for (var i = 0; i < nodes.length; ++i)
text += nodes[i].textNode.nodeValue;
var match;
while (match = regex.exec(text)) {
// Prevent empty matches causing infinite loops
if (!match[0].length)
{
regex.lastIndex++;
continue;
}
// Find the start and end text node
var startNode = null, endNode = null;
for (i = 0; i < nodes.length; ++i) {
var node = nodes[i];
if (node.start + node.length <= match.index)
continue;
if (!startNode)
startNode = node;
if (node.start + node.length >= match.index + match[0].length)
{
endNode = node;
break;
}
}
var range = document.createRange();
range.setStart(startNode.textNode, match.index - startNode.start);
range.setEnd(endNode.textNode, match.index + match[0].length - endNode.start);
var spanNode = document.createElement("span");
spanNode.className = "highlight";
spanNode.appendChild(range.extractContents());
range.insertNode(spanNode);
nodes = getNodes();
}
}
// Test code
var testDiv = document.getElementById("test-cases");
var originalHtml = testDiv.innerHTML;
function test() {
testDiv.innerHTML = originalHtml;
try {
var regex = new RegExp(document.getElementById("regex").value, "g");
highlight(testDiv, regex);
}
catch(e) {
testDiv.innerText = e;
}
}
document.getElementById("runBtn").onclick = test;
test();
.highlight {
background-color: yellow;
border: 1px solid orange;
border-radius: 5px;
}
.section {
border: 1px solid gray;
padding: 10px;
margin: 10px;
}
<form class="section">
RegEx: <input id="regex" type="text" value="[A-Z].*?\." /> <button id="runBtn">Highlight</button>
</form>
<div id="test-cases" class="section">
<div>foo bar baz</div>
<p>
<b>HTML</b> is a language used to make <b>websites.</b>
It was developed by <i>CERN</i> employees in the early 90s.
<p>
<p>
This program is not stable yet. Do not use this in production yet.
</p>
<div>foo bar baz</div>
</div>
Ok, that was the lazy approach which, unfortunately doesn't work for some cases. It works well if you only highlight across inline elements, but breaks when there are block elements along the way because of the following property of the extractContents function:
Partially selected nodes are cloned to include the parent tags necessary to make the document fragment valid.
That's bad. It'll just duplicate block-level nodes. Try the previous demo with the baz\s+HTML regex if you want to see how it breaks.
Second approach
This approach iterates over the matching nodes, creating <span> tags along the way.
The overall algorithm is straightforward as it just wraps each matching node in its own <span>. But this means we have to deal with partially matching text nodes, which requires some more effort.
If a text node matches partially, it's split with the splitText function:
After the split, the current node contains all the content up to the specified offset point, and a newly created node of the same type contains the remaining text. The newly created node is returned to the caller.
function highlight(element, regex) {
var document = element.ownerDocument;
var nodes = [],
text = "",
node,
nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false);
while (node = nodeIterator.nextNode()) {
nodes.push({
textNode: node,
start: text.length
});
text += node.nodeValue
}
if (!nodes.length)
return;
var match;
while (match = regex.exec(text)) {
var matchLength = match[0].length;
// Prevent empty matches causing infinite loops
if (!matchLength)
{
regex.lastIndex++;
continue;
}
for (var i = 0; i < nodes.length; ++i) {
node = nodes[i];
var nodeLength = node.textNode.nodeValue.length;
// Skip nodes before the match
if (node.start + nodeLength <= match.index)
continue;
// Break after the match
if (node.start >= match.index + matchLength)
break;
// Split the start node if required
if (node.start < match.index) {
nodes.splice(i + 1, 0, {
textNode: node.textNode.splitText(match.index - node.start),
start: match.index
});
continue;
}
// Split the end node if required
if (node.start + nodeLength > match.index + matchLength) {
nodes.splice(i + 1, 0, {
textNode: node.textNode.splitText(match.index + matchLength - node.start),
start: match.index + matchLength
});
}
// Highlight the current node
var spanNode = document.createElement("span");
spanNode.className = "highlight";
node.textNode.parentNode.replaceChild(spanNode, node.textNode);
spanNode.appendChild(node.textNode);
}
}
}
// Test code
var testDiv = document.getElementById("test-cases");
var originalHtml = testDiv.innerHTML;
function test() {
testDiv.innerHTML = originalHtml;
try {
var regex = new RegExp(document.getElementById("regex").value, "g");
highlight(testDiv, regex);
}
catch(e) {
testDiv.innerText = e;
}
}
document.getElementById("runBtn").onclick = test;
test();
.highlight {
background-color: yellow;
}
.section {
border: 1px solid gray;
padding: 10px;
margin: 10px;
}
<form class="section">
RegEx: <input id="regex" type="text" value="[A-Z].*?\." /> <button id="runBtn">Highlight</button>
</form>
<div id="test-cases" class="section">
<div>foo bar baz</div>
<p>
<b>HTML</b> is a language used to make <b>websites.</b>
It was developed by <i>CERN</i> employees in the early 90s.
<p>
<p>
This program is not stable yet. Do not use this in production yet.
</p>
<div>foo bar baz</div>
</div>
This should be good enough for most cases I hope. If you need to minimize the number of <span> tags it can be done by extending this function, but I wanted to keep it simple for now.
function parseText( element ){
var stack = [ element ];
var group = false;
var re = /(?!\s|$).*?(\.|$)/;
while ( stack.length > 0 ){
var node = stack.shift();
if ( node.nodeType === Node.TEXT_NODE )
{
if ( node.textContent.trim() != "" )
{
var match;
while( node && (match = re.exec( node.textContent )) )
{
var start = group ? 0 : match.index;
var length = match[0].length + match.index - start;
if ( start > 0 )
{
node = node.splitText( start );
}
var wrapper = document.createElement( 'span' );
var next = null;
if ( match[1].length > 0 ){
if ( node.textContent.length > length )
next = node.splitText( length );
group = false;
wrapper.className = "sentence sentence-end";
}
else
{
wrapper.className = "sentence";
group = true;
}
var parent = node.parentNode;
var sibling = node.nextSibling;
wrapper.appendChild( node );
if ( sibling )
parent.insertBefore( wrapper, sibling );
else
parent.appendChild( wrapper );
node = next;
}
}
}
else if ( node.nodeType === Node.ELEMENT_NODE || node.nodeType === Node.DOCUMENT_NODE )
{
stack.unshift.apply( stack, node.childNodes );
}
}
}
parseText( document.body );
.sentence {
text-decoration: underline wavy red;
}
.sentence-end {
border-right: 1px solid red;
}
<p>This is a sentence. This is another sentence.</p>
<p>This sentence has <strong>emphasis</strong> inside it.</p>
<p><span>This sentence spans</span><span> two elements.</span></p>
I would use "flat DOM" representation for such task.
In flat DOM this paragraph
<p>abc <a href="beta.html">def. ghij.</p>
will be represented by two vectors:
chars: "abc def. ghij.",
props: ....aaaaaaaaaa,
You will use normal regexp on chars to mark span areas on props vector:
chars: "abc def. ghij."
props: ssssaaaaaaaaaa
ssss sssss
I am using schematic representation here, it's real structure is an array of arrays:
props: [
[s],
[s],
[s],
[s],
[a,s],
[a,s],
...
]
conversion tree-DOM <-> flat-DOM can use simple state automata.
At the end you will convert flat DOM to tree DOM that will look like:
<p><s>abc </s><a href="beta.html"><s>def.</s> <s>ghij.</s></p>
Just in case: I am using this approach in my HTML WYSIWYG editors.
As everyone has already said, this is more of an academic question since this shouldn't really be the way you do it. That being said, it seemed like fun so here's one approach.
EDIT: I think I got the gist of it now.
function myReplace(str) {
myRegexp = /((^<[^>*]>)+|([^<>\.]*|(<[^\/>]*>[^<>\.]+<\/[^>]*>)+)*[^<>\.]*\.\s*|<[^>]*>|[^\.<>]+\.*\s*)/g;
arr = str.match(myRegexp);
var out = "";
for (i in arr) {
var node = arr[i];
if (node.indexOf("<")===0) out += node;
else out += "<span>"+node+"</span>"; // Here is where you would run whichever
// regex you want to match by
}
document.write(out.replace(/</g, "<").replace(/>/g, ">")+"<br>");
console.log(out);
}
myReplace('<p>This program is not stable yet. Do not use this in production yet.</p>');
myReplace('<p>This is a <b>sentence. </b></p>');
myReplace('<p>This is a <b>another</b> and <i>more complex</i> even <b>super complex</b> sentence.</p>');
myReplace('<p>This is a <b>a sentence</b>. Followed <i>by</i> another one.</p>');
myReplace('<p>This is a <b>an even</b> more <i>complex sentence. </i></p>');
/* Will output:
<p><span>This program is </span><span>not stable yet. </span><span>Do not use this in production yet.</span></p>
<p><span>This is a </span><b><span>sentence. </span></b></p>
<p><span>This is a <b>another</b> and <i>more complex</i> even <b>super complex</b> sentence.</span></p>
<p><span>This is a <b>a sentence</b>. </span><span>Followed <i>by</i> another one.</span></p>
<p><span>This is a </span><b><span>an even</span></b><span> more </span><i><span>complex sentence. </span></i></p>
*/
I have spent a long time implementing all of approaches given in this thread.
Node iterator
Html parsing
Flat Dom
For any of this approaches you have to come up with technique to split entire html into sentences and wrap into span (some might want words in span). As soon as we do this we will run into performance issues (I should say beginner like me will run into performance issues).
Performance Bottleneck
I couldn't scale any of this approach to 70k - 200k words and still do it in milli seconds. Wrapping time keeps increasing as words in pages keep increasing.
With complex html pages with combinations of text-node and different elements we soon run into trouble and with this technical debt keeps increasing.
Best approach : Mark.js (according to me)
Note: if you do this right you can process any number of words in millis.
Just use Ranges I want to recommend Mark.js and following example,
var instance = new Mark(document.body);
instance.markRanges([{
start: 15,
length: 5
}, {
start: 25:
length: 8
}]); /
With this we can treat entire body.textContent as string and just keep highlighting substring.
No DOM structure is modified here. And you can easily fix complex use cases and technical debt doesn't increase with more if and else.
Additionally once text is highlighted with html5 mark tag you can post process these tags to find out bounding rectangles.
Also look into Splitting.js if you just want split html documents into words/chars/lines and many more... But one draw back for this approach is that Splitting.js collapses additional spaces in the document so we loose little bit of info.
Thanks.
Edit: StackOverFlow is replacing Japanese Characters with translations upon saving my question.
This makes it look like I'm replacing the same text, with the same text.
The first item(of the dupes, below) should be Japanese text.
Using the scripts described here:
Find all instances of 'old' in a webpage and replace each with 'new', using a javascript bookmarklet
I've gone about trying to translate Yahoo Japan Auction pages
(yes, i know translation engines exist, but I have my reasons...)
example page:
http://auctions.search.yahoo.co.jp/search?auccat=&p=bose&tab_ex=commerce&ei=UTF-8&fr=bzr-prop
Have tried a couple scripts and While the scripts work, I must wait and click the "Unresponsive Script" a couple of times before the changes occur (10-20 seconds)
While I'm certain my implementation is buggy, also uncertain how to proceed.
The script can contain over 200 change items.
These below are culled for space considerations.
Version 1 Script:
function newTheOlds(node) {
node = node || document.body;
if(node.nodeType == 3) {
// Text node
node.nodeValue = node.nodeValue.split('Car,Bike').join('Car,Bike');
node.nodeValue = node.nodeValue.split('Current $').join('Current $');
node.nodeValue = node.nodeValue.split('Buy it Now').join('Buy it Now');
node.nodeValue = node.nodeValue.split('Bid').join('Bid');
node.nodeValue = node.nodeValue.split('Remaining Time').join('Remaining Time');
node.nodeValue = node.nodeValue.split('Popular-Newest').join('Popular-Newest');
} else {
var nodes = node.childNodes;
if(nodes) {
var i = nodes.length;
while(i--) newTheOlds(nodes[i]);
}
}
}
newTheOlds();
Version 2 Script:
function htmlreplace(a, b, element) {
if (!element) element = document.body;
var nodes = element.childNodes;
for (var n=0; n<nodes.length; n++) {
if (nodes[n].nodeType == Node.TEXT_NODE) {
var r = new RegExp(a, 'gi');
nodes[n].textContent = nodes[n].textContent.replace(r, b);
} else {
htmlreplace(a, b, nodes[n]);
}
}
}
htmlreplace('Car,Bike', 'Car,Bike');
htmlreplace('Current $', 'Current $');
htmlreplace('Buy it Now', 'Buy it Now');
htmlreplace('Bid', 'Bid');
htmlreplace('Remaining Time', 'Remaining Time');
htmlreplace('Popular-Newest', 'Popular-Newest');
htmlreplace('Display', 'Display');
htmlreplace('Music', 'Music');
htmlreplace('Hobby', 'Hobby');
htmlreplace('Books/Mags', 'Books/Mags');
htmlreplace('Antiques', 'Antiques');
htmlreplace('Comics/Anime', 'Comics/Anime');
htmlreplace('Movie/Video', 'Movie/Video');
htmlreplace('Computers', 'Computers');
htmlreplace('Others', 'Others');
Should I be trying another technique?
Thanks,
Woody
While I'm certain my implementation is buggy, also uncertain how to proceed.
Replace multiple node.nodeValue = an assignment to a documentFragment
Move the strings in the multiple htmlreplace calls into a key/value object literal
Replace the loop with an Array.prototype.map call over the childNodes of the documentFragment
Replace matches using a replacer callback which references the object literal
References
createDocumentFragment
DOM documentFragments
Alternatives to innerHTML
The tiny table sorter - or - you can write LINQ in JavaScript
ok i have difficulty to understand the rejex and how it work, I try to do a basic dictionnary/glossary for a web site and i past too many time on it already.
There my code :
// MY MULTIPLE ARRAY
var myDictionnary = new Object();
myDictionnary.myDefinition = new Array();
myDictionnary.myDefinition.push({'term':"*1", 'definition':"My description here"});
myDictionnary.myDefinition.push({'term':"word", 'definition':"My description here"});
myDictionnary.myDefinition.push({'term':"my dog doesn't move", 'definition':"My description here"});
// MY FUNCTION
$.each(myDictionnary.myDefinition, function(){
var myContent = $('#content_right').html();
var myTerm = this.term;
var myRegTerm = new RegExp(myTerm,'g');
$('#content_right').html(myContent.replace(myRegTerm,'<span class="tooltip" title="'+this.definition+'"> *'+this.term+'</span>'));
});
I create my array and for each result i search in my div#content_right for the same content and replace it by span with title and tooltip. I put my regex empty to not confuse with what i have try before.
In my array 'term' you can see what kind of text i will research. It work for searching normal text like 'word'.
But for the regular expression like 'asterix' it bug, when i find a way to past it, he add it to text, i try many way to interpret my 'asterix' like a normal caracter but it doesn't work.
There what i want :
Interpret the variable literally whatever the text inside my var 'myTerm'. It this possible? if it not, what kind solution i shouldn use.
Thank in advance,
P.S. sorry for my poor english...(im french)
Alex
* is a special character in a Regex, meanining "0 or more of the previous character". Since it's the first character, it is invalid and you get an error.
Consider implementing preg_quote from PHPJS to escape such special characters and make them valid for use in a regex.
HOWEVER, the way you are doing this is extremely bad. Not only because you're using innerHTML multiple times, but because what if you come across something like this?
Blah blah blah <img src="blah/word/blah.png" />
Your resulting output would be:
Blah blah blah <img src="blah/<span class="tooltip" title="My description here"> *word</span>/blah.png" />
The only real way to do what you're attempting is to specifically scan through the text nodes, then use .splitText() on the text node to extract the word match, then put that into its own <span> tag.
Here's an example implementation of the above explanation:
function definitions(rootnode) {
if( !rootnode) rootnode = document.body;
var dictionary = [
{"term":"*1", "definition":"My description here"},
{"term":"word", "definition":"My description here"},
{"term":"my dog doesn't move", "definition":"My description here"}
], dl = dictionary.length, recurse = function(node) {
var c = node.childNodes, cl = c.length, i, j, s, io;
for( i=0; i<cl; i++) {
if( c[i].nodeType == 1) { // ELEMENT NODE
if( c[i].className.match(/\btooltip\b/)) continue;
// ^ Exclude elements that are already tooltips
recurse(c[i]);
continue;
}
if( c[i].nodeType == 3) { // TEXT NODE
for( j=0; j<dl; j++) {
if( (io = c[i].nodeValue.indexOf(dictionary[j].term)) > -1) {
c[i].splitText(io+dictionary[j].term.length);
c[i].splitText(io);
cl += 2; // we created two new nodes
i++; // go to next sibling, the matched word
s = document.createElement('span');
s.className = "tooltip";
s.title = dictionary[j].definition;
node.insertBefore(s,c[i]);
i++;
s.appendChild(c[i]); // place the text node inside the span
// Note that now when i++ it will point to the tail,
// so multiple matches in the same node are possible.
}
}
}
}
};
recurse(rootnode);
}
Now you can call definitions() to replace all matches in the document, or something like definitions(document.getElementById("myelement")) to only replace matches in a certain container.
By adding another variable, it will fix the problem:
for( j=0; j<dl; j++) {
debutI =i;
if((io =c[i].nodeValue.indexOf(dictionary[j].term)) != -1) {
c[i].splitText(io+dictionary[j].term.length);
c[i].splitText(io);
cl += 2; // we created two new nodes
i++; // go to next sibling, the matched word
s = document.createElement('span');
s.className = "tooltip";
s.title = dictionary[j].definition;
node.insertBefore(s,c[i]);
i++;
s.appendChild(c[i]); // place the text node inside the span
// Note that now when i++ it will point to the tail,
// so multiple matches in the same node are possible.
}
i = debutI;
}
Comme ça, on n'aura pas à mettre tous les spans partout dans le site à la main. >_>
First off, don't link to the "Don't parse HTML with Regex" post :)
I've got the following HTML, which is used to display prices in various currencies, inc and ex tax:
<span id="price_break_12345" name="1">
<span class="price">
<span class="inc" >
<span class="GBP">£25.00</span>
<span class="USD" style="display:none;">$34.31</span>
<span class="EUR" style="display:none;">27.92 €</span>
</span>
<span class="ex" style="display:none;">
<span class="GBP">£20.83</span>
<span class="USD" style="display:none;">$34.31</span>
<span class="EUR" style="display:none;">23.27 €</span>
</span>
</span>
<span style="display:none" class="raw_price">25.000</span>
</span>
An AJAX call returns a single string of HTML, containing multiple copies of the above HTML, with the prices varying. What I'm trying to match with regex is:
Each block of the above HTML (as mentioned, it occurs multiple times in the return string)
The value of the name attribute on the outermost span
What I have so far is this:
var price_regex = new RegExp(/(<span([\s\S]*?)><span([\s\S]*?)>([\s\S]*?)<\/span><\/span\>)/gm);
console && console.log(price_regex.exec(product_price));
It matches the first price break once for each price break that occurs (so if there's name=1, name=5 and name=15 it matches name=1 3 times.
Whereabouts am I going wrong?
So, if you can count on the format of that first span in each block like this:
<span id="price_break_12345" name="1">
Then, how about you use code like this to cycle through all the matches. This code identifies the price_break_xxxx id value in that first span and then picks out the following name attribute:
var re = /id="price_break_\d+"\s+name="([^"]+)"/gm;
var match;
while (match = re.exec(str)) {
console.log(match[1]);
}
You can see it work here: http://jsfiddle.net/jfriend00/G39ne/.
I used a converter to make three of your blocks of HTML into a single javascript string (to simulate what you get back from your ajax call) so I could run the code on it.
A more robust way to do this is to just use the browser's HTML parser to do all the work for you. Assuming you have the HTML in a string variable named `str', you can use the browser's parser like this:
function getElementChildren(parent) {
var elements = [];
var children = parent.childNodes;
for (var i = 0, len = children.length; i < len; i++) {
// collect element nodes only
if (children[i].nodeType == 1) {
elements.push(children[i]);
}
}
return(elements);
}
var div = document.createElement("div");
div.innerHTML = str;
var priceBlocks = getElementChildren(div);
for (i = 0; i < priceBlocks.length; i++) {
console.log(priceBlocks[i].id + ", " + priceBlocks[i].getAttribute("name") + "<br>");
}
Demo here: http://jsfiddle.net/jfriend00/F6D8d/
This will leave you with all the DOM traversal functions for these elements rather than using (the somewhat brittle) regular expressions on HTML.
Thanks in large part to jfriend for making me realise why my regex was matching in a strange way (while (price_break = regex.exec(string)) instead of just exec'ing it once), I've got it working:
var price_regex = new RegExp(/<span[\s\S]*?name="([0-9]+)"[\s\S]*?><span[\s\S]*?>[\s\S]*?<\/span><\/span\>/gm);
var price_break;
while (price_break = price_regex.exec(strProductPrice))
{
console && console.log(price_break);
}
I had a ton of useless () which were just clogging up the result set, so stripping them out made things a lot simpler.
The other thing, as mentioned above was that originally I was just doing
price_break = price_regex.exec(strProductPrice)
which runs the regex once, and returns the first match only (which I mistook for returning 3 copies of the first match, due to the ()s). By looping over them, it keeps evaluating the regex until all the matches have been exhausted, which I assumed it did normally, similar to PHP's preg_match.
What I want to do is replace all instances of 'old' in a webpage with 'new' in a JS bookmarklet or a greasemonkey script. How can I do this? I suppose jQuery or other frameworks are okay, as there're hacks to include them in both bookmarklets as well as greasemonkey scripts.
A function that is clobber-proof. That mean's this won't touch any tags or attributes, only text.
function htmlreplace(a, b, element) {
if (!element) element = document.body;
var nodes = element.childNodes;
for (var n=0; n<nodes.length; n++) {
if (nodes[n].nodeType == Node.TEXT_NODE) {
var r = new RegExp(a, 'gi');
nodes[n].textContent = nodes[n].textContent.replace(r, b);
} else {
htmlreplace(a, b, nodes[n]);
}
}
}
htmlreplace('a', 'r');
Bookmarklet version:
javascript:function htmlreplace(a,b,element){if(!element)element=document.body;var nodes=element.childNodes;for(var n=0;n<nodes.length;n++){if(nodes[n].nodeType==Node.TEXT_NODE){nodes[n].textContent=nodes[n].textContent.replace(new RegExp(a,'gi'),b);}else{htmlreplace(a,b,nodes[n]);}}}htmlreplace('old','new');
If you replace the innerHtml then you will destroy any dom events you have on the page. Try traversing the document to replace text:
function newTheOlds(node) {
node = node || document.body;
if(node.nodeType == 3) {
// Text node
node.nodeValue = node.nodeValue.split('old').join('new');
} else {
var nodes = node.childNodes;
if(nodes) {
var i = nodes.length;
while(i--) newTheOlds(nodes[i]);
}
}
}
newTheOlds();
The split/join is faster than doing "replace" if you do not need pattern matching. If you need pattern matching then use "replace" and a regex:
node.nodeValue = node.nodeValue.replace(/(?:dog|cat)(s?)/, 'buffalo$1');
As a bookmarklet:
javascript:function newTheOlds(node){node=node||document.body;if(node.nodeType==3){node.nodeValue=node.nodeValue.split('old').join('new');}else{var nodes=node.childNodes;if(nodes){var i=nodes.length;while(i--)newTheOlds(nodes[i]);}}}newTheOlds();
For older browsers will need to change Node.TEXT_NODE to 3 and the node.textContent to node.nodeValue; so final function should read:
function htmlreplace(a, b, element) {
if (!element) element = document.body;
var nodes = element.childNodes;
for (var n=0; n<nodes.length; n++) {
if (nodes[n].nodeType == 3) { //Node.TEXT_NODE == 3
var r = new RegExp(a, 'gi');
nodes[n].nodeValue = nodes[n].nodeValue.replace(r, b);
} else {
htmlreplace(a, b, nodes[n]);
}
}
}
A simple line that works along jQuery:
`javascript:var a = function(){$("body").html($("body").html().replace(/old/g,'new'));return;}; a();`
Without jQuery:
`javascript:function a (){document.body.innerHTML=document.body.innerHTML.replace(/old/g, "new" );return;}; a();`
The function returning nothing is very important, so the browser is not redirected anywhere after executing the bookmarklet.
Yet another recursive approach:
function replaceText(oldText, newText, node){
node = node || document.body;
var childs = node.childNodes, i = 0;
while(node = childs[i]){
if (node.nodeType == Node.TEXT_NODE){
node.textContent = node.textContent.replace(oldText, newText);
} else {
replaceText(oldText, newText, node);
}
i++;
}
}
Minified bookmarklet:
javascript:function replaceText(ot,nt,n){n=n||document.body;var cs=n.childNodes,i=0;while(n=cs[i]){if(n.nodeType==Node.TEXT_NODE){n.textContent=n.textContent.replace(ot,nt);}else{replaceText(ot,nt,n);};i++;}};replaceText('old','new');
Okay, I'm just consolidating some of the great stuff that people are putting up in one answer.
Here is sixthgear's jQuery code, but made portable (I source jQuery from the big G) and minified into a bookmarklet:
javascript:var scrEl=document.createElement('script');scrEl.setAttribute('language','javascript');scrEl.setAttribute('type','text/javascript');scrEl.setAttribute('src','http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js');function htmlreplace(a,b,element){if(!element)element=document.body;var nodes=$(element).contents().each(function(){if(this.nodeType==Node.TEXT_NODE){var r=new RegExp(a,'gi');this.textContent=this.textContent.replace(r,b);}else{htmlreplace(a,b,this);}});}htmlreplace('old','new');
NOTE that 'old' can be either a 'string literal', or a 'reg[Ee]x'.
Actually, now that I think about it, sixthgear's is the best answer, especially with my enhancements. I can't find anything that the other answers add over it, using jQuery achieves incredible X-browser compatibility. Plus, I'm just too damn lazy. community wiki, Enjoy!
Hey you could try this, problem is it searches the entire body so even attributes and such get changed.
javascript:document.body.innerHTML=document.body.innerHTML.replace( /old/g, "new" );
I'm trying to slightly modify this so that it prompts for the text to search for, followed by the text to replace with, and when all done processing, show a dialog box letting me know it's done.
I plan to use it on a phpmyadmin database edit page that'll have any number of textboxes filled with text (which is what I need it to search and replace in). Also, the text to search for and replace may or may not be multi-line, so I've added the 'm' param in the regex, and also, since I'll be doing searches/replaces that may contain html, they'll often have quotes/double quotes in them. ex:
Search for:
<img height="76" width="92" src="http://www.gifs.net/Animation11/Hobbies_and_Entertainment/Games_and_Gambling/Slot_machine.gif" /></div>
<div class="rtecenter"> <strong><em><font color="#ff0000">Vegas Baby!<br />
</font></em></strong></div>
and maybe replace with nothing (just to erase all that code), or some other html. So far this is the bookmarklet I've come up with, (javascript, and especially bookmarklets aren't something I mess with often) however, it does nothing as far as finding/replacing, although it does do the prompting correctly.
javascript:var%20scrEl=document.createElement('script');scrEl.setAttribute('language','javascript');scrEl.setAttribute('type','text/javascript');scrEl.setAttribute('src','http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js');function%20htmlreplace(a,b,element){if(!element)element=document.body;var%20nodes=$(element).contents().each(function(){if(this.nodeType==Node.TEXT_NODE){var%20r=new%20RegExp(a,'gim');this.textContent=this.textContent.replace(r,b);}else{htmlreplace(a,b,this);alert('Done%20processing.');}});}htmlreplace(prompt('Text%20to%20find:',''),prompt('Replace%20with:',''));
Anyone have any ideas?