Trying to replace HTML by cloning nodes but getting strange results - javascript

I'm trying to add <abbr> tags to acronyms found on a website. I'm running this as a Chrome Extension but I'm fairly certain the problem is within the javascript itself and doesn't have much to do with the Chrome stuff (I'll include the source just in case anyways)
I should mention that I'm using a lot of code from this link which was suggested on another answer. Unfortunately I'm getting unexpected results as my end goal differs a bit from what is discussed there.
First I have an array of acronyms (shortened here, I included the whole thing on JSFiddle)
"ITPHR": "Inside-the-park home run:hits on which the batter successfully touched all four bases, without the contribution of a fielding error or the ball going outside the ball park.",
"pNERD": "Pitcher's NERD: expected aesthetic pleasure of watching an individual pitcher",
"RISP": "Runner In Scoring Position: a breakdown of the batter's batting average with runners in scoring position, which include runners at second and third bases.",
"SBA/ATT": "Stolen base attempts: total number of times the player has attempted to steal a base (SB+CS)",
then the matchText() function from the previously linked artile
var matchText = function (node, regex, callback, excludeElements) {
excludeElements || (excludeElements = ['script', 'style', 'iframe', 'canvas']);
var child = node.firstChild;
do {
switch (child.nodeType) {
case 1:
if (excludeElements.indexOf(child.tagName.toLowerCase()) > -1) {
continue;
}
matchText(child, regex, callback, excludeElements);
break;
case 3:
child.data.replace(regex, function (all) {
var args = [].slice.call(arguments),
offset = args[args.length - 2],
newTextNode = child.splitText(offset);
newTextNode.data = newTextNode.data.substr(all.length);
callback.apply(window, [child].concat(args));
child = newTextNode;
});
break;
}
} while (child = child.nextSibling);
return node;
}
and finally my code that cycles through the array of acronyms and searches all the terms one by one (this might not be the optimal way of doing things, please let me know if you have a better idea)
var abbrList = Object.keys(acronyms);
for (var i = 0; i < abbrList.length; i++) {
var abbrev = abbrList[i];
abbrevSearch = abbrev.replace('%', '\\%').replace('+', '\\+').replace('/', '\\/');
console.log("Looking for " + abbrev);
matchText(document.body.getElementsByTagName("*"), new RegExp("\\b" + abbrevSearch + "\\b", "g"), function (node, match, offset) {
var span = document.createElement("abbr");
// span.className = "sabrabbr"; // If someone decides to style them
span.setAttribute("title", acronyms[abbrev].replace(''', '\''));
span.textContent = match;
node.parentNode.insertBefore(span, node.nextSibling);
});
}
As a reference here are the Chrome-specific files:
manifest.json
{
"name": "SABR Acronyms",
"version": "0.1",
"manifest_version": 2,
"description": "Adds tooltips with a definition to commonly used acronyms in baseball.",
"icons": {
"16" : "images/16.png",
"48" : "images/48.png",
"128" : "images/128.png"
},
"permissions": [
"activeTab"
],
"browser_action": {
"default_icon": "images/16.png",
"default_title": "SABR Acronyms"
},
"content_scripts": [
{
"matches": ["http://*/*"],
"js": ["content.js","jquery.min.js"],
"css": ["sabr.css"]
}
],
"web_accessible_resources": ["content.js", "sabr.js", "sabr.css","jquery.min.js","jquery-2.0.3.min.map"]
}
content.js
var s = document.createElement('script');
s.src = chrome.extension.getURL('sabr.js');
(document.head||document.documentElement).appendChild(s);
s.onload = function() {
s.parentNode.removeChild(s);
};
I uploaded everything on JSFiddle since it's the easiest way to see the code in action. I copied the <body>...</body> of a page containing an article with a few of the acronyms being used. A lot of them should be picked up but aren't. Exact matches are also picked up but not all the time. There also seems to be a problem single/2-letter acronyms (such as IP in the table). The regular expression is quite simple, I thought \b would do the trick.
Thanks!

There were a couple of issues with your code (or maybe a little more).
Chrome detects word-boundaries in its own way, so \b does not work as expected (e.g. a . is considered part of a word).
You were using the global modifier which returned the indexes of all the matches it found. But when handling each match, you modified the content of child.data, so the indices that referred to the original child.data were rendered useless. This problem would only come up whenever there were more than 1 matches in a single TextNode. (Note that once this error caused an exception to be raised, execution was aborted, so no further TextNodes were processed.)
The acronyms were searched for (and replaced) in the order of appearance in the acronym list. This could lead to cases, where only a substring of an acronym would be recognised as another acronym and incorrectly replaced. E.g. if ERA was seached for before ERA+, all ERA+ occurrences in the DOM would be replaced by <abbr ...>ERA</abbr>+ and would not be recognised as ERA+ occurrences later on.
Similarly to the above problem, a substring of an already processed acronym, could be subsequently recognised as another acronym and pertially replaced. E.g. if ERA+ was searched for before ERA the following would happen:
ERA+
-> <abbr (title_for_ERA+)>ERA+</abbr>
-> <abbr (title_for_ERA+)><abbr (title_for_ERA)>ERA</abbr>+</abbr>
Your one-letter "acronyms" would also match characters they shouldn't (e.g. E in E-mail, G in Paul G. etc).
(Among many possible ways) I chose to address the above problems like this:
For (1):
Instead of using \b...\b I used (^|[^A-Za-z0-9_])(...)([^A-Za-z0-9_]|$).
This will look for one character that is not a word character before and after our acronym under search (or settle for string start (^) or end ($) respectively). Since the matched characters (if any) before and after the actual acronym match need to be put back in the regular TextNodes, 3 backreferences are created and handled appropriately in the replace callback (see code below).
For (2):
I removed the global modifier and matched one occurrence at a time.
This also required a slight modification, so that the new TextNode, created with the part of child.data after the current match, is subsequently searched as well.
For (3):
Before starting the search and replace operations I ordered the array of acronyms by decreasing length, so longer acronyms were search for (and replaced) before sorter acronyms (which could possible be a substring of the former). E.g. ERA+ is always replaced before ERA, IP/GS is always replaced before IP etc.
(Note that this solves problem (3), but we still have to deal with (4).)
For (4):
Every time I create a new <abbr> node I add a class to it. Later on, when I encounter an element with that special class, I skip it (as I don't want any replacements to happen in a substring of an already matched acronym).
For (5):
Well, I am good, but I am not Jon Skeet :)
There is not much you can do about it, unless you want to bring on some AI, but I suppose it is not much of a problem either (i.e. you can live with it).
(As already mentioned the above solutions are neither the only ones available and probably nor optimal.)
That said, here is my version of the code (with a few more miror (for the most part stylistic) changes):
var matchText = function (node, regex, callback, excludeElements) {
excludeElements
|| (excludeElements = ['script', 'style', 'iframe', 'canvas']);
var child = node.firstChild;
if (!child) {
return;
}
do {
switch (child.nodeType) {
case 1:
if ((child.className === 'sabrabbr') ||
(excludeElements.indexOf(
child.tagName.toLowerCase()) > -1)) {
continue;
}
matchText(child, regex, callback, excludeElements);
break;
case 3:
child.data.replace(regex, function (fullMatch, g1, g2, g3, idx,
original) {
var offset = idx + g1.length;
newTextNode = child.splitText(offset);
newTextNode.data = newTextNode.data.substr(g2.length);
callback.apply(window, [child, g2]);
child = child.nextSibling;
});
break;
}
} while (child = child.nextSibling);
return node;
}
var abbrList = Object.keys(acronyms).sort(function(a, b) {
return b.length - a.length;
});
for (var i = 0; i < abbrList.length; i++) {
var abbrev = abbrList[i];
abbrevSearch = abbrev.replace('%', '\\%').replace('+', '\\+').replace('/', '\\/');
console.log("Looking for " + abbrev);
var regex = new RegExp("(^|[^A-Za-z0-9_])(" + abbrevSearch
+ ")([^A-Za-z0-9_]|$)", "");
matchText(document.body, regex, function (node, match) {
var span = document.createElement("abbr");
span.className = "sabrabbr";
span.title = acronyms[abbrev].replace(''', '\'');
span.textContent = match;
node.parentNode.insertBefore(span, node.nextSibling);
});
}
For the noble few that made it this far, there is, also, this short demo.

Related

How to wrap part of a text in a node with JavaScript

I have a challenging problem to solve. I'm working on a script which takes a regex as an input. This script then finds all matches for this regex in a document and wraps each match in its own <span> element. The hard part is that the text is a formatted html document, so my script needs to navigate through the DOM and apply the regex across multiple text nodes at once, while figuring out where it has to split text nodes if needed.
For example, with a regex that captures full sentences starting with a capital letter and ending with a period, this document:
<p>
<b>HTML</b> is a language used to make <b>websites.</b>
It was developed by <i>CERN</i> employees in the early 90s.
</p>
Would ideally be turned into this:
<p>
<span><b>HTML</b> is a language used to make <b>websites.</b></span>
<span>It was developed by <i>CERN</i> employees in the early 90s.</span>
</p>
The script should then return the list of all created spans.
I already have some code which finds all the text nodes and stores them in a list along with their position across the whole document and their depth. You don't really need to understand that code to help me and its recursive structure can be a bit confusing. The first part I'm not sure how to do is figure out which elements should be included within the span.
function findTextNodes(node, depth = -1, start = 0) {
let list = [];
if (node.nodeType === Node.TEXT_NODE) {
list.push({ node, depth, start });
} else {
for (let i = 0; i < node.childNodes.length; ++i) {
list = list.concat(findTextNodes(node.childNodes[i], depth+1, start));
if (list.length) {
start += list[list.length-1].node.nodeValue.length;
}
}
}
return list;
}
I figure I'll make a string out of all the document, run the regex through it and use the list to find which nodes correspond to witch regex matches and then split the text nodes accordingly.
But an issue arrives when I have a document like this:
<p>
This program is not stable yet. Do not use this in production yet.
</p>
There's a sentence which starts outside of the <a> tag but ends inside it. Now I don't want the script to split that link in two tags. In a more complex document, it could ruin the page if it did. The code could either wrap two sentences together:
<p>
<span>This program is not stable yet. Do not use this in production yet.</span>
</p>
Or just wrap each part in its own element:
<p>
<span>This program is </span>
<a href="beta.html">
<span>not stable yet.</span>
<span>Do not use this in production yet.</span>
</a>
</p>
There could be a parameter to specify what it should do. I'm just not sure how to figure out when an impossible cut is about to happen, and how to recover from it.
Another issue comes when I have whitespace inside a child element like this:
<p>This is a <b>sentence. </b></p>
Technically, the regex match would end right after the period, before the end of the <b> tag. However, it would be much better to consider the space as part of the match and wrap it like this:
<p><span>This is a <b>sentence. </b></span></p>
Than this:
<p><span>This is a </span><b><span>sentence.</span> </b></p>
But that's a minor issue. After all, I could just allow extra white-space to be included within the regex.
I know this might sound like a "do it for me" question and its not the kind of quick question we see on SO on a daily basis, but I've been stuck on this for a while and it's for an open-source library I'm working on. Solving this problem is the last obstacle. If you think another SE site is best suited for this question, redirect me please.
Here are two ways to deal with this.
I don't know if the following will exactly match your needs. It's a simple enough solution to the problem, but at least it doesn't use RegEx to manipulate HTML tags. It performs pattern matching against the raw text and then uses the DOM to manipulate the content.
First approach
This approach creates only one <span> tag per match, leveraging some less common browser APIs.
(See the main problem of this approach below the demo, and if not sure, use the second approach).
The Range class represents a text fragment. It has a surroundContents function that lets you wrap a range in an element. Except it has a caveat:
This method is nearly equivalent to newNode.appendChild(range.extractContents()); range.insertNode(newNode). After surrounding, the boundary points of the range include newNode.
An exception will be thrown, however, if the Range splits a non-Text node with only one of its boundary points. That is, unlike the alternative above, if there are partially selected nodes, they will not be cloned and instead the operation will fail.
Well, the workaround is provided in the MDN, so all's good.
So here's an algorithm:
Make a list of Text nodes and keep their start indices in the text
Concatenate these nodes' values to get the text
Find matches over the text, and for each match:
Find the start and end nodes of the match, comparing the the nodes' start indices to the match position
Create a Range over the match
Let the browser do the dirty work using the trick above
Rebuild the node list since the last action changed the DOM
Here's my implementation with a demo:
function highlight(element, regex) {
var document = element.ownerDocument;
var getNodes = function() {
var nodes = [],
offset = 0,
node,
nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false);
while (node = nodeIterator.nextNode()) {
nodes.push({
textNode: node,
start: offset,
length: node.nodeValue.length
});
offset += node.nodeValue.length
}
return nodes;
}
var nodes = getNodes(nodes);
if (!nodes.length)
return;
var text = "";
for (var i = 0; i < nodes.length; ++i)
text += nodes[i].textNode.nodeValue;
var match;
while (match = regex.exec(text)) {
// Prevent empty matches causing infinite loops
if (!match[0].length)
{
regex.lastIndex++;
continue;
}
// Find the start and end text node
var startNode = null, endNode = null;
for (i = 0; i < nodes.length; ++i) {
var node = nodes[i];
if (node.start + node.length <= match.index)
continue;
if (!startNode)
startNode = node;
if (node.start + node.length >= match.index + match[0].length)
{
endNode = node;
break;
}
}
var range = document.createRange();
range.setStart(startNode.textNode, match.index - startNode.start);
range.setEnd(endNode.textNode, match.index + match[0].length - endNode.start);
var spanNode = document.createElement("span");
spanNode.className = "highlight";
spanNode.appendChild(range.extractContents());
range.insertNode(spanNode);
nodes = getNodes();
}
}
// Test code
var testDiv = document.getElementById("test-cases");
var originalHtml = testDiv.innerHTML;
function test() {
testDiv.innerHTML = originalHtml;
try {
var regex = new RegExp(document.getElementById("regex").value, "g");
highlight(testDiv, regex);
}
catch(e) {
testDiv.innerText = e;
}
}
document.getElementById("runBtn").onclick = test;
test();
.highlight {
background-color: yellow;
border: 1px solid orange;
border-radius: 5px;
}
.section {
border: 1px solid gray;
padding: 10px;
margin: 10px;
}
<form class="section">
RegEx: <input id="regex" type="text" value="[A-Z].*?\." /> <button id="runBtn">Highlight</button>
</form>
<div id="test-cases" class="section">
<div>foo bar baz</div>
<p>
<b>HTML</b> is a language used to make <b>websites.</b>
It was developed by <i>CERN</i> employees in the early 90s.
<p>
<p>
This program is not stable yet. Do not use this in production yet.
</p>
<div>foo bar baz</div>
</div>
Ok, that was the lazy approach which, unfortunately doesn't work for some cases. It works well if you only highlight across inline elements, but breaks when there are block elements along the way because of the following property of the extractContents function:
Partially selected nodes are cloned to include the parent tags necessary to make the document fragment valid.
That's bad. It'll just duplicate block-level nodes. Try the previous demo with the baz\s+HTML regex if you want to see how it breaks.
Second approach
This approach iterates over the matching nodes, creating <span> tags along the way.
The overall algorithm is straightforward as it just wraps each matching node in its own <span>. But this means we have to deal with partially matching text nodes, which requires some more effort.
If a text node matches partially, it's split with the splitText function:
After the split, the current node contains all the content up to the specified offset point, and a newly created node of the same type contains the remaining text. The newly created node is returned to the caller.
function highlight(element, regex) {
var document = element.ownerDocument;
var nodes = [],
text = "",
node,
nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false);
while (node = nodeIterator.nextNode()) {
nodes.push({
textNode: node,
start: text.length
});
text += node.nodeValue
}
if (!nodes.length)
return;
var match;
while (match = regex.exec(text)) {
var matchLength = match[0].length;
// Prevent empty matches causing infinite loops
if (!matchLength)
{
regex.lastIndex++;
continue;
}
for (var i = 0; i < nodes.length; ++i) {
node = nodes[i];
var nodeLength = node.textNode.nodeValue.length;
// Skip nodes before the match
if (node.start + nodeLength <= match.index)
continue;
// Break after the match
if (node.start >= match.index + matchLength)
break;
// Split the start node if required
if (node.start < match.index) {
nodes.splice(i + 1, 0, {
textNode: node.textNode.splitText(match.index - node.start),
start: match.index
});
continue;
}
// Split the end node if required
if (node.start + nodeLength > match.index + matchLength) {
nodes.splice(i + 1, 0, {
textNode: node.textNode.splitText(match.index + matchLength - node.start),
start: match.index + matchLength
});
}
// Highlight the current node
var spanNode = document.createElement("span");
spanNode.className = "highlight";
node.textNode.parentNode.replaceChild(spanNode, node.textNode);
spanNode.appendChild(node.textNode);
}
}
}
// Test code
var testDiv = document.getElementById("test-cases");
var originalHtml = testDiv.innerHTML;
function test() {
testDiv.innerHTML = originalHtml;
try {
var regex = new RegExp(document.getElementById("regex").value, "g");
highlight(testDiv, regex);
}
catch(e) {
testDiv.innerText = e;
}
}
document.getElementById("runBtn").onclick = test;
test();
.highlight {
background-color: yellow;
}
.section {
border: 1px solid gray;
padding: 10px;
margin: 10px;
}
<form class="section">
RegEx: <input id="regex" type="text" value="[A-Z].*?\." /> <button id="runBtn">Highlight</button>
</form>
<div id="test-cases" class="section">
<div>foo bar baz</div>
<p>
<b>HTML</b> is a language used to make <b>websites.</b>
It was developed by <i>CERN</i> employees in the early 90s.
<p>
<p>
This program is not stable yet. Do not use this in production yet.
</p>
<div>foo bar baz</div>
</div>
This should be good enough for most cases I hope. If you need to minimize the number of <span> tags it can be done by extending this function, but I wanted to keep it simple for now.
function parseText( element ){
var stack = [ element ];
var group = false;
var re = /(?!\s|$).*?(\.|$)/;
while ( stack.length > 0 ){
var node = stack.shift();
if ( node.nodeType === Node.TEXT_NODE )
{
if ( node.textContent.trim() != "" )
{
var match;
while( node && (match = re.exec( node.textContent )) )
{
var start = group ? 0 : match.index;
var length = match[0].length + match.index - start;
if ( start > 0 )
{
node = node.splitText( start );
}
var wrapper = document.createElement( 'span' );
var next = null;
if ( match[1].length > 0 ){
if ( node.textContent.length > length )
next = node.splitText( length );
group = false;
wrapper.className = "sentence sentence-end";
}
else
{
wrapper.className = "sentence";
group = true;
}
var parent = node.parentNode;
var sibling = node.nextSibling;
wrapper.appendChild( node );
if ( sibling )
parent.insertBefore( wrapper, sibling );
else
parent.appendChild( wrapper );
node = next;
}
}
}
else if ( node.nodeType === Node.ELEMENT_NODE || node.nodeType === Node.DOCUMENT_NODE )
{
stack.unshift.apply( stack, node.childNodes );
}
}
}
parseText( document.body );
.sentence {
text-decoration: underline wavy red;
}
.sentence-end {
border-right: 1px solid red;
}
<p>This is a sentence. This is another sentence.</p>
<p>This sentence has <strong>emphasis</strong> inside it.</p>
<p><span>This sentence spans</span><span> two elements.</span></p>
I would use "flat DOM" representation for such task.
In flat DOM this paragraph
<p>abc <a href="beta.html">def. ghij.</p>
will be represented by two vectors:
chars: "abc def. ghij.",
props: ....aaaaaaaaaa,
You will use normal regexp on chars to mark span areas on props vector:
chars: "abc def. ghij."
props: ssssaaaaaaaaaa
ssss sssss
I am using schematic representation here, it's real structure is an array of arrays:
props: [
[s],
[s],
[s],
[s],
[a,s],
[a,s],
...
]
conversion tree-DOM <-> flat-DOM can use simple state automata.
At the end you will convert flat DOM to tree DOM that will look like:
<p><s>abc </s><a href="beta.html"><s>def.</s> <s>ghij.</s></p>
Just in case: I am using this approach in my HTML WYSIWYG editors.
As everyone has already said, this is more of an academic question since this shouldn't really be the way you do it. That being said, it seemed like fun so here's one approach.
EDIT: I think I got the gist of it now.
function myReplace(str) {
myRegexp = /((^<[^>*]>)+|([^<>\.]*|(<[^\/>]*>[^<>\.]+<\/[^>]*>)+)*[^<>\.]*\.\s*|<[^>]*>|[^\.<>]+\.*\s*)/g;
arr = str.match(myRegexp);
var out = "";
for (i in arr) {
var node = arr[i];
if (node.indexOf("<")===0) out += node;
else out += "<span>"+node+"</span>"; // Here is where you would run whichever
// regex you want to match by
}
document.write(out.replace(/</g, "<").replace(/>/g, ">")+"<br>");
console.log(out);
}
myReplace('<p>This program is not stable yet. Do not use this in production yet.</p>');
myReplace('<p>This is a <b>sentence. </b></p>');
myReplace('<p>This is a <b>another</b> and <i>more complex</i> even <b>super complex</b> sentence.</p>');
myReplace('<p>This is a <b>a sentence</b>. Followed <i>by</i> another one.</p>');
myReplace('<p>This is a <b>an even</b> more <i>complex sentence. </i></p>');
/* Will output:
<p><span>This program is </span><span>not stable yet. </span><span>Do not use this in production yet.</span></p>
<p><span>This is a </span><b><span>sentence. </span></b></p>
<p><span>This is a <b>another</b> and <i>more complex</i> even <b>super complex</b> sentence.</span></p>
<p><span>This is a <b>a sentence</b>. </span><span>Followed <i>by</i> another one.</span></p>
<p><span>This is a </span><b><span>an even</span></b><span> more </span><i><span>complex sentence. </span></i></p>
*/
I have spent a long time implementing all of approaches given in this thread.
Node iterator
Html parsing
Flat Dom
For any of this approaches you have to come up with technique to split entire html into sentences and wrap into span (some might want words in span). As soon as we do this we will run into performance issues (I should say beginner like me will run into performance issues).
Performance Bottleneck
I couldn't scale any of this approach to 70k - 200k words and still do it in milli seconds. Wrapping time keeps increasing as words in pages keep increasing.
With complex html pages with combinations of text-node and different elements we soon run into trouble and with this technical debt keeps increasing.
Best approach : Mark.js (according to me)
Note: if you do this right you can process any number of words in millis.
Just use Ranges I want to recommend Mark.js and following example,
var instance = new Mark(document.body);
instance.markRanges([{
start: 15,
length: 5
}, {
start: 25:
length: 8
}]); /
With this we can treat entire body.textContent as string and just keep highlighting substring.
No DOM structure is modified here. And you can easily fix complex use cases and technical debt doesn't increase with more if and else.
Additionally once text is highlighted with html5 mark tag you can post process these tags to find out bounding rectangles.
Also look into Splitting.js if you just want split html documents into words/chars/lines and many more... But one draw back for this approach is that Splitting.js collapses additional spaces in the document so we loose little bit of info.
Thanks.

Find, Change multiple instances of text on page is Slow and Unresponsive, but Works

Edit: StackOverFlow is replacing Japanese Characters with translations upon saving my question.
This makes it look like I'm replacing the same text, with the same text.
The first item(of the dupes, below) should be Japanese text.
Using the scripts described here:
Find all instances of 'old' in a webpage and replace each with 'new', using a javascript bookmarklet
I've gone about trying to translate Yahoo Japan Auction pages
(yes, i know translation engines exist, but I have my reasons...)
example page:
http://auctions.search.yahoo.co.jp/search?auccat=&p=bose&tab_ex=commerce&ei=UTF-8&fr=bzr-prop
Have tried a couple scripts and While the scripts work, I must wait and click the "Unresponsive Script" a couple of times before the changes occur (10-20 seconds)
While I'm certain my implementation is buggy, also uncertain how to proceed.
The script can contain over 200 change items.
These below are culled for space considerations.
Version 1 Script:
function newTheOlds(node) {
node = node || document.body;
if(node.nodeType == 3) {
// Text node
node.nodeValue = node.nodeValue.split('Car,Bike').join('Car,Bike');
node.nodeValue = node.nodeValue.split('Current $').join('Current $');
node.nodeValue = node.nodeValue.split('Buy it Now').join('Buy it Now');
node.nodeValue = node.nodeValue.split('Bid').join('Bid');
node.nodeValue = node.nodeValue.split('Remaining Time').join('Remaining Time');
node.nodeValue = node.nodeValue.split('Popular-Newest').join('Popular-Newest');
} else {
var nodes = node.childNodes;
if(nodes) {
var i = nodes.length;
while(i--) newTheOlds(nodes[i]);
}
}
}
newTheOlds();
Version 2 Script:
function htmlreplace(a, b, element) {
if (!element) element = document.body;
var nodes = element.childNodes;
for (var n=0; n<nodes.length; n++) {
if (nodes[n].nodeType == Node.TEXT_NODE) {
var r = new RegExp(a, 'gi');
nodes[n].textContent = nodes[n].textContent.replace(r, b);
} else {
htmlreplace(a, b, nodes[n]);
}
}
}
htmlreplace('Car,Bike', 'Car,Bike');
htmlreplace('Current $', 'Current $');
htmlreplace('Buy it Now', 'Buy it Now');
htmlreplace('Bid', 'Bid');
htmlreplace('Remaining Time', 'Remaining Time');
htmlreplace('Popular-Newest', 'Popular-Newest');
htmlreplace('Display', 'Display');
htmlreplace('Music', 'Music');
htmlreplace('Hobby', 'Hobby');
htmlreplace('Books/Mags', 'Books/Mags');
htmlreplace('Antiques', 'Antiques');
htmlreplace('Comics/Anime', 'Comics/Anime');
htmlreplace('Movie/Video', 'Movie/Video');
htmlreplace('Computers', 'Computers');
htmlreplace('Others', 'Others');
Should I be trying another technique?
Thanks,
Woody
While I'm certain my implementation is buggy, also uncertain how to proceed.
Replace multiple node.nodeValue = an assignment to a documentFragment
Move the strings in the multiple htmlreplace calls into a key/value object literal
Replace the loop with an Array.prototype.map call over the childNodes of the documentFragment
Replace matches using a replacer callback which references the object literal
References
createDocumentFragment
DOM documentFragments
Alternatives to innerHTML
The tiny table sorter - or - you can write LINQ in JavaScript

jQuery / JavaScript Parsing strings the proper way

Recently, I've been attempting to emulate a small language in jQuery and JavaScript, yet I've come across what I believe is an issue. I think that I may be parsing everything completely wrong.
In the code:
#name Testing
#inputs
#outputs
#persist
#trigger
print("Test")
The current way I am separating and parsing the string is by splitting all of the code into lines, and then reading through this lines array using searches and splits. For example, I would find the name using something like:
if(typeof lines[line] === 'undefined')
{
}
else
{
if(lines[line].search('#name') == 0)
{
name = lines[line].split(' ')[1];
}
}
But I think that I may be largely wrong on how I am handling parsing.
While reading through examples on how other people are handling parsing of code blocks like this, it appeared that people parsed the entire block, instead of splitting it into lines as I do. I suppose the question of the matter is, what is the proper and conventional way of parsing things like this, and how do you suggest I use it to parse something such as this?
In simple cases like this regular expressions is your tool of choice:
matches = code.match(/#name\s+(\w+)/)
name = matches[1]
To parse "real" programming languages regexps are not powerful enough, you'll need a parser, either hand-written or automatically generated with a tool like PEG.
A general approach to parsing, that I like to take often is the following:
loop through the complete block of text, character by character.
if you find a character that signalizes the start of one unit, call a specialized subfunction to parse the next characters.
within each subfunction, call additional subfunctions if you find certain characters
return from every subfunction when a character is found, that signalizes, that the unit has ended.
Here is a small example:
var text = "#func(arg1,arg2)"
function parse(text) {
var i, max_i, ch, funcRes;
for (i = 0, max_i = text.length; i < max_i; i++) {
ch = text.charAt(i);
if (ch === "#") {
funcRes = parseFunction(text, i + 1);
i = funcRes.index;
}
}
console.log(funcRes);
}
function parseFunction(text, i) {
var max_i, ch, name, argsRes;
name = [];
for (max_i = text.length; i < max_i; i++) {
ch = text.charAt(i);
if (ch === "(") {
argsRes = parseArguments(text, i + 1);
return {
name: name.join(""),
args: argsRes.arr,
index: argsRes.index
};
}
name.push(ch);
}
}
function parseArguments(text, i) {
var max_i, ch, args, arg;
arg = [];
args = [];
for (max_i = text.length; i < max_i; i++) {
ch = text.charAt(i);
if (ch === ",") {
args.push(arg.join(""));
arg = [];
continue;
} else if (ch === ")") {
args.push(arg.join(""));
return {
arr: args,
index: i
};
}
arg.push(ch);
}
}
FIDDLE
this example just parses function expressions, that follow the syntax "#functionName(argumentName1, argumentName2, ...)". The general idea is to visit every character exactly once without the need to save current states like "hasSeenAtCharacter" or "hasSeenOpeningParentheses", which can get pretty messy when you parse large structures.
Please note that this is a very simplified example and it misses all the error handling and stuff like that, but I hope the general idea can be seen. Note also that I'm not saying that you should use this approach all the time. It's a very general approach, that can be used in many scenerios. But that doesn't mean that it can't be combined with regular expressions for instance, if it, at some part of your text, makes more sense than parsing each individual character.
And one last remark: you can save yourself the trouble if you put the specialized parsing function inside the main parsing function, so that all functions have access to the same variable i.

Node.innerHTML giving tag names in lower case

I am iterating NodeList to get Node data, but while using Node.innerHTML i am getting the tag names in lowercase.
Actual Tags
<Panel><Label>test</Label></Panel>
giving as
<panel><label>test</label></panel>
I need these tags as it is. Is it possible to get it with regular expression? I am using it with dojo (is there any way in dojo?).
var xhrArgs = {
url: "./user/"+Runtime.userName+"/ws/workspace/"+Workbench.getProject()+"/lib/custom/"+(first.type).replace(".","/")+".html",
content: {},
sync:true,
load: function(data){
var test = domConstruct.toDom(data);
dojo.forEach(dojo.query("[id]",test),function(node){
domAttr.remove(node,"id");
});
var childEle = "";
dojo.forEach(test.childNodes,function(node){
if(node.innerHTML){
childEle+=node.innerHTML;
}
});
command.add(new ModifyCommand(newWidget,{},childEle,context));
}
};
You cannot count on .innerHTML preserving the exact nature of your original HTML. In fact, in some browsers, it's significantly different (though generates the same results) with different quotation, case, order of attributes, etc...
It is much better to not rely on the preservation of case and adjust your javascript to deal with uncertain case.
It is certainly possible to use a regular expression to do a case insensitive search (the "i" flag designates its searches as case insensitive), though it is generally much, much better to use direct DOM access/searching rather than innerHTML searching. You'd have to tell us more about what exactly you're trying to do before we could offer some code.
It would take me a bit to figure that out with a regex, but you can use this:
var str = '<panel><label>test</label></panel>';
chars = str.split("");
for (var i = 0; i < chars.length; i++) {
if (chars[i] === '<' || chars[i] === '/') {
chars[i + 1] = chars[i + 1].toUpperCase();
}
}
str = chars.join("");
jsFiddle
I hope it helps.
If you are trying to just capitalise the first character of the tag name, you can use:
var s = 'panel';
s.replace(/(^.)(.*)/,function(m, a, b){return a.toUpperCase() + b.toLowerCase()}); // Panel
Alternatively you can use string manipulation (probably more efficient than a regular expression):
s.charAt(0).toUpperCase() + s.substring(1).toLowerCase(); // Panel
The above will output any input string with the first character in upper case and everything else lower case.
this is not thoroughly tested , and is highly inefficcient, but it worked quite quickly in the console:
(also, it's jquery, but it can be converted to pure javascript/DOM easily)
in jsFiddle
function tagString (element) {
return $(element).
clone().
contents().
remove().
end()[0].
outerHTML.
replace(/(^<\s*\w)|(<\/\s*\w(?=\w*\s*>$))/g,
function (a) {
return a.
toUpperCase();
}).
split(/(?=<\/\s*\w*\s*>$)/);
}
function capContents (element) {
return $(element).
contents().
map(function () {
return this.nodeType === 3 ? $(this).text() : capitalizeHTML(this);
})
}
function capitalizeHTML (selector) {
var e = $(selector).first();
var wrap = tagString(e);
return wrap[0] + capContents(e).toArray().join("") + wrap[1];
}
capitalizeHTML('body');
also, besides being a nice exercise (in my opinion), do you really need to do this?

Finding keywords in texts

I have an array with incidents that has happened, that are written in free text and therefore aren't following a pattern except for some keywords, eg. "robbery", "murderer", "housebreaking", "car accident" etc. Those keywords can be anywhere in the text, and I want to find those keywords and add those to categories, eg. "Robberies".
In the end, when I have checked all the incidents I want to have a list of categories like this:
Robberies: 14
Murder attempts: 2
Car accidents: 5
...
The array elements can look like this:
incidents[0] = "There was a robbery on Amest Ave last night...";
incidents[1] = "There has been a report of a murder attempt...";
incidents[2] = "Last night there was a housebreaking in...";
...
I guess the best here is to use regular expressions to find the keywords in the texts, but I really suck at regexp and therefore need some help here.
The regular expressions is not correct below, but I guess this structure would work?
Is there a better way of doing this to avoid DRY?
var trafficAccidents = 0,
robberies = 0,
...
function FindIncident(incident) {
if (incident.match(/car accident/g)) {
trafficAccidents += 1;
}
else if (incident.match(/robbery/g)) {
robberies += 1;
}
...
}
Thanks a lot in advance!
The following code shows an approach you can take. You can test it here
var INCIDENT_MATCHES = {
trafficAccidents: /(traffic|car) accident(?:s){0,1}/ig,
robberies: /robbery|robberies/ig,
murder: /murder(?:s){0,1}/ig
};
function FindIncidents(incidentReports) {
var incidentCounts = {};
var incidentTypes = Object.keys(INCIDENT_MATCHES);
incidentReports.forEach(function(incident) {
incidentTypes.forEach(function(type) {
if(typeof incidentCounts[type] === 'undefined') {
incidentCounts[type] = 0;
}
var matchFound = incident.match(INCIDENT_MATCHES[type]);
if(matchFound){
incidentCounts[type] += matchFound.length;
};
});
});
return incidentCounts;
}
Regular expressions make sense, since you'll have a number of strings that meet your 'match' criteria, even if you only consider the differences in plural and singular forms of 'robbery'. You also want to ensure that your matching is case-insensitive.
You need to use the 'global' modifier on your regexes so that you match strings like "Murder, Murder, murder" and increment your count by 3 instead of just 1.
This allows you to keep the relationship between your match criteria and incident counters together. It also avoids the need for global counters (granted INCIDENT_MATCHES is a global variable here, but you can readily put that elsewhere and take it out of the global scope.
Actually, I would kind of disagree with you here . . . I think string functions like indexOf will work perfectly fine.
I would use JavaScript's indexOf method which takes 2 inputs:
string.indexOf(value,startPos);
So one thing you can do is define a simple temporary variable as your cursor as such . . .
function FindIncident(phrase, word) {
var cursor = 0;
var wordCount = 0;
while(phrase.indexOf(word,cursor) > -1){
cursor = incident.indexOf(word,cursor);
++wordCount;
}
return wordCount;
}
I have not tested the code but hopefully you get the idea . . .
Be particularly careful of the starting position if you do use it.
RegEx makes my head hurt too. ;) If you're looking for exact matches and aren't worried about typos and misspellings, I'd search the incident strings for substrings containing the keywords you're looking for.
incident = incident.toLowerCase();
if incident.search("car accident") > 0 {
trafficAccidents += 1;
}
else if incident.search("robbery") > 0 {
robberies += 1;
}
...
Use an array of objects to store all the many different categories you're searching for, complete with an appropiate regular expression and a count member, and you can write the whole thing in four lines.
var categories = [
{
regexp: /\brobbery\b/i
, display: "Robberies"
, count: 0
}
, {
regexp: /\bcar accidents?\b/i
, display: "Car Accidents"
, count: 0
}
, {
regexp: /\bmurder\b/i
, display: "Murders"
, count: 0
}
];
var incidents = [
"There was a robbery on Amest Ave last night..."
, "There has been a report of an murder attempt..."
, "Last night there was a housebreaking in..."
];
for(var x = 0; x<incidents.length; x++)
for(var y = 0; y<categories.length; y++)
if (incidents[x].match(categories[y].regexp))
categories[y].count++;
Now, no matter what you need, you can simply edit one section of code, and it will propagate through your code.
This code has the potential to categorize each incident in multiple categories. To prevent that, just add a 'break' statement to the if block.
You could do something like this which will grab all words found on each item in the array and it will return an object with the count:
var words = ['robbery', 'murderer', 'housebreaking', 'car accident'];
function getAllIncidents( incidents ) {
var re = new RegExp('('+ words.join('|') +')', 'i')
, result = {};
incidents.forEach(function( txt ) {
var match = ( re.exec( txt ) || [,0] )[1];
match && (result[ match ] = ++result[ match ] || 1);
});
return result;
}
console.log( getAllIncidents( incidents ) );
//^= { housebreaking: 1, car accident: 2, robbery: 1, murderer: 2 }
This is more a a quick prototype but it could be improved with plurals and multiple keywords.
Demo: http://jsbin.com/idesoc/1/edit
Use an object to store your data.
events = [
{ exp : /\brobbery|robberies\b/i,
// \b word boundary
// robbery singular
// | or
// robberies plural
// \b word boundary
// /i case insensitive
name : "robbery",
count: 0
},
// other objects here
]
var i = events.length;
while( i-- ) {
var j = incidents.length;
while( j-- ) {
// only checks a particular event exists in incident rather than no. of occurrences
if( events[i].exp.test( incidents[j] ) {
events[i].count++;
}
}
}
Yes, that's one way to do it, although matching plain-words with regex is a bit of overkill — in which case, you should be using indexOf as rbtLong suggested.
You can further sophisticate it by:
appending the i flag (match lowercase and uppercase characters).
adding possible word variations to your expression. robbery could be translated into robber(y|ies), thus matching both singular and plural variations of the word. car accident could be (car|truck|vehicle|traffic) accident.
Word boundaries \b
Don't use this. It'll require having non-alphanumeric characters surrounding your matching word and will prevent matching typos. You should make your queries as abrangent as possible.
if (incident.match(/(car|truck|vehicle|traffic) accident/i)) {
trafficAccidents += 1;
}
else if (incident.match(/robber(y|ies)/i)) {
robberies += 1;
}
Notice how I discarded the g flag; it stands for "global match" and makes the parser continue searching the string after the first match. This seems unnecessary as just one confirmed occurrence is enough for your needs.
This website offers an excellent introduction to regular expressions
http://www.regular-expressions.info/tutorial.html

Categories