TEXT_NODE: returns ONLY text? - javascript

I'm using JavaScript in order to extract all text from a DOM object. My algorithm goes over the DOM object itself and it's descendants, if the node is a TEXT_NODE type than accumulates it's nodeValue.
For some weird reason I also get things like:
#hdr-editions a { text-decoration:none; }
#cnn_hdr-editionS { text-align:left;clear:both; }
#cnn_hdr-editionS a { text-decoration:none;font-size:10px;top:7px;line-height:12px;font-weight:bold; }
#hdr-prompt-text b { display:inline-block;margin:0 0 0 20px; }
#hdr-editions li { padding:0 10px; }
How do I filter this? Do I need to use something else? I want ONLY text.

From the looks of things, you're also collecting the text from <style> elements. You might want to run a check for those:
var ignore = { "STYLE":0, "SCRIPT":0, "NOSCRIPT":0, "IFRAME":0, "OBJECT":0 }
if (element.tagName in ignore)
continue;
You can add any other elements to the object map to ignore them.

You want to skip over style elements.
In your loop, you could do this...
if (element.tagName == 'STYLE') {
continue;
}
You also probably want to skip over script, textarea, etc.

This is text as far as the DOM is concerned. You'll have to filter out (skip) <script> and <style> tags.

[Answer added after reading OP's comments to Andy's excellent answer]
The problem is that you see the text nodes inside elements whose content is normally not rendered by browsers - such as STYLE and SCRIPT tags.
When scan the DOM tree, using depth-first search I assume, your scan should skip over the content of such tags.
For example - a recursive depth-first DOM tree walker might look like this:
function walker(domObject, extractorCallback) {
if (domObject == null) return; // fail fast
extractorCallback(domObject);
if (domObject.nodeType != Node.ELEMENT_NODE) return;
var childs = domObject.childNodes;
for (var i = 0; i < childs.length; i++)
walker(childs[i]);
}
var textvalue = "":
walker(document, function(node) {
if (node.nodeType == Node.TEXT_NODE)
textvalue += node.nodeValue;
});
In such a case, if your walker encounters tags that you know you won't like to see their content, you should just skip going into that part of the tree. So walker() will have to be adapted as thus:
var ignore = { "STYLE":0, "SCRIPT":0, "NOSCRIPT":0, "IFRAME":0, "OBJECT":0 }
function walker(domObject, extractorCallback) {
if (domObject == null) return; // fail fast
extractorCallback(domObject);
if (domObject.nodeType != Node.ELEMENT_NODE) return;
if (domObject.tagName in ignore) return; // <--- HERE
var childs = domObject.childNodes;
for (var i = 0; i < childs.length; i++)
walker(childs[i]);
}
That way, if we see a tag that you don't like, we simply skip it and all its children, and your extractor will never be exposed to the text nodes inside such tags.

Related

Is there a way to test a css-selector query to an unappended element?

I have this code:
Element.prototype.queryTest = function(strQuery) {
var _r;
if (this.parentElement == null) {
_r = Array.prototype.slice.call(document.querySelectorAll(strQuery)).indexOf(this);
} else {
_r = Array.prototype.slice.call(this.parentElement.querySelectorAll(strQuery)).indexOf(this);
}
return !!(_r+1);
}
I am searching for some way to test a query to an unappended element.
I want to change the first code to make this work:
var t = document.createElement("span");
t.classList.add("asdfg");
console.log(t.queryTest("span.adsfg"));
If there is a way to detect if the element isn't appended I could create a new temporary unappended one and append the target one to the temporary one to test the css-selector query.
Is there a way to detect if the element hasn't been appended jet? Could the target element be accessible even after freeing the temporary parent one? I have tested it on Chrome and it is accessible but I don't know if that is the case for firefox.
I know I can use document.querySelectorAll("*") to get a list of nodes but... isn't too CPU-demmanding the process to turn this NodeList to an Array? This is why I prefer not to use that way.
Thanks in advance.
There is already a native Element.prototype.matches method which does that:
const el = document.createElement('span');
el.classList.add('test');
console.log(el.matches('span.test'));
Note that to check if a node is connected or not, there is the Node.prototype.isConnected getter.
I did it.
Element.prototype.querySelectorTest = function(strQuery) {
var _r;
if (this.parentElement != null) {
_r = Array.prototype.indexOf.call(this.parentElement.querySelectorAll(strQuery),this);
} else if (this == document.documentElement) {
_r = ((document.querySelector(strQuery) == this)-1);
} else {
_r = ((this == document.createElement("i").appendChild(this).parentElement.querySelector(strQuery))-1);
}
return !!(_r+1);
}
I changed the way it check the nodeList.
I renamed the function to a more proper name.
If the target element is the root one there's no need to make a querySelectorAll.
If you append the unappended element to a temporary one to test the child you don't loose the reference (variable value in case there is one).
This is not my native language so please consider that.

Remove Text from Element Without Removing Reference

I would like to replace all the text in some element (including text in children) with some other text. For example, the html
<div id="myText">
This is some text.
This is some other text.
<p id="toHide">
This is even more text.
Click this text to hide it.
</p>
</div>
should become
<div id="myText">
That is some text.
That is some other text.
<p id="toHide">
That is even more text.
Click That text to hide it.
</p>
</div>
Essentially, I've replaced all of /this/gi with "That". However, I cannot use the following:
$("#myText").innerHTML = $("#myText").innerHTML.replace(/this/gi, "");
This is because I keep a lot of references to the children of myText. This references will be erased. I realize that in simple cases, I can just update these references, but I have a fairly large file, and many references (and it would be troublesome and error prone to have to update every reference every time this function is called).
I also store some data not visible to innerHTML. For example, I use
$("#toHide").test = "test";
This is lost when writing to innerHTML.
How can I replace text in a div without innerHTML (preferably without jquery)?
Jsfiddle http://jsfiddle.net/prankol57/ZEfM7/
Here's a solution:
var n, walker = document.createTreeWalker(document.getElementById("myText"), NodeFilter.SHOW_TEXT);
while (n = walker.nextNode()) {
n.nodeValue = n.nodeValue.replace(/this/ig, "that");
}
Basically, walk all the text nodes, and substitute their values.
For better compatibility, here's some reusable code:
function visitTextNodes(el, callback) {
if (el.nodeType === 3) {
callback(el);
}
for (var i=0; i < el.childNodes.length; ++i) {
visitTextNodes(el.childNodes[i], callback);
}
}
Then you can do:
visitTextNodes(document.getElementById("myText"), function(el) {
el.nodeValue = el.nodeValue.replace(/this/ig, "that");
});
You can use DOM methods (a.k.a. the old and safe way)
function replaceText(el, pattern, txt) {
for(var i=0; i<el.childNodes.length; ++i) {
var node = el.childNodes[i];
switch(node.nodeType){
case 1: // Element
replaceText(node, pattern, txt); continue;
case 3: // Text node
node.nodeValue = node.nodeValue.replace(/this/gi, "that"); continue;
}
}
}
Demo
Here my version of replaceText:
function replaceText(elem) {
if(elem.nodeType === Node.TEXT_NODE) {
elem.nodeValue = elem.nodeValue.replace(/this/gi, 'that')
return
}
var children = elem.childNodes
for(var i = 0, len = children.length; i < len; ++i)
replaceText(children[i]);
}
NB this take an element as the first parameter and traverse all children, hence it works even with complex elements.
Here the updated fiddle: http://jsfiddle.net/ZEfM7/6/

Replacing content of html document

I am trying to replace a word in an html document with selected word using javascript.
JavaScript
var node=document.body;
var childs=node.childNodes;
var n=childs.length,i=0;
while (i < n) {
node=childs[i];
if (node.nodeType == 3) {
if (node.textContent) {
node.nodeValue=node.nodeValue.replace("injected","hai");
}
}
i++;
}
but string is not getting replaced...pls help
Add document.body=node; at the end. When you set node to equal body you are copying the value, not editing it by reference.
I'm not sure why you're trying to work with the text node directly. console.log on nodeValue shows that the textContent of displayed tags is neither retrieved nor set in your code.
This works great. Live demo here (click).
<p>something to be replaced.</p>
and the js:
var childs = document.body.childNodes;
var len = childs.length;
for (var i=0; i<len; ++i) {
var node=childs[i];
if (node.nodeName === 'P') {
node.textContent = node.textContent.replace("to be replaced","was replaced");
}
}
There is a much simpler method using the String replace method. For example, you can convert the body of the page into a string and use regular expressions to replace the word. This means that you can avoid having to traverse the entire DOM and node lists, which is unnecessarily slow for your task.
document.getElementByTagName("body")[0].innerHTML.replace("injected","hai")

Broken HTML tags when using .innerHTML

As part of a larger script, I've been trying to make a page that would take a block of text from another function and "type" it out onto the screen:
function typeOut(page,nChar){
var txt = document.getElementById("text");
if (nChar<page.length){
txt.innerHTML = txt.innerHTML + page[nChar];
setTimeout(function () {typeOut(page,nChar+1);},20);
}
}
This basically works the way I want it to, but if the block of text I pass it has any html tags in it (like links), those show up as plain-text instead of being interpreted. Is there any way to get around that and force it to display the html elements correctly?
The problem is that you will create invalid HTML in the process, which the browser will try to correct. So apparently when you add < or >, it will automatically encode that character to not break the structure.
A proper solution would not work literally with every character of the text, but would process the HTML element by element. I.e. whenever you encounter an element in the source HTML, you would clone the element and add it to target element. Then you would process its text nodes character by character.
Here is a solution I hacked together (meaning, it can probably be improved a lot):
function typeOut(html, target) {
var d = document.createElement('div');
d.innerHTML = html;
var source = d.firstChild;
var i = 0;
(function process() {
if (source) {
if (source.nodeType === 3) { // process text node
if (i === 0) { // create new text node
target = target.appendChild(document.createTextNode(''));
target.nodeValue = source.nodeValue.charAt(i++);
// stop and continue to next node
} else if (i === source.nodeValue.length) {
if (source.nextSibling) {
source = source.nextSibling;
target = target.parentNode;
}
else {
source = source.parentNode.nextSibling;
target = target.parentNode.parentNode;
}
i = 0;
} else { // add to text node
target.nodeValue += source.nodeValue.charAt(i++);
}
} else if (source.nodeType === 1) { // clone element node
var clone = source.cloneNode();
clone.innerHTML = '';
target.appendChild(clone);
if (source.firstChild) {
source = source.firstChild;
target = clone;
} else {
source = source.nextSibling;
}
}
setTimeout(process, 20);
}
}());
}
DEMO
Your code should work. Example here : http://jsfiddle.net/hqKVe/2/
The issue is probably that the content of page[nChar] has HTML chars escaped.
The easiest solution is to use the html() function of jQuery (if you use jQuery). There a good example given by Canavar here : How to decode HTML entities using jQuery?
If you are not using jQuery, you have to unescape the string by yourself. In practice, just do the opposite of what is described here : Fastest method to escape HTML tags as HTML entities?

in pure Javascript, how do I get all elements inside the body tag excluding a certain div and its children?

I'm trying to find all the elements inside the body tag, but there is one element (div) that has a certain class type of "hidden" which I want to exclude it and its children from my array of elements.
here is my var that contains all the elements in the body:
allTagsInBody = document.body.getElementsByTagName('*');
and here is the div that I want to exclude from this list:
<div class="myHiddenElement">
<button>Click here</button>
<div> <button>Click here</button> </div>
<button>Click here</button>
</div>
the problem is that I don't know how many elements there are inside that div and how far nested they are.
As you iterate through each element, you need to not only check if it has your hidden class but if any of its parent elements have the class. Thus you need to recursively check each element's parents. This can be very expensive depending on the number of elements on the page and how deeply nested they are, but here's how's it's done:
var arr = [];
var len;
var i;
var nodes = document.querySelectorAll('body *');
function checkNode(node) {
if (node.classList.contains('myHiddenElement')) {
return true;
} else if (node.parentNode.nodeType === 1) {
return checkNode(node.parentNode);
}
return false;
};
for (i = 0, len = nodes.length; i < len; i++) {
if (checkNode(nodes[i])) {
continue;
} else {
arr.push(nodes[i]);
}
}
Here's a JSFiddle example: http://jsfiddle.net/xzCfs/5/
Unfortunately I don't think there is a way to do this with CSS selectors since the :not() selector only accepts simple selectors, not compound ones (e.g., :not(.myHiddenClass *) <-- would be awesome if that worked).
document.querySelectorAll( '*:not(.myHiddenElement)' );
The .querySelectorAll along with css2 :not() selector will do it.
Try this
​​var elems = document.body.childNodes;
var filtered = Array(); //holds elements that doesn't have 'myHiddenElement' class
​for(var i=0; i<elems.length; i++)
{
if(elems[i].className != 'myHiddenElement')
filtered.push(elems[i]);
}
If all else fails you can always recursively traverse the DOM (it's what all the libraries do anyway):
Here's a generic DOM traverse function:
# Note: Even though this function accepts a callback it is synchronous:
function traverse (node, callback) {
// The callback function must return true to continue processing
// otherwise stop processing down this branch:
if (callback(node)) {
for (var i=0;i < node.childNodes.length; i++) {
traverse(node.childNodes[i],callback);
}
}
}
So, to build up your collection:
var elements = [];
traverse(document,function(node){
// We only care about element nodes, ignore comments, attributes etc:
if (node.nodeType == 1 && node.className != "myHiddenElement") {
elements.push(node);
return true; // continue parsing this branch
}
return false; // ignore this branch and its children
});

Categories