Extract text from HTML while preserving block-level element newlines

Extract text from HTML while preserving block-level element newlines - javascript

Background
Most questions about extracting text from HTML (i.e., stripping the tags) use:
jQuery( htmlString ).text();
While this abstracts browser inconsistencies (such as innerText vs. textContent), the function call also ignores the semantic meaning of block-level elements (such as li).
Problem
Preserving newlines of block-level elements (i.e., the semantic intent) across various browsers entails no small effort, as Mike Wilcox describes.
A seemingly simpler solution would be to emulate pasting HTML content into a <textarea>, which strips HTML while preserving block-level element newlines. However, JavaScript-based inserts do not trigger the same HTML-to-text routines that browsers employ when users paste content into a <textarea>.
I also tried integrating Mike Wilcox's JavaScript code. The code works in Chromium, but not in Firefox.
Question
What is the simplest cross-browser way to extract text from HTML while preserving semantic newlines for block-level elements using jQuery (or vanilla JavaScript)?
Example
Consider:
Select and copy this entire question.
Open the textarea example page.
Paste the content into the textarea.
The textarea preserves the newlines for ordered lists, headings, preformatted text, and so forth. That is the result I would like to achieve.
To further clarify, given any HTML content, such as:
<h1>Header</h1>
<p>Paragraph</p>
<ul>
<li>First</li>
<li>Second</li>
</ul>
<dl>
<dt>Term</dt>
<dd>Definition</dd>
</dl>
<div>Div with <span>span</span>.<br />After the break.</div>
How would you produce:
Header
Paragraph
First
Second
Term
Definition
Div with span.
After the break.
Note: Neither indentation nor non-normalized whitespace are relevant.

Consider:
/**
* Returns the style for a node.
*
* #param n The node to check.
* #param p The property to retrieve (usually 'display').
* #link http://www.quirksmode.org/dom/getstyles.html
*/
this.getStyle = function( n, p ) {
return n.currentStyle ?
n.currentStyle[p] :
document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}
/**
* Converts HTML to text, preserving semantic newlines for block-level
* elements.
*
* #param node - The HTML node to perform text extraction.
*/
this.toText = function( node ) {
var result = '';
if( node.nodeType == document.TEXT_NODE ) {
// Replace repeated spaces, newlines, and tabs with a single space.
result = node.nodeValue.replace( /\s+/g, ' ' );
}
else {
for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
result += _this.toText( node.childNodes[i] );
}
var d = _this.getStyle( node, 'display' );
if( d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
node.tagName == 'BR' || node.tagName == 'HR' ) {
result += '\n';
}
}
return result;
}
http://jsfiddle.net/3mzrV/2/
That is to say, with an exception or two, iterate through each node and print its contents, letting the browser's computed style tell you when to insert newlines.

This seems to be (nearly) doing what you want:
function getText($node) {
return $node.contents().map(function () {
if (this.nodeName === 'BR') {
return '\n';
} else if (this.nodeType === 3) {
return this.nodeValue;
} else {
return getText($(this));
}
}).get().join('');
}
DEMO
It just recursively concatenates the values of all text nodes and replaces <br> elements with line breaks.
But there is no semantics in this, it completely relies the original HTML formatting (the leading and trailing white spaces seem to come from how jsFiddle embeds the HTML, but you can easily trim those). For example, notice how it indents the definition term.
If you really want to do this on a semantic level, you need a list of block level elements, recursively iterate over the elements and indent them accordingly. You treat different block elements differently with respect to indentation and line breaks around them. This should not be too difficult.

based on https://stackoverflow.com/a/20384452/3338098
and fixed to support TEXT1<div>TEXT2</div>=>TEXT1\nTEXT2 and allow non-DOM nodes
/**
* Returns the style for a node.
*
* #param n The node to check.
* #param p The property to retrieve (usually 'display').
* #link http://www.quirksmode.org/dom/getstyles.html
*/
function getNodeStyle( n, p ) {
return n.currentStyle ?
n.currentStyle[p] :
document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}
//IF THE NODE IS NOT ACTUALLY IN THE DOM then this won't take into account <div style="display: inline;">text</div>
//however for simple things like `contenteditable` this is sufficient, however for arbitrary html this will not work
function isNodeBlock(node) {
if (node.nodeType == document.TEXT_NODE) {return false;}
var d = getNodeStyle( node, 'display' );//this is irrelevant if the node isn't currently in the current DOM.
if (d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
node.tagName == 'BR' || node.tagName == 'HR' ||
node.tagName == 'DIV' // div,p,... add as needed to support non-DOM nodes
) {
return true;
}
return false;
}
/**
* Converts HTML to text, preserving semantic newlines for block-level
* elements.
*
* #param node - The HTML node to perform text extraction.
*/
function htmlToText( htmlOrNode, isNode ) {
var node = htmlOrNode;
if (!isNode) {node = jQuery("<span>"+htmlOrNode+"</span>")[0];}
//TODO: inject "unsafe" HTML into current DOM while guaranteeing that it won't
// change the visible DOM so that `isNodeBlock` will work reliably
var result = '';
if( node.nodeType == document.TEXT_NODE ) {
// Replace repeated spaces, newlines, and tabs with a single space.
result = node.nodeValue.replace( /\s+/g, ' ' );
} else {
for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
result += htmlToText( node.childNodes[i], true );
if (i < j-1) {
if (isNodeBlock(node.childNodes[i])) {
result += '\n';
} else if (isNodeBlock(node.childNodes[i+1]) &&
node.childNodes[i+1].tagName != 'BR' &&
node.childNodes[i+1].tagName != 'HR') {
result += '\n';
}
}
}
}
return result;
}
the main change was
if (i < j-1) {
if (isNodeBlock(node.childNodes[i])) {
result += '\n';
} else if (isNodeBlock(node.childNodes[i+1]) &&
node.childNodes[i+1].tagName != 'BR' &&
node.childNodes[i+1].tagName != 'HR') {
result += '\n';
}
}
to check neighboring blocks to determine the appropriateness of adding a newline.

I would like to suggest a little edit from the code of svidgen:
function getText(n, isInnerNode) {
var rv = '';
if (n.nodeType == 3) {
rv = n.nodeValue;
} else {
var partial = "";
var d = getComputedStyle(n).getPropertyValue('display');
if (isInnerNode && d.match(/^block/) || d.match(/list/) || n.tagName == 'BR') {
partial += "\n";
}
for (var i = 0; i < n.childNodes.length; i++) {
partial += getText(n.childNodes[i], true);
}
rv = partial;
}
return rv;
};
I just added the line break before the for loop, in this way we have a newline before the block, and also a variable to avoid the newline for the root element.
The code should be invocated:
getText(document.getElementById("divElement"))

Use element.innerText
This not return extra nodes added from contenteditable elements.
If you use element.innerHTML the text will contain additional markup, but innerText will return what you see on the element's contents.
<div id="txt" contenteditable="true"></div>
<script>
var txt=document.getElementById("txt");
var withMarkup=txt.innerHTML;
var textOnly=txt.innerText;
console.log(withMarkup);
console.log(textOnly);
</script>

Related

JavaScript getting innerHtml for all text nodes in nested elements [duplicate]

<div class="title">
I am text node
<a class="edit">Edit</a>
</div>
I wish to get the "I am text node", do not wish to remove the "edit" tag, and need a cross browser solution.

var text = $(".title").contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
This gets the contents of the selected element, and applies a filter function to it. The filter function returns only text nodes (i.e. those nodes with nodeType == Node.TEXT_NODE).

You can get the nodeValue of the first childNode using
$('.title')[0].childNodes[0].nodeValue
http://jsfiddle.net/TU4FB/

Another native JS solution that can be useful for "complex" or deeply nested elements is to use NodeIterator. Put NodeFilter.SHOW_TEXT as the second argument ("whatToShow"), and iterate over just the text node children of the element.
var root = document.querySelector('p'),
iter = document.createNodeIterator(root, NodeFilter.SHOW_TEXT),
textnode;
// print all text nodes
while (textnode = iter.nextNode()) {
console.log(textnode.textContent)
}
<p>
<br>some text<br>123
</p>
You can also use TreeWalker. The difference between the two is that NodeIterator is a simple linear iterator, while TreeWalker allows you to navigate via siblings and ancestors as well.

ES6 version that return the first #text node content
const extract = (node) => {
const text = [...node.childNodes].find(child => child.nodeType === Node.TEXT_NODE);
return text && text.textContent.trim();
}

If you mean get the value of the first text node in the element, this code will work:
var oDiv = document.getElementById("MyDiv");
var firstText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
var curNode = oDiv.childNodes[i];
if (curNode.nodeName === "#text") {
firstText = curNode.nodeValue;
break;
}
}
You can see this in action here: http://jsfiddle.net/ZkjZJ/

Pure JavaScript: Minimalist
First off, always keep this in mind when looking for text in the DOM.
MDN - Whitespace in the DOM
This issue will make you pay attention to the structure of your XML / HTML.
In this pure JavaScript example, I account for the possibility of multiple text nodes that could be interleaved with other kinds of nodes. However, initially, I do not pass judgment on whitespace, leaving that filtering task to other code.
In this version, I pass a NodeList in from the calling / client code.
/**
* Gets strings from text nodes. Minimalist. Non-robust. Pre-test loop version.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #param nodeList The child nodes of a Node, as in node.childNodes.
* #param target A positive whole number >= 1
* #return String The text you targeted.
*/
function getText(nodeList, target)
{
var trueTarget = target - 1,
length = nodeList.length; // Because you may have many child nodes.
for (var i = 0; i < length; i++) {
if ((nodeList[i].nodeType === Node.TEXT_NODE) && (i === trueTarget)) {
return nodeList[i].nodeValue; // Done! No need to keep going.
}
}
return null;
}
Of course, by testing node.hasChildNodes() first, there would be no need to use a pre-test for loop.
/**
* Gets strings from text nodes. Minimalist. Non-robust. Post-test loop version.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #param nodeList The child nodes of a Node, as in node.childNodes.
* #param target A positive whole number >= 1
* #return String The text you targeted.
*/
function getText(nodeList, target)
{
var trueTarget = target - 1,
length = nodeList.length,
i = 0;
do {
if ((nodeList[i].nodeType === Node.TEXT_NODE) && (i === trueTarget)) {
return nodeList[i].nodeValue; // Done! No need to keep going.
}
i++;
} while (i < length);
return null;
}
Pure JavaScript: Robust
Here the function getTextById() uses two helper functions: getStringsFromChildren() and filterWhitespaceLines().
getStringsFromChildren()
/**
* Collects strings from child text nodes.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #version 7.0
* #param parentNode An instance of the Node interface, such as an Element. object.
* #return Array of strings, or null.
* #throws TypeError if the parentNode is not a Node object.
*/
function getStringsFromChildren(parentNode)
{
var strings = [],
nodeList,
length,
i = 0;
if (!parentNode instanceof Node) {
throw new TypeError("The parentNode parameter expects an instance of a Node.");
}
if (!parentNode.hasChildNodes()) {
return null; // We are done. Node may resemble <element></element>
}
nodeList = parentNode.childNodes;
length = nodeList.length;
do {
if ((nodeList[i].nodeType === Node.TEXT_NODE)) {
strings.push(nodeList[i].nodeValue);
}
i++;
} while (i < length);
if (strings.length > 0) {
return strings;
}
return null;
}
filterWhitespaceLines()
/**
* Filters an array of strings to remove whitespace lines.
* Generic, cross platform solution.
*
* #author Anthony Rutledge
* #version 6.0
* #param textArray a String associated with the id attribute of an Element.
* #return Array of strings that are not lines of whitespace, or null.
* #throws TypeError if the textArray param is not of type Array.
*/
function filterWhitespaceLines(textArray)
{
var filteredArray = [],
whitespaceLine = /(?:^\s+$)/; // Non-capturing Regular Expression.
if (!textArray instanceof Array) {
throw new TypeError("The textArray parameter expects an instance of a Array.");
}
for (var i = 0; i < textArray.length; i++) {
if (!whitespaceLine.test(textArray[i])) { // If it is not a line of whitespace.
filteredArray.push(textArray[i].trim()); // Trimming here is fine.
}
}
if (filteredArray.length > 0) {
return filteredArray ; // Leave selecting and joining strings for a specific implementation.
}
return null; // No text to return.
}
getTextById()
/**
* Gets strings from text nodes. Robust.
* Generic, cross platform solution.
*
* #author Anthony Rutledge
* #version 6.0
* #param id A String associated with the id property of an Element.
* #return Array of strings, or null.
* #throws TypeError if the id param is not of type String.
* #throws TypeError if the id param cannot be used to find a node by id.
*/
function getTextById(id)
{
var textArray = null; // The hopeful output.
var idDatatype = typeof id; // Only used in an TypeError message.
var node; // The parent node being examined.
try {
if (idDatatype !== "string") {
throw new TypeError("The id argument must be of type String! Got " + idDatatype);
}
node = document.getElementById(id);
if (node === null) {
throw new TypeError("No element found with the id: " + id);
}
textArray = getStringsFromChildren(node);
if (textArray === null) {
return null; // No text nodes found. Example: <element></element>
}
textArray = filterWhitespaceLines(textArray);
if (textArray.length > 0) {
return textArray; // Leave selecting and joining strings for a specific implementation.
}
} catch (e) {
console.log(e.message);
}
return null; // No text to return.
}
Next, the return value (Array, or null) is sent to the client code where it should be handled. Hopefully, the array should have string elements of real text, not lines of whitespace.
Empty strings ("") are not returned because you need a text node to properly indicate the presence of valid text. Returning ("") may give the false impression that a text node exists, leading someone to assume that they can alter the text by changing the value of .nodeValue. This is false, because a text node does not exist in the case of an empty string.
Example 1:
<p id="bio"></p> <!-- There is no text node here. Return null. -->
Example 2:
<p id="bio">
</p> <!-- There are at least two text nodes ("\n"), here. -->
The problem comes in when you want to make your HTML easy to read by spacing it out. Now, even though there is no human readable valid text, there are still text nodes with newline ("\n") characters in their .nodeValue properties.
Humans see examples one and two as functionally equivalent--empty elements waiting to be filled. The DOM is different than human reasoning. This is why the getStringsFromChildren() function must determine if text nodes exist and gather the .nodeValue values into an array.
for (var i = 0; i < length; i++) {
if (nodeList[i].nodeType === Node.TEXT_NODE) {
textNodes.push(nodeList[i].nodeValue);
}
}
In example two, two text nodes do exist and getStringFromChildren() will return the .nodeValue of both of them ("\n"). However, filterWhitespaceLines() uses a regular expression to filter out lines of pure whitespace characters.
Is returning null instead of newline ("\n") characters a form of lying to the client / calling code? In human terms, no. In DOM terms, yes. However, the issue here is getting text, not editing it. There is no human text to return to the calling code.
One can never know how many newline characters might appear in someone's HTML. Creating a counter that looks for the "second" newline character is unreliable. It might not exist.
Of course, further down the line, the issue of editing text in an empty <p></p> element with extra whitespace (example 2) might mean destroying (maybe, skipping) all but one text node between a paragraph's tags to ensure the element contains precisely what it is supposed to display.
Regardless, except for cases where you are doing something extraordinary, you will need a way to determine which text node's .nodeValue property has the true, human readable text that you want to edit. filterWhitespaceLines gets us half way there.
var whitespaceLine = /(?:^\s+$)/; // Non-capturing Regular Expression.
for (var i = 0; i < filteredTextArray.length; i++) {
if (!whitespaceLine.test(textArray[i])) { // If it is not a line of whitespace.
filteredTextArray.push(textArray[i].trim()); // Trimming here is fine.
}
}
At this point you may have output that looks like this:
["Dealing with text nodes is fun.", "Some people just use jQuery."]
There is no guarantee that these two strings are adjacent to each other in the DOM, so joining them with .join() might make an unnatural composite. Instead, in the code that calls getTextById(), you need to chose which string you want to work with.
Test the output.
try {
var strings = getTextById("bio");
if (strings === null) {
// Do something.
} else if (strings.length === 1) {
// Do something with strings[0]
} else { // Could be another else if
// Do something. It all depends on the context.
}
} catch (e) {
console.log(e.message);
}
One could add .trim() inside of getStringsFromChildren() to get rid of leading and trailing whitespace (or to turn a bunch of spaces into a zero length string (""), but how can you know a priori what every application may need to have happen to the text (string) once it is found? You don't, so leave that to a specific implementation, and let getStringsFromChildren() be generic.
There may be times when this level of specificity (the target and such) is not required. That is great. Use a simple solution in those cases. However, a generalized algorithm enables you to accommodate simple and complex situations.

.text() - for jquery
$('.title').clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text(); //get the text of element

This will ignore the whitespace as well so, your never got the Blank textNodes..code using core Javascript.
var oDiv = document.getElementById("MyDiv");
var firstText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
var curNode = oDiv.childNodes[i];
whitespace = /^\s*$/;
if (curNode.nodeName === "#text" && !(whitespace.test(curNode.nodeValue))) {
firstText = curNode.nodeValue;
break;
}
}
Check it on jsfiddle : - http://jsfiddle.net/webx/ZhLep/

Simply via Vanilla JavaScript:
const el = document.querySelector('.title');
const text = el.firstChild.textContent.trim();

You can also use XPath's text() node test to get the text nodes only. For example
var target = document.querySelector('div.title');
var iter = document.evaluate('text()', target, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE);
var node;
var want = '';
while (node = iter.iterateNext()) {
want += node.data;
}

There are some overcomplicated solutions here but the operation is as straightforward as using .childNodes to get children of all node types and .filter to extract e.nodeType === Node.TEXT_NODEs. Optionally, we may want to do it recursively and/or ignore "empty" text nodes (all whitespace).
These examples convert the nodes to their text content for display purposes, but this is technically a separate step from filtering.
const immediateTextNodes = el =>
[...el.childNodes].filter(e => e.nodeType === Node.TEXT_NODE);
const immediateNonEmptyTextNodes = el =>
[...el.childNodes].filter(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
);
const firstImmediateTextNode = el =>
[...el.childNodes].find(e => e.nodeType === Node.TEXT_NODE);
const firstImmediateNonEmptyTextNode = el =>
[...el.childNodes].find(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
);
// example usage:
const text = el => el.textContent;
const p = document.querySelector("p");
console.log(immediateTextNodes(p).map(text));
console.log(immediateNonEmptyTextNodes(p).map(text));
console.log(text(firstImmediateTextNode(p)));
console.log(text(firstImmediateNonEmptyTextNode(p)));
// if you want to trim whitespace:
console.log(immediateNonEmptyTextNodes(p).map(e => text(e).trim()));
<p>
<span>IGNORE</span>
<b>IGNORE</b>
foo
<br>
bar
</p>
Recursive alternative to a NodeIterator:
const deepTextNodes = el => [...el.childNodes].flatMap(e =>
e.nodeType === Node.TEXT_NODE ? e : deepTextNodes(e)
);
const deepNonEmptyTextNodes = el =>
[...el.childNodes].flatMap(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
? e : deepNonEmptyTextNodes(e)
);
// example usage:
const text = el => el.textContent;
const p = document.querySelector("p");
console.log(deepTextNodes(p).map(text));
console.log(deepNonEmptyTextNodes(p).map(text));
<p>
foo
<span>bar</span>
baz
<span><b>quux</b></span>
</p>
Finally, feel free to join the text node array into a string if you wish using .join(""). But as with trimming and text content extraction, I'd probably not bake this into the core filtering function and leave it to the caller to handle as needed.

Best way to extract unformatted text from DOM preserving line breaks?

Let's say I have the following element TEXT in HTML:
<div id="TEXT">
<p>First <strong>Line</strong></p>
<p>Seond <em>Line</em></p>
</div>
How should one extract the raw text from this element, without HTML tags, but preserving the line breaks?
I know about the following two options but neither of them seems to be perfect:
document.getElementById("TEXT").textContent
returns
First LineSecond Line
problem: ignores the line break that should be included between paragraphs
document.getElementById("TEXT").innerText
returns
First Line
Second Line
problem: is not part of W3C standard and is not guaranteed to work in all browsers

Here's a handy function for getting text contents of any element and it works well on all platforms, and yes, it preserves line breaks.
function text(e){
var t = "";
e = e.childNodes || e;
for(var i = 0;i<e.length;i++){
t+= e[i].nodeType !=1 ? e[i].nodeValue : text(e[i].childNodes);
}
return t;
}

You can check how jQuery does it. It uses sizzle js. Here is the function that you can use.
<div id="TEXT">
<p>First <strong>Line</strong></p>
<p>Seond <em>Line</em></p>
</div>
<script>
var getText = function( elem ) {
var node,
ret = "",
i = 0,
nodeType = elem.nodeType;
if ( !nodeType ) {
// If no nodeType, this is expected to be an array
while ( (node = elem[i++]) ) {
// Do not traverse comment nodes
ret += getText( node );
}
} else if ( nodeType === 1 || nodeType === 9 || nodeType === 11 ) {
// Use textContent for elements
// innerText usage removed for consistency of new lines (jQuery #11153)
if ( typeof elem.textContent === "string" ) {
return elem.textContent;
} else {
// Traverse its children
for ( elem = elem.firstChild; elem; elem = elem.nextSibling ) {
ret += getText( elem );
}
}
} else if ( nodeType === 3 || nodeType === 4 ) {
return elem.nodeValue;
}
// Do not include comment or processing instruction nodes
return ret;
};
console.log(getText(document.getElementById('TEXT')));
<script>

Converting html to textual representation with preserved whitespace meaning of tags -- how?

Consider such html piece:
<p>foo</p><p>bar</p>
If you run (for example) jQuery text for it you will get "foobar" -- so it is raw text actually, not textual representation.
I am looking for some ready to use library to get textual representation, in this case it should be -- "foo\nbar". Or clever hints how to make this as easy as possible ;-).
NOTE: I am not looking for beautiful output text, but just preserved meaning of whitespaces, so for:
<tr><td>foo</td><td>bar</td></tr>
<tr><td>1</td><td>2</td></tr>
I will be happy with
foo bar
1 2
it does NOT have to be:
foo bar
1 2
(but of course no harm done).

Have you looked at the innerText or textContent properties?
function getText(element){
var s = "";
if(element.innerText){
s = element.innerText;
}else if(element.textContent){
s = element.textContent;
}
return s;
}
Example
Adds a PRE tag to the body and appends the body text.
document.body.appendChild(
document.createElement('pre')
)
.appendChild(
document.createTextNode(
getText(document.body)
)
);
Edit
Does using a range work with firefox?
var r = document.createRange();
r.selectNode(document.body);
console.log(r.toString());
Edit
It looks like you're stuck with a parsing function like this then.
var parse = function(element){
var s = "";
for(var i = 0; i < element.childNodes.length; i++){
if(/^(iframe|noscript|script|style)$/i.test(element.childNodes[i].nodeName)){
continue;
}else if(/^(tr|br|p|hr)$/i.test(element.childNodes[i].nodeName)){
s+='\n';
}else if(/^(td|th)$/.test(element.childNodes[i].nodeName)){
s+='\t';
}
if(element.childNodes[i].nodeType == 3){
s+=element.childNodes[i].nodeValue.replace(/[\r\n]+/, "");
}else{
s+=parse(element.childNodes[i]);
}
}
return s;
}
console.log(parse(document.body));

I started writing my own function probably at the same time as Zapthedingbat, so just for the record:
var NodeTypeEnum = { Element : 1,Attribute : 2, Text: 3, Comment :8,Document :9};
function doTextualRepresentation(elem)
{
if (elem.nodeType==NodeTypeEnum.Text)
return elem.nodeValue;
else if (elem.nodeType==NodeTypeEnum.Element || elem.nodeType==NodeTypeEnum.Document)
{
var s = "";
var child = elem.firstChild;
while (child!=null)
{
s += doTextualRepresentation(child);
child = child.nextSibling;
}
if (['P','DIV','TABLE','TR','BR','HR'].indexOf(elem.tagName)>-1)
s = "\n"+s+"\n";
else if (['TD','TR'].indexOf(elem.tagName)>-1)
s = "\t"+s+"\t";
return s;
}
return "";
}
function TextualRepresentation(elem)
{
return doTextualRepresentation(elem).replace(/\n[\s]+/g,"\n").replace(/\t{2,}/g,"\t");
}
One thing I am surprised with -- I couldn't get
for (var child in elem.childNodes)
working, and it is a pity, because I spend most time in C# and I like this syntax, theoretically it should work in JS, but it doesn't.

How to get the text node of an element?

<div class="title">
I am text node
<a class="edit">Edit</a>
</div>
I wish to get the "I am text node", do not wish to remove the "edit" tag, and need a cross browser solution.

var text = $(".title").contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
This gets the contents of the selected element, and applies a filter function to it. The filter function returns only text nodes (i.e. those nodes with nodeType == Node.TEXT_NODE).

You can get the nodeValue of the first childNode using
$('.title')[0].childNodes[0].nodeValue
http://jsfiddle.net/TU4FB/

Another native JS solution that can be useful for "complex" or deeply nested elements is to use NodeIterator. Put NodeFilter.SHOW_TEXT as the second argument ("whatToShow"), and iterate over just the text node children of the element.
var root = document.querySelector('p'),
iter = document.createNodeIterator(root, NodeFilter.SHOW_TEXT),
textnode;
// print all text nodes
while (textnode = iter.nextNode()) {
console.log(textnode.textContent)
}
<p>
<br>some text<br>123
</p>
You can also use TreeWalker. The difference between the two is that NodeIterator is a simple linear iterator, while TreeWalker allows you to navigate via siblings and ancestors as well.

ES6 version that return the first #text node content
const extract = (node) => {
const text = [...node.childNodes].find(child => child.nodeType === Node.TEXT_NODE);
return text && text.textContent.trim();
}

If you mean get the value of the first text node in the element, this code will work:
var oDiv = document.getElementById("MyDiv");
var firstText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
var curNode = oDiv.childNodes[i];
if (curNode.nodeName === "#text") {
firstText = curNode.nodeValue;
break;
}
}
You can see this in action here: http://jsfiddle.net/ZkjZJ/

Pure JavaScript: Minimalist
First off, always keep this in mind when looking for text in the DOM.
MDN - Whitespace in the DOM
This issue will make you pay attention to the structure of your XML / HTML.
In this pure JavaScript example, I account for the possibility of multiple text nodes that could be interleaved with other kinds of nodes. However, initially, I do not pass judgment on whitespace, leaving that filtering task to other code.
In this version, I pass a NodeList in from the calling / client code.
/**
* Gets strings from text nodes. Minimalist. Non-robust. Pre-test loop version.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #param nodeList The child nodes of a Node, as in node.childNodes.
* #param target A positive whole number >= 1
* #return String The text you targeted.
*/
function getText(nodeList, target)
{
var trueTarget = target - 1,
length = nodeList.length; // Because you may have many child nodes.
for (var i = 0; i < length; i++) {
if ((nodeList[i].nodeType === Node.TEXT_NODE) && (i === trueTarget)) {
return nodeList[i].nodeValue; // Done! No need to keep going.
}
}
return null;
}
Of course, by testing node.hasChildNodes() first, there would be no need to use a pre-test for loop.
/**
* Gets strings from text nodes. Minimalist. Non-robust. Post-test loop version.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #param nodeList The child nodes of a Node, as in node.childNodes.
* #param target A positive whole number >= 1
* #return String The text you targeted.
*/
function getText(nodeList, target)
{
var trueTarget = target - 1,
length = nodeList.length,
i = 0;
do {
if ((nodeList[i].nodeType === Node.TEXT_NODE) && (i === trueTarget)) {
return nodeList[i].nodeValue; // Done! No need to keep going.
}
i++;
} while (i < length);
return null;
}
Pure JavaScript: Robust
Here the function getTextById() uses two helper functions: getStringsFromChildren() and filterWhitespaceLines().
getStringsFromChildren()
/**
* Collects strings from child text nodes.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #version 7.0
* #param parentNode An instance of the Node interface, such as an Element. object.
* #return Array of strings, or null.
* #throws TypeError if the parentNode is not a Node object.
*/
function getStringsFromChildren(parentNode)
{
var strings = [],
nodeList,
length,
i = 0;
if (!parentNode instanceof Node) {
throw new TypeError("The parentNode parameter expects an instance of a Node.");
}
if (!parentNode.hasChildNodes()) {
return null; // We are done. Node may resemble <element></element>
}
nodeList = parentNode.childNodes;
length = nodeList.length;
do {
if ((nodeList[i].nodeType === Node.TEXT_NODE)) {
strings.push(nodeList[i].nodeValue);
}
i++;
} while (i < length);
if (strings.length > 0) {
return strings;
}
return null;
}
filterWhitespaceLines()
/**
* Filters an array of strings to remove whitespace lines.
* Generic, cross platform solution.
*
* #author Anthony Rutledge
* #version 6.0
* #param textArray a String associated with the id attribute of an Element.
* #return Array of strings that are not lines of whitespace, or null.
* #throws TypeError if the textArray param is not of type Array.
*/
function filterWhitespaceLines(textArray)
{
var filteredArray = [],
whitespaceLine = /(?:^\s+$)/; // Non-capturing Regular Expression.
if (!textArray instanceof Array) {
throw new TypeError("The textArray parameter expects an instance of a Array.");
}
for (var i = 0; i < textArray.length; i++) {
if (!whitespaceLine.test(textArray[i])) { // If it is not a line of whitespace.
filteredArray.push(textArray[i].trim()); // Trimming here is fine.
}
}
if (filteredArray.length > 0) {
return filteredArray ; // Leave selecting and joining strings for a specific implementation.
}
return null; // No text to return.
}
getTextById()
/**
* Gets strings from text nodes. Robust.
* Generic, cross platform solution.
*
* #author Anthony Rutledge
* #version 6.0
* #param id A String associated with the id property of an Element.
* #return Array of strings, or null.
* #throws TypeError if the id param is not of type String.
* #throws TypeError if the id param cannot be used to find a node by id.
*/
function getTextById(id)
{
var textArray = null; // The hopeful output.
var idDatatype = typeof id; // Only used in an TypeError message.
var node; // The parent node being examined.
try {
if (idDatatype !== "string") {
throw new TypeError("The id argument must be of type String! Got " + idDatatype);
}
node = document.getElementById(id);
if (node === null) {
throw new TypeError("No element found with the id: " + id);
}
textArray = getStringsFromChildren(node);
if (textArray === null) {
return null; // No text nodes found. Example: <element></element>
}
textArray = filterWhitespaceLines(textArray);
if (textArray.length > 0) {
return textArray; // Leave selecting and joining strings for a specific implementation.
}
} catch (e) {
console.log(e.message);
}
return null; // No text to return.
}
Next, the return value (Array, or null) is sent to the client code where it should be handled. Hopefully, the array should have string elements of real text, not lines of whitespace.
Empty strings ("") are not returned because you need a text node to properly indicate the presence of valid text. Returning ("") may give the false impression that a text node exists, leading someone to assume that they can alter the text by changing the value of .nodeValue. This is false, because a text node does not exist in the case of an empty string.
Example 1:
<p id="bio"></p> <!-- There is no text node here. Return null. -->
Example 2:
<p id="bio">
</p> <!-- There are at least two text nodes ("\n"), here. -->
The problem comes in when you want to make your HTML easy to read by spacing it out. Now, even though there is no human readable valid text, there are still text nodes with newline ("\n") characters in their .nodeValue properties.
Humans see examples one and two as functionally equivalent--empty elements waiting to be filled. The DOM is different than human reasoning. This is why the getStringsFromChildren() function must determine if text nodes exist and gather the .nodeValue values into an array.
for (var i = 0; i < length; i++) {
if (nodeList[i].nodeType === Node.TEXT_NODE) {
textNodes.push(nodeList[i].nodeValue);
}
}
In example two, two text nodes do exist and getStringFromChildren() will return the .nodeValue of both of them ("\n"). However, filterWhitespaceLines() uses a regular expression to filter out lines of pure whitespace characters.
Is returning null instead of newline ("\n") characters a form of lying to the client / calling code? In human terms, no. In DOM terms, yes. However, the issue here is getting text, not editing it. There is no human text to return to the calling code.
One can never know how many newline characters might appear in someone's HTML. Creating a counter that looks for the "second" newline character is unreliable. It might not exist.
Of course, further down the line, the issue of editing text in an empty <p></p> element with extra whitespace (example 2) might mean destroying (maybe, skipping) all but one text node between a paragraph's tags to ensure the element contains precisely what it is supposed to display.
Regardless, except for cases where you are doing something extraordinary, you will need a way to determine which text node's .nodeValue property has the true, human readable text that you want to edit. filterWhitespaceLines gets us half way there.
var whitespaceLine = /(?:^\s+$)/; // Non-capturing Regular Expression.
for (var i = 0; i < filteredTextArray.length; i++) {
if (!whitespaceLine.test(textArray[i])) { // If it is not a line of whitespace.
filteredTextArray.push(textArray[i].trim()); // Trimming here is fine.
}
}
At this point you may have output that looks like this:
["Dealing with text nodes is fun.", "Some people just use jQuery."]
There is no guarantee that these two strings are adjacent to each other in the DOM, so joining them with .join() might make an unnatural composite. Instead, in the code that calls getTextById(), you need to chose which string you want to work with.
Test the output.
try {
var strings = getTextById("bio");
if (strings === null) {
// Do something.
} else if (strings.length === 1) {
// Do something with strings[0]
} else { // Could be another else if
// Do something. It all depends on the context.
}
} catch (e) {
console.log(e.message);
}
One could add .trim() inside of getStringsFromChildren() to get rid of leading and trailing whitespace (or to turn a bunch of spaces into a zero length string (""), but how can you know a priori what every application may need to have happen to the text (string) once it is found? You don't, so leave that to a specific implementation, and let getStringsFromChildren() be generic.
There may be times when this level of specificity (the target and such) is not required. That is great. Use a simple solution in those cases. However, a generalized algorithm enables you to accommodate simple and complex situations.

.text() - for jquery
$('.title').clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text(); //get the text of element

This will ignore the whitespace as well so, your never got the Blank textNodes..code using core Javascript.
var oDiv = document.getElementById("MyDiv");
var firstText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
var curNode = oDiv.childNodes[i];
whitespace = /^\s*$/;
if (curNode.nodeName === "#text" && !(whitespace.test(curNode.nodeValue))) {
firstText = curNode.nodeValue;
break;
}
}
Check it on jsfiddle : - http://jsfiddle.net/webx/ZhLep/

Simply via Vanilla JavaScript:
const el = document.querySelector('.title');
const text = el.firstChild.textContent.trim();

You can also use XPath's text() node test to get the text nodes only. For example
var target = document.querySelector('div.title');
var iter = document.evaluate('text()', target, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE);
var node;
var want = '';
while (node = iter.iterateNext()) {
want += node.data;
}

There are some overcomplicated solutions here but the operation is as straightforward as using .childNodes to get children of all node types and .filter to extract e.nodeType === Node.TEXT_NODEs. Optionally, we may want to do it recursively and/or ignore "empty" text nodes (all whitespace).
These examples convert the nodes to their text content for display purposes, but this is technically a separate step from filtering.
const immediateTextNodes = el =>
[...el.childNodes].filter(e => e.nodeType === Node.TEXT_NODE);
const immediateNonEmptyTextNodes = el =>
[...el.childNodes].filter(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
);
const firstImmediateTextNode = el =>
[...el.childNodes].find(e => e.nodeType === Node.TEXT_NODE);
const firstImmediateNonEmptyTextNode = el =>
[...el.childNodes].find(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
);
// example usage:
const text = el => el.textContent;
const p = document.querySelector("p");
console.log(immediateTextNodes(p).map(text));
console.log(immediateNonEmptyTextNodes(p).map(text));
console.log(text(firstImmediateTextNode(p)));
console.log(text(firstImmediateNonEmptyTextNode(p)));
// if you want to trim whitespace:
console.log(immediateNonEmptyTextNodes(p).map(e => text(e).trim()));
<p>
<span>IGNORE</span>
<b>IGNORE</b>
foo
<br>
bar
</p>
Recursive alternative to a NodeIterator:
const deepTextNodes = el => [...el.childNodes].flatMap(e =>
e.nodeType === Node.TEXT_NODE ? e : deepTextNodes(e)
);
const deepNonEmptyTextNodes = el =>
[...el.childNodes].flatMap(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
? e : deepNonEmptyTextNodes(e)
);
// example usage:
const text = el => el.textContent;
const p = document.querySelector("p");
console.log(deepTextNodes(p).map(text));
console.log(deepNonEmptyTextNodes(p).map(text));
<p>
foo
<span>bar</span>
baz
<span><b>quux</b></span>
</p>
Finally, feel free to join the text node array into a string if you wish using .join(""). But as with trimming and text content extraction, I'd probably not bake this into the core filtering function and leave it to the caller to handle as needed.

TEXT_NODE: returns ONLY text?

I'm using JavaScript in order to extract all text from a DOM object. My algorithm goes over the DOM object itself and it's descendants, if the node is a TEXT_NODE type than accumulates it's nodeValue.
For some weird reason I also get things like:
#hdr-editions a { text-decoration:none; }
#cnn_hdr-editionS { text-align:left;clear:both; }
#cnn_hdr-editionS a { text-decoration:none;font-size:10px;top:7px;line-height:12px;font-weight:bold; }
#hdr-prompt-text b { display:inline-block;margin:0 0 0 20px; }
#hdr-editions li { padding:0 10px; }
How do I filter this? Do I need to use something else? I want ONLY text.

From the looks of things, you're also collecting the text from <style> elements. You might want to run a check for those:
var ignore = { "STYLE":0, "SCRIPT":0, "NOSCRIPT":0, "IFRAME":0, "OBJECT":0 }
if (element.tagName in ignore)
continue;
You can add any other elements to the object map to ignore them.

You want to skip over style elements.
In your loop, you could do this...
if (element.tagName == 'STYLE') {
continue;
}
You also probably want to skip over script, textarea, etc.

This is text as far as the DOM is concerned. You'll have to filter out (skip) <script> and <style> tags.

[Answer added after reading OP's comments to Andy's excellent answer]
The problem is that you see the text nodes inside elements whose content is normally not rendered by browsers - such as STYLE and SCRIPT tags.
When scan the DOM tree, using depth-first search I assume, your scan should skip over the content of such tags.
For example - a recursive depth-first DOM tree walker might look like this:
function walker(domObject, extractorCallback) {
if (domObject == null) return; // fail fast
extractorCallback(domObject);
if (domObject.nodeType != Node.ELEMENT_NODE) return;
var childs = domObject.childNodes;
for (var i = 0; i < childs.length; i++)
walker(childs[i]);
}
var textvalue = "":
walker(document, function(node) {
if (node.nodeType == Node.TEXT_NODE)
textvalue += node.nodeValue;
});
In such a case, if your walker encounters tags that you know you won't like to see their content, you should just skip going into that part of the tree. So walker() will have to be adapted as thus:
var ignore = { "STYLE":0, "SCRIPT":0, "NOSCRIPT":0, "IFRAME":0, "OBJECT":0 }
function walker(domObject, extractorCallback) {
if (domObject == null) return; // fail fast
extractorCallback(domObject);
if (domObject.nodeType != Node.ELEMENT_NODE) return;
if (domObject.tagName in ignore) return; // <--- HERE
var childs = domObject.childNodes;
for (var i = 0; i < childs.length; i++)
walker(childs[i]);
}
That way, if we see a tag that you don't like, we simply skip it and all its children, and your extractor will never be exposed to the text nodes inside such tags.

We Keep Coding

JavaScript is the programming language of the Web.

Extract text from HTML while preserving block-level element newlines - javascript

Related

JavaScript getting innerHtml for all text nodes in nested elements [duplicate]

Best way to extract unformatted text from DOM preserving line breaks?

Converting html to textual representation with preserved whitespace meaning of tags -- how?

How to get the text node of an element?

TEXT_NODE: returns ONLY text?

Categories

Resources