Best way to extract unformatted text from DOM preserving line breaks?

Best way to extract unformatted text from DOM preserving line breaks? - javascript

Let's say I have the following element TEXT in HTML:
<div id="TEXT">
<p>First <strong>Line</strong></p>
<p>Seond <em>Line</em></p>
</div>
How should one extract the raw text from this element, without HTML tags, but preserving the line breaks?
I know about the following two options but neither of them seems to be perfect:
document.getElementById("TEXT").textContent
returns
First LineSecond Line
problem: ignores the line break that should be included between paragraphs
document.getElementById("TEXT").innerText
returns
First Line
Second Line
problem: is not part of W3C standard and is not guaranteed to work in all browsers

Here's a handy function for getting text contents of any element and it works well on all platforms, and yes, it preserves line breaks.
function text(e){
var t = "";
e = e.childNodes || e;
for(var i = 0;i<e.length;i++){
t+= e[i].nodeType !=1 ? e[i].nodeValue : text(e[i].childNodes);
}
return t;
}

You can check how jQuery does it. It uses sizzle js. Here is the function that you can use.
<div id="TEXT">
<p>First <strong>Line</strong></p>
<p>Seond <em>Line</em></p>
</div>
<script>
var getText = function( elem ) {
var node,
ret = "",
i = 0,
nodeType = elem.nodeType;
if ( !nodeType ) {
// If no nodeType, this is expected to be an array
while ( (node = elem[i++]) ) {
// Do not traverse comment nodes
ret += getText( node );
}
} else if ( nodeType === 1 || nodeType === 9 || nodeType === 11 ) {
// Use textContent for elements
// innerText usage removed for consistency of new lines (jQuery #11153)
if ( typeof elem.textContent === "string" ) {
return elem.textContent;
} else {
// Traverse its children
for ( elem = elem.firstChild; elem; elem = elem.nextSibling ) {
ret += getText( elem );
}
}
} else if ( nodeType === 3 || nodeType === 4 ) {
return elem.nodeValue;
}
// Do not include comment or processing instruction nodes
return ret;
};
console.log(getText(document.getElementById('TEXT')));
<script>

Related

Is element.empty() equivalent to element.innerHTML = ""?

Like the title says. If it's not then what would be the same as .innerHTML = "" ?

It's nearly the same. If you look at the source for the method, you'll see that it's:
empty: function() {
var elem,
i = 0;
for ( ; ( elem = this[ i ] ) != null; i++ ) {
if ( elem.nodeType === 1 ) {
// Prevent memory leaks
jQuery.cleanData( getAll( elem, false ) );
// Remove any remaining nodes
elem.textContent = "";
}
}
return this;
},
And assigning the empty string to the .textContent of an element is the same as assigning the empty string to the .innerHTML of an element.
The only difference is that .empty calls .cleanData, which removes a number of jQuery-specific data/events associated with the element, if there happen to be any.

Extract text from HTML while preserving block-level element newlines

Background
Most questions about extracting text from HTML (i.e., stripping the tags) use:
jQuery( htmlString ).text();
While this abstracts browser inconsistencies (such as innerText vs. textContent), the function call also ignores the semantic meaning of block-level elements (such as li).
Problem
Preserving newlines of block-level elements (i.e., the semantic intent) across various browsers entails no small effort, as Mike Wilcox describes.
A seemingly simpler solution would be to emulate pasting HTML content into a <textarea>, which strips HTML while preserving block-level element newlines. However, JavaScript-based inserts do not trigger the same HTML-to-text routines that browsers employ when users paste content into a <textarea>.
I also tried integrating Mike Wilcox's JavaScript code. The code works in Chromium, but not in Firefox.
Question
What is the simplest cross-browser way to extract text from HTML while preserving semantic newlines for block-level elements using jQuery (or vanilla JavaScript)?
Example
Consider:
Select and copy this entire question.
Open the textarea example page.
Paste the content into the textarea.
The textarea preserves the newlines for ordered lists, headings, preformatted text, and so forth. That is the result I would like to achieve.
To further clarify, given any HTML content, such as:
<h1>Header</h1>
<p>Paragraph</p>
<ul>
<li>First</li>
<li>Second</li>
</ul>
<dl>
<dt>Term</dt>
<dd>Definition</dd>
</dl>
<div>Div with <span>span</span>.<br />After the break.</div>
How would you produce:
Header
Paragraph
First
Second
Term
Definition
Div with span.
After the break.
Note: Neither indentation nor non-normalized whitespace are relevant.

Consider:
/**
* Returns the style for a node.
*
* #param n The node to check.
* #param p The property to retrieve (usually 'display').
* #link http://www.quirksmode.org/dom/getstyles.html
*/
this.getStyle = function( n, p ) {
return n.currentStyle ?
n.currentStyle[p] :
document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}
/**
* Converts HTML to text, preserving semantic newlines for block-level
* elements.
*
* #param node - The HTML node to perform text extraction.
*/
this.toText = function( node ) {
var result = '';
if( node.nodeType == document.TEXT_NODE ) {
// Replace repeated spaces, newlines, and tabs with a single space.
result = node.nodeValue.replace( /\s+/g, ' ' );
}
else {
for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
result += _this.toText( node.childNodes[i] );
}
var d = _this.getStyle( node, 'display' );
if( d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
node.tagName == 'BR' || node.tagName == 'HR' ) {
result += '\n';
}
}
return result;
}
http://jsfiddle.net/3mzrV/2/
That is to say, with an exception or two, iterate through each node and print its contents, letting the browser's computed style tell you when to insert newlines.

This seems to be (nearly) doing what you want:
function getText($node) {
return $node.contents().map(function () {
if (this.nodeName === 'BR') {
return '\n';
} else if (this.nodeType === 3) {
return this.nodeValue;
} else {
return getText($(this));
}
}).get().join('');
}
DEMO
It just recursively concatenates the values of all text nodes and replaces <br> elements with line breaks.
But there is no semantics in this, it completely relies the original HTML formatting (the leading and trailing white spaces seem to come from how jsFiddle embeds the HTML, but you can easily trim those). For example, notice how it indents the definition term.
If you really want to do this on a semantic level, you need a list of block level elements, recursively iterate over the elements and indent them accordingly. You treat different block elements differently with respect to indentation and line breaks around them. This should not be too difficult.

based on https://stackoverflow.com/a/20384452/3338098
and fixed to support TEXT1<div>TEXT2</div>=>TEXT1\nTEXT2 and allow non-DOM nodes
/**
* Returns the style for a node.
*
* #param n The node to check.
* #param p The property to retrieve (usually 'display').
* #link http://www.quirksmode.org/dom/getstyles.html
*/
function getNodeStyle( n, p ) {
return n.currentStyle ?
n.currentStyle[p] :
document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}
//IF THE NODE IS NOT ACTUALLY IN THE DOM then this won't take into account <div style="display: inline;">text</div>
//however for simple things like `contenteditable` this is sufficient, however for arbitrary html this will not work
function isNodeBlock(node) {
if (node.nodeType == document.TEXT_NODE) {return false;}
var d = getNodeStyle( node, 'display' );//this is irrelevant if the node isn't currently in the current DOM.
if (d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
node.tagName == 'BR' || node.tagName == 'HR' ||
node.tagName == 'DIV' // div,p,... add as needed to support non-DOM nodes
) {
return true;
}
return false;
}
/**
* Converts HTML to text, preserving semantic newlines for block-level
* elements.
*
* #param node - The HTML node to perform text extraction.
*/
function htmlToText( htmlOrNode, isNode ) {
var node = htmlOrNode;
if (!isNode) {node = jQuery("<span>"+htmlOrNode+"</span>")[0];}
//TODO: inject "unsafe" HTML into current DOM while guaranteeing that it won't
// change the visible DOM so that `isNodeBlock` will work reliably
var result = '';
if( node.nodeType == document.TEXT_NODE ) {
// Replace repeated spaces, newlines, and tabs with a single space.
result = node.nodeValue.replace( /\s+/g, ' ' );
} else {
for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
result += htmlToText( node.childNodes[i], true );
if (i < j-1) {
if (isNodeBlock(node.childNodes[i])) {
result += '\n';
} else if (isNodeBlock(node.childNodes[i+1]) &&
node.childNodes[i+1].tagName != 'BR' &&
node.childNodes[i+1].tagName != 'HR') {
result += '\n';
}
}
}
}
return result;
}
the main change was
if (i < j-1) {
if (isNodeBlock(node.childNodes[i])) {
result += '\n';
} else if (isNodeBlock(node.childNodes[i+1]) &&
node.childNodes[i+1].tagName != 'BR' &&
node.childNodes[i+1].tagName != 'HR') {
result += '\n';
}
}
to check neighboring blocks to determine the appropriateness of adding a newline.

I would like to suggest a little edit from the code of svidgen:
function getText(n, isInnerNode) {
var rv = '';
if (n.nodeType == 3) {
rv = n.nodeValue;
} else {
var partial = "";
var d = getComputedStyle(n).getPropertyValue('display');
if (isInnerNode && d.match(/^block/) || d.match(/list/) || n.tagName == 'BR') {
partial += "\n";
}
for (var i = 0; i < n.childNodes.length; i++) {
partial += getText(n.childNodes[i], true);
}
rv = partial;
}
return rv;
};
I just added the line break before the for loop, in this way we have a newline before the block, and also a variable to avoid the newline for the root element.
The code should be invocated:
getText(document.getElementById("divElement"))

Use element.innerText
This not return extra nodes added from contenteditable elements.
If you use element.innerHTML the text will contain additional markup, but innerText will return what you see on the element's contents.
<div id="txt" contenteditable="true"></div>
<script>
var txt=document.getElementById("txt");
var withMarkup=txt.innerHTML;
var textOnly=txt.innerText;
console.log(withMarkup);
console.log(textOnly);
</script>

Native javascript equivalent of jQuery :contains() selector

I am writing a UserScript that will remove elements from a page that contain a certain string.
If I understand jQuery's contains() function correctly, it seems like the correct tool for the job.
Unfortunately, since the page I'll be running the UserScript on does not use jQuery, I can't use :contains(). Any of you lovely people know what the native way to do this is?
http://codepen.io/coulbourne/pen/olerh

This should do in modern browsers:
function contains(selector, text) {
var elements = document.querySelectorAll(selector);
return [].filter.call(elements, function(element){
return RegExp(text).test(element.textContent);
});
}
Then use it like so:
contains('p', 'world'); // find "p" that contain "world"
contains('p', /^world/); // find "p" that start with "world"
contains('p', /world$/i); // find "p" that end with "world", case-insensitive
...

Super modern one-line approach with optional chaining operator
[...document.querySelectorAll('*')].filter(element => element.childNodes?.[0]?.nodeValue?.match('❤'));
And better way is to search in all child nodes
[...document.querySelectorAll("*")].filter(e => e.childNodes && [...e.childNodes].find(n => n.nodeValue?.match("❤")))

If you want to implement contains method exaclty as jQuery does, this is what you need to have
function contains(elem, text) {
return (elem.textContent || elem.innerText || getText(elem)).indexOf(text) > -1;
}
function getText(elem) {
var node,
ret = "",
i = 0,
nodeType = elem.nodeType;
if ( !nodeType ) {
// If no nodeType, this is expected to be an array
for ( ; (node = elem[i]); i++ ) {
// Do not traverse comment nodes
ret += getText( node );
}
} else if ( nodeType === 1 || nodeType === 9 || nodeType === 11 ) {
// Use textContent for elements
// innerText usage removed for consistency of new lines (see #11153)
if ( typeof elem.textContent === "string" ) {
return elem.textContent;
} else {
// Traverse its children
for ( elem = elem.firstChild; elem; elem = elem.nextSibling ) {
ret += getText( elem );
}
}
} else if ( nodeType === 3 || nodeType === 4 ) {
return elem.nodeValue;
}
// Do not include comment or processing instruction nodes
return ret;
};
SOURCE: Sizzle.js

The original question is from 2013
Here is an even older solution, and the fastest solution because the main workload is done by the Browser Engine NOT the JavaScript Engine
The TreeWalker API has been around for ages, IE9 was the last browser to implement it... in 2011
All those 'modern' and 'super-modern' querySelectorAll("*") need to process all nodes and do string comparisons on every node.
The TreeWalker API gives you only the #text Nodes, and then you do what you want with them.
You could also use the NodeIterator API, but TreeWalker is faster
function textNodesContaining(txt, root = document.body) {
let nodes = [],
node,
tree = document.createTreeWalker(
root,
4, // NodeFilter.SHOW_TEXT
{
node: node => RegExp(txt).test(node.data)
});
while (node = tree.nextNode()) { // only return accepted nodes
nodes.push(node);
}
return nodes;
}
Usage
textNodesContaining(/Overflow/);
textNodesContaining("Overflow").map(x=>console.log(x.parentNode.nodeName,x));
// get "Overflow" IN A parent
textNodesContaining("Overflow")
.filter(x=>x.parentNode.nodeName == 'A')
.map(x=>console.log(x));
// get "Overflow" IN A ancestor
textNodesContaining("Overflow")
.filter(x=>x.parentNode.closest('A'))
.map(x=>console.log(x.parentNode.closest('A')));

This is the modern approach
function get_nodes_containing_text(selector, text) {
const elements = [...document.querySelectorAll(selector)];
return elements.filter(
(element) =>
element.childNodes[0]
&& element.childNodes[0].nodeValue
&& RegExp(text, "u").test(element.childNodes[0].nodeValue.trim())
);
}

Well, jQuery comes equipped with a DOM traversing engine that operates a lot better than the one i'm about to show you, but it will do the trick.
var items = document.getElementsByTagName("*");
for (var i = 0; i < items.length; i++) {
if (items[i].innerHTML.indexOf("word") != -1) {
// Do your magic
}
}
Wrap it in a function if you will, but i would strongly recommend to use jQuery's implementation.

getElementById doesn't work on a node

In this simple script i get the error "obj.parentNode.getElementById is not a function", and I have no idea, what is wrong.
<script type="text/javascript">
function dosomething (obj) {
sibling=obj.parentNode.getElementById("2");
alert(sibling.getAttribute("attr"));
}
</script>
<body>
<div>
<a id="1" onclick="dosomething(this)">1</a>
<a id="2" attr="some attribute">2</a>
</div>
</body>

.getElementById() is on document, like this:
document.getElementById("2");
Since IDs are supposed to be unique, there's no need for a method that finds an element by ID relative to any other element (in this case, inside that parent). Also, they shouldn't start with a number if using HTML4, a numberic ID is valid in HTML5.

replace .getElementById(id) with .querySelector('#'+id);

document.getElementById() won't work if the node was created on the fly and not yet attached into the main document dom.
For example with Ajax, not all nodes are attached at any given point. In this case, you'd either need to explicitly track a handle to each node (generally best for performance), or use something like this to look the objects back up:
function domGet( id , rootNode ) {
if ( !id ) return null;
if ( rootNode === undefined ) {
// rel to doc base
var o = document.getElementById( id );
return o;
} else {
// rel to current node
var nodes = [];
nodes.push(rootNode);
while ( nodes && nodes.length > 0 ) {
var children = [];
for ( var i = 0; i<nodes.length; i++ ) {
var node = nodes[i];
if ( node && node['id'] !== undefined ) {
if ( node.id == id ) {
return node; // found!
}
}
// else keep searching
var childNodes = node.childNodes;
if ( childNodes && childNodes.length > 0 ) {
for ( var j = 0 ; j < childNodes.length; j++ ) {
children.push( childNodes[j] );
}
}
}
nodes = children;
}
// nothing found
return null;
}
}

How to check if element has any children in Javascript?

Simple question, I have an element which I am grabbing via .getElementById (). How do I check if it has any children?

A couple of ways:
if (element.firstChild) {
// It has at least one
}
or the hasChildNodes() function:
if (element.hasChildNodes()) {
// It has at least one
}
or the length property of childNodes:
if (element.childNodes.length > 0) { // Or just `if (element.childNodes.length)`
// It has at least one
}
If you only want to know about child elements (as opposed to text nodes, attribute nodes, etc.) on all modern browsers (and IE8 — in fact, even IE6) you can do this: (thank you Florian!)
if (element.children.length > 0) { // Or just `if (element.children.length)`
// It has at least one element as a child
}
That relies on the children property, which wasn't defined in DOM1, DOM2, or DOM3, but which has near-universal support. (It works in IE6 and up and Chrome, Firefox, and Opera at least as far back as November 2012, when this was originally written.) If supporting older mobile devices, be sure to check for support.
If you don't need IE8 and earlier support, you can also do this:
if (element.firstElementChild) {
// It has at least one element as a child
}
That relies on firstElementChild. Like children, it wasn't defined in DOM1-3 either, but unlike children it wasn't added to IE until IE9. The same applies to childElementCount:
if (element.childElementCount !== 0) {
// It has at least one element as a child
}
If you want to stick to something defined in DOM1 (maybe you have to support really obscure browsers), you have to do more work:
var hasChildElements, child;
hasChildElements = false;
for (child = element.firstChild; child; child = child.nextSibling) {
if (child.nodeType == 1) { // 1 == Element
hasChildElements = true;
break;
}
}
All of that is part of DOM1, and nearly universally supported.
It would be easy to wrap this up in a function, e.g.:
function hasChildElement(elm) {
var child, rv;
if (elm.children) {
// Supports `children`
rv = elm.children.length !== 0;
} else {
// The hard way...
rv = false;
for (child = element.firstChild; !rv && child; child = child.nextSibling) {
if (child.nodeType == 1) { // 1 == Element
rv = true;
}
}
}
return rv;
}

As slashnick & bobince mention, hasChildNodes() will return true for whitespace (text nodes). However, I didn't want this behaviour, and this worked for me :)
element.getElementsByTagName('*').length > 0
Edit: for the same functionality, this is a better solution:
element.children.length > 0
children[] is a subset of childNodes[], containing elements only.
Compatibility

You could also do the following:
if (element.innerHTML.trim() !== '') {
// It has at least one
}
This uses the trim() method to treat empty elements which have only whitespaces (in which case hasChildNodes returns true) as being empty.
NB: The above method doesn't filter out comments. (so a comment would classify a a child)
To filter out comments as well, we could make use of the read-only Node.nodeType property where Node.COMMENT_NODE (A Comment node, such as <!-- … -->) has the constant value - 8
if (element.firstChild?.nodeType !== 8 && element.innerHTML.trim() !== '' {
// It has at least one
}
let divs = document.querySelectorAll('div');
for(element of divs) {
if (element.firstChild?.nodeType !== 8 && element.innerHTML.trim() !== '') {
console.log('has children')
} else { console.log('no children') }
}
<div><span>An element</span>
<div>some text</div>
<div> </div> <!-- whitespace -->
<div><!-- A comment --></div>
<div></div>

You can check if the element has child nodes element.hasChildNodes(). Beware in Mozilla this will return true if the is whitespace after the tag so you will need to verify the tag type.
https://developer.mozilla.org/En/DOM/Node.hasChildNodes

Try the childElementCount property:
if ( element.childElementCount !== 0 ){
alert('i have children');
} else {
alert('no kids here');
}

Late but document fragment could be a node:
function hasChild(el){
var child = el && el.firstChild;
while (child) {
if (child.nodeType === 1 || child.nodeType === 11) {
return true;
}
child = child.nextSibling;
}
return false;
}
// or
function hasChild(el){
for (var i = 0; el && el.childNodes[i]; i++) {
if (el.childNodes[i].nodeType === 1 || el.childNodes[i].nodeType === 11) {
return true;
}
}
return false;
}
See:
https://github.com/k-gun/so/blob/master/so.dom.js#L42
https://github.com/k-gun/so/blob/master/so.dom.js#L741

A reusable isEmpty( <selector> ) function.
You can also run it toward a collection of elements (see example)
const isEmpty = sel =>
![... document.querySelectorAll(sel)].some(el => el.innerHTML.trim() !== "");
console.log(
isEmpty("#one"), // false
isEmpty("#two"), // true
isEmpty(".foo"), // false
isEmpty(".bar") // true
);
<div id="one">
foo
</div>
<div id="two">
</div>
<div class="foo"></div>
<div class="foo"><p>foo</p></div>
<div class="foo"></div>
<div class="bar"></div>
<div class="bar"></div>
<div class="bar"></div>
returns true (and exits loop) as soon one element has any kind of content beside spaces or newlines.

<script type="text/javascript">
function uwtPBSTree_NodeChecked(treeId, nodeId, bChecked)
{
//debugger;
var selectedNode = igtree_getNodeById(nodeId);
var ParentNodes = selectedNode.getChildNodes();
var length = ParentNodes.length;
if (bChecked)
{
/* if (length != 0) {
for (i = 0; i < length; i++) {
ParentNodes[i].setChecked(true);
}
}*/
}
else
{
if (length != 0)
{
for (i = 0; i < length; i++)
{
ParentNodes[i].setChecked(false);
}
}
}
}
</script>
<ignav:UltraWebTree ID="uwtPBSTree" runat="server"..........>
<ClientSideEvents NodeChecked="uwtPBSTree_NodeChecked"></ClientSideEvents>
</ignav:UltraWebTree>

We Keep Coding

JavaScript is the programming language of the Web.

Best way to extract unformatted text from DOM preserving line breaks? - javascript

Here's a handy function for getting text contents of any element and it works well on all platforms, and yes, it preserves line breaks. function text(e){ var t = ""; e = e.childNodes || e; for(var i = 0;i<e.length;i++){ t+= e[i].nodeType !=1 ? e[i].nodeValue : text(e[i].childNodes); } return t; }

Related

Is element.empty() equivalent to element.innerHTML = ""?

Extract text from HTML while preserving block-level element newlines

Native javascript equivalent of jQuery :contains() selector

getElementById doesn't work on a node

How to check if element has any children in Javascript?

Categories

Resources