Convert textNode content to a string - javascript

Having problem with a textNode that I can't convert to a string.
I'm trying to scrape a site and get certain information out from it, and when I use an XPath to find this text I'm after I get an textNode back.
When I look in google development tool in chrome, I can se that the textNode itself contain the text I'm after, but how do I convert the textNode to plain text?
here is the line of code I use:
abstracts = ZU.xpath(doc, '//*[#id="abstract"]/div/div/par/text()');
I have tried to use stuff like .innerHTML, toString, textContent but nothing have worked so far.

I usually use Text.wholeText if I want to see the content string of a textNode, because textNode is an object so using toString or innerHTML will not work because it is an object not as the string itself...
Example: from https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText
The Text.wholeText read-only property returns the full text of all Text nodes logically adjacent to the node. The text is concatenated in document order. This allows to specify any text node and obtain all adjacent text as a single string.
Syntax
str = textnode.wholeText;
Notes and example:
Suppose you have the following simple paragraph within your webpage (with some whitespace added to aid formatting throughout the code samples here), whose DOM node is stored in the variable para:
<p>Thru-hiking is great! <strong>No insipid election coverage!</strong>
However, <a href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You decide you don’t like the middle sentence, so you remove it:
para.removeChild(para.childNodes[1]);
Later, you decide to rephrase things to, “Thru-hiking is great, but casting a ballot is tricky.” while preserving the hyperlink. So you try this:
para.firstChild.data = "Thru-hiking is great, but ";
All set, right? Wrong! What happened was you removed the strong element, but the removed sentence’s element separated two text nodes. One for the first sentence, and one for the first word of the last. Instead, you now effectively have this:
<p>Thru-hiking is great, but However, <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You’d really prefer to treat all those adjacent text nodes as a single one. That’s where wholeText comes in: if you have multiple adjacent text nodes, you can access the contents of all of them using wholeText. Let’s pretend you never made that last mistake. In that case, we have:
assert(para.firstChild.wholeText == "Thru-hiking is great! However, ");
wholeText is just a property of text nodes that returns the string of data making up all the adjacent (i.e. not separated by an element boundary) text nodes combined.
Now let’s return to our original problem. What we want is to be able to replace the whole text with new text. That’s where replaceWholeText() comes in:
para.firstChild.replaceWholeText("Thru-hiking is great, but ");
We’re removing every adjacent text node (all the ones that constituted the whole text) but the one on which replaceWholeText() is called, and we’re changing the remaining one to the new text. What we have now is this:
<p>Thru-hiking is great, but <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
Some uses of the whole-text functionality may be better served by using Node.textContent, or the longstanding Element.innerHTML; that’s fine and probably clearer in most circumstances. If you have to work with mixed content within an element, as seen here, wholeText and replaceWholeText() may be useful.
More info: https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText

Related

Does `elt.outerHTML` really captures the whole html string representation accurately?

this question stems from How to get html string representation for all types of Dom nodes?
question
Does elt.outerHTML really captures the whole html string representation accurately?
(I think) I know how to get the html string representation for Node.TEXT_NODE -- textContent & Node.ELEMENT_NODE -- outerHTML;
but, does this really accurately capture all the html String of such node?
For what I try to mean by "all". Think about the following cases::
ie:
there is no special case where I call elt.outerHTML & get an html string that has less data (less html code string) than the elt actually has)?
--ie,eg:
would there be a special case where:
you have <div>a special node contains special <&&xx some hypothetic special node syntax, which this may not be captured? xx&&> data</div>
if I call elt.outerHTML,
I will only get back <div>a special node contains special data</div>?
ie,eg:
if I call node.textContent on Node.ELEMENT_NODE
it returns only the text content
ie,eg:
if I call node.textContent on Node.COMMENT_NODE
it returns only the text content, but without the <!-- -->
What about node types other than Node.ELEMENT_NODE?
Clarity
For what exactly I am trying to do?:
there is no specific purpose
-- it just an common operation that you want to get the exact html string representation of a node
-- a programmer need to get the exact accurate input data to do their job
And so I ask,
when I access a property like .outerHTML / .textContent, do I get the exact accurate input data I want? In What case I wont?
Simple question (though I know the answer/reason may be complex).
And I provided examples above to show what I mean.
What I am trying to do is (-- If I really have to be more specific (-- though its still just a general operation)...):
I am given an html file (/ given an dom html element);
I can get all the nodes in that html file (/ element);
I want to accurately get the html string of those nodes;
With those html string, I do some business logic base on it -- I am adding additional info to it, there is no deletion of any original data / html string;
Then I put back the modified html into that node.
The node now should contain no less than the original data.

Javascript: how to convert an element into an HTML-evaluated string?

Here's a simplified example of what I'd like to do:
var footnote = somewhere.innerHTML // This is <q>the note</q>.
var result = ???(footnote)
target.setAttribute("title", result) // This is "the note".
I've tried various methods and functions for the "???", but end up with either the raw tags displayed in the title, or with plain text and no quotation marks.
Other than processing all the inner tags myself, is there a way to convert an element into a string that contains how it would appear when HTML expanded?
Clarification:
I thought it was obvious from the "I have" and "I want" values indicated in the code comments, but this is what I want to do:
I have an element (say a <p> if you need a specific type)
that has content "This is <q>the note</q>."
I want something that will convert it into a string suitable for use in a title="..." attribute in some other element.
Displayable internal tags (in this specific example <q>) need to be HTML-interpreted so that they display as actual quotation marks, ideally handling nested quotations.
innerHTML conversion to string leaves the raw tags in place.
innerText conversion to string ignores the tags and produces no quotation marks.
Is there some other way of doing the HTML interpretation other than by writing my own function to process it?
When you add the q tag you're actually adding a text node(this is what you get with textContent, innerText, etc) that has two CSS pseudo-elements around it, the open and closing quotations.
Neither pseudo-elements nor pseudo-classes appear in the document source or document tree. They basically don't actually exist in the DOM and are therefore not selectable/won't show up in any values of the element properties.
In short, using <q></q> is more semantic mark-up, but if you're looking to represent those quotations outside the scope of the view you may want to use the traditional "
example:
let p = document.querySelector("p"), div = document.querySelector("div");
div.title = p.textContent;
console.log(div.title);
<p>"Example Text"</p>
<div></div>
Additionally, though I will say that I don't recommend it, if you really wanted to keep what you have and you're not too concerned with optimization you could simply use a replace:
let p = document.querySelector("p"), div = document.querySelector("div");
div.title = p.innerHTML.replace(/<q>|<\/q>/gmi, '"');
console.log(div.title);
<p><q>Example Text</q></p>
<div></div>

How can I separately retrieve the HTML that's before and after a child element inside a parent element?

We're writing a web app that relies on Javascript/jQuery. It involves users filling out individual words in a large block of text, kind of like Mad Libs. We've created a sort of HTML format that we use to write the large block of text, which we then manipulate with jQuery as the user fills it out.
Part of a block of text might look like this:
<span class="fillmeout">This is a test of the <span>NOUN</span> Broadcast System.</span>
Given that markup, I need to separately retrieve and manipulate the text before and after the inner <span>; we're calling those the "prefix" and "suffix".
I know that you can't parse HTML with simple string manipulation, but I tried anyway; I tried using split() on the <span> and </span> tags. It seemed simple enough. Unfortunately, Internet Explorer casts all HTML tags to uppercase, so that technique fails. I could write a special case, but the error has taught me to do this the right way.
I know I could simply use extra HTML tags to manually denote the prefix and suffix, but that seems ugly and redundant; I'd like to keep our markup format as lean and readable and writable as possible.
I've looked through the jQuery docs, and can't find a function that does exactly what I need. There are all sorts of functions to add stuff before and after and around and inside elements, but none that I can find to retrieve what's already there. I could remove the inner <span>, but then I don't know how I can tell what came before the deleted element apart from what came after it.
Is there a "right" way to do what I'm trying to do?
With simple string manipulations you can also use Regex.
That should solve your problem.
var array = $('.fillmeout').html().split(/<\/?span>/i);
Use your jQuery API! $('.fillmeout').children() and then you can manipulate that element as required.
http://api.jquery.com/children/
For completeness, I thought I should point out that the cleanest answer is to put the prefix and suffix text in it's own <span> like this and then you can use jQuery selectors and methods to directly access the desired text:
<span class="fillmeout">
<span class="prefix">This is a test of the </span>
<span>NOUN</span>
<span class="suffix"> Broadcast System.</span>
</span>
Then, the code would be as simple as:
var fillme = $(".fillmeout").eq(0);
var prefix = fillme.find(".prefix").text();
var suffix = fillme.find(".suffix").text();
FYI, I would not call this level of simplicity "ugly and redundant" as you theorized. You're using HTML markup to delineate the text into separate elements that you want to separately access. That's just smart, not redundant.
By way of analogy, imagine you have toys of three separate colors (red, white and blue) and they are initially organized by color and you know that sometime in the future you are going to need to have them separated by color again. You also have three boxes to store them in. You can either put them all in one box now and manually sort them out by color again later or you can just take the already separated colors and put them each into their own box so there's no separation work to do later. Which is easier? Which is smarter?
HTML elements are like the boxes. They are containers for your text. If you want the text separated out in the future, you might as well put each piece of text into it's own named container so it's easy to access just that piece of text in the future.
Several of these answers almost got me what I needed, but in the end I found a function not mentioned here: .contents(). It returns an array of all child nodes, including text nodes, that I can then iterate over (recursively if needed) to find what I need.
I'm not sure if this is the 'right' way either, but you could replace the SPANs with an element you could consistently split the string on:
jQuery('.fillmeout span').replaceWith('|');
http://api.jquery.com/replaceWith/
http://jsfiddle.net/mdarnell/P24se/
You could use
$('.fillmeout span').get(0).previousSibling.textContent
$('.fillmeout span').get(0).nextSibling.textContent
This works in IE9, but sadly not in IE versions smaller than 9.
Based on your example, you could use your target as a delimiter to split the sentence.
var str = $('.fillmeout').html();
str = str.split('<span>NOUN</span>');
This would return an array of ["This is a test of the ", " Broadcast System."]. Here's a jsFiddle example.
You could just use the nextSibling and previousSibling native JavaScript (coupled with jQuery selectors):
$('.fillmeout span').each(
function(){
var prefix = this.previousSibling.nodeValue,
suffix = this.nextSibling.nodeValue;
});
JS Fiddle proof of concept.
References:
each().
node.nextSibling.
node.previousSibling.
If you want to use the DOM instead of parsing the HTML yourself and you can't put the desired text in it's own elements, then you will need to look through the DOM for text nodes and find the text nodes before and after the span tag.
jQuery isn't a whole lot of help when dealing with text nodes instead of element nodes so the work is mostly done in plain javascript like this:
$(".fillmeout").each(function() {
var node = this.firstChild, prefix = "", suffix = "", foundSpan = false;
while (node) {
if (node.nodeType == 3) {
// if text node
if (!foundSpan) {
prefix += node.nodeValue;
} else {
suffix += node.nodeValue;
}
} else if (node.nodeType == 1 && node.tagName == "SPAN") {
// if element and span tag
foundSpan = true;
}
node = node.nextSibling;
}
// here prefix and suffix are the text before and after the first
// <span> tag in the HTML
// You can do with them what you want here
});
Note: This code does not assume that all text before the span is located in one text node and one text node only. It might be, but it also might not be so it collates all the text nodes together that are before and after the span tag. The code would be simpler if you could just reference one text node on each side, but it isn't 100% certain that that is a safe assumption.
This code also handles the case where there is no text before or after the span.
You can see it work here: http://jsfiddle.net/jfriend00/P9YQ6/

Javascript .replace command replace page text?

Can the JavaScript command .replace replace text in any webpage? I want to create a Chrome extension that replaces specific words in any webpage to say something else (example cake instead of pie).
The .replace method is a string operation, so it's not immediately simple to run the operation on HTML documents, which are composed of DOM Node objects.
Use TreeWalker API
The best way to go through every node in a DOM and replace text in it is to use the document.createTreeWalker method to create a TreeWalker object. This is a practice that is used in a number of Chrome extensions!
// create a TreeWalker of all text nodes
var allTextNodes = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT),
// some temp references for performance
tmptxt,
tmpnode,
// compile the RE and cache the replace string, for performance
cakeRE = /cake/g,
replaceValue = "pie";
// iterate through all text nodes
while (allTextNodes.nextNode()) {
tmpnode = allTextNodes.currentNode;
tmptxt = tmpnode.nodeValue;
tmpnode.nodeValue = tmptxt.replace(cakeRE, replaceValue);
}
To replace parts of text with another element or to add an element in the middle of text, use DOM splitText, createElement, and insertBefore methods, example.
See also how to replace multiple strings with multiple other strings.
Don't use innerHTML or innerText or jQuery .html()
// the innerHTML property of any DOM node is a string
document.body.innerHTML = document.body.innerHTML.replace(/cake/g,'pie')
It's generally slower (especially on mobile devices).
It effectively removes and replaces the entire DOM, which is not awesome and could have some side effects: it destroys all event listeners attached in JavaScript code (via addEventListener or .onxxxx properties) thus breaking the functionality partially/completely.
This is, however, a common, quick, and very dirty way to do it.
Ok, so the createTreeWalker method is the RIGHT way of doing this and it's a good way. I unfortunately needed to do this to support IE8 which does not support document.createTreeWalker. Sad Ian is sad.
If you want to do this with a .replace on the page text using a non-standard innerHTML call like a naughty child, you need to be careful because it WILL replace text inside a tag, leading to XSS vulnerabilities and general destruction of your page.
What you need to do is only replace text OUTSIDE of tag, which I matched with:
var search_re = new RegExp("(?:>[^<]*)(" + stringToReplace + ")(?:[^>]*<)", "gi");
gross, isn't it. you may want to mitigate any slowness by replacing some results and then sticking the rest in a setTimeout call like so:
// replace some chunk of stuff, the first section of your page works nicely
// if you happen to have that organization
//
setTimeout(function() { /* replace the rest */ }, 10);
which will return immediately after replacing the first chunk, letting your page continue with its happy life. for your replace calls, you're also going to want to replace large chunks in a temp string
var tmp = element.innerHTML.replace(search_re, whatever);
/* more replace calls, maybe this is in a for loop, i don't know what you're doing */
element.innerHTML = tmp;
so as to minimize reflows (when the page recalculates positioning and re-renders everything). for large pages, this can be slow unless you're careful, hence the optimization pointers. again, don't do this unless you absolutely need to. use the createTreeWalker method zetlen has kindly posted above..
have you tryed something like that?
$('body').html($('body').html().replace('pie','cake'));

Best way to pick up text in a HTML element that is in the parent node only

I have, for example, markup like this
<div id="content">
<p>Here is some wonderful text, and here is a link. All links should have a `href` attribute.</p>
</div>
Now I want to be able to perform some regex replace on the text inside the p element, but not in any HTML, i.e. be able to match the href within backticks, but not inside the anchor element.
I thought about regex, but as the general consensus is, I shouldn't be using them to parse HTML.
My current method of doing this is like so: I've got a bunch of words in an array, and I am looping through them and making an object of data like so:
termsData[term] = {
regex: new RegExp('(\\b' + term + '\\b)', 'gmi'),
replaceWith: '<span>{TERM}</span>'
};
I then loop through it again, making the replacements like so:
var html = obj.html();
$.each(terms, function(i, term) {
// Replace each word in the HTML with the span
html = html.replace(termsData[term].regex, termsData[term].replaceWith.replace(/{TERM}/, '$1'));
});
obj.html(html);
Now I did a lot of this last night at an ungodly hour, and copying and pasting it into here seems to make think I should refactor some of this.
So from you should be able to tell, I want to be able to replace plain text, but not anything inside a HTML tag.
What would be the best way to do it?
Note: The source code is coming from here if you'd like a better look.
You're right to not want to be processing HTML with regex. It's also bad news to be assigning huge chunks of .html(); apart from the performance drawbacks of serialising and reparsing a large amount of HTML, you'll also lose unserialisable data like event listeners, form data and JS properties/references.
See the findText function in this answer and call something like (assuming obj is a jQuery wrapper over your topmost node to search in):
findText(obj[0], /\b(term1|term2|term3)\b/g, function(node, match) {
var span= document.createElement('span');
node.splitText(match.index+match[0].length);
span.appendChild(node.splitText(match.index));
node.parentNode.insertBefore(span, node.nextSibling);
});

Categories