Does `elt.outerHTML` really captures the whole html string representation accurately? - javascript

this question stems from How to get html string representation for all types of Dom nodes?
question
Does elt.outerHTML really captures the whole html string representation accurately?
(I think) I know how to get the html string representation for Node.TEXT_NODE -- textContent & Node.ELEMENT_NODE -- outerHTML;
but, does this really accurately capture all the html String of such node?
For what I try to mean by "all". Think about the following cases::
ie:
there is no special case where I call elt.outerHTML & get an html string that has less data (less html code string) than the elt actually has)?
--ie,eg:
would there be a special case where:
you have <div>a special node contains special <&&xx some hypothetic special node syntax, which this may not be captured? xx&&> data</div>
if I call elt.outerHTML,
I will only get back <div>a special node contains special data</div>?
ie,eg:
if I call node.textContent on Node.ELEMENT_NODE
it returns only the text content
ie,eg:
if I call node.textContent on Node.COMMENT_NODE
it returns only the text content, but without the <!-- -->
What about node types other than Node.ELEMENT_NODE?
Clarity
For what exactly I am trying to do?:
there is no specific purpose
-- it just an common operation that you want to get the exact html string representation of a node
-- a programmer need to get the exact accurate input data to do their job
And so I ask,
when I access a property like .outerHTML / .textContent, do I get the exact accurate input data I want? In What case I wont?
Simple question (though I know the answer/reason may be complex).
And I provided examples above to show what I mean.
What I am trying to do is (-- If I really have to be more specific (-- though its still just a general operation)...):
I am given an html file (/ given an dom html element);
I can get all the nodes in that html file (/ element);
I want to accurately get the html string of those nodes;
With those html string, I do some business logic base on it -- I am adding additional info to it, there is no deletion of any original data / html string;
Then I put back the modified html into that node.
The node now should contain no less than the original data.

Related

Extract elements from html-string in ClojureScript

For a ClojureScript project I'm looking for a concise way to extract content from an external HTML document on the client side. The content is actually received via an ajax call in the Markdown format, which is parsed to HTML subsequently. So a HTML string is the point of departure.
(def html-string "<p>Something, that <a>was</a> Markdown before</p>")
The libraries Enlive and Garden for instance use vectors to express CSS Selectors, which are needed here. Enlive has a front-end sister, called Enfocus, which provides similar semantics.
Here's an enfocus example which extracts some content from the current DOM:
(require '[enfocus.core :as ef])
(ef/from js/document.head :something [:title]
(ef/get-text))
;;{:something "My Title"}
If there were more matches the value of :something would become a vector. I could not figure out, how to apply this function on arbitrary HTML strings. The closest I could get was by using this function:
(defn html->node [h]
(doto (.createElement js/document "div")
(aset "innerHTML" h)))
and then:
(ef/from (html->node html-string) :my-link [:a]
(ef/get-text))
;;{:my-link "was"}
However, this is not quite clean, since now there's a div wrapping everything, which might cause trouble in some situations.
Inserting some HTML content into a div makes your arbitrary HTML automatically evaluated. What you should do instead is to parse your HTML string with something like this.
(defn parse-html [html]
"Parse an html string into a document"
(let [doc (.createHTMLDocument js/document.implementation "mydoc")]
(set! (.-innerHTML doc.documentElement) html)
doc
))
This is plain and simple ClojureScript, maybe using the Google Closure library would be more elegant if you want you can dig into the goog.dom namespace or someplace else inside the Google Closure API.
Then you parse your HTML string:
(def html-doc (parse-html "<body><div><p>some of my stuff</p></div></body>"))
So you can call Enfocus on your document:
(ef/from html-doc :mystuff [:p] (ef/get-text))
Result:
{:mystuff "some of my stuff"}

Convert textNode content to a string

Having problem with a textNode that I can't convert to a string.
I'm trying to scrape a site and get certain information out from it, and when I use an XPath to find this text I'm after I get an textNode back.
When I look in google development tool in chrome, I can se that the textNode itself contain the text I'm after, but how do I convert the textNode to plain text?
here is the line of code I use:
abstracts = ZU.xpath(doc, '//*[#id="abstract"]/div/div/par/text()');
I have tried to use stuff like .innerHTML, toString, textContent but nothing have worked so far.
I usually use Text.wholeText if I want to see the content string of a textNode, because textNode is an object so using toString or innerHTML will not work because it is an object not as the string itself...
Example: from https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText
The Text.wholeText read-only property returns the full text of all Text nodes logically adjacent to the node. The text is concatenated in document order. This allows to specify any text node and obtain all adjacent text as a single string.
Syntax
str = textnode.wholeText;
Notes and example:
Suppose you have the following simple paragraph within your webpage (with some whitespace added to aid formatting throughout the code samples here), whose DOM node is stored in the variable para:
<p>Thru-hiking is great! <strong>No insipid election coverage!</strong>
However, <a href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You decide you don’t like the middle sentence, so you remove it:
para.removeChild(para.childNodes[1]);
Later, you decide to rephrase things to, “Thru-hiking is great, but casting a ballot is tricky.” while preserving the hyperlink. So you try this:
para.firstChild.data = "Thru-hiking is great, but ";
All set, right? Wrong! What happened was you removed the strong element, but the removed sentence’s element separated two text nodes. One for the first sentence, and one for the first word of the last. Instead, you now effectively have this:
<p>Thru-hiking is great, but However, <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You’d really prefer to treat all those adjacent text nodes as a single one. That’s where wholeText comes in: if you have multiple adjacent text nodes, you can access the contents of all of them using wholeText. Let’s pretend you never made that last mistake. In that case, we have:
assert(para.firstChild.wholeText == "Thru-hiking is great! However, ");
wholeText is just a property of text nodes that returns the string of data making up all the adjacent (i.e. not separated by an element boundary) text nodes combined.
Now let’s return to our original problem. What we want is to be able to replace the whole text with new text. That’s where replaceWholeText() comes in:
para.firstChild.replaceWholeText("Thru-hiking is great, but ");
We’re removing every adjacent text node (all the ones that constituted the whole text) but the one on which replaceWholeText() is called, and we’re changing the remaining one to the new text. What we have now is this:
<p>Thru-hiking is great, but <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
Some uses of the whole-text functionality may be better served by using Node.textContent, or the longstanding Element.innerHTML; that’s fine and probably clearer in most circumstances. If you have to work with mixed content within an element, as seen here, wholeText and replaceWholeText() may be useful.
More info: https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText

javascript parser for specific purpose

I am trying to create a tool that looks for missing translations in .html files. some of our translations are done at runtime in JS code. I would like to map these together. below is an example.
<select id="dropDown"></select>
// js
bindings: {
"dropDown": function() {
translate(someValue);
// translate then set option
}
}
above you can see that I have a drop down where values are created & translated at runtime. I was thinking that an AST would be the right way to accomplish this. basically I need to go through the .html file looking for tags that are missing in-line translations (done using {{t value}}) and search the corresponding .js file for run time translations. Is there a better way to accomplish this? any advice on a tool for creating the AST?
I think you want to hunt for patterns in the code. In particular, I think you want to determine, for each HTML select construct, if there is a corresponding, properly shaped JavaScript fragment with the right embedded ID name.
You can do that with an AST, right. In your case, you need an AST for the HTML file (nodes are essentially HTML tags) with sub-ASTs for script chunks () tags containing the parsed JavaScript.
For this you need two parsers: one to parse the HTML (a nasty just because HTML is a mess), that produces a tree containing script nodes having just text. Then you need a JavaScript parser that you can apply to the text blobs under the Script tags, to produce JavaScript ASTs; ideally, you splice these into the HTML tree to replace the text blob nodes. Now you have a mixed tree, with some nodes being html, with some subtrees that are JavaScript. Ideally the HMTL nodes are marked as being HTML, and the JavaScript nodes are marked as Javascript.
Now you can search the tree for select nodes, pick up the id, and the search all the javascript subtrees for expected structure.
You can code the matching procedurally, but it will be messy:
for all node
if node is HTML and nodetype is Select
then
functionname=node.getchild("ID").text
for all node
if node is JavaScript and node.parent is HTML and nodetype is pair
then if node.getchild(left).ext is "bindings"
then if node.getchild(right)=structure
then... (lots more....)
There's a lot of hair in this. Technically its just sweat. You have to know (and encode) the precise detail of the tree, in climbing up and downs its links correctly and checking the node types one by one. If the grammar changes a tiny bit, this code breaks too; it knows too much about the grammar.
You can finish this by coding your own parsers from scratch. Lots more sweat.
There are tools that can make this a lot easier; see Program Transformation Systems. Such tools let you define language grammars, and they generate parsers and AST builders for such grammars. (As a general rule, they are pretty good at defining working grammars, because they are designed to be applied to many languages). That at least puts lots of structure into the process, and they provide a lot of machinery to make this work. But the good part is that you can express patterns, usually in source language surface syntax, that can make this a lot easier to express.
One of these tools is our DMS Software Reengineering Toolkit (I'm the architect).
DMS has dirty HTML and full JavaScript parsers already, so those don't need to be built.
You would have to write a bit of code for DMS to invoke the HTML parser, find the subtree for script nodes, and apply the JavaScript parser. DMS makes this practical by allowing you parse a blob of text as an arbitrary nonterminal in a grammar; in this case, you want to parse those blobs as an expression nonterminal.
With all that in place, you can now write patterns that will support the check:
pattern select_node(property: string): HTML~dirty.HTMLform =
" <select ID=\property></select> ";
pattern script(code: string): HTML~dirty.HTMLform =
" <script>\code</script> ";
pattern js_bindings(s: string, e:expression):JavaScript.expression =
" bindings : { \s : function ()
{ translate(\e);
}
} ";
While these patterns look like text, they are parsed by DMS into ASTs with placeholder nodes for the parameter list elements, denoted by "\nnnn" inside
the (meta)quotes "..." that surround the program text of interest. Such ASTs patterns can be pattern matched against ASTs; they match if the pattern tree matches, and the pattern variable leaves are then captured as bindings.
(See Registry:PatternMatch below, and resulting match argument with slots matched (a boolean) and bindings (an array of bound subtrees resulting from the match). A big win for the tool builder: he doesn't have to know much about the fine detail of the grammar, because he writes the pattern, and the tool produces all the tree nodes for him, implicitly.
With these patterns, you can write procedural PARLANSE (DMS's Lisp-style programming language) code to implement the check (liberties taken to shorten the presentation):
(;; `Parse HTML file':
(= HTML_tree (HMTL:ParseFile .... ))
`Find script nodes and replace by ASTs for same':
(AST:FindAllMatchingSubtrees HTML_tree
(lambda (function boolean [html_node AST:Node])
(let (= [match Registry:Match]
(Registry:PatternMatch html_node "script"))
(ifthenelse match:boolean
(value (;; (AST:ReplaceNode node
(JavaScript:ParseStream
"expression" ; desired nonterminal
(make Streams:Stream
(AST:GetString match:bindings:1))))
);;
~f ; false: don't visit subtree
)value
~t ; true: continue scanning into subtree
)ifthenelse
)let
)lambda )
`Now find select nodes, and check sanity':
(AST:FindAllMatchingSubtrees HTML_tree
(lambda (function boolean [html_node AST:node])
(let (;; (= [select_match Registry:Match] ; capture match data
(Registry:PatternMatch "select" html_node)) ; hunt for this pattern
[select_function_name string]
);;
(ifthenelse select_match:boolean
(value (;; `Found <select> node.
Get name of function...':
(= select_function_name
(AST:GetString select_match:bindings:1))
`... and search for matching script fragment':
(ifthen
(~ (AST:FindFirstMatchingSubtree HTML_tree
(lambda (function boolean [js_node AST:Node])
(let (;; (= [match Registry:Match] ; capture match data
(Registry:PatternMatch js_node "js_bindings")) ; hunt for this pattern
(&& match:boolean
(== select_match:bindings:1
select_function_name)
)&& ; is true if we found matching function
)let
)lambda ) )~
(;; `Complain if we cant find matching script fragment'
(Format:SNN `Select #S with missing translation at line #D column #D'
select_function_name
(AST:GetLineNumber select_match:source_position)
(AST:GetColumnNumber select_match:source_position)
)
);;
)ifthen
);;
~f ; don't visit subtree
)value
~t ; continue scanning into subtree
)ifthenelse
)let
)lambda )
);;
This procedural code first parses an HTML source file producing an HTML tree. All these nodes are stamped as being from the "HTML~dirty" langauge.
It then scans that tree to find SCRIPT nodes, and replaces them with an AST obtained from a JavaScript-expression-parse of the text content of the script nodes encountered. Finally, it finds all SELECT nodes, picks out the name of the function mentioned in the ID clause, and checks all JavaScript ASTs for a matching "bindings" expression as specified by OP. All of this leans on the pattern matching machinery, which in turn leans on top of the low-level AST library that provides a variety of means to inspect/navigate/change tree nodes.
I've left out some detail (esp. error handling code) and obviously haven't tested this. But this gives the flavor of how you might do it with DMS.
Similar patterns and matching processes are available in other program transformation systems.

HTML entities in CSS content (convert entities to escape-string at runtime)

I know that html-entities like or ö or ð can not be used inside a css like this:
div.test:before {
content:"text with html-entities like ` ` or `ö` or `ð`";
}
There is a good question with good answers dealing with this problem: Adding HTML entities using CSS content
But I am reading the strings that are put into the css-content from a server via AJAX. The JavaScript running at the users client receives text with embedded html-entities and creates style-content from it instead of putting it as a text-element into an html-element's content. This method helps against thieves who try to steal my content via copy&paste. Text that is not part of the html-document (but part of css-content) is really hard to copy. This method works fine. There is only this nasty problem with that html-entities.
So I need to convert html-entities into unicode escape-sequences at runtime. I can do this either on the server with a perl-script or on the client with JavaScript, But I don't want to write a subroutine that contains a complete list of all existing named entities. There are more than 2200 named entities in html5, as listed here: http://www.w3.org/TR/2011/WD-html5-20110113/named-character-references.html And I don't want to change my subroutine every time this list gets changed. (Numeric entities are no problem.)
Is there any trick to perfom this conversion with javascript? Maybe by adding, reading and removing content to the DOM? (I am using jQuery)
I've found a solution:
var text = 'Text that contains html-entities';
var myDiv = document.createElement('div');
$(myDiv).html(text);
text = $(myDiv).text();
$('#id_of_a_style-element').html('#id_of_the_protected_div:before{content:"' + text + '"}');
Writing the Question was half way to get this answer. I hope this answer helps others too.

Replace comment in JavaScript AST with subtree derived from the comment's content

I'm the author of doctest, quick and dirty doctests for JavaScript and CoffeeScript. I'd like to make the library less dirty by using a JavaScript parser rather than regular expressions to locate comments.
I'd like to use Esprima or Acorn to do the following:
Create an AST
Walk the tree, and for each comment node:
Create an AST from the comment node's text
Replace the comment node in the main tree with this subtree
Input:
!function() {
// > toUsername("Jesper Nøhr")
// "jespernhr"
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
Output:
!function() {
doctest.input(function() {
return toUsername("Jesper Nøhr")
});
doctest.output(4, function() {
return "jespernhr"
});
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
I don't know how to do this. Acorn provides a walker which takes a node type and a function, and walks the tree invoking the function each time a node of the specified type is encountered. This seems promising, but doesn't apply to comments.
With Esprima I can use esprima.parse(input, {comment: true, loc: true}).comments to get the comments, but I'm not sure how to update the tree.
Most AST-producing parsers throw away comments. I don't know what Esprima or Acorn do, but that might be the issue.
.... in fact, Esprima lists comment capture as a current bug:
http://code.google.com/p/esprima/issues/detail?id=197
... Acorn's code is right there in GitHub. It appears to throw comments away, too.
So, looks like you get to fix either parser to capture the comments first, at which point your task should be straightforward, or, you're stuck.
Our DMS Software Reengineering Toolkit has JavaScript parsers that capture comments, in the tree. It also has language substring parsers, that could be used to parse the comment text into JavaScript ASTs of whatever type the comment represents (e.g, function declaration, expression, variable declaration, ...), and the support machinery to graft such new ASTs into the main tree. If you are going to manipulate ASTs, this substring capability is likely important: most parsers won't parse arbitrary language fragments, they are wired only to parse "whole programs". For DMS, there are no comment nodes to replace; there are comments associated with ASTs nodes, so the grafting process is a little trickier than just "replace comment nodes". Still pretty easy.
I'll observe that most parsers (including these) read the source and break it into tokens by using or applying the equivalent of a regular expressions. So, if you are already using these to locate comments (that means using them to locate *non*comments to throw away, as well, e.g., you need to recognize string literals that contain comment-like text and ignore them), you are doing as well as the parsers would do anyway in terms of finding the comments. And if all you want to do is to replace them exactly with their content, echoing the source stream with the comment prefix/suffix /* */ stripped will do apparantly exactly what you want, so all this parsing machinery seems like overkill.
You can already use Esprima to achieve what you want:
Parse the code, get the comments (as an array).
Iterate over the comments, see if each is what you are interested in.
If you need to transform the comment, note its range. Collect all transformations.
Apply the transformation back-to-first so that the ranges are not shifted.
The trick is here not change the AST. Simply apply the text change as if you are doing a typical search replace on the source string directly. Because the position of the replacement might shift, you need to collect everything and then do it from the last one. For an example on how to carry out such a transformation, take a look at my blog post "From double-quotes to single-quotes" (it deals with string quotes but the principle remains the same).
Last but not least, you might want to use a slightly higher-level utility such as Rocambole.

Categories