Extract elements from html-string in ClojureScript

Extract elements from html-string in ClojureScript - javascript

For a ClojureScript project I'm looking for a concise way to extract content from an external HTML document on the client side. The content is actually received via an ajax call in the Markdown format, which is parsed to HTML subsequently. So a HTML string is the point of departure.
(def html-string "<p>Something, that <a>was</a> Markdown before</p>")
The libraries Enlive and Garden for instance use vectors to express CSS Selectors, which are needed here. Enlive has a front-end sister, called Enfocus, which provides similar semantics.
Here's an enfocus example which extracts some content from the current DOM:
(require '[enfocus.core :as ef])
(ef/from js/document.head :something [:title]
(ef/get-text))
;;{:something "My Title"}
If there were more matches the value of :something would become a vector. I could not figure out, how to apply this function on arbitrary HTML strings. The closest I could get was by using this function:
(defn html->node [h]
(doto (.createElement js/document "div")
(aset "innerHTML" h)))
and then:
(ef/from (html->node html-string) :my-link [:a]
(ef/get-text))
;;{:my-link "was"}
However, this is not quite clean, since now there's a div wrapping everything, which might cause trouble in some situations.

Inserting some HTML content into a div makes your arbitrary HTML automatically evaluated. What you should do instead is to parse your HTML string with something like this.
(defn parse-html [html]
"Parse an html string into a document"
(let [doc (.createHTMLDocument js/document.implementation "mydoc")]
(set! (.-innerHTML doc.documentElement) html)
doc
))
This is plain and simple ClojureScript, maybe using the Google Closure library would be more elegant if you want you can dig into the goog.dom namespace or someplace else inside the Google Closure API.
Then you parse your HTML string:
(def html-doc (parse-html "<body><div><p>some of my stuff</p></div></body>"))
So you can call Enfocus on your document:
(ef/from html-doc :mystuff [:p] (ef/get-text))
Result:
{:mystuff "some of my stuff"}

Related

javascript parser for specific purpose

I am trying to create a tool that looks for missing translations in .html files. some of our translations are done at runtime in JS code. I would like to map these together. below is an example.
<select id="dropDown"></select>
// js
bindings: {
"dropDown": function() {
translate(someValue);
// translate then set option
}
}
above you can see that I have a drop down where values are created & translated at runtime. I was thinking that an AST would be the right way to accomplish this. basically I need to go through the .html file looking for tags that are missing in-line translations (done using {{t value}}) and search the corresponding .js file for run time translations. Is there a better way to accomplish this? any advice on a tool for creating the AST?

I think you want to hunt for patterns in the code. In particular, I think you want to determine, for each HTML select construct, if there is a corresponding, properly shaped JavaScript fragment with the right embedded ID name.
You can do that with an AST, right. In your case, you need an AST for the HTML file (nodes are essentially HTML tags) with sub-ASTs for script chunks () tags containing the parsed JavaScript.
For this you need two parsers: one to parse the HTML (a nasty just because HTML is a mess), that produces a tree containing script nodes having just text. Then you need a JavaScript parser that you can apply to the text blobs under the Script tags, to produce JavaScript ASTs; ideally, you splice these into the HTML tree to replace the text blob nodes. Now you have a mixed tree, with some nodes being html, with some subtrees that are JavaScript. Ideally the HMTL nodes are marked as being HTML, and the JavaScript nodes are marked as Javascript.
Now you can search the tree for select nodes, pick up the id, and the search all the javascript subtrees for expected structure.
You can code the matching procedurally, but it will be messy:
for all node
if node is HTML and nodetype is Select
then
functionname=node.getchild("ID").text
for all node
if node is JavaScript and node.parent is HTML and nodetype is pair
then if node.getchild(left).ext is "bindings"
then if node.getchild(right)=structure
then... (lots more....)
There's a lot of hair in this. Technically its just sweat. You have to know (and encode) the precise detail of the tree, in climbing up and downs its links correctly and checking the node types one by one. If the grammar changes a tiny bit, this code breaks too; it knows too much about the grammar.
You can finish this by coding your own parsers from scratch. Lots more sweat.
There are tools that can make this a lot easier; see Program Transformation Systems. Such tools let you define language grammars, and they generate parsers and AST builders for such grammars. (As a general rule, they are pretty good at defining working grammars, because they are designed to be applied to many languages). That at least puts lots of structure into the process, and they provide a lot of machinery to make this work. But the good part is that you can express patterns, usually in source language surface syntax, that can make this a lot easier to express.
One of these tools is our DMS Software Reengineering Toolkit (I'm the architect).
DMS has dirty HTML and full JavaScript parsers already, so those don't need to be built.
You would have to write a bit of code for DMS to invoke the HTML parser, find the subtree for script nodes, and apply the JavaScript parser. DMS makes this practical by allowing you parse a blob of text as an arbitrary nonterminal in a grammar; in this case, you want to parse those blobs as an expression nonterminal.
With all that in place, you can now write patterns that will support the check:
pattern select_node(property: string): HTML~dirty.HTMLform =
" <select ID=\property></select> ";
pattern script(code: string): HTML~dirty.HTMLform =
" <script>\code</script> ";
pattern js_bindings(s: string, e:expression):JavaScript.expression =
" bindings : { \s : function ()
{ translate(\e);
}
} ";
While these patterns look like text, they are parsed by DMS into ASTs with placeholder nodes for the parameter list elements, denoted by "\nnnn" inside
the (meta)quotes "..." that surround the program text of interest. Such ASTs patterns can be pattern matched against ASTs; they match if the pattern tree matches, and the pattern variable leaves are then captured as bindings.
(See Registry:PatternMatch below, and resulting match argument with slots matched (a boolean) and bindings (an array of bound subtrees resulting from the match). A big win for the tool builder: he doesn't have to know much about the fine detail of the grammar, because he writes the pattern, and the tool produces all the tree nodes for him, implicitly.
With these patterns, you can write procedural PARLANSE (DMS's Lisp-style programming language) code to implement the check (liberties taken to shorten the presentation):
(;; `Parse HTML file':
(= HTML_tree (HMTL:ParseFile .... ))
`Find script nodes and replace by ASTs for same':
(AST:FindAllMatchingSubtrees HTML_tree
(lambda (function boolean [html_node AST:Node])
(let (= [match Registry:Match]
(Registry:PatternMatch html_node "script"))
(ifthenelse match:boolean
(value (;; (AST:ReplaceNode node
(JavaScript:ParseStream
"expression" ; desired nonterminal
(make Streams:Stream
(AST:GetString match:bindings:1))))
);;
~f ; false: don't visit subtree
)value
~t ; true: continue scanning into subtree
)ifthenelse
)let
)lambda )
`Now find select nodes, and check sanity':
(AST:FindAllMatchingSubtrees HTML_tree
(lambda (function boolean [html_node AST:node])
(let (;; (= [select_match Registry:Match] ; capture match data
(Registry:PatternMatch "select" html_node)) ; hunt for this pattern
[select_function_name string]
);;
(ifthenelse select_match:boolean
(value (;; `Found <select> node.
Get name of function...':
(= select_function_name
(AST:GetString select_match:bindings:1))
`... and search for matching script fragment':
(ifthen
(~ (AST:FindFirstMatchingSubtree HTML_tree
(lambda (function boolean [js_node AST:Node])
(let (;; (= [match Registry:Match] ; capture match data
(Registry:PatternMatch js_node "js_bindings")) ; hunt for this pattern
(&& match:boolean
(== select_match:bindings:1
select_function_name)
)&& ; is true if we found matching function
)let
)lambda ) )~
(;; `Complain if we cant find matching script fragment'
(Format:SNN `Select #S with missing translation at line #D column #D'
select_function_name
(AST:GetLineNumber select_match:source_position)
(AST:GetColumnNumber select_match:source_position)
)
);;
)ifthen
);;
~f ; don't visit subtree
)value
~t ; continue scanning into subtree
)ifthenelse
)let
)lambda )
);;
This procedural code first parses an HTML source file producing an HTML tree. All these nodes are stamped as being from the "HTML~dirty" langauge.
It then scans that tree to find SCRIPT nodes, and replaces them with an AST obtained from a JavaScript-expression-parse of the text content of the script nodes encountered. Finally, it finds all SELECT nodes, picks out the name of the function mentioned in the ID clause, and checks all JavaScript ASTs for a matching "bindings" expression as specified by OP. All of this leans on the pattern matching machinery, which in turn leans on top of the low-level AST library that provides a variety of means to inspect/navigate/change tree nodes.
I've left out some detail (esp. error handling code) and obviously haven't tested this. But this gives the flavor of how you might do it with DMS.
Similar patterns and matching processes are available in other program transformation systems.

HTML entities in CSS content (convert entities to escape-string at runtime)

I know that html-entities like or ö or ð can not be used inside a css like this:
div.test:before {
content:"text with html-entities like ` ` or `ö` or `ð`";
}
There is a good question with good answers dealing with this problem: Adding HTML entities using CSS content
But I am reading the strings that are put into the css-content from a server via AJAX. The JavaScript running at the users client receives text with embedded html-entities and creates style-content from it instead of putting it as a text-element into an html-element's content. This method helps against thieves who try to steal my content via copy&paste. Text that is not part of the html-document (but part of css-content) is really hard to copy. This method works fine. There is only this nasty problem with that html-entities.
So I need to convert html-entities into unicode escape-sequences at runtime. I can do this either on the server with a perl-script or on the client with JavaScript, But I don't want to write a subroutine that contains a complete list of all existing named entities. There are more than 2200 named entities in html5, as listed here: http://www.w3.org/TR/2011/WD-html5-20110113/named-character-references.html And I don't want to change my subroutine every time this list gets changed. (Numeric entities are no problem.)
Is there any trick to perfom this conversion with javascript? Maybe by adding, reading and removing content to the DOM? (I am using jQuery)

I've found a solution:
var text = 'Text that contains html-entities';
var myDiv = document.createElement('div');
$(myDiv).html(text);
text = $(myDiv).text();
$('#id_of_a_style-element').html('#id_of_the_protected_div:before{content:"' + text + '"}');
Writing the Question was half way to get this answer. I hope this answer helps others too.

Techniques to avoid building HTML strings in JavaScript?

It is very often I come across a situation in which I want to modify, or even insert whole blocks of HTML into a page using JavaScript. Usually it also involves changing several parts of the HTML dynamically depending on certain parameters.
However, it can make for messy/unreadable code, and it just doesn't seem right to have these little snippets of HTML in my JavaScript code, dammit.
So, what are some of your techniques to avoid mixing HTML and JavaScript?

The Dojo toolkit has a quite useful system to deal with HTML fragments/templates. Let's say the HTML snippet mycode/mysnippet.tpl.html is something like the following
<div>
<span dojoAttachPoint="foo"></span>
</div>
Notice the dojoAttachPoint attribute. You can then make a widget mycode/mysnippet.js using the HTML snippet as its template:
dojo.declare("mycode.mysnippet", [dijit._Widget, dijit._Templated], {
templateString: dojo.cache("mycode", "mysnippet.tpl.html"),
construct: function(bar){
this.bar = bar;
},
buildRendering: function() {
this.inherited(arguments);
this.foo.innerHTML = this.bar;
}
});
The HTML elements given attach point attributes will become class members in the widget code. You can then use the templated widget like so:
new mycode.mysnippet("A cup of tea would restore my normality.").placeAt(someOtherDomElement);
A nice feature is that if you use dojo.cache and Dojo's build system, it will insert the HTML template text into the javascript code, so that the client doesn't have to make a separate request.
This may of course be way too bloated for your use case, but I find it quite useful - and since you asked for techniques, there's mine. Sitepoint has a nice article on it too.

There are many possible techniques. Perhaps the most obvious is to have all elements on the page but have them hidden - then your JS can simply unhide them/show them as required. This may not be possible though for certain situations. What if you need to add a number (unspecified) of duplicate elements (or groups of elements)? Then perhaps have the elements in question hidden and using something like jQuery's clone function insert them as required into the DOM.
Alternatively if you really have to build HTML on the fly then definitely make your own class to handle it so you don't have snippets scattered through your code. You could employ jQuery literal creators to help do this.

I'm not sure if it qualifies as a "technique", but I generally tend to avoid constructing blocks of HTML in JavaScript by simply loading the relevant blocks from the back-end via AJAX and using JavaScript to swap them in and out/place them as required. (i.e.: None of the low-level text shuffling is done in JavaScript - just the DOM manipulation.)
Whilst you of course need to allow for this during the design of the back-end architecture, I can't help but think to leads to a much cleaner set up.

Sometimes I utilise a custom method to return a node structure based on provided JSON argument(s), and add that return value to the DOM as required. It ain't accessible once JS is unavailable like some backend solutions could be.

After reading some of the responses I managed to come up with my own solution using Python/Django and jQuery.
I have the HTML snippet as a Django template:
<div class="marker_info">
<p> _info_ </p>
more info...
</div>
In the view, I use the Django method render_to_string to load the templates as strings stored in a dictionary:
snippets = { 'marker_info': render_to_string('templates/marker_info_snippet.html')}
The good part about this is I can still use the template tags, for example, the url function. I use simplejson to dump it as JSON and pass it into the full template. I still wanted to dynamically replace strings in the JavaScript code, so I wrote a function to replace words surrounded by underscores with my own variables:
function render_snippet(snippet, dict) {
for (var key in dict)
{
var regex = new RegExp('_' + key + '_', 'gi');
snippet = snippet.replace(regex, dict[key]);
}
return snippet;
}

JavaScript multiline strings and templating?

I have been wondering if there is a way to define multiline strings in JavaScript like you can do in languages like PHP:
var str = "here
goes
another
line";
Apparently this breaks up the parser. I found that placing a backslash \ in front of the line feed solves the problem:
var str = "here\
goes\
another\
line";
Or I could just close and reopen the string quotes again and again.
The reason why I am asking because I am making JavaScript based UI widgets that utilize HTML templates written in JavaScript. It is painful to type HTML in strings especially if you need to open and close quotes all the time. What would be a good way to define HTML templates within JavaScript?
I am considering using separate HTML files and a compilation system to make everything easier, but the library is distributed among other developers so that HTML templates have to be easy to include for the developers.

No thats basically what you have to do to do multiline strings.
But why define the templates in javascript anwyay? why not just put them into a file and have a ajax call load them up in a variable when you need them?
For instantce (using jquery)
$.get('/path/to/template.html', function(data) {
alert(data); //will alert the template code
});

#slebetman, Thanks for the detailed example.
Quick comment on the substitute_strings function.
I had to revise
str.replace(n,substitutions[n]);
to be
str = str.replace(n,substitutions[n]);
to get it to work. (jQuery version 1.5? - it is pure javascript though.)
Also when I had below situation in my template:
$CONTENT$ repeated twice $CONTENT$ like this
I had to do additional processing to get it to work.
str = str.replace(new RegExp(n, 'g'), substitutions[n]);
And I had to refrain from $ (regex special char) as the delimiter and used # instead.
Thought I would share my findings.

There are several templating systems in javascript. However, my personal favorite is one I developed myself using ajax to fetch XML templates. The templates are XML files which makes it easy to embed HTML cleanly and it looks something like this:
<title>This is optional</title>
<body><![CDATA[
HTML content goes here, the CDATA block prevents XML errors
when using non-xhtml html.
<div id="more">
$CONTENT$ may be substituted using replace() before being
inserted into $DOCUMENT$.
</div>
]]></body>
<script><![CDATA[
/* javascript code to be evaled after template
* is inserted into document. This is to get around
* the fact that this templating system does not
* have its own turing complete programming language.
* Here's an example use:
*/
if ($HIDE_MORE$) {
document.getElementById('more').display = 'none';
}
]]></script>
And the javascript code to process the template goes something like this:
function insertTemplate (url_to_template, insertion_point, substitutions) {
// Ajax call depends on the library you're using, this is my own style:
ajax(url_to_template, function (request) {
var xml = request.responseXML;
var title = xml.getElementsByTagName('title');
if (title) {
insertion_point.innerHTML += substitute_strings(title[0],substitutions);
}
var body = xml.getElementsByTagName('body');
if (body) {
insertion_point.innerHTML += substitute_strings(body[0],substitutions);
}
var script = xml.getElementsByTagName('script');
if (script) {
eval(substitute_strings(script[0],substitutions));
}
});
}
function substitute_strings (str, substitutions) {
for (var n in substitutions) {
str.replace(n,substitutions[n]);
}
return str;
}
The way to call the template would be:
insertTemplate('http://path.to.my.template', myDiv, {
'$CONTENT$' : "The template's content",
'$DOCUMENT$' : "the document",
'$HIDE_MORE$' : 0
});
The $ sign for substituted strings is merely a convention, you may use % of # or whatever delimiters you prefer. It's just there to make the part to be substituted unambiguous.
One big advantage to using substitutions on the javascript side instead of server side processing of the template is that this allows the template to be plain static files. The advantage of that (other than not having to write server side code) is that you can then set the caching policy for the template to be very aggressive so that the browser only needs to fetch the template the first time you load it. Subsequent use of the template would come from cache and would be very fast.
Also, this is a very simple example of the implementation to illustrate the mechanism. It's not what I'm using. You can modify this further to do things like multiple substitution, better handling of script block, handle multiple content blocks by using a for loop instead of just using the first element returned, properly handling HTML entities etc.
The reason I really like this is that the HTML is simply HTML in a plain text file. This avoids quoting hell and horrible string concatenation performance issues that you'll usually find if you directly embed HTML strings in javascript.

I think I found a solution I like.
I will store templates in files and fetch them using AJAX. This works for development stage only. For production stage, the developer has to run a compiler once that compiles all templates with the source files. It also compiles JavaScript and CSS to be more compact and it compiles them to a single file.
The biggest problem now is how to educate other developers doing that. I need to build it so that it is easy to do and understand why and what are they doing.

You could also use \n to generate newlines. The html would however be on a single line and difficult to edit. But if you generate the JS using PHP or something it might be an alternative

Easy way to drop XML Namespaces with JavaScript?

Is there an easy way, for example, to drop an XML name space, but keep the tag as is with jQuery or JavaScript? For example:
<html:a href="#an-example" title="Go to the example">Just an Example</html:a>
And change it to:
Just an Example
On the fly with jQuery or JavaScript and not knowing the elements and or attributes inside?

If there is no <script> tag in the code to be replaced you can try (demo):
container.innerHTML = container.innerHTML
.replace(/<(\/?)([^:>\s]*:)?([^>]+)>/g, "<$1$3>")

This isn't really an answer, but you'd be better off handling this server-side. JavaScript comes into the picture a bit too late for this kind of task... events may have already been attached to the existing nodes and you'd be processing each element twice.
What is the purpose of the namespace?
If these are static html files and the namespace serves no purpose I would strip all the namespaces with a regex.
If they're not static and the namespace has a purpose when served as xml, you could do some server-side detection to serve with the right doctype and the namespace (when appropriate).

It's easy when you know how ... just use \\ to escape the colon so it's not parsed as an action delimiter by the jquery parsing engine.
$('html\\:a').each( function(){
var temp = $(this).html();
$(this).replaceWith("<a>"+temp+"</a>");
});
This should iterate between each of those elements and replace them with normal tags. Since these are being served up to an ajax callback function, otherwise I don't see why you'd want to do it on the fly ... then the top line would change to:
$.post('...',{}, function(dat){
$(dat).find('html\\:a').each( blah blah ....
.
.
});
NB, i'm one of those terrible people who only really tests things in FF ...

We Keep Coding

JavaScript is the programming language of the Web.

Extract elements from html-string in ClojureScript - javascript

Related

javascript parser for specific purpose

HTML entities in CSS content (convert entities to escape-string at runtime)

Techniques to avoid building HTML strings in JavaScript?

JavaScript multiline strings and templating?

Easy way to drop XML Namespaces with JavaScript?

Categories

Resources