this question stems from How to get html string representation for all types of Dom nodes?
question
Does elt.outerHTML really captures the whole html string representation accurately?
(I think) I know how to get the html string representation for Node.TEXT_NODE -- textContent & Node.ELEMENT_NODE -- outerHTML;
but, does this really accurately capture all the html String of such node?
For what I try to mean by "all". Think about the following cases::
ie:
there is no special case where I call elt.outerHTML & get an html string that has less data (less html code string) than the elt actually has)?
--ie,eg:
would there be a special case where:
you have <div>a special node contains special <&&xx some hypothetic special node syntax, which this may not be captured? xx&&> data</div>
if I call elt.outerHTML,
I will only get back <div>a special node contains special data</div>?
ie,eg:
if I call node.textContent on Node.ELEMENT_NODE
it returns only the text content
ie,eg:
if I call node.textContent on Node.COMMENT_NODE
it returns only the text content, but without the <!-- -->
What about node types other than Node.ELEMENT_NODE?
Clarity
For what exactly I am trying to do?:
there is no specific purpose
-- it just an common operation that you want to get the exact html string representation of a node
-- a programmer need to get the exact accurate input data to do their job
And so I ask,
when I access a property like .outerHTML / .textContent, do I get the exact accurate input data I want? In What case I wont?
Simple question (though I know the answer/reason may be complex).
And I provided examples above to show what I mean.
What I am trying to do is (-- If I really have to be more specific (-- though its still just a general operation)...):
I am given an html file (/ given an dom html element);
I can get all the nodes in that html file (/ element);
I want to accurately get the html string of those nodes;
With those html string, I do some business logic base on it -- I am adding additional info to it, there is no deletion of any original data / html string;
Then I put back the modified html into that node.
The node now should contain no less than the original data.
I'm building an app which has a feature for embedding expressions/rules in a config yaml file. So for example user can reference a variable defined in yaml file like ${variables.name == 'John'} or ${is_equal(variables.name, 'John')}. I can probably get by with simple expressions but I want to support complex rules/expressions such ${variables.name == 'John'} and (${variables.age > 18} OR ${variables.adult == true})
I'm looking for a parsing/dsl/rules-engine library that can support these type of expressions and normalize it. I'm open using ruby, javascript, java, or python if anyone knows of a library for that languages.
One option I thought of was to just support javascript as conditons/rules and basically pass it through eval with the right context setup with access to variables and other reference-able vars.
I don't know if you use Golang or not, but if you use it, I recommend this https://github.com/antonmedv/expr.
I have used it for parsing bot strategy that (stock options bot). This is from my test unit:
func TestPattern(t *testing.T) {
a := "pattern('asdas asd 12dasd') && lastdigit(23asd) < sma(50) && sma(14) > sma(12) && ( macd(5,20) > macd_signal(12,26,9) || macd(5,20) <= macd_histogram(12,26,9) )"
r, _ := regexp.Compile(`(\w+)(\s+)?[(]['\d.,\s\w]+[)]`)
indicator := r.FindAllString(a, -1)
t.Logf("%v\n", indicator)
t.Logf("%v\n", len(indicator))
for _, i := range indicator {
t.Logf("%v\n", i)
if strings.HasPrefix(i, "pattern") {
r, _ = regexp.Compile(`pattern(\s+)?\('(.+)'\)`)
check1 := r.ReplaceAllString(i, "$2")
t.Logf("%v\n", check1)
r, _ = regexp.Compile(`[^du]`)
check2 := r.FindAllString(check1, -1)
t.Logf("%v\n", len(check2))
} else if strings.HasPrefix(i, "lastdigit") {
r, _ = regexp.Compile(`lastdigit(\s+)?\((.+)\)`)
args := r.ReplaceAllString(i, "$2")
r, _ = regexp.Compile(`[^\d]`)
parameter := r.FindAllString(args, -1)
t.Logf("%v\n", parameter)
} else {
}
}
}
Combine it with regex and you have good (if not great, string translator).
And for Java, I personally use https://github.com/ridencww/expression-evaluator but not for production. It has similar feature with above link.
It supports many condition and you don't have to worry about Parentheses and Brackets.
Assignment =
Operators + - * / DIV MOD % ^
Logical < <= == != >= > AND OR NOT
Ternary ? :
Shift << >>
Property ${<id>}
DataSource #<id>
Constants NULL PI
Functions CLEARGLOBAL, CLEARGLOBALS, DIM, GETGLOBAL, SETGLOBAL
NOW PRECISION
Hope it helps.
You might be surprised to see how far you can get with a syntax parser and 50 lines of code!
Check this out. The Abstract Syntax Tree (AST) on the right represents the code on the left in nice data structures. You can use these data structures to write your own simple interpreter.
I wrote a little example of one:
https://codesandbox.io/s/nostalgic-tree-rpxlb?file=/src/index.js
Open up the console (button in the bottom), and you'll see the result of the expression!
This example can only handle (||) and (>), but looking at the code (line 24), you can see how you could make it support any other JS operator. Just add a case to the branch, evaluate the sides, and do the calculation on JS.
Parenthesis and operator precedence are all handled by the parser for you.
I'm not sure if this is the solution for you, but it will for sure be fun ;)
One option I thought of was to just support javascript as
conditons/rules and basically pass it through eval with the right
context setup with access to variables and other reference-able vars.
I would personally lean towards something like this. If you are getting into complexities such as logic comparisons, a DSL can become a beast since you are basically almost writing a compiler and a language at that point. You might want to just not have a config, and instead have the configurable file just be JavaScript (or whatever language) that can then be evaluated and then loaded. Then whoever your target audience is for this "config" file can just supplement logical expressions as needed.
The only reason I would not do this is if this configuration file was being exposed to the public or something, but in that case security for a parser would also be quite difficult.
I did something like that once, you can probably pick it up and adapt it to your needs.
TL;DR: thanks to Python's eval, you doing this is a breeze.
The problem was to parse dates and durations in textual form. What I did was to create a yaml file mapping regex pattern to the result. The mapping itself was a python expression that would be evaluated with the match object, and had access to other functions and variables defined elsewhere in the file.
For example, the following self-contained snippet would recognize times like "l'11 agosto del 1993" (Italian for "August 11th, 1993,).
__meta_vars__:
month: (gennaio|febbraio|marzo|aprile|maggio|giugno|luglio|agosto|settembre|ottobre|novembre|dicembre)
prep_art: (il\s|l\s?'\s?|nel\s|nell\s?'\s?|del\s|dell\s?'\s?)
schema:
date: http://www.w3.org/2001/XMLSchema#date
__meta_func__:
- >
def month_to_num(month):
""" gennaio -> 1, febbraio -> 2, ..., dicembre -> 12 """
try:
return index_in_or(meta_vars['month'], month) + 1
except ValueError:
return month
Tempo:
- \b{prep_art}(?P<day>\d{{1,2}}) (?P<month>{month}) {prep_art}?\s*(?P<year>\d{{4}}): >
'"{}-{:02d}-{:02d}"^^<{schema}>'.format(match.group('year'),
month_to_num(match.group('month')),
int(match.group('day')),
schema=schema['date'])
__meta_func__ and __meta_vars (not the best names, I know) define functions and variables that are accessible to the match transformation rules. To make the rules easier to write, the pattern is formatted by using the meta-variables, so that {month} is replaced with the pattern matching all months. The transformation rule calls the meta-function month_to_num to convert the month to a number from 1 to 12, and reads from the schema meta-variable. On the example above, the match results in the string "1993-08-11"^^<http://www.w3.org/2001/XMLSchema#date>, but some other rules would produce a dictionary.
Doing this is quite easy in Python, as you can use exec to evaluate strings as Python code (obligatory warning about security implications). The meta-functions and meta-variables are evaluated and stored in a dictionary, which is then passed to the match transformation rules.
The code is on github, feel free to ask any questions if you need clarifications. Relevant parts, slightly edited:
class DateNormalizer:
def _meta_init(self, specs):
""" Reads the meta variables and the meta functions from the specification
:param dict specs: The specifications loaded from the file
:return: None
"""
self.meta_vars = specs.pop('__meta_vars__')
# compile meta functions in a dictionary
self.meta_funcs = {}
for f in specs.pop('__meta_funcs__'):
exec f in self.meta_funcs
# make meta variables available to the meta functions just defined
self.meta_funcs['__builtins__']['meta_vars'] = self.meta_vars
self.globals = self.meta_funcs
self.globals.update(self.meta_vars)
def normalize(self, expression):
""" Find the first matching part in the given expression
:param str expression: The expression in which to search the match
:return: Tuple with (start, end), category, result
:rtype: tuple
"""
expression = expression.lower()
for category, regexes in self.regexes.iteritems():
for regex, transform in regexes:
match = regex.search(expression)
if match:
result = eval(transform, self.globals, {'match': match})
start, end = match.span()
return (first_position + start, first_position + end) , category, result
Here are some categorized Ruby options and resources:
Insecure
Pass expression to eval in the language of your choice.
It must be mentioned that eval is technically an option, but extraordinary trust must exist in its inputs and it is safer to avoid it altogether.
Heavyweight
Write a parser for your expressions and an interpreter to evaluate them
A cost-intensive solution would be implementing your own expression language. That is, to design a lexicon for your expression language, implement a parser for it, and an interpreter to execute the code that's parsed.
Some Parsing Options (ruby)
Parslet
TreeTop
Citrus
Roll-your-own with StringScanner
Medium Weight
Pick an existing language to write expressions in and parse / interpret those expressions.
This route assumes you can pick a known language to write your expressions in. The benefit is that a parser likely already exists for that language to turn it into an Abstract Syntax Tree (data structure that can be walked for interpretation).
A ruby example with the Parser gem
require 'parser'
class MyInterpreter
# https://whitequark.github.io/ast/AST/Processor/Mixin.html
include ::Parser::AST::Processor::Mixin
def on_str(node)
node.children.first
end
def on_int(node)
node.children.first.to_i
end
def on_if(node)
expression, truthy, falsey = *node.children
if process(expression)
process(truthy)
else
process(falsey)
end
end
def on_true(_node)
true
end
def on_false(_node)
false
end
def on_lvar(node)
# lookup a variable by name=node.children.first
end
def on_send(node, &block)
# allow things like ==, string methods? whatever
end
# ... etc
end
ast = Parser::ConcurrentRuby.parse(<<~RUBY)
name == 'John' && adult
RUBY
MyParser.new.process(ast)
# => true
The benefit here is that a parser and syntax is predetermined and you can interpret only what you need to (and prevent malicious code from executing by controller what on_send and on_const allow).
Templating
This is more markup-oriented and possibly doesn't apply, but you could find some use in a templating library, which parses expressions and evaluates for you. Control and supplying variables to the expressions would be possible depending on the library you use for this. The output of the expression could be checked for truthiness.
Liquid
Jinja
Some toughs and things you should consider.
1. Unified Expression Language (EL),
Another option is EL, specified as part of the JSP 2.1 standard (JSR-245). Official documentation.
They have some nice examples that can give you a good overview of the syntax. For example:
El Expression: `${100.0 == 100}` Result= `true`
El Expression: `${4 > 3}` Result= `true`
You can use this to evaluate small script-like expressions. And there are some implementations: Juel is one open source implementation of the EL language.
2. Audience and Security
All the answers recommend using different interpreters, parser generators. And all are valid ways to add functionality to process complex data. But I would like to add an important note here.
Every interpreter has a parser, and injection attacks target those parsers, tricking them to interpret data as commands. You should have a clear understanding how the interpreter's parser works, because that's the key to reduce the chances to have a successful injection attack Real world parsers have many corner cases and flaws that may not match the specs. And have clear the measures to mitigate possible flaws.
And even if your application is not facing the public. You can have external or internal actors that can abuse this feature.
I'm building an app which has a feature for embedding expressions/rules in a config yaml file.
I'm looking for a parsing/dsl/rules-engine library that can support these type of expressions and normalize it. I'm open using ruby, javascript, java, or python if anyone knows of a library for that languages.
One possibility might be to embed a rule interpreter such as ClipsRules inside your application. You could then code your application in C++ (perhaps inspired by my clips-rules-gcc project) and link to it some C++ YAML library such as yaml-cpp.
Another approach could be to embed some Python interpreter inside a rule interpreter (perhaps the same ClipsRules) and some YAML library.
A third approach could be to use Guile (or SBCL or Javascript v8) and extend it with some "expert system shell".
Before starting to code, be sure to read several books such as the Dragon Book, the Garbage Collection handbook, Lisp In Small Pieces, Programming Language Pragmatics. Be aware of various parser generators such as ANTLR or GNU bison, and of JIT compilation libraries like libgccjit or asmjit.
You might need to contact a lawyer about legal compatibility of various open source licenses.
For a ClojureScript project I'm looking for a concise way to extract content from an external HTML document on the client side. The content is actually received via an ajax call in the Markdown format, which is parsed to HTML subsequently. So a HTML string is the point of departure.
(def html-string "<p>Something, that <a>was</a> Markdown before</p>")
The libraries Enlive and Garden for instance use vectors to express CSS Selectors, which are needed here. Enlive has a front-end sister, called Enfocus, which provides similar semantics.
Here's an enfocus example which extracts some content from the current DOM:
(require '[enfocus.core :as ef])
(ef/from js/document.head :something [:title]
(ef/get-text))
;;{:something "My Title"}
If there were more matches the value of :something would become a vector. I could not figure out, how to apply this function on arbitrary HTML strings. The closest I could get was by using this function:
(defn html->node [h]
(doto (.createElement js/document "div")
(aset "innerHTML" h)))
and then:
(ef/from (html->node html-string) :my-link [:a]
(ef/get-text))
;;{:my-link "was"}
However, this is not quite clean, since now there's a div wrapping everything, which might cause trouble in some situations.
Inserting some HTML content into a div makes your arbitrary HTML automatically evaluated. What you should do instead is to parse your HTML string with something like this.
(defn parse-html [html]
"Parse an html string into a document"
(let [doc (.createHTMLDocument js/document.implementation "mydoc")]
(set! (.-innerHTML doc.documentElement) html)
doc
))
This is plain and simple ClojureScript, maybe using the Google Closure library would be more elegant if you want you can dig into the goog.dom namespace or someplace else inside the Google Closure API.
Then you parse your HTML string:
(def html-doc (parse-html "<body><div><p>some of my stuff</p></div></body>"))
So you can call Enfocus on your document:
(ef/from html-doc :mystuff [:p] (ef/get-text))
Result:
{:mystuff "some of my stuff"}
I'm the author of doctest, quick and dirty doctests for JavaScript and CoffeeScript. I'd like to make the library less dirty by using a JavaScript parser rather than regular expressions to locate comments.
I'd like to use Esprima or Acorn to do the following:
Create an AST
Walk the tree, and for each comment node:
Create an AST from the comment node's text
Replace the comment node in the main tree with this subtree
Input:
!function() {
// > toUsername("Jesper Nøhr")
// "jespernhr"
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
Output:
!function() {
doctest.input(function() {
return toUsername("Jesper Nøhr")
});
doctest.output(4, function() {
return "jespernhr"
});
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
I don't know how to do this. Acorn provides a walker which takes a node type and a function, and walks the tree invoking the function each time a node of the specified type is encountered. This seems promising, but doesn't apply to comments.
With Esprima I can use esprima.parse(input, {comment: true, loc: true}).comments to get the comments, but I'm not sure how to update the tree.
Most AST-producing parsers throw away comments. I don't know what Esprima or Acorn do, but that might be the issue.
.... in fact, Esprima lists comment capture as a current bug:
http://code.google.com/p/esprima/issues/detail?id=197
... Acorn's code is right there in GitHub. It appears to throw comments away, too.
So, looks like you get to fix either parser to capture the comments first, at which point your task should be straightforward, or, you're stuck.
Our DMS Software Reengineering Toolkit has JavaScript parsers that capture comments, in the tree. It also has language substring parsers, that could be used to parse the comment text into JavaScript ASTs of whatever type the comment represents (e.g, function declaration, expression, variable declaration, ...), and the support machinery to graft such new ASTs into the main tree. If you are going to manipulate ASTs, this substring capability is likely important: most parsers won't parse arbitrary language fragments, they are wired only to parse "whole programs". For DMS, there are no comment nodes to replace; there are comments associated with ASTs nodes, so the grafting process is a little trickier than just "replace comment nodes". Still pretty easy.
I'll observe that most parsers (including these) read the source and break it into tokens by using or applying the equivalent of a regular expressions. So, if you are already using these to locate comments (that means using them to locate *non*comments to throw away, as well, e.g., you need to recognize string literals that contain comment-like text and ignore them), you are doing as well as the parsers would do anyway in terms of finding the comments. And if all you want to do is to replace them exactly with their content, echoing the source stream with the comment prefix/suffix /* */ stripped will do apparantly exactly what you want, so all this parsing machinery seems like overkill.
You can already use Esprima to achieve what you want:
Parse the code, get the comments (as an array).
Iterate over the comments, see if each is what you are interested in.
If you need to transform the comment, note its range. Collect all transformations.
Apply the transformation back-to-first so that the ranges are not shifted.
The trick is here not change the AST. Simply apply the text change as if you are doing a typical search replace on the source string directly. Because the position of the replacement might shift, you need to collect everything and then do it from the last one. For an example on how to carry out such a transformation, take a look at my blog post "From double-quotes to single-quotes" (it deals with string quotes but the principle remains the same).
Last but not least, you might want to use a slightly higher-level utility such as Rocambole.
I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.
My specific questions are:
As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:
[
[
"[",
"img",
"",
[
"/",
"f",
"o",
"o",
"/",
"b",
"a",
"r"
],
"",
"]"
],
[
"[/",
"img",
"]"
]
]
given a grammar like:
document
= (open_tag / close_tag / new_line / text)*
open_tag
= ("[" tag_name "="? tag_data? tag_attributes? "]")
close_tag
= ("[/" tag_name "]")
text
= non_tag+
non_tag
= [\n\[\]]
new_line
= ("\r\n" / "\n")
I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.
Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?
Thanks
First question (grammar for incomplete texts):
You can add
incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
// the closing bracket is omitted ---^
after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.
Second question (how to include actions):
You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!
In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.
For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.
Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.
The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example
text
= text:non_tag+ {
// we captured the text in an array and can manipulate it now
return text.join("");
}
At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.
Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.
text
= result:non_tag+ { return result.join(''); }