javascript parser for specific purpose - javascript

I am trying to create a tool that looks for missing translations in .html files. some of our translations are done at runtime in JS code. I would like to map these together. below is an example.
<select id="dropDown"></select>
// js
bindings: {
"dropDown": function() {
translate(someValue);
// translate then set option
}
}
above you can see that I have a drop down where values are created & translated at runtime. I was thinking that an AST would be the right way to accomplish this. basically I need to go through the .html file looking for tags that are missing in-line translations (done using {{t value}}) and search the corresponding .js file for run time translations. Is there a better way to accomplish this? any advice on a tool for creating the AST?

I think you want to hunt for patterns in the code. In particular, I think you want to determine, for each HTML select construct, if there is a corresponding, properly shaped JavaScript fragment with the right embedded ID name.
You can do that with an AST, right. In your case, you need an AST for the HTML file (nodes are essentially HTML tags) with sub-ASTs for script chunks () tags containing the parsed JavaScript.
For this you need two parsers: one to parse the HTML (a nasty just because HTML is a mess), that produces a tree containing script nodes having just text. Then you need a JavaScript parser that you can apply to the text blobs under the Script tags, to produce JavaScript ASTs; ideally, you splice these into the HTML tree to replace the text blob nodes. Now you have a mixed tree, with some nodes being html, with some subtrees that are JavaScript. Ideally the HMTL nodes are marked as being HTML, and the JavaScript nodes are marked as Javascript.
Now you can search the tree for select nodes, pick up the id, and the search all the javascript subtrees for expected structure.
You can code the matching procedurally, but it will be messy:
for all node
if node is HTML and nodetype is Select
then
functionname=node.getchild("ID").text
for all node
if node is JavaScript and node.parent is HTML and nodetype is pair
then if node.getchild(left).ext is "bindings"
then if node.getchild(right)=structure
then... (lots more....)
There's a lot of hair in this. Technically its just sweat. You have to know (and encode) the precise detail of the tree, in climbing up and downs its links correctly and checking the node types one by one. If the grammar changes a tiny bit, this code breaks too; it knows too much about the grammar.
You can finish this by coding your own parsers from scratch. Lots more sweat.
There are tools that can make this a lot easier; see Program Transformation Systems. Such tools let you define language grammars, and they generate parsers and AST builders for such grammars. (As a general rule, they are pretty good at defining working grammars, because they are designed to be applied to many languages). That at least puts lots of structure into the process, and they provide a lot of machinery to make this work. But the good part is that you can express patterns, usually in source language surface syntax, that can make this a lot easier to express.
One of these tools is our DMS Software Reengineering Toolkit (I'm the architect).
DMS has dirty HTML and full JavaScript parsers already, so those don't need to be built.
You would have to write a bit of code for DMS to invoke the HTML parser, find the subtree for script nodes, and apply the JavaScript parser. DMS makes this practical by allowing you parse a blob of text as an arbitrary nonterminal in a grammar; in this case, you want to parse those blobs as an expression nonterminal.
With all that in place, you can now write patterns that will support the check:
pattern select_node(property: string): HTML~dirty.HTMLform =
" <select ID=\property></select> ";
pattern script(code: string): HTML~dirty.HTMLform =
" <script>\code</script> ";
pattern js_bindings(s: string, e:expression):JavaScript.expression =
" bindings : { \s : function ()
{ translate(\e);
}
} ";
While these patterns look like text, they are parsed by DMS into ASTs with placeholder nodes for the parameter list elements, denoted by "\nnnn" inside
the (meta)quotes "..." that surround the program text of interest. Such ASTs patterns can be pattern matched against ASTs; they match if the pattern tree matches, and the pattern variable leaves are then captured as bindings.
(See Registry:PatternMatch below, and resulting match argument with slots matched (a boolean) and bindings (an array of bound subtrees resulting from the match). A big win for the tool builder: he doesn't have to know much about the fine detail of the grammar, because he writes the pattern, and the tool produces all the tree nodes for him, implicitly.
With these patterns, you can write procedural PARLANSE (DMS's Lisp-style programming language) code to implement the check (liberties taken to shorten the presentation):
(;; `Parse HTML file':
(= HTML_tree (HMTL:ParseFile .... ))
`Find script nodes and replace by ASTs for same':
(AST:FindAllMatchingSubtrees HTML_tree
(lambda (function boolean [html_node AST:Node])
(let (= [match Registry:Match]
(Registry:PatternMatch html_node "script"))
(ifthenelse match:boolean
(value (;; (AST:ReplaceNode node
(JavaScript:ParseStream
"expression" ; desired nonterminal
(make Streams:Stream
(AST:GetString match:bindings:1))))
);;
~f ; false: don't visit subtree
)value
~t ; true: continue scanning into subtree
)ifthenelse
)let
)lambda )
`Now find select nodes, and check sanity':
(AST:FindAllMatchingSubtrees HTML_tree
(lambda (function boolean [html_node AST:node])
(let (;; (= [select_match Registry:Match] ; capture match data
(Registry:PatternMatch "select" html_node)) ; hunt for this pattern
[select_function_name string]
);;
(ifthenelse select_match:boolean
(value (;; `Found <select> node.
Get name of function...':
(= select_function_name
(AST:GetString select_match:bindings:1))
`... and search for matching script fragment':
(ifthen
(~ (AST:FindFirstMatchingSubtree HTML_tree
(lambda (function boolean [js_node AST:Node])
(let (;; (= [match Registry:Match] ; capture match data
(Registry:PatternMatch js_node "js_bindings")) ; hunt for this pattern
(&& match:boolean
(== select_match:bindings:1
select_function_name)
)&& ; is true if we found matching function
)let
)lambda ) )~
(;; `Complain if we cant find matching script fragment'
(Format:SNN `Select #S with missing translation at line #D column #D'
select_function_name
(AST:GetLineNumber select_match:source_position)
(AST:GetColumnNumber select_match:source_position)
)
);;
)ifthen
);;
~f ; don't visit subtree
)value
~t ; continue scanning into subtree
)ifthenelse
)let
)lambda )
);;
This procedural code first parses an HTML source file producing an HTML tree. All these nodes are stamped as being from the "HTML~dirty" langauge.
It then scans that tree to find SCRIPT nodes, and replaces them with an AST obtained from a JavaScript-expression-parse of the text content of the script nodes encountered. Finally, it finds all SELECT nodes, picks out the name of the function mentioned in the ID clause, and checks all JavaScript ASTs for a matching "bindings" expression as specified by OP. All of this leans on the pattern matching machinery, which in turn leans on top of the low-level AST library that provides a variety of means to inspect/navigate/change tree nodes.
I've left out some detail (esp. error handling code) and obviously haven't tested this. But this gives the flavor of how you might do it with DMS.
Similar patterns and matching processes are available in other program transformation systems.

Related

Does `elt.outerHTML` really captures the whole html string representation accurately?

this question stems from How to get html string representation for all types of Dom nodes?
question
Does elt.outerHTML really captures the whole html string representation accurately?
(I think) I know how to get the html string representation for Node.TEXT_NODE -- textContent & Node.ELEMENT_NODE -- outerHTML;
but, does this really accurately capture all the html String of such node?
For what I try to mean by "all". Think about the following cases::
ie:
there is no special case where I call elt.outerHTML & get an html string that has less data (less html code string) than the elt actually has)?
--ie,eg:
would there be a special case where:
you have <div>a special node contains special <&&xx some hypothetic special node syntax, which this may not be captured? xx&&> data</div>
if I call elt.outerHTML,
I will only get back <div>a special node contains special data</div>?
ie,eg:
if I call node.textContent on Node.ELEMENT_NODE
it returns only the text content
ie,eg:
if I call node.textContent on Node.COMMENT_NODE
it returns only the text content, but without the <!-- -->
What about node types other than Node.ELEMENT_NODE?
Clarity
For what exactly I am trying to do?:
there is no specific purpose
-- it just an common operation that you want to get the exact html string representation of a node
-- a programmer need to get the exact accurate input data to do their job
And so I ask,
when I access a property like .outerHTML / .textContent, do I get the exact accurate input data I want? In What case I wont?
Simple question (though I know the answer/reason may be complex).
And I provided examples above to show what I mean.
What I am trying to do is (-- If I really have to be more specific (-- though its still just a general operation)...):
I am given an html file (/ given an dom html element);
I can get all the nodes in that html file (/ element);
I want to accurately get the html string of those nodes;
With those html string, I do some business logic base on it -- I am adding additional info to it, there is no deletion of any original data / html string;
Then I put back the modified html into that node.
The node now should contain no less than the original data.

Creating a DSL expressions parser / rules engine

I'm building an app which has a feature for embedding expressions/rules in a config yaml file. So for example user can reference a variable defined in yaml file like ${variables.name == 'John'} or ${is_equal(variables.name, 'John')}. I can probably get by with simple expressions but I want to support complex rules/expressions such ${variables.name == 'John'} and (${variables.age > 18} OR ${variables.adult == true})
I'm looking for a parsing/dsl/rules-engine library that can support these type of expressions and normalize it. I'm open using ruby, javascript, java, or python if anyone knows of a library for that languages.
One option I thought of was to just support javascript as conditons/rules and basically pass it through eval with the right context setup with access to variables and other reference-able vars.
I don't know if you use Golang or not, but if you use it, I recommend this https://github.com/antonmedv/expr.
I have used it for parsing bot strategy that (stock options bot). This is from my test unit:
func TestPattern(t *testing.T) {
a := "pattern('asdas asd 12dasd') && lastdigit(23asd) < sma(50) && sma(14) > sma(12) && ( macd(5,20) > macd_signal(12,26,9) || macd(5,20) <= macd_histogram(12,26,9) )"
r, _ := regexp.Compile(`(\w+)(\s+)?[(]['\d.,\s\w]+[)]`)
indicator := r.FindAllString(a, -1)
t.Logf("%v\n", indicator)
t.Logf("%v\n", len(indicator))
for _, i := range indicator {
t.Logf("%v\n", i)
if strings.HasPrefix(i, "pattern") {
r, _ = regexp.Compile(`pattern(\s+)?\('(.+)'\)`)
check1 := r.ReplaceAllString(i, "$2")
t.Logf("%v\n", check1)
r, _ = regexp.Compile(`[^du]`)
check2 := r.FindAllString(check1, -1)
t.Logf("%v\n", len(check2))
} else if strings.HasPrefix(i, "lastdigit") {
r, _ = regexp.Compile(`lastdigit(\s+)?\((.+)\)`)
args := r.ReplaceAllString(i, "$2")
r, _ = regexp.Compile(`[^\d]`)
parameter := r.FindAllString(args, -1)
t.Logf("%v\n", parameter)
} else {
}
}
}
Combine it with regex and you have good (if not great, string translator).
And for Java, I personally use https://github.com/ridencww/expression-evaluator but not for production. It has similar feature with above link.
It supports many condition and you don't have to worry about Parentheses and Brackets.
Assignment =
Operators + - * / DIV MOD % ^
Logical < <= == != >= > AND OR NOT
Ternary ? :
Shift << >>
Property ${<id>}
DataSource #<id>
Constants NULL PI
Functions CLEARGLOBAL, CLEARGLOBALS, DIM, GETGLOBAL, SETGLOBAL
NOW PRECISION
Hope it helps.
You might be surprised to see how far you can get with a syntax parser and 50 lines of code!
Check this out. The Abstract Syntax Tree (AST) on the right represents the code on the left in nice data structures. You can use these data structures to write your own simple interpreter.
I wrote a little example of one:
https://codesandbox.io/s/nostalgic-tree-rpxlb?file=/src/index.js
Open up the console (button in the bottom), and you'll see the result of the expression!
This example can only handle (||) and (>), but looking at the code (line 24), you can see how you could make it support any other JS operator. Just add a case to the branch, evaluate the sides, and do the calculation on JS.
Parenthesis and operator precedence are all handled by the parser for you.
I'm not sure if this is the solution for you, but it will for sure be fun ;)
One option I thought of was to just support javascript as
conditons/rules and basically pass it through eval with the right
context setup with access to variables and other reference-able vars.
I would personally lean towards something like this. If you are getting into complexities such as logic comparisons, a DSL can become a beast since you are basically almost writing a compiler and a language at that point. You might want to just not have a config, and instead have the configurable file just be JavaScript (or whatever language) that can then be evaluated and then loaded. Then whoever your target audience is for this "config" file can just supplement logical expressions as needed.
The only reason I would not do this is if this configuration file was being exposed to the public or something, but in that case security for a parser would also be quite difficult.
I did something like that once, you can probably pick it up and adapt it to your needs.
TL;DR: thanks to Python's eval, you doing this is a breeze.
The problem was to parse dates and durations in textual form. What I did was to create a yaml file mapping regex pattern to the result. The mapping itself was a python expression that would be evaluated with the match object, and had access to other functions and variables defined elsewhere in the file.
For example, the following self-contained snippet would recognize times like "l'11 agosto del 1993" (Italian for "August 11th, 1993,).
__meta_vars__:
month: (gennaio|febbraio|marzo|aprile|maggio|giugno|luglio|agosto|settembre|ottobre|novembre|dicembre)
prep_art: (il\s|l\s?'\s?|nel\s|nell\s?'\s?|del\s|dell\s?'\s?)
schema:
date: http://www.w3.org/2001/XMLSchema#date
__meta_func__:
- >
def month_to_num(month):
""" gennaio -> 1, febbraio -> 2, ..., dicembre -> 12 """
try:
return index_in_or(meta_vars['month'], month) + 1
except ValueError:
return month
Tempo:
- \b{prep_art}(?P<day>\d{{1,2}}) (?P<month>{month}) {prep_art}?\s*(?P<year>\d{{4}}): >
'"{}-{:02d}-{:02d}"^^<{schema}>'.format(match.group('year'),
month_to_num(match.group('month')),
int(match.group('day')),
schema=schema['date'])
__meta_func__ and __meta_vars (not the best names, I know) define functions and variables that are accessible to the match transformation rules. To make the rules easier to write, the pattern is formatted by using the meta-variables, so that {month} is replaced with the pattern matching all months. The transformation rule calls the meta-function month_to_num to convert the month to a number from 1 to 12, and reads from the schema meta-variable. On the example above, the match results in the string "1993-08-11"^^<http://www.w3.org/2001/XMLSchema#date>, but some other rules would produce a dictionary.
Doing this is quite easy in Python, as you can use exec to evaluate strings as Python code (obligatory warning about security implications). The meta-functions and meta-variables are evaluated and stored in a dictionary, which is then passed to the match transformation rules.
The code is on github, feel free to ask any questions if you need clarifications. Relevant parts, slightly edited:
class DateNormalizer:
def _meta_init(self, specs):
""" Reads the meta variables and the meta functions from the specification
:param dict specs: The specifications loaded from the file
:return: None
"""
self.meta_vars = specs.pop('__meta_vars__')
# compile meta functions in a dictionary
self.meta_funcs = {}
for f in specs.pop('__meta_funcs__'):
exec f in self.meta_funcs
# make meta variables available to the meta functions just defined
self.meta_funcs['__builtins__']['meta_vars'] = self.meta_vars
self.globals = self.meta_funcs
self.globals.update(self.meta_vars)
def normalize(self, expression):
""" Find the first matching part in the given expression
:param str expression: The expression in which to search the match
:return: Tuple with (start, end), category, result
:rtype: tuple
"""
expression = expression.lower()
for category, regexes in self.regexes.iteritems():
for regex, transform in regexes:
match = regex.search(expression)
if match:
result = eval(transform, self.globals, {'match': match})
start, end = match.span()
return (first_position + start, first_position + end) , category, result
Here are some categorized Ruby options and resources:
Insecure
Pass expression to eval in the language of your choice.
It must be mentioned that eval is technically an option, but extraordinary trust must exist in its inputs and it is safer to avoid it altogether.
Heavyweight
Write a parser for your expressions and an interpreter to evaluate them
A cost-intensive solution would be implementing your own expression language. That is, to design a lexicon for your expression language, implement a parser for it, and an interpreter to execute the code that's parsed.
Some Parsing Options (ruby)
Parslet
TreeTop
Citrus
Roll-your-own with StringScanner
Medium Weight
Pick an existing language to write expressions in and parse / interpret those expressions.
This route assumes you can pick a known language to write your expressions in. The benefit is that a parser likely already exists for that language to turn it into an Abstract Syntax Tree (data structure that can be walked for interpretation).
A ruby example with the Parser gem
require 'parser'
class MyInterpreter
# https://whitequark.github.io/ast/AST/Processor/Mixin.html
include ::Parser::AST::Processor::Mixin
def on_str(node)
node.children.first
end
def on_int(node)
node.children.first.to_i
end
def on_if(node)
expression, truthy, falsey = *node.children
if process(expression)
process(truthy)
else
process(falsey)
end
end
def on_true(_node)
true
end
def on_false(_node)
false
end
def on_lvar(node)
# lookup a variable by name=node.children.first
end
def on_send(node, &block)
# allow things like ==, string methods? whatever
end
# ... etc
end
ast = Parser::ConcurrentRuby.parse(<<~RUBY)
name == 'John' && adult
RUBY
MyParser.new.process(ast)
# => true
The benefit here is that a parser and syntax is predetermined and you can interpret only what you need to (and prevent malicious code from executing by controller what on_send and on_const allow).
Templating
This is more markup-oriented and possibly doesn't apply, but you could find some use in a templating library, which parses expressions and evaluates for you. Control and supplying variables to the expressions would be possible depending on the library you use for this. The output of the expression could be checked for truthiness.
Liquid
Jinja
Some toughs and things you should consider.
1. Unified Expression Language (EL),
Another option is EL, specified as part of the JSP 2.1 standard (JSR-245). Official documentation.
They have some nice examples that can give you a good overview of the syntax. For example:
El Expression: `${100.0 == 100}` Result= `true`
El Expression: `${4 > 3}` Result= `true`
You can use this to evaluate small script-like expressions. And there are some implementations: Juel is one open source implementation of the EL language.
2. Audience and Security
All the answers recommend using different interpreters, parser generators. And all are valid ways to add functionality to process complex data. But I would like to add an important note here.
Every interpreter has a parser, and injection attacks target those parsers, tricking them to interpret data as commands. You should have a clear understanding how the interpreter's parser works, because that's the key to reduce the chances to have a successful injection attack Real world parsers have many corner cases and flaws that may not match the specs. And have clear the measures to mitigate possible flaws.
And even if your application is not facing the public. You can have external or internal actors that can abuse this feature.
I'm building an app which has a feature for embedding expressions/rules in a config yaml file.
I'm looking for a parsing/dsl/rules-engine library that can support these type of expressions and normalize it. I'm open using ruby, javascript, java, or python if anyone knows of a library for that languages.
One possibility might be to embed a rule interpreter such as ClipsRules inside your application. You could then code your application in C++ (perhaps inspired by my clips-rules-gcc project) and link to it some C++ YAML library such as yaml-cpp.
Another approach could be to embed some Python interpreter inside a rule interpreter (perhaps the same ClipsRules) and some YAML library.
A third approach could be to use Guile (or SBCL or Javascript v8) and extend it with some "expert system shell".
Before starting to code, be sure to read several books such as the Dragon Book, the Garbage Collection handbook, Lisp In Small Pieces, Programming Language Pragmatics. Be aware of various parser generators such as ANTLR or GNU bison, and of JIT compilation libraries like libgccjit or asmjit.
You might need to contact a lawyer about legal compatibility of various open source licenses.

Extract elements from html-string in ClojureScript

For a ClojureScript project I'm looking for a concise way to extract content from an external HTML document on the client side. The content is actually received via an ajax call in the Markdown format, which is parsed to HTML subsequently. So a HTML string is the point of departure.
(def html-string "<p>Something, that <a>was</a> Markdown before</p>")
The libraries Enlive and Garden for instance use vectors to express CSS Selectors, which are needed here. Enlive has a front-end sister, called Enfocus, which provides similar semantics.
Here's an enfocus example which extracts some content from the current DOM:
(require '[enfocus.core :as ef])
(ef/from js/document.head :something [:title]
(ef/get-text))
;;{:something "My Title"}
If there were more matches the value of :something would become a vector. I could not figure out, how to apply this function on arbitrary HTML strings. The closest I could get was by using this function:
(defn html->node [h]
(doto (.createElement js/document "div")
(aset "innerHTML" h)))
and then:
(ef/from (html->node html-string) :my-link [:a]
(ef/get-text))
;;{:my-link "was"}
However, this is not quite clean, since now there's a div wrapping everything, which might cause trouble in some situations.
Inserting some HTML content into a div makes your arbitrary HTML automatically evaluated. What you should do instead is to parse your HTML string with something like this.
(defn parse-html [html]
"Parse an html string into a document"
(let [doc (.createHTMLDocument js/document.implementation "mydoc")]
(set! (.-innerHTML doc.documentElement) html)
doc
))
This is plain and simple ClojureScript, maybe using the Google Closure library would be more elegant if you want you can dig into the goog.dom namespace or someplace else inside the Google Closure API.
Then you parse your HTML string:
(def html-doc (parse-html "<body><div><p>some of my stuff</p></div></body>"))
So you can call Enfocus on your document:
(ef/from html-doc :mystuff [:p] (ef/get-text))
Result:
{:mystuff "some of my stuff"}

Replace comment in JavaScript AST with subtree derived from the comment's content

I'm the author of doctest, quick and dirty doctests for JavaScript and CoffeeScript. I'd like to make the library less dirty by using a JavaScript parser rather than regular expressions to locate comments.
I'd like to use Esprima or Acorn to do the following:
Create an AST
Walk the tree, and for each comment node:
Create an AST from the comment node's text
Replace the comment node in the main tree with this subtree
Input:
!function() {
// > toUsername("Jesper Nøhr")
// "jespernhr"
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
Output:
!function() {
doctest.input(function() {
return toUsername("Jesper Nøhr")
});
doctest.output(4, function() {
return "jespernhr"
});
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
I don't know how to do this. Acorn provides a walker which takes a node type and a function, and walks the tree invoking the function each time a node of the specified type is encountered. This seems promising, but doesn't apply to comments.
With Esprima I can use esprima.parse(input, {comment: true, loc: true}).comments to get the comments, but I'm not sure how to update the tree.
Most AST-producing parsers throw away comments. I don't know what Esprima or Acorn do, but that might be the issue.
.... in fact, Esprima lists comment capture as a current bug:
http://code.google.com/p/esprima/issues/detail?id=197
... Acorn's code is right there in GitHub. It appears to throw comments away, too.
So, looks like you get to fix either parser to capture the comments first, at which point your task should be straightforward, or, you're stuck.
Our DMS Software Reengineering Toolkit has JavaScript parsers that capture comments, in the tree. It also has language substring parsers, that could be used to parse the comment text into JavaScript ASTs of whatever type the comment represents (e.g, function declaration, expression, variable declaration, ...), and the support machinery to graft such new ASTs into the main tree. If you are going to manipulate ASTs, this substring capability is likely important: most parsers won't parse arbitrary language fragments, they are wired only to parse "whole programs". For DMS, there are no comment nodes to replace; there are comments associated with ASTs nodes, so the grafting process is a little trickier than just "replace comment nodes". Still pretty easy.
I'll observe that most parsers (including these) read the source and break it into tokens by using or applying the equivalent of a regular expressions. So, if you are already using these to locate comments (that means using them to locate *non*comments to throw away, as well, e.g., you need to recognize string literals that contain comment-like text and ignore them), you are doing as well as the parsers would do anyway in terms of finding the comments. And if all you want to do is to replace them exactly with their content, echoing the source stream with the comment prefix/suffix /* */ stripped will do apparantly exactly what you want, so all this parsing machinery seems like overkill.
You can already use Esprima to achieve what you want:
Parse the code, get the comments (as an array).
Iterate over the comments, see if each is what you are interested in.
If you need to transform the comment, note its range. Collect all transformations.
Apply the transformation back-to-first so that the ranges are not shifted.
The trick is here not change the AST. Simply apply the text change as if you are doing a typical search replace on the source string directly. Because the position of the replacement might shift, you need to collect everything and then do it from the last one. For an example on how to carry out such a transformation, take a look at my blog post "From double-quotes to single-quotes" (it deals with string quotes but the principle remains the same).
Last but not least, you might want to use a slightly higher-level utility such as Rocambole.

Using PEG Parser for BBCode Parsing: pegjs or ... what?

I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.
My specific questions are:
As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:
[
[
"[",
"img",
"",
[
"/",
"f",
"o",
"o",
"/",
"b",
"a",
"r"
],
"",
"]"
],
[
"[/",
"img",
"]"
]
]
given a grammar like:
document
= (open_tag / close_tag / new_line / text)*
open_tag
= ("[" tag_name "="? tag_data? tag_attributes? "]")
close_tag
= ("[/" tag_name "]")
text
= non_tag+
non_tag
= [\n\[\]]
new_line
= ("\r\n" / "\n")
I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.
Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?
Thanks
First question (grammar for incomplete texts):
You can add
incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
// the closing bracket is omitted ---^
after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.
Second question (how to include actions):
You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!
In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.
For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.
Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.
The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example
text
= text:non_tag+ {
// we captured the text in an array and can manipulate it now
return text.join("");
}
At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.
Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.
text
= result:non_tag+ { return result.join(''); }

Categories