Creating a DSL expressions parser / rules engine - javascript

I'm building an app which has a feature for embedding expressions/rules in a config yaml file. So for example user can reference a variable defined in yaml file like ${variables.name == 'John'} or ${is_equal(variables.name, 'John')}. I can probably get by with simple expressions but I want to support complex rules/expressions such ${variables.name == 'John'} and (${variables.age > 18} OR ${variables.adult == true})
I'm looking for a parsing/dsl/rules-engine library that can support these type of expressions and normalize it. I'm open using ruby, javascript, java, or python if anyone knows of a library for that languages.
One option I thought of was to just support javascript as conditons/rules and basically pass it through eval with the right context setup with access to variables and other reference-able vars.

I don't know if you use Golang or not, but if you use it, I recommend this https://github.com/antonmedv/expr.
I have used it for parsing bot strategy that (stock options bot). This is from my test unit:
func TestPattern(t *testing.T) {
a := "pattern('asdas asd 12dasd') && lastdigit(23asd) < sma(50) && sma(14) > sma(12) && ( macd(5,20) > macd_signal(12,26,9) || macd(5,20) <= macd_histogram(12,26,9) )"
r, _ := regexp.Compile(`(\w+)(\s+)?[(]['\d.,\s\w]+[)]`)
indicator := r.FindAllString(a, -1)
t.Logf("%v\n", indicator)
t.Logf("%v\n", len(indicator))
for _, i := range indicator {
t.Logf("%v\n", i)
if strings.HasPrefix(i, "pattern") {
r, _ = regexp.Compile(`pattern(\s+)?\('(.+)'\)`)
check1 := r.ReplaceAllString(i, "$2")
t.Logf("%v\n", check1)
r, _ = regexp.Compile(`[^du]`)
check2 := r.FindAllString(check1, -1)
t.Logf("%v\n", len(check2))
} else if strings.HasPrefix(i, "lastdigit") {
r, _ = regexp.Compile(`lastdigit(\s+)?\((.+)\)`)
args := r.ReplaceAllString(i, "$2")
r, _ = regexp.Compile(`[^\d]`)
parameter := r.FindAllString(args, -1)
t.Logf("%v\n", parameter)
} else {
}
}
}
Combine it with regex and you have good (if not great, string translator).
And for Java, I personally use https://github.com/ridencww/expression-evaluator but not for production. It has similar feature with above link.
It supports many condition and you don't have to worry about Parentheses and Brackets.
Assignment =
Operators + - * / DIV MOD % ^
Logical < <= == != >= > AND OR NOT
Ternary ? :
Shift << >>
Property ${<id>}
DataSource #<id>
Constants NULL PI
Functions CLEARGLOBAL, CLEARGLOBALS, DIM, GETGLOBAL, SETGLOBAL
NOW PRECISION
Hope it helps.

You might be surprised to see how far you can get with a syntax parser and 50 lines of code!
Check this out. The Abstract Syntax Tree (AST) on the right represents the code on the left in nice data structures. You can use these data structures to write your own simple interpreter.
I wrote a little example of one:
https://codesandbox.io/s/nostalgic-tree-rpxlb?file=/src/index.js
Open up the console (button in the bottom), and you'll see the result of the expression!
This example can only handle (||) and (>), but looking at the code (line 24), you can see how you could make it support any other JS operator. Just add a case to the branch, evaluate the sides, and do the calculation on JS.
Parenthesis and operator precedence are all handled by the parser for you.
I'm not sure if this is the solution for you, but it will for sure be fun ;)

One option I thought of was to just support javascript as
conditons/rules and basically pass it through eval with the right
context setup with access to variables and other reference-able vars.
I would personally lean towards something like this. If you are getting into complexities such as logic comparisons, a DSL can become a beast since you are basically almost writing a compiler and a language at that point. You might want to just not have a config, and instead have the configurable file just be JavaScript (or whatever language) that can then be evaluated and then loaded. Then whoever your target audience is for this "config" file can just supplement logical expressions as needed.
The only reason I would not do this is if this configuration file was being exposed to the public or something, but in that case security for a parser would also be quite difficult.

I did something like that once, you can probably pick it up and adapt it to your needs.
TL;DR: thanks to Python's eval, you doing this is a breeze.
The problem was to parse dates and durations in textual form. What I did was to create a yaml file mapping regex pattern to the result. The mapping itself was a python expression that would be evaluated with the match object, and had access to other functions and variables defined elsewhere in the file.
For example, the following self-contained snippet would recognize times like "l'11 agosto del 1993" (Italian for "August 11th, 1993,).
__meta_vars__:
month: (gennaio|febbraio|marzo|aprile|maggio|giugno|luglio|agosto|settembre|ottobre|novembre|dicembre)
prep_art: (il\s|l\s?'\s?|nel\s|nell\s?'\s?|del\s|dell\s?'\s?)
schema:
date: http://www.w3.org/2001/XMLSchema#date
__meta_func__:
- >
def month_to_num(month):
""" gennaio -> 1, febbraio -> 2, ..., dicembre -> 12 """
try:
return index_in_or(meta_vars['month'], month) + 1
except ValueError:
return month
Tempo:
- \b{prep_art}(?P<day>\d{{1,2}}) (?P<month>{month}) {prep_art}?\s*(?P<year>\d{{4}}): >
'"{}-{:02d}-{:02d}"^^<{schema}>'.format(match.group('year'),
month_to_num(match.group('month')),
int(match.group('day')),
schema=schema['date'])
__meta_func__ and __meta_vars (not the best names, I know) define functions and variables that are accessible to the match transformation rules. To make the rules easier to write, the pattern is formatted by using the meta-variables, so that {month} is replaced with the pattern matching all months. The transformation rule calls the meta-function month_to_num to convert the month to a number from 1 to 12, and reads from the schema meta-variable. On the example above, the match results in the string "1993-08-11"^^<http://www.w3.org/2001/XMLSchema#date>, but some other rules would produce a dictionary.
Doing this is quite easy in Python, as you can use exec to evaluate strings as Python code (obligatory warning about security implications). The meta-functions and meta-variables are evaluated and stored in a dictionary, which is then passed to the match transformation rules.
The code is on github, feel free to ask any questions if you need clarifications. Relevant parts, slightly edited:
class DateNormalizer:
def _meta_init(self, specs):
""" Reads the meta variables and the meta functions from the specification
:param dict specs: The specifications loaded from the file
:return: None
"""
self.meta_vars = specs.pop('__meta_vars__')
# compile meta functions in a dictionary
self.meta_funcs = {}
for f in specs.pop('__meta_funcs__'):
exec f in self.meta_funcs
# make meta variables available to the meta functions just defined
self.meta_funcs['__builtins__']['meta_vars'] = self.meta_vars
self.globals = self.meta_funcs
self.globals.update(self.meta_vars)
def normalize(self, expression):
""" Find the first matching part in the given expression
:param str expression: The expression in which to search the match
:return: Tuple with (start, end), category, result
:rtype: tuple
"""
expression = expression.lower()
for category, regexes in self.regexes.iteritems():
for regex, transform in regexes:
match = regex.search(expression)
if match:
result = eval(transform, self.globals, {'match': match})
start, end = match.span()
return (first_position + start, first_position + end) , category, result

Here are some categorized Ruby options and resources:
Insecure
Pass expression to eval in the language of your choice.
It must be mentioned that eval is technically an option, but extraordinary trust must exist in its inputs and it is safer to avoid it altogether.
Heavyweight
Write a parser for your expressions and an interpreter to evaluate them
A cost-intensive solution would be implementing your own expression language. That is, to design a lexicon for your expression language, implement a parser for it, and an interpreter to execute the code that's parsed.
Some Parsing Options (ruby)
Parslet
TreeTop
Citrus
Roll-your-own with StringScanner
Medium Weight
Pick an existing language to write expressions in and parse / interpret those expressions.
This route assumes you can pick a known language to write your expressions in. The benefit is that a parser likely already exists for that language to turn it into an Abstract Syntax Tree (data structure that can be walked for interpretation).
A ruby example with the Parser gem
require 'parser'
class MyInterpreter
# https://whitequark.github.io/ast/AST/Processor/Mixin.html
include ::Parser::AST::Processor::Mixin
def on_str(node)
node.children.first
end
def on_int(node)
node.children.first.to_i
end
def on_if(node)
expression, truthy, falsey = *node.children
if process(expression)
process(truthy)
else
process(falsey)
end
end
def on_true(_node)
true
end
def on_false(_node)
false
end
def on_lvar(node)
# lookup a variable by name=node.children.first
end
def on_send(node, &block)
# allow things like ==, string methods? whatever
end
# ... etc
end
ast = Parser::ConcurrentRuby.parse(<<~RUBY)
name == 'John' && adult
RUBY
MyParser.new.process(ast)
# => true
The benefit here is that a parser and syntax is predetermined and you can interpret only what you need to (and prevent malicious code from executing by controller what on_send and on_const allow).
Templating
This is more markup-oriented and possibly doesn't apply, but you could find some use in a templating library, which parses expressions and evaluates for you. Control and supplying variables to the expressions would be possible depending on the library you use for this. The output of the expression could be checked for truthiness.
Liquid
Jinja

Some toughs and things you should consider.
1. Unified Expression Language (EL),
Another option is EL, specified as part of the JSP 2.1 standard (JSR-245). Official documentation.
They have some nice examples that can give you a good overview of the syntax. For example:
El Expression: `${100.0 == 100}` Result= `true`
El Expression: `${4 > 3}` Result= `true`
You can use this to evaluate small script-like expressions. And there are some implementations: Juel is one open source implementation of the EL language.
2. Audience and Security
All the answers recommend using different interpreters, parser generators. And all are valid ways to add functionality to process complex data. But I would like to add an important note here.
Every interpreter has a parser, and injection attacks target those parsers, tricking them to interpret data as commands. You should have a clear understanding how the interpreter's parser works, because that's the key to reduce the chances to have a successful injection attack Real world parsers have many corner cases and flaws that may not match the specs. And have clear the measures to mitigate possible flaws.
And even if your application is not facing the public. You can have external or internal actors that can abuse this feature.

I'm building an app which has a feature for embedding expressions/rules in a config yaml file.
I'm looking for a parsing/dsl/rules-engine library that can support these type of expressions and normalize it. I'm open using ruby, javascript, java, or python if anyone knows of a library for that languages.
One possibility might be to embed a rule interpreter such as ClipsRules inside your application. You could then code your application in C++ (perhaps inspired by my clips-rules-gcc project) and link to it some C++ YAML library such as yaml-cpp.
Another approach could be to embed some Python interpreter inside a rule interpreter (perhaps the same ClipsRules) and some YAML library.
A third approach could be to use Guile (or SBCL or Javascript v8) and extend it with some "expert system shell".
Before starting to code, be sure to read several books such as the Dragon Book, the Garbage Collection handbook, Lisp In Small Pieces, Programming Language Pragmatics. Be aware of various parser generators such as ANTLR or GNU bison, and of JIT compilation libraries like libgccjit or asmjit.
You might need to contact a lawyer about legal compatibility of various open source licenses.

Related

Methods for de-obfuscating javascript that uses string concatenation for property names

I am trying to puzzle out a way to de-obfuscate javascript that looks like this:
https://jsfiddle.net/douglasg14b/4951br9f/2/
var testString = 'Test | String'
var wf6 = {
fq4: 'su',
k8d: 'bs',
l8z: 'tri',
cy1: 'ng',
t5j: 'te',
ol: 'stS',
x3q: 'tri',
l9x: 'ng',
gh: 'xO'
};
//Obfuscated
let test1 = testString[wf6.fq4 + wf6.k8d + wf6.l8z + wf6.cy1](4,11);
//Normal
let test2 = testString.substring(4,11);
let test3;
//More complex obfuscation
(function moreComplex(){
let h = "i",
w = "nde",
T0 = "f",
hj = '|',
a = eval(wf6.t5j + wf6.ol + wf6.x3q + wf6.l9x).length;
//Obfuscated
test3 = testString[wf6.fq4 + wf6.k8d + wf6.l8z + wf6.cy1](testString[h + w + wf6.gh + T0](hj), a);
//Normal
let test4 = testString.substring(testString.indexOf('|'), testString.length);
})();
$('.span1').text(test1);
$('.span2').text(test3);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<span class="span1"></span><br>
<span class="span2"></span>
This is a small example, the file I'm working with is ~60k lines long and is full this kind of obfuscation. Everywhere a string can be used as a property name, this kind of obfuscation is used.
The way I can think of doing this, is to evaluate all the string concatenations so they are turned into a readable equivalent. Though, I am not sure how to go about this and ignore all the other working code that exists between all the concatenations.
Thoughts?
Bonus question: Is there a commonly used name for this kind of obfuscation that might make searches a bit easier?
Edit: Added a more complex example.
You have the basic idea right: you have to partially-evaluate the program and precompute all the constant computations. In your case, the constant computations of main interest are the concatenation steps over values which don't change.
To do this, you need a program transformation system (PTS). This is a tool that will read/parse source code for a specified language and build an abstract syntax tree, allow you specify transformations and analyses over the AST, and run those, and then spit out the modified AST as source code again.
In your case, you obviously want a PTS that is wired to know JavaScript out of the box (rare) or is willing to accept a description of JavaScript and then read JavaScript (more typical) with the hope that you can build or get a JavaScript description easily. [I build a PTS that has JavaScript descriptions available, see my bio].
With that in hand, you need to:
code an analyzer that inspects each variable found in an expression to see if that expression is constant (e.g., "wf6"). To demonstrate it is constant, you will have to find the variable definition, and check that all the values used in the variable definition are themselves constants. If there is more than one variable definition, you might have to check that all definitions produce the same value. You need to check for side-effects on the variable (e.g, there are no function calls "foo(...,wf6,...)" which would allow the variable's value to be modified). You need to worry about whether an eval command to accomplish such a side effect exists [this is virtually impossible to do, so you often have to just ignore evals and assume they do not do such things]. Many PTSes will have a way to allow you to build such analyzers; some are easier than others.
For every constant valued variable, substitute the value of that variable in the code
For every constant-valued sub-expression after such substitutions, "fold" (calculate) the result of that expression and substitute that value for that subexpression and repeat until no more folding is possible. Obviously you want to do this for at least all "+" operators. [OP just modified his example; he'll want to do it for "eval" operators too when all its operands are constant].
You may have to iterate this process, as folding an expression may make it obvious that a variable now has a constant value
The above process is called "constant propagation" in the compiler literature and is a feature of many compilers.
In your case, you could restrict the constant folding to just string concatenates. However, once you have adequate machinery to do constant value propagation, doing all or most operators on constants isn't that hard. You may need this to undo other obfuscations involving constants since that
seems to be the obfuscation style used on the code you are working on.
You'll need a special rule that transforms
var['string'](args)
into
var.string(args)
as a final step.
You have another complication: that is knowing that you have all the JavaScript relevant to producing constant-valued variables. A single web page may have many included chunks of JavaScript; you will need all of them to demonstrate there are no side effects on a variable. I assume in your case you are sure you have it all.
With respect to producing known-constant values, you may have worry about a tricky case: an expression that produces constant values from non-constant operands. Imagine the obfuscated expression was:
x=random(); // produce a value between 0 and 1
one=x+(1-x); // not constant by constant propagation, but constant by algebraic relations
teststring['st'[one]+'vu'[one+1]+'bz'[one]+...](4,11)
You can see it always computes 'substring' as a property. You can add a transformation rule that understands the trick used to compute "one", e.g., a rule for each algebraic trick used to compute known constants. Unfortunately for you, there's an infinite number of algebra theorems one can use to manufacture constants; how many are really used in your example bit of code? [Welcome to the problem of reverse engineering with a smart adversary].
Nope, none of this "easy". Presumably that's why the obfuscation method
used was chosen.

jQuery / Javascript substitution 'Syntax error, unrecognized expression'

I am implementing jQuery chaining - using Mika Tuupola's Chained plugin - in my rails project (using nested form_for partials) and need to dynamically change the chaining attribute:
The code that works without substitution:
$(".employee_title_2").remoteChained({
parents : ".employee_title_1",
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
The attributes being substituted are .employee_title_1 and .employee_title_2:
var t2 = new Date().getTime();
var A1 = ".employee_title_1A_" + t2;
var B2 = ".employee_title_2B_" + t2;
In ruby speak, I'm namespacing the variables by adding datetime.
Here's the code I'm using for on-the-fly substitution:
$(`"${B2}"`).remoteChained({
parents : `"${A1}"`,
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
Which throws this error:
Uncaught Error: Syntax error, unrecognized expression:
".employee_title_2B_1462463848339"
The issue appears to be the leading '.' How do I escape it, assuming that's the issue? Researching the error message Syntax error, unrecognized expression lead to SO question #14347611 - which suggests "a string is only considered to be HTML if it starts with a less-than ('<) character" Unfortunately, I don't understand how to implement the solution. My javascript skills are weak!
Incidentally, while new Date().getTime(); isn't in date format, it works for my purpose, i.e., it increments as new nested form fields are added to the page
Thanks in advance for your assistance.
$(`"${B2b}"`).remoteChained({
// ^ ^
// These quotes should not be here
As it is evaluated to a string containing something like:
".my_class"
and to tie it together:
$('".my_class"')...
Same goes for the other place you use backtick notation. In your case you could simply use:
$(B2).remoteChained({
parents : A1,
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
The back tick (``) syntax is new for Javascript, and provides a templating feature, similar to the way that Ruby provides interpolated strings. For instance, this Javascript code:
var who = "men";
var what = "country";
var famous_quote = `Now is the time for all good ${who} to come to the aid of their #{what}`;
is interpolated in exactly the same way as this Ruby code:
who = "men"
what = "country"
famous_quote = "Now is the time for all good #{who} to come to the aid of their #{what}"
In both cases, the quote ends up reading, "Now is the time for all good men to come to the aid of their country". Similar feature, slightly different syntax.
Moving on to jQuery selectors, you have some flexibility in how you specify them. For instance, this code:
$(".my_class").show();
is functionally equivalent to this code:
var my_class_name = ".my_class";
$(my_class_name).show();
This is a great thing, because that means that you can store the name of jQuery selectors in variables and use them instead of requiring string literals. You can also build them from components, as you will find in this example:
var mine_or_yours = (user_selection == "me") ? "my" : "your";
var my_class_name = "." + mine_or_yours + "_class";
$(my_class_name).show();
This is essentially the behavior that you're trying to get working. Using the two features together (interpolation and dynamic jQuery selectors), you have this:
$(`"${B2}"`).remote_chained(...);
which produces this code through string interpolation:
$("\".employee_title_2B_1462463848339\"").remote_chained(...);
which is not correct. and is actually the cause of the error message from jQuery, because of the embedded double quotes in the value of the string. jQuery is specifically complaining about the extra double quotes surrounding the value that you're passing to the selector.
What you actually want is the equivalent of this:
$(".employee_title_2B_1462463848339").remote_chained(...);
which could either be written this way:
$(`${B2}`).remote_chained(...);
or, much more simply and portably, like so:
$(B2).remote_chained(...);
Try this little sample code to prove the equivalence it to yourself:
if (`${B2}` == B2) {
alert("The world continues to spin on its axis...");
} else if (`"${B2}"` == B2) {
alert("Lucy, you've got some 'splain' to do!");
} else {
alert("Well, back to the drawing board...");
}
So, we've established the equivalency of interpolation to the original strings. We've also established the equivalency of literal jQuery selectors to dynamic selectors. Now, it's time to put the techniques together in the original code context.
Try this instead of the interpolation version:
$(B2).remoteChained({
parents : A1,
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
We already know that $(B2) is a perfectly acceptable dynamic jQuery selector, so that works. The value passed to the parents key in the remoteChained hash simply requires a string, and A1 already fits the bill, so there's no need to introduce interpolation in that case, either.
Realistically, nothing about this issue is related to Chained; it just happens to be included in the statement that's failing. So, that means that you can easily isolate the failing code (building and using the jQuery selectors), which makes it far easier to debug.
Note that the Javascript syntax was codified just last year with ECMAScript version 6, so the support for it is still a mixed bag. Check your browser support to make sure that you can use it reliably.

Proper way to use visitors in ANTLR4 (javascript target)

I am having trouble understanding how to properly use Visitors in ANTLR4, Javascript target.
I have prepared a very basic grammar, it accepts INT + INT or INT - INT operations.
grammar PlusMinus;
INT : [0-9]+;
WS : [ \t\r]+ -> skip;
PLUS : '+';
MINUS : '-';
input : plusOrMinus
;
plusOrMinus
: numberLeft PLUS numberRight # Plus
| numberLeft MINUS numberRight # Minus
;
numberLeft : INT;
numberRight : INT;
From this grammar ANTLR will generate a Visitor that has these three functions, visitInput, visitPlus and visitMinus. I start from visitInput where I will be able to fetch the operation ctx by doing this operation = ctx.plusOrMinus().
This is where I get stuck, how do I know if operation is of type plus or minus? In other words, where do I pass ctx.plusOrMinux(), to visitPlus() or visitMinus()?
I managed to create a visitor that does work, but it's very ugly, I am posting it here because perhaps it will help to better understand my question. Lines 20-29 is where the problem is.
First of all... PLUS and MINUS are lexer rules. You don't visit tokens (the result of lexer rules).
It rather looks like you're expecting this to work like a listener (where you set up your function that gets called when the tree walker reaches that node. You can be called on enter or exit from the node (depends on whether you want to get the node before or after you've processed it's children). Visitors expect you to handle your own tree navigation, which is sometimes useful, but listeners are cleaner where they suit the purpose. With nesting, You'll probably want to listen after the children nodes are processed, so you'll want to implement an exitPlusOrMins() function on your listener. I'd suggest stopping your code in the debugger inside this function to take a look at the objects you have available to you (in the ctx object).
(You also need to rethink your numberLeft and numberRIght parser rules. Something more like:
plusOrMinus: lexpr=INT (op=PLUS | op=MINUS) rexpr=INT;
would give you a pretty close equivalent to what you have so far. What you have will work with a recursive descent parser like ANTLR (so far as this example goes), but you're headed in the wrong direction making them different parse rules. Specifically, by making them two alternative parse rules, you're giving PLUS a higher precedence than minus, and PLUS and MINUS should have the same precedence in order of evaluation. As a result, they need to be the same parse rule.). When you place alternatives like this in a parser rule, you're also establishing precedence, so be careful about the order of these rules.
To get further than adding or subtracting integers, though, you'll need lexpr and rexpr to actually be expressions themselves (you should read up on expression parsing in the ANTLR book; it's covered very nicely).
With that rule, your exitPlusOrMinus can parse the int values of lexpr and rexpr and then evaluate the value of the op to determine whether to add or subtract.

Replace comment in JavaScript AST with subtree derived from the comment's content

I'm the author of doctest, quick and dirty doctests for JavaScript and CoffeeScript. I'd like to make the library less dirty by using a JavaScript parser rather than regular expressions to locate comments.
I'd like to use Esprima or Acorn to do the following:
Create an AST
Walk the tree, and for each comment node:
Create an AST from the comment node's text
Replace the comment node in the main tree with this subtree
Input:
!function() {
// > toUsername("Jesper Nøhr")
// "jespernhr"
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
Output:
!function() {
doctest.input(function() {
return toUsername("Jesper Nøhr")
});
doctest.output(4, function() {
return "jespernhr"
});
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
I don't know how to do this. Acorn provides a walker which takes a node type and a function, and walks the tree invoking the function each time a node of the specified type is encountered. This seems promising, but doesn't apply to comments.
With Esprima I can use esprima.parse(input, {comment: true, loc: true}).comments to get the comments, but I'm not sure how to update the tree.
Most AST-producing parsers throw away comments. I don't know what Esprima or Acorn do, but that might be the issue.
.... in fact, Esprima lists comment capture as a current bug:
http://code.google.com/p/esprima/issues/detail?id=197
... Acorn's code is right there in GitHub. It appears to throw comments away, too.
So, looks like you get to fix either parser to capture the comments first, at which point your task should be straightforward, or, you're stuck.
Our DMS Software Reengineering Toolkit has JavaScript parsers that capture comments, in the tree. It also has language substring parsers, that could be used to parse the comment text into JavaScript ASTs of whatever type the comment represents (e.g, function declaration, expression, variable declaration, ...), and the support machinery to graft such new ASTs into the main tree. If you are going to manipulate ASTs, this substring capability is likely important: most parsers won't parse arbitrary language fragments, they are wired only to parse "whole programs". For DMS, there are no comment nodes to replace; there are comments associated with ASTs nodes, so the grafting process is a little trickier than just "replace comment nodes". Still pretty easy.
I'll observe that most parsers (including these) read the source and break it into tokens by using or applying the equivalent of a regular expressions. So, if you are already using these to locate comments (that means using them to locate *non*comments to throw away, as well, e.g., you need to recognize string literals that contain comment-like text and ignore them), you are doing as well as the parsers would do anyway in terms of finding the comments. And if all you want to do is to replace them exactly with their content, echoing the source stream with the comment prefix/suffix /* */ stripped will do apparantly exactly what you want, so all this parsing machinery seems like overkill.
You can already use Esprima to achieve what you want:
Parse the code, get the comments (as an array).
Iterate over the comments, see if each is what you are interested in.
If you need to transform the comment, note its range. Collect all transformations.
Apply the transformation back-to-first so that the ranges are not shifted.
The trick is here not change the AST. Simply apply the text change as if you are doing a typical search replace on the source string directly. Because the position of the replacement might shift, you need to collect everything and then do it from the last one. For an example on how to carry out such a transformation, take a look at my blog post "From double-quotes to single-quotes" (it deals with string quotes but the principle remains the same).
Last but not least, you might want to use a slightly higher-level utility such as Rocambole.

Using PEG Parser for BBCode Parsing: pegjs or ... what?

I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.
My specific questions are:
As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:
[
[
"[",
"img",
"",
[
"/",
"f",
"o",
"o",
"/",
"b",
"a",
"r"
],
"",
"]"
],
[
"[/",
"img",
"]"
]
]
given a grammar like:
document
= (open_tag / close_tag / new_line / text)*
open_tag
= ("[" tag_name "="? tag_data? tag_attributes? "]")
close_tag
= ("[/" tag_name "]")
text
= non_tag+
non_tag
= [\n\[\]]
new_line
= ("\r\n" / "\n")
I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.
Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?
Thanks
First question (grammar for incomplete texts):
You can add
incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
// the closing bracket is omitted ---^
after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.
Second question (how to include actions):
You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!
In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.
For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.
Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.
The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example
text
= text:non_tag+ {
// we captured the text in an array and can manipulate it now
return text.join("");
}
At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.
Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.
text
= result:non_tag+ { return result.join(''); }

Categories