Using PEG Parser for BBCode Parsing: pegjs or ... what?

Using PEG Parser for BBCode Parsing: pegjs or ... what? - javascript

I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.
My specific questions are:
As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:
[
[
"[",
"img",
"",
[
"/",
"f",
"o",
"o",
"/",
"b",
"a",
"r"
],
"",
"]"
],
[
"[/",
"img",
"]"
]
]
given a grammar like:
document
= (open_tag / close_tag / new_line / text)*
open_tag
= ("[" tag_name "="? tag_data? tag_attributes? "]")
close_tag
= ("[/" tag_name "]")
text
= non_tag+
non_tag
= [\n\[\]]
new_line
= ("\r\n" / "\n")
I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.
Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?
Thanks

First question (grammar for incomplete texts):
You can add
incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
// the closing bracket is omitted ---^
after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.
Second question (how to include actions):
You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!
In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.
For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.

Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.
The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example
text
= text:non_tag+ {
// we captured the text in an array and can manipulate it now
return text.join("");
}
At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.

Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.
text
= result:non_tag+ { return result.join(''); }

Related

jQuery / Javascript substitution 'Syntax error, unrecognized expression'

I am implementing jQuery chaining - using Mika Tuupola's Chained plugin - in my rails project (using nested form_for partials) and need to dynamically change the chaining attribute:
The code that works without substitution:
$(".employee_title_2").remoteChained({
parents : ".employee_title_1",
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
The attributes being substituted are .employee_title_1 and .employee_title_2:
var t2 = new Date().getTime();
var A1 = ".employee_title_1A_" + t2;
var B2 = ".employee_title_2B_" + t2;
In ruby speak, I'm namespacing the variables by adding datetime.
Here's the code I'm using for on-the-fly substitution:
$(`"${B2}"`).remoteChained({
parents : `"${A1}"`,
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
Which throws this error:
Uncaught Error: Syntax error, unrecognized expression:
".employee_title_2B_1462463848339"
The issue appears to be the leading '.' How do I escape it, assuming that's the issue? Researching the error message Syntax error, unrecognized expression lead to SO question #14347611 - which suggests "a string is only considered to be HTML if it starts with a less-than ('<) character" Unfortunately, I don't understand how to implement the solution. My javascript skills are weak!
Incidentally, while new Date().getTime(); isn't in date format, it works for my purpose, i.e., it increments as new nested form fields are added to the page
Thanks in advance for your assistance.

$(`"${B2b}"`).remoteChained({
// ^ ^
// These quotes should not be here
As it is evaluated to a string containing something like:
".my_class"
and to tie it together:
$('".my_class"')...
Same goes for the other place you use backtick notation. In your case you could simply use:
$(B2).remoteChained({
parents : A1,
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});

The back tick (``) syntax is new for Javascript, and provides a templating feature, similar to the way that Ruby provides interpolated strings. For instance, this Javascript code:
var who = "men";
var what = "country";
var famous_quote = `Now is the time for all good ${who} to come to the aid of their #{what}`;
is interpolated in exactly the same way as this Ruby code:
who = "men"
what = "country"
famous_quote = "Now is the time for all good #{who} to come to the aid of their #{what}"
In both cases, the quote ends up reading, "Now is the time for all good men to come to the aid of their country". Similar feature, slightly different syntax.
Moving on to jQuery selectors, you have some flexibility in how you specify them. For instance, this code:
$(".my_class").show();
is functionally equivalent to this code:
var my_class_name = ".my_class";
$(my_class_name).show();
This is a great thing, because that means that you can store the name of jQuery selectors in variables and use them instead of requiring string literals. You can also build them from components, as you will find in this example:
var mine_or_yours = (user_selection == "me") ? "my" : "your";
var my_class_name = "." + mine_or_yours + "_class";
$(my_class_name).show();
This is essentially the behavior that you're trying to get working. Using the two features together (interpolation and dynamic jQuery selectors), you have this:
$(`"${B2}"`).remote_chained(...);
which produces this code through string interpolation:
$("\".employee_title_2B_1462463848339\"").remote_chained(...);
which is not correct. and is actually the cause of the error message from jQuery, because of the embedded double quotes in the value of the string. jQuery is specifically complaining about the extra double quotes surrounding the value that you're passing to the selector.
What you actually want is the equivalent of this:
$(".employee_title_2B_1462463848339").remote_chained(...);
which could either be written this way:
$(`${B2}`).remote_chained(...);
or, much more simply and portably, like so:
$(B2).remote_chained(...);
Try this little sample code to prove the equivalence it to yourself:
if (`${B2}` == B2) {
alert("The world continues to spin on its axis...");
} else if (`"${B2}"` == B2) {
alert("Lucy, you've got some 'splain' to do!");
} else {
alert("Well, back to the drawing board...");
}
So, we've established the equivalency of interpolation to the original strings. We've also established the equivalency of literal jQuery selectors to dynamic selectors. Now, it's time to put the techniques together in the original code context.
Try this instead of the interpolation version:
$(B2).remoteChained({
parents : A1,
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
We already know that $(B2) is a perfectly acceptable dynamic jQuery selector, so that works. The value passed to the parents key in the remoteChained hash simply requires a string, and A1 already fits the bill, so there's no need to introduce interpolation in that case, either.
Realistically, nothing about this issue is related to Chained; it just happens to be included in the statement that's failing. So, that means that you can easily isolate the failing code (building and using the jQuery selectors), which makes it far easier to debug.
Note that the Javascript syntax was codified just last year with ECMAScript version 6, so the support for it is still a mixed bag. Check your browser support to make sure that you can use it reliably.

javascript parser for specific purpose

I am trying to create a tool that looks for missing translations in .html files. some of our translations are done at runtime in JS code. I would like to map these together. below is an example.
<select id="dropDown"></select>
// js
bindings: {
"dropDown": function() {
translate(someValue);
// translate then set option
}
}
above you can see that I have a drop down where values are created & translated at runtime. I was thinking that an AST would be the right way to accomplish this. basically I need to go through the .html file looking for tags that are missing in-line translations (done using {{t value}}) and search the corresponding .js file for run time translations. Is there a better way to accomplish this? any advice on a tool for creating the AST?

I think you want to hunt for patterns in the code. In particular, I think you want to determine, for each HTML select construct, if there is a corresponding, properly shaped JavaScript fragment with the right embedded ID name.
You can do that with an AST, right. In your case, you need an AST for the HTML file (nodes are essentially HTML tags) with sub-ASTs for script chunks () tags containing the parsed JavaScript.
For this you need two parsers: one to parse the HTML (a nasty just because HTML is a mess), that produces a tree containing script nodes having just text. Then you need a JavaScript parser that you can apply to the text blobs under the Script tags, to produce JavaScript ASTs; ideally, you splice these into the HTML tree to replace the text blob nodes. Now you have a mixed tree, with some nodes being html, with some subtrees that are JavaScript. Ideally the HMTL nodes are marked as being HTML, and the JavaScript nodes are marked as Javascript.
Now you can search the tree for select nodes, pick up the id, and the search all the javascript subtrees for expected structure.
You can code the matching procedurally, but it will be messy:
for all node
if node is HTML and nodetype is Select
then
functionname=node.getchild("ID").text
for all node
if node is JavaScript and node.parent is HTML and nodetype is pair
then if node.getchild(left).ext is "bindings"
then if node.getchild(right)=structure
then... (lots more....)
There's a lot of hair in this. Technically its just sweat. You have to know (and encode) the precise detail of the tree, in climbing up and downs its links correctly and checking the node types one by one. If the grammar changes a tiny bit, this code breaks too; it knows too much about the grammar.
You can finish this by coding your own parsers from scratch. Lots more sweat.
There are tools that can make this a lot easier; see Program Transformation Systems. Such tools let you define language grammars, and they generate parsers and AST builders for such grammars. (As a general rule, they are pretty good at defining working grammars, because they are designed to be applied to many languages). That at least puts lots of structure into the process, and they provide a lot of machinery to make this work. But the good part is that you can express patterns, usually in source language surface syntax, that can make this a lot easier to express.
One of these tools is our DMS Software Reengineering Toolkit (I'm the architect).
DMS has dirty HTML and full JavaScript parsers already, so those don't need to be built.
You would have to write a bit of code for DMS to invoke the HTML parser, find the subtree for script nodes, and apply the JavaScript parser. DMS makes this practical by allowing you parse a blob of text as an arbitrary nonterminal in a grammar; in this case, you want to parse those blobs as an expression nonterminal.
With all that in place, you can now write patterns that will support the check:
pattern select_node(property: string): HTML~dirty.HTMLform =
" <select ID=\property></select> ";
pattern script(code: string): HTML~dirty.HTMLform =
" <script>\code</script> ";
pattern js_bindings(s: string, e:expression):JavaScript.expression =
" bindings : { \s : function ()
{ translate(\e);
}
} ";
While these patterns look like text, they are parsed by DMS into ASTs with placeholder nodes for the parameter list elements, denoted by "\nnnn" inside
the (meta)quotes "..." that surround the program text of interest. Such ASTs patterns can be pattern matched against ASTs; they match if the pattern tree matches, and the pattern variable leaves are then captured as bindings.
(See Registry:PatternMatch below, and resulting match argument with slots matched (a boolean) and bindings (an array of bound subtrees resulting from the match). A big win for the tool builder: he doesn't have to know much about the fine detail of the grammar, because he writes the pattern, and the tool produces all the tree nodes for him, implicitly.
With these patterns, you can write procedural PARLANSE (DMS's Lisp-style programming language) code to implement the check (liberties taken to shorten the presentation):
(;; `Parse HTML file':
(= HTML_tree (HMTL:ParseFile .... ))
`Find script nodes and replace by ASTs for same':
(AST:FindAllMatchingSubtrees HTML_tree
(lambda (function boolean [html_node AST:Node])
(let (= [match Registry:Match]
(Registry:PatternMatch html_node "script"))
(ifthenelse match:boolean
(value (;; (AST:ReplaceNode node
(JavaScript:ParseStream
"expression" ; desired nonterminal
(make Streams:Stream
(AST:GetString match:bindings:1))))
);;
~f ; false: don't visit subtree
)value
~t ; true: continue scanning into subtree
)ifthenelse
)let
)lambda )
`Now find select nodes, and check sanity':
(AST:FindAllMatchingSubtrees HTML_tree
(lambda (function boolean [html_node AST:node])
(let (;; (= [select_match Registry:Match] ; capture match data
(Registry:PatternMatch "select" html_node)) ; hunt for this pattern
[select_function_name string]
);;
(ifthenelse select_match:boolean
(value (;; `Found <select> node.
Get name of function...':
(= select_function_name
(AST:GetString select_match:bindings:1))
`... and search for matching script fragment':
(ifthen
(~ (AST:FindFirstMatchingSubtree HTML_tree
(lambda (function boolean [js_node AST:Node])
(let (;; (= [match Registry:Match] ; capture match data
(Registry:PatternMatch js_node "js_bindings")) ; hunt for this pattern
(&& match:boolean
(== select_match:bindings:1
select_function_name)
)&& ; is true if we found matching function
)let
)lambda ) )~
(;; `Complain if we cant find matching script fragment'
(Format:SNN `Select #S with missing translation at line #D column #D'
select_function_name
(AST:GetLineNumber select_match:source_position)
(AST:GetColumnNumber select_match:source_position)
)
);;
)ifthen
);;
~f ; don't visit subtree
)value
~t ; continue scanning into subtree
)ifthenelse
)let
)lambda )
);;
This procedural code first parses an HTML source file producing an HTML tree. All these nodes are stamped as being from the "HTML~dirty" langauge.
It then scans that tree to find SCRIPT nodes, and replaces them with an AST obtained from a JavaScript-expression-parse of the text content of the script nodes encountered. Finally, it finds all SELECT nodes, picks out the name of the function mentioned in the ID clause, and checks all JavaScript ASTs for a matching "bindings" expression as specified by OP. All of this leans on the pattern matching machinery, which in turn leans on top of the low-level AST library that provides a variety of means to inspect/navigate/change tree nodes.
I've left out some detail (esp. error handling code) and obviously haven't tested this. But this gives the flavor of how you might do it with DMS.
Similar patterns and matching processes are available in other program transformation systems.

Can something help me to see how to deal with single quote escaping in the following scenario

We write js programs for clients which allow them to craft the display text. Here is what we did
We have a raw js file which replaced those strings with tokens, for example
month = [_MonthToken_];
name = '_NameToken_';
and have a xml file to allow user to specify the text like
<xml>
<token name="MonthToken">'Jan','Feb','March'</token>
<token name="NameToken">Alice</token>
</xml>
and have a generator to replace the token with the text and generate the final js file.
month = ['Jan','Feb','March'];
name = 'Alice';
However, I found there is a bug in this scenario. When somebody specifies the name to be "D'Angelo" (for example.) the js will run into a error because the name variable will become
name='D'Angelo'
We have thought of several ways to fix the problem but none of which are perfect.
We may ask our clients to escape the characters, may it seems not appropriate given that they may not know js and there are more cases to escape (", ), which could make them unhappy :|
We also think of changing the generator to escape ', but sometimes the text may be replacing an array, the single quote there should not be escaped. (there are other cases, we may detect it case by case, but it is tedious)
We may have done something wrong for the whole scenario/architecture. but we don't want to change that unless we have confirmed that it is definitely necessary.
So, is there any solution? I will look into every ideas. Thank you in advanced!
(I may also need a better title :P)

I think your xml schema is poor designed, and this is the root cause of your problems.
Basically, you are forcing the author of the xml to put Javascript code inside of the name="MonthToken" element, while you pretend that she can do this without Javascript syntax knowledgement. I guess that you are planning to use eval on the parsed element content to build month and name variables.
The problem you discovered it's not the only one: you also are subject to Javascript code injection: what if a user forge an element such as:
<token name="MonthToken">alert('put some evil instruction here')</token>
I would suggest to change the xml schema in this way:
<xml>
<token name="MonthToken">Jan</token>
<token name="MonthToken">Feb</token>
<token name="MonthToken">March</token>
<token name="NameToken">Alice</token>
</xml>
Then in your generator, you'll have to parse each MonthToken element content, and add it to the month array. Do the same for the name variable.
In this way:
You don't use eval, and so you have no possibility of code injection
Your user doesn't no more have to know how to quote month names
You automatically handle quotes or apostrophe in names, because you are not using them as js code.
If you want month variable to become a string when user enter just a month, then simply transform the variable: with something similar to this:
if (month.length == 1) {
month = month[0];
}

Replace comment in JavaScript AST with subtree derived from the comment's content

I'm the author of doctest, quick and dirty doctests for JavaScript and CoffeeScript. I'd like to make the library less dirty by using a JavaScript parser rather than regular expressions to locate comments.
I'd like to use Esprima or Acorn to do the following:
Create an AST
Walk the tree, and for each comment node:
Create an AST from the comment node's text
Replace the comment node in the main tree with this subtree
Input:
!function() {
// > toUsername("Jesper Nøhr")
// "jespernhr"
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
Output:
!function() {
doctest.input(function() {
return toUsername("Jesper Nøhr")
});
doctest.output(4, function() {
return "jespernhr"
});
var toUsername = function(text) {
return ('' + text).replace(/\W/g, '').toLowerCase()
}
}()
I don't know how to do this. Acorn provides a walker which takes a node type and a function, and walks the tree invoking the function each time a node of the specified type is encountered. This seems promising, but doesn't apply to comments.
With Esprima I can use esprima.parse(input, {comment: true, loc: true}).comments to get the comments, but I'm not sure how to update the tree.

Most AST-producing parsers throw away comments. I don't know what Esprima or Acorn do, but that might be the issue.
.... in fact, Esprima lists comment capture as a current bug:
http://code.google.com/p/esprima/issues/detail?id=197
... Acorn's code is right there in GitHub. It appears to throw comments away, too.
So, looks like you get to fix either parser to capture the comments first, at which point your task should be straightforward, or, you're stuck.
Our DMS Software Reengineering Toolkit has JavaScript parsers that capture comments, in the tree. It also has language substring parsers, that could be used to parse the comment text into JavaScript ASTs of whatever type the comment represents (e.g, function declaration, expression, variable declaration, ...), and the support machinery to graft such new ASTs into the main tree. If you are going to manipulate ASTs, this substring capability is likely important: most parsers won't parse arbitrary language fragments, they are wired only to parse "whole programs". For DMS, there are no comment nodes to replace; there are comments associated with ASTs nodes, so the grafting process is a little trickier than just "replace comment nodes". Still pretty easy.
I'll observe that most parsers (including these) read the source and break it into tokens by using or applying the equivalent of a regular expressions. So, if you are already using these to locate comments (that means using them to locate *non*comments to throw away, as well, e.g., you need to recognize string literals that contain comment-like text and ignore them), you are doing as well as the parsers would do anyway in terms of finding the comments. And if all you want to do is to replace them exactly with their content, echoing the source stream with the comment prefix/suffix /* */ stripped will do apparantly exactly what you want, so all this parsing machinery seems like overkill.

You can already use Esprima to achieve what you want:
Parse the code, get the comments (as an array).
Iterate over the comments, see if each is what you are interested in.
If you need to transform the comment, note its range. Collect all transformations.
Apply the transformation back-to-first so that the ranges are not shifted.
The trick is here not change the AST. Simply apply the text change as if you are doing a typical search replace on the source string directly. Because the position of the replacement might shift, you need to collect everything and then do it from the last one. For an example on how to carry out such a transformation, take a look at my blog post "From double-quotes to single-quotes" (it deals with string quotes but the principle remains the same).
Last but not least, you might want to use a slightly higher-level utility such as Rocambole.

Javascript - get strings inside a string

var str = '<div part="1">
<div>
...
<p class="so">text</p>
...
</div>
</div><span></span>';
I got a long string stored in var str, I need to extract the the strings inside div part="1". Can you help me please?

you could create a DOM element and set its innerHTML to your string.
Then you can iterate through the childNodes and read the attributes you want ;)
example
var str = "<your><html>";
var node = document.createElement("div");
node.innerHTML = str;
for(var i = 0; i < node.childNodes.length; i++){
console.log(node.childNodes[i].getAttribute("part"));
}

If you're using a library like JQuery, this is trivially easy without having to go through the horrors of parsing HTML with regex.
Simply load the string into a JQuery object; then you'll be able to query it using selectors. It's as simple as this:
var so = $(str).find('.so');
to get the class='so' elememnt.
If you want to get all the text in part='1', then it would be this:
var part1 = $(str).find('[part=1]').text();
Similar results can be achieved with Prototype library, or others. Without any library, you can still do the same thing using the DOM, but it'll be much harder work.
Just to clarify why it's a bad idea to do this sort of thing in regex:
Yes, it can be done. It is possible to scan a block of HTML code with regex and find things within the string.
However, the issue is that HTML is too variable -- it is defined as a non-regular language (bear in mind that the 'reg' in 'regex' is for 'regular').
If you know that your HTML structure is always going to look the same, it's relatively easy. However if it's ever going to be possible that the incoming HTML might contain elements or attributes other than the exact ones you're expecting, suddenly writing the regex becomes extremely difficult, because regex is designed for searching in predictable strings. When you factor in the possibility of being given invalid HTML code to parse, the difficulty factor increases even more.
With a lot of effort and good understanding of the more esoteric parts of regex, it can be done, with a reasonable degree of reliability. But it's never going to be perfect -- there's always going to be the possibility of your regex not working if it's fed with something it doesn't expect.
By contrast, parsing it with the DOM is much much simpler -- as demonstrated, with the right libraries, it can be a single line of code (and very easy to read, unlike the horrific regex you'd need to write). It'll also be much more efficient to run, and gives you the ability to do other search operations on the same chunk of HTML, without having to re-parse it all again.

We Keep Coding

JavaScript is the programming language of the Web.