I am working with reflect.js (a nice Javascript parser) from Zach Carter on github; I am trying to modify the behavior of his parser to handle comments as normal tokens that should be parsed like anything else. The default behavior of reflect.js is to keep track of all comments (the lexer grabs them as tokens) and then append a list of them to the end of the AST (Abstract Syntax Tree) it creates.
However, I would like these comments to be included in-place in the AST. I believe this change will involve adding grammar rules to the grammar.y file here . There are currently no rules for comments -- If my understanding is correct, that is why they are ignored by the main parsing code.
How do you write rules to include comments in an AST?
The naive version modifies each rule of the original grammer:
LHS = RHS1 RHS2 ... RHSN ;
to be:
LHS = RHS1 COMMENTS RHS2 COMMENTS ... COMMENTS RHSN ;
While this works in the abstract, this will likely screw up your parser generator if it is LL or LALR based, because now it can't see far enough ahead with just the next token to decide what to do. So you'd have to switch to a more powerful parser generator such as GLR.
A smarter version replaces (only and) every terminal T with a nonterminal:
T = COMMENTS t ;
and modifies the orginal lexer to trivally emit t instead of T. You still have lookahead troubles.
But this gives us the basis for real solution.
A more sophisticated version of this is to cause the lexer to collect comments seen before a token and attach them to next token it emits; in essence, we are implementing the terminal rule modification of the grammar, in the lexer.
Now the parser (you don't have to switch technologies) just sees the tokens it originally saw; the tokens carry the comments as annotations. You'll find it useful to divide comments into those that attach to the previous token, and those that attach to the next, but you won't be able to make this any better than a heuristic, because there is no practical way to decide to which token the comments really belong.
You'll find it fun to figure out how to capture the positioning information on the tokens and the comments, to enable regeneration of the original text ("comments in their proper locations"). You'll find it more fun to actually regenerate the text with appropriate radix values, character string escapes, etc., in a way that doesn't break the language syntax rules.
We do this with our general language processing tools and it works reasonably well. It is amazing how much work it is to get it all straight, so that you can focus on your transformation task. People underestimate this a lot.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
JavaScript source code can be converted to an AST. I am using SHIFT AST Parser to create AST from JavaScipt code.
Now I want to convert the generated AST back to source code.
I am very much confused here and trying to understand the fundamentals. I am hearing from my colleagues that AST can't be converted back to the source code. But for what reason?
One of colleague told me AST does not preserve the spacing and indentation will be lost while converting AST to source code.
Is it the only reason?
First of all, it depends on what you mean by "original source" code.
If you mean the exact same file on the exact same file system that you were editing when you wrote the software, the answer is that you can't. Obviously.
If you mean code which is character for character identical to the code you wrote, it is technically possible but unlikely in practice.
If you mean code that works exactly the same and "looks mostly the same", then yes, you can. (Depending on "mostly".)
The answer also depends on what AST implementation you are talking about.
Some AST implementations don't preserve comments or spacing / indentation.
Other AST implementations apparently can preserve comments; e.g. as decorations on the tree nodes.
It is theoretically possible for an AST implementation to preserve absolutely everything needed to reconstruct identical source code. (But I don't know of an example that does. It would be memory expensive and kind of pointless.)
What is the harm in not being able to recover comments?
Well it depends on what you want to use the regenerated source code for. If you intend to be able to replace the original code, then there are clear problems:
You have lost any (hopefully) useful comments that the programmer included to help people to understand the code.
It is common to embed formal API documentation in the form of stylized source code comments that are then extracted, formatted, etc. If those comments are lost, it becomes harder to keep the API documentation up to date.
Some 3rd-party tools use stylized comments for specific purposes. For example, a comment could be could be used to suppress a false positive from a static code analyzer; e.g. a # noqa comment in Python code suppresses a pep8 style error.
On the other hand ... this kind of thing may not be relevant for your use-case.
Now from the tags I deduce that you are using Shift-AST. From a brief scan of the documentation and source code, I don't think this preserves either comments or indentation / spacing.
So that means that you cannot recover source code that is character for character identical with the original code. If that is what you want ... your colleague is 100% correct.
However, character for character identical code may not be necessary, so this may not be a limitation. It depends on your use-case.
And you could investigate Babel as an alternative. Apparently it can preserve comments.
One of colleague told me AST does not preserve the spacing and indentation will be lost while converting AST to source code. Is it the only reason?
Clearly, No. (As my answer explains.)
Nope. An abstract syntax tree is abstract in the way that it abstracts away ambiguous grammar, such as whitespace and possibly also comments (if these are irrelevant to further processing). As there is usually no purpose in storing this information, it is worth dropping during parsing.
While one can't go back to "original source code", one can still go back to an equivalent representation which is usually called the canonical form.
Well, it's possible. If you're using shift-ast you can do it.
Step 1:
npm install shift-codegen
Step 2:
import codegen from "shift-codegen";
let programSource = codegen(/* Shift format AST */);
ProgramSource will return string. Write it to your file and use prettier to format your code.
There is an alternative to shift-ast is called babel gives so many benefits, transforms and template features. Also provides typescript, flow, jsx and comments and minification features.
Is there an optimal way to see if a single line or block comment is merely code that was commented out, or if it is an actual comment.
e.g.
// console.log('foo');
Should validate true of being a code comment.
// This does stuff
Should validate false of being a code comment.
Current Solution:
Parse the comment contents through to an AST and see if it's code or not, kind of like a validator.
Assumptions:
I have access to the original code parsed into an AST already and have access to a comment node.
Is going to be a node script.
You need to collect the comment text, and run it through a language substring recognizer. You might have:
/* X=2.7*Y^3+9.3^Y2+2.7* */
That looks like code to me, even if it is incomplete.
So in general you want to detect substrings of the language as opposed to arbitrarily chosen language structures. (Even if you choose expansion of langauge just nonterminals as defined by a grammar, do you include all 1000 of the nonterminals in your complex grammar? Just "statement" or "expression"?
Your first problem will be deciding where the "comment" begins or ends. Is
// X=X+1;
/* foo(bar);
bar(baz);
*/
one code block or two (or three)? What if the apparant code is split across comments?
// X=X+
/* 1; */
I'd guess your biggest problem is finding a langauge substring parser. Just because you have a parser for the full language doesn't mean you an easily build a subtring recognizer with it. (We have done that by bending GLR parsers for our tools, see my bio if you want to know more).
Your hardest problem is intention: did the programmer really comment out actual code, or was she just sketching a computation in a comment? You can't know unless you can read long-gone minds.
I'm trying to understand how JS is actually parsed. But my searches either return some ones very vaguely documented project of a "parser/generator" (i don't even know what that means), or how to parse JS using a JS Engine using the magical "parse" method. I don't want to scan through a bunch of code and try all my life to understand (although i can, it would take too long).
i want to know how an arbitrary string of JS code is actually turned into objects, functions, variables etc. I also want to know the procedures, and techniques that turns that string into stuff, gets stored, referenced, executed.
Are there any documentation/references for this?
Parsers probably work in all sorts of ways, but fundamentally they first go through a stage of tokenisation, then give the result to the compiler, which turns it into a program if it can. For example, given:
function foo(a) {
alert(a);
}
the parser will remove any leading whitespace to the first character, the letter "f". It will collect characters until it gets something that doesn't belong, the whitespace, that indicates the end of the token. It starts again with the "f" of "foo" until it gets to the "(", so it now has the tokens "function" and "foo". It knows "(" is a token on its own, so that's 3 tokens. It then gets the "a" followed by ")" which are two more tokens to make 5, and so on.
The only need for whitespace is between tokens that are otherwise ambiguous (e.g. there must be either whitespace or another token between "function" and "foo").
Once tokenisation is complete, it goes to the compiler, which sees "function" as an identifier, and interprets it as the keyword "function". It then gets "foo", an identifier that the language grammar tells it is the function name. Then the "(" indicates an opening grouping operator and hence the start of a formal parameter list, and so on.
Compilers may deal with tokens one at a time, or may grab them in chunks, or do all sorts of weird things to make them run faster.
You can also read How do C/C++ parsers work?, which gives a few more clues. Or just use Google.
While it doesn't correspond closely to the way real JS engines work, you might be interested in reading Douglas Crockford's article on Top Down Operator Precedence, which includes code for a small working lexer and parser written in the Javascript subset it parses. It's very readable and concise code (with good accompanying explanations) which at least gives you an outline of how a real implementation might work.
A more common technique than Crockford's "Top Down Operator Precedence" is recursive descent parsing, which is used in Narcissus, a complete implementation of JS in JS.
maybe esprima will help you to understand how JS parses the grammar. it's online
I have a grammar for a domain specific language, and I need to create a javascript code editor for that language. Are there any tools that would allow me to generate
a) a javascript incremental parser
b) a javascript auto-complete / auto-suggest engine?
Thanks!
An Example of implementing content assist (auto-complete)
using Chevrotain Javascript Parsing DSL:
https://github.com/SAP/chevrotain/tree/master/examples/parser/content_assist
Chevrotain was designed specifically to build parsers used (as part of) language services tools in Editors/IDEs.
Some of the relevant features are:
Automatic Error Recovery / Fault tolerance because editors and IDEs need to be able to handle 'mostly valid' inputs.
Every Grammar rule may be used as the starting rule as an Editor/IDE may only want to implement incremental parsing for performance reasons.
You may want jison, a js parser generator. In terms of auto-complete / auto-suggest...most of the stuff out there I know if more based on word completion rather than code completion. But once you have a parser running I don't think that part is too difficult..
This is difficult. I'm doing the same sort of thing myself.
One approach is:
You need is a parser which will give you an array of the currently possible ASTs for the text up until the token before the current cursor position.
From there you can see the next token can be of a number of types (usually just one), and do the completion, based on the partial text.
If I ever get my incremental parser working, I'll send a link.
Good luck, and let me know if you find a package which does this.
Chris.
Does anyone know of a Javascript lexical analyzer or tokenizer (preferably in Python?)
Basically, given an arbitrary Javascript file, I want to grab the tokens.
e.g.
foo = 1
becomes something like:
variable name : "foo"
whitespace
operator : equals
whitespace
integer : 1
http://code.google.com/p/pynarcissus/ has one.
Also I made one but it doesn't support automatic semicolon insertion so it is pretty useless for javascript that you have no control over (as almost all real life javascript programs lack at least one semicolon) :) Here is mine:
http://bitbucket.org/santagada/jaspyon/src/tip/jaspyon/
the grammar is in jsgrammar.txt, it is parsed by the PyPy parsing lib (which you will have to download and extract from the pypy source) and it build a parse tree which I walk on astbuilder.py
But if you don't have licensing problems I would go with pynarcissus. heres a direct link to look at the code (ported from narcissus):
http://code.google.com/p/pynarcissus/source/browse/trunk/jsparser.py