Distinguish a code comment from a text comment

Distinguish a code comment from a text comment - javascript

Is there an optimal way to see if a single line or block comment is merely code that was commented out, or if it is an actual comment.
e.g.
// console.log('foo');
Should validate true of being a code comment.
// This does stuff
Should validate false of being a code comment.
Current Solution:
Parse the comment contents through to an AST and see if it's code or not, kind of like a validator.
Assumptions:
I have access to the original code parsed into an AST already and have access to a comment node.
Is going to be a node script.

You need to collect the comment text, and run it through a language substring recognizer. You might have:
/* X=2.7*Y^3+9.3^Y2+2.7* */
That looks like code to me, even if it is incomplete.
So in general you want to detect substrings of the language as opposed to arbitrarily chosen language structures. (Even if you choose expansion of langauge just nonterminals as defined by a grammar, do you include all 1000 of the nonterminals in your complex grammar? Just "statement" or "expression"?
Your first problem will be deciding where the "comment" begins or ends. Is
// X=X+1;
/* foo(bar);
bar(baz);
*/
one code block or two (or three)? What if the apparant code is split across comments?
// X=X+
/* 1; */
I'd guess your biggest problem is finding a langauge substring parser. Just because you have a parser for the full language doesn't mean you an easily build a subtring recognizer with it. (We have done that by bending GLR parsers for our tools, see my bio if you want to know more).
Your hardest problem is intention: did the programmer really comment out actual code, or was she just sketching a computation in a comment? You can't know unless you can read long-gone minds.

Related

Is it always possible to go from AST to original source code? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
JavaScript source code can be converted to an AST. I am using SHIFT AST Parser to create AST from JavaScipt code.
Now I want to convert the generated AST back to source code.
I am very much confused here and trying to understand the fundamentals. I am hearing from my colleagues that AST can't be converted back to the source code. But for what reason?
One of colleague told me AST does not preserve the spacing and indentation will be lost while converting AST to source code.
Is it the only reason?

First of all, it depends on what you mean by "original source" code.
If you mean the exact same file on the exact same file system that you were editing when you wrote the software, the answer is that you can't. Obviously.
If you mean code which is character for character identical to the code you wrote, it is technically possible but unlikely in practice.
If you mean code that works exactly the same and "looks mostly the same", then yes, you can. (Depending on "mostly".)
The answer also depends on what AST implementation you are talking about.
Some AST implementations don't preserve comments or spacing / indentation.
Other AST implementations apparently can preserve comments; e.g. as decorations on the tree nodes.
It is theoretically possible for an AST implementation to preserve absolutely everything needed to reconstruct identical source code. (But I don't know of an example that does. It would be memory expensive and kind of pointless.)
What is the harm in not being able to recover comments?
Well it depends on what you want to use the regenerated source code for. If you intend to be able to replace the original code, then there are clear problems:
You have lost any (hopefully) useful comments that the programmer included to help people to understand the code.
It is common to embed formal API documentation in the form of stylized source code comments that are then extracted, formatted, etc. If those comments are lost, it becomes harder to keep the API documentation up to date.
Some 3rd-party tools use stylized comments for specific purposes. For example, a comment could be could be used to suppress a false positive from a static code analyzer; e.g. a # noqa comment in Python code suppresses a pep8 style error.
On the other hand ... this kind of thing may not be relevant for your use-case.
Now from the tags I deduce that you are using Shift-AST. From a brief scan of the documentation and source code, I don't think this preserves either comments or indentation / spacing.
So that means that you cannot recover source code that is character for character identical with the original code. If that is what you want ... your colleague is 100% correct.
However, character for character identical code may not be necessary, so this may not be a limitation. It depends on your use-case.
And you could investigate Babel as an alternative. Apparently it can preserve comments.
One of colleague told me AST does not preserve the spacing and indentation will be lost while converting AST to source code. Is it the only reason?
Clearly, No. (As my answer explains.)

Nope. An abstract syntax tree is abstract in the way that it abstracts away ambiguous grammar, such as whitespace and possibly also comments (if these are irrelevant to further processing). As there is usually no purpose in storing this information, it is worth dropping during parsing.
While one can't go back to "original source code", one can still go back to an equivalent representation which is usually called the canonical form.

Well, it's possible. If you're using shift-ast you can do it.
Step 1:
npm install shift-codegen
Step 2:
import codegen from "shift-codegen";
let programSource = codegen(/* Shift format AST */);
ProgramSource will return string. Write it to your file and use prettier to format your code.
There is an alternative to shift-ast is called babel gives so many benefits, transforms and template features. Also provides typescript, flow, jsx and comments and minification features.

Grammar rules for comments

I am working with reflect.js (a nice Javascript parser) from Zach Carter on github; I am trying to modify the behavior of his parser to handle comments as normal tokens that should be parsed like anything else. The default behavior of reflect.js is to keep track of all comments (the lexer grabs them as tokens) and then append a list of them to the end of the AST (Abstract Syntax Tree) it creates.
However, I would like these comments to be included in-place in the AST. I believe this change will involve adding grammar rules to the grammar.y file here . There are currently no rules for comments -- If my understanding is correct, that is why they are ignored by the main parsing code.
How do you write rules to include comments in an AST?

The naive version modifies each rule of the original grammer:
LHS = RHS1 RHS2 ... RHSN ;
to be:
LHS = RHS1 COMMENTS RHS2 COMMENTS ... COMMENTS RHSN ;
While this works in the abstract, this will likely screw up your parser generator if it is LL or LALR based, because now it can't see far enough ahead with just the next token to decide what to do. So you'd have to switch to a more powerful parser generator such as GLR.
A smarter version replaces (only and) every terminal T with a nonterminal:
T = COMMENTS t ;
and modifies the orginal lexer to trivally emit t instead of T. You still have lookahead troubles.
But this gives us the basis for real solution.
A more sophisticated version of this is to cause the lexer to collect comments seen before a token and attach them to next token it emits; in essence, we are implementing the terminal rule modification of the grammar, in the lexer.
Now the parser (you don't have to switch technologies) just sees the tokens it originally saw; the tokens carry the comments as annotations. You'll find it useful to divide comments into those that attach to the previous token, and those that attach to the next, but you won't be able to make this any better than a heuristic, because there is no practical way to decide to which token the comments really belong.
You'll find it fun to figure out how to capture the positioning information on the tokens and the comments, to enable regeneration of the original text ("comments in their proper locations"). You'll find it more fun to actually regenerate the text with appropriate radix values, character string escapes, etc., in a way that doesn't break the language syntax rules.
We do this with our general language processing tools and it works reasonably well. It is amazing how much work it is to get it all straight, so that you can focus on your transformation task. People underestimate this a lot.

Javascript lexer / tokenizer (in Python?)

Does anyone know of a Javascript lexical analyzer or tokenizer (preferably in Python?)
Basically, given an arbitrary Javascript file, I want to grab the tokens.
e.g.
foo = 1
becomes something like:
variable name : "foo"
whitespace
operator : equals
whitespace
integer : 1

http://code.google.com/p/pynarcissus/ has one.
Also I made one but it doesn't support automatic semicolon insertion so it is pretty useless for javascript that you have no control over (as almost all real life javascript programs lack at least one semicolon) :) Here is mine:
http://bitbucket.org/santagada/jaspyon/src/tip/jaspyon/
the grammar is in jsgrammar.txt, it is parsed by the PyPy parsing lib (which you will have to download and extract from the pypy source) and it build a parse tree which I walk on astbuilder.py
But if you don't have licensing problems I would go with pynarcissus. heres a direct link to look at the code (ported from narcissus):
http://code.google.com/p/pynarcissus/source/browse/trunk/jsparser.py

Parse JavaScript to instrument code

I need to split a JavaScript file into single instructions. For example
a = 2;
foo()
function bar() {
b = 5;
print("spam");
}
has to be separated into three instructions. (assignment, function call and function definition).
Basically I need to instrument the code, injecting code between these instructions to perform checks. Splitting by ";" wouldn't obviously work because you can also end instructions with newlines and maybe I don't want to instrument code inside function and class definitions (I don't know yet). I took a course about grammars with flex/Bison but in this case the semantic action for this rule would be "print all the descendants in the parse tree and put my code at the end" which can't be done with basic Bison I think. How do I do this? I also need to split the code because I need to interface with Python with python-spidermonkey.
Or... is there a library out there already which saves me from reinventing the wheel? It doesn't have to be in Python.

Why not use a JavaScript parser? There are lots, including a Python API for ANTLR and a Python wrapper around SpiderMonkey.

JavaScript is tricky to parse; you need a full JavaScript parser.
The DMS Software Reengineering Toolkit can parse full JavaScript and build a corresponding AST.
AST operators can then be used to walk over the tree to "split it". Even easier, however, is to apply source-to-source transformations that look for one surface syntax (JavaScript) pattern, and replace it by another. You can use such transformations to insert the instrumentation into the code, rather than splitting the code to make holds in which to do the insertions. After the transformations are complete, DMS can regenerate valid JavaScript code (complete with the orignal comments if unaffected).

Why not use an existing JavaScript interpreter like Rhino (Java) or python-spidermonkey (not sure whether this one is still alive)? It will parse the JS and then you can examine the resulting parse tree. I'm not sure how easy it will be to recreate the original code but that mostly depends on how readable the instrumented code must be. If no one ever looks at it, just generate a really compact form.
pyjamas might also be of interest; this is a Python to JavaScript transpiler.
[EDIT] While this doesn't solve your problem at first glance, you might use it for a different approach: Instead of instrumenting JavaScript, write your code in Python instead (which can be easily instrumented; all the tools are already there) and then convert the result to JavaScript.
Lastly, if you want to solve your problem in Python but can't find a parser: Use a Java engine to add comments to the code which you can then search for in Python to instrument the code.

Why not try a javascript beautifier?
For example http://jsbeautifier.org/
Or see Command line JavaScript code beautifier that works on Windows and Linux

Forget my parser. https://bitbucket.org/mvantellingen/pyjsparser is great and complete parser. I've fixed a couple of it's bugs here: https://bitbucket.org/nullie/pyjsparser

Do you recommend using semicolons after every statement in JavaScript?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
In many situations, JavaScript parsers will insert semicolons for you if you leave them out. My question is, do you leave them out?
If you're unfamiliar with the rules, there's a description of semicolon insertion on the Mozilla site. Here's the key point:
If the first through the nth tokens of a JavaScript program form are grammatically valid but the first through the n+1st tokens are not and there is a line break between the nth tokens and the n+1st tokens, then the parser tries to parse the program again after inserting a virtual semicolon token between the nth and the n+1st tokens.
That description may be incomplete, because it doesn't explain #Dreas's example. Anybody have a link to the complete rules, or see why the example gets a semicolon? (I tried it in JScript.NET.)
This stackoverflow question is related, but only talks about a specific scenario.

Yes, you should use semicolons after every statement in JavaScript.

An ambiguous case that breaks in the absence of a semicolon:
// define a function
var fn = function () {
//...
} // semicolon missing at this line
// then execute some code inside a closure
(function () {
//...
})();
This will be interpreted as:
var fn = function () {
//...
}(function () {
//...
})();
We end up passing the second function as an argument to the first function and then trying to call the result of the first function call as a function. The second function will fail with a "... is not a function" error at runtime.

Yes, you should always use semicolons. Why? Because if you end up using a JavaScript compressor, all your code will be on one line, which will break your code.
Try http://www.jslint.com/; it will hurt your feelings, but show you many ways to write better JavaScript (and one of the ways is to always use semicolons).

What everyone seems to miss is that the semi-colons in JavaScript are not statement terminators but statement separators. It's a subtle difference, but it is important to the way the parser is programmed. Treat them like what they are and you will find leaving them out will feel much more natural.
I've programmed in other languages where the semi-colon is a statement separator and also optional as the parser does 'semi-colon insertion' on newlines where it does not break the grammar. So I was not unfamiliar with it when I found it in JavaScript.
I don't like noise in a language (which is one reason I'm bad at Perl) and semi-colons are noise in JavaScript. So I omit them.

I'd say consistency is more important than saving a few bytes. I always include semicolons.
On the other hand, I'd like to point out there are many places where the semicolon is not syntactically required, even if a compressor is nuking all available whitespace. e.g. at then end of a block.
if (a) { b() }

JavaScript automatically inserts semicolons whilst interpreting your code, so if you put the value of the return statement below the line, it won't be returned:
Your Code:
return
5
JavaScript Interpretation:
return;
5;
Thus, nothing is returned, because of JavaScript's auto semicolon insertion

I think this is similar to what the last podcast discussed. The "Be liberal in what you accept" means that extra work had to be put into the Javascript parser to fix cases where semicolons were left out. Now we have a boatload of pages out there floating around with bad syntax, that might break one day in the future when some browser decides to be a little more stringent on what it accepts. This type of rule should also apply to HTML and CSS. You can write broken HTML and CSS, but don't be surprise when you get weird and hard to debug behaviors when some browser doesn't properly interpret your incorrect code.

The article Semicolons in JavaScript are optional makes some really good points about not using semi colons in Javascript. It deals with all the points have been brought up by the answers to this question.

This is the very best explanation of automatic semicolon insertion that I've found anywhere. It will clear away all your uncertainty and doubt.

I use semicolon, since it is my habit.
Now I understand why I can't have string split into two lines... it puts semicolon at the end of each line.

No, only use semicolons when they're required.

We Keep Coding

JavaScript is the programming language of the Web.