Does anyone know of a Javascript lexical analyzer or tokenizer (preferably in Python?)
Basically, given an arbitrary Javascript file, I want to grab the tokens.
e.g.
foo = 1
becomes something like:
variable name : "foo"
whitespace
operator : equals
whitespace
integer : 1
http://code.google.com/p/pynarcissus/ has one.
Also I made one but it doesn't support automatic semicolon insertion so it is pretty useless for javascript that you have no control over (as almost all real life javascript programs lack at least one semicolon) :) Here is mine:
http://bitbucket.org/santagada/jaspyon/src/tip/jaspyon/
the grammar is in jsgrammar.txt, it is parsed by the PyPy parsing lib (which you will have to download and extract from the pypy source) and it build a parse tree which I walk on astbuilder.py
But if you don't have licensing problems I would go with pynarcissus. heres a direct link to look at the code (ported from narcissus):
http://code.google.com/p/pynarcissus/source/browse/trunk/jsparser.py
Related
There seems to be a lot of resources for XML pull parsing, but is it possible to build a pull parser for JavaScript? Why is this not something people pursue? Pull parsing enables to stream the file while parsing it, which allows for infinitely sized files (for example) and concurrent use.
The problem I encounter is that I need to divide the code into certain small units. I thought statements would be a good way to split the code. Each call to the pull parser would yield another statement (or function declaration). However this goes wrong with function expressions. They require to split the statements up because each statement could contain a function with more statements.
How would I go about implementing such a parser? Or do you think this is an unwise design?
I'm trying to build a fast minifier.
EDIT: see http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java-V2 for more info on sequential access parsers. They only describe JSON and XML...
Also see https://github.com/qfox/zeparser2 for a fast streaming JS parser.
EDIT2:
I can think of a few options:
return each grammar type, even nested ones. So (most) tokens will be returned multiple times in different grammars (like an expression inside a statement). So for example you first return the statement 'var a = b + c;' and then return the expression 'b + c'. So as caller you can check if the returned grammar is a var-statement and do something with that...
work with event function, this is push-parsing. Like call the var-statement handler, or expression handler.
full blown AST generation with early return?
I am working with reflect.js (a nice Javascript parser) from Zach Carter on github; I am trying to modify the behavior of his parser to handle comments as normal tokens that should be parsed like anything else. The default behavior of reflect.js is to keep track of all comments (the lexer grabs them as tokens) and then append a list of them to the end of the AST (Abstract Syntax Tree) it creates.
However, I would like these comments to be included in-place in the AST. I believe this change will involve adding grammar rules to the grammar.y file here . There are currently no rules for comments -- If my understanding is correct, that is why they are ignored by the main parsing code.
How do you write rules to include comments in an AST?
The naive version modifies each rule of the original grammer:
LHS = RHS1 RHS2 ... RHSN ;
to be:
LHS = RHS1 COMMENTS RHS2 COMMENTS ... COMMENTS RHSN ;
While this works in the abstract, this will likely screw up your parser generator if it is LL or LALR based, because now it can't see far enough ahead with just the next token to decide what to do. So you'd have to switch to a more powerful parser generator such as GLR.
A smarter version replaces (only and) every terminal T with a nonterminal:
T = COMMENTS t ;
and modifies the orginal lexer to trivally emit t instead of T. You still have lookahead troubles.
But this gives us the basis for real solution.
A more sophisticated version of this is to cause the lexer to collect comments seen before a token and attach them to next token it emits; in essence, we are implementing the terminal rule modification of the grammar, in the lexer.
Now the parser (you don't have to switch technologies) just sees the tokens it originally saw; the tokens carry the comments as annotations. You'll find it useful to divide comments into those that attach to the previous token, and those that attach to the next, but you won't be able to make this any better than a heuristic, because there is no practical way to decide to which token the comments really belong.
You'll find it fun to figure out how to capture the positioning information on the tokens and the comments, to enable regeneration of the original text ("comments in their proper locations"). You'll find it more fun to actually regenerate the text with appropriate radix values, character string escapes, etc., in a way that doesn't break the language syntax rules.
We do this with our general language processing tools and it works reasonably well. It is amazing how much work it is to get it all straight, so that you can focus on your transformation task. People underestimate this a lot.
I'm trying to understand how JS is actually parsed. But my searches either return some ones very vaguely documented project of a "parser/generator" (i don't even know what that means), or how to parse JS using a JS Engine using the magical "parse" method. I don't want to scan through a bunch of code and try all my life to understand (although i can, it would take too long).
i want to know how an arbitrary string of JS code is actually turned into objects, functions, variables etc. I also want to know the procedures, and techniques that turns that string into stuff, gets stored, referenced, executed.
Are there any documentation/references for this?
Parsers probably work in all sorts of ways, but fundamentally they first go through a stage of tokenisation, then give the result to the compiler, which turns it into a program if it can. For example, given:
function foo(a) {
alert(a);
}
the parser will remove any leading whitespace to the first character, the letter "f". It will collect characters until it gets something that doesn't belong, the whitespace, that indicates the end of the token. It starts again with the "f" of "foo" until it gets to the "(", so it now has the tokens "function" and "foo". It knows "(" is a token on its own, so that's 3 tokens. It then gets the "a" followed by ")" which are two more tokens to make 5, and so on.
The only need for whitespace is between tokens that are otherwise ambiguous (e.g. there must be either whitespace or another token between "function" and "foo").
Once tokenisation is complete, it goes to the compiler, which sees "function" as an identifier, and interprets it as the keyword "function". It then gets "foo", an identifier that the language grammar tells it is the function name. Then the "(" indicates an opening grouping operator and hence the start of a formal parameter list, and so on.
Compilers may deal with tokens one at a time, or may grab them in chunks, or do all sorts of weird things to make them run faster.
You can also read How do C/C++ parsers work?, which gives a few more clues. Or just use Google.
While it doesn't correspond closely to the way real JS engines work, you might be interested in reading Douglas Crockford's article on Top Down Operator Precedence, which includes code for a small working lexer and parser written in the Javascript subset it parses. It's very readable and concise code (with good accompanying explanations) which at least gives you an outline of how a real implementation might work.
A more common technique than Crockford's "Top Down Operator Precedence" is recursive descent parsing, which is used in Narcissus, a complete implementation of JS in JS.
maybe esprima will help you to understand how JS parses the grammar. it's online
Hey folks, thanks for reading
I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.
I began slowly with the basics : + - / * and parenthesis handling.
I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.
This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.
I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.
Analysis
Let's define what a calculator query is:
// NON TERMINAL EXPRESSIONS //
query -> statement
query -> ε // means end of query
statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number
number -> integer
number -> float
// TERMINAL EXPRESSIONS //
operator -> [+*/%^-]
prefix -> -
integer -> [0-9]+
float -> [0-9]+[.,][0-9]+
Javascript
Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:
(I added spaces to make it more readable)
var calcPat =
/^ (\s*
( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;
If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part
I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:
I created a method called isStatement(string) that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:
-If the first two tokens are a number followed by an operator:
-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)
-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.
What's the problem ?
My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?
Thank you so much for having taken the time to read everything.
Gael
(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)
You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.
The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.
The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.
You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.
Now, because you've got Javascript regular expressions to work with, you could come up with a regular expression designed to match a string of tokens. The trick with that will be coming up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.
edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very complicated, and it'd be pretty fun to have around.
It is valid JavaScript to write something like this:
function example(x) {
"Here is a short doc what I do.";
// code of the function
}
The string actually does nothing. Is there any reason, why one shouldn't comment his/her functions in JavaScript in this way?
Two points I could think of during wiriting the question:
The string literal must be initiated, which could be costly in the long run
The string literal will not be recognized as removable by JS minifiers
Any other points?
Edit: Why I brought up this topic: I found something like this on John Resig's Blog, where the new ECMA 5 standard uses a not assigned string literal to enable "strict mode". Now it was my interest to just evaluate, if there could be uses or dangers in doing such documentation.
There's really no point in doing this in Javascript. In Python, the string is made available as the __doc__ member of the function, class, or module. So these docstrings are available for introspection, etc.
If you create strings like this in Javascript, you get no benefit over using a comment, plus you get some disadvantages, like the string always being present.
I was looking for a way to add multi-line strings to my code without littering it with \n's. It looks like this module fits the bill:
https://github.com/monolithed/doc
Unfortunately, the comments won't survive minification, but I suppose you could write a compile task to convert docstrings to "\n" format.