How does a JavaScript parser work?

How does a JavaScript parser work? - javascript

I'm trying to understand how JS is actually parsed. But my searches either return some ones very vaguely documented project of a "parser/generator" (i don't even know what that means), or how to parse JS using a JS Engine using the magical "parse" method. I don't want to scan through a bunch of code and try all my life to understand (although i can, it would take too long).
i want to know how an arbitrary string of JS code is actually turned into objects, functions, variables etc. I also want to know the procedures, and techniques that turns that string into stuff, gets stored, referenced, executed.
Are there any documentation/references for this?

Parsers probably work in all sorts of ways, but fundamentally they first go through a stage of tokenisation, then give the result to the compiler, which turns it into a program if it can. For example, given:
function foo(a) {
alert(a);
}
the parser will remove any leading whitespace to the first character, the letter "f". It will collect characters until it gets something that doesn't belong, the whitespace, that indicates the end of the token. It starts again with the "f" of "foo" until it gets to the "(", so it now has the tokens "function" and "foo". It knows "(" is a token on its own, so that's 3 tokens. It then gets the "a" followed by ")" which are two more tokens to make 5, and so on.
The only need for whitespace is between tokens that are otherwise ambiguous (e.g. there must be either whitespace or another token between "function" and "foo").
Once tokenisation is complete, it goes to the compiler, which sees "function" as an identifier, and interprets it as the keyword "function". It then gets "foo", an identifier that the language grammar tells it is the function name. Then the "(" indicates an opening grouping operator and hence the start of a formal parameter list, and so on.
Compilers may deal with tokens one at a time, or may grab them in chunks, or do all sorts of weird things to make them run faster.
You can also read How do C/C++ parsers work?, which gives a few more clues. Or just use Google.

While it doesn't correspond closely to the way real JS engines work, you might be interested in reading Douglas Crockford's article on Top Down Operator Precedence, which includes code for a small working lexer and parser written in the Javascript subset it parses. It's very readable and concise code (with good accompanying explanations) which at least gives you an outline of how a real implementation might work.
A more common technique than Crockford's "Top Down Operator Precedence" is recursive descent parsing, which is used in Narcissus, a complete implementation of JS in JS.

maybe esprima will help you to understand how JS parses the grammar. it's online

Related

Replacement of "-' with "_" using a loop in javascript

I have applied the same method to replace "-" with "_" in c++ and it is working properly but in javascript, it is not replacing at all.
function replace(str)
{
for(var i=0;i<str.length;i++)
{
if(str.charAt(i)=="-")
{
str.charAt(i) = "_";
}
}
return str;
}

It's simple enough in javascript, it's not really worth making a new function:
function replace(str){
return str.replace(/-/g, '_') // uses regular expression with 'g' flag for 'global'
}
console.log(replace("hello-this-is-a-string"))
This does not alter the original string, however, because strings in javascript are immutable.
If you are dead set on avoiding the builtin(maybe you want to do more complex processing), reduce() can be useful:
function replace(str){
return [...str].reduce((s, letter) => s += letter == '-' ? '_' : letter , '')
}
console.log(replace("hello-this-is-a-string"))

This is yet another case of "make sure you read the error message". In the case of
str.charAt(i) = "_";
the correct description of what happens is not "it is not replacing at all", as you would have it; it is "it generates a run-time JavaScript error", which in Chrome is worded as
Uncaught ReferenceError: Invalid left-hand side in assignment
In Firefox, it is
ReferenceError: invalid assignment left-hand side
That should have given you the clue you needed to track down your problem.
I repeat: read error messages closely. Then read them again, and again. In the great majority of cases, they tell you exactly what the problem is (if you only take the time to try to understand them), or at least point you in right direction.
Of course, reading the error message assumes you know how to view the error message. Do you? In most browsers, a development tools window is available--often via F12--which contains a "console", displaying error messages. If you don't know about devtools, or the console, then make sure you understand them before you write a single line of code.
Or, you could have simply searched on Stack Overflow, since (almost) all questions have already been asked there, especially those from newcomers. For example, I searched in Google for
can i use charat in javascript to replace a character in a string?
and the first result was How do I replace a character at a particular index in JavaScript?, which has over 400 upvotes, as does the first and accepted answer, which reads:
In JavaScript, strings are immutable, which means the best you can do is create a new string with the changed content, and assign the variable to point to it.
As you learn to program, or learn a new languages, you will inevitably run into things you don't know, or things that confuse you. What to do? Posting to Stack Overflow is almost always the worst alternative. After all, as you know, it's not a tutorial site, and it's not a help site, and it's not a forum. It's a Q&A site for interesting programming questions.
In the best case, you'll get snarky comments and answers which will ruin your day; in the worst case, you'll get down-voted, and close-voted, which is not just embarrassing, but may actually prevent you from asking questions in the future. Since you want to make sure you are able to ask questions when you really need to, you are best off taking much more time doing research, including Google and SO searches, on simple beginner questions before posting. Or find a forum which is designed to help new folks. Or ask the person next to you if there is one. Or run through one or more tutorials.
But why write it yourself at all?
However, unless you are working on this problem as a way of teaching yourself JavaScript, as a kind of training exercise, there is no reason to write it at all. It has already been written hundreds, or probably thousands, of times in the history of computing. And the overwhelming majority of those implementations are going to be better, faster, cleaner, less buggy, and more featureful than whatever you will write. So your job as a "programmer" is not to write something that converts dashes to underscores; it's to find and use something that does.
As the wise man said, today we don't write algorithms any more; we string together API calls. Our job is to find, and understand, the APIs to call.
Finding the API is not at all hard with Google. In this case, it could be helpful if you knew that strings with underscores are sometimes called "snake-cased", but even without knowing that you can find something on the first page of Google results with a query such as "javascript convert string to use underscores library".
The very first result, and the one you should take a look at, is underscore.string, a collection of string utilities written in the spirit of the versatile "underscore" library, and designed to be used with it. It provides an API called underscored. In addition to dealing with "dasherized" input (your case), it also handles other string/identifier formats such as "camelCase".
Even if you don't want to import this particular library and use it (which you probably should), you would be much better off stealing or cloning its code, which in simplified form is:
str
.replace(/([a-z\d])([A-Z]+)/g, '$1_$2')
.replace(/[-\s]+/g, '_')
This is not as complicated as it looks. The first part is to deal with camelCase input, so that "catsAndDogs" becomes "cats-and-dogs". the second line is what handles the dashes and spaces). Look closely--it replaces runs of multiple dashes with a single underscore. That could easily be something that you want to do too, depending on who is doing what with the transformed string downstream. That's a perfect example of something that someone else writing a professional-level library for this task has thought of that you might not when writing your home-grown solution.
Note that this well-regarded, finely-turned library does not use the split/join trick to do global replaces in strings, as another answer suggests. That approach went out of use almost a decade ago.
Besides saving yourself the trouble of writing it yourself, and ending up with better code, if you take time time to understand what it's doing, you will also end up knowing more about regular expressions.

You can easily replace complete string using .split() and .join().
function replace(str){
return str.split("-").join("_");
}
console.log(replace("hello-this-is-a-string"))

ECMAScript pull parser

There seems to be a lot of resources for XML pull parsing, but is it possible to build a pull parser for JavaScript? Why is this not something people pursue? Pull parsing enables to stream the file while parsing it, which allows for infinitely sized files (for example) and concurrent use.
The problem I encounter is that I need to divide the code into certain small units. I thought statements would be a good way to split the code. Each call to the pull parser would yield another statement (or function declaration). However this goes wrong with function expressions. They require to split the statements up because each statement could contain a function with more statements.
How would I go about implementing such a parser? Or do you think this is an unwise design?
I'm trying to build a fast minifier.
EDIT: see http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java-V2 for more info on sequential access parsers. They only describe JSON and XML...
Also see https://github.com/qfox/zeparser2 for a fast streaming JS parser.
EDIT2:
I can think of a few options:
return each grammar type, even nested ones. So (most) tokens will be returned multiple times in different grammars (like an expression inside a statement). So for example you first return the statement 'var a = b + c;' and then return the expression 'b + c'. So as caller you can check if the returned grammar is a var-statement and do something with that...
work with event function, this is push-parsing. Like call the var-statement handler, or expression handler.
full blown AST generation with early return?

What is the difference between semicolons in JavaScript and in Python?

Python and JavaScript both allow developers to use or to omit semicolons. However, I've often seen it suggested (in books and blogs) that I should not use semicolons in Python, while I should always use them in JavaScript.
Is there a technical difference between how the languages use semicolons or is this just a cultural difference?

Semicolons in Python are totally optional (unless you want to have multiple statements in a single line, of course). I personally think Python code with semicolons at the end of every statement looks very ugly.
Now in Javascript, if you don't write a semicolon, one is automatically inserted1 at the end of line. And this can cause problems. Consider:
function add(a, b) {
return
a + b
}
You'd think this returns a + b, but Javascript just outsmarted you and sees this as:
function add() {
return;
a + b;
}
Returning undefined instead.
1 See page 27, item 7.9 - Automatic Semicolon Insertion on ECMAScript Language Specification for more details and caveats.

This had me confused for the longest time. I thought it was just a cultural difference, and that everyone complaining about semicolon insertion being the worst feature in the language was an idiot. The oft-repeated example from NullUserException's answer didn't sway me because, disregarding indentation, Python behaves the same as JavaScript in that case.
Then one day, I wrote something vaguely like this:
alert(2)
(x = $("#foo")).detach()
I expected it to be interpreted like this:
alert(2);
(x = $("#foo")).detach();
It was actually interpreted like this:
alert(2)(x = $("#foo")).detach();
I now use semicolons.
JavaScript will only1 treat a newline as a semicolon in these cases:
It's a syntax error not to.
The newline is between the throw or return keyword and an expression.
The newline is between the continue or break keyword and an identifier.
The newline is between a variable and a postfix ++ or -- operator.
This leaves cases like this where the behaviour is not what you'd expect. Some people2 have adopted conventions that only use semicolons where necessary. I prefer to follow the standard convention of always using them, now that I know it's not pointless.
1 I've omitted a few minor details, consult ECMA-262 5e Section 7.9 for the exact description.
2 Twitter Bootstrap is one high-profile example.

Aside from the syntactical issues, it is partly cultural. In Python culture any extraneous characters are an anathema, and those that are not white-space or alphanumeric, doubly so.
So things like leading $ signs, semi-colons, and curly braces, are not liked. What you do in your code though, is up to you, but to really understand a language it is not enough just to learn the syntax.

JavaScript is designed to "look like C", so semicolons are part of the culture. Python syntax is different enough to not make programmers feel uncomfortable if the semicolons are "missing".

The answer why you don't see them in Python code is: no one needs them, and the code looks cleaner without them.
Generally speaking, semicolons is just a tradition. Many new languages have just dropped them for good (take Python, Ruby, Scala, Go, Groovy, and Io for example). Programmers don't need them, and neither do compilers. If a language lets you not type an extra character you never needed, you will want to take advantage of that, won't you?
It's just that JavaScript's attempt to drop them wasn't very successful, and many prefer the convention to always use them, because that makes code less ambiguous.

It is mostly that Python looks nothing like Java, and JavaScript does, which leads people to treat it that way. It is very simple to not get into trouble using semicolons with JavaScript (Semicolons in JavaScript are optional), and anything else is FUD.

Both are dynamic typing to increase the readability.
Python Enhancement Proposal 8, or PEP 8, is a style guide for Python code. In 2001, Guido van Rossum, Barry Warsaw, and Nick Coghlan created PEP 8 to help Python programmers write consistent and readable code. Reference.
So in JavaScript we have the ECMAScript specification that describes how, if a statement is not explicitly terminated with a semicolon, sometimes a semicolon will be automatically inserted by the JavaScript engine (called “automatic semicolon insertion” (ASI)). Reference.
See this article from Google talking about JavaScript too.

How to implement Lexical Analysis in Javascript

Hey folks, thanks for reading
I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.
I began slowly with the basics : + - / * and parenthesis handling.
I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.
This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.
I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.
Analysis
Let's define what a calculator query is:
// NON TERMINAL EXPRESSIONS //
query -> statement
query -> ε // means end of query
statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number
number -> integer
number -> float
// TERMINAL EXPRESSIONS //
operator -> [+*/%^-]
prefix -> -
integer -> [0-9]+
float -> [0-9]+[.,][0-9]+
Javascript
Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:
(I added spaces to make it more readable)
var calcPat =
/^ (\s*
( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;
If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part
I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:
I created a method called isStatement(string) that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:
-If the first two tokens are a number followed by an operator:
-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)
-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.
What's the problem ?
My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?
Thank you so much for having taken the time to read everything.
Gael
(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)

You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.
The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.
The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.
You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.
Now, because you've got Javascript regular expressions to work with, you could come up with a regular expression designed to match a string of tokens. The trick with that will be coming up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.
edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very complicated, and it'd be pretty fun to have around.

Javascript lexer / tokenizer (in Python?)

Does anyone know of a Javascript lexical analyzer or tokenizer (preferably in Python?)
Basically, given an arbitrary Javascript file, I want to grab the tokens.
e.g.
foo = 1
becomes something like:
variable name : "foo"
whitespace
operator : equals
whitespace
integer : 1

http://code.google.com/p/pynarcissus/ has one.
Also I made one but it doesn't support automatic semicolon insertion so it is pretty useless for javascript that you have no control over (as almost all real life javascript programs lack at least one semicolon) :) Here is mine:
http://bitbucket.org/santagada/jaspyon/src/tip/jaspyon/
the grammar is in jsgrammar.txt, it is parsed by the PyPy parsing lib (which you will have to download and extract from the pypy source) and it build a parse tree which I walk on astbuilder.py
But if you don't have licensing problems I would go with pynarcissus. heres a direct link to look at the code (ported from narcissus):
http://code.google.com/p/pynarcissus/source/browse/trunk/jsparser.py

We Keep Coding

JavaScript is the programming language of the Web.