How does atom text editor parse / tokenise code? (syntax-highlighting)

How does atom text editor parse / tokenise code? (syntax-highlighting) - javascript

So CodeMirror uses modes to tokenise its code.
It breaks up the document into lines and makes each line a stream, which is then put through into the pre-defined mode. It can span multiple lines by using its state parameter.
It seems ACE has a similar method.
Neither of these methods use RegExp inherently (but obviously whomever creates the mode can code in RegExp into their mode).
From what I've read of Atom's code and style, is that it calls different syntax highlighters grammars and they resemble closely the grammars from TextMate.
These grammars resemble JSON objects which contain classnames and RegExps (see how to write a TextMate grammar).
I can't figure out for the life of me how exactly Atom Text Editor actually performs the parsing of code, keeping its state and also extending through various scopes.
If someone could point me in the right direction that would be great.

You're probably better of asking your question in the Atom forums, since they are frequented by the Atom developers.

The question was answered here.
Atom uses its first-mate module, which relies on oniguruma for parsing Regular Expressions.

Related

Is it always possible to go from AST to original source code? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
JavaScript source code can be converted to an AST. I am using SHIFT AST Parser to create AST from JavaScipt code.
Now I want to convert the generated AST back to source code.
I am very much confused here and trying to understand the fundamentals. I am hearing from my colleagues that AST can't be converted back to the source code. But for what reason?
One of colleague told me AST does not preserve the spacing and indentation will be lost while converting AST to source code.
Is it the only reason?

First of all, it depends on what you mean by "original source" code.
If you mean the exact same file on the exact same file system that you were editing when you wrote the software, the answer is that you can't. Obviously.
If you mean code which is character for character identical to the code you wrote, it is technically possible but unlikely in practice.
If you mean code that works exactly the same and "looks mostly the same", then yes, you can. (Depending on "mostly".)
The answer also depends on what AST implementation you are talking about.
Some AST implementations don't preserve comments or spacing / indentation.
Other AST implementations apparently can preserve comments; e.g. as decorations on the tree nodes.
It is theoretically possible for an AST implementation to preserve absolutely everything needed to reconstruct identical source code. (But I don't know of an example that does. It would be memory expensive and kind of pointless.)
What is the harm in not being able to recover comments?
Well it depends on what you want to use the regenerated source code for. If you intend to be able to replace the original code, then there are clear problems:
You have lost any (hopefully) useful comments that the programmer included to help people to understand the code.
It is common to embed formal API documentation in the form of stylized source code comments that are then extracted, formatted, etc. If those comments are lost, it becomes harder to keep the API documentation up to date.
Some 3rd-party tools use stylized comments for specific purposes. For example, a comment could be could be used to suppress a false positive from a static code analyzer; e.g. a # noqa comment in Python code suppresses a pep8 style error.
On the other hand ... this kind of thing may not be relevant for your use-case.
Now from the tags I deduce that you are using Shift-AST. From a brief scan of the documentation and source code, I don't think this preserves either comments or indentation / spacing.
So that means that you cannot recover source code that is character for character identical with the original code. If that is what you want ... your colleague is 100% correct.
However, character for character identical code may not be necessary, so this may not be a limitation. It depends on your use-case.
And you could investigate Babel as an alternative. Apparently it can preserve comments.
One of colleague told me AST does not preserve the spacing and indentation will be lost while converting AST to source code. Is it the only reason?
Clearly, No. (As my answer explains.)

Nope. An abstract syntax tree is abstract in the way that it abstracts away ambiguous grammar, such as whitespace and possibly also comments (if these are irrelevant to further processing). As there is usually no purpose in storing this information, it is worth dropping during parsing.
While one can't go back to "original source code", one can still go back to an equivalent representation which is usually called the canonical form.

Well, it's possible. If you're using shift-ast you can do it.
Step 1:
npm install shift-codegen
Step 2:
import codegen from "shift-codegen";
let programSource = codegen(/* Shift format AST */);
ProgramSource will return string. Write it to your file and use prettier to format your code.
There is an alternative to shift-ast is called babel gives so many benefits, transforms and template features. Also provides typescript, flow, jsx and comments and minification features.

Determine all ISO 15924 script codes in JavaScript string

I'm looking for an efficient way to take a JavaScript string and return all of the scripts which occur in that string.
Full UTF-16 including the "astral" plane / non-BMP characters which require surrogate pairs must be correctly handled. This is possibly the main problem since JavaScript is not UTF-16 aware.
It only has to deal with codepoints so no fancy awareness of complex scripts or grapheme clusters is necessary. (This will be obvious to some of you anyway.)
Example:
stringToIso15924("παν語");
would return something like:
[ "Grek", "Hani" ]
I'm using node.js and some Unicode libraries such as XRegExp and unorm already so I don't mind adding other libraries that might already handle or ease such a feature.
I'm not aware of a JavaScript library that can look up character properties such as script codes, so this is probably the second part of the problem.
The third part of the problem is just to avoid inefficiencies.

I answered a similar question, well at least related. In this pastebin you will a (looooong) function that returns the script name for a character. It should be easy to modifiy it to accommodate a string.

What is the 'standard' concerning style guidelines in JavaScript?

First of all, I'd like to say that I'm not trying to start a discussion on what is the best coding style.
Rather, I was wondering what is actually the global standard when it comes to styling your code. I've seen different websites and mainly open source organisations which have their own guideline page, which for example says that you should put } else { on the same line.
Are there some (un)written rules concerning code style which apply to all JavaScript being written? Is there a common preference for specific coding styles? Or is this really on a per-organisation basis?

These are widely accepted*:
Variable names contain only characters a-zA-Z_ (and sometimes $0-9)
Indent by 4 spaces or a tab character (Never mix!)
Constructor functions begin with an uppercase letter
Terminate every statement with a semicolon
Egyptian bracing
always use blocks in after if, else, etc., even for a single statement
One space after a comma, no space before
Assignment/comparison operators are surrounded by spaces
Avoid lines containing multiple statements
Use ' as a string delimiter
From my experience, most conventions are subject to heated discussions.
So, no, there is no general rule. Some people even try to completely avoid semicolons
* or are they? ;)

There isn't one standard. Are there any guidelines out there that you can follow if you want to keep your code consistent? How about google's coding style? http://google-styleguide.googlecode.com/svn/trunk/javascriptguide.xml
We use that as basic guidelines at our company

Douglas Crockford's JavaScript: The Good Parts is widely used as a basis for coding guidelines.
His JSLint tool can be used to check whether code meets his recommendations.

Standard is the new standard.
I've been using it in all my projects.

Is metaprogramming possible in Javascript?

During my routine work, i happened to write the chained javascript function which is something like LINQ expression to query the JSON result.
var Result = from(obj1).as("x").where("x.id=5").groupby("x.status").having(count("x.status") > 5).select("x.status");
It works perfectly and give the expected result.
I was wondering this looks awesome if the code is written like this (in a more readable way)
var Result = from obj1 as x where x.status
groupby x.status having count(x.status) > 5
select x.status;
is there a way to achieve this??
Cheers
Ramesh Vel

No. JavaScript doesn't support this.
But this looks quite good too:
var Result = from(obj1)
.as("x")
.where("x.id=5")
.groupby("x.status")
.having(count("x.status") > 5)
.select("x.status");

Most people insist on trying to metaprogram from inside their favorite language. That doesn't work if the language doesn't support metaprogramming well; other answers have observed that JavaScript does not.
A way around this is to do metaprogramming from outside the language, using
program transformation tools. Such tools can parse source code, and carry out arbitrary transformations on it (that's what metaprogramming does anyway) and then spit the revised program.
If you have a general purpose program transformation system, that can parse arbitrary languages, you can then do metaprogramming on/with whatever language you like. See our DMS Software Reengineering Toolkit for such a tool, that has robust front ends for C, C++, Java, C#, COBOL, PHP, and ECMAScript and a number of other programming langauges, and has been used for metaprogramming on all of these.
In your case, you want to extend the JavaScript grammar with new syntax for SQL queries, and then transform them to plain JavaScript. (This is a lot like Intentional Programming)
DMS will easily let you build a JavaScript dialect with additional rules, and then you can use its program transformation capabilities to produce the equivalent standard Javascript.
Having said, that, I'm not a great fan of "custom syntax for every programmer on the planet" which is where Intentional Programming leads IMHO.
This is a good thing to do if there is a large community of users that would find this valuable. This idea may or may not be one of them; part of the problem is you don't get to find out without doing the experiment, and it might fail to gain enough social traction to matter.

although not quite what you wanted, it is possible to write parsers in javascript, and just parse the query (stored as strings) and then execute it. e.g.,using libraries like http://jscc.jmksf.com/ (no doubt there are others out there) it shouldnt be too hard to implement.
but what you have in the question looks great already, i m not sure why you'd want it to look the way you suggested.

Considering that this question is asked some years ago, I will try to add more to it based on the current technologies.
As of ECMAScript 6, metaprogramming is now supported in a sense via Symbol, Reflect and Proxy objects.
By searching on the web, I found a series of very interesting articles on the subject, written by Keith Kirkel:
Metaprogramming in ES6: Symbols and why they're awesome
In short, Symbols are new primitives that can be added inside an object (without practically being properties) and are very handy for passing metaprogramming properties to it among others. Symbols are all about changing the behavior of existing classes by modifying them (Reflection within implementation).
Metaprogramming in ES6: Part 2 - Reflect
In short, Reflect is effectively a collection of all of those “internal methods” that were available exclusively through the JavaScript engine internals, now exposed in one single, handy object. Its usage is analogous to the Reflection capabilities of Java and C#. They are used to discover very low level information about your code (Reflection through introspection).
Metaprogramming in ES6: Part 3 - Proxies
In short, Proxies are handler objects, responsible for wrapping objects and intercepting their behaviors through traps (Reflection through intercession).
Of course, these objects provide specific metaprogramming capabilities, much more restrictive compared to metaprogramming languages, but still can provide handy ways of basic metaprogramming, mainly through Reflection practices, in fact.
In the end, it is worth mentioning that there is some worth-noticing ongoing research work on staged metaprogramming in JavaScript.

Well, in your code sample:
var Result = from(obj1)
.as("x")
.where("x.id=5")
.groupby("x.status")
.having(count("x.status") > 5)
.select("x.status");
The only problem I see (other than select used as an identifier) is that you embed a predicate as a function argument. You'd have to make it a function instead:
.having(function(x){ return x.status > 5; })
JavaScript has closures and dynamic typing, so you can do some really nifty and elegant things in it. Just letting you know.

In pure JS no you can not. But with right preprocessor it is possible.
You can do something similar with sweet.js macros or (God forgive me) GPP.

Wat you want is to change the javascript parser into an SQL parser. It wasn't created to do that, the javascript syntax doesn't allow you to.
What you have is 90% like SQL (it maps straight onto it), and a 100% valid javascript, which is a great achievement. My answer to the question in the title is: YES, metaprogramming is possible, but NO it won't give you an SQL parser, since it's bound to use javascript grammar.

Maybe you want something like JSONPath if you've got JSON data. I found this at http://www.json.org/. Lots of other tools linked to from there if it's not exactly what you need.
(this is being worked on as well: http://docs.dojocampus.org/dojox/json/query)

Parse JavaScript to instrument code

I need to split a JavaScript file into single instructions. For example
a = 2;
foo()
function bar() {
b = 5;
print("spam");
}
has to be separated into three instructions. (assignment, function call and function definition).
Basically I need to instrument the code, injecting code between these instructions to perform checks. Splitting by ";" wouldn't obviously work because you can also end instructions with newlines and maybe I don't want to instrument code inside function and class definitions (I don't know yet). I took a course about grammars with flex/Bison but in this case the semantic action for this rule would be "print all the descendants in the parse tree and put my code at the end" which can't be done with basic Bison I think. How do I do this? I also need to split the code because I need to interface with Python with python-spidermonkey.
Or... is there a library out there already which saves me from reinventing the wheel? It doesn't have to be in Python.

Why not use a JavaScript parser? There are lots, including a Python API for ANTLR and a Python wrapper around SpiderMonkey.

JavaScript is tricky to parse; you need a full JavaScript parser.
The DMS Software Reengineering Toolkit can parse full JavaScript and build a corresponding AST.
AST operators can then be used to walk over the tree to "split it". Even easier, however, is to apply source-to-source transformations that look for one surface syntax (JavaScript) pattern, and replace it by another. You can use such transformations to insert the instrumentation into the code, rather than splitting the code to make holds in which to do the insertions. After the transformations are complete, DMS can regenerate valid JavaScript code (complete with the orignal comments if unaffected).

Why not use an existing JavaScript interpreter like Rhino (Java) or python-spidermonkey (not sure whether this one is still alive)? It will parse the JS and then you can examine the resulting parse tree. I'm not sure how easy it will be to recreate the original code but that mostly depends on how readable the instrumented code must be. If no one ever looks at it, just generate a really compact form.
pyjamas might also be of interest; this is a Python to JavaScript transpiler.
[EDIT] While this doesn't solve your problem at first glance, you might use it for a different approach: Instead of instrumenting JavaScript, write your code in Python instead (which can be easily instrumented; all the tools are already there) and then convert the result to JavaScript.
Lastly, if you want to solve your problem in Python but can't find a parser: Use a Java engine to add comments to the code which you can then search for in Python to instrument the code.

Why not try a javascript beautifier?
For example http://jsbeautifier.org/
Or see Command line JavaScript code beautifier that works on Windows and Linux

Forget my parser. https://bitbucket.org/mvantellingen/pyjsparser is great and complete parser. I've fixed a couple of it's bugs here: https://bitbucket.org/nullie/pyjsparser

We Keep Coding

JavaScript is the programming language of the Web.