Learning Regular Expressions Backtracking in Google Chrome

Learning Regular Expressions Backtracking in Google Chrome - javascript

I am trying to learn and understand the regular expressions in javascript and would like to understand the concept of backtracking for regular expressions in javascript. Can anyone point me to the source code or refer me to the algorithm which javascript in Google Chrome (V8 Engine) uses to parse regular expressions and to check how does it backtrack. As Google's V8 Engine is Open Source this must not be difficult.

The source code of V8 Engine is not exactly a friendly place to start learning about backtracking.
At first glance and from my experience reading Java's implementation of Pattern class, the file trunk/src/x87/regexp-macro-assembler-x87.cc contains the source code of JS RegExp. You are basically reading assembly at the level present in the source code.
You might be interested in trunk/src/runtime/runtime-regexp.cc, which contains the implementation of methods of RegExp. The code doesn't contain anything about the inner working of RegExp, though.
The concept of backtracking is similar to search algorithms. Since you don't know the result, you would perform a brute force search, but depending on how you define the search order, you can arrive at the result faster or slower.
For concatenation, each node connects to the next one in a linear manner. There is no branching, so there is no backtracking.
For a branch P1|P2|...|Pn, you can think of it as a search tree with n nodes, where you will try the node P1 first, then P2, ... and finally Pn. All n nodes in the branch connects to the sequel atom, so if any of the node succeeds, it will move on to the sequel atom, and only backtrack for the next node in the branch when all possibilities on the next atom has been exhausted.
For (greedy) quantifier 0 or more A*, you can think of it as a node with 2 branches, one branch to A then back to itself, and the other branch going to the next atom. The branch to A will be tried first. Note that this is a simplified description, since the engine also has to deal with 0-length matches.
For (lazy) quantifier 0 or more A*?, it is basically the same as above, except that the branch to the next atom will be tried first.
When you add upper bound and lower bound to quantifier, you can imagine a counter being added to record how many times A has been matched. Depending on the counter, either of the branches will be unavailable for branching at certain counter.
So you will perform a search using the search tree above, and every time you got stuck (can't reach the goal state of the tree), you will backtrack and try the other branch.
Hope this help getting you started on the concept of backtracking.

Related

How to get the changes between two strings (insertion, deletion or same)?

I used implementation of Levenshtein algorithm to get the distance between two strings, but what I really need is where (the indexes) in the second string happened insertion or deletion or stayed the same ?
Is there any implementation to do this in JavaScript (or other C#)?

Google offers Diff Match and Patch libraries that contains robust algorithms to perform the tasks you asked for. These libraries can be used in Java, JavaScript, Python, C++, C#, Objective-C, Lua and Dart.
Diff: Diff takes two texts and finds the differences. This implementation works on a character by character basis. The result of any diff may contain 'chaff', irrelevant small commonalities which complicate the output. A post-diff cleanup algorithm factors out these trivial commonalities.
Match: Match looks for a pattern within a larger text. This implementation of match is fuzzy, meaning it can find a match even if the pattern contains errors and doesn't exactly match what is found in the text. This implementation also accepts an expected location, near which the match should be found. The candidate matches are scored based on a) the number of spelling differences between the pattern and the text and b) the distance between the candidate match and the expected location. The match distance parameter sets the relative importance of these two metrics.
Patch: Two texts can be diff-ed against each other, generating a list of patches. These patches can then be applied against a third text. If the third text has edits of its own, this version of patch will apply its changes on a best-effort basis, reporting which patches succeeded and which failed.
You can learn more about it from here.

Mutual recursion in Regex

I'd like to know how is it possible to get mutual recursion with Regex in Javascript.
For instance, a regex recognising the language {a^n b^n} would be /a$0b/ where $0 stands for the current regex. It would be basic recursion.
Mutual recursion is the same, but with different regexs.
How can we get it done with Javascript ?
Thanks for your answers !

Unfortunately, Javascript doesn't have support for recursive regular expressions.
You can implement a basic recursive function with checks for the first and last letter (to ensure they're a and b respectively) and a recursive call for the interior string (the $0 in your example).
For one, you realize that this language produces phrases that have strictly an even number of letters, so you can implement a check to break out early.
Secondly, your only other failure case is if the letters at each end don't match.
You can map any other regular expression to such a tester function. To implement a mutually recursive one, simply call one function from the other.
To truly embody a finite-state machine, go token by token when you do the checks; for the example above, first check that you have an a at the beginning, then recurse, then check for the b at the end.
The a^n b^n language you are describing isn't even a regular language, and so a mutually recursive language would be far from it. You cannot use the regular expression engine in Javascript to produce a mutually recursive language, which is why you need to write a finite-state machine to take a token at a time and proceed to the next state if we encounter a match.

Will this regex be slow? Is there any way to optimize it?

This will be run in javascript multiple times on bits of HTML. Will all of the or expressions make it slow? Can it be optimized?
\<[^\>]*?(abbr|acronym|address|applet|area|article|aside|audio|base|basefont|bdi|bdo|big|blockquote|body|button|canvas|caption|center|cite|code|col|colgroup|command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|head|header|hr|html|iframe|img|input|ins|kbd|keygen|label|legend|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|optgroup|option|output|param|pre|progress|q|rp|rt|ruby|samp|script|section|select|small|source|strike|style|sub|summary|sup|textarea|time|title|track|tt|var|video|wbr)[^\>]*?\>/g

You might try moving element names found very frequently in your source (a, div) to the front of the list:
… (a|div|abbr| …
Also, I think your pattern will match, e.g., < notanabbreviation >. If that's not what you want, try
<\b(a|abbr|…)\b[^>]*?>
The \b preceding the alternations helps because it lets the engine exit early without trying all of the alternations.
But you just have to test to see. I made a jsperf test using nytimes.com as an example.

You can use this tool to compare different regex
it took 2.4 seconds to execute over the source code of Yahoo's front page . This is not a scientific test but it doesnt look very effecient.
PS silverlight plugin is required

adding i after the g will make it case insensitive
also since it is javascript maybe you could use a hash instead of a giant regex

Is it possible to enumerate computer programs?

Suppose you have to write a program that will test all programs in search of one that completes a specific task. For example, consider this JavaScript function:
function find_truth(){
for(n=0;;++n){
try {
var fn = Function(string(n));
if (fn() == 42)
return fn;
} catch() {
continue;
}
}
}
As long as string(n) returns the nth string possible ("a", "b", "c", ... "aa", "ab" ...), this program would eventually output a function that evaluated to 42. The problem with this method is that it is enumerating strings that could or couldn't be a valid program. My question is: is it possible to enumerate programs themselves? How?

Yes, there are ways that this is possible. A few months ago I made a little experimental project to do something like it, which works reasonably well considering what it's doing. You give it a type and some tests to pass, and it searches for a suitable function that passes the tests. I never put in the work to make it mature, but you can see that it does in fact eventually generate functions that pass the tests in the case of the examples. Requiring the type was an essential component for the practicality of this search -- it drastically reduced the number of programs that needed to be tried.
It is important to have a firm grasp of the theory here before making assertions about what is and is not possible -- there are many who will jump to the halting problem and tell you that it's impossible, when the truth is quite a bit more nuanced than that.
You can easily generate all syntactically valid programs in a given language. Naively, you can generate all strings and filter out the ones that parse/typecheck; but there are better ways.
If the language is not turing complete -- e.g. the simply-typed lambda calculus, or even something very powerful like Agda, you can enumerate all programs that generate 42, and given any program you can decide "this generates 42" or "this does not generate 42". (The language I used in my experimental project is in this class). The tension here is that in any such language, there will be some computable functions which you cannot write.
Even if the language is turing complete, you can still enumerate all programs which eventually generate 42 (do this by running them all in parallel, and reporting success as they finish).
What you cannot do to a turing complete language is decide that a program will definitely not ever generate 42 -- it could run forever trying, and you won't be able to tell whether it will eventually succeed until it does. There are some programs which you can tell will infinite loop, though, just not all of them. So if you have a table of programs, you might expect your enumerator program in the case of a non-turing complete language to be like this:
Program | P1 | P2 | P3 | P4 | P5 | P6 | P7 | ...
42? | No | No | No | Yes | No | No | No | ...
Whereas in a turing complete language, your output (at a given time) would be like this:
Program | P1 | P2 | P3 | P4 | P5 | P6 | P7 | ...
42? | No | No | Loop | Yes | No | Dunno | No | ...
And later that Dunno might turn into a Yes or a No, or it might stay dunno forever -- you just dunno.
This is all very theoretical, and actually generating programs in a turing complete language to search for ones that do a certain thing is pretty hard and takes a long time. Certainly you don't want to do it one by one, because the space is so big; you probably want to use a genetic search algorithm or something.
Another subtle point in the theory: talking about programs which "generate 42" is missing some important detail. Usually when we talk about generating programs, we want it to generate a certain function, not just output something specific. And this is when things you might want to do get more impossible. If you have an infinite space of possible inputs -- say, inputting a single number, then
You can't in general tell whether a program computes a given function. You can't check every input manually because infinity is too many to check. You can't search for proofs that your function does the right thing, because for any computable function f, for any axiom system A, there are programs that compute f such that A has no proof that they compute f.
You can't tell whether a program is going to give any kind of answer for every possible input. It might work for the first 400,000,000 of them but infinite loop on the 400,000,001st.
Similarly, you can't tell whether two programs compute the same function (i.e. same inputs -> same outputs).
There you have it, a daily dose of decidability theory phenomenology.
If you are interested in doing it for real, look into genetic algorithms, and put timeouts on the functions you try and/or use a decidable (non-turing-complete) language.

It is certainly possible to enumerate all programs in a given language that are syntactically valid (and in a statically typed language even only those that type check): You could simply enumerate all strings like you proposed, try to feed each one into a parser for the language and then reject those that can't be parsed. So if that's your definition of a valid program, then yes, it's possible.
It is however not true that your program would necessarily output a function that returns 42 eventually - even if you replace string with a function that only returns syntactically valid programs. If a returned function contains an infinite loop, it will run forever and thus your program will never get to try another function. Thus you might never reach a function that returns 42.
To avoid this issue, you might say that the string(n) function should only produce code that is syntactically valid and does not contain an infinite loop (while still being able to return all such functions). That, however, is not possible because that would entail deciding the Halting Problem (which is, of course, undecidable).

As has been noted, you can trivially turn a "generate all strings" program into a "generate all valid programs in language X" by plugging in a parser/compiler for language X. Generally if you can write a program which takes text as input and returns true/false indicating whether the text represents a valid program, then you can use it as a filter on the "generate all strings" program.
You could also design a programming language in which every string of characters is a valid program (perl springs to mind).
Probably more interesting is that given a formal grammar for a language, you could use that to generate programs in the language instead of parsing them. You just need to do a breadth-first traversal of the grammar to be sure that every finite-length program will be reached in some finite time (if you do a depth-first traversal you'll get struck exploring all programs consisting solely of a variable whose name is an ever-longer sequence of 'A's, or something).
Most grammars actually used in parsing programming languages wouldn't work directly for this though; they're usually slightly over-permissive. For example, a grammar may define identifiers as anything matching a regex [_A-Za-z][0-9_A-Za-z]*, which allows variable names of unbounded length, but many language implementations will actually choke on programs with variable names that are gigabytes long. But you could in principle find out about all of these kind of gotchas and write an enumerable grammar that exactly covers all of the valid programs in some language of interest.
So that lets you enumerate all programs. That's not actually sufficient to let you run your find_truth program and find a program that returns 42 though, because it'll get stuck infinitely evaluating the first program that happens to contain an infinite loop.
But it's still actually possible to do this! You just need to pick an order in which to examine all the possibilities so that everything is eventually reached in some finite time. You've got two infinite "dimensions" to search in; the enumeration of all the programs, and the depth of evaluation of each program. You can make sure you cover all bases by doing something like the following strategy:
Run all programs of length up to 1 for up to 1 step
Run all programs of length up to 2 for up to 2 steps
Run all programs of length up to 3 for up to 3 steps
...
And so on. This guarantees that whatever the length of the program and number of "steps" needed, you'll eventually hit them without needing to have done an infinite amount of work "first" (so long as a program with your desired result actually exists).
If you've got unbounded storage available you can avoid repeating work between these phases (you store all programs of length N that haven't finished in N steps, along with their state, and then in the next round you run the new programs up to N+1 steps, and run all the "pending" programs for one more step each). The definition of "step" doesn't matter much, so long as it doesn't admit infinite loops. Some finite number of bytecodes, or CPU instructions, or even seconds; you'd need a suspendable implementation of the language, of course.

Assuming I am correctly interpreting your phrase "is it possible to enumerate programs themselves?" Then Yes you can enumerate programs. This is essentially a Genetic Programming problem. See :
http://en.wikipedia.org/wiki/Genetic_programming
Here Programs themselves are created, run, and evaluated (with a resulting fitness value, where the optimum value = 42), just as with Genetic Algorithms, so yes this would provide your enumeration. Furthermore over several generations you could have it evolve your program to produce 42.

It is impossible. The problem is that some programs may take a long time to finish computing. And some programs may be stuck in an infinite loop. Ideally you would like to abort running those programs that are stuck in infinite loops, and keep only the long running programs. You could implement a timer, but what if you had a program that ran longer than the timer, but still would return the correct answer?
In general, the only way to see if a computer program will terminate is to run it and see, with the risk of entering an infinite loop. Of course, you could implement some heuristics to recognize common forms of infinite loops, but in general it is impossible. This is known as the halting problem.
EDIT:
I realize that I only partially answer your question. You ask if it is possible to enumerate programs themselves. This is certainly possible. You already have a way of generating all possible strings in order. Then you could just see which strings parse correctly as a javascript program, and just keep those.

I would suggest starting from the grammar of javascript, for example of ANTLR.
https://github.com/antlr/grammars-v4/blob/master/javascript/javascript/JavaScriptParser.g4
The grammar defines a directed graph similar to this one:
grammar Exp;
/* This is the entry point of our parser. */
eval
: additionExp
;
/* Addition and subtraction have the lowest precedence. */
additionExp
: multiplyExp
( '+' multiplyExp
| '-' multiplyExp
)*
;
/* Multiplication and division have a higher precedence. */
multiplyExp
: atomExp
( '*' atomExp
| '/' atomExp
)*
;
Using this structure you can create a program that creates all gramatically valid programs of different depths 1, 2, 3, 4, ... and runs these in an emulator. These would be syntactically valid programs although many would return errors (think division by zero or accessing an undefined variable). Also some you would not be able to prove if they finish or not. But generating as many gramatically correct programs is possible with a properly defined grammar like the one provided by ANTLR.

How does a JavaScript parser work?

I'm trying to understand how JS is actually parsed. But my searches either return some ones very vaguely documented project of a "parser/generator" (i don't even know what that means), or how to parse JS using a JS Engine using the magical "parse" method. I don't want to scan through a bunch of code and try all my life to understand (although i can, it would take too long).
i want to know how an arbitrary string of JS code is actually turned into objects, functions, variables etc. I also want to know the procedures, and techniques that turns that string into stuff, gets stored, referenced, executed.
Are there any documentation/references for this?

Parsers probably work in all sorts of ways, but fundamentally they first go through a stage of tokenisation, then give the result to the compiler, which turns it into a program if it can. For example, given:
function foo(a) {
alert(a);
}
the parser will remove any leading whitespace to the first character, the letter "f". It will collect characters until it gets something that doesn't belong, the whitespace, that indicates the end of the token. It starts again with the "f" of "foo" until it gets to the "(", so it now has the tokens "function" and "foo". It knows "(" is a token on its own, so that's 3 tokens. It then gets the "a" followed by ")" which are two more tokens to make 5, and so on.
The only need for whitespace is between tokens that are otherwise ambiguous (e.g. there must be either whitespace or another token between "function" and "foo").
Once tokenisation is complete, it goes to the compiler, which sees "function" as an identifier, and interprets it as the keyword "function". It then gets "foo", an identifier that the language grammar tells it is the function name. Then the "(" indicates an opening grouping operator and hence the start of a formal parameter list, and so on.
Compilers may deal with tokens one at a time, or may grab them in chunks, or do all sorts of weird things to make them run faster.
You can also read How do C/C++ parsers work?, which gives a few more clues. Or just use Google.

While it doesn't correspond closely to the way real JS engines work, you might be interested in reading Douglas Crockford's article on Top Down Operator Precedence, which includes code for a small working lexer and parser written in the Javascript subset it parses. It's very readable and concise code (with good accompanying explanations) which at least gives you an outline of how a real implementation might work.
A more common technique than Crockford's "Top Down Operator Precedence" is recursive descent parsing, which is used in Narcissus, a complete implementation of JS in JS.

maybe esprima will help you to understand how JS parses the grammar. it's online

We Keep Coding

JavaScript is the programming language of the Web.