Mutual recursion in Regex

Mutual recursion in Regex - javascript

I'd like to know how is it possible to get mutual recursion with Regex in Javascript.
For instance, a regex recognising the language {a^n b^n} would be /a$0b/ where $0 stands for the current regex. It would be basic recursion.
Mutual recursion is the same, but with different regexs.
How can we get it done with Javascript ?
Thanks for your answers !

Unfortunately, Javascript doesn't have support for recursive regular expressions.
You can implement a basic recursive function with checks for the first and last letter (to ensure they're a and b respectively) and a recursive call for the interior string (the $0 in your example).
For one, you realize that this language produces phrases that have strictly an even number of letters, so you can implement a check to break out early.
Secondly, your only other failure case is if the letters at each end don't match.
You can map any other regular expression to such a tester function. To implement a mutually recursive one, simply call one function from the other.
To truly embody a finite-state machine, go token by token when you do the checks; for the example above, first check that you have an a at the beginning, then recurse, then check for the b at the end.
The a^n b^n language you are describing isn't even a regular language, and so a mutually recursive language would be far from it. You cannot use the regular expression engine in Javascript to produce a mutually recursive language, which is why you need to write a finite-state machine to take a token at a time and proceed to the next state if we encounter a match.

Related

Is there a way to expand a deterministic regular expression in javascript?

I want to know if there is a way to see if part of a regular expression is deterministic or not. For example, the regular expression 0{3} is deterministic a.k.a there is only one string that matches it: "000". So for example if we had the regular expression \d0{3} along with the string "1", is there a way to get the string "1000" from that? It seems technically possible since once you have the first digit, you know that the rest of the digits are all 0s and there can only be 3 of them. I don't know if I am missing something or not though.

A sufficient condition for a regular expression to be deterministic is that it does not contain:
The | operator.
Any quantifiers (+,*,?,{n,m}) other than fixed repetition ({n}).
Any character class matching more than one character (\w, [a-z])
These conditions are not necessary because of zero-width assertions. For example, the expression (?!x)(x|y) only matches y. So this simple approach will not cover all cases, though it may suffice for your application.
At least for the case of true regular expressions without backreferences, it should be possible to determine whether they are singular. Simply use the standard construction to turn the expression into a nondeterministic finite automaton, then a deterministic finite automaton, then minimize it. The minimal DFA is singular if and only if there is exactly one accepting state, the accepting state has no edges coming from it, and every nonaccepting state has one edge coming from it.
To handle lookahead assertions, you might need to turn the expression into an alternating finite automaton, then use an approach similar to Thompson's construction to get the NFA, then proceed from there. Note that the worst-case here could have doubly exponential blowup. You can take \b and ^ and similar and translate them to one-character lookbehind assertions, then do some fiddly stuff to get those to work.

Learning Regular Expressions Backtracking in Google Chrome

I am trying to learn and understand the regular expressions in javascript and would like to understand the concept of backtracking for regular expressions in javascript. Can anyone point me to the source code or refer me to the algorithm which javascript in Google Chrome (V8 Engine) uses to parse regular expressions and to check how does it backtrack. As Google's V8 Engine is Open Source this must not be difficult.

The source code of V8 Engine is not exactly a friendly place to start learning about backtracking.
At first glance and from my experience reading Java's implementation of Pattern class, the file trunk/src/x87/regexp-macro-assembler-x87.cc contains the source code of JS RegExp. You are basically reading assembly at the level present in the source code.
You might be interested in trunk/src/runtime/runtime-regexp.cc, which contains the implementation of methods of RegExp. The code doesn't contain anything about the inner working of RegExp, though.
The concept of backtracking is similar to search algorithms. Since you don't know the result, you would perform a brute force search, but depending on how you define the search order, you can arrive at the result faster or slower.
For concatenation, each node connects to the next one in a linear manner. There is no branching, so there is no backtracking.
For a branch P1|P2|...|Pn, you can think of it as a search tree with n nodes, where you will try the node P1 first, then P2, ... and finally Pn. All n nodes in the branch connects to the sequel atom, so if any of the node succeeds, it will move on to the sequel atom, and only backtrack for the next node in the branch when all possibilities on the next atom has been exhausted.
For (greedy) quantifier 0 or more A*, you can think of it as a node with 2 branches, one branch to A then back to itself, and the other branch going to the next atom. The branch to A will be tried first. Note that this is a simplified description, since the engine also has to deal with 0-length matches.
For (lazy) quantifier 0 or more A*?, it is basically the same as above, except that the branch to the next atom will be tried first.
When you add upper bound and lower bound to quantifier, you can imagine a counter being added to record how many times A has been matched. Depending on the counter, either of the branches will be unavailable for branching at certain counter.
So you will perform a search using the search tree above, and every time you got stuck (can't reach the goal state of the tree), you will backtrack and try the other branch.
Hope this help getting you started on the concept of backtracking.

is there any way to get all the possible outcomes of a regular expression pattern?

is there any way to get all the possible outcomes of a regular expression pattern?. everything I've seen refers to a pattern that is evaluated against a string. but what I need is to have a pattern like this:
^EM1650S(B{1,2}|L{1,2})?$
generate all possible matches:
EM1650S
EM1650SB
EM1650SBB
EM1650SL
EM1650SLL

In the general case, no. In this case, you have almost no solution space.
There's a section covering this in Higher Order Perl (PDF) and a Perl module. I never re-implemented it in anything else, but I had a similar problem and this solution was adequate for similarly-limited needs.

There are tools that can display all possible matches of a regex.
Here is one written in Haskell: https://github.com/audreyt/regex-genex
and here is a Perl module: http://metacpan.org/pod/Regexp::Genex
Unfortunately I couldn't find anything for JavaScript

In this particular case, yes. The regex generates a finite number of valid string, so they can be counted up.
You'll just have to parse the regex. Some part of that (EM1650S) is mandatory, so think for the rest. Parse by the | (or) symbol. Then enumerate the strings for both sides of it. Then you can get all possible combinations of them.
Some regex (containing * or + symbols) can represent an infinite number of strings, so they cannot be counted.

From a computational theoretic standpoint, regular expressions are equivalent to finite state machines. This is part of "automata theory." You could create a finite state machine that is equivalent to a regular expression and then use graph traversal algorithms to traverse all paths of the FSM. In the general case a countably infinite number of strings may match a regular expression, so your program may never terminate depending on the input regular expression.

How to let user enter a JavaScript function securely?

What I'm doing is creating a truth table generator. Using a function provided by the user (e.g. a && b || c), I'm trying to make JavaScript display all combinations of a, b and c with the result of the function.
The point is that I'm not entirely sure how to parse the function provided by the user. The user could namely basically put everything he wants in a function, which could have the effect of my website being changed etc.
eval() is not secure at all; neither is new Function(), as both can make the user put everything to his liking in the function. Usually JSON.parse() is a great alternative to eval(), however functions don't exist in JSON.
So I was wondering how I can parse a custom boolean operator string like a && b || c to a function, whilst any malicious code strings are ignored. Only boolean operators (&&, ||, !) should be allowed inside the function.

Even if you check for boolean expressions, I could do this:
(a && b && (function() { ruinYourShit(); return true; })())
This is to illustrate that your problem is not solvable in the general case.
To make it work for your situation, you'd have to impose severe restrictions on variable naming, like requiring that all variables are a single alphabet letter, and then use a regular expression to kick it back if anything else is found. Also, to "match" a boolean expression, you'd actually have to develop a grammar for that expression. You are attempting to parse a non-regular language (javascript) and therefore you cannot write a regular expression that could match every possible boolean expression of arbitrary complexity. Basically, the problem you are trying to solve is really, really hard.
Assuming you aren't expecting the world to come crashing down if someone ruins your shit, you can develop a good enough solution by simply checking for the function keyword, and disallowing any logical code blocks contained in { and }.

I don't see the problem with using eval() and letting the user do whatever (s)he wants, as long as you have proper validation/filters on any of your server-side scripts that expect input. Sure, the user could do something that ruins your page, but (s)he can do that already. The user can easily enough run any javascript (s)he wants on your page already, using any number built-in browser features or add-ons.

What I would do would be to first split the user input into tokens (variable names, operators and possibly parentheses for grouping), then build an expression tree from that and then generate the relevant output from the expression tree (possibly after running some simplification on it).
So, say, break the string "a & !b" into the token sequence "a" "&" "!" "b", then go through and (eventually) build something like: booland(boolvar("a"), boolnot(boolvar("b"))) and you then have a suitable data structure to run your (custom, hopefully injection-free) evaluator over.

I ended up with a regular expression:
if(!/^[a-zA-Z\|\&\!\(\)\ ]+$/.test(str)) {
throw "The function is not a combination of Boolean operators.";
return;
}

How to implement Lexical Analysis in Javascript

Hey folks, thanks for reading
I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.
I began slowly with the basics : + - / * and parenthesis handling.
I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.
This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.
I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.
Analysis
Let's define what a calculator query is:
// NON TERMINAL EXPRESSIONS //
query -> statement
query -> ε // means end of query
statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number
number -> integer
number -> float
// TERMINAL EXPRESSIONS //
operator -> [+*/%^-]
prefix -> -
integer -> [0-9]+
float -> [0-9]+[.,][0-9]+
Javascript
Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:
(I added spaces to make it more readable)
var calcPat =
/^ (\s*
( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;
If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part
I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:
I created a method called isStatement(string) that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:
-If the first two tokens are a number followed by an operator:
-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)
-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.
What's the problem ?
My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?
Thank you so much for having taken the time to read everything.
Gael
(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)

You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.
The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.
The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.
You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.
Now, because you've got Javascript regular expressions to work with, you could come up with a regular expression designed to match a string of tokens. The trick with that will be coming up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.
edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very complicated, and it'd be pretty fun to have around.

We Keep Coding

JavaScript is the programming language of the Web.