Using Jison to create / translate simple script to another language - javascript

Have been playing with Jison to try to create an "interpreter" for a very simple scripting syntax (this is just for a personal messing around project, no business case!)
It's been about 20 years since I had to create a compiler, and I think I'm just not grasping some of the concepts.
What I am thinking of doing is give a program of very simple statements, one per line, to Jison, and get a stream of Javascript statements back that then perform the actions.
I may be looking at this wrong - maybe I need to actually perform the actions during the parse? This doesn't sound right though.
Anyway, what I've got is (I'm trying this online btw http://zaach.github.io/jison/try/)
/* lexical grammar */
%lex
%options case-insensitive
%%
\s+ /* skip whitespace */
is\s+a\b return 'OCREATE'
is\s+some\b return 'ACREATE'
[_a-zA-Z]+[_a-zA-Z0-9]*\b return 'IDENTIFIER'
<<EOF>> return 'EOF'
/lex
/* operator associations and precedence */
%start input
%% /* language grammar */
input
: /* empty */
| program EOF
{ return $1; }
;
program
: expression
{ $$ = $1; }
| program expression
{ $$ = $1; }
;
expression
: IDENTIFIER OCREATE IDENTIFIER
{ $$ = 'controller.createOne(\'' + $1 + '\', \'' + $3 + '\');' }
| IDENTIFIER ACREATE IDENTIFIER
{ $$ = 'controller.createSeveral(\'' + $1 + '\', \'' + $3 + '\');' }
;
So, for the input:
basket is some apples
orange is a fruit
...I want:
controller.createSeveral('basket', 'apples');
controller.createOne('orange', 'fruit');
What I am getting is:
controller.createSeveral('basket', 'apples');
This kind of makes sense to me to get a single result, but I have no idea what to do to progress with building my output.

The problem is in your second production for program:
program
: expression
{ $$ = $1; }
| program expression
{ $$ = $1; }
What the second production is saying, basically, is "a program can be a (shorter) program followed by an expression, but its semantic value is the value of the shorter program."
You evidently want the value of the program to be augmented by the value of the expression, so you need to say that:
program
: expression
{ $$ = $1; }
| program expression
{ $$ = $1.concat($2); }
(or $$ = $1 + $2 if you prefer. And you might want a newline for readability.)

Related

Why does decodeURI decode more characters than it should?

I was just reading about decodeURI (MDN, ES6 spec) and something caught my eye:
Escape sequences that could not have been introduced by encodeURI are not replaced.
So, it should only decode characters that encodeURI encodes.
// None of these should be escaped by `encodeURI`.
const unescaped = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_.!~*'();/?:#&=+$,#";
const data = [...unescaped].map(char => ({
"char": char,
"encodeURI(char)": encodeURI(char),
"encodePercent(char)": encodePercent(char),
"decodeURI(encodePercent(char))": decodeURI(encodePercent(char))
}));
console.table( data );
console.log( "Check the browser's console." );
function encodePercent(string) {
return string.replace(/./g, char => "%" + char.charCodeAt(0).toString(16));
}
Why is this only true for ; / ? : # & = + $ , #?
The specification states the following step:
Let unescapedURISet be a String containing one instance of each code unit valid in uriReserved and uriUnescaped plus "#"
Let’s take a look at uriReserved, and voilà:
uriReserved ::: one of
; / ? : # & = + $ ,
The following step is then:
Return Encode(uriString, unescapedURISet).
Encode everything in encodes the string except for the characters in unescapedURISet, which include ; / ? : # & = + $ ,.
This means that encodeURI can never introduce escape sequences for anything in uriReserved and uriUnescaped.
Interestingly enough, decodeURI is defined like this:
Let reservedURISet be a String containing one instance of each code unit valid in uriReserved plus "#".
Return Decode(uriString, reservedURISet).
Decode works similarly to Encode and decodes everything except for the characters in reservedURISet. Obviously, only the characters of uriReserved are excluded from being decoded. And those happen to be ; / ? : # & = + $ ,!
The questions remains why the standard specifies this. If they had included uriUnescaped in reservedURISet the behaviour would be exactly what the introduction states. Probably a mistake?

JavaScript Regex to match balanced constructs without caring about imbalanced constructs

I'm working on a JavaScript-based project that involves a rudimentary Bash-inspired scripting system, and I'm using a regex to separate lines into (a number of types of) tokens.
One such token class is of course the recursive $() construct. This construct can be nested arbitrarily. I am trying to devise a JavaScript regular expression to match this type of token without accidentally leaving parts behind or grabbing parts of other tokens.
The problem, more specifically:
Given a string such as this example:
"$(foo $(bar) fizz $(buzz)) $(something $(else))"
which is made up of individual tokens, each delimited by an outer $() and followed either by whitespace or end-of-string,
match the first such token in the string from and including its open $( to and including its final closing )
In the event that an unbalanced construct occurs anywhere in the string, the behavior of this regex is considered undefined and does not matter.
So in the above example the regex should match "$(foo $(bar) fizz $(buzz))"
Further usage details:
Both the input string and the returned match are passed through String.prototype.trim(), so leading and trailing whitespace doesn't matter
I am able to deem unbalanced constructs an undefined case because the code that consumes this type of token once extracted does its own balance checking. Even if the regex returns a match that is surrounded by an outer $(), the error will eventually be caught elsewhere.
What I've tried so-far
For a while I've been using this regex:
/\$\(.*?\)(?!(?:(?!\$\().)*\))(?:\s+|$)/
which seemed to work for quite some time. It matches arbitrarily nested balanced constructs so long as they do not have multiple nests at the same level. That case just slipped past my mind somehow when testing initially. It consumes the contents of the token with lazy repetition, and asserts that after the closing paren there is not another close paren before there is an open $(. This is of course broken by tokens such as the example above.
I know that the traditional "balanced constructs regex problem" is not solvable without subroutines/recursion. However I'm hoping that since I only need to match balanced constructs and not fail to match imbalanced ones, that there's some clever way to cheat here.
So in the above example the regex should match $(foo $(bar) fizz $(buzz))
The solution as I see it is almost the same as I posted today (based on Matching Nested Constructs in JavaScript, Part 2 by Steven Levithan), but all you need is to add the delimiters since they are known.
Example usage:
matchRecursiveRegExp("$(foo $(bar) fizz $(buzz)) $(something $(else))", "\\$\\(", "\\)");
Code:
function matchRecursiveRegExp (str, left, right, flags) {
var f = flags || "",
g = f.indexOf("g") > -1,
x = new RegExp(left + "|" + right, "g" + f),
l = new RegExp(left, f.replace(/g/g, "")),
a = [],
t, s, m;
do {
t = 0;
while (m = x.exec(str)) {
if (l.test(m[0])) {
if (!t++) s = x.lastIndex;
} else if (t) {
if (!--t) {
a.push(str.slice(s, m.index));
if (!g) return a;
}
}
}
} while (t && (x.lastIndex = s));
return a;
}
document.write("$(" + matchRecursiveRegExp("$(foo $(bar) fizz $(buzz)) $(something $(else))", "\\$\\(", "\\)") + ")");
Note this solution does not support global matching.

Detecting _vars_with_underscores_; why does this not work?

I am trying to write a PEGjs rule to convert
Return _a_b_c_.
to
Return <>a_b_c</>.
My grammar is
root = atoms:atom+
{ return atoms.join(''); }
atom = variable
/ normalText
variable = "_" first:variableSegment rest:$("_" variableSegment)* "_"
{ return '<>' + first + rest + '</>'; }
variableSegment = $[^\n_ ]+
normalText = $[^\n]
This works for
Return _a_b_c_ .
and
Return _a_b_c_
but something is going wrong with the
Return _a_b_c_.
example.
I can't quite understand why this is breaking, and would love an explanation of why it's behaving as it does. (I don't even need a solution to the problem, necessarily; the biggest issue is that my mental model of PEGjs grammars is deficient.)
Rearranging the grammar slightly makes it work:
root = atoms:atom+
{ return atoms.join(''); }
atom = variable
/ normalText
variable = "_" first:$(variableSegment "_") rest:$(variableSegment "_")*
{ return '<>' + first + rest + '</>'; }
variableSegment = seg:$[^\n_ ]+
normalText = normal:$[^\n]
I'm not sure I understand why, exactly. In this one, the parser gets to the "." and matches it as a "variableSegment", but then backtracks just one step in the greedy "*" lookahead, decides it's got a "variable", and then re-parses the "." as "normal". (Note that this picks up the trailing _, which if not desired can be snipped off by a hack in action, or something like that; see below.)
In the original version, after failing because of the missing trailing underscore, the very next step the parser takes is back to the leading underscore, opting for the "normal" interpretation.
I added some action code with console.log() calls to trace the parser behavior.
edit — I think the deal is this. In your original version, the parse is failing on a rule that's of the form
expr1 expr2 expr3 ... exprN
The first sub-expression is the literal _. The next is for the first variable segment. The third is for the sequence of variable expressions preceded by _, and the last is the trailing _. While working through that rule on the problematic input, the last expression fails. The others have all succeeded, however, so the only place to start over is at the alternative point in the "atom" rule.
In the revised version, the parser can unwind the operation of the greedy * by one step. It then has a successful match of the third expression, so the rule succeeds.
Thus another revision, closer to the original, will also work:
root = atoms:atom+
{ return atoms.join(''); }
atom = variable
/ normalText
variable = "_" first:variableSegment rest:$("_" variableSegment & "_")* "_"
{ return '<>' + first + rest + '</>'; }
variableSegment = $[^\n_ ]+
normalText = $[^\n]
Now that greedy * group will backtrack when it fails in peeking forward at an _.
The parser interprets the last _. as variableSegment. If you exclude the the dot from the variableSegment RegExp your code will work as expected.

Issue with a Jison Grammar, Strange error from generate dparser

I am writing a simple Jison grammar in order to get some experience before starting a more complex project. I tried a simple grammar which is a comma separated list of numeric ranges, with ranges where the beginning and ending values were the same to use a single number shorthand. However, when running the generated parser on some test input I get an error which doe snot make alot of sense to me. Here is the grammar i came up with:
/* description: Parses end executes mathematical expressions. */
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+ {return 'NUMBER'}
"-" {return '-'}
"," {return ','}
<<EOF>> {return 'EOF'}
. {return 'INVALID'}
/lex
/* operator associations and precedence */
%start ranges
%% /* language grammar */
ranges
: e EOF
{return $1;}
;
e : rng { $$ = $1;}
| e ',' e {alert('e,e');$$ = new Array(); $$.push($1); $$.push($3);}
;
rng
: NUMBER '-' NUMBER
{$$ = new Array(); var rng = {Start:$1, End: $3; }; $$.push(rng); }
| NUMBER
{$$ = new Array(); var rng = {Start:$1, End: $1; }; $$.push(rng);}
;
NUMBER: {$$ = Number(yytext);};
The Test input is this:
5-10,12-16
The output is:
Parse error on line 1:
5-10,12-16
^
Expecting '-', 'EOF', ',', got '8'
If it put an 'a' at the front i get and expected error about finding "INVALID" but i dont have an "8" in the input string so i wondering if this is an internal state?
I am using the online parser generator at: http://zaach.github.io/jison/try/
thoughts?
This production is confusing Jison (and it confused me, too :) ):
NUMBER: {$$ = Number(yytext);};
NUMBER is supposed to be a terminal, but the above production declares it as a non-terminal with an empty body. Since it can match nothing, it immediately matches, and your grammar doesn't allow two consecutive NUMBERs. Hence the error.
Also, your grammar is ambiguous, although I suppose Jison's default will solve the issue. It would be better to be explicit, though, since it's easy. Your rule:
e : rng
| e ',' e
does not specify how , "associates": in other words, whether rng , rng , rng should be considered as e , rng or rng , e. The first one is probably better for you, so you should write it explicitly:
e : rng
| e ',' rng
One big advantage of the above is that you don't need to create a new array in the second production; you can just push $3 onto the end of $1 and set $$ to $1.

Javascript RegEx non-capturing prefix

I am trying to do some string replacement with RegEx in Javascript. The scenario is a single line string containing long comma-delimited list of numbers, in which duplicates are possible.
An example string is: 272,2725,2726,272,2727,297,272 (The end may or may not end in a comma)
In this example, I am trying to match each occurrence of the whole number 272. (3 matches expected)
The example regex I'm trying to use is: (?:^|,)272(?=$|,)
The problem I am having is that the second and third matches are including the leading comma, which I do not want. I am confused because I thought (?:^|,) would match, but not capture. Can someone shed light on this for me? An interesting bit is that the trailing comma is excluded from the result, which is what I want.
For what it is worth, if I were using C# there is syntax for prefix matching that does what I want: (?<=^|,)
However, it appears to be unsupported in JavaScript.
Lastly, I know I could workaround it using string splitting, array manipulation and rejoining, but I want to learn.
Use word boundaries instead:
\b272\b
ensures that only 272 matches, but not 2725.
(?:...) matches and doesn't capture - but whatever it matches will be part of the overall match.
A lookaround assertion like (?=...) is different: It only checks if it is possible (or impossible) to match the enclosed regex at the current point, but it doesn't add to the overall match.
Here is a way to create a JavaScript look behind that has worked in all cases I needed.
This is an example. One can do many more complex and flexible things.
The main point here is that in some cases,
it is possible to create a RegExp non-capturing prefix
(look behind) construct in JavaScript .
This example is designed to extract all fields that are surrounded by braces '{...}'.
The braces are not returned with the field.
This is just an example to show the idea at work not necessarily a prelude to an application.
function testGetSingleRepeatedCharacterInBraces()
{
var leadingHtmlSpaces = ' ' ;
// The '(?:\b|\B(?={))' acts as a prefix non-capturing group.
// That is, this works (?:\b|\B(?=WhateverYouLike))
var regex = /(?:\b|\B(?={))(([0-9a-zA-Z_])\2{4})(?=})/g ;
var string = '' ;
string = 'Message has no fields' ;
document.write( 'String => "' + string
+ '"<br>' + leadingHtmlSpaces + 'fields => '
+ getMatchingFields( string, regex )
+ '<br>' ) ;
string = '{LLLLL}Message {11111}{22222} {ffffff}abc def{EEEEE} {_____} {4444} {666666} {55555}' ;
document.write( 'String => "' + string
+ '"<br>' + leadingHtmlSpaces + 'fields => '
+ getMatchingFields( string, regex )
+ '<br>' ) ;
} ;
function getMatchingFields( stringToSearch, regex )
{
var matches = stringToSearch.match( regex ) ;
return matches ? matches : [] ;
} ;
Output:
String => "Message has no fields"
fields =>
String => "{LLLLL}Message {11111}{22222} {ffffff}abc def{EEEEE} {_____} {4444} {666666} {55555}"
fields => LLLLL,11111,22222,EEEEE,_____,55555

Categories