I have a grammar in ANTLR4 around which I am writing an application. A snippet of the pertinent grammar is shown below:
grammar SomeGrammar;
// ... a bunch of other parse rules
operand
: id | literal ;
id
: ID ;
literal
: LITERAL ;
// A bunch of other lexer rules
LITERAL : NUMBER | BOOLEAN | STRING;
NUMBER : INTEGER | FLOAT ;
INTEGER : [0-9]+ ;
FLOAT : INTEGER '.' INTEGER | '.' INTEGER ;
BOOLEAN : 'TRUE' | 'FALSE' ;
ID : [A-Za-z]+[A-Za-z0-9_]* ;
STRING : '"' .*? '"' ;
I generate the antlr4 JavaScript Lexer and Parser like so:
$ antlr4 -o . -Dlanguage=JavaScript -listener -visitor
and then I overload the exitLiteral () prototype to check if an operand is a literal. The issue is that if I pass
a
it (force) parses it to a literal, and throws an error (e.g. below shown with grun):
$ grun YARL literal -gui -tree
a
line 1:0 mismatched input 'a' expecting LITERAL
(literal a)
The same error when I use the JavaScript Parser which I overloaded like so:
SomeGrammarLiteralPrinter.prototype.exitLiteral = function (ctx) {
debug ("Literal is " + ctx.getText ()); // Literal is a
};
I would like to catch the error so that I can decide that it is an ID, and not a LITERAL. How do I do that?
Any help is appreciated.
Better solution is to adjust the grammar so that it accurately describes the intended syntax to begin with:
startRule : ruleA ruleB EOF ;
ruleA : something operand anotherthing ;
ruleB : id assign literal ;
operand : ID | LITERAL ;
id : ID ;
literal : LITERAL ;
The parser performs a top-down graph evaluation of the parser rules, starting with the startRule. That is, the parser will evaluate the listed startRule elements in order, sequentially descending through the named sub-rules (and just those sub-rules). Consequently, ruleA will not encounter/consider the id and literal rules.
In this limited example then, there is no conflict in the seemingly overlapping definition of the operand, id, and literal rules.
Update
The OperandContext class will contain ID() and LITERAL() methods returning TerminalNode. The one that does not return null represents the symbol that was actually matched in that specific context. Look at the generated code.
Related
I'm trying node with REPL, parsing from string failed like this:
$node
> var str="{'a':1,'b':2}"
undefined
> var js=JSON.parse(str)
SyntaxError: Unexpected token ' in JSON at position 1
But the reversed parse seems OK:
> var json = {a : ' 1 ',b : ' 2'};
undefined
> var str = JSON.stringify(json);
undefined
> str
'{"a":" 1 ","b":" 2"}'
Where did I get wrong?
You have syntax error in your JSON:
{'a':1,'b':2}
^
|
'--- invalid syntax. Illegal character (')
JSON is not the same thing as Javascript object literals. JSON is a file/data format which is compatible with object literal syntax but is more strict. The JSON format was specified by Douglas Crockford and documented at http://json.org/
Some of the differences between JSON and object literals:
Property names are strings
Strings start and end with double quotes (")
Hexadecimals numbers (eg. 0x1234) are not supported
etc.
In the following string:
(my name is zeeze :) and I am very happy ;))
I need to replace all the ) with __BR__ that are part of the pattern satisfied by regex:
[8|:|;|\*]{1}[-c^;\*]?\)
Reference: Regex playground
I cannot replace ending ) because it is not part of the pattern.
What could be a way to achieve this?
You may do it in the callback method:
var s = "(my name is zeeze :) and I am very happy ;))";
console.log(
s.replace(/[8:;*][-c^;*]?\)/g, function($0) {
return $0.replace(/\)/g, "__BR__")
})
)
// => (my name is zeeze :__BR__ and I am very happy ;__BR__)
Note that | inside a character class [8|:|;|\*] is treated as a literal | pipe symbol, thus I think it is a human error. {1} is redundant as an atom is matched exactly once by default. There is no need to escape * char inside a character class, it is parsed as a literal asterisk symbol there.
Have been playing with Jison to try to create an "interpreter" for a very simple scripting syntax (this is just for a personal messing around project, no business case!)
It's been about 20 years since I had to create a compiler, and I think I'm just not grasping some of the concepts.
What I am thinking of doing is give a program of very simple statements, one per line, to Jison, and get a stream of Javascript statements back that then perform the actions.
I may be looking at this wrong - maybe I need to actually perform the actions during the parse? This doesn't sound right though.
Anyway, what I've got is (I'm trying this online btw http://zaach.github.io/jison/try/)
/* lexical grammar */
%lex
%options case-insensitive
%%
\s+ /* skip whitespace */
is\s+a\b return 'OCREATE'
is\s+some\b return 'ACREATE'
[_a-zA-Z]+[_a-zA-Z0-9]*\b return 'IDENTIFIER'
<<EOF>> return 'EOF'
/lex
/* operator associations and precedence */
%start input
%% /* language grammar */
input
: /* empty */
| program EOF
{ return $1; }
;
program
: expression
{ $$ = $1; }
| program expression
{ $$ = $1; }
;
expression
: IDENTIFIER OCREATE IDENTIFIER
{ $$ = 'controller.createOne(\'' + $1 + '\', \'' + $3 + '\');' }
| IDENTIFIER ACREATE IDENTIFIER
{ $$ = 'controller.createSeveral(\'' + $1 + '\', \'' + $3 + '\');' }
;
So, for the input:
basket is some apples
orange is a fruit
...I want:
controller.createSeveral('basket', 'apples');
controller.createOne('orange', 'fruit');
What I am getting is:
controller.createSeveral('basket', 'apples');
This kind of makes sense to me to get a single result, but I have no idea what to do to progress with building my output.
The problem is in your second production for program:
program
: expression
{ $$ = $1; }
| program expression
{ $$ = $1; }
What the second production is saying, basically, is "a program can be a (shorter) program followed by an expression, but its semantic value is the value of the shorter program."
You evidently want the value of the program to be augmented by the value of the expression, so you need to say that:
program
: expression
{ $$ = $1; }
| program expression
{ $$ = $1.concat($2); }
(or $$ = $1 + $2 if you prefer. And you might want a newline for readability.)
I am writing a simple Jison grammar in order to get some experience before starting a more complex project. I tried a simple grammar which is a comma separated list of numeric ranges, with ranges where the beginning and ending values were the same to use a single number shorthand. However, when running the generated parser on some test input I get an error which doe snot make alot of sense to me. Here is the grammar i came up with:
/* description: Parses end executes mathematical expressions. */
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+ {return 'NUMBER'}
"-" {return '-'}
"," {return ','}
<<EOF>> {return 'EOF'}
. {return 'INVALID'}
/lex
/* operator associations and precedence */
%start ranges
%% /* language grammar */
ranges
: e EOF
{return $1;}
;
e : rng { $$ = $1;}
| e ',' e {alert('e,e');$$ = new Array(); $$.push($1); $$.push($3);}
;
rng
: NUMBER '-' NUMBER
{$$ = new Array(); var rng = {Start:$1, End: $3; }; $$.push(rng); }
| NUMBER
{$$ = new Array(); var rng = {Start:$1, End: $1; }; $$.push(rng);}
;
NUMBER: {$$ = Number(yytext);};
The Test input is this:
5-10,12-16
The output is:
Parse error on line 1:
5-10,12-16
^
Expecting '-', 'EOF', ',', got '8'
If it put an 'a' at the front i get and expected error about finding "INVALID" but i dont have an "8" in the input string so i wondering if this is an internal state?
I am using the online parser generator at: http://zaach.github.io/jison/try/
thoughts?
This production is confusing Jison (and it confused me, too :) ):
NUMBER: {$$ = Number(yytext);};
NUMBER is supposed to be a terminal, but the above production declares it as a non-terminal with an empty body. Since it can match nothing, it immediately matches, and your grammar doesn't allow two consecutive NUMBERs. Hence the error.
Also, your grammar is ambiguous, although I suppose Jison's default will solve the issue. It would be better to be explicit, though, since it's easy. Your rule:
e : rng
| e ',' e
does not specify how , "associates": in other words, whether rng , rng , rng should be considered as e , rng or rng , e. The first one is probably better for you, so you should write it explicitly:
e : rng
| e ',' rng
One big advantage of the above is that you don't need to create a new array in the second production; you can just push $3 onto the end of $1 and set $$ to $1.
I am trying to write a function that builds a regular expression that can test whether a string starts with a string and contains another string.
function buildRegExp(startsWith,contains){
return new RegExp( ????? )
}
for example:
buildRegExp('abc','fg').test('abcdefg')
The above expression should evaluate to true, since the string 'abcdefg' starts with 'abc' and contains 'fg'.
The 'startsWith', and the 'contains' strings may overlap eachother, so the regular expression cannot simply search for the 'startsWith' string, then search for 'contains' string
the following should also evaluate to true:
buildRegExp('abc','bcd').test('abcdefg')
I cannot use simple string functions. It needs to be a regular expression because I am passing this regular expression to a MongoDB query.
A pattern like this would handle cases where the startsWith / contains substrings overlap in the matched string:
/(?=.*bcd)^abc/
i.e.
return new RegExp("(?=.*" + contains + ")^" + startsWith);
Try this regexp
(^X).*Y
E.g. in javascript
/(^ab).*bc/.test('abc') => false
/(^ab).*bc/.test('abcbc') => true