Antlr4 lexer rule should only match if at the beginning of line - javascript

I am using Antlr4 with JavaScript and C#. I have a rule which should match only if it is at the start of line. If a expression starts with REM it should be recognized as comment and therefore be hidden. Otherwise it is not a comment.
REM 1+1 is a comment
1+1 REM 2 is not a comment
The bellow code in my Lexer is the best I could do till now. But the problem is it works only if a have a new line before which is actually not that good.
START_COMMENT : ('\r\n' | '\n' | '\f') ([R][E][M] | [;] | [#][ ]) ~[\r\n]* -> channel(HIDDEN);
I am interested to know if there is some kind of trick to tell the Lexer that a rule should only match if it is at the beginning of the line and nowhere else ?

You could use a predicate that tests if the current character index of the lexer is 0 (indicating the start of a line). The disadvantage of this is that you add target specific code to your grammar. For JavaScript, this should work:
REMARK
: {this.getCharIndex() === 0}? ( R E M | ';' | '# ' ) ~[\r\n]*
;
framgment R : [rR];
framgment E : [eE];
framgment M : [mM];

this is probably a hacky way to do it, but when I had this problem, I simply told it to look for a new line before it, and then in my code, I checked if the inputted string is supposed to start with it, and in that case, I added a new line to the top. So I basically did this
comment: NEWLINE '//';
NEWLINE: [\r\n] -> skip;
then in my code:
if (content.startsWith("//")) {
content = "\n" + content;
}
Of course this has the disadvantage of any errors being one line off, but it may work in your case.

Related

Remove all ANSI colors/styles from strings

I use a library that adds ANSI colors / styles to strings. For example:
> "Hello World".rgb(255, 255, 255)
'\u001b[38;5;231mHello World\u001b[0m'
> "Hello World".rgb(255, 255, 255).bold()
'\u001b[1m\u001b[38;5;231mHello World\u001b[0m\u001b[22m'
When I do:
console.log('\u001b[1m\u001b[38;5;231mHello World\u001b[0m\u001b[22m')
a "Hello World" white and bold message will be output.
Having a string like '\u001b[1m\u001b[38;5;231mHello World\u001b[0m\u001b[22m' how can these elements be removed?
foo('\u001b[1m\u001b[38;5;231mHello World\u001b[0m\u001b[22m') //=> "Hello World"
Maybe a good regular expression? Or is there any built-in feature?
The work around I was thinking was to create child process:
require("child_process")
.exec("node -pe \"console.error('\u001b[1m\u001b[38;5;231mHello World\u001b[0m\u001b[22m')\""
, function (err, stderr, stdout) { console.log(stdout);
});
But the output is the same...
The regex you should be using is
/[\u001b\u009b][[()#;?]*(?:[0-9]{1,4}(?:;[0-9]{0,4})*)?[0-9A-ORZcf-nqry=><]/g
This matches most of the ANSI escape codes, beyond just colors, including the extended VT100 codes, archaic/proprietary printer codes, etc.
Note that the \u001b in the above regex may not work for your particular library (even though it should); check out my answer to a similar question regarding acceptable escape characters if it doesn't.
If you don't like regexes, you can always use the strip-ansi package.
For instance, the string jumpUpAndRed below contains ANSI codes for jumping to the previous line, writing some red text, and then going back to the beginning of the next line - of which require suffixes other than m.
var jumpUpAndRed = "\x1b[F\x1b[31;1mHello, there!\x1b[m\x1b[E";
var justText = jumpUpAndRed.replace(
/[\u001b\u009b][[()#;?]*(?:[0-9]{1,4}(?:;[0-9]{0,4})*)?[0-9A-ORZcf-nqry=><]/g, '');
console.log(justText);
The escape character is \u001b, and the sequence from [ until first m is encountered is the styling. You just need to remove that. So, replace globally using the following pattern:
/\u001b\[.*?m/g
Thus,
'\u001b[1m\u001b[38;5;231mHello World\u001b[0m\u001b[22m'.replace(/\u001b\[.*?m/g, '')
The colors are like ESC[39m format, the shortest regexp is for it the /\u001b[^m]*?m/g
Where \u001b is the ESC character,
[^m]*? is any character(s) till m (not greedy pattern),
the m itself, and /g for global (all) replace.
Example:
var line="\x1B[90m2021-02-03 09:35:50.323\x1B[39m\t\x1B[97mFinding: \x1B[39m\x1B[97m»\x1B[39m\x1B[33m42125121242\x1B[39m\x1B[97m«\x1B[39m\x1B[0m\x1B[0m\t\x1B[92mOK\x1B[39m";
console.log(line.replace(/\u001b[^m]*?m/g,""));
// -> 2021-02-03 09:35:50.323 Finding: »42125121242« OK ( without colors )
console.log(line);
// -> 2021-02-03 09:35:50.323 Finding: »42125121242« OK ( colored )

Find and alter text in a string based on ln # and character #

How can I prepend characters to a word located at a particular Ln # and character # in text?
Example:
The use case is when a person enters code into a textarea (like jsFiddle), I find and replace some of their variables. I know the line # and character location of the start and end of these variables.
Example Text:
var usersCode = $('textarea').val();
console logging usersCode:
print "My first name is: " + first_name
print "This is awesome."
print "My last name is: " + last_name
How could I find the word starting at Ln 0, Char 29 and ending at Ln 0, Char 39 (first_name) and turn it into MyObj.first_name.value?
print "My first name is: " + MyObj.first_name.value
print "This is awesome."
print "My last name is: " + last_name
Maybe I can use a regex that translates "line number" into counting the number of \n occurrences? And then moving the pointer in by the number of characters?
I have to use Ln # and Char # for many details that I won't go into here. I am aware of many simpler alternatives if I wasn't constrained to using Ln # and Ch #.
You can save the lines of the textarea into an array:
var lines = $('#textarea').val().split(/\n/);
And from there you take the substring of a particular line and assign it to your object:
MyObj.first_name.value = lines[0].substring(29,39)
Hope that helps!
If you're just trying to replace first_name and last_name the simplest solution would be to use replace(), for example:
usersCode.replace("first_name", MyObj.first_name.value);
usersCode.replace("last_name", MyObj.last_name.value);
If you're insistent on doing the line number and character number specific thing, we can arrange that, but it's not at all necessary for what it seems like you're asking for. Perhaps I'm mistaken, though.
Update:
Okay so you want to replace all instances? Finding line numbers and all that is still unnecessary, just use this:
usersCode.replace(/first_name/g, MyObj.first_name.value);
usersCode.replace(/last_name/g, MyObj.last_name.value);
g is a RegEx flag for global searching.
Update 2:
usersCode.split("\n")[lineNumber - 1].replace(usersCode.split("\n")[lineNumber - 1].substr(29, 39), MyObj.first_name.value);
You can replace 29 and 39 with variables startChar and endChar respectively. lineNumber will also need to be provided by you.
RegEx can easily search based on characters but not via position. Though you can still do it using regEx but the soultion will become more and more complex. For this case you don;t need a pure RegEx answer. Code below is what you need.
k=$('#regex_string').val()
function findAndReplaceWord(lineNo, startPos, endPos, replaceWith) {
var line = k.split(/\n/)[lineNo];
if(line) {
var word = line.substring(startPos, endPos)
return k.replace(word, replaceWith)
}
}
findAndReplaceWord(0, 29, 39, MyObj.first_name.value)

JavaScript replace() method not deleting correct lines

I have a sdp and it has multiple lines. I want to replace one line with " " or remove it. I tried:
obj.sdp = obj.sdp.replace(/a=line5:[\w\W]*\n|\r/gi, "" );
for delete line 5 but it is deleting line 5 and also other lines that comes after line 5. I used \n|\r for delete until here. Also I when I use
sdp = sdp.replace(/a=line5:0.*$/mg, "");
Netbeans give me "Insecure '.' error".
The OR | in your RegExp is excluding a=line5: and therefore when used with the global flag, the \r matches every \r in your String, you probably want
/(a=line5:[^\r\n]*)(?:\r|\n)+/gi
"$1"
I fixed it with;
str.replace(/(a=line5:[\w\W]*?(:\r|\n))/, "" );
Thanks!

Issue with a Jison Grammar, Strange error from generate dparser

I am writing a simple Jison grammar in order to get some experience before starting a more complex project. I tried a simple grammar which is a comma separated list of numeric ranges, with ranges where the beginning and ending values were the same to use a single number shorthand. However, when running the generated parser on some test input I get an error which doe snot make alot of sense to me. Here is the grammar i came up with:
/* description: Parses end executes mathematical expressions. */
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+ {return 'NUMBER'}
"-" {return '-'}
"," {return ','}
<<EOF>> {return 'EOF'}
. {return 'INVALID'}
/lex
/* operator associations and precedence */
%start ranges
%% /* language grammar */
ranges
: e EOF
{return $1;}
;
e : rng { $$ = $1;}
| e ',' e {alert('e,e');$$ = new Array(); $$.push($1); $$.push($3);}
;
rng
: NUMBER '-' NUMBER
{$$ = new Array(); var rng = {Start:$1, End: $3; }; $$.push(rng); }
| NUMBER
{$$ = new Array(); var rng = {Start:$1, End: $1; }; $$.push(rng);}
;
NUMBER: {$$ = Number(yytext);};
The Test input is this:
5-10,12-16
The output is:
Parse error on line 1:
5-10,12-16
^
Expecting '-', 'EOF', ',', got '8'
If it put an 'a' at the front i get and expected error about finding "INVALID" but i dont have an "8" in the input string so i wondering if this is an internal state?
I am using the online parser generator at: http://zaach.github.io/jison/try/
thoughts?
This production is confusing Jison (and it confused me, too :) ):
NUMBER: {$$ = Number(yytext);};
NUMBER is supposed to be a terminal, but the above production declares it as a non-terminal with an empty body. Since it can match nothing, it immediately matches, and your grammar doesn't allow two consecutive NUMBERs. Hence the error.
Also, your grammar is ambiguous, although I suppose Jison's default will solve the issue. It would be better to be explicit, though, since it's easy. Your rule:
e : rng
| e ',' e
does not specify how , "associates": in other words, whether rng , rng , rng should be considered as e , rng or rng , e. The first one is probably better for you, so you should write it explicitly:
e : rng
| e ',' rng
One big advantage of the above is that you don't need to create a new array in the second production; you can just push $3 onto the end of $1 and set $$ to $1.

Javascript Regular Expressions Functionality

I've spent a few hours on this and I can't seem to figure this one out.
In the code below, I'm trying to understand exactly what and how the regular expressions in the url.match are working.
As the code is below, it doesn't work. However if I remove (?:&toggle=|&ie=utf-8|&FORM=|&aq=|&x=|&gwp) it seems to give me the output that I want.
However, I don't want to remove this without understanding what it is doing.
I found a pretty useful resource, but after a few hours I still can't precisely determine what these expressions are doing:
https://developer.mozilla.org/en-US/docs/JavaScript/Guide/Regular_Expressions#Using_Parenthesized_Substring_Matches
Could someone break this down for me and explain how exactly it is parsing the strings. The expressions themselves and the placement of the parentheses is not really clear to me and frankly very confusing.
Any help is appreciated.
(function($) {
$(document).ready(function() {
function parse_keywords(url){
var matches = url.match(/.*(?:\?p=|\?q=|&q=|\?s=)([a-zA-Z0-9 +]*)(?:&toggle=|&ie=utf-8|&FORM=|&aq=|&x=|&gwp)/);
return matches ? matches[1].split('+') : [];
}
myRefUrl = "http://www.google.com/url?sa=f&rct=j&url=https://www.mydomain.com/&q=my+keyword+from+google&ei=fUpnUaage8niAKeiICgCA&usg=AFQjCNFAlKg_w5pZzrhwopwgD12c_8z_23Q";
myk1 = (parse_keywords(myRefUrl));
kw="";
for (i=0;i<myk1.length;i++) {
if (i == (myk1.length - 1)) {
kw = kw + myk1[i];
}
else {
kw = kw + myk1[i] + '%20';
}
}
console.log (kw);
if (kw != null && kw != "" && kw != " " && kw != "%20") {
orighref = $('a#applynlink').attr('href');
$('a#applynlink').attr('href', orighref + '&scbi=' + kw);
}
});
})(jQuery);
Let's break this regex down.
/
Begin regex.
.*
Match zero or more anything - basically, we're willing to match this regex at any point into the string.
(?:\?p=
|\?q=
|&q=
|\?s=)
In this, the ?: means 'do not capture anything inside of this group'. See http://www.regular-expressions.info/refadv.html
The \? means take ? literally, which is normally a character meaning 'match 0 or 1 copies of the previous token' but we want to match an actual ?.
Other than that, it's just looking for a multitude of different options to select (| means 'the regex is valid if I match either what's before me or after me.)
([a-zA-Z0-9 +]*)
Now we match zero or more of any of the following characters in any arrangement: a-ZA-Z0-9 + And since it is inside a () with no ?: we DO capture it.
(?:&toggle=
|&ie=utf-8
|&FORM=
|&aq=
|&x=
|&gwp)
We see another ?: so this is another non-capturing group.
Other than that, it is just full of literal characters separated by |s, so it is not doing any fancy logic.
/
End regex.
In summary, this regex looks through the string for any instance of the first non capturing group, captures everything inside of it, then looks for any instance of the second non capturing group to 'cap' it off and returns everything that was between those two non capturing groups. (Think of it as a 'sandwich', we look for the header and footer and capture everything in between that we're interested in)
After the regex runs, we do this:
return matches ? matches[1].split('+') : [];
Which grabs the captured group and splits it on + into an array of strings.
For situations like this, it's really helpful to visualize it with www.debuggex.com (which I built). It immediately shows you the structure of your regex and allows you to walk through step-by-step.
In this case, the reason it works when you remove the last part of your regex is because none of the strings &toggle=, &ie=utf-8, etc are in your sample url. To see this, drag the grey slider above the test string on debuggex and you'll see that it never makes it past the & in that last group.

Categories