Detecting regular expression in content during parse

Detecting regular expression in content during parse - javascript

I am writing a simple parser for C. I was just running it with some other language files (for fun - to see the extent of C-likeness and laziness - don't wanna really write separate parsers for each language if I can avoid it).
However the parser seems to break down for JavaScript if the code being parsed contains regular expressions...
Case 1:
For example, while parsing the JavaScript code snippet,
var phone="(304)434-5454"
phone=phone.replace(/[\(\)-]/g, "")
//Returns "3044345454" (removes "(", ")", and "-")
The '(', '[' etc get matched as starters of new scopes, which may never be closed.
Case 2:
And, for the Perl code snippet,
# Replace backslashes with two forward slashes
# Any character can be used to delimit the regex
$FILE_PATH =~ s#\\#//#g;
The // gets matched as a comment...
How can I detect a regular expression within the content text of a "C-like" program-file?

It is impossible.
Take this, for example:
m =~ s/a/b/g;
Could be both C or perl.
One minute's thinking reveals, that the number of perl style regular expressions that are also sntyctically valid C expressions is infinite.
Another example:
m+foo *bar[index]+i
The best you can get is some extreme vague guesswork. The difficulty stems from the fact that a regular expression is a sequence of characters that can be virtually everything.
You better clean up your error handling. A parser should not "break down" if some parenthesis are missing or superfluous ones are seen.

Well, your token grammar has to take regex syntax into consideration. Classic parsers consist of two layers: something to tokenize the input, and then something to parse the grammar. The syntax of the language is generally expressed in terms of tokens, so the job of the tokenizer is to feed a stream of those to the parser. Generally the tokens them selves are regular expressions, or more properly a great big regex of things ORed together. At each character position on the input, one of the token regexes must match or else the character is invalid.
Now, there are other parsing techniques that sort-of squish together the tokenization with the parsing. ("PEG" parsers for example)
edit — another note: you can't parse languages like Javascript or Perl with just a regular expression.

Related

Removing javascript reserved words from string

So, I have a string which is actually a javascript script. I need to remove first reserved javascript word from it, but only if it actually has the meaning of the reserved word. That means:
it can't be inside string literals ("" or '', like "return that thing to me");
it has to be preceded and followed by whitespace, linebreak and such;
any other cases where it's not a reserved word.
I have the hard time trying to write RegExp for this, as there always seems to be at least one case it doesn't work as intended.
Any help, please?

You have to use a more powerful method than regex - such as syntax analyzer to break your string into an abstract syntax tree. Then look for any keyword you want.
Try using the parser API of the Spider Money.
Or a library like UglifyJS or Esprima

Is it safe to use UTF-8 character literals in JavaScript source code?

Is it save to write JavaScript source code (to be executed in the browser) which includes UTF-8 character literals?
For example, I would like to use an ellipses literal in a string as such:
var foo = "Oops… Something went wrong";
Do "modern" browsers support this? Is there a published browser support matrix somewhere?

JavaScript is by specification a Unicode language, so Unicode characters in strings should be safe. You can use hex escapes (\u8E24) as an alternative. Make sure your script files are served with proper content type headers.
Note that characters beyond one- and two-byte sequences are problematic, and that JavaScript regular expressions are terrible with characters beyond the first codepage. (Well maybe not "terrible", but primitive at best.)
You can also use Unicode letters, Unicode combining marks, and Unicode connector punctuation characters in identifiers, in case you want to impress your friends. Thus
var wavy﹏line = "wow";
is perfectly good JavaScript (but good luck with your bug report if you find a browser where it doesn't work).
Read all about it in the spec, or use it to fall asleep at night :)

is there any way to get all the possible outcomes of a regular expression pattern?

is there any way to get all the possible outcomes of a regular expression pattern?. everything I've seen refers to a pattern that is evaluated against a string. but what I need is to have a pattern like this:
^EM1650S(B{1,2}|L{1,2})?$
generate all possible matches:
EM1650S
EM1650SB
EM1650SBB
EM1650SL
EM1650SLL

In the general case, no. In this case, you have almost no solution space.
There's a section covering this in Higher Order Perl (PDF) and a Perl module. I never re-implemented it in anything else, but I had a similar problem and this solution was adequate for similarly-limited needs.

There are tools that can display all possible matches of a regex.
Here is one written in Haskell: https://github.com/audreyt/regex-genex
and here is a Perl module: http://metacpan.org/pod/Regexp::Genex
Unfortunately I couldn't find anything for JavaScript

In this particular case, yes. The regex generates a finite number of valid string, so they can be counted up.
You'll just have to parse the regex. Some part of that (EM1650S) is mandatory, so think for the rest. Parse by the | (or) symbol. Then enumerate the strings for both sides of it. Then you can get all possible combinations of them.
Some regex (containing * or + symbols) can represent an infinite number of strings, so they cannot be counted.

From a computational theoretic standpoint, regular expressions are equivalent to finite state machines. This is part of "automata theory." You could create a finite state machine that is equivalent to a regular expression and then use graph traversal algorithms to traverse all paths of the FSM. In the general case a countably infinite number of strings may match a regular expression, so your program may never terminate depending on the input regular expression.

How to implement Lexical Analysis in Javascript

Hey folks, thanks for reading
I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.
I began slowly with the basics : + - / * and parenthesis handling.
I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.
This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.
I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.
Analysis
Let's define what a calculator query is:
// NON TERMINAL EXPRESSIONS //
query -> statement
query -> ε // means end of query
statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number
number -> integer
number -> float
// TERMINAL EXPRESSIONS //
operator -> [+*/%^-]
prefix -> -
integer -> [0-9]+
float -> [0-9]+[.,][0-9]+
Javascript
Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:
(I added spaces to make it more readable)
var calcPat =
/^ (\s*
( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;
If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part
I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:
I created a method called isStatement(string) that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:
-If the first two tokens are a number followed by an operator:
-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)
-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.
What's the problem ?
My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?
Thank you so much for having taken the time to read everything.
Gael
(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)

You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.
The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.
The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.
You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.
Now, because you've got Javascript regular expressions to work with, you could come up with a regular expression designed to match a string of tokens. The trick with that will be coming up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.
edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very complicated, and it'd be pretty fun to have around.

Finding beginning and end quotations

I'm starting to write a code syntax highlighter in JavaScript, and I want to highlight text that is in quotes (both "s and 's) in a certain color. I need it be able to not be messed up by one of one type of quote being in the middle of a pair of the other quotes as well, but i'm really not sure where to even start. I'm not sure how I should go about finding the quotes and then finding the correct end quote.

Unless you're doing this for the challenge, have a look at Google Code Prettify.
For your problem, you could read up on parsing (and lexers) at Wikipedia. It's a huge topic and you'll find that you'll come upon bigger problems than parsing strings.
To start, you could use regular expressions (although they rarely have the accuracy of a true lexer.) A typical regular expression for matching a string is:
/"(?:[^"\\]+|\\.)*"/
And then the same for ' instead of ".
Otherwise, for a character-by-character parser, you would set some kind of state that you're in a string once you hit ", then when you hit " that is not preceded by an uneven amount of backslashes (an even amount of backslashes would escape eachother), you exit the string.

You can find quotes using regular expressions but if you're writing a syntax highlighter then the only reliable way is to step through the code, character by character, and decide what to do from there.
E.g. of a Regex
/("|')((?:\\\1|.)+?)\1/g
(matches "this" and 'this' and "thi\"s")

use stack.. if unmatched quote found push it.. if match found pop

I did it with a single regular expression in php using backwards references. JS does not support it and i think that's what you need if you really want to detect undefined backslashes.

We Keep Coding

JavaScript is the programming language of the Web.