Division/RegExp conflict while tokenizing Javascript [duplicate] - javascript

This question already has answers here:
When parsing Javascript, what determines the meaning of a slash?
(5 answers)
Closed 8 years ago.
I'm writing a simple javascript tokenizer which detects basic types: Word, Number, String, RegExp, Operator, Comment and Newline. Everything is going fine but I can't understand how to detect if the current character is RegExp delimiter or division operator. I'm not using regular expressions because they are too slow. Does anybody know the mechanism of detecting it? Thanks.

You can tell by what the preceding token is is in the stream. Go through each token that your lexer emits and ask whether it can reasonably be followed by a division sign or a regexp; you'll find that the two resulting sets of tokens are disjoint. For example, (, [, {, ;, and all of the binary operators can only be followed by a regexp. Likewise, ), ], }, identifiers, and string/number literals can only be followed by a division sign.
See Section 7 of the ECMAScript spec for more details.

you have to check the context when encounter the slash. if the slash is after a expression, then it must be division, or it is a regexp start.
in order to recognize the context, maybe you have to make a syntax parser.
for example
function f() {}
/1/g
//this case ,the slash is after a function definition, so it's a refexp start
var a = {}
/1/g;
//this case, the slash is after an object expression,so it's a division

Related

Solve Catastrophic Backtracking in my regex detecting Email [duplicate]

This question already has an answer here:
Email validation Regular expression is causing catastrophic backtracking
(1 answer)
Closed 7 months ago.
I have regex
/^\w+([.-]?\w+)*#\w+([.-]?\w+)*(\.\w{2,4})+$/
for checking valid Email.
It works, but GitHub's code scanner shows this error
This Part of the Regular Expression May Cause Exponential Backtracking on Strings Starting With 'A#a' and Containing Many Repetitions of 'A'.
I got the error, however, I'm not sure how to solve it.
A good place to start is this: How can I recognize an evil regex?
As one of the answers there says, the key is to avoid "repetition of a repetition". For instance, given (\w+)* and the input aaa, it could match as (aaa), or (a)(aa), or (aa)(a), or (a)(a)(a); and as the input gets longer, the number of possibilities goes up exponentially. If instead you just write (\w*), it will match all the same strings, but only in one way.
In your case, you have two places where you write ([.-]?\w+)* and because you've made the [.-] optional, it can match in all the ways that (\w+)* can. But text without a dot or dash is already matched by the \w+ just before, so you can have ([.-]\w+)* instead.
The string .aaa can now only match one way, because (.a)(aa) doesn't have a dot or dash at the start of the second group. Other strings like aaa or ..a can be ruled out because you need exactly one dot or dash, and at least one character matching \w (which doesn't include . or -).

What is the meaning of forward slash in a JavaScript expression? [duplicate]

This question already has answers here:
Meaning of javascript text between two slashes
(3 answers)
Closed 2 years ago.
Stumbling upon a piece of JavaScript in a library I found this:
let useBlobFallback = /constructor/i.test(window.HTMLElement) || !!window.safari || !!window.WebKitPoint
but I can't find the meaning of the /constructor/i. Even searching online produces meaningless results because of the 'constructor' word and/or because the slash is also used in regular expressions. Which I believe it's not the case in this code snippet..
This is a RegExp literal. It's equivalent to new RegExp('constructor', 'i').test(window.HTMLElement).
Have a look at this maybe?
Simple patterns are constructed of characters for which you want to find a direct match. For example, the pattern /abc/ matches character combinations in strings only when the exact sequence "abc" occurs (all characters together and in that order). Such a match would succeed in the strings "Hi, do you know your abc's?" and "The latest airplane designs evolved from slabcraft." In both cases the match is with the substring "abc". There is no match in the string "Grab crab" because while it contains the substring "ab c", it does not contain the exact substring "abc".

JavaScript regex: why is alternation not ordered? [duplicate]

This question already has answers here:
Why order matters in this RegEx with alternation?
(3 answers)
Order of regular expression operator (..|.. ... ..|..)
(1 answer)
Closed 2 years ago.
Given this code:
const regex = /graph|photograph/;
'A photograph'.match(regex);
// Output: [ 'photograph', index: 2, input: 'A photograph', groups: undefined ]
Why is the engine not finding graph first? After looking at similar SO questions and the ECMAScript docs, I can see that
The | regular expression operator separates two alternatives. The pattern first tries to match the left Alternative (followed by the sequel of the regular expression); if it fails, it tries to match the right Disjunction (followed by the sequel of the regular expression).
Now, the above quote covers the case /photo|photograph/ where the alternatives share a common beginning, but the case where they share a common ending appears to be governed by a different rule.
I am content with the result I am getting, as in my use case I prefer to get the longest match, not the earliest one, but I would like to know why this happens, so I can be sure this isn't just a coincidence that is bound to change in the future.
The alternative graph does not match starting at the third character, but the alternative photograph does. The engine proceeds through the string from left to right.
The ordering you refer to in the question applies when alternatives match from a common starting point in the string. Otherwise, while proceeding through the "haystack" string, the alternatives are all considered. If there's a single match starting from a particular character,
then the rest of the regex will proceed with that (and may of course backtrack later).
Whether the engine prefers longer matches from a set of alternatives when there are multiple matches from the same character in the source, I can't say off the top of my head. I would guess it would try the longer one first, to consume more of the string optimistically, because it can always backtrack. However, I don't know that to be actual specified behavior and just thinking about reading the regex semantics in the spec makes my head hurt.

Workaround for negative look-behind in javascript [duplicate]

This question already has answers here:
javascript regex - look behind alternative?
(8 answers)
Closed 6 years ago.
I'm converting a python script I wrote to javascript. My python script has the following regex to match single instances of '\':
re.sub(r'(?<!\\)(\\{1})(?!\\)', r'\\', word)
I got a compiler error when trying to run this in js:
"Invalid regular expression: /(?<!\\)(\\{1})(?!\\)/: Invalid group"
After some searching found out that regex in js does not support look behinds.
I looked at this answer and they used:
^(?!filename).+\.js
in the form of a negative look-ahead from the start of the string, which does not help me as I need to change '\' to '\\' anywhere in the string.
I do not think this is a duplicate question as my question is trying to determine how to avoid and match the same character at different points in a string, while the linked question seeks to avoid a specific phrase from being matched.
I need to match '\' characters that do not have '\' either before or after them.
You always can use capture groups instead of lookbehind
string.match(/(^|[^\\])(\\{1})(?!\\)/)[2]
let replaced = "a\\b\\\\".replace(/(^|[^\\])(\\{1})(?!\\)/, x => x[0] == '\\' ? x : 'value')
console.log(replaced)
will return you same thing as (?<!\\)(\\{1})(?!\\)
Just match without assertions (^|[^\\])\\([^\\]|$) then substitute them back.
Note that this will tell you nothing about if it is escaping anything or not.
That regex is more complex.

Match all Inside Parenthesis but not Outside [duplicate]

This question already has answers here:
Recursive matching with regular expressions in Javascript
(6 answers)
Closed 8 years ago.
I'm trying to use regular expressions to match certain groups of strings which correspond to functions. Right now it looks like this:
(Spreadsheet.[^)\)]+\))
Where it finds the variable Spreadsheet which has the function as an attribute. The expression keeps going until it gets to the end parenthesis. For simple functions such as
Spreadsheet.ADD(1,2)
the regular expression will work fine.
However, if I try to do any sort of nesting, the expression does not work because it will stop at the inside parenthesis instead of going to the last parenthesis.
Spreadsheet.ADD(Spreadsheet.ADD(1, 2), 3)
Thus, the ", 3)" isn't identified and ends being ignored. Of course, due to the way my code processes it, this unusual string ends up causing an error.
Does anyone with more knowledge of regular expressions know how it could be changed such that it will stop only when it is at the last parenthesis and not the first?
Thanks.
Assuming that you only want to match functions in the form that you state in the question. If you want to match any type of function (including operators, nested comments, etc) then what you are wanting is going to be difficult with regex, see here. Anyway, to match the last bracket you can use:
(Spreadsheet\..+\))
This will match
Spreadsheet.ADD(1,2)
Spreadsheet.ADD(Spreadsheet.ADD(1, 2), 3)
Spreadsheet.ADD(Spreadsheet.ADD(1, 2), 3)foo
(foo not part of the match)
The reason that your regex did not match the full string is because it will stop when it finds a character that is not a ) which is the first ). Also, as an aside Spreadsheet. will match Spreadsheeta, Spreadsheetb, Spreadsheetc. To match a dot you need \..
In my regex .+) will include the last bracket because + is greedy, so it will get the longest match it can. As an aside you would specify a non-greedy match using +?

Categories