How do I combine these two regular expressions into one? - javascript

I'm writing a rudimentary lexer using regular expressions in JavaScript and I have two regular expressions (one for single quoted strings and one for double quoted strings) which I wish to combine into one. These are my two regular expressions (I added the ^ and $ characters for testing purposes):
var singleQuotedString = /^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$/gi;
var doubleQuotedString = /^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$/gi;
Now I tried to combine them into a single regular expression as follows:
var string = /^(["'])(?:[^\1\\]|\\\1|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*\1$/gi;
However when I test the input "Hello"World!" it returns true instead of false:
alert(string.test('"Hello"World!"')); //should return false as a double quoted string must escape double quote characters
I figured that the problem is in [^\1\\] which should match any character besides matching group \1 (which is either a single or a double quote - the delimiter of the string) and \\ (which is the backslash character).
The regular expression correctly filters out backslashes and matches the delimiters, but it doesn't filter out the delimiter within the string. Any help will be greatly appreciated. Note that I referred to Crockford's railroad diagrams to write the regular expressions.

You can't refer to a matched group inside a character class: (['"])[^\1\\]. Try something like this instead:
(['"])((?!\1|\\).|\\[bnfrt]|\\u[a-fA-F\d]{4}|\\\1)*\1
(you'll need to add some more escapes, but you get my drift...)
A quick explanation:
(['"]) # match a single or double quote and store it in group 1
( # start group 2
(?!\1|\\). # if group 1 or a backslash isn't ahead, match any non-line break char
| # OR
\\[bnfrt] # match an escape sequence
| # OR
\\u[a-fA-F\d]{4} # match a Unicode escape
| # OR
\\\1 # match an escaped quote
)* # close group 2 and repeat it zero or more times
\1 # match whatever group 1 matched

This should work too (raw regex).
If speed is a factor, this is the 'unrolled' method, said to be the fastest for this kind of thing.
(['"])(?:(?!\\|\1).)*(?:\\(?:[\/bfnrt]|u[0-9A-F]{4}|\1)(?:(?!\\|\1).)*)*/1
Expanded
(['"]) # Capture a quote
(?:
(?!\\|\1). # As many non-escape and non-quote chars as possible
)*
(?:
\\ # escape plus,
(?:
[\/bfnrt] # /,b,f,n,r,t or u[a-9A-f]{4} or captured quote
| u[0-9A-F]{4}
| \1
)
(?:
(?!\\|\1). # As many non-escape and non-quote chars as possible
)*
)*
/1 # Captured quote

Well, you can always just create a larger regex by just using the alternation operator on the smaller regexes
/(?:single-quoted-regex)|(?:double-quoted-regex)/
Or explicitly:
var string = /(?:^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$)|(?:^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$)/gi;
Finally, if you want to avoid the code duplication, you can build up this regex dynamically, using the new Regex constructor.
var quoted_string = function(delimiter){
return ('^' + delimiter + '(?:[^' + delimiter + '\\]|\\' + delimiter + '|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*' + delimiter + '$').replace(/\\/g, '\\\\');
//in the general case you could consider using a regex excaping function to avoid backslash hell.
};
var string = new RegExp( '(?:' + quoted_string("'") + ')|(?:' + quoted_string('"') + ')' , 'gi' );

Related

Regex, Get sequence if not preceded by symbols [duplicate]

This question already has answers here:
RegEx for a^b instead of pow(a,b)
(6 answers)
Closed 2 years ago.
I'm using the math.js library and I need to take the exponent of some variables. I have the following strings:
//Ok
pow(y,2)
pow(y,2+2)
pow(y,2-3)
pow(y,2.2)
pow(y,(23)/(2))+23123
pow(y,pow(2,pow(2,4)))-932
pow(y,pow(2,1*pow(2,0.5)))+23
//Erro
pow(y,2)*pow(2,2)
pow(y,3)-pow(2,2)
pow(y,4)+pow(2,2)
pow(y,pow(2,1*pow(2,0.5)))+pow(1,1)
I'm having trouble implementing this search using regex. The pow(a,b) function is composed of two arguments "a" is the base and "b" the exponent.
In the last four strings of the code above, I need to capture only "2", "3", "4" and "pow(2,1*pow(2,0.5))". I don't want to take the part after "*", "+" and "-".
Since it is possible to chain the pow() function and both "a" and "b" can have arithmetic operators and functions like pow() and sqrt(), this turned out to be very complex. Is there any way to resolve this using regex?
The closest I got is in this regex: https://regex101.com/r/hB1cg4/4
As stated in the comments, doing balanced match is hard in regex, though the .NET regex flavor supports this feature. Please see this answer: https://stackoverflow.com/a/35271017/8031896
Nevertheless, there is a work-around that uses the common regex flavors. However, please note that you may need to modifiy it according to the number of parentheses recursion layer in your mathematic notation.
((?<=^pow)\(([^()]*|\(([^()]*|\([^()]*\))*\))*\))
demo: https://regex101.com/r/hB1cg4/6
For more detailed explanation, please see this answer: https://stackoverflow.com/a/18703444/8031896
The following regex matches all of the "Euro" strings, and one variant, but unfortunately fails to match two of the "OK" strings. Perhaps some tweaking is possible. The regex contains a single capture group that captures the information of interest.
^pow\([^,]+,(\d[^()]*|pow\(\d+,\d+(?:\)|[^()]*\([^()]*\)\)))\).*
Javascript demo
To match the "Euro" strings I assumed that pow(2,1*pow(2,0.5)) in pow(y,pow(2,1*pow(2,0.5)))+23 represented the maximum number of nested "pow"'s.
The regex performs the following operations.
^ # match beginning of line
pow\( # match 'pow('
[^,]+ # match 1+ chars other than ','
, # match ','
( # begin capture group 1
\d[^()]* # match a digit, 0+ chars other than '(' and ')'
| # or
pow\(\d+,\d+ # match 'pow(', 1+ digits, ',' 1+ digits
(?: # begin non-cap grp
\) # match ')'
| # or
[^()]* # match 0+ chars other than '(' and ')'
\( # match '('
[^()]* # match 0+ chars other than '(' and ')'
\)\) # match '))'
) # end non-cap grp
) # end cap grp 1
\) # match ')'

Regex remove string in url

I have an url like https://randomsitename-dd555959b114a0.mydomain.com and want to remove the -dd555959b114a0 part of the url.
So randomsitename is a random name and the domain is static domain name.
Is this possible to remove the part with jquery or javascript?
Look at this code that is using regex
var url = "https://randomsitename-dd555959b114a0.mydomain.com";
var res = url.replace(/ *\-[^.]*\. */g, ".");
http://jsfiddle.net/VYw9Y/
It's usually best to code for all possible cases and since hyphens are allowed within any part of domain names, you'll more than likely want to use a more specific RexExp such as:
^ # start of string
( # start first capture group
[a-z]+ # one or more letters
) # end first capture group
:// # literal separator
( # start second capture group
[^.-]+ # one or more chars except dot or hyphen
) # end second capture group
(?: # start optional non-capture group
- # literal hyphen
[^.]+ # one or more chars except dot
)? # end optional non-capture group
( # start third capture group
.+ # one or more chars
) # end third capture group
$ # end of string
Or without comments:
^([a-z]+)://([^.-])(?:-[^.]+)?(.+)$
(Remember to escape slashes if you use the literal form for RegExps rather than creating them as objects, i.e. /literal\/form/ vs. new RegExp('object/form'))
Used in a string replacement, the second argument should then be: $1://$2$3
Previous answers will fail for URLs like http://foo.bar-baz.com or http://foo-bar.baz-blarg.com.
You could try this regex,
(.*)(-[^\.]*)(.*$)
Your code should be,
var url = "https://randomsitename-dd555959b114a0.mydomain.com";
var res = url.replace(/(.*)(-[^\.]*)(.*$)/, "$1$3");
//=>https://randomsitename.mydomain.com
Explanation:
(.*) matches any character 0 or more times and it was stored into group 1 because we enclose those characters within paranthesis. Whenever the regex engine finds -, it stops storing it into group1.
(-[^\.]*) From - upto a literal . are stored into group2. It stops storing when it finds a literal dot.
(.*$) From the literal dot upto the last character are stored into group3.
$1$3 at the replacement part prints only the stored group1 and 3.
OR
(.*)(?:-[^\.]*)(.*$)
If you use this regex, in the replacement part you need to put only $1 and $2.
DEMO

Replace spaces but not when between parentheses

I guess I can do this with multiple regexs fairly easily, but I want to replace all the spaces in a string, but not when those spaces are between parentheses.
For example:
Here is a string (that I want to) replace spaces in.
After the regex I want the string to be
Hereisastring(that I want to)replacespacesin.
Is there an easy way to do this with lookahead or lookbehing operators?
I'm a little confused on how they work, and not real sure they would work in this situation.
Try this:
replace(/\s+(?=[^()]*(\(|$))/g, '')
A quick explanation:
\s+ # one or more white-space chars
(?= # start positive look ahead
[^()]* # zero or more chars other than '(' and ')'
( # start group 1
\( # a '('
| # OR
$ # the end of input
) # end group 1
) # end positive look ahead
In plain English: it matches one or more white space chars if either a ( or the end-of-input can be seen ahead without encountering any parenthesis in between.
An online Ideone demo: http://ideone.com/jaljw
The above will not work if:
there are nested parenthesis
parenthesis can be escaped

Need help with a regular expression in Javascript

The box should allow:
Uppercase and lowercase letters (case insensitive)
The digits 0 through 9
The characters, ! # $ % & ' * + - / = ? ^ _ ` { | } ~
The character "." provided that it is not the first or last character
Try
^(?!\.)(?!.*\.$)[\w.!#$%&'*+\/=?^`{|}~-]*$
Explanation:
^ # Anchor the match at the start of the string
(?!\.) # Assert that the first characters isn't a dot
(?!.*\.$) # Assert that the last characters isn't a dot
[\w.!#$%&'*+\/=?^`{|}~-]* # Match any number of allowed characters
$ # Anchor the match at the end of the string
Try something like this:
// the '.' is not included in this:
var temp = "\\w,!#$%&'*+/=?^`{|}~-";
var regex = new RegExp("^["+ temp + "]([." + temp + "]*[" + temp + "])?$");
// ^
// |
// +---- the '.' included here
Looking at your comments it's clear you don't know exactly what a character class does. You don't need to separate the characters with comma's. The character class:
[0-9,a-z]
matches a single (ascii) -digit or lower case letter OR a comma. Note that \w is a "short hand class" that equals [a-zA-Z0-9_]
More information on character classes can be found here:
http://www.regular-expressions.info/charclass.html
You can do something like:
^[a-zA-Z0-9,!#$%&'*+-/=?^_`{|}~][a-zA-Z0-9,!#$%&'*+-/=?^_`{|}~.]*[a-zA-Z0-9,!#$%&'*+-/=?^_`{|}~]$
Here's how I would do it:
/^[\w!#$%&'*+\/=?^`{|}~-]+(?:\.[\w!#$%&'*+\/=?^`{|}~-]+)*$/
The first part is required to match at least one non-dot character, but everything else is optional, allowing it to match a string with only one (non-dot) character. Whenever a dot is encountered, at least one non-dot character must follow, so it won't match a string that begins or ends with a dot.
It also won't match a string with two or more consecutive dots in it. You didn't specify that, but it's usually one of the requirements when people ask for patterns like this. If you want to permit consecutive dots, just change the \. to \.+.

What does the regular expression \|(?=\w=>) mean?

I am an amateur in JavaScript. I saw this other (now deleted) question, and it made me wonder. Can you tell me what does the below regular expression exactly mean?
split(/\|(?=\w=>)/)
Does it split the string with |?
The regular expression is contained in the slashes.
It means
\| # A pipe symbol. It needs to be scaped with a backslash
# because otherwise it means "OR"
(?= # a so-called lookahead group. It checks if its contents match
# at the current position without actually advancing in the string
\w=> # a word character (a-z, A-Z, 0-9, _) followed by =>
) # end of lookahead group.
It splits the string on | but only if its followed by a char in [a-zA-Z0-9_] and =>
Example:
It will split a|b=> on the |
It will not split a|b on the |
It splits the string on every '|' followed by (?) an alphanumerical character (\w, shorthand for [a-zA-Z0-9_]) + the character sequence '=>'.
Here's a link that can help you understand regular expressions in javascript
Breakdown of the regular expression:
/ regular expression literal start delimiter
\| match | in the string, | is a special character in regex, so \ is used to escape it
(?= Is a lookahead expression, it checks to see if a string follows the expression without matching it
\w=> matches any alphanumeric string (including _), followed by =>
)/ marks the end of the lookahead expression and the end of the regex
In short, the string will be split on | if it is followed by any alphanumeric character or underscore and then =>.
In this case, the pipe character is escaped so it's treated as a literal pipe. The split occurs on pipes that are followed by any alphanumeric and '=>'.
The '|' is also used in regular expressions as a sort of OR operator. For example:
split(/k|i|tt|y/)
Would split on either a 'k', an 'i', a 'tt' or a 'y' character.
Trimming the delimiting characters, we get \|(?=\w=>)
| is a special character in regex, so it should be escaped with a backslash as \|
(?=REGEX) is syntax for positive look ahead: matches only if REGEX matches, but doesn't consume the substring that matches REGEX. The match to the REGEX doesn't become part of the matched result. Had it been mere \|\w=>, the parent string would be split around |a=> instead of |.
Thus /\|(?=\w=>)/ matches only those | characters that are followed by \w=>. It matches |a=> but not |a>, || etc.
Consider the example string from the linked question: a=>aa|b=>b||b|c=>cc. If it wasn't for the lookahead, split will yield an array of [a=>aa, b||b, cc]. With lookahead, you'll get [a=>aa, b=>b||b, c=>cc], which is the desired output.

Categories