What does the regular expression \|(?=\w=>) mean? - javascript

I am an amateur in JavaScript. I saw this other (now deleted) question, and it made me wonder. Can you tell me what does the below regular expression exactly mean?
split(/\|(?=\w=>)/)
Does it split the string with |?

The regular expression is contained in the slashes.
It means
\| # A pipe symbol. It needs to be scaped with a backslash
# because otherwise it means "OR"
(?= # a so-called lookahead group. It checks if its contents match
# at the current position without actually advancing in the string
\w=> # a word character (a-z, A-Z, 0-9, _) followed by =>
) # end of lookahead group.

It splits the string on | but only if its followed by a char in [a-zA-Z0-9_] and =>
Example:
It will split a|b=> on the |
It will not split a|b on the |

It splits the string on every '|' followed by (?) an alphanumerical character (\w, shorthand for [a-zA-Z0-9_]) + the character sequence '=>'.
Here's a link that can help you understand regular expressions in javascript

Breakdown of the regular expression:
/ regular expression literal start delimiter
\| match | in the string, | is a special character in regex, so \ is used to escape it
(?= Is a lookahead expression, it checks to see if a string follows the expression without matching it
\w=> matches any alphanumeric string (including _), followed by =>
)/ marks the end of the lookahead expression and the end of the regex
In short, the string will be split on | if it is followed by any alphanumeric character or underscore and then =>.

In this case, the pipe character is escaped so it's treated as a literal pipe. The split occurs on pipes that are followed by any alphanumeric and '=>'.
The '|' is also used in regular expressions as a sort of OR operator. For example:
split(/k|i|tt|y/)
Would split on either a 'k', an 'i', a 'tt' or a 'y' character.

Trimming the delimiting characters, we get \|(?=\w=>)
| is a special character in regex, so it should be escaped with a backslash as \|
(?=REGEX) is syntax for positive look ahead: matches only if REGEX matches, but doesn't consume the substring that matches REGEX. The match to the REGEX doesn't become part of the matched result. Had it been mere \|\w=>, the parent string would be split around |a=> instead of |.
Thus /\|(?=\w=>)/ matches only those | characters that are followed by \w=>. It matches |a=> but not |a>, || etc.
Consider the example string from the linked question: a=>aa|b=>b||b|c=>cc. If it wasn't for the lookahead, split will yield an array of [a=>aa, b||b, cc]. With lookahead, you'll get [a=>aa, b=>b||b, c=>cc], which is the desired output.

Related

Javascript Regex to match only zero/even number of backslashes anywhere in a string

I need a regular expression that matches the complete string with a zero/even number of backslashes anywhere in the string. If the string contains an odd number of backslashes, it should not match the complete string.
Example:
\\ -> match
\\\ -> does not match
test\\test -> match
test\\\test-> does not match
test\\test\ -> does not match
test\\test\\ -> match
and so on...
Note: We can assume any string of any length in place of 'test' in the above example
I am using this ^[^\\]*(\\\\)*[^\\]*$ regular expression, but it does not match the backslashes after the second test.
For example:
test\\test(doesn't match anything after this)
Thanks for any help in advance.
You may use this regex:
^(?:(?:[^\\]*\\){2})*[^\\]*$
RegEx Demo
RegEx Breakdown:
^: Start
(?:: Start non-capture group #1
(?:: Start non-capture group #2
[^\\]*: Match 0 or more og any char except a \
\\: Match a \
){2}: End non-capture group #2. Repeat this group 2 times.
)*: End non-capture group #1. Repeat this group 0 or more times.
[^\\]*: Match 0 or more og any char except a \
$: End
The current regular expression ^[^\\]*(\\\\)*[^\\]*$ can be interpreted as Any(\\)*Any, Where Any means any character except backslash.
The expected language shall be Any\\Any\\Any\\..., which can be obtained by containing the current regular expression in Kleene closure operator. That is (Any(\\)*Any)*
The original regular expression after modification:
^([^\\]*(\\\\)*[^\\]*)*$
It can be further optimized as:
^((\\\\)*[^\\]*)*$

Regular expression to fetch beginning of string or a symbol

I am writing a function to find attributes value from given string and given attribute name.
The input stings look like those below:
sip:+19999999999#trunkgroup2:5060;user=phone
<sip:+19999999999;tgrp=0180401;trunk-context=aaaa.aaaa.ca#10.10.10.100:8000;user=phone;transport=udp>
<sip:19999999999;tgrp=0306001;trunk-context=aaaa.aaaa.ca#10.10.10.100:8000;transport=udp>
<sip:+19999999999;tgrp=SMPPDIN;trunk-context=aaaa.aaaa.ca#10.10.10.100:8000;transport=udp>
After few hours I came out with this regular expression: /(\Wsip[:,+,=]+)(\w+)/g, but this is not working for the first example - as there is no not a word character before the attributes name.
How can I fix this expression to fetch both cases - <sip... and sip.. only when it is the beginning of the string.
I use this function to extract both sip and tgrp values.
Replace \W with \b, and use
\b(sip[:+=]+)(\w+)
Or, to match at the beginning of a string:
^\W?(sip[:+=]+)(\w+)
See the first regex demo and the second regex demo.
As \W is a consuming pattern matching any non-word char (a char other than a letter/digit/_) you won't have a match at the start of the string. A \b word boundary will match at the start of the string and in case there is a non-word char before s.
If you literally need to find a match at the beginning of a string after an optional non-word char, the \W must be replaced with ^\W? where ^ match the start of a string, and \W? matches 1 or 0 non-word chars.
Also, note that , inside a character class is matched as a literal ,. If you mean to use it to enumerate chars, you should remove it.
Pattern details:
\b - a word boundary
OR
^ - start of string
\W? - 1 or 0 (due to the ? quantifier) non-word chars (i.e. chars other than letters/digits and _)
(sip[:+=]+) - Group 1: sip substring followed with one or more :, + or = chars
(\w+) - Group 2: one or more word chars.
for begining of line use ^ and to make < is optional use ?
^<?(sip[:,+,=]+)(\w+)

Regex match character before and after underscore

I have to write a regex with matches following:
String should start with alphabets - [a-zA-Z]
String can contain alphabets, spaces, numbers, _ and - (underscore and hyphen)
String should not end with _ or - (underscore and hyphen)
Underscore character should not have space before and after.
I came up with the following regex, but it doesn't seems to work
/^[a-zA-Z0-9]+(\b_|_\b)[a-zA-Z0-9]+$/
Test case:
HelloWorld // Match
Hello_World //Match
Hello _World // doesn't match
Hello_ World // doesn't match
Hello _ World // doesn't match
Hello_World_1 // Match
He110_W0rld // Match
Hello - World // Match
Hello-World // Match
_HelloWorld // doesn't match
Hello_-_World // match
You may use
^(?!.*(?:[_-]$|_ | _))[a-zA-Z][\w -]*$
See the regex demo
Explanation:
^ - start of string
(?!.*(?:[_-]$|_ | _)) - after some chars (.*) there must not appear ((?!...)) a _ or - at the end of string ([_-]$), nor space+_ or _+space
[a-zA-Z] - the first char matched and consumed must be an ASCII letter
[\w -]* - 0+ word (\w = [a-zA-Z0-9_]) chars or space or -
$ - end of string
You could use this one:
^(?!^[ _-]|.*[ _-]$|.* _|.*_ )[\w -]*$
regex tester
For the test cases I used modifier gm to match each line individually.
If emtpy string should not be considered as acceptable, then change the final * to a +:
^(?!^[ _-]|.*[ _-]$|.* _|.*_ )[\w -]+$
Meaning of each part
^ and $ match the beginning/ending of the input
(?! ): list of things that should not match:
|: logical OR
^[ _-]: starts with any of these three characters
.*[ _-]$: ends with any of these three characters
.* _: has space followed by underscore anywhere
.*_: has underscore followed by space anywhere
[\w -]: any alphanumeric character or underscore (also matched by \w) or space or hyphen
*: zero or more times
+: one or more times
What about this?
^[a-zA-Z](\B_\B|[a-zA-Z0-9 -])*[a-zA-Z0-9 ]$
Broken down:
^
[a-zA-Z] allowed characters at beginning
(
\B_\B underscore with no word-boundary
| or
[a-zA-Z0-9 -] other allowed characters
)*
[a-zA-Z0-9 ] allowed characters at end
$
Oh! I love me some regex!
Would this work? /^[a-z]$|^[a-z](?:_(?=[^ ]))?(?:[a-z\d -][^ ]_[^ ])*[a-z\d -]*[^_-]$/i
I was a tad unsure of rule 4--do you mean underscores can have a space before or after or neither, but not before and after?

Regex match a dollar, without a backslash before it

I want to match only a dollar symbol without a backslash immediately before, as demonstrated below:
$not\$yes $
^.........^
So far, I have [^\\]\$, but this doesn't match any dollar that begins a line. The dollar could be the first symbol in the document, so matching a newline would not work. How do I match this? Is the regex I have so far even right?
You could use an alternation with the ^ anchor in order to match the $ character literally if it is the first character in the string or if it follows a character that is not a backslash.
/(?:^|[^\\])\$/
Explanation:
(?: - Start of a non-capturing group that is used to group the alternation.
^|[^\\] - Alternation that matches the start of the string using the ^ anchor or match a non-\ character
) - Close the non-capturing group that was used to group ^|[^\\]
\$ - The $ character literally
In other words, the ^ anchor will match the start of the string; while [^\\] will match anything but a backslash. The pipe | acts as an "or" operator that will match the start of the string or anything but a backslash (i.e., ^|[^\\]).
So in the string you provided, the first/last $ character would be matched.
Use a negative lookbehind assertion
(?<!\\)\$
In Action: https://regex101.com/r/dA8aA1/1

How do I combine these two regular expressions into one?

I'm writing a rudimentary lexer using regular expressions in JavaScript and I have two regular expressions (one for single quoted strings and one for double quoted strings) which I wish to combine into one. These are my two regular expressions (I added the ^ and $ characters for testing purposes):
var singleQuotedString = /^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$/gi;
var doubleQuotedString = /^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$/gi;
Now I tried to combine them into a single regular expression as follows:
var string = /^(["'])(?:[^\1\\]|\\\1|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*\1$/gi;
However when I test the input "Hello"World!" it returns true instead of false:
alert(string.test('"Hello"World!"')); //should return false as a double quoted string must escape double quote characters
I figured that the problem is in [^\1\\] which should match any character besides matching group \1 (which is either a single or a double quote - the delimiter of the string) and \\ (which is the backslash character).
The regular expression correctly filters out backslashes and matches the delimiters, but it doesn't filter out the delimiter within the string. Any help will be greatly appreciated. Note that I referred to Crockford's railroad diagrams to write the regular expressions.
You can't refer to a matched group inside a character class: (['"])[^\1\\]. Try something like this instead:
(['"])((?!\1|\\).|\\[bnfrt]|\\u[a-fA-F\d]{4}|\\\1)*\1
(you'll need to add some more escapes, but you get my drift...)
A quick explanation:
(['"]) # match a single or double quote and store it in group 1
( # start group 2
(?!\1|\\). # if group 1 or a backslash isn't ahead, match any non-line break char
| # OR
\\[bnfrt] # match an escape sequence
| # OR
\\u[a-fA-F\d]{4} # match a Unicode escape
| # OR
\\\1 # match an escaped quote
)* # close group 2 and repeat it zero or more times
\1 # match whatever group 1 matched
This should work too (raw regex).
If speed is a factor, this is the 'unrolled' method, said to be the fastest for this kind of thing.
(['"])(?:(?!\\|\1).)*(?:\\(?:[\/bfnrt]|u[0-9A-F]{4}|\1)(?:(?!\\|\1).)*)*/1
Expanded
(['"]) # Capture a quote
(?:
(?!\\|\1). # As many non-escape and non-quote chars as possible
)*
(?:
\\ # escape plus,
(?:
[\/bfnrt] # /,b,f,n,r,t or u[a-9A-f]{4} or captured quote
| u[0-9A-F]{4}
| \1
)
(?:
(?!\\|\1). # As many non-escape and non-quote chars as possible
)*
)*
/1 # Captured quote
Well, you can always just create a larger regex by just using the alternation operator on the smaller regexes
/(?:single-quoted-regex)|(?:double-quoted-regex)/
Or explicitly:
var string = /(?:^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$)|(?:^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$)/gi;
Finally, if you want to avoid the code duplication, you can build up this regex dynamically, using the new Regex constructor.
var quoted_string = function(delimiter){
return ('^' + delimiter + '(?:[^' + delimiter + '\\]|\\' + delimiter + '|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*' + delimiter + '$').replace(/\\/g, '\\\\');
//in the general case you could consider using a regex excaping function to avoid backslash hell.
};
var string = new RegExp( '(?:' + quoted_string("'") + ')|(?:' + quoted_string('"') + ')' , 'gi' );

Categories