Regex windows path validator - javascript

I've tried to find a windows file path validation for Javascript, but none seemed to fulfill the requirements I wanted, so I decided to build it myself.
The requirements are the following:
the path should not be empty
may begin with x:\, x:\\, \, // and followed by a filename (no file
extension required)
filenames cannot include the following special characters: <>:"|?*
filenames cannot end with dot or space
Here is the regex I came up with:
/^([a-z]:((\|/|\\|//))|(\\|//))[^<>:"|?*]+/i
But there are some issues:
it validates also filenames that include the special characters
mentioned in the rules
it doesn't include the last rule (cannot end with: . or space)
var reg = new RegExp(/^([a-z]:((\\|\/|\\\\|\/\/))|(\\\\|\/\/))[^<>:"|?*]+/i);
var startList = [
'C://test',
'C://te?st.html',
'C:/test',
'C://test.html',
'C://test/hello.html',
'C:/test/hello.html',
'//test',
'/test',
'//test.html',
'//10.1.1.107',
'//10.1.1.107/test.html',
'//10.1.1.107/test/hello.html',
'//10.1.1.107/test/hello',
'//test/hello.txt',
'/test/html',
'/tes?t/html',
'/test.html',
'test.html',
'//',
'/',
'\\\\',
'\\',
'/t!esrtr',
'C:/hel**o'
];
startList.forEach(item => {
document.write(reg.test(item) + ' >>> ' + item);
document.write("<br>");
});

Unfortunately, JavaScript flavour of regex does not support lookbehinds,
but fortunately it does support lookaheads, and this is the key factor
how to construct the regex.
Let's start from some observations:
After a dot, slash, backslash or a space there can not occur another
dot, slash or backslash. The set of "forbidden" chars includes also
\n, because none of these chars can be the last char of the file name
or its segment (between dots or (back-)slashes).
Other chars, allowed in the path are the chars which you mentioned
(other than ...), but the "exclusion list" must include also a dot,
slash, backslash, space and \n (the chars mentioned in point 1).
After the "initial part" (C:\) there can be multiple instances of
char mentioned in point 1 or 2.
Taking these points into account, I built the regex from 3 parts:
"Starting" part, matching the drive letter, a colon and up to 2
slashes (forward or backward).
The first alternative - either a dot, slash, backslash or a space,
with negative lookahead - a list of "forbidden" chars after each of
the above chars (see point 1).
The second alternative - chars mentioned in point 2.
Both the above alternatives can occur multiple times (+ quantifier).
So the regex is as follows:
^ - Start of the string.
(?:[a-z]:)? - Drive letter and a colon, optional.
[\/\\]{0,2} - Either a backslash or a slash, between 0 and 2 times.
(?: - Start of the non-capturing group, needed due to the +
quantifier after it.
[.\/\\ ] - The first alternative.
(?![.\/\\\n]) - Negative lookahead - "forbidden" chars.
| - Or.
[^<>:"|?*.\/\\ \n] - The second alternative.
)+ - End of the non-capturing group, may occur multiple times.
$ - End of the string.
If you attempt to match each path separately, use only i option.
But if you have multiple paths in separate rows, and match them
globally in one go, add also g and m options.
For a working example see https://regex101.com/r/4JY31I/1
Note: I suppose that ! should also be treated as a forbidden
character. If you agree, add it to the second alternative, e.g. after *.

This may work for you: ^(?!.*[\\\/]\s+)(?!(?:.*\s|.*\.|\W+)$)(?:[a-zA-Z]:)?(?:(?:[^<>:"\|\?\*\n])+(?:\/\/|\/|\\\\|\\)?)+$
You have a demo here
Explained:
^
(?!.*[\\\/]\s+) # Disallow files beginning with spaces
(?!(?:.*\s|.*\.|\W+)$) # Disallow bars and finish with dot/space
(?:[a-zA-Z]:)? # Drive letter (optional)
(?:
(?:[^<>:"\|\?\*\n])+ # Word (non-allowed characters repeated one or more)
(?:\/\/|\/|\\\\|\\)? # Bars (// or / or \\ or \); Optional
)+ # Repeated one or more
$

Since this post seems to be (one of) the top result(s) in a search for a RegEx Windows path validation pattern, and given the caveats / weaknesses of the above proposed solutions, I'll include the solution that I use for validating Windows paths (and which, I believe, addresses all of the points raised previously in that use-case).
I could not come up with a single viable REGEX, with or without look-aheads and look behinds that would do the job, but I could do it with two, without any look-aheads, or -behinds!
Note, though, that successive relative paths (i.e. "..\..\folder\file.exe") will not pass this pattern (though using "..\" or ".\" at the beginning of the string will). Periods and spaces before and after slashes, or at the end of the line are failed, as well as any character not permitted according to Microsoft's short-filename specification:
https://learn.microsoft.com/en-us/windows/win32/msi/filename
First Pattern:
^ (?# <- Start at the beginning of the line #)
(?# validate the opening drive or path delimiter, if present -> #)
(?: (?# "C:", "C:\", "C:..\", "C:.\" -> #)
(?:[A-Z]:(?:\.{1,2}[\/\\]|[\/\\])?)
| (?# or "\", "..\", ".\", "\\" -> #)
(?:[\/\\]{1,2}|\.{1,2}[\/\\])
)?
(?# validate the form and content of the body -> #)
(?:[^\x00-\x1A|*?\v\r\n\f+\/,;"'`\\:<>=[\]]+[\/\\]?)+
$ (?# <- End at the end of the line. #)
This will generally validate the path structure and character validity, but it also allows problematic things like double-periods, double-backslashes, and both periods and backslashes that are preceded-, and/or followed-by spaces or periods. Paths that end with spaces and/or periods are also permitted.
To address these problems I perform a second test with another (similar) pattern:
^ (?# <- Start at the beginning of the line #)
(?# validate the opening drive or path delimiter, if present -> #)
(?: (?# "C:", "C:\", "C:..\", "C:.\" -> #)
(?:[A-Z]:(?:\.{1,2}[\/\\]|[\/\\])?)
| (?# or "\", "..\", ".\", "\\" -> #)
(?:[\/\\]{1,2}|\.{1,2}[\/\\])
)?
(?# ensure that undesired patterns aren't present in the string -> #)
(?:([^\/\\. ]|[^\/. \\][\/. \\][^\/. \\]|[\/\\]$)*
[^\x00-\x1A|*?\s+,;"'`:<.>=[\]]) (?# <- Ensure that the last character is valid #)
$ (?# <- End at the end of the line. #)
This validates that, within the path body, no multiple-periods, multiple-slashes, period-slashes, space-slashes, slash-spaces or slash-periods occur, and that the path doesn't end with an invalid character. Annoyingly, I have to re-validate the <root> group because it's the one place where some of these combinations are allowed (i.e. ".\", "\\", and "..\") and I don't want those to invalidate the pattern.
Here is an implementation of my test (in C#):
/// <summary>Performs pattern testing on a string to see if it's in a form recognizable as an absolute path.</summary>
/// <param name="test">The string to test.</param>
/// <param name="testExists">If TRUE, this also verifies that the specified path exists.</param>
/// <returns>TRUE if the contents of the passed string are valid, and, if requested, the path exists.</returns>
public bool ValidatePath( string test, bool testExists = false )
{
bool result = !string.IsNullOrWhiteSpace(test);
string
drivePattern = /* language=regex */
#"^(([A-Z]:(?:\.{1,2}[\/\\]|[\/\\])?)|([\/\\]{1,2}|\.{1,2}[\/\\]))?",
pattern = drivePattern + /* language=regex */
#"([^\x00-\x1A|*?\t\v\f\r\n+\/,;""'`\\:<>=[\]]+[\/\\]?)+$";
result &= Regex.IsMatch( test, pattern, RegexOptions.ExplicitCapture );
pattern = drivePattern + /* language=regex */
#"(([^\/\\. ]|[^\/. \\][\/. \\][^\/. \\]|[\/\\]$)*[^\x00-\x1A|*?\s+,;""'`:<.>=[\]])$";
result &= Regex.IsMatch( test, pattern, RegexOptions.ExplicitCapture );
return result && (!testExists || Directory.Exists( test ));
}

Related

Regular Expression to only get a specific line

I am attempting to only extract a specific line without any other characters after. For example:
permit ip any any
permit oped any any eq 10.52.5.15
permit top any any (sdfg)
permit sdo any host 10.51.86.17 eq sdg
I would like to match only the first line permit ip any any and not the others. A thing to take note is that the second word ip can be any word.
Meaning, I find only permit (anyword) any any and if there was a character after the second any, do not match.
I tried to do \bpermit.\w+.(?:any.any).([$&+,:;=?##|'<>.^*()%!-\w].+)but that finds the other lines except the permit ip any any. I did attempt to do a reverse lookup, but to no success.
Use the $ end of line anchor after the final "any" and the m multiline regexp flag.
/^permit \w+ any any$/gm
https://regex101.com/r/FfOp5k/2
If you are using Java based regex, you can include the multiline flag in the expression. This syntax is not supported by JavaScript regex.
(?m)^permit \w+ any$
I tried to do \bpermit.\w+.(?:any.any).([$&+,:;=?##|'<>.^*()%!-\w].+) but that finds the other lines except the permit ip any any. I did attempt to do a reverse lookup, but to no success.
Lets take apart your regex to see what your regex says:
\b # starting on a word boundary (space to non space or reverse)
permit # look for the literal characters "permit" in that order
. # followed by any character
\w+ # followed by word characters (letters, numbers, underscores)
. # followed by any character
(?: # followed by a non-capturing group that contains
any # the literal characters 'any'
. # any character
any # the literal characters 'any'
)
. # followed by any character <-- ERROR HERE!
( # followed by a capturing group
[$&+,:;=?##|'<>.^*()%!-\w] # any one of these many characters or word characters
.+ # then any one character one or more times
)
The behavior you describe...
but that finds the other lines except the permit ip any any.
matches what you've specified. Specifically, the regex above requires that there be characters after the 'any any'. Because permit \w+ any any does not have any characters after the any any part, the regex fails at the <-- ERROR HERE! mark in my breakdown above.
If that last part must be captured (using a capturing group) but it may not exist, you can make that entire last part optional using the ? character.
This would look like:
permit \w+ any any(?: (.+))?
for a breakdown of:
permit # the word permit
[ ] # a literal space
\w+ # one or more word characters
[ ] # a literal space
any # the word any
[ ] # another literal space
any # another any; all of this is requred.
(?: # a non-capturing group to start the "optional" part
[ ] # a literal space after the any
(.+) # everything else, including spaces, and capture it in a group
)? # end non-capturing group, but make it optional

How to extract an optional query parameter using regex in Javascript

I'd like to construct a regex that will check for a "path" and a "foo" parameter (non-negative integer). "foo" is optional. It should:
MATCH
path?foo=67 # path found, foo = 67
path?foo=67&bar=hello # path found, foo = 67
path?bar=bye&foo=1&baz=12 # path found, foo = 1
path?bar=123 # path found, foo = ''
path # path found, foo = ''
DO NOT MATCH
path?foo=37signals # foo is not integer
path?foo=-8 # foo cannot be negative
something?foo=1 # path not found
Also, I'd like to get the value of foo, without performing an additional match.
What would be the simplest regex to achieve this?
The Answer
Screw your hard work, I just want the answer! Okay, here you go...
var regex = /^path(?:(?=\?)(?:[?&]foo=(\d*)(?=[&#]|$)|(?![?&]foo=)[^#])+)?(?=#|$)/,
URIs = [
'path', // valid!
'pathbreak', // invalid path
'path?foo=123', // valid!
'path?foo=-123', // negative
'invalid?foo=1', // invalid path
'path?foo=123&bar=abc', // valid!
'path?bar=abc&foo=123', // valid!
'path?bar=foo', // valid!
'path?foo', // valid!
'path#anchor', // valid!
'path#foo=bar', // valid!
'path?foo=123#bar', // valid!
'path?foo=123abc', // not an integer
];
for(var i = 0; i < URIs.length; i++) {
var URI = URIs[i],
match = regex.exec(URI);
if(match) {
var foo = match[1] ? match[1] : 'null';
console.log(URI + ' matched, foo = ' + foo);
} else {
console.log(URI + ' is invalid...');
}
}
<script src="https://getfirebug.com/firebug-lite-debug.js"></script>
Research
Your bounty request asks for "credible and/or official sources", so I'll quote the RFC on query strings.
The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI's scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.
This seems pretty vague on purpose: a query string starts with the first ? and is terminated by a # (start of anchor) or the end of the URI (or string/line in our case). They go on to mention that most data sets are in key=value pairs, which is what it seems like what you expect to be parsing (so lets assume that is the case).
However, as query components are often used to carry identifying information in the form of "key=value" pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters.
With all this in mind, let's assume a few things about your URIs:
Your examples start with the path, so the path will be from the beginning of the string until a ? (query string), # (anchor), or the end of the string.
The query string is the iffy part, since RFC doesn't really define a "norm". A browser tends to expect a query string to be generated from a form submission and be a list of key=value pairs appended by & characters. Keeping this mentality:
A key cannot be null, will be preceded by a ? or &, and cannot contain a =, & or #.
A value is optional, will be preceded by key=, and cannot contain a & or #.
Anything after a # character is the anchor.
Let's Begin!
Let's start by mapping out our basic URI structure. You have a path, which is characters starting at the string and up until a ?, #, or the end of the string. You have an optional query string, which starts at a ? and goes until a # or the end of the string. And you have an optional anchor, which starts at a # and goes until the end of the string.
^
([^?#]+)
(?:
\?
([^#]+)
)?
(?:
#
(.*)
)?
$
Let's do some clean up before digging into the query string. You can easily require the path to equal a certain value by replacing the first capture group. Whatever you replace it with (path), will have to be followed by an optional query string, an optional anchor, and the end of the string (no more, no less). Since you don't need to parse the anchor, the capturing group can be replaced by ending the match at either a # or the end of the string (which is the end of the query parameter).
^path
(?:
\?
([^#\+)
)?
(?=#|$)
Stop Messing Around
Okay, I've been doing a lot of setup without really worrying about your specific example. The next example will match a specific path (path) and optionally match a query string while capturing the value of a foo parameter. This means you could stop here and check for a valid match..if the match is valid, then the first capture group must be null or a non-negative integer. But that wasn't your question, was it. This got a lot more complicated, so I'm going to explain the expression inline:
^ (?# match beginning of the string)
path (?# match path literally)
(?: (?# begin optional non-capturing group)
(?=\?) (?# lookahead for a literal ?)
(?: (?# begin optional non-capturing group)
[?&] (?# keys are preceded by ? or &)
foo (?# match key literally)
(?: (?# begin optional non-capturing group)
= (?# values are preceded by =)
([^&#]*) (?# values are 0+ length and do not contain & or #)
) (?# end optional non-capturing group)
| (?# OR)
[^#] (?# query strings are non-# characters)
)+ (?# end repeating non-capturing group)
)? (?# end optional non-capturing group)
(?=#|$) (?# lookahead for a literal # or end of the string)
Some key takeaways here:
Javascript doesn't support lookbehinds, meaning you can't look behind for a ? or & before the key foo, meaning you actually have to match one of those characters, meaning the start of your query string (which looks for a ?) has to be a lookahead so that you don't actually match the ?. This also means that your query string will always be at least one character (the ?), so you want to repeat the query string [^#] 1+ times.
The query string now repeats one character at a time in a non-capturing group..unless it sees the key foo, in which case it captures the optional value and continues repeating.
Since this non-capture query string group repeats all the way until the anchor or end of the URI, a second foo value (path?foo=123&foo=bar) would overwrite the initial captured value..meaning you wouldn't 100% be able to rely on the above solution.
Final Solution?
Okay..now that I've captured the foo value, it's time to kill the match on a values that are not positive integers.
^ (?# match beginning of the string)
path (?# match path literally)
(?: (?# begin optional non-capturing group)
(?=\?) (?# lookahead for a literal ?)
(?: (?# begin optional non-capturing group)
[?&] (?# keys are preceeded by ? or &)
foo (?# match key literally)
= (?# values are preceeded by =)
(\d*) (?# value must be a non-negative integer)
(?= (?# begin lookahead)
[&#] (?# literally match & or #)
| (?# OR)
$ (?# match end of the string)
) (?# end lookahead)
| (?# OR)
(?! (?# begin negative lookahead)
[?&] (?# literally match ? or &)
foo= (?# literally match foo=)
) (?# end negative lookahead)
[^#] (?# query strings are non-# characters)
)+ (?# end repeating non-capturing group)
)? (?# end optional non-capturing group)
(?=#|$) (?# lookahead for a literal # or end of the string)
Let's take a closer look at some of the juju that went into that expression:
After finding foo=\d*, we use a lookahead to ensure that it is followed by a &, #, or the end of the string (the end of a query string value).
However..if there is more to foo=\d*, the regex would be kicked back by the alternator to a generic [^#] match right at the [?&] before foo. This isn't good, because it will continue to match! So before you look for a generic query string ([^#]), you must make sure you are not looking at a foo (that must be handled by the first alternation). This is where the negative lookahead (?![?&]foo=) comes in handy.
This will work with multiple foo keys, since they will all have to equal non-negative integers. This lets foo be optional (or equal null) as well.
Disclaimer: Most Regex101 demos use PHP for better syntax highlighting and include \n in negative character classes since there are multiple lines of examples.
Nice question! Seems fairly simple at first...but there are a lot of gotchas. Would advise checking any claimed solution will handle the following:
ADDITIONAL MATCH TESTS
path? # path found, foo = ''
path#foo # path found, foo = ''
path#bar # path found, foo = ''
path?foo= # path found, foo = ''
path?bar=1&foo= # path found, foo = ''
path?foo=&bar=1 # path found, foo = ''
path?foo=1#bar # path found, foo = 1
path?foo=1&foo=2 # path found, foo = 2
path?foofoo=1 # path found, foo = ''
path?bar=123&foofoo=1 # path found, foo = ''
ADDITIONAL DO NOT MATCH TESTS
pathbar? # path not found
pathbar?foo=1 # path not found
pathbar?bar=123&foo=1 # path not found
path?foo=a&foofoo=1 # not an integer
path?foofoo=1&foo=a # not an integer
The simplest regex I could come up with that works for all these additional cases is:
path(?=(\?|$|#))(\?(.+&)?foo=(\d*)(&|#|$)|((?![?&]foo=).)*$)
However, would advise adding ?: to the unused capturing groups so they are ignored and you can easily get the foo value from Group 1 - see Debuggex Demo
path(?=(?:\?|$|#))(?:\?(?:.+&)?foo=(\d*)(?:&|#|$)|(?:(?![?&]foo=).)*$)
^path\b(?!.*[?&]foo=(?!\d+(?=&|#|$)))(?:.*[?&]foo=(\d+)(?=&|#|$))?
Basically I just broke it down into three parts
^path\b # starts with path
(?!.*[?&]foo=(?!\d+(?=&|#|$))) # not followed by foo with an invalid value
(?:.*[?&]foo=(\d+)(?=&|#|$))? # possibly followed by foo with a valid value
see validation here http://regexr.com/39i7g
Caveats:
will match path#bar=1&foo=27
will not match path?foo=
The OP didn't mention these requirements and since he wants a simple regex (oxymoron?) I did not attempt to solve them.
path.+?(?:foo=(\d+))(?![a-zA-Z\d])|path((?!foo).)*$
You can try this.See demo.
http://regex101.com/r/jT3pG3/10
You can try the following regex:
path(?:.*?foo=(\d+)\b|()(?!.*foo))
regex101 demo
There are two possible matches after path:
.*?foo=(\d+)\b i.e. foo followed by digits.
OR
()(?!.*foo) an empty string if there is no foo ahead.
Add some word boundaries (\b) if you don't want the regex to interpret other words (e.g. another parameter named barfoobar) around the foos.
path(?:.*?\bfoo=(\d+)\b|()(?!.*\bfoo\b))
You can check for the existence of 3rd matched group. It it is not there, the foo value would be null; otherwise, it is the group itself:
/^(path)(?:$|\?(?:(?=.*\b(foo=)(\d+)\b.*$)|(?!foo=).*?))/gm
An example on regex101: http://regex101.com/r/oP6lU7/1
Dealing with javascript engine to make Regular Expressions besides all the lacks it has in compare with PCRE, somehow is enjoyable!
I made this RegEx, simple and understandable:
^(?=path\?).*foo=(\d*)(?:&|$)|path$
Explanations
^(?=path\?) # A positive lookahead to ensure we have "path" at the very begining
.*foo=(\d*)(?:&|$) # Looking for a string includes foo=(zero or more digits) following a "&" character or end of string
| # OR
path$ # Just "path" itself
Runnable snippet:
var re = /^(?=path\?).*foo=(\d*)(?:&|$)|path$/gm;
var str = 'path?foo=67\npath?foo=67&bar=hello\npath?bar=bye&foo=1&baz=12\npath\npathtest\npath?foo=37signals\npath?foo=-8\nsomething?foo=1';
var m, n = [];
while ((m = re.exec(str)) != null) {
if (m.index === re.lastIndex) {
re.lastIndex++;
}
n.push(m[0]);
}
alert( JSON.stringify(n) );
Or a Live demo for more details
path(?:\?(?:[^&]*&)*foo=([0-9]+)(?:[&#]|$))?
This is as short as most, and reads more straightforwardly, since things that appear once in the string appear once in the RE.
We match:
the initial path
a question mark, (or skip to end)
some blocks terminated by ampersands
our parameter assignment
a closing confirmation, either starting the next syntactic element, or ending the line
Unfortunately it matches foo to None rather than '' when the foo parameter is omitted, but in Python (my language of choice) that is considered more appropriate. You could complain if you wanted, or just or with '' afterwards.
Based on the OP's data here is my attempt pattern
^(path)\b(?:[^f]+|f(?!oo=))(?!\bfoo=(?!\d+\b))(?:\bfoo=(\d+)\b)?
if path is found: sub-pattern #1 will contains "path"
if foo is valid: sub-pattern #2 will contains "foo value if any"
Demo
^(path)\b "path"
(?:[^f]+|f(?!oo=)) followed by anything but "foo="
(?!\bfoo=(?!\d+\b)) if "foo=" is found it must not see anything but \d+\b
(?:\bfoo=(\d+)\b)? if valid "foo=" is found, capture "foo" value
t = 'path?foo=67&bar=hello';
console.log(t.match(/\b(foo|path)\=\d+\b/))
regex /\b(foo|path)\=\d+\b/

Regex remove string in url

I have an url like https://randomsitename-dd555959b114a0.mydomain.com and want to remove the -dd555959b114a0 part of the url.
So randomsitename is a random name and the domain is static domain name.
Is this possible to remove the part with jquery or javascript?
Look at this code that is using regex
var url = "https://randomsitename-dd555959b114a0.mydomain.com";
var res = url.replace(/ *\-[^.]*\. */g, ".");
http://jsfiddle.net/VYw9Y/
It's usually best to code for all possible cases and since hyphens are allowed within any part of domain names, you'll more than likely want to use a more specific RexExp such as:
^ # start of string
( # start first capture group
[a-z]+ # one or more letters
) # end first capture group
:// # literal separator
( # start second capture group
[^.-]+ # one or more chars except dot or hyphen
) # end second capture group
(?: # start optional non-capture group
- # literal hyphen
[^.]+ # one or more chars except dot
)? # end optional non-capture group
( # start third capture group
.+ # one or more chars
) # end third capture group
$ # end of string
Or without comments:
^([a-z]+)://([^.-])(?:-[^.]+)?(.+)$
(Remember to escape slashes if you use the literal form for RegExps rather than creating them as objects, i.e. /literal\/form/ vs. new RegExp('object/form'))
Used in a string replacement, the second argument should then be: $1://$2$3
Previous answers will fail for URLs like http://foo.bar-baz.com or http://foo-bar.baz-blarg.com.
You could try this regex,
(.*)(-[^\.]*)(.*$)
Your code should be,
var url = "https://randomsitename-dd555959b114a0.mydomain.com";
var res = url.replace(/(.*)(-[^\.]*)(.*$)/, "$1$3");
//=>https://randomsitename.mydomain.com
Explanation:
(.*) matches any character 0 or more times and it was stored into group 1 because we enclose those characters within paranthesis. Whenever the regex engine finds -, it stops storing it into group1.
(-[^\.]*) From - upto a literal . are stored into group2. It stops storing when it finds a literal dot.
(.*$) From the literal dot upto the last character are stored into group3.
$1$3 at the replacement part prints only the stored group1 and 3.
OR
(.*)(?:-[^\.]*)(.*$)
If you use this regex, in the replacement part you need to put only $1 and $2.
DEMO

Facebook registration form email validation pattern

When searching for RegExp patterns to validate an email address in Javascript, I found a pattern that Facebook uses from here.
function is_email(a){return /^([\w!.%+\-])+#([\w\-])+(?:\.[\w\-]+)+$/.test(a);}
Can someone please explain to me how this pattern works? I understand that it is looking for 'word characters' in three positions along with a '#' character. But a nice explanation will help a lot for me to understand this.
There are two websites (that I know of), which generate explanations for regex patterns.
regex101.com explains the pattern in words
regexper.com does so graphically
Here is my own explanation for the pattern:
^ # anchor the pattern to the beginning of the string; this ensures that
# there are no undesired characters before the email address, as regex
# matches might well be substrings otherwise
( # starts a group (which is unnecessary and incurs overhead)
[\w!.%+\-]
# matches a letter, digit, underscore or one of the explicitly mentioned
# characters (note that the backslash is used to escape the hyphen
# although that is not required if the hyphen is the last character)
)+ # end group; repeat one or more times
# # match a literal #
( # starts another group (again unnecessary and incurs overhead)
[\w\-] # match a letter, digit, underscore or hyphen
)+ # end group; repeat one or more times
(?: # starts a non-capturing group (this one is necessary and, because
# capturing is suppressed, this one does not incur any overhead)
\. # match a literal period
[\w\-] # match a letter, digit, underscore or hyphen
+ # one or more of those
)+ # end group; repeat one or more times
$ # anchor the pattern to the end of the string; analogously to ^
So, this would be a slightly optimised version:
/^[\w!.%+\-]+#[\w\-]+(?:\.[\w\-]+)+$/

Matching optional groups with lookahead in JavaScript regex

I'm trying to solve a string matching problem with regexes. I need to match URLs of this form:
http://soundcloud.com/okapi23/dont-turn-your-back/
And I need to "reject" URL of this form:
http://soundcloud.com/okapi23/sets/happily-reversed/
The trailing '/' is obviously optional.
So basically:
After the hostname, there can be 2 or 3 groups, and if in the second one is equal to "sets", then the regex should not match.
"sets" can be contained anywhere else in the URL
"sets" needs to be an exact match
What I came up so far is http(s)?://(www\.)?soundcloud\.com/.+/(?!sets)\b(/.+)?, which fails.
Any suggestions? Are there any libraries that would simplify the task (for example, making trailing slashes optional)?
Assuming that the OP wants to test to see if a given string contains a URL which meets the following requirements:
URL scheme must be either http: or https:.
URL authority must be either //soundcloud.com or //www.soundcloud.com.
URL path must exist and must contain 2 or 3 path segments.
The second path segment must not be: "sets".
Each path segment must consist of one or more "words" consisting of only alphanumeric characters ([A-Za-z0-9]) and multiple words are separated by exactly one dash or underscore.
The URL must have no query or fragment component.
The URL path may end with an optional "/".
The URL should match case insensitively.
Here is a tested JavaScript function (with a fully commented regex) which does the trick:
function isValidCustomUrl(text) {
/* Here is the regex commented in free-spacing mode:
# Match specific URL having non-"sets" 2nd path segment.
^ # Anchor to start of string.
https?: # URL Scheme (http or https).
// # Begin URL Authority.
(?:www\.)? # Optional www subdomain.
soundcloud\.com # URL DNS domain.
/ # 1st path segment (can be: "sets").
[A-Za-z0-9]+ # 1st word-portion (required).
(?: # Zero or more extra word portions.
[-_] # only if separated by one - or _.
[A-Za-z0-9]+ # Additional word-portion.
)* # Zero or more extra word portions.
(?!/sets(?:/|$)) # Assert 2nd segment not "sets".
(?: # 2nd and 3rd path segments.
/ # Additional path segment.
[A-Za-z0-9]+ # 1st word-portion.
(?: # Zero or more extra word portions.
[-_] # only if separated by one - or _.
[A-Za-z0-9]+ # Additional word-portion.
)* # Zero or more extra word portions.
){1,2} # 2nd path segment required, 3rd optional.
/? # URL may end with optional /.
$ # Anchor to end of string.
*/
// Same regex in javascript syntax:
var re = /^https?:\/\/(?:www\.)?soundcloud\.com\/[A-Za-z0-9]+(?:[-_][A-Za-z0-9]+)*(?!\/sets(?:\/|$))(?:\/[A-Za-z0-9]+(?:[-_][A-Za-z0-9]+)*){1,2}\/?$/i;
if (re.test(text)) return true;
return false;
}
Instead of . use [a-zA-Z][\w-]* which means "match a letter followed by any number of letters, numbers, underscores or hyphens".
^https?://(www\.)?soundcloud\.com/[a-zA-Z][\w-]*/(?!sets(/|$))[a-zA-Z][\w-]*(/[a-zA-Z][\w-]*)?/?$
To get the optional trailing slash, use /?$.
In a Javascript regular expression literal all the forward slashes must be escaped.
I suggest you to go with regex pattern
^https?:\/\/soundcloud\.com(?!\/[^\/]+\/sets(?:\/|$))(?:\/[^\/]+){2,3}\/?$

Categories