Matching optional groups with lookahead in JavaScript regex - javascript

I'm trying to solve a string matching problem with regexes. I need to match URLs of this form:
http://soundcloud.com/okapi23/dont-turn-your-back/
And I need to "reject" URL of this form:
http://soundcloud.com/okapi23/sets/happily-reversed/
The trailing '/' is obviously optional.
So basically:
After the hostname, there can be 2 or 3 groups, and if in the second one is equal to "sets", then the regex should not match.
"sets" can be contained anywhere else in the URL
"sets" needs to be an exact match
What I came up so far is http(s)?://(www\.)?soundcloud\.com/.+/(?!sets)\b(/.+)?, which fails.
Any suggestions? Are there any libraries that would simplify the task (for example, making trailing slashes optional)?

Assuming that the OP wants to test to see if a given string contains a URL which meets the following requirements:
URL scheme must be either http: or https:.
URL authority must be either //soundcloud.com or //www.soundcloud.com.
URL path must exist and must contain 2 or 3 path segments.
The second path segment must not be: "sets".
Each path segment must consist of one or more "words" consisting of only alphanumeric characters ([A-Za-z0-9]) and multiple words are separated by exactly one dash or underscore.
The URL must have no query or fragment component.
The URL path may end with an optional "/".
The URL should match case insensitively.
Here is a tested JavaScript function (with a fully commented regex) which does the trick:
function isValidCustomUrl(text) {
/* Here is the regex commented in free-spacing mode:
# Match specific URL having non-"sets" 2nd path segment.
^ # Anchor to start of string.
https?: # URL Scheme (http or https).
// # Begin URL Authority.
(?:www\.)? # Optional www subdomain.
soundcloud\.com # URL DNS domain.
/ # 1st path segment (can be: "sets").
[A-Za-z0-9]+ # 1st word-portion (required).
(?: # Zero or more extra word portions.
[-_] # only if separated by one - or _.
[A-Za-z0-9]+ # Additional word-portion.
)* # Zero or more extra word portions.
(?!/sets(?:/|$)) # Assert 2nd segment not "sets".
(?: # 2nd and 3rd path segments.
/ # Additional path segment.
[A-Za-z0-9]+ # 1st word-portion.
(?: # Zero or more extra word portions.
[-_] # only if separated by one - or _.
[A-Za-z0-9]+ # Additional word-portion.
)* # Zero or more extra word portions.
){1,2} # 2nd path segment required, 3rd optional.
/? # URL may end with optional /.
$ # Anchor to end of string.
*/
// Same regex in javascript syntax:
var re = /^https?:\/\/(?:www\.)?soundcloud\.com\/[A-Za-z0-9]+(?:[-_][A-Za-z0-9]+)*(?!\/sets(?:\/|$))(?:\/[A-Za-z0-9]+(?:[-_][A-Za-z0-9]+)*){1,2}\/?$/i;
if (re.test(text)) return true;
return false;
}

Instead of . use [a-zA-Z][\w-]* which means "match a letter followed by any number of letters, numbers, underscores or hyphens".
^https?://(www\.)?soundcloud\.com/[a-zA-Z][\w-]*/(?!sets(/|$))[a-zA-Z][\w-]*(/[a-zA-Z][\w-]*)?/?$
To get the optional trailing slash, use /?$.
In a Javascript regular expression literal all the forward slashes must be escaped.

I suggest you to go with regex pattern
^https?:\/\/soundcloud\.com(?!\/[^\/]+\/sets(?:\/|$))(?:\/[^\/]+){2,3}\/?$

Related

How to match a filename pattern in JavaScript?

How can i match a filename, which is exactly (Capitals included) in the following format/pattern:
yymmdd_Name1_Data_Prices,
yymmdd_Name1_Data_Contact,
yymmdd_Name1_Data_Address.
I have files that need to be uploaded and the filenames are saved in a database. I want to match the given filename, with the pattern from the database, but i am unsure how to do that.
You could use the following regular expression.
\b\d{6}(?:_[A-Z][a-z]+){3}\b
Demo
Javascript's regex engine performs the following operations.
\b # match word break
\d{6} # match 6 digits
(?: # begin non-capture group
_[A-Z][a-z]+ # match '_', one upper-case letter, 1+ lower-case letters
) # end non-capture group
{3} # execute non-capture group 3 times
\b # match word break
Match the first 6 characters, which corresponds to a date, could be more precise than simply matching 6 digits. For example, assuming the year is 2000-2020, one could replace \d{6} with
(?:[01]\d|20)(?:0[1-9]|1[0-2])(?:0[1-9]|[12]\d|30|31)
but it still does would not ensure the date is valid.

Regex windows path validator

I've tried to find a windows file path validation for Javascript, but none seemed to fulfill the requirements I wanted, so I decided to build it myself.
The requirements are the following:
the path should not be empty
may begin with x:\, x:\\, \, // and followed by a filename (no file
extension required)
filenames cannot include the following special characters: <>:"|?*
filenames cannot end with dot or space
Here is the regex I came up with:
/^([a-z]:((\|/|\\|//))|(\\|//))[^<>:"|?*]+/i
But there are some issues:
it validates also filenames that include the special characters
mentioned in the rules
it doesn't include the last rule (cannot end with: . or space)
var reg = new RegExp(/^([a-z]:((\\|\/|\\\\|\/\/))|(\\\\|\/\/))[^<>:"|?*]+/i);
var startList = [
'C://test',
'C://te?st.html',
'C:/test',
'C://test.html',
'C://test/hello.html',
'C:/test/hello.html',
'//test',
'/test',
'//test.html',
'//10.1.1.107',
'//10.1.1.107/test.html',
'//10.1.1.107/test/hello.html',
'//10.1.1.107/test/hello',
'//test/hello.txt',
'/test/html',
'/tes?t/html',
'/test.html',
'test.html',
'//',
'/',
'\\\\',
'\\',
'/t!esrtr',
'C:/hel**o'
];
startList.forEach(item => {
document.write(reg.test(item) + ' >>> ' + item);
document.write("<br>");
});
Unfortunately, JavaScript flavour of regex does not support lookbehinds,
but fortunately it does support lookaheads, and this is the key factor
how to construct the regex.
Let's start from some observations:
After a dot, slash, backslash or a space there can not occur another
dot, slash or backslash. The set of "forbidden" chars includes also
\n, because none of these chars can be the last char of the file name
or its segment (between dots or (back-)slashes).
Other chars, allowed in the path are the chars which you mentioned
(other than ...), but the "exclusion list" must include also a dot,
slash, backslash, space and \n (the chars mentioned in point 1).
After the "initial part" (C:\) there can be multiple instances of
char mentioned in point 1 or 2.
Taking these points into account, I built the regex from 3 parts:
"Starting" part, matching the drive letter, a colon and up to 2
slashes (forward or backward).
The first alternative - either a dot, slash, backslash or a space,
with negative lookahead - a list of "forbidden" chars after each of
the above chars (see point 1).
The second alternative - chars mentioned in point 2.
Both the above alternatives can occur multiple times (+ quantifier).
So the regex is as follows:
^ - Start of the string.
(?:[a-z]:)? - Drive letter and a colon, optional.
[\/\\]{0,2} - Either a backslash or a slash, between 0 and 2 times.
(?: - Start of the non-capturing group, needed due to the +
quantifier after it.
[.\/\\ ] - The first alternative.
(?![.\/\\\n]) - Negative lookahead - "forbidden" chars.
| - Or.
[^<>:"|?*.\/\\ \n] - The second alternative.
)+ - End of the non-capturing group, may occur multiple times.
$ - End of the string.
If you attempt to match each path separately, use only i option.
But if you have multiple paths in separate rows, and match them
globally in one go, add also g and m options.
For a working example see https://regex101.com/r/4JY31I/1
Note: I suppose that ! should also be treated as a forbidden
character. If you agree, add it to the second alternative, e.g. after *.
This may work for you: ^(?!.*[\\\/]\s+)(?!(?:.*\s|.*\.|\W+)$)(?:[a-zA-Z]:)?(?:(?:[^<>:"\|\?\*\n])+(?:\/\/|\/|\\\\|\\)?)+$
You have a demo here
Explained:
^
(?!.*[\\\/]\s+) # Disallow files beginning with spaces
(?!(?:.*\s|.*\.|\W+)$) # Disallow bars and finish with dot/space
(?:[a-zA-Z]:)? # Drive letter (optional)
(?:
(?:[^<>:"\|\?\*\n])+ # Word (non-allowed characters repeated one or more)
(?:\/\/|\/|\\\\|\\)? # Bars (// or / or \\ or \); Optional
)+ # Repeated one or more
$
Since this post seems to be (one of) the top result(s) in a search for a RegEx Windows path validation pattern, and given the caveats / weaknesses of the above proposed solutions, I'll include the solution that I use for validating Windows paths (and which, I believe, addresses all of the points raised previously in that use-case).
I could not come up with a single viable REGEX, with or without look-aheads and look behinds that would do the job, but I could do it with two, without any look-aheads, or -behinds!
Note, though, that successive relative paths (i.e. "..\..\folder\file.exe") will not pass this pattern (though using "..\" or ".\" at the beginning of the string will). Periods and spaces before and after slashes, or at the end of the line are failed, as well as any character not permitted according to Microsoft's short-filename specification:
https://learn.microsoft.com/en-us/windows/win32/msi/filename
First Pattern:
^ (?# <- Start at the beginning of the line #)
(?# validate the opening drive or path delimiter, if present -> #)
(?: (?# "C:", "C:\", "C:..\", "C:.\" -> #)
(?:[A-Z]:(?:\.{1,2}[\/\\]|[\/\\])?)
| (?# or "\", "..\", ".\", "\\" -> #)
(?:[\/\\]{1,2}|\.{1,2}[\/\\])
)?
(?# validate the form and content of the body -> #)
(?:[^\x00-\x1A|*?\v\r\n\f+\/,;"'`\\:<>=[\]]+[\/\\]?)+
$ (?# <- End at the end of the line. #)
This will generally validate the path structure and character validity, but it also allows problematic things like double-periods, double-backslashes, and both periods and backslashes that are preceded-, and/or followed-by spaces or periods. Paths that end with spaces and/or periods are also permitted.
To address these problems I perform a second test with another (similar) pattern:
^ (?# <- Start at the beginning of the line #)
(?# validate the opening drive or path delimiter, if present -> #)
(?: (?# "C:", "C:\", "C:..\", "C:.\" -> #)
(?:[A-Z]:(?:\.{1,2}[\/\\]|[\/\\])?)
| (?# or "\", "..\", ".\", "\\" -> #)
(?:[\/\\]{1,2}|\.{1,2}[\/\\])
)?
(?# ensure that undesired patterns aren't present in the string -> #)
(?:([^\/\\. ]|[^\/. \\][\/. \\][^\/. \\]|[\/\\]$)*
[^\x00-\x1A|*?\s+,;"'`:<.>=[\]]) (?# <- Ensure that the last character is valid #)
$ (?# <- End at the end of the line. #)
This validates that, within the path body, no multiple-periods, multiple-slashes, period-slashes, space-slashes, slash-spaces or slash-periods occur, and that the path doesn't end with an invalid character. Annoyingly, I have to re-validate the <root> group because it's the one place where some of these combinations are allowed (i.e. ".\", "\\", and "..\") and I don't want those to invalidate the pattern.
Here is an implementation of my test (in C#):
/// <summary>Performs pattern testing on a string to see if it's in a form recognizable as an absolute path.</summary>
/// <param name="test">The string to test.</param>
/// <param name="testExists">If TRUE, this also verifies that the specified path exists.</param>
/// <returns>TRUE if the contents of the passed string are valid, and, if requested, the path exists.</returns>
public bool ValidatePath( string test, bool testExists = false )
{
bool result = !string.IsNullOrWhiteSpace(test);
string
drivePattern = /* language=regex */
#"^(([A-Z]:(?:\.{1,2}[\/\\]|[\/\\])?)|([\/\\]{1,2}|\.{1,2}[\/\\]))?",
pattern = drivePattern + /* language=regex */
#"([^\x00-\x1A|*?\t\v\f\r\n+\/,;""'`\\:<>=[\]]+[\/\\]?)+$";
result &= Regex.IsMatch( test, pattern, RegexOptions.ExplicitCapture );
pattern = drivePattern + /* language=regex */
#"(([^\/\\. ]|[^\/. \\][\/. \\][^\/. \\]|[\/\\]$)*[^\x00-\x1A|*?\s+,;""'`:<.>=[\]])$";
result &= Regex.IsMatch( test, pattern, RegexOptions.ExplicitCapture );
return result && (!testExists || Directory.Exists( test ));
}

Regular Expression to only get a specific line

I am attempting to only extract a specific line without any other characters after. For example:
permit ip any any
permit oped any any eq 10.52.5.15
permit top any any (sdfg)
permit sdo any host 10.51.86.17 eq sdg
I would like to match only the first line permit ip any any and not the others. A thing to take note is that the second word ip can be any word.
Meaning, I find only permit (anyword) any any and if there was a character after the second any, do not match.
I tried to do \bpermit.\w+.(?:any.any).([$&+,:;=?##|'<>.^*()%!-\w].+)but that finds the other lines except the permit ip any any. I did attempt to do a reverse lookup, but to no success.
Use the $ end of line anchor after the final "any" and the m multiline regexp flag.
/^permit \w+ any any$/gm
https://regex101.com/r/FfOp5k/2
If you are using Java based regex, you can include the multiline flag in the expression. This syntax is not supported by JavaScript regex.
(?m)^permit \w+ any$
I tried to do \bpermit.\w+.(?:any.any).([$&+,:;=?##|'<>.^*()%!-\w].+) but that finds the other lines except the permit ip any any. I did attempt to do a reverse lookup, but to no success.
Lets take apart your regex to see what your regex says:
\b # starting on a word boundary (space to non space or reverse)
permit # look for the literal characters "permit" in that order
. # followed by any character
\w+ # followed by word characters (letters, numbers, underscores)
. # followed by any character
(?: # followed by a non-capturing group that contains
any # the literal characters 'any'
. # any character
any # the literal characters 'any'
)
. # followed by any character <-- ERROR HERE!
( # followed by a capturing group
[$&+,:;=?##|'<>.^*()%!-\w] # any one of these many characters or word characters
.+ # then any one character one or more times
)
The behavior you describe...
but that finds the other lines except the permit ip any any.
matches what you've specified. Specifically, the regex above requires that there be characters after the 'any any'. Because permit \w+ any any does not have any characters after the any any part, the regex fails at the <-- ERROR HERE! mark in my breakdown above.
If that last part must be captured (using a capturing group) but it may not exist, you can make that entire last part optional using the ? character.
This would look like:
permit \w+ any any(?: (.+))?
for a breakdown of:
permit # the word permit
[ ] # a literal space
\w+ # one or more word characters
[ ] # a literal space
any # the word any
[ ] # another literal space
any # another any; all of this is requred.
(?: # a non-capturing group to start the "optional" part
[ ] # a literal space after the any
(.+) # everything else, including spaces, and capture it in a group
)? # end non-capturing group, but make it optional

Regex remove string in url

I have an url like https://randomsitename-dd555959b114a0.mydomain.com and want to remove the -dd555959b114a0 part of the url.
So randomsitename is a random name and the domain is static domain name.
Is this possible to remove the part with jquery or javascript?
Look at this code that is using regex
var url = "https://randomsitename-dd555959b114a0.mydomain.com";
var res = url.replace(/ *\-[^.]*\. */g, ".");
http://jsfiddle.net/VYw9Y/
It's usually best to code for all possible cases and since hyphens are allowed within any part of domain names, you'll more than likely want to use a more specific RexExp such as:
^ # start of string
( # start first capture group
[a-z]+ # one or more letters
) # end first capture group
:// # literal separator
( # start second capture group
[^.-]+ # one or more chars except dot or hyphen
) # end second capture group
(?: # start optional non-capture group
- # literal hyphen
[^.]+ # one or more chars except dot
)? # end optional non-capture group
( # start third capture group
.+ # one or more chars
) # end third capture group
$ # end of string
Or without comments:
^([a-z]+)://([^.-])(?:-[^.]+)?(.+)$
(Remember to escape slashes if you use the literal form for RegExps rather than creating them as objects, i.e. /literal\/form/ vs. new RegExp('object/form'))
Used in a string replacement, the second argument should then be: $1://$2$3
Previous answers will fail for URLs like http://foo.bar-baz.com or http://foo-bar.baz-blarg.com.
You could try this regex,
(.*)(-[^\.]*)(.*$)
Your code should be,
var url = "https://randomsitename-dd555959b114a0.mydomain.com";
var res = url.replace(/(.*)(-[^\.]*)(.*$)/, "$1$3");
//=>https://randomsitename.mydomain.com
Explanation:
(.*) matches any character 0 or more times and it was stored into group 1 because we enclose those characters within paranthesis. Whenever the regex engine finds -, it stops storing it into group1.
(-[^\.]*) From - upto a literal . are stored into group2. It stops storing when it finds a literal dot.
(.*$) From the literal dot upto the last character are stored into group3.
$1$3 at the replacement part prints only the stored group1 and 3.
OR
(.*)(?:-[^\.]*)(.*$)
If you use this regex, in the replacement part you need to put only $1 and $2.
DEMO

Facebook registration form email validation pattern

When searching for RegExp patterns to validate an email address in Javascript, I found a pattern that Facebook uses from here.
function is_email(a){return /^([\w!.%+\-])+#([\w\-])+(?:\.[\w\-]+)+$/.test(a);}
Can someone please explain to me how this pattern works? I understand that it is looking for 'word characters' in three positions along with a '#' character. But a nice explanation will help a lot for me to understand this.
There are two websites (that I know of), which generate explanations for regex patterns.
regex101.com explains the pattern in words
regexper.com does so graphically
Here is my own explanation for the pattern:
^ # anchor the pattern to the beginning of the string; this ensures that
# there are no undesired characters before the email address, as regex
# matches might well be substrings otherwise
( # starts a group (which is unnecessary and incurs overhead)
[\w!.%+\-]
# matches a letter, digit, underscore or one of the explicitly mentioned
# characters (note that the backslash is used to escape the hyphen
# although that is not required if the hyphen is the last character)
)+ # end group; repeat one or more times
# # match a literal #
( # starts another group (again unnecessary and incurs overhead)
[\w\-] # match a letter, digit, underscore or hyphen
)+ # end group; repeat one or more times
(?: # starts a non-capturing group (this one is necessary and, because
# capturing is suppressed, this one does not incur any overhead)
\. # match a literal period
[\w\-] # match a letter, digit, underscore or hyphen
+ # one or more of those
)+ # end group; repeat one or more times
$ # anchor the pattern to the end of the string; analogously to ^
So, this would be a slightly optimised version:
/^[\w!.%+\-]+#[\w\-]+(?:\.[\w\-]+)+$/

Categories