I have - let say - example.com website and there I have account page.
It may have GET parameters, which is also considered part of account page.
It also may have URL fragment. If it's home.html fragment - it is still the account page. And if another fragment - then it's a different sub-page of the account page.
So - I need a RegEx (JS) to match this case. This is what I managed to build so far:
example.com\/account\/(|.*\#home\.html|(\?(?!.*#.*)))$
https://regex101.com/r/ihjCIg/1
The first 4 are the cases I need. And as you see - the second row is not matched by my RegEx.
What am I missing here?
You could create 2 optional groups, 1 to optionally match ? and matching any char except # and another optional group matching #home.html
Note to escape the dot to match it literally.
^example\.com\/account\/(?:\?[^#\r\n]*)?(?:#home\.html)?$
^ Start of string
example\.com\/account\/ Match start
(?: Non capturing group
\?[^#\r\n]* Match ? and 0+ times any char except # or a newline
)? Close group and make it optional
(?: Non capturing group
#home\.html Match #home.html
)? Close group and make it optional
$
Regex demo
let pattern = /^example\.com\/account\/(?:\?[^#\r\n]*)?(?:#home\.html)?$/;
[
"example.com/account/",
"example.com/account/?brand=mine",
"example.com/account/#home.html",
"example.com/account/?brand=mine#home.html",
"example.com/account/#other.html",
"example.com/account/?brand=mine#other.html"
].forEach(url => console.log(url + " --> " + pattern.test(url)));
Third alternative in your group has a negative look ahead which ensures it rejects any text that contains a # but you haven't specifically mentioned anything that should match rest of the content till end of line. Check this updated regex demo,
https://regex101.com/r/ihjCIg/3
If you notice, I have escaped your first dot just before com and have added .* after the negative look ahead part so it matches your second sample.
example\.com\/account\/((\??[^#\r\n]+)?(#?home\.html)?)?$
This matches your first four strings
example.com/account/
example.com/account/?brand=mine
example.com/account/#home.html
example.com/account/?brand=mine#home.html
and excludes your last two
example.com/account/#other.html
example.com/account/?brand=mine#other.html
I'd like to construct a regex that will check for a "path" and a "foo" parameter (non-negative integer). "foo" is optional. It should:
MATCH
path?foo=67 # path found, foo = 67
path?foo=67&bar=hello # path found, foo = 67
path?bar=bye&foo=1&baz=12 # path found, foo = 1
path?bar=123 # path found, foo = ''
path # path found, foo = ''
DO NOT MATCH
path?foo=37signals # foo is not integer
path?foo=-8 # foo cannot be negative
something?foo=1 # path not found
Also, I'd like to get the value of foo, without performing an additional match.
What would be the simplest regex to achieve this?
The Answer
Screw your hard work, I just want the answer! Okay, here you go...
var regex = /^path(?:(?=\?)(?:[?&]foo=(\d*)(?=[&#]|$)|(?![?&]foo=)[^#])+)?(?=#|$)/,
URIs = [
'path', // valid!
'pathbreak', // invalid path
'path?foo=123', // valid!
'path?foo=-123', // negative
'invalid?foo=1', // invalid path
'path?foo=123&bar=abc', // valid!
'path?bar=abc&foo=123', // valid!
'path?bar=foo', // valid!
'path?foo', // valid!
'path#anchor', // valid!
'path#foo=bar', // valid!
'path?foo=123#bar', // valid!
'path?foo=123abc', // not an integer
];
for(var i = 0; i < URIs.length; i++) {
var URI = URIs[i],
match = regex.exec(URI);
if(match) {
var foo = match[1] ? match[1] : 'null';
console.log(URI + ' matched, foo = ' + foo);
} else {
console.log(URI + ' is invalid...');
}
}
<script src="https://getfirebug.com/firebug-lite-debug.js"></script>
Research
Your bounty request asks for "credible and/or official sources", so I'll quote the RFC on query strings.
The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI's scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.
This seems pretty vague on purpose: a query string starts with the first ? and is terminated by a # (start of anchor) or the end of the URI (or string/line in our case). They go on to mention that most data sets are in key=value pairs, which is what it seems like what you expect to be parsing (so lets assume that is the case).
However, as query components are often used to carry identifying information in the form of "key=value" pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters.
With all this in mind, let's assume a few things about your URIs:
Your examples start with the path, so the path will be from the beginning of the string until a ? (query string), # (anchor), or the end of the string.
The query string is the iffy part, since RFC doesn't really define a "norm". A browser tends to expect a query string to be generated from a form submission and be a list of key=value pairs appended by & characters. Keeping this mentality:
A key cannot be null, will be preceded by a ? or &, and cannot contain a =, & or #.
A value is optional, will be preceded by key=, and cannot contain a & or #.
Anything after a # character is the anchor.
Let's Begin!
Let's start by mapping out our basic URI structure. You have a path, which is characters starting at the string and up until a ?, #, or the end of the string. You have an optional query string, which starts at a ? and goes until a # or the end of the string. And you have an optional anchor, which starts at a # and goes until the end of the string.
^
([^?#]+)
(?:
\?
([^#]+)
)?
(?:
#
(.*)
)?
$
Let's do some clean up before digging into the query string. You can easily require the path to equal a certain value by replacing the first capture group. Whatever you replace it with (path), will have to be followed by an optional query string, an optional anchor, and the end of the string (no more, no less). Since you don't need to parse the anchor, the capturing group can be replaced by ending the match at either a # or the end of the string (which is the end of the query parameter).
^path
(?:
\?
([^#\+)
)?
(?=#|$)
Stop Messing Around
Okay, I've been doing a lot of setup without really worrying about your specific example. The next example will match a specific path (path) and optionally match a query string while capturing the value of a foo parameter. This means you could stop here and check for a valid match..if the match is valid, then the first capture group must be null or a non-negative integer. But that wasn't your question, was it. This got a lot more complicated, so I'm going to explain the expression inline:
^ (?# match beginning of the string)
path (?# match path literally)
(?: (?# begin optional non-capturing group)
(?=\?) (?# lookahead for a literal ?)
(?: (?# begin optional non-capturing group)
[?&] (?# keys are preceded by ? or &)
foo (?# match key literally)
(?: (?# begin optional non-capturing group)
= (?# values are preceded by =)
([^&#]*) (?# values are 0+ length and do not contain & or #)
) (?# end optional non-capturing group)
| (?# OR)
[^#] (?# query strings are non-# characters)
)+ (?# end repeating non-capturing group)
)? (?# end optional non-capturing group)
(?=#|$) (?# lookahead for a literal # or end of the string)
Some key takeaways here:
Javascript doesn't support lookbehinds, meaning you can't look behind for a ? or & before the key foo, meaning you actually have to match one of those characters, meaning the start of your query string (which looks for a ?) has to be a lookahead so that you don't actually match the ?. This also means that your query string will always be at least one character (the ?), so you want to repeat the query string [^#] 1+ times.
The query string now repeats one character at a time in a non-capturing group..unless it sees the key foo, in which case it captures the optional value and continues repeating.
Since this non-capture query string group repeats all the way until the anchor or end of the URI, a second foo value (path?foo=123&foo=bar) would overwrite the initial captured value..meaning you wouldn't 100% be able to rely on the above solution.
Final Solution?
Okay..now that I've captured the foo value, it's time to kill the match on a values that are not positive integers.
^ (?# match beginning of the string)
path (?# match path literally)
(?: (?# begin optional non-capturing group)
(?=\?) (?# lookahead for a literal ?)
(?: (?# begin optional non-capturing group)
[?&] (?# keys are preceeded by ? or &)
foo (?# match key literally)
= (?# values are preceeded by =)
(\d*) (?# value must be a non-negative integer)
(?= (?# begin lookahead)
[&#] (?# literally match & or #)
| (?# OR)
$ (?# match end of the string)
) (?# end lookahead)
| (?# OR)
(?! (?# begin negative lookahead)
[?&] (?# literally match ? or &)
foo= (?# literally match foo=)
) (?# end negative lookahead)
[^#] (?# query strings are non-# characters)
)+ (?# end repeating non-capturing group)
)? (?# end optional non-capturing group)
(?=#|$) (?# lookahead for a literal # or end of the string)
Let's take a closer look at some of the juju that went into that expression:
After finding foo=\d*, we use a lookahead to ensure that it is followed by a &, #, or the end of the string (the end of a query string value).
However..if there is more to foo=\d*, the regex would be kicked back by the alternator to a generic [^#] match right at the [?&] before foo. This isn't good, because it will continue to match! So before you look for a generic query string ([^#]), you must make sure you are not looking at a foo (that must be handled by the first alternation). This is where the negative lookahead (?![?&]foo=) comes in handy.
This will work with multiple foo keys, since they will all have to equal non-negative integers. This lets foo be optional (or equal null) as well.
Disclaimer: Most Regex101 demos use PHP for better syntax highlighting and include \n in negative character classes since there are multiple lines of examples.
Nice question! Seems fairly simple at first...but there are a lot of gotchas. Would advise checking any claimed solution will handle the following:
ADDITIONAL MATCH TESTS
path? # path found, foo = ''
path#foo # path found, foo = ''
path#bar # path found, foo = ''
path?foo= # path found, foo = ''
path?bar=1&foo= # path found, foo = ''
path?foo=&bar=1 # path found, foo = ''
path?foo=1#bar # path found, foo = 1
path?foo=1&foo=2 # path found, foo = 2
path?foofoo=1 # path found, foo = ''
path?bar=123&foofoo=1 # path found, foo = ''
ADDITIONAL DO NOT MATCH TESTS
pathbar? # path not found
pathbar?foo=1 # path not found
pathbar?bar=123&foo=1 # path not found
path?foo=a&foofoo=1 # not an integer
path?foofoo=1&foo=a # not an integer
The simplest regex I could come up with that works for all these additional cases is:
path(?=(\?|$|#))(\?(.+&)?foo=(\d*)(&|#|$)|((?![?&]foo=).)*$)
However, would advise adding ?: to the unused capturing groups so they are ignored and you can easily get the foo value from Group 1 - see Debuggex Demo
path(?=(?:\?|$|#))(?:\?(?:.+&)?foo=(\d*)(?:&|#|$)|(?:(?![?&]foo=).)*$)
^path\b(?!.*[?&]foo=(?!\d+(?=&|#|$)))(?:.*[?&]foo=(\d+)(?=&|#|$))?
Basically I just broke it down into three parts
^path\b # starts with path
(?!.*[?&]foo=(?!\d+(?=&|#|$))) # not followed by foo with an invalid value
(?:.*[?&]foo=(\d+)(?=&|#|$))? # possibly followed by foo with a valid value
see validation here http://regexr.com/39i7g
Caveats:
will match path#bar=1&foo=27
will not match path?foo=
The OP didn't mention these requirements and since he wants a simple regex (oxymoron?) I did not attempt to solve them.
path.+?(?:foo=(\d+))(?![a-zA-Z\d])|path((?!foo).)*$
You can try this.See demo.
http://regex101.com/r/jT3pG3/10
You can try the following regex:
path(?:.*?foo=(\d+)\b|()(?!.*foo))
regex101 demo
There are two possible matches after path:
.*?foo=(\d+)\b i.e. foo followed by digits.
OR
()(?!.*foo) an empty string if there is no foo ahead.
Add some word boundaries (\b) if you don't want the regex to interpret other words (e.g. another parameter named barfoobar) around the foos.
path(?:.*?\bfoo=(\d+)\b|()(?!.*\bfoo\b))
You can check for the existence of 3rd matched group. It it is not there, the foo value would be null; otherwise, it is the group itself:
/^(path)(?:$|\?(?:(?=.*\b(foo=)(\d+)\b.*$)|(?!foo=).*?))/gm
An example on regex101: http://regex101.com/r/oP6lU7/1
Dealing with javascript engine to make Regular Expressions besides all the lacks it has in compare with PCRE, somehow is enjoyable!
I made this RegEx, simple and understandable:
^(?=path\?).*foo=(\d*)(?:&|$)|path$
Explanations
^(?=path\?) # A positive lookahead to ensure we have "path" at the very begining
.*foo=(\d*)(?:&|$) # Looking for a string includes foo=(zero or more digits) following a "&" character or end of string
| # OR
path$ # Just "path" itself
Runnable snippet:
var re = /^(?=path\?).*foo=(\d*)(?:&|$)|path$/gm;
var str = 'path?foo=67\npath?foo=67&bar=hello\npath?bar=bye&foo=1&baz=12\npath\npathtest\npath?foo=37signals\npath?foo=-8\nsomething?foo=1';
var m, n = [];
while ((m = re.exec(str)) != null) {
if (m.index === re.lastIndex) {
re.lastIndex++;
}
n.push(m[0]);
}
alert( JSON.stringify(n) );
Or a Live demo for more details
path(?:\?(?:[^&]*&)*foo=([0-9]+)(?:[&#]|$))?
This is as short as most, and reads more straightforwardly, since things that appear once in the string appear once in the RE.
We match:
the initial path
a question mark, (or skip to end)
some blocks terminated by ampersands
our parameter assignment
a closing confirmation, either starting the next syntactic element, or ending the line
Unfortunately it matches foo to None rather than '' when the foo parameter is omitted, but in Python (my language of choice) that is considered more appropriate. You could complain if you wanted, or just or with '' afterwards.
Based on the OP's data here is my attempt pattern
^(path)\b(?:[^f]+|f(?!oo=))(?!\bfoo=(?!\d+\b))(?:\bfoo=(\d+)\b)?
if path is found: sub-pattern #1 will contains "path"
if foo is valid: sub-pattern #2 will contains "foo value if any"
Demo
^(path)\b "path"
(?:[^f]+|f(?!oo=)) followed by anything but "foo="
(?!\bfoo=(?!\d+\b)) if "foo=" is found it must not see anything but \d+\b
(?:\bfoo=(\d+)\b)? if valid "foo=" is found, capture "foo" value
t = 'path?foo=67&bar=hello';
console.log(t.match(/\b(foo|path)\=\d+\b/))
regex /\b(foo|path)\=\d+\b/
I need to match the below type of strings using a regex pattern in javascript.
E.g. /this/<one or more than one word with hyphen>/<one or more than one word with hyphen>/<one or more than one word with hyphen>/<one or more than one word with hyphen>
So this single pattern should match both these strings:
1. /this/is/single-word
2. /this/is-more-than/single/word-patterns/to-match
Only the slash (/) and the 'this' string in the beginning are consistent and contains only alphabets.
You can use:
\/this\/[a-zA-Z ]+\/[a-zA-Z ]+\/[a-zA-Z ]+
Working Demo
I think you want something like this maybe?
(\/this\/(\w+\s?){1,}\/\w+\/(\w+\s?)+)
break down:
\/ # divder
this # keyword
\/ # divider
( # begin section
\w+ # single valid word character
\s? # possibly followed by a space
) # end section
{1,} # match previous section at least 1 times, more if possible.
\/ # divider
\w+ # single valid word character
\/ # divider
( # begin section
\w+ # single valid word character
\s? # possible space
) # end section
Working example
This might be obvious, however to match each pattern as a separate result, I believe you want to place parenthesis around the whole expression, like so:
(\/[a-zA-Z ]+\/[a-zA-Z ]+\/[a-zA-Z ]+\/[a-zA-Z ]+)
This makes sure that TWO results are returned, not just one big group.
Also, your question did not state that "this" would be static, as the other answers assumed... it says only the slashes are static. This should work for any text combo (no word this required).
Edit - actually looking back at your attempt, I see you used /this/ in your expression, so I assume that's why others did as well.
Demo: http://rubular.com/r/HGYp2qtmAM
Modified question samples:
/this/is/single-word
/this/is-more-than/single/word-patterns/to-match
Modified again The sections may have hyphen (no spaces) and there may be 3 or 4 sections beyond '/this/'
Modified pattern /^\/this(?:\/[a-zA-Z]+(?:-[a-zA-Z]+)*){3,4}$/
^
/this
(?:
/ [a-zA-Z]+
(?: - [a-zA-Z]+ )*
){3,4}
$
I'm trying to solve a string matching problem with regexes. I need to match URLs of this form:
http://soundcloud.com/okapi23/dont-turn-your-back/
And I need to "reject" URL of this form:
http://soundcloud.com/okapi23/sets/happily-reversed/
The trailing '/' is obviously optional.
So basically:
After the hostname, there can be 2 or 3 groups, and if in the second one is equal to "sets", then the regex should not match.
"sets" can be contained anywhere else in the URL
"sets" needs to be an exact match
What I came up so far is http(s)?://(www\.)?soundcloud\.com/.+/(?!sets)\b(/.+)?, which fails.
Any suggestions? Are there any libraries that would simplify the task (for example, making trailing slashes optional)?
Assuming that the OP wants to test to see if a given string contains a URL which meets the following requirements:
URL scheme must be either http: or https:.
URL authority must be either //soundcloud.com or //www.soundcloud.com.
URL path must exist and must contain 2 or 3 path segments.
The second path segment must not be: "sets".
Each path segment must consist of one or more "words" consisting of only alphanumeric characters ([A-Za-z0-9]) and multiple words are separated by exactly one dash or underscore.
The URL must have no query or fragment component.
The URL path may end with an optional "/".
The URL should match case insensitively.
Here is a tested JavaScript function (with a fully commented regex) which does the trick:
function isValidCustomUrl(text) {
/* Here is the regex commented in free-spacing mode:
# Match specific URL having non-"sets" 2nd path segment.
^ # Anchor to start of string.
https?: # URL Scheme (http or https).
// # Begin URL Authority.
(?:www\.)? # Optional www subdomain.
soundcloud\.com # URL DNS domain.
/ # 1st path segment (can be: "sets").
[A-Za-z0-9]+ # 1st word-portion (required).
(?: # Zero or more extra word portions.
[-_] # only if separated by one - or _.
[A-Za-z0-9]+ # Additional word-portion.
)* # Zero or more extra word portions.
(?!/sets(?:/|$)) # Assert 2nd segment not "sets".
(?: # 2nd and 3rd path segments.
/ # Additional path segment.
[A-Za-z0-9]+ # 1st word-portion.
(?: # Zero or more extra word portions.
[-_] # only if separated by one - or _.
[A-Za-z0-9]+ # Additional word-portion.
)* # Zero or more extra word portions.
){1,2} # 2nd path segment required, 3rd optional.
/? # URL may end with optional /.
$ # Anchor to end of string.
*/
// Same regex in javascript syntax:
var re = /^https?:\/\/(?:www\.)?soundcloud\.com\/[A-Za-z0-9]+(?:[-_][A-Za-z0-9]+)*(?!\/sets(?:\/|$))(?:\/[A-Za-z0-9]+(?:[-_][A-Za-z0-9]+)*){1,2}\/?$/i;
if (re.test(text)) return true;
return false;
}
Instead of . use [a-zA-Z][\w-]* which means "match a letter followed by any number of letters, numbers, underscores or hyphens".
^https?://(www\.)?soundcloud\.com/[a-zA-Z][\w-]*/(?!sets(/|$))[a-zA-Z][\w-]*(/[a-zA-Z][\w-]*)?/?$
To get the optional trailing slash, use /?$.
In a Javascript regular expression literal all the forward slashes must be escaped.
I suggest you to go with regex pattern
^https?:\/\/soundcloud\.com(?!\/[^\/]+\/sets(?:\/|$))(?:\/[^\/]+){2,3}\/?$
I'm writing a rudimentary lexer using regular expressions in JavaScript and I have two regular expressions (one for single quoted strings and one for double quoted strings) which I wish to combine into one. These are my two regular expressions (I added the ^ and $ characters for testing purposes):
var singleQuotedString = /^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$/gi;
var doubleQuotedString = /^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$/gi;
Now I tried to combine them into a single regular expression as follows:
var string = /^(["'])(?:[^\1\\]|\\\1|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*\1$/gi;
However when I test the input "Hello"World!" it returns true instead of false:
alert(string.test('"Hello"World!"')); //should return false as a double quoted string must escape double quote characters
I figured that the problem is in [^\1\\] which should match any character besides matching group \1 (which is either a single or a double quote - the delimiter of the string) and \\ (which is the backslash character).
The regular expression correctly filters out backslashes and matches the delimiters, but it doesn't filter out the delimiter within the string. Any help will be greatly appreciated. Note that I referred to Crockford's railroad diagrams to write the regular expressions.
You can't refer to a matched group inside a character class: (['"])[^\1\\]. Try something like this instead:
(['"])((?!\1|\\).|\\[bnfrt]|\\u[a-fA-F\d]{4}|\\\1)*\1
(you'll need to add some more escapes, but you get my drift...)
A quick explanation:
(['"]) # match a single or double quote and store it in group 1
( # start group 2
(?!\1|\\). # if group 1 or a backslash isn't ahead, match any non-line break char
| # OR
\\[bnfrt] # match an escape sequence
| # OR
\\u[a-fA-F\d]{4} # match a Unicode escape
| # OR
\\\1 # match an escaped quote
)* # close group 2 and repeat it zero or more times
\1 # match whatever group 1 matched
This should work too (raw regex).
If speed is a factor, this is the 'unrolled' method, said to be the fastest for this kind of thing.
(['"])(?:(?!\\|\1).)*(?:\\(?:[\/bfnrt]|u[0-9A-F]{4}|\1)(?:(?!\\|\1).)*)*/1
Expanded
(['"]) # Capture a quote
(?:
(?!\\|\1). # As many non-escape and non-quote chars as possible
)*
(?:
\\ # escape plus,
(?:
[\/bfnrt] # /,b,f,n,r,t or u[a-9A-f]{4} or captured quote
| u[0-9A-F]{4}
| \1
)
(?:
(?!\\|\1). # As many non-escape and non-quote chars as possible
)*
)*
/1 # Captured quote
Well, you can always just create a larger regex by just using the alternation operator on the smaller regexes
/(?:single-quoted-regex)|(?:double-quoted-regex)/
Or explicitly:
var string = /(?:^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$)|(?:^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$)/gi;
Finally, if you want to avoid the code duplication, you can build up this regex dynamically, using the new Regex constructor.
var quoted_string = function(delimiter){
return ('^' + delimiter + '(?:[^' + delimiter + '\\]|\\' + delimiter + '|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*' + delimiter + '$').replace(/\\/g, '\\\\');
//in the general case you could consider using a regex excaping function to avoid backslash hell.
};
var string = new RegExp( '(?:' + quoted_string("'") + ')|(?:' + quoted_string('"') + ')' , 'gi' );