Regex to match specific URL fragment and not all other URL possibilities - javascript

I have - let say - example.com website and there I have account page.
It may have GET parameters, which is also considered part of account page.
It also may have URL fragment. If it's home.html fragment - it is still the account page. And if another fragment - then it's a different sub-page of the account page.
So - I need a RegEx (JS) to match this case. This is what I managed to build so far:
example.com\/account\/(|.*\#home\.html|(\?(?!.*#.*)))$
https://regex101.com/r/ihjCIg/1
The first 4 are the cases I need. And as you see - the second row is not matched by my RegEx.
What am I missing here?

You could create 2 optional groups, 1 to optionally match ? and matching any char except # and another optional group matching #home.html
Note to escape the dot to match it literally.
^example\.com\/account\/(?:\?[^#\r\n]*)?(?:#home\.html)?$
^ Start of string
example\.com\/account\/ Match start
(?: Non capturing group
\?[^#\r\n]* Match ? and 0+ times any char except # or a newline
)? Close group and make it optional
(?: Non capturing group
#home\.html Match #home.html
)? Close group and make it optional
$
Regex demo
let pattern = /^example\.com\/account\/(?:\?[^#\r\n]*)?(?:#home\.html)?$/;
[
"example.com/account/",
"example.com/account/?brand=mine",
"example.com/account/#home.html",
"example.com/account/?brand=mine#home.html",
"example.com/account/#other.html",
"example.com/account/?brand=mine#other.html"
].forEach(url => console.log(url + " --> " + pattern.test(url)));

Third alternative in your group has a negative look ahead which ensures it rejects any text that contains a # but you haven't specifically mentioned anything that should match rest of the content till end of line. Check this updated regex demo,
https://regex101.com/r/ihjCIg/3
If you notice, I have escaped your first dot just before com and have added .* after the negative look ahead part so it matches your second sample.

example\.com\/account\/((\??[^#\r\n]+)?(#?home\.html)?)?$
This matches your first four strings
example.com/account/
example.com/account/?brand=mine
example.com/account/#home.html
example.com/account/?brand=mine#home.html
and excludes your last two
example.com/account/#other.html
example.com/account/?brand=mine#other.html

Related

Regex for getting region and lang from url /xx/xx in two groups

I have a url structure where the first subdirectory is the region and then the second optional one is the language overide:
https://example.com/no/en
I'm trying to get the two parts out in a group each. This way, in the JS, I can do the following to get each part of the url:
const pathname = window.location.pathname // '/no/en/mypage'
const match = pathname.match('xxx')
const region = match[1] // 'no' or '/no'
const language = match[2] // 'en' or '/en'
I have tried creating multiple regexes with no luck in nailing all of my requirements below:
This is the closest I have come, but it is prune to error due to also matching "/do" from /donotmatch with the following regex:
(\/[a-z]{2})(\/[a-z]{2})?
The problem with this one is that it's also matching cases like /noada.
I then tried to match first two a-z and then followed by either a forward slash or no characters like this: (\/[a-z]{2}\/|[^.])([a-z]{2}\/|[^.])? I think I am not getting the syntax correct for the not part.
The regex I am trying to create has to pass these criterias in order not to break:
/no - group 1 match(no), group 2 undefined
/no/ - group 1 match(no), group 2 undefined
/nona - no matches
/no/en - group 1 match(no), group 2 match(en)
/no/en/ - group 1 match(no), group 2 match(en)
/no/enen - group 1 match(no), group 2 undefined
/no/en/something - group 1 match(no), group 2 match(en)
/no/en/jp - group 1 match(no), group 2 match(en) (jp is not going to be matched)
I feel I am really close to a working solution, but all my tries so far have been off in a slight way.
If the group part is not possible, I suppose also getting /xx/xx and then splitting by / is also an option.
You may use this regex with an optional 2nd capture group:
\/(\w{2})(?:\/(\w{2}))?(?:\/|$)
RegEx Demo
RegEx Explanation:
\/: Match starting /
(\w{2}): First capture group to match 2 word characters
(?:\/(\w{2}))?: Optional non-capture group that starts with a / followed by seconf capture group to match 2 word characters.
(?:\/|$): Match closing / or end of line
Follow each capture with (?=$|/), which is a look ahead to assert that what comes next is either end of input or a slash.
https?://[^/]+/(\w\w)(?=$|/)(?:/(\w\w)(?=$|/))?
See live demo.
The second capture is wrapped in an optional non-capture group via (?:…)?
To be more strict to allow only letters, replace \w with [a-z] but \w may be enough for your needs.
I just saw this new way of getting the same result using the new URL pattern API
The API is quite new as in the writing of this answer, but there is a polyfill you can use to add support for it right now.
const pattern = new URLPattern({ pathname: '/:region(\\w{2})/:lang(\\w{2})?' })
const result = pattern.exec('https://example.com/no/en')?.pathname?.groups
const region = result?.region // no or undefined
const lang = result?.lang // en or undefined
Solving the issue with trailing slash, one could replace the slashes with nothing before sending the "url string" to the exec method.
// ...
const urlWithoutTrailingSlashes = 'https://example.com/no/en/'.replace(/\/+$/, '')
const result = pattern.exec(urlWithoutTrailingSlashes)?.pathname?.groups
// ...
I did not yet find a way to do the optional trailing slashes in the regex inside the pattern as the limitations of lookaheads and ends with. If anyone finds a way, please edit this answer or add a comment to it.

JS regex: one correct match out of three and one false match

This JS regex error is killing me - one correct match out of three and one false match.
If it makes a difference I am writing my script in Google Apps Script.
I have a string (xml formatted) I want to match three date nodes as follows:
<dateCreated>1619155581543</dateCreated>
<dispatchDate>1619478000000</dispatchDate>
<deliveryDate>1619564400000</deliveryDate>
I don't care about the tags so much - I just need enough to reliably replace them. I am using this regular expression:
var regex = new RegExp('[dD]ate(.{1,})?>[0-9]{13,}</');
These are the matches:
dateCreated>1619155581543</
Created
Obviously I understand number 1 - I wanted that. But I do not understand how 2 was matched. Also why were dispatchDate and deliveryDate not matched? All three targets are matched if I use the above regex in BBEdit and on https://ihateregex.io/playground and neither of those match "Created".
I've also tried this regular expression without success:
var regex = new RegExp('[dD]ate.{0,}>[0-9]{13,}</');
If you can't answer why my regex fails but you can offer a working solution I'd still be happy with that.
The first pattern that you tried [dD]ate(.{1,})?>[0-9]{13,}</ matches:
[dD]ate Match date or Date
(.{1,})? Optional capture group, match 1+ times any char (This group will capture Created)
> Match literally
[0-9]{13,} Match 13 or more digits 0-9
</ Match literally
What you will get are partial matches from date till </ and the first capture group will contain Created
The second pattern is almost the same, except for {0,} which matches 0 or more times, and there is no capture group.
Still this will give you partial matches.
What you could do to match the whole element is either harvest the power of an XML parser (which would be the recommended way) or use a pattern what assumes only digits between the tags and no < > chars between the opening an closing.
Note that this is a brittle solution.
<([^<>]*[dD]ate[^<>]*)>\d{13}<\/\1>
< Match literally
( Capture group 1 (This group is used for the backreference \1 at the end of the pattern
[^\s<>]* Match 0+ times any character except < or >
[dD]ate[^<>]* Match either date or Date followed 0+ times any char except < or >
) Close group 1
> Match literally
\d{13} Match 13 digits (or \d{13,} for 13 or more
<\/\1> Match </ then a backreference to the exact text that is captured in group 1 (to match the name of the closing tag) and then match >
Regex demo
A bit more restricted pattern could be allowing only word characters \w around matching date
<(\w*[dD]ate\w*)>\d{13}<\/\1>
Regex demo
const regex = /<([^<>]*[dD]ate[^<>]*)>\d{13}<\/\1>/;
[
"<dateCreated>1619155581543</dateCreated>",
"<dispatchDate>1619478000000</dispatchDate>",
"<deliveryDate>1619564400000</deliveryDate>",
"<thirteendigits>1619564400000</thirteendigits>",
].forEach(str => {
const match = str.match(regex);
console.log(match ? `Match --> ${str}` : `No match --> ${str}`)
});

How to form regex to match everything up to a "("

In javascript, how can a regular expression be formed to match everything up to and NOT including an opening parenthesis "("?
example input:
"12(pm):00"
"12(am):))"
"8(am):00"
ive found /^(.*?)\(/ to be successful with the "up to" part, but the match returned includes the "("
In regex101.com, its says the first capturing group is what im looking for, is there a way to return only the captured group?
There are three ways to deal with this. The first is to restrict the characters you match to not include the parenthesis:
let match = "12(pm):00".match(/[^(]*/);
console.log(match[0]);
The second is to only get the part of the match you are interested in, using capture groups:
let match = "12(pm):00".match(/(.*?)\(/);
console.log(match[1]);
The third is to use lookahead to explicitly exclude the parenthesis from the match:
let match = "12(pm):00".match(/.*?(?=\()/);
console.log(match[0]);
As in OP, note the non-greedy modifier in the second and third case: it is necessary to restrict the quantifier in case there is another open parenthesis further inside the string. This is not necessary in the first place, since the quantifier is explicitly forbidden to gobble up the parenthesis.
Try
^\d+
^ asserts position at start of a line
\d matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
https://regex101.com/r/C9XNT4/1

Match group before nth character and after that

I want to match everything before the nth character (except the first character) and everything after it. So for the following string
/firstname/lastname/some/cool/name
I want to match
Group 1: firstname/lastname
Group 2: some/cool/name
With the following regex, I'm nearly there, but I can't find the correct regex to also correctly match the first group and ignore the first /:
([^\/]*\/){3}([^.]*)
Note that I always want to match the 3rd forward slash. Everything after that can be any character that is valid in an URL.
Your regex group are not giving proper result because ([^\/]*\/){3} you're repeating captured group which will overwrite the previous matched group Read this
You can use
^.([^/]+\/[^/]+)\/(.*)$
let str = `/firstname/lastname/some/cool/name`
let op = str.match(/^.([^/]+\/[^/]+)\/(.*)$/)
console.log(op)
Ignoring the first /, then capturing the first two words, then capturing the rest of the phrase after the /.
^(:?\/)([^\/]+\/[^\/]+)\/(.+)
See example
The quantifier {3} repeats 3 times the capturing group, which will have the value of the last iteration.
The first iteration will match /, the second firstname/ and the third (the last iteration) lastname/ which will be the value of the group.
The second group captures matching [^.]* which will match 0+ not a literal dot which does not take the the structure of the data into account.
If you want to match the full pattern, you could use:
^\/([^\/]+\/[^\/]+)\/([^\/]+(?:\/[^\/]+)+)$
Explanation
^ Start of string
( Capture group 1
[^\/]+/[^\/]+ Match 2 times not a / using a negated character class then a /
) Close group
\/ Match /
( Capture group 2
[^\/]+ Match 1+ times not /
(?:\/[^\/]+)+ Repeat 1+ times matching / and 1+ times not / to match the pattern of the rest of the string.
) Close group
$ End of string
Regex demo

Regex remove string in url

I have an url like https://randomsitename-dd555959b114a0.mydomain.com and want to remove the -dd555959b114a0 part of the url.
So randomsitename is a random name and the domain is static domain name.
Is this possible to remove the part with jquery or javascript?
Look at this code that is using regex
var url = "https://randomsitename-dd555959b114a0.mydomain.com";
var res = url.replace(/ *\-[^.]*\. */g, ".");
http://jsfiddle.net/VYw9Y/
It's usually best to code for all possible cases and since hyphens are allowed within any part of domain names, you'll more than likely want to use a more specific RexExp such as:
^ # start of string
( # start first capture group
[a-z]+ # one or more letters
) # end first capture group
:// # literal separator
( # start second capture group
[^.-]+ # one or more chars except dot or hyphen
) # end second capture group
(?: # start optional non-capture group
- # literal hyphen
[^.]+ # one or more chars except dot
)? # end optional non-capture group
( # start third capture group
.+ # one or more chars
) # end third capture group
$ # end of string
Or without comments:
^([a-z]+)://([^.-])(?:-[^.]+)?(.+)$
(Remember to escape slashes if you use the literal form for RegExps rather than creating them as objects, i.e. /literal\/form/ vs. new RegExp('object/form'))
Used in a string replacement, the second argument should then be: $1://$2$3
Previous answers will fail for URLs like http://foo.bar-baz.com or http://foo-bar.baz-blarg.com.
You could try this regex,
(.*)(-[^\.]*)(.*$)
Your code should be,
var url = "https://randomsitename-dd555959b114a0.mydomain.com";
var res = url.replace(/(.*)(-[^\.]*)(.*$)/, "$1$3");
//=>https://randomsitename.mydomain.com
Explanation:
(.*) matches any character 0 or more times and it was stored into group 1 because we enclose those characters within paranthesis. Whenever the regex engine finds -, it stops storing it into group1.
(-[^\.]*) From - upto a literal . are stored into group2. It stops storing when it finds a literal dot.
(.*$) From the literal dot upto the last character are stored into group3.
$1$3 at the replacement part prints only the stored group1 and 3.
OR
(.*)(?:-[^\.]*)(.*$)
If you use this regex, in the replacement part you need to put only $1 and $2.
DEMO

Categories