Match optional domain within string - javascript

I've racked my brain over this JS regex and have so far only managed to get parts of it to work or the whole thing to work in certain circumstances.
I have a string like this:
Some string<br>http://anysubdomain.particulardomain.com<br>Rest of string
The goal is to move the domain part to the end of the string, if it's there. The http part is also optional and can also be https. The TLD is always particulardomain.com, the subdomain can be anything.
I've managed to get everything into capture groups when the domain with protocol is present with this regex:
(.*)(https?\:\/\/[a-z\d\-]*\.particulardomain\.com)(.*)
But any attempt at making the domain part and the protocol part within it optional has resulted in no or the wrong matches.
The end result I'm looking for is to have the three parts of the string – beginning, domain, end – in separate capture groups so I can move capture group 2 (the domain part) to the end, or, if there's no domain present, the whole string in the first capture group.
To clarify, here are some examples with the expected output/capture groups:
INPUT:
Some string<br>http://anysubdomain.particulardomain.com<br>Rest of string
OR (no protocol):
Some string<br>anysubdomain.particulardomain.com<br>Rest of string
OUTPUT:
$1: Some string<br>
$2: http://anysubdomain.particulardomain.com
$3: <br>Rest of string
INPUT:
Some string<br>Rest of string
OUTPUT:
$1: Some string<br>Rest of string
$2: empty
$3: empty

One mistake in your regex is that it contains only particular whereas
the source text contains particulardomain, but this is a detail.
Now let's move to the protocol part. You put only one ? (after s),
which means that only s is optional, but both http and :
are still required.
To make the whole protocol optional, you must:
enclose it with a group (either capturing or not),
make this group optional (put ? after it).
And now maybe the most important thing: Your regex starts with (.*).
Note that it is greedy version, which:
initially tries to capture the whole rest of source string,
then moves back one char by one, to allow matching by the
following part of regex.
Change it to reluctant version (.*?) and then optional
group (https?:)? will match as expected.
Another detail: \ before : is not needed. It does not do
any harm either, but due to the principle "Keep It Simple...",
I recommend to delete it (as I did above).
One more detail: After [a-z\d\-] (subdomain part) you should put
+, not *, as this part may not be empty.
So the whole regex can be:
(.*?)((https?:)?\/\/[a-z\d\-]+\.particulardomain\.com)(.*)
And the last remark: I am in doubt, whether you really need three
capturing groups. Maybe it would be enough to leave only the content
of the middle capturing group, i.e.:
(https?:)?\/\/[a-z\d\-]+\.particulardomain\.com

Found a solution. Since, as stated, the goal is to move the domain to the end of the string, if it's present, I'm just matching the domain and anything after it. If there's no domain, nothing matches and hence nothing gets replaced. The problem was the two .* both at the beginning and the end of the regex. Only the one at the end is needed.
REGEX:
([a-z\d\-:\/]+\.particulardomain\.com)(.*)
Works for the following strings:
Domain present:
Start of string 1234<br>https://subdomain.particulardomain.com<br>End of string 999
Domain without protocol:
Start of string 1234<br>subdomain.particulardomain.com<br>End of string 999
No domain:
Start of string 1234<br>End of string 999
Thanks everyone for helping me rethink the problem!

I see good answer here, as you explained you need three group and set the domain to the back of the string(to be clear the entire url or only the domain e.g particulardomain.com)
You can do this:
//Don't know if the <br> tag matter for you problem, suppose it not
//this is you input
let str = "Start of string 1234<br>https://subdomain.particulardomain.com<br>End of string 99";
let group = str.split(<br>);
let indexOfDomain;
/*moere code like a for loop or work with a in-build funcion of the array with the regExp you made /[a-z\d\-:\/]+\.particulardomain\.com/ you can validated the domain separately.
}
TO HAVE IN MIND:
With your solution will not work at 100%, why?
your regExp:
([a-z\d\-:\/]+\.particulardomain\.com)(.*)
will mach a http, https, *(any other thing that is not a protocol) and will not work for this input you can test if you like and do a comment
Start of string 1234<br>End of string 999
The regExp that #Valdi_Bo answer:
(.*?)((https?:)?\/\/[a-z\d\-]+\.particulardomain\.com)(.*)
will fit to the what you described in the question
This regExp don't fit all yours input maybe he did not test it for all your input as you did not explained in your question like you did in your own answer
In conclusion at the end you need to extract the domain (wich don't know if is the entire url as you mix up the idea). If you are not going to use the do a split and then validated the regExp it will be more easy

Related

Regex: Replace last segment of url

I try to figure out the correct regex to replace the last segment of an url with a modified version of that very last segment. (I know that there are similar threads out there, but none seemed to help...)
Example:
https://www.test.com/one/two/three/mypost/
--->
one/two/three?id=mypost
https://www.test.com/one/mypost/
--->
one?id=mypost
Now I am stuck here:
https://regex101.com/r/9GqYaU/1
I can get the last segment in capturing group 2 but how would I replace it?
I think I will have to something like this:
const url = 'https://www.test.com/one/two/three/mypost/'
const regex = /(http[s]?:\/\/)([^\/]+\/)*(?=\/$|$)/
const path = url.replace(regex, `${myUrlWithoutTheLastSegmentAnd WithoutHTTPS}?id=$2`)
return path
But I have no idea how to get the url without the last segment. I have currently only access to the whole string or group 1 (which is useless in this case) and then group 2, but not the string without group 2.
I would be very glad for any help here. Sometimes I just lack the knowledge of what is possible with regex and how to achieve it.
Thank you in advance.
Cheers
You could use the URL class to extract the pathname and substring to remove the first '/'.
Then, you could put the last part of the pathname in a group and use it as a reference $1 for the replacement.
const url = new URL('https://www.test.com/one/two/three/mypost/').pathname.substring(1)
console.log(url.replace(/\/([^/]*)\/$/, '?id=$1'))
I came across your question yesterday and agree with going down the route of parsing the URL. Once you get there you could even use JavaScript array methods which I prefer to string methods like:
pathname.split("/").filter(p => p.length).pop()
This would separate each folder, ignore any with no length (i.e. handle a trailing slash) and return the last one (mypost).
Anyway, I am also learning regex so sometimes when I find a question like this I just try to find the answer anyway as the best way of learning is doing. It took 24 hours 😂 I came up with this:
/(https?:\/\/).+?([a-z-]*)\/?$/gm
(https?:\/\/) you know what this does. Small correction, you don't need the square brackets. Question mark matches 0 or 1 of the preceding character. As we're only matching s this just works. If you wanted to match s or z you would use [sz]?. I think.
.+? this is the cool one I think I will use in future now I found it. The question mark here has a different meaning - it makes .+ (which means one or more of any character) non-greedy. That means it stops applying once it reaches the next rule. Which is...
([a-z-]*) any number of letters or a hyphen. You should maybe change this to include numbers and upper case.
\/? Optional slash
$ all this must apply at the end of the string.
Here is a demo
https://regex101.com/r/mQNkIS/1

How do I match URLs with regular expressions?

We want to check if a URL matches mail.google.com or mail.yahoo.com (also a subdomain of them is accepted) but not a URL which contains this string after a question mark. We also want the strings "mail.google.com" and "mail.yahoo.com" to come before the third slash of the URL, for example https://mail.google.com/ is accepted, https://www.facebook.com/mail.google.com/ is not accepted, and https://www.facebook.com/?mail=https://mail.google.com/ is also not accepted. https://mail.google.com.au/ is also not accepted. Is it possible to do it with regular expressions?
var possibleURLs = /^[^\?]*(mail\.google\.com|mail\.yahoo\.com)\//gi;
var url;
// assign a value to var url.
if (url.match(possibleURLs) !== null) {
// Do something...
}
Currently this will match both https://mail.google.com/ and https://www.facebook.com/mail.google.com/ , but we don't want to match https://www.facebook.com/mail.google.com/.
Edit: I want to match any protocol (any string which doesn't contain "?" and "/") followed by a slash "/" twice (the string and the slash can both be twice), then any string which doesn't contain "?" and "/" (if it's not empty, it must end with a dot "."), and then (mail\.google\.com|mail\.yahoo\.com)\/. Case insensitive.
Not being funny - but why must it be a regular expression?
Is there are reason why you couldn't simplify the process using URL (or webkitURL in Chrome and Safari) - the URL constructor simply takes a string and then contains properties for each part of the URL. Whether it supports all the host types that you want to support, I don't know.
Granted, you might still need a regex after that (although really you'd just be checking that the hostname ends with either yahoo.com or google.com), but you would just be running it against the hostname of the URL object rather than the whole URI.
The API is not ubiquitous, but seems reasonably well supported and, anyway, if this is client-side validation then I hope you're checking it on the server, too, because sidestepping javascript validation is easy.
How about
^[a-z]+:\/\/([^.\/]+\.)*mail\.(google|yahoo).com\/
Regex Example Link
^ Anchors the regex at the start of the string
[a-z]+ Matches the protocol. If you want a specific set of protocols, then (https?|ftp) may do the work
([^.\/]+\.)* matches the subdomin part
^([-a-z]+://|^cid:|^//)([^/\?]+\.)?mail\.(google|yahoo)\.com/
Should do the trick
The first ^ means "match beginning of line", the second negates the allowed characters, thus making a slash / not allowed.
Nb. You still have to escape the slashes, or use it as a string in new RegExp(string):
new RegExp('^([-a-z]+://|^cid:|^//)([^/\?]+\.)?mail\.(google|yahoo)\.com/')
OK, I found that it works with:
var possibleURLs = /^([^\/\?]*\/){2}([^\.\/\?]+\.)*(mail\.google\.com|mail\.yahoo\.com)\//gi;

Capturing optional part of URL with RegExp

While writing an API service for my site, I realized that String.split() won't do it much longer, and decided to try my luck with regular expressions. I have almost done it but I can't find the last bit. Here is what I want to do:
The URL represents a function call:
/api/SECTION/FUNCTION/[PARAMS]
This last part, including the slash, is optional. Some functions display a JSON reply without having to receive any arguments. Example: /api/sounds/getAllSoundpacks prints a list of available sound packs. Though, /api/sounds/getPack/8Bit prints the detailed information.
Here is the expression I have tried:
req.url.match(/\/(.*)\/(.*)\/?(.*)/);
What am I missing to make the last part optional - or capture it in whole?
This will capture everything after FUNCTION/ in your URL, independent of the appearance of any further / after FUNCTION/:
FUNCTION\/(.+)$
The RegExp will not match if there is no part after FUNCTION.
This regex should work by making last slash and part after optional:
/^\/[^/]*\/[^/]*(?:\/.*)?$/
This matches all of these strings:
/api/SECTION/FUNCTION/abc
/api/SECTION
/api/SECTION/
/api/SECTION/FUNCTION
Your pattern /(.*)/(.*)/?(.*) was almost correct, it's just a bit too short - it allows 2 or 3 slashes, but you want to accept anything with 3 or 4 slashes. And if you want to capture the last (optional) slash AND any text behind it as a whole, you simply need to create a group around that section and make it optional:
/.*/.*/.*(?:/.+)?
should do the trick.
Demo. (The pattern looks different because multiline mode is enabled, but it still works. It's also a little "better" because it won't match garbage like "///".)

Regexp javascript - url match with localhost

I'm trying to find a simple regexp for url validation, but not very good in regexing..
Currently I have such regexp: (/^https?:\/\/\w/).test(url)
So it's allowing to validate urls as http://localhost:8080 etc.
What I want to do is NOT to validate urls if they have some long special characters at the end like: http://dodo....... or http://dododo&&&&&
Could you help me?
How about this?
/^http:\/\/\w+(\.\w+)*(:[0-9]+)?\/?(\/[.\w]*)*$/
Will match: http://domain.com:port/path or just http://domain or http://domain:port
/^http:\/\/\w+(\.\w+)*(:[0-9]+)?\/?$/
match URLs without path
Some explanations of regex blocks:
Domain: \w+(\.\w+)* to match text with dots: localhost or www.yahoo.com (could be as long as Path or Port section begins)
Port: (:[0-9]+)? to match or to not match a number starting with semicolon: :8000 (and it could be only one)
Path: \/?(\/[.\w]*)* to match any alphanums with slashes and dots: /user/images/0001.jpg (until the end of the line)
(path is very interesting part, now I did it to allow lone or adjacent dots, i.e. such expressions could be possible: /. or /./ or /.../ and etc. If you'd like to have dots in path like in domain section - without border or adjacent dots, then use \/?(\/\w+(.\w+)*)* regexp, similar to domain part.)
* UPDATED *
Also, if you would like to have (it is valid) - characters in your URL (or any other), you should simply expand character class for "URL text matching", i.e. \w+ should become [\-\w]+ and so on.
If you want to match ABCD then you may leave the start part..
For Example to match http://localhost:8080
' just write
/(localhost).
if you want to match specific thing then please focus the term that you want to search, not the starting and ending of sentence.
Regular expression is for searching the terms, until we have a rigid rule for the same. :)
i hope this will do..
It depends on how complex you need the Regex to be. A simple way would be to just accept words (and the port/domain):
^https?:\/\/\w+(:[0-9]*)?(\.\w+)?$
Remember you need to use the + character to match one or more characters.
Of course, there are far better & more complicated solutions out there.
^https?:\/\/localhost:[0-9]{1,5}\/([-a-zA-Z0-9()#:%_\+.~#?&\/=]*)
match:
https://localhost:65535/file-upload-svc/files/app?query=abc#next
not match:
https://localhost:775535/file-upload-svc/files/app?query=abc#next
explanation
it can only be used for localhost
it also check the value for port number since it should be less than 65535 but you probably need to add additional logic
You can use this. This will allow localhost and live domain as well.
^https?:\/\/\w+(\.\w+)*(:[0-9]+)?(\/.*)?$
I'm pretty late to the party but now you should consider validating your URL with the URL class. Avoid the headache of regex and rely on standard
let isValid;
try {
new URL(endpoint); // Will throw if URL is invalid
isValid = true;
} catch (err) {
isValid = false;
}
^https?:\/\/(localhost:([0-9]+\.)+[a-zA-Z0-9]{1,6})?$
Will match the following cases :
http://localhost:3100/api
http://localhost:3100/1
http://localhost:3100/AP
http://localhost:310
Will NOT match the following cases :
http://localhost:3100/
http://localhost:
http://localhost
http://localhost:31

Regex to convert URL to Links

I 'borrowed' a regex from this website : http://daringfireball.net/2010/07/improved_regex_for_matching_urls that is almost complete but i want to match exemple.com
I know that stackoverflow is not doyourhomework.com but I passed a long time thinking without results. Here is a fiddle to test : http://jsfiddle.net/BGnMm/25/ and you can see at the end that exemple.com is not a link.
var reg=/\b((?:[a-z][\w-]+:(?:\/*)|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi;
var allurl="http:foo.com/blah_blah http://foo.com/blah_blah/ (Something like http://foo.com/blah_blah) http://foo.com/blah_blah_(wikipedia) http://foo.com/more_(than)_one_(parens) (Something like http://foo.com/blah_blah_(wikipedia)) http://foo.com/blah_(wikipedia)#cite-1 http://foo.com/blah_(wikipedia)_blah#cite-1 http://foo.com/unicode_(✪)_in_parens http://foo.com/(something)?after=parens http://foo.com/blah_blah. http://foo.com/blah_blah/. <http://foo.com/blah_blah> <http://foo.com/blah_blah/> http://foo.com/blah_blah, http://www.extinguishedscholar.com/wpglob/?p=364. http://✪df.ws/1234 rdar://1234 rdar:/1234 x-yojimbo-item://6303E4C1-6A6E-45A6-AB9D-3A908F59AE0E message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff#mail.gmail.com%3e http://➡.ws/䨹 www.c.ws/䨹 <tag>http://example.com</tag> Just a www.example.com link. http://example.com/something?with,commas,in,url, but not at end What about <mailto:gruber#daringfireball.net?subject=TEST> (including brokets). mailto:name#example.com bit.ly/foo “is.gd/foo/” WWW.EXAMPLE.COM http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))/Web_ENG/View_DetailPhoto.aspx?PicId=752 http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55)) http://lcweb2.loc.gov/cgi-bin/query/h?pp/horyd:#field(NUMBER+#band(thc+5a46634)) 6:00p filename.txt http://example.com/quotes-are-“part” ✪df.ws/1234 example.com example.com/";
document.write(allurl.replace(reg,"<a href='$1' >$1</a><br />"));
Add an alternation operator (|) after the {2,4}\/, i.e.
var reg=/\b((?:[a-z][\w-]+:(?:\/*)|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/|)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi;
There's something you should understand about this. The first non-captured group, (?: … ), looks for "indicators" of URLs. One indicator, for example, is the www (followed by up to 3 digits of numbers). You however are asking for a way to identify URLs without any indicator at all. So, what we've done above is we've added a clause, "or an empty match," as a "valid" indicator. The consequence of this is that your regular expression is less selective now: all sorts of strings, not only example.com but also filename.txt, 3.141593, and omg...really are identified as URLs! Your only other (readily available) option is to be more selective about suffixes, e.g. require specific suffixes (com|org|net), but then this takes away from the generality of the original regex, which doesn't specify any suffixes at all.
In other words, you are probably faced with a limitation of logic, not a limitation of regex-writing skills or the regex language itself.
Please check if
var reg=/\b((?:[a-z][\w-]+:(?:\/*)|(?:www\d{0,3}[.])|[a-z0-9.\-]+[.][a-z]{2,4}\/{0,1})(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))*(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi;
suits your needs. www(anyNumber) has just been put to appear one or zero times. Sorry for the first answer, did not notice the texts.

Categories