Regex to retrieve domain.extension from a url - javascript

I am needing to come up with a regex to extract only domainname.extension from a url. Right now I have a regex that strips out "www." from the host name, but I need to update the regex to remove any subdomain strings from the hostname:
This strips off www.:
window.location.hostname.replace(/^www\./i, '')
But I need to detect any subdomain info on abc.def.test.com or ghi.test.com to replace it with an empty string and always return "test.com"

You could achieve the same result with replace method but match is some how more suitable:
console.log(
window.location.hostname.match(/[^\s.]+\.[^\s.]+$/)[0]
);
[^\s.]+ Match non-whitespace characters except dot
$ Assert end of input string
Doing so with replace method according to comments:
console.log(
window.location.hostname.replace(/[^\s.]+\.(?=[^\s.]\.)/g, '')
);

Well, that depends mainly on what you define as a domain and how do you define a subdomain. I'll use the most generalised approach of considering the top domain as the last two subcomponents (like you use in test.com) In that case you can proceed as:
([a-zA-Z0-9-]+\.)*([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+) ==> $2
as you see, the regexp is divided in two groups, and we only get the second in the output, which is the last two domain components. The [a-zA-Z0-9-] subexpression demands some explanation, as it appears thrice in the regexp: It is the set of chars allowed in a domain component, including the - hyphen. See [1] for a working demo.
in the case you want to cope with the co.uk example posted in the last demo, to match www.test.co.uk as test.co.uk, then you have to anchor your regexp to the end (with $, or if you are in the middle of a url, with the next : or / that can follow the domain name), to avoid that prefixes get detected as valid domains like it is shown in [2]:
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)$ ==> $3
or [3]
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)(?=[:/]|$) ==> $3
Of course, you have to put in the list all countries that follow the convention of using top domains as prefixes under their structure. You have to be careful here, as not all countries follow this approach. I've used the non-greedy *? operator here, as if I don't, then the group matching doesn't get as desired (the first group gets greedy, and the match is again at co.uk instead of test.co.uk)
But as you have finally to anchor your regexp (mainly because you can have domain names in the query string part of the url or in the subpath part, the best it to anchor it to the whole url.

Related

Optional regex string/pattern sections with and without non-capturing groups

Here's what I'm trying to do:
http://i.imgur.com/Xqrf8Wn.png
Simply take a URL with 3 groups, $1 not so important, $2 & $3 are but $2 is totally optional including (obviously) the corresponding backslash when present, which is all I am trying to make optional. I get that it can/should? be in a non-cap group, but does it HAVE to be? I've seen enough now seems to indicate it does not HAVE to be. If possible, I'd really like to have someone explain it so I can try to fully understand it, and not just get one possible working answer handed to me to simply copy, like some come here seeking.
Here's my regex string(s) tried and at best only currently matching second URL string with optional present:
^https:\/\/([a-z]{0,2})\.?blah\.com(?:\/)(.*)\/required\/B([A-Z0-9]{9}).*
^https:\/\/([a-z]{0,2})\.?blah\.com(\/)?(.*)\/required\/B([A-Z0-9]{9}).*
^https:\/\/([a-z]{0,2})\.?blah\.com(?:\/)?(.*)?\/required\/B([A-Z0-9]{9}).*
Here are the two URLs that I want to capture group 2 & 3, with 1 and 2 being optional, but $2 being the problem. I've tried all the strings above and have yet to get it to match the string when the optional is NOT present and I believe it must be due to the backslashes?
https://blah.com/required/B7BG0Z0GU1A
https://blah.com/optional/required/B7BG0Z0GU1A
Making a part of the pattern optional is as simple as adding ?, and your last two attempts both work: https://regex101.com/r/RIKvYY/1
Your mistake is that your test is wrong - you are using ^ which matches the beginning of the string. You need to add the /m flag (multiline) to make it match the beginning of each line. This is the reason your patterns never match the second line...
Note that you're allowing two slashes (//required, for example). You can solve it by joining the first slash and the optional part to the same capturing group (of course, as long as you are using .* you can still match multiple slashes):
https:\/\/([a-z]{0,2})\.?blah\.com(?:\/(.*))?\/required\/B([A-Z0-9]{9}).*

How to match a URL in this string?

I've seen various articles which show how to match a URL. But my situation is a bit different from the usual URL matching.
This was one such regex that didn't work for me
/https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9#:%_\+.~#?&//=]*)/
My requirement:
My requirement is that I've a string like this
userlist.2011.text_mediafire.com,
userlist.2011.text_http://www.mediafire.com",
userlist.2011.text_http://mediafire.com",
userlist.2011.text.www.mediafire.com
Now, I want to match mediafire.com along with (if exists) "http://www." and "www." so, the contraint that I wish to set is that all the strings to the left of a TLD (in this case '.com') should be recorded upto a list of specal characters like '"_- etc.
I wasn't able to proceed any further except that the basic /(.*)\.(com|net|org|info)/ .Which is clearly wrong.
Use the below regex and get the string you want from group index 1.
(?:http:\/\/)?(?:www\.)?([^'"_.-]*\.(?:com|net|org|info)\b)
You need the '$' to match the end of string. If you care about capturing the entire string before the special character you will also need to match the beginning of the string '^'.
/^(.*)\.(([^\.]+)\.(com|net|org|info))$/

URL Pattern Matching issue, .+ matches all after

I am matching up stored URLs to the current URL and having a little bit of an issue - the regex works fine when being matched against the URL itself, but for some reason all sub-directories match too (when I want a direct match only of course).
Say the user stores www.facebook.com, this should match both http://www.facebook.com and https://www.facebook.com and it does
The problem is it is also matching sub-directories such as https://www.facebook.com/events/upcoming etc.
The regex for example:
/.+:\/\/www\.facebook\.com/
Matches the following:
https://www.facebook.com/events/upcoming
When it should just be matching
http://www.facebook.com/
https://www.facebook.com/
How can I fix this seemingly broken regex?
If you're being really specific about what you want to match, why not reflect that in your RegExp?
/^https?:\/\/(?:(?:www|m)\.)?facebook\.com\/?$/
http or https
www., m. or no subdomain
facebook.com
Demo
edit to include optional trailing backslash
Put an end marker $, like:
/.+:\/\/www\.facebook\.com\/$/
but really should have a start marker ^ too, like:
/^https?:\/\/www\.facebook\.com\/$/
also if you're matching the current domain, you may as well just match the location.host rather than location.href
Try adding a $ at the end of your regex. It's the symbol for end of string.

getting user and tweet ID from url using JavaScript regex

So I have tweet url for example https://twitter.com/ESPNFC/status/423771542627966976.
This url in my website gets automatically parsed to
https://twitter.com/ESPNFC/status/423771542627966976
I need to match this pattern and also get username and tweet ID.
I did it that way
/<a href="(http|https):\/\/twitter.com\/([^\/]*)\/status\/([^\/]*)">.+<\/a>/g. Everything works when I have 1 tweet per line, but if there are 2 or more tweets in one line, that regex matches both of them at same time and groups it as one, but I need to separate them.
Example:
https://twitter.com/ESPNFC/status/423771542627966976
https://twitter.com/ESPNFC/status/423771542627966976
returns 2 matches, but
https://twitter.com/ESPNFC/status/423771542627966976https://twitter.com/ESPNFC/status/423771542627966976
returns 1 match including both urls. How can I separate it or for example everything after interpret as new line?
It's best to avoid parsing HTML with regex when possible. Having said that the problem with your expression is the greedy .+ which will match as much as possible. Instead you could use .+? to make it ungreedy (match as few characters as possible). Or you could restrict what . matches, for example use [^\s<>]+ instead of .+.
Also you probably want to change those [^\/]* to maybe [^\/"\s]* to make them more effective.

The inverse of [^\/:] | Regular Expression Improvement

This character set
[^\/:] // all characters except / or :
is weak per jslint b.c. I should be specifying the characters that can be used not he characters that can not be used per this SO Post.
This is for a simple not production level domain tester that looks like this:
domain: /:\/\/(www\.)?([^\/:]+)/,
I'm just looking for some direction on how to think about this. The post mentions that allowing the myriad of Unicode characters is not a good thing...How do I formulate a plan to write this a tad better?
I am not concerned with the completeness of my domain checker ( it is just a prototype )...I am concerned with how to write reg-exes differently.
According to http://en.wikipedia.org/wiki/Domain_name#Internationalized_domain_names
the character set allowed in the Domain Name System is based on ASCII
and as per http://www.netregister.biz/faqit.htm#1
to name your domain you can use any letter, numbers between 0 and 9, and the symbol "-" [as long as the first character is not "-"]
and considering that your domain must end with .something, you are looking for
([a-zA-Z0-9][a-zA-Z0-9-]*\.)+[a-zA-Z0-9][a-zA-Z0-9-]*
"I should be specifying the characters that can be used not he characters that can not be use"
No, that's nonsense, just JSLint being JSLint.
When you see [^\/:] in a regex it's immediately obvious what it is doing. If you tried to list all possible allowed characters the resulting regex would be horrendously difficult to read and it would be easy to accidentally forget to include some characters.
If you have a specific set of allowed characters then fine, list them. That's easier and more reliable than trying to list all possible invalid characters.
But if you have a specific set of invalid characters the [^] syntax is the appropriate way to do it.
Here`s a regex for characters you can have:
mycharactersarecool[^shouldnothavethesechars](oneoftwooptions|anotheroption)
Is this what you're talking about ?
This is a great question for Google, you know... but just to wet your beak: Matthew O'Riordan has written such regular expression that mathces link with or without protocol.
Here's link to his blog post
But for future reference let me provide the regular expression from the post here as well:
/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[.\!\/\\w]*))?)/
And as nicely broken down by blog writer Matthew himself:
(
( # brackets covering match for protocol (optional) and domain
([A-Za-z]{3,9}:(?:\/\/)?) # match protocol, allow in format http:// or mailto:
(?:[\-;:&=\+\$,\w]+#)? # allow something# for email addresses
[A-Za-z0-9\.\-]+ # anything looking at all like a domain, non-unicode domains
| # or instead of above
(?:www\.|[\-;:&=\+\$,\w]+#) # starting with something# or www.
[A-Za-z0-9\.\-]+ # anything looking at all like a domain
)
( # brackets covering match for path, query string and anchor
(?:\/[\+~%\/\.\w\-]*) # allow optional /path
?\??(?:[\-\+=&;%#\.\w]*) # allow optional query string starting with ?
#?(?:[\.\!\/\\\w]*) # allow optional anchor #anchor
)? # make URL suffix optional
)
What about your particular example
But in your case of mathing URL domains the negative of [^\/:] could simply be:
[-0-9a-zA-Z_.]
And that should match everything after // and before first /. But what happens when your URLs don't end with a slash? what will you do in that case?
Upper regular expression (simplification) only matches one character just like your negative character set does. So this just replaces your negative set in the complete reg ex you're using.

Categories