What is the best way to parse a URL with JavaScript? [duplicate]

What is the best way to parse a URL with JavaScript? [duplicate] - javascript

If there is one thing I just cant get my head around, it's regex.
So after a lot of searching I finally found this one that suits my needs:
function get_domain_name()
{
aaaa="http://www.somesite.se/blah/sdgsdgsdgs";
//aaaa="http://somesite.se/blah/sese";
domain_name_parts = aaaa.match(/:\/\/(.[^/]+)/)[1].split('.');
if(domain_name_parts.length >= 3){
domain_name_parts[0] = '';
}
var domain = domain_name_parts.join('.');
if(domain.indexOf('.') == 0)
alert("1"+ domain.substr(1));
else
alert("2"+ domain);
}
It basically gives me back the domain name, is there anyway I can also get all the stuff after the domain name? in this case it would be /blah/sdgsdgsdgs from the aaaa variable.

EDIT (2020): In modern browsers, you can use the built-in URL Web API.
https://developer.mozilla.org/en-US/docs/Web/API/URL/URL
var url = new URL("http://www.somesite.se/blah/sdgsdgsdgs");
var pathname = url.pathname; // returns /blah/sdgsdgsdgs
Instead of relying on a potentially unreliable* regex, you should instead use the built-in URL parser that the JavaScript DOM API provides:
var url = document.createElement('a');
url.href = "http://www.example.com/some/path?name=value#anchor";
That's all you need to do to parse the URL. Everything else is just accessing the parsed values:
url.protocol; //(http:)
url.hostname; //(www.example.com)
url.pathname; //(/some/path)
url.search; // (?name=value)
url.hash; //(#anchor)
In this case, if you're looking for /blah/sdgsdgsdgs, you'd access it with url.pathname
Basically, you're just creating a link (technically, anchor element) in JavaScript, and then you can make calls to the parsed pieces directly. (Since you're not adding it to the DOM, it doesn't add any invisible links anywhere.) It's accessed in the same way that values on the location object are.
(Inspired by this wonderful answer.)
EDIT: An important note: it appears that Internet Explorer has a bug where it omits the leading slash on the pathname attribute on objects like this. You could normalize it by doing something like:
url.pathname = url.pathname.replace(/(^\/?)/,"/");
Note:
*: I say "potentially unreliable", since it can be tempting to try to build or find an all-encompassing URL parser, but there are many, many conditions, edge cases and forgiving parsing techniques that might not be considered or properly supported; browsers are probably best at implementing (since parsing URLs is critical to their proper operation) this logic, so we should keep it simple and leave it to them.

The RFC (see appendix B) provides a regular expression to parse the URI parts:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
where
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
Example:
function parse_url(url) {
var pattern = RegExp("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?");
var matches = url.match(pattern);
return {
scheme: matches[2],
authority: matches[4],
path: matches[5],
query: matches[7],
fragment: matches[9]
};
}
console.log(parse_url("http://www.somesite.se/blah/sdgsdgsdgs"));
gives
Object
authority: "www.somesite.se"
fragment: undefined
path: "/blah/sdgsdgsdgs"
query: undefined
scheme: "http"
DEMO

Please note that this solution is not the best. I made this just to match the requirements of the OP. I personally would suggest looking into the other answers.
THe following regexp will give you back the domain and the rest. :\/\/(.[^\/]+)(.*):
www.google.com
/goosomething
I suggest you studying the RegExp documentation here: http://www.regular-expressions.info/reference.html
Using your function:
function get_domain_name()
{
aaaa="http://www.somesite.se/blah/sdgsdgsdgs";
//aaaa="http://somesite.se/blah/sese";
var matches = aaaa.match(/:\/\/(?:www\.)?(.[^/]+)(.*)/);
alert(matches[1]);
alert(matches[2]);
}

You just need to modify your regex a bit. For example:
var aaaa="http://www.somesite.se/blah/sdgsdgsdgs";
var m = aaaa.match(/^[^:]*:\/\/([^\/]+)(\/.*)$/);
m will then contain the following parts:
["http://www.somesite.se/blah/sdgsdgsdgs", "www.somesite.se", "/blah/sdgsdgsdgs"]
Here is the same example, but modified so that it will split out the "www." part. I think the regular expression should be written so that the match will work whether or not you you have the "www." part. So check this out:
var aaaa="http://www.somesite.se/blah/sdgsdgsdgs";
var m = aaaa.match(/^[^:]*:\/\/(www\.)?([^\/]+)(\/.*)$/);
m will then contain the following parts:
["http://www.somesite.se/blah/sdgsdgsdgs", "www.", "somesite.se", "/blah/sdgsdgsdgs"]
Now check out the same regular expression but with a url that does not start with "www.":
var bbbb="http://somesite.se/blah/sdgsdgsdgs";
var m = .match(/^[^:]*:\/\/(www\.)?([^\/]+)(\/.*)$/);
Now your match looks like this:
["http://somesite.se/blah/sdgsdgsdgs", undefined, "somesite.se", "/blah/sdgsdgsdgs"]
So as you can see it will do the right thing in both cases.

There is a nice jQuery plugin for parsing URLs: Purl.
All the regex stuff is hidden inside, and you get something like:
> url = $.url("http://markdown.com/awesome/language/markdown.html?show=all#top");
> url.attr('source');
"http://markdown.com/awesome/language/markdown.html?show=all#top"
> url.attr('protocol');
"http"
> url.attr('host');
"markdown.com"
> url.attr('relative');
"/awesome/language/markdown.html?show=all#top"
> url.attr('path');
"/awesome/language/markdown.html"
> url.attr('directory');
"/awesome/language/"
> url.attr('file');
"markdown.html"
> url.attr('query');
"show=all"
> url.attr('fragment');
"top"

Browsers have come a long way since this question was first asked. You can now use the native URL interface to accomplish this:
const url = new URL('http://www.somesite.se/blah/sdgsdgsdgs')
console.log(url.host) // "www.somesite.se"
console.log(url.href) // "http://www.somesite.se/blah/sdgsdgsdgs"
console.log(url.origin) // "http://www.somesite.se"
console.log(url.pathname) // "/blah/sdgsdgsdgs"
console.log(url.protocol) // "http:"
// etc.
Be aware that IE does not support this API. But, you can easily polyfill it with polyfill.io:
<script crossorigin="anonymous" src="https://polyfill.io/v3/polyfill.min.js?flags=gated&features=URL"></script>

Related

Checking for a specific URL regex

I need to check for a specific URL pattern using regex and not sure what would be the approach but I think it should not be too complex for this case and therefore regex would be the preferred solution. I just need to check that the exact strings #, shares and assets are in the appropriate slots, for example:
http://some-domain.com/#/shares/a454-rte3-445f-4543/assets
Everything in the URL can be variable (protocol, domain, port, share id) except the exact strings I'm looking for and the slots (slash positions) at which they appear.
Thanks for your help!

You can use
/^https?:\/\/some-domain\.com\/#\/shares\/[^/]+\/assets/i
let url = `http://some-domain.com/#/shares/a454-rte3-445f-4543/assets`
let matched = /^https?:\/\/some-domain\.com\/#\/shares\/[^/]+\/assets/i.test(url)
console.log(matched)

Decided to avoid regex and do it this way instead.
const urlParts = window.location.href.split('/');
if (urlParts[3] === '#' && urlParts[4] === 'shares' && urlParts[6] === 'assets') {
// code goes here...
}

Remove plus sign (+) in URL query string

I am trying get the string in the following URL to display on my webpage.
http://example.com?ks4day=Friday+September+13th
EDIT: The date in the URL will change from person to person as it's merged in by my CRM program.
I can get it to display on my webpage using the code below, the problem is the plus signs (+) come through as well.
eg. Friday+September+13th
What I need it to do is replace the plus signs (+) with spaces so it looks like this:
eg. Friday September 13th
I'm new to this so I'm having some trouble working it out.
Any help would be appreciated.
This is the code i'm using in a .js file
function qs(search_for) {
var query = window.location.search.substring(1);
var parms = query.split('&');
for (var i=0; i<parms.length; i++) {
var pos = parms[i].indexOf('=');
if (pos > 0 && search_for == parms[i].substring(0,pos)) {
return parms[i].substring(pos+1);;
}
}
return "";
}
This is the code i'm using on my webpage to make it display
<script type="text/javascript">document.write(qs("ks4day"));</script>

Although Bibhu's answer will work for this one case, you'll need to add decodeURIComponent if you have encoded characters in your URI string. You also want to make sure you do the replace before the decode in case you have a legitimate + in your URI string (as %2B).
I believe this is the best general way to do it:
var x = qs("ks4day"); // 'Friday+September+13th'
x = x.replace(/\+/g, '%20'); // 'Friday%20September%2013th'
x = decodeURIComponent(x); // 'Friday September 13th'
Here's an example of when it might be useful:
var x = '1+%2B+1+%3D+2';
x = x.replace(/\+/g, '%20'); // '1%20%2B%201%20%3D%202'
x = decodeURIComponent(x); // '1 + 1 = 2'

You can use replace() for this purpose
var dateString = 'Friday+September+13th';
var s = dateString .replace(/\+/g, ' ');

Parsing strings using regex is often prone to so many errors. Thankfully all modern browsers provide URLSearchParams to handle params from url strings in a proper way:
var params = new URLSearchParams(window.location.search);
var value = params.get('ks4day');
// "Friday September 13th"
Ps: There is also a good polyfill for old browsers.

Have you tried https://www.npmjs.com/package/querystring ?
import { parse } from 'querystring';
parse('ks4day=Friday+September+13th')
returns
{ 'ks4day': 'Friday September 13th' }
Assuming you are using something like Webpack that knows how to process import statements

If that's what you are doing, the plus sign will not be the only one that is going to give you a hard time. The apostrophe ('), equals (=), plus (+) and basically anything not in the permitted URL characters (see Percent-encoding # Wikipedia) is going to get escaped.
You are most likely looking for the decodeURIComponent function.

How to get the domain 'name' of a url in Javascript? [duplicate]

How can I fetch a domain name from a URL String?
Examples:
+----------------------+------------+
| input | output |
+----------------------+------------+
| www.google.com | google |
| www.mail.yahoo.com | mail.yahoo |
| www.mail.yahoo.co.in | mail.yahoo |
| www.abc.au.uk | abc |
+----------------------+------------+
Related:
Matching a web address through regex

I once had to write such a regex for a company I worked for. The solution was this:
Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.
Example regex:
.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$
This worked really well and also matched weird, unofficial top-levels like de.com and friends.
The upside:
Very fast if regex is optimally ordered
The downside of this solution is of course:
Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
Very large regex so not very readable.

A little late to the party, but:
const urls = [
'www.abc.au.uk',
'https://github.com',
'http://github.ca',
'https://www.google.ru',
'http://www.google.co.uk',
'www.yandex.com',
'yandex.ru',
'yandex'
]
urls.forEach(url => console.log(url.replace(/.+\/\/|www.|\..+/g, '')))

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the subdomain (the prefix) may or may not be there. Listing all domain extensions is not an option because there are hundreds of these. EuroDNS.com for example lists over 800 domain name extensions.
I therefore wrote a short php function that uses 'parse_url()' and some observations about domain extensions to accurately extract the url components AND the domain name. The function is as follows:
function parse_url_all($url){
$url = substr($url,0,4)=='http'? $url: 'http://'.$url;
$d = parse_url($url);
$tmp = explode('.',$d['host']);
$n = count($tmp);
if ($n>=2){
if ($n==4 || ($n==3 && strlen($tmp[($n-2)])<=3)){
$d['domain'] = $tmp[($n-3)].".".$tmp[($n-2)].".".$tmp[($n-1)];
$d['domainX'] = $tmp[($n-3)];
} else {
$d['domain'] = $tmp[($n-2)].".".$tmp[($n-1)];
$d['domainX'] = $tmp[($n-2)];
}
}
return $d;
}
This simple function will work in almost every case. There are a few exceptions, but these are very rare.
To demonstrate / test this function you can use the following:
$urls = array('www.test.com', 'test.com', 'cp.test.com' .....);
echo "<div style='overflow-x:auto;'>";
echo "<table>";
echo "<tr><th>URL</th><th>Host</th><th>Domain</th><th>Domain X</th></tr>";
foreach ($urls as $url) {
$info = parse_url_all($url);
echo "<tr><td>".$url."</td><td>".$info['host'].
"</td><td>".$info['domain']."</td><td>".$info['domainX']."</td></tr>";
}
echo "</table></div>";
The output will be as follows for the URL's listed:
As you can see, the domain name and the domain name without the extension are consistently extracted whatever the URL that is presented to the function.
I hope that this helps.

/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/

There are two ways
Using split
Then just parse that string
var domain;
//find & remove protocol (http, ftp, etc.) and get domain
if (url.indexOf('://') > -1) {
domain = url.split('/')[2];
} if (url.indexOf('//') === 0) {
domain = url.split('/')[2];
} else {
domain = url.split('/')[0];
}
//find & remove port number
domain = domain.split(':')[0];
Using Regex
var r = /:\/\/(.[^/]+)/;
"http://stackoverflow.com/questions/5343288/get-url".match(r)[1]
=> stackoverflow.com
Hope this helps

I don't know of any libraries, but the string manipulation of domain names is easy enough.
The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).
The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.

import urlparse
GENERIC_TLDS = [
'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs',
'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
]
def get_domain(url):
hostname = urlparse.urlparse(url.lower()).netloc
if hostname == '':
# Force the recognition as a full URL
hostname = urlparse.urlparse('http://' + uri).netloc
# Remove the 'user:passw', 'www.' and ':port' parts
hostname = hostname.split('#')[-1].split(':')[0].lstrip('www.').split('.')
num_parts = len(hostname)
if (num_parts < 3) or (len(hostname[-1]) > 2):
return '.'.join(hostname[:-1])
if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
return '.'.join(hostname[:-1])
if num_parts >= 3:
return '.'.join(hostname[:-2])
This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.
However it'll do the job in most cases.

It is not possible without using a TLD list to compare with as their exist many cases like http://www.db.de/ or http://bbc.co.uk/ that will be interpreted by a regex as the domains db.de (correct) and co.uk (wrong).
But even with that you won't have success if your list does not contain SLDs, too. URLs like http://big.uk.com/ and http://www.uk.com/ would be both interpreted as uk.com (the first domain is big.uk.com).
Because of that all browsers use Mozilla's Public Suffix List:
https://en.wikipedia.org/wiki/Public_Suffix_List
You can use it in your code by importing it through this URL:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
Feel free to extend my function to extract the domain name, only. It won't use regex and it is fast:
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878

Basically, what you want is:
google.com -> google.com -> google
www.google.com -> google.com -> google
google.co.uk -> google.co.uk -> google
www.google.co.uk -> google.co.uk -> google
www.google.org -> google.org -> google
www.google.org.uk -> google.org.uk -> google
Optional:
www.google.com -> google.com -> www.google
images.google.com -> google.com -> images.google
mail.yahoo.co.uk -> yahoo.co.uk -> mail.yahoo
mail.yahoo.com -> yahoo.com -> mail.yahoo
www.mail.yahoo.com -> yahoo.com -> mail.yahoo
You don't need to construct an ever-changing regex as 99% of domains will be matched properly if you simply look at the 2nd last part of the name:
(co|com|gov|net|org)
If it is one of these, then you need to match 3 dots, else 2. Simple. Now, my regex wizardry is no match for that of some other SO'ers, so the best way I've found to achieve this is with some code, assuming you've already stripped off the path:
my #d=split /\./,$domain; # split the domain part into an array
$c=#d; # count how many parts
$dest=$d[$c-2].'.'.$d[$c-1]; # use the last 2 parts
if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
$dest=$d[$c-3].'.'.$dest; # if so, add a third part
};
print $dest; # show it
To just get the name, as per your question:
my #d=split /\./,$domain; # split the domain part into an array
$c=#d; # count how many parts
if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
$dest=$d[$c-3]; # if so, give the third last
$dest=$d[$c-4].'.'.$dest if ($c>3); # optional bit
} else {
$dest=$d[$c-2]; # else the second last
$dest=$d[$c-3].'.'.$dest if ($c>2); # optional bit
};
print $dest; # show it
I like this approach because it's maintenance-free. Unless you want to validate that it's actually a legitimate domain, but that's kind of pointless because you're most likely only using this to process log files and an invalid domain wouldn't find its way in there in the first place.
If you'd like to match "unofficial" subdomains such as bozo.za.net, or bozo.au.uk, bozo.msf.ru just add (za|au|msf) to the regex.
I'd love to see someone do all of this using just a regex, I'm sure it's possible.

/[^w{3}\.]([a-zA-Z0-9]([a-zA-Z0-9\-]{0,65}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}/gim
usage of this javascript regex ignores www and following dot, while retaining the domain intact. also properly matches no www and cc tld

Could you just look for the word before .com (or other) (the order of the other list would be the opposite of the frequency see here
and take the first matching group
i.e.
window.location.host.match(/(\w|-)+(?=(\.(com|net|org|info|coop|int|co|ac|ie|co|ai|eu|ca|icu|top|xyz|tk|cn|ga|cf|nl|us|eu|de|hk|am|tv|bingo|blackfriday|gov|edu|mil|arpa|au|ru)(\.|\/|$)))/g)[0]
You can test it could by copying this line into the developers' console on any tab
This example works in the following cases:

So if you just have a string and not a window.location you could use...
String.prototype.toUrl = function(){
if(!this && 0 < this.length)
{
return undefined;
}
var original = this.toString();
var s = original;
if(!original.toLowerCase().startsWith('http'))
{
s = 'http://' + original;
}
s = this.split('/');
var protocol = s[0];
var host = s[2];
var relativePath = '';
if(s.length > 3){
for(var i=3;i< s.length;i++)
{
relativePath += '/' + s[i];
}
}
s = host.split('.');
var domain = s[s.length-2] + '.' + s[s.length-1];
return {
original: original,
protocol: protocol,
domain: domain,
host: host,
relativePath: relativePath,
getParameter: function(param)
{
return this.getParameters()[param];
},
getParameters: function(){
var vars = [], hash;
var hashes = this.original.slice(this.original.indexOf('?') + 1).split('&');
for (var i = 0; i < hashes.length; i++) {
hash = hashes[i].split('=');
vars.push(hash[0]);
vars[hash[0]] = hash[1];
}
return vars;
}
};};
How to use.
var str = "http://en.wikipedia.org/wiki/Knopf?q=1&t=2";
var url = str.toUrl;
var host = url.host;
var domain = url.domain;
var original = url.original;
var relativePath = url.relativePath;
var paramQ = url.getParameter('q');
var paramT = url.getParamter('t');

For a certain purpose I did this quick Python function yesterday. It returns domain from URL. It's quick and doesn't need any input file listing stuff. However, I don't pretend it works in all cases, but it really does the job I needed for a simple text mining script.
Output looks like this :
http://www.google.co.uk => google.co.uk
http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif => tumblr.com
def getDomain(url):
parts = re.split("\/", url)
match = re.match("([\w\-]+\.)*([\w\-]+\.\w{2,6}$)", parts[2])
if match != None:
if re.search("\.uk", parts[2]):
match = re.match("([\w\-]+\.)*([\w\-]+\.[\w\-]+\.\w{2,6}$)", parts[2])
return match.group(2)
else: return ''
Seems to work pretty well.
However, it has to be modified to remove domain extensions on output as you wished.

how is this
=((?:(?:(?:http)s?:)?\/\/)?(?:(?:[a-zA-Z0-9]+)\.?)*(?:(?:[a-zA-Z0-9]+))\.[a-zA-Z0-9]{2,3})
(you may want to add "\/" to end of pattern
if your goal is to rid url's passed in as a param you may add the equal sign as the first char, like:
=((?:(?:(?:http)s?:)?//)?(?:(?:[a-zA-Z0-9]+).?)*(?:(?:[a-zA-Z0-9]+)).[a-zA-Z0-9]{2,3}/)
and replace with "/"
The goal of this example to get rid of any domain name regardless of the form it appears in.
(i.e. to ensure url parameters don't incldue domain names to avoid xss attack)

All answers here are very nice, but all will fails sometime.
So i know it is not common to link something else, already answered elsewhere, but you'll find that you have to not waste your time into impossible thing.
This because domains like mydomain.co.uk there is no way to know if an extracted domain is correct.
If you speak about to extract by URLs, something that ever have http or https or nothing in front (but if it is possible nothing in front, you have to remove
filter_var($url, filter_var($url, FILTER_VALIDATE_URL))
here below, because FILTER_VALIDATE_URL do not recognize as url a string that do not begin with http, so may remove it, and you can also achieve with something stupid like this, that never will fail:
$url = strtolower('hTTps://www.example.com/w3/forum/index.php');
if( filter_var($url, FILTER_VALIDATE_URL) && substr($url, 0, 4) == 'http' )
{
// array order is !important
$domain = str_replace(array("http://www.","https://www.","http://","https://"), array("","","",""), $url);
$spos = strpos($domain,'/');
if($spos !== false)
{
$domain = substr($domain, 0, $spos);
} } else { $domain = "can't extract a domain"; }
echo $domain;
Check FILTER_VALIDATE_URL default behavior here
But, if you want to check a domain for his validity, and ALWAYS be sure that the extracted value is correct, then you have to check against an array of valid top domains, as explained here:
https://stackoverflow.com/a/70566657/6399448
or you'll NEVER be sure that the extracted string is the correct domain. Unfortunately, all the answers here sometime will fails.
P.s the unique answer that make sense here seem to me this (i did not read it before sorry. It provide the same solution, even if do not provide an example as mine above mentioned or linked):
https://stackoverflow.com/a/569219/6399448

I know you actually asked for Regex and were not specific to a language. But In Javascript you can do this like this. Maybe other languages can parse URL in a similar way.
Easy Javascript solution
const domain = (new URL(str)).hostname.replace("www.", "");
Leave this solution in js for completeness.

In Javascript, the best way to do this is using the tld-extract npm package. Check out an example at the following link.
Below is the code for the same:
var tldExtract = require("tld-extract")
const urls = [
'http://www.mail.yahoo.co.in/',
'https://mail.yahoo.com/',
'https://www.abc.au.uk',
'https://github.com',
'http://github.ca',
'https://www.google.ru',
'https://google.co.uk',
'https://www.yandex.com',
'https://yandex.ru',
]
const tldList = [];
urls.forEach(url => tldList.push(tldExtract(url)))
console.log({tldList})
which results in the following output:
0: Object {tld: "co.in", domain: "yahoo.co.in", sub: "www.mail"}
1: Object {tld: "com", domain: "yahoo.com", sub: "mail"}
2: Object {tld: "uk", domain: "au.uk", sub: "www.abc"}
3: Object {tld: "com", domain: "github.com", sub: ""}
4: Object {tld: "ca", domain: "github.ca", sub: ""}
5: Object {tld: "ru", domain: "google.ru", sub: "www"}
6: Object {tld: "co.uk", domain: "google.co.uk", sub: ""}
7: Object {tld: "com", domain: "yandex.com", sub: "www"}
8: Object {tld: "ru", domain: "yandex.ru", sub: ""}

Found a custom function which works in most of the cases:
function getDomainWithoutSubdomain(url) {
const urlParts = new URL(url).hostname.split('.')
return urlParts
.slice(0)
.slice(-(urlParts.length === 4 ? 3 : 2))
.join('.')
}

You need a list of what domain prefixes and suffixes can be removed. For example:
Prefixes:
www.
Suffixes:
.com
.co.in
.au.uk

#!/usr/bin/perl -w
use strict;
my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
print $3;
}

/^(?:https?:\/\/)?(?:www\.)?([^\/]+)/i

Just for knowledge:
'http://api.livreto.co/books'.replace(/^(https?:\/\/)([a-z]{3}[0-9]?\.)?(\w+)(\.[a-zA-Z]{2,3})(\.[a-zA-Z]{2,3})?.*$/, '$3$4$5');
# returns livreto.co

I know the question is seeking a regex solution but in every attempt it won't work to cover everything
I decided to write this method in Python which only works with urls that have a subdomain (i.e. www.mydomain.co.uk) and not multiple level subdomains like www.mail.yahoo.com
def urlextract(url):
url_split=url.split(".")
if len(url_split) <= 2:
raise Exception("Full url required with subdomain:",url)
return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

Let's say we have this: http://google.com
and you only want the domain name
let url = http://google.com;
let domainName = url.split("://")[1];
console.log(domainName);

Use this
(.)(.*?)(.)
then just extract the leading and end points.
Easy, right?

Regular expression to determine website root

I have following url's and all these url are considered root of the website, how can I use javascript location.pathname using regex to determine pattern below, as you'll notice the word "site" is repeating in this pattern..
http://www.somehost.tv/sitedev/
http://www.somehost.tv/sitetest/
http://www.somehost.tv/site/
http://www.somehost.tv/sitedev/index.html
http://www.somehost.tv/sitetest/index.html
http://www.somehost.tv/site/index.html
I am attempting to display jQuery dialog only and only if the user is at the root of the website.

Simply use the DOM to parse this. No need to invoke a regex parser.
var url = 'http://www.somesite.tv/foobar/host/site';
urlLocation = document.createElement('a');
urlLocation.href = url;
alert(urlLocation.hostname); // alerts 'www.somesite.tv'

A complete pattern, including protocol and domain, could be like this:
/^http:\/\/www\.somehost\.tv\/site(test|dev)?\/(index\.html)?$/
but, if you're matching against location.pathname just try
/^\/site(test|dev)?\/(index\.html)?$/.test(location.pathname)

If you do not explicitly need a Regular Expression for this
You also could do for example
Fill an array with your urls
Loop over a decreasing substring of
the shortest element.
Comparing it against
the longest element.
Until they match.
var urls = ["http://www.somehost.tv/sitedev/",
"http://www.somehost.tv/sitetest/",
"http://www.somehost.tv/site/",
"http://www.somehost.tv/sitedev/index.html",
"http://www.somehost.tv/sitetest/index.html",
"http://www.somehost.tv/site/index.html"]
function getRepeatedSub(arr) {
var srt = arr.concat().sort();
var a = srt[0];
var b = srt.pop();
var s = a.length;
while (!~b.indexOf(a.substr(0, s))) {
s--
};
return a.substr(0, s);
}
console.log(getRepeatedSub(urls)); //http://www.somehost.tv/site
Heres an example on JSBin

How to parse a URL?

If there is one thing I just cant get my head around, it's regex.
So after a lot of searching I finally found this one that suits my needs:
function get_domain_name()
{
aaaa="http://www.somesite.se/blah/sdgsdgsdgs";
//aaaa="http://somesite.se/blah/sese";
domain_name_parts = aaaa.match(/:\/\/(.[^/]+)/)[1].split('.');
if(domain_name_parts.length >= 3){
domain_name_parts[0] = '';
}
var domain = domain_name_parts.join('.');
if(domain.indexOf('.') == 0)
alert("1"+ domain.substr(1));
else
alert("2"+ domain);
}
It basically gives me back the domain name, is there anyway I can also get all the stuff after the domain name? in this case it would be /blah/sdgsdgsdgs from the aaaa variable.

EDIT (2020): In modern browsers, you can use the built-in URL Web API.
https://developer.mozilla.org/en-US/docs/Web/API/URL/URL
var url = new URL("http://www.somesite.se/blah/sdgsdgsdgs");
var pathname = url.pathname; // returns /blah/sdgsdgsdgs
Instead of relying on a potentially unreliable* regex, you should instead use the built-in URL parser that the JavaScript DOM API provides:
var url = document.createElement('a');
url.href = "http://www.example.com/some/path?name=value#anchor";
That's all you need to do to parse the URL. Everything else is just accessing the parsed values:
url.protocol; //(http:)
url.hostname; //(www.example.com)
url.pathname; //(/some/path)
url.search; // (?name=value)
url.hash; //(#anchor)
In this case, if you're looking for /blah/sdgsdgsdgs, you'd access it with url.pathname
Basically, you're just creating a link (technically, anchor element) in JavaScript, and then you can make calls to the parsed pieces directly. (Since you're not adding it to the DOM, it doesn't add any invisible links anywhere.) It's accessed in the same way that values on the location object are.
(Inspired by this wonderful answer.)
EDIT: An important note: it appears that Internet Explorer has a bug where it omits the leading slash on the pathname attribute on objects like this. You could normalize it by doing something like:
url.pathname = url.pathname.replace(/(^\/?)/,"/");
Note:
*: I say "potentially unreliable", since it can be tempting to try to build or find an all-encompassing URL parser, but there are many, many conditions, edge cases and forgiving parsing techniques that might not be considered or properly supported; browsers are probably best at implementing (since parsing URLs is critical to their proper operation) this logic, so we should keep it simple and leave it to them.

The RFC (see appendix B) provides a regular expression to parse the URI parts:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
where
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
Example:
function parse_url(url) {
var pattern = RegExp("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?");
var matches = url.match(pattern);
return {
scheme: matches[2],
authority: matches[4],
path: matches[5],
query: matches[7],
fragment: matches[9]
};
}
console.log(parse_url("http://www.somesite.se/blah/sdgsdgsdgs"));
gives
Object
authority: "www.somesite.se"
fragment: undefined
path: "/blah/sdgsdgsdgs"
query: undefined
scheme: "http"
DEMO

Please note that this solution is not the best. I made this just to match the requirements of the OP. I personally would suggest looking into the other answers.
THe following regexp will give you back the domain and the rest. :\/\/(.[^\/]+)(.*):
www.google.com
/goosomething
I suggest you studying the RegExp documentation here: http://www.regular-expressions.info/reference.html
Using your function:
function get_domain_name()
{
aaaa="http://www.somesite.se/blah/sdgsdgsdgs";
//aaaa="http://somesite.se/blah/sese";
var matches = aaaa.match(/:\/\/(?:www\.)?(.[^/]+)(.*)/);
alert(matches[1]);
alert(matches[2]);
}

You just need to modify your regex a bit. For example:
var aaaa="http://www.somesite.se/blah/sdgsdgsdgs";
var m = aaaa.match(/^[^:]*:\/\/([^\/]+)(\/.*)$/);
m will then contain the following parts:
["http://www.somesite.se/blah/sdgsdgsdgs", "www.somesite.se", "/blah/sdgsdgsdgs"]
Here is the same example, but modified so that it will split out the "www." part. I think the regular expression should be written so that the match will work whether or not you you have the "www." part. So check this out:
var aaaa="http://www.somesite.se/blah/sdgsdgsdgs";
var m = aaaa.match(/^[^:]*:\/\/(www\.)?([^\/]+)(\/.*)$/);
m will then contain the following parts:
["http://www.somesite.se/blah/sdgsdgsdgs", "www.", "somesite.se", "/blah/sdgsdgsdgs"]
Now check out the same regular expression but with a url that does not start with "www.":
var bbbb="http://somesite.se/blah/sdgsdgsdgs";
var m = .match(/^[^:]*:\/\/(www\.)?([^\/]+)(\/.*)$/);
Now your match looks like this:
["http://somesite.se/blah/sdgsdgsdgs", undefined, "somesite.se", "/blah/sdgsdgsdgs"]
So as you can see it will do the right thing in both cases.

There is a nice jQuery plugin for parsing URLs: Purl.
All the regex stuff is hidden inside, and you get something like:
> url = $.url("http://markdown.com/awesome/language/markdown.html?show=all#top");
> url.attr('source');
"http://markdown.com/awesome/language/markdown.html?show=all#top"
> url.attr('protocol');
"http"
> url.attr('host');
"markdown.com"
> url.attr('relative');
"/awesome/language/markdown.html?show=all#top"
> url.attr('path');
"/awesome/language/markdown.html"
> url.attr('directory');
"/awesome/language/"
> url.attr('file');
"markdown.html"
> url.attr('query');
"show=all"
> url.attr('fragment');
"top"

Browsers have come a long way since this question was first asked. You can now use the native URL interface to accomplish this:
const url = new URL('http://www.somesite.se/blah/sdgsdgsdgs')
console.log(url.host) // "www.somesite.se"
console.log(url.href) // "http://www.somesite.se/blah/sdgsdgsdgs"
console.log(url.origin) // "http://www.somesite.se"
console.log(url.pathname) // "/blah/sdgsdgsdgs"
console.log(url.protocol) // "http:"
// etc.
Be aware that IE does not support this API. But, you can easily polyfill it with polyfill.io:
<script crossorigin="anonymous" src="https://polyfill.io/v3/polyfill.min.js?flags=gated&features=URL"></script>

We Keep Coding

JavaScript is the programming language of the Web.

What is the best way to parse a URL with JavaScript? [duplicate] - javascript

Related

Checking for a specific URL regex

Remove plus sign (+) in URL query string

How to get the domain 'name' of a url in Javascript? [duplicate]

Regular expression to determine website root

How to parse a URL?

Categories

Resources