simple regex: multiple dots(.) - javascript

I am trying to delete everything after the .com, .etc, in a URL; to make it more meaningful
so
sub.domain.com/324fr9?=awerf?=awrf
turns to
sub.domain.com/
except the same regex doesn't work for
noSubDomain.com/crap?=yes123456789timesOver
because it only has one dot, not two!
Here's my regex(javascript):
/.*:\/\/.*\..*\.com/g

If you are segmenting the URL and you want to do it in perl CPAN's URI module might be a better option. $uri->host is what you want, but you can do many other things using URI module.

Javascript has the window.location object which is better to use for getting URL information - http://www.w3schools.com/jsref/obj_location.asp
Else, you could just design your regex to delete everything after the / instead.
url = url.replace(/\/.*/g, "/");

/.*:\/\/(.*\.)?.*\.com/g
Here's the part that matters: (.*\.)? The question mark says that everything in that group is optional.

Related

Regular Expression - Extract subdomain & domain

I'm trying to form a regular expression (javascript/node.js) which will extract the sub-domain & domain part from any given URL. This is what I ended up with:
[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)
Right now, I'm just considering http, https for protocol & exclude "www." portion from the subdomain+domain portion of an URL. I checked the expression & it almost works. But, here is the issue:
Success
'http://mplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)
'http://lplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)
Failure
'http://play.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)
'http://tplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)
I just use the first element from the result array. I'm not able to understand why "play." & "tplay." doesn't work. Could anyone please help me in this regard?
Does "/p" and "/t" have any meaning for the regular expression evaluator?
Is there any other way of extracting sub-domain & domain from any given URL using a regular expression?
Edit -
Example:
https://play.google.com/store/apps/details?id=com.skgames.trafficracer => play.google.com
https://mail.google.com/mail/u/0/#inbox => mail.google.com
Your regex doesn't seem correct. Try this regex:
/^(?:https?:\/\/)?(?:[^#\n]+#)?(?:www\.)?([^:\/\n?]+)/img
RegEx Demo
You are about the one millionth person to try to parse URLs in JavaScript. I'm a little bit surprised you didn't see any of the existing questions on SO dating back years. The last thing you want to do is write yet another broken regexp, with all due respect to those that provided answers to your question.
There are many well documented libraries and approaches to handling this. Google it. The simplest way is to create an a element in memory, assign it an href, and then access its hostname and other properties. See http://tutorialzine.com/2013/07/quick-tip-parse-urls/. If that does not float your boat, then use a library like uri.js.
If you really don't want to use a library, and insist on reinventing the wheel, then at least do something like the following:
function get_domain_from_url(url) {
var a = document.createElement('a').
a.setAttribute('href', url);
return a.hostname;
}
Essentially, you are delegating the extraction of the subdomain/domain part of the URL to the browser's URL parsing logic, which is MUCH better than anything you will ever write.
Also see Parse URL with jquery/ javascript?, Parse URL with Javascript, How do I parse a URL into hostname and path in javascript?, or parse URL with JavaScript or jQuery. How did you miss those? Sorry, I have to vote to close this as a duplicate.
The same RegExp as in anubhava's answer, only added support for protocol-relative URLs like //google.com:
/^(?:https?:)?(?:\/\/)?(?:[^#\n]+#)?(?:www\.)?([^:\/\n]+)/im
RegEx Demo
Here's a solution ignoring everything before ://
.*\://?([^\/]+)
Incase you want to ignore www.
.*\://(?:www.)?([^\/]+)
Your regex expression works pretty well. You only need to remove the brackets. The final expression is:
^(?:http:\/\/|www\.|https:\/\/)([^\/]+)
Hope it's useful!
I know I am late to the party but I want to answer the question with some extra useful info.
Get the domain name from a link using regex.
^(https?:\/\/)?(www\.)?([^\/]+)
Here is the link to above regex.
If you want to get the subdomain, split the result from one of the matches of above regex with the first occurrence of .
Note: regex is faster than language built-in modules. check below examples, regex comes out to be 15x faster than the built-in module
javascript Example with Regex:
console.time('time2');
const pttrn = /^(https?:\/\/)?(www\.)?([^\/]+)/gm
const urlInfo = pttrn.exec("https://www.google.co.in/imghp");
console.timeEnd('time2');
//time2: 0.055ms
console.log(urlInfo[0]) // https://www.google.co.in
console.log(urlInfo[1]) // https://
console.log(urlInfo[2]) // www.
console.log(urlInfo[3]) // google.co.in
Nodejs with built-in url module
console.time('time');
const url = require('url');
const urlInfo = url.parse("https://www.google.co.in/imghp");
console.timeEnd('time');
//time: 0.840ms;
console.log(urlInfo.hostname) //www.google.co.in

URL RegExp WITHOUT http:// or www

I'm trying to construct URL RegExp. The base expression looks like:
/^(((http(?:s)?\:\/\/)|www\.)[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*\.[a-zA-Z]{2,6}(?:\/?|(?:\/[\w\-]+)*)(?:\/?|\/\w+((\.[a-zA-Z]{2,4})?)(?:\?[\w]+\=[\w\-]+)?)?(?:\&[\w]+\=[\w\-]+)*)$/
It looks good for me, because matches these:
http://gmail.com
http://www.gmail.com
www.gmail.com
But I wold like to modify it to match this:
gmail.com
I will appreciate any help.
just add a ? to make www optional, then it will match gmail.com also
use this :
^(((http(?:s)?\:\/\/)|www\.)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*\.[a-zA-Z]{2,6}(?:\/?|(?:\/[\w\-]+)*)(?:\/?|\/\w+((\.[a-zA-Z]{2,4})?)(?:\?[\w]+\=[\w\-]+)?)?(?:\&[\w]+\=[\w\-]+)*)$
or if you want to match only gmail.com and not http://gmail.com in that case use this :
^([a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*\.[a-zA-Z]{2,6}(?:\/?|(?:\/[\w\-]+)*)(?:\/?|\/\w+((\.[a-zA-Z]{2,4})?)(?:\?[\w]+\=[\w\-]+)?)?(?:\&[\w]+\=[\w\-]+)*)$
please note , this will match anu string which has dots and alphabets in it.
IMO it will be better off using a regex like this :
^(http:\/\/|www\.)?[\w\.]+\.(com|net|co\.cc|co\.in)$
you can modify it according to your needs .
check out a demo here and play around with the regex :
http://regex101.com/r/tS4aB3
The easiest way is to treat 'www' as just another subdomain (because that's all it is).
So:
/^(((http(?:s)?\:\/\/))?([a-zA-Z0-9\-]+\.?)+(?:\.[a-zA-Z0-9\-]+)*\.[a-zA-Z]{2,6}(?:\/?|(?:\/[\w\-]+)*)(?:\/?|\/\w+((\.[a-zA-Z]{2,4})?)(?:\?[\w]+\=[\w\-]+)?)?(?:\&[\w]+\=[\w\-]+)*)$/
Edit: as a side note, the tld (i.e. the ".com" part) is... quite complicated these days. There are a lot of them, and they may not fit easily in 2-6 chars.

trouble using string.replace with regex

Given something a regex like this:
http://rubular.com/r/ai1LFT5jvK
I want to use string.replace to replace "subdir" with a string of my choosing.
Doing myStr.replace(/^.*\/\/.*\.net\/.*\/(.*)\/.*\z/,otherStr)
only returns the same string, as shown here: http://jsfiddle.net/nLmbV/
If you view the Rublar, it appears to capture what I want it to capture, but on the Fiddle, it doesn't replace it.
I'd like to know why this happens, and what I'm doing wrong. A correct regex or a correct implementation of the replace call would be nice, but most of all, I want to understand what I'm doing wrong so that I can avoid it in the future.
EDIT
I've updated the fiddle to change my regex from:
/^.*\/\/.*\.net\/.*\/(.*)\/.*\z/
to
/^.*\/\/.*\.net\/.*\/(.*)\/.*$/
And according to the fiddle, it just returns hello instead of https://xxxxxxxxxxx.cloudfront.net/dir/hello/Slide1_v2.PNG
It's that little \z in your regex.
You probably forgot to replace it with a $ sign. JavaScript uses ^ and $ as anchors, while Ruby uses \A and \z.
To answer your edit:
The match is always replaced as a whole. You'll want to group both the left side and the right side of the to-be-replaced part and reinsert it in the replacement:
url.replace(/^(.*\/\/.*\.net\/.*\/).*(\/.*)$/,"$1hello$2")
Before I get marked down, I know the question asks about regexp. The reason for this answer URLs are nearly impossible to process reliably with a regexp without writing fiendishly complex regexps. It can be done, but it makes your head hurt!
If you are doing this in a browser, you can use an A tag in your script to make things much simpler. The A tag knows how to parse them into pieces, and it lets you modify the pieces independently, so you only need to deal with the pathname:
//make a temporary a tag
var a = document.createElement('a');
//set the href property to the url you want to process
a.href = "scheme://host.domain/path/to/the/file?querystring"
//grab the path part of the url, and chop up into an array of directories
var dirs = a.pathname.split('/');
//set 2nd dir name - array is ['','path','to','file']
dirs[2]='hello';
//put the path back together
a.pathname = dirs.join('/');
a.href now contains the URL you want.
More lines, but also more hair left when you come back to change the code later.

Getting http URL using regex when there are multiple http URLs

I want to extract different .swf files from different sites for a project. Different sites use different source methods so I can't use src= or data= in my regex.
I'm able to match the file name with /[\w-]+.swf/g , but when I try to match the full path( http(.*?).swf ) starting with http it matches another http before the path (the first one in the code). Also I can't use src= or data= etc, it must be only the link.
Basically, is there a way to limit the match to the first http found when searching backwards?
If anyone cares to take a look then here's the code: http://pastebin.com/kT20UqqJ .
And here's a good place to test regex: http://regex.larsolavtorvik.com/
Try the following one:
var regex = /http:[\.\/\w-%]+\.swf/g
You need to escape the . else it will match an arbitrary character and the / since it is the expression delimiter.
You can see the working Example here.
If you have url encoded characters (like white space) you would have also a % in your url.
Here is an example which will work in this case: /http:[\./\w%-]+\.swf/g
Here is a tool where you can test the regex: http://regexpal.com/
And one where you can check it's performance: http://regexter.com/

How to identify all URLs that contain a (domain) substring?

If I am correct, the following code will only match a URL that is exactly as presented.
However, what would it look like if you wanted to identify subdomains as well as urls that contain various different query strings - in other words, any address that contains this domain:
var url = /test.com/
if (window.location.href.match(url)){
alert("match!");
}
If you want this regex to match "test.com" you need to escape the "." and both of the "/" that means any character in regex syntax.
Escaped : \/test\.com\/
Take a look for here for more info
No, your pattern will actually match on all strings containing test.com.
The regular expresssion /test.com/ says to match for test[ANY CHARACTER]com anywhere in the string
Better to use example.com for example links. So I replaces test with example.
Some example matches could be
http://example.com
http://examplexcom.xyz
http://example!com.xyz
http://example.com?q=123
http://sub.example.com
http://fooexample.com
http://example.com/asdf/123
http://stackoverflow.com/?site=example.com
I think you need to use /g. /g enables "global" matching. When using the replace() method, specify this modifier to replace all matches, rather than only the first one:
var /test.com/g;
If you want to test if an URL is valid this is the one I use. Fairly complex, because it takes care also of numeric domain & a few other peculiarities :
var urlMatcher = /(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?/;
Takes care of parameters and anchors etc... dont ask me to explain the details pls.

Categories