I am new at making regular expressions, and so this might just be a stupid oversight, but my regex (that aims to match URL's) is not working. My goal was to have it match any urls like:
http://www.somewhere.com
somewhere.com
https://ww3.some_where-hi.com
www.goop.go/herp/derp.lol
The regex i built is below, however, it does not match a URL like http://t.co/GZhtBh6c, it stops matching at the number 6 (As determined by www.regexpal.com).
((http|https)://)?([a-z0-9]+\.)?[a-z0-9\-_]+.[a-z]+(/[a-z0-9\-_]*)*([a-z0-9\-_]*\.[a-z]+){0,1}
Can anyone tell me why this is not working? Also, I'm sure this is not the best solution. If you have a more elegant regex for this, I would love to see it.
P.S. This regex will be used with javascript.
Validate if a string holds a URL as specified in RFC 3986. Both absolute and relative URLs are supported.
This matches your provide sample and more. It also lets you extract the different parts of the url
^
(# Scheme
[a-z][a-z0-9+\-.]*:
(# Authority & path
//
([a-z0-9\-._~%!$&'()*+,;=]+#)? # User
([a-z0-9\-._~%]+ # Named host
|\[[a-f0-9:.]+\] # IPv6 host
|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPvFuture host
(:[0-9]+)? # Port
(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Path
|# Path without authority
(/?[a-z0-9\-._~%!$&'()*+,;=:#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/?)?
)
|# Relative URL (no scheme or authority)
([a-z0-9\-._~%!$&'()*+,;=#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Relative path
|(/[a-z0-9\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
$
In javascript this becomes
if (/^([a-z][a-z0-9+\-.]*:(\/\/([a-z0-9\-._~%!$&'()*+,;=]+#)?([a-z0-9\-._~%]+|\[[a-f0-9:.]+\]|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\])(:[0-9]+)?(\/[a-z0-9\-._~%!$&'()*+,;=:#]+)*\/?|(\/?[a-z0-9\-._~%!$&'()*+,;=:#]+(\/[a-z0-9\-._~%!$&'()*+,;=:#]+)*\/?)?)|([a-z0-9\-._~%!$&'()*+,;=#]+(\/[a-z0-9\-._~%!$&'()*+,;=:#]+)*\/?|(\/[a-z0-9\-._~%!$&'()*+,;=:#]+)+\/?))(\?[a-z0-9\-._~%!$&'()*+,;=:#\/?]*)?(#[a-z0-9\-._~%!$&'()*+,;=:#\/?]*)?$/im.test(subject)) {
// Successful match
} else {
// Match attempt failed
}
use a [A-z] instead of [a-z]
your little a-z is only matching lowercase letters.
Related
I'm using a linkify function, which detects link-like patterns by using regex and replaces those with a-tags to reveal a clickable link.
The regex looks like that:
// http://, https://, ftp://
var urlPattern = /\b(?![^<]*>|[^<>]*<\/)(?:https?|ftp):\/\/[a-z0-9-+&##\/%?=~_|!:,.;]*[a-z0-9-+&##\/%=~_|]/gim;
/* Some explanations:
(?! # Negative lookahead start (will cause match to fail if contents match)
[^<]* # Any number of non-'<' characters
> # A > character
| # Or
[^<>]* # Any number of non-'<' and non-'>' characters
</ # The characters < and /
) # End negative lookahead.
*/
and replaces the link like this:
return textInput.replace(urlPattern, '<a target="_blank" rel="noopener" href="$&">$&</a>')
The regex works perfectly for in-text links. However, I am using it in HTML-Code also, such as
<ul><li>Link: https://www.link.com</li></ul> //linkify not working
<ul><li>Link: https://www.link.com <br/></li></ul> //linkify working
where just the secont example is working. I dont't know why the behavior is different and would be very glad to get some help from you. What should my regex look like, to linkify without the break in list elements?
If I understood correctly your issue I think that this regex should be ok to detect the links in both the scenarios:
\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)
Essentially with the first part we are segmenting in this way:
Then we go and grab the different parts of interest: the first part is a non-capturing group as in your original expression to strip the protocol later, if really unneeded. The last part takes the remaining part of the URL
For the way we created the regex we can now decide if taking the entire URL or just the second part. This is evident looking to the bottom-right of this screenshot:
Now in order to log the two parts we can take this nice snippet:
const str = '<ul><li>Link: https://www.link.com</li></ul>';
var myRegexp = /\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)/gim;
var match = myRegexp.exec(str);
console.log(match[0]);
console.log(match[1]);
Possible variations:
in a situation like the one presented above you can simplify further your regex to:
(?:https?|ftp):\/\/([a-z0-9-+&##\/%?=~_|!:,.;]*)
getting the same output
if the full URL is enough you can remove the round parentheses of the second group
(?:https?|ftp):\/\/[a-z0-9-+&##\/%?=~_|!:,.;]*
PS - I'm assuming that your examples were meant to be:
<ul><li>Link: https://www.link.com</li></ul>
<ul><li>Link: https://www.link.com <br/></li></ul>
i.e. with https, http or ftp which makes the second case work with your original regex
We want to check if a URL matches mail.google.com or mail.yahoo.com (also a subdomain of them is accepted) but not a URL which contains this string after a question mark. We also want the strings "mail.google.com" and "mail.yahoo.com" to come before the third slash of the URL, for example https://mail.google.com/ is accepted, https://www.facebook.com/mail.google.com/ is not accepted, and https://www.facebook.com/?mail=https://mail.google.com/ is also not accepted. https://mail.google.com.au/ is also not accepted. Is it possible to do it with regular expressions?
var possibleURLs = /^[^\?]*(mail\.google\.com|mail\.yahoo\.com)\//gi;
var url;
// assign a value to var url.
if (url.match(possibleURLs) !== null) {
// Do something...
}
Currently this will match both https://mail.google.com/ and https://www.facebook.com/mail.google.com/ , but we don't want to match https://www.facebook.com/mail.google.com/.
Edit: I want to match any protocol (any string which doesn't contain "?" and "/") followed by a slash "/" twice (the string and the slash can both be twice), then any string which doesn't contain "?" and "/" (if it's not empty, it must end with a dot "."), and then (mail\.google\.com|mail\.yahoo\.com)\/. Case insensitive.
Not being funny - but why must it be a regular expression?
Is there are reason why you couldn't simplify the process using URL (or webkitURL in Chrome and Safari) - the URL constructor simply takes a string and then contains properties for each part of the URL. Whether it supports all the host types that you want to support, I don't know.
Granted, you might still need a regex after that (although really you'd just be checking that the hostname ends with either yahoo.com or google.com), but you would just be running it against the hostname of the URL object rather than the whole URI.
The API is not ubiquitous, but seems reasonably well supported and, anyway, if this is client-side validation then I hope you're checking it on the server, too, because sidestepping javascript validation is easy.
How about
^[a-z]+:\/\/([^.\/]+\.)*mail\.(google|yahoo).com\/
Regex Example Link
^ Anchors the regex at the start of the string
[a-z]+ Matches the protocol. If you want a specific set of protocols, then (https?|ftp) may do the work
([^.\/]+\.)* matches the subdomin part
^([-a-z]+://|^cid:|^//)([^/\?]+\.)?mail\.(google|yahoo)\.com/
Should do the trick
The first ^ means "match beginning of line", the second negates the allowed characters, thus making a slash / not allowed.
Nb. You still have to escape the slashes, or use it as a string in new RegExp(string):
new RegExp('^([-a-z]+://|^cid:|^//)([^/\?]+\.)?mail\.(google|yahoo)\.com/')
OK, I found that it works with:
var possibleURLs = /^([^\/\?]*\/){2}([^\.\/\?]+\.)*(mail\.google\.com|mail\.yahoo\.com)\//gi;
I'm trying to find a simple regexp for url validation, but not very good in regexing..
Currently I have such regexp: (/^https?:\/\/\w/).test(url)
So it's allowing to validate urls as http://localhost:8080 etc.
What I want to do is NOT to validate urls if they have some long special characters at the end like: http://dodo....... or http://dododo&&&&&
Could you help me?
How about this?
/^http:\/\/\w+(\.\w+)*(:[0-9]+)?\/?(\/[.\w]*)*$/
Will match: http://domain.com:port/path or just http://domain or http://domain:port
/^http:\/\/\w+(\.\w+)*(:[0-9]+)?\/?$/
match URLs without path
Some explanations of regex blocks:
Domain: \w+(\.\w+)* to match text with dots: localhost or www.yahoo.com (could be as long as Path or Port section begins)
Port: (:[0-9]+)? to match or to not match a number starting with semicolon: :8000 (and it could be only one)
Path: \/?(\/[.\w]*)* to match any alphanums with slashes and dots: /user/images/0001.jpg (until the end of the line)
(path is very interesting part, now I did it to allow lone or adjacent dots, i.e. such expressions could be possible: /. or /./ or /.../ and etc. If you'd like to have dots in path like in domain section - without border or adjacent dots, then use \/?(\/\w+(.\w+)*)* regexp, similar to domain part.)
* UPDATED *
Also, if you would like to have (it is valid) - characters in your URL (or any other), you should simply expand character class for "URL text matching", i.e. \w+ should become [\-\w]+ and so on.
If you want to match ABCD then you may leave the start part..
For Example to match http://localhost:8080
' just write
/(localhost).
if you want to match specific thing then please focus the term that you want to search, not the starting and ending of sentence.
Regular expression is for searching the terms, until we have a rigid rule for the same. :)
i hope this will do..
It depends on how complex you need the Regex to be. A simple way would be to just accept words (and the port/domain):
^https?:\/\/\w+(:[0-9]*)?(\.\w+)?$
Remember you need to use the + character to match one or more characters.
Of course, there are far better & more complicated solutions out there.
^https?:\/\/localhost:[0-9]{1,5}\/([-a-zA-Z0-9()#:%_\+.~#?&\/=]*)
match:
https://localhost:65535/file-upload-svc/files/app?query=abc#next
not match:
https://localhost:775535/file-upload-svc/files/app?query=abc#next
explanation
it can only be used for localhost
it also check the value for port number since it should be less than 65535 but you probably need to add additional logic
You can use this. This will allow localhost and live domain as well.
^https?:\/\/\w+(\.\w+)*(:[0-9]+)?(\/.*)?$
I'm pretty late to the party but now you should consider validating your URL with the URL class. Avoid the headache of regex and rely on standard
let isValid;
try {
new URL(endpoint); // Will throw if URL is invalid
isValid = true;
} catch (err) {
isValid = false;
}
^https?:\/\/(localhost:([0-9]+\.)+[a-zA-Z0-9]{1,6})?$
Will match the following cases :
http://localhost:3100/api
http://localhost:3100/1
http://localhost:3100/AP
http://localhost:310
Will NOT match the following cases :
http://localhost:3100/
http://localhost:
http://localhost
http://localhost:31
Could you update my regex to match with next requirements
Must match urls without www and http
If query contains - match too
Url ends when space or comma(,) or string end meet
match only with TopLevelDomains from list
var srg = new RegExp(/(^|[\s])([\w\.]+\.(com|cc|net))/ig);
For sample, must match:
jsfiddle.net
jmitty.cc:8080/test3s.html
www.ru,sample.com,google.com/?l=en
very.secure.dotster.com/i?ewe
As result i need
<a>jsfiddle.net</a>
<a>jmitty.cc:8080/test3s.html</a>
<a>www.ru</a>,<a>sample.com</a>,<a>google.com/?l=en</a>
<a>very.secure.dotster.com/i?ewe</a>
Fiddle http://jsfiddle.net/tYnU7/
Well, I guess you can change some little things in your regex:
([\w\.]+\.(?:com|cc|net|ru)[^,\s]*)
Replace by:
$1
I'm not sure why you were having (^|[\s]) at the beginning and it didn't seem useful to me, so I removed it. If you had your reasons, you can put it back.
I added ru to the extensions to match www.ru as you required and added [^,\s]* to continue matching until a comma or space is encountered.
Your updated fiddle is here.
This is a very complex problem with no perfect answer, but if you don't need perfection, check out Jeff Roberson's Linkify page and this post by Van Goyvaerts discussing Jeff Atwood's blog post, "The Problem with URLs".
/
(?:^|\b) # match word boundary or beginning of line
( # begin cpature
[\w.]+ # domain part
\.[a-z]{2,3} # domain suffix
(?:\:[0-9]{1,5})? # optional port
(?:\/.*)? # path details
) # end capture
(?:[,\s]|$) # comma, space or eol
/ig
Some details:
[\w.]+ may need more work depending on what you classify as acceptable domain characters (I've heard they're accepting unicode characters now?)
You can change [a-z]{2,3} in to a list of acceptable top-level domains (e.g. (?:com|org|net|info|edu). In your example you only list com, cc & net, but your result shows www.ru as captured.
(?:\/.*)? is greedy by default, but should be okay since you want query information.
And the fiddle
Oh, and if you want your links clickable (because those without a protocol don't work):
var r = t.replace(srg, function(match,b,m,e){
return b + '' + m + '' + e;
});
Which is demonstrated here
I've looked all over and have yet to find a single solution to address my need for a regular expression pattern that will match a generic URL. I need to support multiple protocols (with verification), localhost and/or IP addressing, ports and query strings. Some examples:
http://localhost/mysite
https://localhost:55000
ftp://192.1.1.1
telnet://somesite/page.htm?a=1&b=2
Ideally, I'd like the pattern to also support extracting the various elements (protocol, host, port, query string, etc.) but this is not a requirement.
(Also, for the purposes of myself and future readers, if you could explain the pattern, it would be helpful.)
Appendix B of RFC 3986/STD 0066 (Uniform Resource Identifier (URI): Generic Syntax) provides the regular expression you need:
Appendix B. Parsing a URI Reference with a Regular Expression
As the "first-match-wins" algorithm is identical to the "greedy"
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential five components of a URI reference.
The following line is the regular expression for breaking-down a
well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis). We refer to the value matched for subexpression
<n> as $<n>. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
where <undefined> indicates that the component is not present, as is
the case for the query component in the above example. Therefore, we
can determine the value of the five components as
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
Going in the opposite direction, we can recreate a URI reference from
its components by using the algorithm of Section 5.3.
As for validating a URI against a particular scheme goes, you'll need to look at the RFC(s) describing the scheme(s) in which you are interested to get the detail required to validate that a URI is valid for the scheme it purports to be. The URI scheme registry is located at http://www.iana.org/assignments/uri-schemes.html.
And even then, you're doomed to some sort of failure. Consider the file: scheme. You can't validate that it represents a valid path in the file system of the authority (unless you are the authority). The best that you can do is validate that it represents something that looks like a valid path. And even then, a windows file: url like file:///C:/foo/bar/baz/bat.txt is (would be) invalid for anything but a server running some flavor of Windows. Any server running *nix would likely choke on it (what's a drive letter anyway?).
Nicholas Carey is correct to steer you towards RFC-3986. The regex he points out will match a generic URI, but it will not validate it (and this regex is not good for picking URLs out of "the wild" - it is too loose and matches just about any string including an empty string).
Regarding the validation requirement, you may want to take a look at an article I wrote on the subject, which takes from Appendix A all the ABNF syntax definitions of all the various components and provides regex equivalents:
Regular Expression URI Validation
Regarding the subject of picking out URL's from the "wild", take a look at Jeff Atwood's "The Problem With URLs" and John' Gruber's "An Improved Liberal, Accurate Regex Pattern for Matching URLs" blog posts to get a glimpse as to some of the subtle problems which can arise. Also, you may want to take a look at a project I started last year: URL Linkification - this picks out unlinked HTTP and FTP URLs from text which may already have some links.
That said, the following is a PHP function which uses a slightly modified version of the RFC-3986 "Absolute URI" regex to validate HTTP and FTP URL's (with this regex, the named host portion must not be empty). All the various components of the URI are isolated and captured into named groups which allows for easy manipulation and validation of the parts within the program code:
function url_valid($url)
{
if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
^
(?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
(?P<authority>
(?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)#)?
(?P<host>
(?P<IP_literal>
\[
(?:
(?P<IPV6address>
(?: (?:[0-9A-Fa-f]{1,4}:){6}
| ::(?:[0-9A-Fa-f]{1,4}:){5}
| (?: [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?:: [0-9A-Fa-f]{1,4}:
| (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
)
(?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
| (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
)
| (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?:: [0-9A-Fa-f]{1,4}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
)
| (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
)
\]
)
| (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
| (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
)
(?::(?P<port>[0-9]*))?
)
(?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:#]|%[0-9A-Fa-f]{2})*)*)
(?:\?(?P<query> (?:[A-Za-z0-9\-._~!$&\'()*+,;=:#\\/?]|%[0-9A-Fa-f]{2})*))?
(?:\#(?P<fragment> (?:[A-Za-z0-9\-._~!$&\'()*+,;=:#\\/?]|%[0-9A-Fa-f]{2})*))?
$
/mx', $url, $m)) return FALSE;
switch ($m['scheme'])
{
case 'https':
case 'http':
if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
break;
case 'ftps':
case 'ftp':
break;
default:
return FALSE; // Unrecognised URI scheme. Default to FALSE.
}
// Validate host name conforms to DNS "dot-separated-parts".
if ($m{'regname'}) // If host regname specified, check for DNS conformance.
{
if (!preg_match('/# HTTP DNS host name.
^ # Anchor to beginning of string.
(?!.{256}) # Overall host length is less than 256 chars.
(?: # Group dot separated host part alternatives.
[0-9A-Za-z]\. # Either a single alphanum followed by dot
| # or... part has more than one char (63 chars max).
[0-9A-Za-z] # Part first char is alphanum (no dash).
[\-0-9A-Za-z]{0,61} # Internal chars are alphanum plus dash.
[0-9A-Za-z] # Part last char is alphanum (no dash).
\. # Each part followed by literal dot.
)* # One or more parts before top level domain.
(?: # Explicitly specify top level domains.
com|edu|gov|int|mil|net|org|biz|
info|name|pro|aero|coop|museum|
asia|cat|jobs|mobi|tel|travel|
[A-Za-z]{2}) # Country codes are exqactly two alpha chars.
$ # Anchor to end of string.
/ix', $m['host'])) return FALSE;
}
$m['url'] = $url;
for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
return $m; // return TRUE == array of useful named $matches plus the valid $url.
}
The first regex validates the string as an absolute (has a non-empty host portion) generic URI. A second regex is used to validate the (named) host portion (when it is not an IP literal or IPv4 address) with regard to the DNS lookup system (where each dot-separated subdomain is 63 chars or less consisting of digits, letters and dashes, with an overall length less than 255 chars.)
Note that the structure of this function allows easy expansion to include other schemes.
Would this be in Perl by any chance?
Try:
use strict;
my $url = "http://localhost/test";
if ($url =~ m/^(.+):\/\/(.+)\/(.+)/) {
my $protocol = $1;
my $domain = $2;
my $dir = $3;
print "$protocol $domain $dir \n";
}