How to identify all URLs that contain a (domain) substring?

How to identify all URLs that contain a (domain) substring? - javascript

If I am correct, the following code will only match a URL that is exactly as presented.
However, what would it look like if you wanted to identify subdomains as well as urls that contain various different query strings - in other words, any address that contains this domain:
var url = /test.com/
if (window.location.href.match(url)){
alert("match!");
}

If you want this regex to match "test.com" you need to escape the "." and both of the "/" that means any character in regex syntax.
Escaped : \/test\.com\/
Take a look for here for more info

No, your pattern will actually match on all strings containing test.com.

The regular expresssion /test.com/ says to match for test[ANY CHARACTER]com anywhere in the string
Better to use example.com for example links. So I replaces test with example.
Some example matches could be
http://example.com
http://examplexcom.xyz
http://example!com.xyz
http://example.com?q=123
http://sub.example.com
http://fooexample.com
http://example.com/asdf/123
http://stackoverflow.com/?site=example.com

I think you need to use /g. /g enables "global" matching. When using the replace() method, specify this modifier to replace all matches, rather than only the first one:
var /test.com/g;

If you want to test if an URL is valid this is the one I use. Fairly complex, because it takes care also of numeric domain & a few other peculiarities :
var urlMatcher = /(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?/;
Takes care of parameters and anchors etc... dont ask me to explain the details pls.

Related

How do I match URLs with regular expressions?

We want to check if a URL matches mail.google.com or mail.yahoo.com (also a subdomain of them is accepted) but not a URL which contains this string after a question mark. We also want the strings "mail.google.com" and "mail.yahoo.com" to come before the third slash of the URL, for example https://mail.google.com/ is accepted, https://www.facebook.com/mail.google.com/ is not accepted, and https://www.facebook.com/?mail=https://mail.google.com/ is also not accepted. https://mail.google.com.au/ is also not accepted. Is it possible to do it with regular expressions?
var possibleURLs = /^[^\?]*(mail\.google\.com|mail\.yahoo\.com)\//gi;
var url;
// assign a value to var url.
if (url.match(possibleURLs) !== null) {
// Do something...
}
Currently this will match both https://mail.google.com/ and https://www.facebook.com/mail.google.com/ , but we don't want to match https://www.facebook.com/mail.google.com/.
Edit: I want to match any protocol (any string which doesn't contain "?" and "/") followed by a slash "/" twice (the string and the slash can both be twice), then any string which doesn't contain "?" and "/" (if it's not empty, it must end with a dot "."), and then (mail\.google\.com|mail\.yahoo\.com)\/. Case insensitive.

Not being funny - but why must it be a regular expression?
Is there are reason why you couldn't simplify the process using URL (or webkitURL in Chrome and Safari) - the URL constructor simply takes a string and then contains properties for each part of the URL. Whether it supports all the host types that you want to support, I don't know.
Granted, you might still need a regex after that (although really you'd just be checking that the hostname ends with either yahoo.com or google.com), but you would just be running it against the hostname of the URL object rather than the whole URI.
The API is not ubiquitous, but seems reasonably well supported and, anyway, if this is client-side validation then I hope you're checking it on the server, too, because sidestepping javascript validation is easy.

How about
^[a-z]+:\/\/([^.\/]+\.)*mail\.(google|yahoo).com\/
Regex Example Link
^ Anchors the regex at the start of the string
[a-z]+ Matches the protocol. If you want a specific set of protocols, then (https?|ftp) may do the work
([^.\/]+\.)* matches the subdomin part

^([-a-z]+://|^cid:|^//)([^/\?]+\.)?mail\.(google|yahoo)\.com/
Should do the trick
The first ^ means "match beginning of line", the second negates the allowed characters, thus making a slash / not allowed.
Nb. You still have to escape the slashes, or use it as a string in new RegExp(string):
new RegExp('^([-a-z]+://|^cid:|^//)([^/\?]+\.)?mail\.(google|yahoo)\.com/')

OK, I found that it works with:
var possibleURLs = /^([^\/\?]*\/){2}([^\.\/\?]+\.)*(mail\.google\.com|mail\.yahoo\.com)\//gi;

Regex for hostname

I have the following code (currHost is hostname):
if (currHost.match(/(alpha|beta|test|dev|load|local)\./))
but I need to add additional conditions such as
file:
.od*. (* is wildcard)
dev-wa.
.sq*. (* is wildcard)
.hbox.
Example MATCHING URLS:
file://c:blahblahblah
www.sqc.mydomain.com
www.sqa.mydomain.com
www.odd.mydomain.com
www.odp.mydomain.com
www.hbox.mydomain.com
dev-wa.mydomain.com
Example NOT MATCHING URLS:
www.sqcmydomain.com
www.sqamydomain.com
www.oddmydomain.com
www.odp.mydomain.com
www.hboxx.mydomain.com
dev-waa.mydomain.com
not sure how to approach this?
Thanks!

For your file matches you can simply use
document.location.protocol == 'file:'
and the expression you're looking for is
curHost.match(/(alpha|beta|test|dev|load|local|\.od.*|dev-wa|\.sq.*|\.hbox)\./)
Just remember that this expression will also match www.od.mydomain.com and oddmydomain.com since it's a valid match too. If you don't want this, you need to either specify a full expression (with the .com part) or specify the number of characters after the od/sq part. For example
curHost.match(/(alpha|beta|test|dev|load|local|\.od.{1}|dev-wa|\.sq.{1}|\.hbox)\./)
For one letter match.
If you want to specifically match those string only in the beginning of the domain, with or without www. you can use
curHost.match(/^(www\.)?(alpha|beta|test|dev|load|local|od.+|dev-wa|sq.+|hbox)\./)

if (currHost.match(/(alpha|beta|test|dev|load|local|file:|dev-wa|od\*|hbox)?\./))
Just add like above? Or do you mean something else?. I've added the a slash in from of the asterix. Added a questionmark at the end to make it ungreedy

What's wrong with this regular expression to find URLs?

I'm working on a JavaScript to extract a URL from a Google search URL, like so:
http://www.google.com/search?client=safari&rls=en&q=thisisthepartiwanttofind.org&ie=UTF-8&oe=UTF-8
Right now, my code looks like this:
var checkForURL = /[\w\d](.org)/i;
var findTheURL = checkForURL.exec(theURL);
I've ran this through a couple regex testers and it seems to work, but in practice the string I get returned looks like this:
thisisthepartiwanttofind.org,.org
So where's that trailing ,.org coming from?
I know my pattern isn't super robust but please don't suggest better patterns to use. I'd really just like advice on what in particular I did wrong with this one. Thanks!

Remove the parentheses in the regex if you do not process the .org (unlikely since it is a literal). As per #Mark comment, add a + to match one or more characters of the class [\w\d]. Also, I would escape the dot:
var checkForURL = /[\w\d]+\.org/i;

What you're actually getting is an array of 2 results, the first being the whole match, the second - the group you defined by using parens (.org).
Compare with:
/([\w\d]+)\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl"]
/[\w\d]+\.org/.exec('thisistheurl.org')
→ ["thisistheurl.org"]
/([\w\d]+)(\.org)/.exec('thisistheurl.org')
→ ["thisistheurl.org", "thisistheurl", ".org"]
The result of an .exec of a JS regex is an Array of strings, the first being the whole match and the subsequent representing groups that you defined by using parens. If there are no parens in the regex, there will only be one element in this array - the whole match.

You should escape .(DOT) in (.org) regex group or it matches any character. So your regex would become:
/[\w\d]+(\.org)/
To match the url in your example you can use something like this:
https?://([0-9a-zA-Z_.?=&\-]+/?)+
or something more accurate like this (you should choose the right regex according to your needs):
^https?://([0-9a-zA-Z_\-]+\.)+(com|org|net|WhatEverYouWant)(/[0-9a-zA-Z_\-?=&.]+)$

JavaScript negative lookbehind issue

I've got some JavaScript that looks for Amazon ASINs within an Amazon link, for example
http://www.amazon.com/dp/B00137QS28
For this I use the following regex: /([A-Z0-9]{10})
However, I don't want it to match artist links which look like:
http://www.amazon.com/Artist-Name/e/B000AQ1JZO
So I need to exclude any links where there's a '/e' before the slash and the 10-character alphanumeric code. I thought the following would do that: (?<!/e)([A-Z0-9]{10}), but it turns out negative lookbehinds don't work in JavaScript. Is that right? Is there another way to do this instead?
Any help would be much appreciated!
As a side note, be aware there are plenty of Amazon link formats, which is why I want to blacklist rather than whitelist, eg, these are all the same page:
http://www.amazon.com/gp/product/B00137QS28/
http://www.amazon.com/dp/B00137QS28
http://www.amazon.com/exec/obidos/ASIN/B00137QS28/
http://www.amazon.com/Product-Title-Goes-Here/dp/B00137QS28/

In your case an expression like this would work:
/(?!\/e)..\/([A-Z0-9]{10})/

([A-Z0-9]{10}) will work equally well on the reverse of its input, so you can
reverse the string,
use positive lookahead,
reverse it back.

You need to use a lookahead to filter the /e/* ones out. Then trim the leading /e/ from each of the matches.
var source; // the source you're matching against the RegExp
var matches = source.match(/(?!\/e)..\/[A-Z0-9]{10}/g) || [];
var ids = matches.map(function (match) {
return match.substr(3);
});

Regex not operator

I need to use a regex to pull a value out a url domain that will exclude everything but the host (ex: wordpress) and domain type (ex .com). The urls are dynamic and contain 2-3 values for each result (www.example.com or example.org). I am trying to use this expression, but I am only getting back the first letter of every item I am attempting to exclude:
Expresssion
(?!wordpress|com|www)(\w+|\d+)
String
example.wordpress.com
Results
example
ordpress
om
Desired Result
example
Any assistance would be greatly appreciated

Anchor your regular expression:
\b(?!wordpress|com|www)(\w+|\d+)\b
You might also want to consider whether (\w+|\d+) is really what you mean. \w already includes digits. Also, there are other characters allowed in URLs such as -. Do you need to handle this?

If I was to do thing like that, I would take advantage of the format of the url: anything (dot) 2nd-level-domain (dot) 1st-level-domain:
^(?<level3>.*)[.]?(?<level2>.+)[.](?<level1>.+)$

Is it so that you are only after what is after the domain part??
(/\/(?!\/).*?\/(.*)/).exec("http://www.google.com/sdfsdf/fdsff")[1]
// returns sdfsdf/fdsff

We Keep Coding

JavaScript is the programming language of the Web.

How to identify all URLs that contain a (domain) substring? - javascript

If you want this regex to match "test.com" you need to escape the "." and both of the "/" that means any character in regex syntax. Escaped : \/test\.com\/ Take a look for here for more info

No, your pattern will actually match on all strings containing test.com.

I think you need to use /g. /g enables "global" matching. When using the replace() method, specify this modifier to replace all matches, rather than only the first one: var /test.com/g;

Related

How do I match URLs with regular expressions?

Regex for hostname

What's wrong with this regular expression to find URLs?

JavaScript negative lookbehind issue

Regex not operator

Categories

Resources