URL Pattern Matching issue, .+ matches all after - javascript

I am matching up stored URLs to the current URL and having a little bit of an issue - the regex works fine when being matched against the URL itself, but for some reason all sub-directories match too (when I want a direct match only of course).
Say the user stores www.facebook.com, this should match both http://www.facebook.com and https://www.facebook.com and it does
The problem is it is also matching sub-directories such as https://www.facebook.com/events/upcoming etc.
The regex for example:
/.+:\/\/www\.facebook\.com/
Matches the following:
https://www.facebook.com/events/upcoming
When it should just be matching
http://www.facebook.com/
https://www.facebook.com/
How can I fix this seemingly broken regex?

If you're being really specific about what you want to match, why not reflect that in your RegExp?
/^https?:\/\/(?:(?:www|m)\.)?facebook\.com\/?$/
http or https
www., m. or no subdomain
facebook.com
Demo
edit to include optional trailing backslash

Put an end marker $, like:
/.+:\/\/www\.facebook\.com\/$/
but really should have a start marker ^ too, like:
/^https?:\/\/www\.facebook\.com\/$/
also if you're matching the current domain, you may as well just match the location.host rather than location.href

Try adding a $ at the end of your regex. It's the symbol for end of string.

Related

Regex for capturing all the urls in a paragraph except for a specific domain

I need to capture all the urls in a paragraph apart from the urls from a specific domain/ sub domain.For example in the below paragraph I need to capture all the urls apart from example.com
"This is a paragraph name.url.com it contains random urls name-dev.url.com name-qa.url.com www.example.com test.example.com http://TestCaSeSensetivEUrl.com http://www.test.com https://www.example.com test.com"
Urls I need to capture
name.url.com
name-dev.url.com
name-qa.url.com
http://TestCaSeSensetivEUrl.com
http://www.test.com
test.com
Urls I don't need to capture as below
www.example.com
test.example.com
https://www.example.com
I have tried the below regex using negative look behind method, but it's not working as I need.
/(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?([a-z0-9]+(?<!example)[\-\.]{1}[a-z0-9+]+(?<!example)\.[a-z]{2,5})/gi
This should be sufficient for your use case:
/(?<!\S)(?:https?:\/\/)?(?:(?:(?!example)\w+[.-])+[a-z]{2,11})(?!\S)/gi
See https://regex101.com/r/NdOxKt/1 for a demonstration of the regex at work. Below is a rough explanation of what the regex is doing:
The leading and trailing (?<!\S) essentially splits the string into segments on space characters, including whitespace and newlines
The ?: syntax makes each set of parenthesis it is in a non-capture group, saving memory on the machine where it is ran and speeding up your execution time
(?:https?:\/\/)? optionally matches both http and https for URLs without matching the invalid characters : and / anywhere else in the URL
(?:(?!example)\w+[.-])+ looks for one or more words that do not match example, followed by either a hyphen or a period
[a-z]{2,11} matches the final domain extension, i.e. com, org, or enterprises
this could be a solution
^(https?:\/\/)?(?!(?:www\.)?google\.*)([\da-zA-Z.-]+)\.([a-zA-Z\.]{2,6})([\/\w .-]*)*\/?$
for example here google is excluded from being captured

Regex to retrieve domain.extension from a url

I am needing to come up with a regex to extract only domainname.extension from a url. Right now I have a regex that strips out "www." from the host name, but I need to update the regex to remove any subdomain strings from the hostname:
This strips off www.:
window.location.hostname.replace(/^www\./i, '')
But I need to detect any subdomain info on abc.def.test.com or ghi.test.com to replace it with an empty string and always return "test.com"
You could achieve the same result with replace method but match is some how more suitable:
console.log(
window.location.hostname.match(/[^\s.]+\.[^\s.]+$/)[0]
);
[^\s.]+ Match non-whitespace characters except dot
$ Assert end of input string
Doing so with replace method according to comments:
console.log(
window.location.hostname.replace(/[^\s.]+\.(?=[^\s.]\.)/g, '')
);
Well, that depends mainly on what you define as a domain and how do you define a subdomain. I'll use the most generalised approach of considering the top domain as the last two subcomponents (like you use in test.com) In that case you can proceed as:
([a-zA-Z0-9-]+\.)*([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+) ==> $2
as you see, the regexp is divided in two groups, and we only get the second in the output, which is the last two domain components. The [a-zA-Z0-9-] subexpression demands some explanation, as it appears thrice in the regexp: It is the set of chars allowed in a domain component, including the - hyphen. See [1] for a working demo.
in the case you want to cope with the co.uk example posted in the last demo, to match www.test.co.uk as test.co.uk, then you have to anchor your regexp to the end (with $, or if you are in the middle of a url, with the next : or / that can follow the domain name), to avoid that prefixes get detected as valid domains like it is shown in [2]:
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)$ ==> $3
or [3]
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)(?=[:/]|$) ==> $3
Of course, you have to put in the list all countries that follow the convention of using top domains as prefixes under their structure. You have to be careful here, as not all countries follow this approach. I've used the non-greedy *? operator here, as if I don't, then the group matching doesn't get as desired (the first group gets greedy, and the match is again at co.uk instead of test.co.uk)
But as you have finally to anchor your regexp (mainly because you can have domain names in the query string part of the url or in the subpath part, the best it to anchor it to the whole url.

regex - how to select all double slashes except followed by colon

I need some help with RegEx, it may be a basic stuff but I cannot find a correct way how to do it. Please help!
So, here's my question:
I have a list of URLs, that are invalid because of double slash, like this:
http://website.com//wp-content/folder/file.jpg, to fix it I need to remove all double slashes except the first one followed by colon (http://), so fixed URL is this: http://website.com/wp-content/folder/file.jpg.
I need to do it with RegExp.
Variant 1
url.replace(/\/\//g,'/'); // => http:/website.com/wp-content/folder/file.jpg
will replace all double slashed (//), including the first one, which is not correct.
example here:
https://regex101.com/r/NhCVMz/2
You may use
url = url.replace(/(https?:\/\/)|(\/){2,}/g, "$1$2")
See the regex demo
Note: a ^ anchor at the beginning of the pattern might be used if the strings are entire URLs.
This pattern will match and capture http:// or https:// and will restore it in the resulting string with the $1 backreference and all other cases of 2 or more / will be matched by (\/){2,} and only 1 occurrence will be put back into the resulting string since the capturing group does not include the quantifier.
Find (^|[^:])/{2,}
Replace $1/
delimited: /(^|[^:])\/{2,}/

Match between occurrence of last slash in URL and hash

I'm trying to match between two characters, specifically the last part of a url but only between the last slash (/) and a hash (#).
http://example.com/path/to/thing#name
The match should return
thing
I'm partly there now. I can either get the whole string of the last part of the url, or everything before hash (#) but not both.
/([^\/]*?.*(?=#))/
Please see my regex101 for testing.
You are close but still overthinking it. This suffices:
/[^\/#]+(?=#|$)/
– a sequence of not-/ or # characters, where the next one should be #. You don't need to add parentheses to match it as a separate group, the match itself is correct. The final lookahead (?=#|$) makes it stop on either an intervening # or the end of the URL.
See regex101.

RegEx matching for G+ Profile URL

I have been trying to match just the user id or vanity part of the URI for Google+ accounts. I am using GAS (Google Script Engine) which I've loaded XRegExp to help match Unicode characters.
So far I have this: ((https?://)?(plus\.)?google\.com/)?(.*/)?([a-zA-Z0-9._]*)($|\?.*) which you can see the regex tests (external site) still don't just match the right parts.
I've tried using \p{L} inside of [a-zA-Z0-9._] but no luck with that. Also, I end up with an extra forward slash at the end of the profile name when it does match.
UPDATE #1: I am trying to fix some G+ URL in a spreadsheet copied from a Google Form. The links are not all the same and the most simplest profile link is "https://plus.google.com/" + user id OR vanity name.
UPDATE #2: So far I have ([+]\w+|[0-9]{21})(?:\/)?(?:\w+)?$ with uses #demrks simplified version of #guest271314's response. However, two problems:
1) Google Vanity URLs can have unicode in them. Example: https://plus.google.com/u/0/+JoseManuelGarcía_ertatto which fails. I have tried to use \p{L} but can't seem to get it right.
2) GAS doesn't seem to like it event though regex tests works on this site. =(
UPDATE #3: It seems GAS just hates using \w so I've had to expand it. So I have this so far:
/([+][A-Za-z0-9-_]+|[0-9]{21})(?:\/)?(?:[A-Za-z0-9-_]+)?$/
This matches even with "/about" or "/posts" at end of the URL. However still doesn't match UNICODE. =( I am still working on that.
UPDATE #4: So this seems to work:
/([+][\\w-_\\p{L}]+|[\\d]{21})(?:\/)?(?:[\\w-_]+)?$/
Looks like I needed to do double backslashes in side of the character classes. So this seems to work so far. Not sure if there is shorter way to use this however.
Edit, updated
Try (v4)
document.URL.match(/\++\w+.*|\d+\d|\/+\w+$/).toString()
.replace(/\/+|posts|about|photos|videos|plusones|reviews/g, "")
e.g.,
var urls = ["https://plus.google.com/+google/posts"
, "https://plus.google.com/+google/about"
, "https://plus.google.com/+google/photos"
, "https://plus.google.com/+google/videos"
, "https://plus.google.com/+google/plusones"
, "https://plus.google.com/+google/reviews"
, "https://plus.google.com/communities/104645458102703754878"
, "https://plus.google.com/u/0/LONGIDHERE"
, "https://plus.google.com/u/0/+JoseManuelGarcía_ertatto"];
var _urls = [];
urls.forEach(function(item) {
_urls.push(item.match(/\++\w+.*|\d+\d|\/+\w+$/).toString()
.replace(/\/+|posts|about|photos|videos|plusones|reviews/g, ""));
});
_urls.forEach(function(id) {
var _id = document.createElement("div");
_id.innerHTML = id;
document.body.appendChild(_id)
});
jsfiddle http://jsfiddle.net/guest271314/o4kvftwh/
This solution should match both IDs and usernames (with unicode characters):
/\+[^/]+|\d{21}/
http://regexr.com/39ds0
Explanation: As an alternative to \w (which doesn't match unicode characters) I used a negation group [^/] (matches anything but "/").
Following a possible solution:
(?:\+)(\w+)|(?:\/)(\w+)$
Explanation:
1st Alternative: (?:\+)(\w+)
(?:\+) Non-capturing group: \+ matches the character + literally. Capturing group (\w+): \w+ match any word character [a-zA-Z0-9_]. Quantifier: Between one and unlimited
times.
2nd Alternative: (?:\/)(\w+)$. (?:\/) Non-capturing group. \/ matches the character / literally. Capturing group (\w+). \w+ match any word character [a-zA-Z0-9_]. Quantifier: Between one and unlimited times. $ assert position at end of the string.
Hope it useful!
So this seems to work:
/([+][\\w-_\\p{L}]+|[\\d]{21})(?:\/)?(?:[\\w-_]+)?$/
Looks like I needed to do double backslashes in side of the character classes. So this seems to work so far. Not sure if there is shorter way to use this however.

Categories