So I have tweet url for example https://twitter.com/ESPNFC/status/423771542627966976.
This url in my website gets automatically parsed to
https://twitter.com/ESPNFC/status/423771542627966976
I need to match this pattern and also get username and tweet ID.
I did it that way
/<a href="(http|https):\/\/twitter.com\/([^\/]*)\/status\/([^\/]*)">.+<\/a>/g. Everything works when I have 1 tweet per line, but if there are 2 or more tweets in one line, that regex matches both of them at same time and groups it as one, but I need to separate them.
Example:
https://twitter.com/ESPNFC/status/423771542627966976
https://twitter.com/ESPNFC/status/423771542627966976
returns 2 matches, but
https://twitter.com/ESPNFC/status/423771542627966976https://twitter.com/ESPNFC/status/423771542627966976
returns 1 match including both urls. How can I separate it or for example everything after interpret as new line?
It's best to avoid parsing HTML with regex when possible. Having said that the problem with your expression is the greedy .+ which will match as much as possible. Instead you could use .+? to make it ungreedy (match as few characters as possible). Or you could restrict what . matches, for example use [^\s<>]+ instead of .+.
Also you probably want to change those [^\/]* to maybe [^\/"\s]* to make them more effective.
Related
I am needing to come up with a regex to extract only domainname.extension from a url. Right now I have a regex that strips out "www." from the host name, but I need to update the regex to remove any subdomain strings from the hostname:
This strips off www.:
window.location.hostname.replace(/^www\./i, '')
But I need to detect any subdomain info on abc.def.test.com or ghi.test.com to replace it with an empty string and always return "test.com"
You could achieve the same result with replace method but match is some how more suitable:
console.log(
window.location.hostname.match(/[^\s.]+\.[^\s.]+$/)[0]
);
[^\s.]+ Match non-whitespace characters except dot
$ Assert end of input string
Doing so with replace method according to comments:
console.log(
window.location.hostname.replace(/[^\s.]+\.(?=[^\s.]\.)/g, '')
);
Well, that depends mainly on what you define as a domain and how do you define a subdomain. I'll use the most generalised approach of considering the top domain as the last two subcomponents (like you use in test.com) In that case you can proceed as:
([a-zA-Z0-9-]+\.)*([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+) ==> $2
as you see, the regexp is divided in two groups, and we only get the second in the output, which is the last two domain components. The [a-zA-Z0-9-] subexpression demands some explanation, as it appears thrice in the regexp: It is the set of chars allowed in a domain component, including the - hyphen. See [1] for a working demo.
in the case you want to cope with the co.uk example posted in the last demo, to match www.test.co.uk as test.co.uk, then you have to anchor your regexp to the end (with $, or if you are in the middle of a url, with the next : or / that can follow the domain name), to avoid that prefixes get detected as valid domains like it is shown in [2]:
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)$ ==> $3
or [3]
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)(?=[:/]|$) ==> $3
Of course, you have to put in the list all countries that follow the convention of using top domains as prefixes under their structure. You have to be careful here, as not all countries follow this approach. I've used the non-greedy *? operator here, as if I don't, then the group matching doesn't get as desired (the first group gets greedy, and the match is again at co.uk instead of test.co.uk)
But as you have finally to anchor your regexp (mainly because you can have domain names in the query string part of the url or in the subpath part, the best it to anchor it to the whole url.
I need to send different Response for different urls. But the Regex that I am using is not working.
The two Regex in question is
"/v1/users/[^/]+/permissions/domain/HTTP/"
(Eg: http://localhost:4544/v1/users/10feec20-afd9-46a0-a3fc-9b2f18c1d363/permissions/domain/HTTP)
and
"/v1/users/[^/]+/"
(Eg: http://localhost:4544/v1/users/10feec20-afd9-46a0-a3fc-9b2f18c1d363)
I am not able to figure out how to stop the regex matching after "[^/]+/". Both the pattern return the same result. It is as if due to regex both of them are same URL's. The pattern matching happens in mountebank mocking server using a matching predicate. Any help would be appreciated. Thanks.
The regular expression "/v1/users/[^/]+/" matches both urls. You are asking it to match '/v1/users/` plus anything except '/' followed by a slash. This happens in both the longer URL and the short one, which is why it matches.
A couple options:
You can match the longer url and not the shorter one with:
"/v1/users/[^/]+/.+"
This matches http://localhost:4544/v1/users/10feec20-afd9-46a0-a3fc-9b2f18c1d363/permissions/domain/HTTP, but not http://localhost:4544/v1/users/10feec20-afd9-46a0-a3fc-9b2f18c1d363/
You could also match just the short one by anchoring the end:
"/v1/users/[^/]+/$"
This matches the short URL but not the long one.
Here's what I'm trying to do:
http://i.imgur.com/Xqrf8Wn.png
Simply take a URL with 3 groups, $1 not so important, $2 & $3 are but $2 is totally optional including (obviously) the corresponding backslash when present, which is all I am trying to make optional. I get that it can/should? be in a non-cap group, but does it HAVE to be? I've seen enough now seems to indicate it does not HAVE to be. If possible, I'd really like to have someone explain it so I can try to fully understand it, and not just get one possible working answer handed to me to simply copy, like some come here seeking.
Here's my regex string(s) tried and at best only currently matching second URL string with optional present:
^https:\/\/([a-z]{0,2})\.?blah\.com(?:\/)(.*)\/required\/B([A-Z0-9]{9}).*
^https:\/\/([a-z]{0,2})\.?blah\.com(\/)?(.*)\/required\/B([A-Z0-9]{9}).*
^https:\/\/([a-z]{0,2})\.?blah\.com(?:\/)?(.*)?\/required\/B([A-Z0-9]{9}).*
Here are the two URLs that I want to capture group 2 & 3, with 1 and 2 being optional, but $2 being the problem. I've tried all the strings above and have yet to get it to match the string when the optional is NOT present and I believe it must be due to the backslashes?
https://blah.com/required/B7BG0Z0GU1A
https://blah.com/optional/required/B7BG0Z0GU1A
Making a part of the pattern optional is as simple as adding ?, and your last two attempts both work: https://regex101.com/r/RIKvYY/1
Your mistake is that your test is wrong - you are using ^ which matches the beginning of the string. You need to add the /m flag (multiline) to make it match the beginning of each line. This is the reason your patterns never match the second line...
Note that you're allowing two slashes (//required, for example). You can solve it by joining the first slash and the optional part to the same capturing group (of course, as long as you are using .* you can still match multiple slashes):
https:\/\/([a-z]{0,2})\.?blah\.com(?:\/(.*))?\/required\/B([A-Z0-9]{9}).*
I've seen various articles which show how to match a URL. But my situation is a bit different from the usual URL matching.
This was one such regex that didn't work for me
/https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9#:%_\+.~#?&//=]*)/
My requirement:
My requirement is that I've a string like this
userlist.2011.text_mediafire.com,
userlist.2011.text_http://www.mediafire.com",
userlist.2011.text_http://mediafire.com",
userlist.2011.text.www.mediafire.com
Now, I want to match mediafire.com along with (if exists) "http://www." and "www." so, the contraint that I wish to set is that all the strings to the left of a TLD (in this case '.com') should be recorded upto a list of specal characters like '"_- etc.
I wasn't able to proceed any further except that the basic /(.*)\.(com|net|org|info)/ .Which is clearly wrong.
Use the below regex and get the string you want from group index 1.
(?:http:\/\/)?(?:www\.)?([^'"_.-]*\.(?:com|net|org|info)\b)
You need the '$' to match the end of string. If you care about capturing the entire string before the special character you will also need to match the beginning of the string '^'.
/^(.*)\.(([^\.]+)\.(com|net|org|info))$/
I have been trying to match just the user id or vanity part of the URI for Google+ accounts. I am using GAS (Google Script Engine) which I've loaded XRegExp to help match Unicode characters.
So far I have this: ((https?://)?(plus\.)?google\.com/)?(.*/)?([a-zA-Z0-9._]*)($|\?.*) which you can see the regex tests (external site) still don't just match the right parts.
I've tried using \p{L} inside of [a-zA-Z0-9._] but no luck with that. Also, I end up with an extra forward slash at the end of the profile name when it does match.
UPDATE #1: I am trying to fix some G+ URL in a spreadsheet copied from a Google Form. The links are not all the same and the most simplest profile link is "https://plus.google.com/" + user id OR vanity name.
UPDATE #2: So far I have ([+]\w+|[0-9]{21})(?:\/)?(?:\w+)?$ with uses #demrks simplified version of #guest271314's response. However, two problems:
1) Google Vanity URLs can have unicode in them. Example: https://plus.google.com/u/0/+JoseManuelGarcĂa_ertatto which fails. I have tried to use \p{L} but can't seem to get it right.
2) GAS doesn't seem to like it event though regex tests works on this site. =(
UPDATE #3: It seems GAS just hates using \w so I've had to expand it. So I have this so far:
/([+][A-Za-z0-9-_]+|[0-9]{21})(?:\/)?(?:[A-Za-z0-9-_]+)?$/
This matches even with "/about" or "/posts" at end of the URL. However still doesn't match UNICODE. =( I am still working on that.
UPDATE #4: So this seems to work:
/([+][\\w-_\\p{L}]+|[\\d]{21})(?:\/)?(?:[\\w-_]+)?$/
Looks like I needed to do double backslashes in side of the character classes. So this seems to work so far. Not sure if there is shorter way to use this however.
Edit, updated
Try (v4)
document.URL.match(/\++\w+.*|\d+\d|\/+\w+$/).toString()
.replace(/\/+|posts|about|photos|videos|plusones|reviews/g, "")
e.g.,
var urls = ["https://plus.google.com/+google/posts"
, "https://plus.google.com/+google/about"
, "https://plus.google.com/+google/photos"
, "https://plus.google.com/+google/videos"
, "https://plus.google.com/+google/plusones"
, "https://plus.google.com/+google/reviews"
, "https://plus.google.com/communities/104645458102703754878"
, "https://plus.google.com/u/0/LONGIDHERE"
, "https://plus.google.com/u/0/+JoseManuelGarcĂa_ertatto"];
var _urls = [];
urls.forEach(function(item) {
_urls.push(item.match(/\++\w+.*|\d+\d|\/+\w+$/).toString()
.replace(/\/+|posts|about|photos|videos|plusones|reviews/g, ""));
});
_urls.forEach(function(id) {
var _id = document.createElement("div");
_id.innerHTML = id;
document.body.appendChild(_id)
});
jsfiddle http://jsfiddle.net/guest271314/o4kvftwh/
This solution should match both IDs and usernames (with unicode characters):
/\+[^/]+|\d{21}/
http://regexr.com/39ds0
Explanation: As an alternative to \w (which doesn't match unicode characters) I used a negation group [^/] (matches anything but "/").
Following a possible solution:
(?:\+)(\w+)|(?:\/)(\w+)$
Explanation:
1st Alternative: (?:\+)(\w+)
(?:\+) Non-capturing group: \+ matches the character + literally. Capturing group (\w+): \w+ match any word character [a-zA-Z0-9_]. Quantifier: Between one and unlimited
times.
2nd Alternative: (?:\/)(\w+)$. (?:\/) Non-capturing group. \/ matches the character / literally. Capturing group (\w+). \w+ match any word character [a-zA-Z0-9_]. Quantifier: Between one and unlimited times. $ assert position at end of the string.
Hope it useful!
So this seems to work:
/([+][\\w-_\\p{L}]+|[\\d]{21})(?:\/)?(?:[\\w-_]+)?$/
Looks like I needed to do double backslashes in side of the character classes. So this seems to work so far. Not sure if there is shorter way to use this however.