How to match a URL in this string?

How to match a URL in this string? - javascript

I've seen various articles which show how to match a URL. But my situation is a bit different from the usual URL matching.
This was one such regex that didn't work for me
/https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9#:%_\+.~#?&//=]*)/
My requirement:
My requirement is that I've a string like this
userlist.2011.text_mediafire.com,
userlist.2011.text_http://www.mediafire.com",
userlist.2011.text_http://mediafire.com",
userlist.2011.text.www.mediafire.com
Now, I want to match mediafire.com along with (if exists) "http://www." and "www." so, the contraint that I wish to set is that all the strings to the left of a TLD (in this case '.com') should be recorded upto a list of specal characters like '"_- etc.
I wasn't able to proceed any further except that the basic /(.*)\.(com|net|org|info)/ .Which is clearly wrong.

Use the below regex and get the string you want from group index 1.
(?:http:\/\/)?(?:www\.)?([^'"_.-]*\.(?:com|net|org|info)\b)

You need the '$' to match the end of string. If you care about capturing the entire string before the special character you will also need to match the beginning of the string '^'.
/^(.*)\.(([^\.]+)\.(com|net|org|info))$/

Related

How can I use regex to determine if a string includes one substring but not a different substring?

I'm trying to find a regular expression that will tell me if a string includes one word but not another.
Basically, I need to find a way to return true for the following
base?param1=value1&param2=value2
but I want it to return false if any of the values (or really, any part of the strings contains Debug
So for example:
collect?e=checkout --> true
collect?e=Debug --> false
I need this as a clean regular expression as I'm just trying to use it in Chrome network filter.
I tried
/(collect).+(?!Debug.)*/ but that doesnt work
The closest I can get is simply /(?!Debug.)*/ which omits anything with Debug in the string but does not limit it to those strings that contain the word "collect"

I think you can get your negative lookahead to work in the way you would like by moving the ".+" so that it's after the "?!"
In other words, since you're trying to match the word collect when it isn't followed by the word Debug after any (non-zero) number of characters, you can get that with the following expression:
/collect(?!.+Debug).*/

Try:
/(?<!Debug.*)collect(?!.*Debug)/
Basically, it's saying that you should only match collect if it doesn't follow a string containing Debug and if it isn't followed by a string containing Debug.

Using (collect).+(?!Debug.) will not work as intended because you first match collect and then .+ will match any char 1+ times until the end of the string.
Then when you are at the end, this assertion (?!Debug.) will be true because there is nothing at the right as you are at the end.
What you could dot is match collect?e= and then assert what is directly to the right is not Debug.
collect\?e=(?!Debug\b)
Regex demo

Regex to retrieve domain.extension from a url

I am needing to come up with a regex to extract only domainname.extension from a url. Right now I have a regex that strips out "www." from the host name, but I need to update the regex to remove any subdomain strings from the hostname:
This strips off www.:
window.location.hostname.replace(/^www\./i, '')
But I need to detect any subdomain info on abc.def.test.com or ghi.test.com to replace it with an empty string and always return "test.com"

You could achieve the same result with replace method but match is some how more suitable:
console.log(
window.location.hostname.match(/[^\s.]+\.[^\s.]+$/)[0]
);
[^\s.]+ Match non-whitespace characters except dot
$ Assert end of input string
Doing so with replace method according to comments:
console.log(
window.location.hostname.replace(/[^\s.]+\.(?=[^\s.]\.)/g, '')
);

Well, that depends mainly on what you define as a domain and how do you define a subdomain. I'll use the most generalised approach of considering the top domain as the last two subcomponents (like you use in test.com) In that case you can proceed as:
([a-zA-Z0-9-]+\.)*([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+) ==> $2
as you see, the regexp is divided in two groups, and we only get the second in the output, which is the last two domain components. The [a-zA-Z0-9-] subexpression demands some explanation, as it appears thrice in the regexp: It is the set of chars allowed in a domain component, including the - hyphen. See [1] for a working demo.
in the case you want to cope with the co.uk example posted in the last demo, to match www.test.co.uk as test.co.uk, then you have to anchor your regexp to the end (with $, or if you are in the middle of a url, with the next : or / that can follow the domain name), to avoid that prefixes get detected as valid domains like it is shown in [2]:
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)$ ==> $3
or [3]
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)(?=[:/]|$) ==> $3
Of course, you have to put in the list all countries that follow the convention of using top domains as prefixes under their structure. You have to be careful here, as not all countries follow this approach. I've used the non-greedy *? operator here, as if I don't, then the group matching doesn't get as desired (the first group gets greedy, and the match is again at co.uk instead of test.co.uk)
But as you have finally to anchor your regexp (mainly because you can have domain names in the query string part of the url or in the subpath part, the best it to anchor it to the whole url.

Parsing units with javascript regex

Say I have a string which contains some units (which may or may not have prefixes) that I want to break into the individual units. For example the string may contain "Btu(th)" or "Btu(th).ft" or even "mBtu(th).ft" where mBtu(th) is the bastardised unit milli thermochemical BTU's (this is purely an example).
I currently have the following (simplified) regex however it fails for the case "mBtu(th).ft":
/(m|k)??(Btu\(th\)|ft|m)(?:\b|\s|$)/g
Currently this does not correctly detect the boundary between the end of 'Btu(th)' and the start of 'ft'. I understand javascript regex does not support look back so how do I accurately parse the string?
Additional notes
The regex presented above is greatly simplified around the prefixes and units groups. The prefixes could span multiple characters like 'Ki' and therefore character sets are not suitable.
The desire is for each group to catch the prefix match as group 1 and the unit as match two i.e for 'mBtu(th).ft' match one would be ['m','Btu(th)'] and match two would be ['','ft'].
The prefix match needs to be lazy so that the string 'm' would be matched as the unit metres rather than the prefix milli. Likewise the match for 'mm' would need to be the prefix milli and the unit metres.

I would try with:
/((m)|(k)|(Btu(\(th\))?)|(ft)|(m)|(?:\.))+/g
at least with example above, it matches all units merged into one string.
DEMO
EDIT
Another try (DEMO):
/(?:(m)|(k)|(Btu)|(th)|(ft)|[\.\(\)])/g
this one again match only one part, but if you use $1,$2,$3,$4, etc, (DEMO) you can extract other fragments. It ignores ., (, ), characters. The problem is to count proper matched groups, but it works to some degree.
Or if you accept multiple separate matches I think simple alternative is:
/(m|k|Btu|th|ft)/g

A word boundary will not separate two non-word characters. So, you don't actually want a word boundary since the parentheses and period are not valid word characters. Instead, you want the string to not be followed by a word character, so you can use this instead:
[mk]??(Btu\(th\)|ft|m)(?!\w)
Demo

I believe you're after something like this. If I understood you correctly that want to match any kind of element, possibly preceded by the m or k character and separated by parantheses or dots.
/[\s\.\(]*(m|k?)(\w+)[\s\.\)]*/g
https://regex101.com/r/eQ5nR4/2
If you don't care about being able to match the parentheses but just return the elements you can just do
/(m|k?)(\w+)/g
https://regex101.com/r/oC1eP5/1

getting user and tweet ID from url using JavaScript regex

So I have tweet url for example https://twitter.com/ESPNFC/status/423771542627966976.
This url in my website gets automatically parsed to
https://twitter.com/ESPNFC/status/423771542627966976
I need to match this pattern and also get username and tweet ID.
I did it that way
/<a href="(http|https):\/\/twitter.com\/([^\/]*)\/status\/([^\/]*)">.+<\/a>/g. Everything works when I have 1 tweet per line, but if there are 2 or more tweets in one line, that regex matches both of them at same time and groups it as one, but I need to separate them.
Example:
https://twitter.com/ESPNFC/status/423771542627966976
https://twitter.com/ESPNFC/status/423771542627966976
returns 2 matches, but
https://twitter.com/ESPNFC/status/423771542627966976https://twitter.com/ESPNFC/status/423771542627966976
returns 1 match including both urls. How can I separate it or for example everything after interpret as new line?

It's best to avoid parsing HTML with regex when possible. Having said that the problem with your expression is the greedy .+ which will match as much as possible. Instead you could use .+? to make it ungreedy (match as few characters as possible). Or you could restrict what . matches, for example use [^\s<>]+ instead of .+.
Also you probably want to change those [^\/]* to maybe [^\/"\s]* to make them more effective.

Help with a regex

I've got the following sequence I'm attempting to detect...#hl=b&xhr=a where b is equal to anything and a is equal to anything.
I've got the following.. but it doesn't appear to be working... (#hl=.+&xhr=) Does anyone know why?
I'm using javascript and values a and b are letters of the alphabet.

(#hl=.+&xhr=.+), you missed the second .+. Depending on your regex engine, you should also see their escaping rules, often the braces or the + have to be escaped. If you just want to match a whole string, the braces are not needed anyway, btw.

You'll need to be more specific to get a better answer:
what programming language are you using RegEx in?
what values can a and b have? Anything implies that newlines are included, which . doesn't match
do you want to get the values of a and b?
Now that that's all been said, lets move onto a regex with some assumptions:
/#h1=(.+)&xhr=(.+)/
This will match a string #h1=a&xhr=b and select the a and b values from the string. It will be greedy, so if there are key-value pairs in the pseudo-URL (I assume it's a url encoded string as a hashtag) they will be matched in b.
#h1=a&xhr=b&foo=bar
the second selection will match b&foo=bar.
The regex also assumes #h1= comes before &xhr=.

Assuming #, & and = are special characters, how about this regular expression:
#h1=([^#&=]+)&xhr=([^#&=]+)
Are you sure your key/value pairs (?) are always in this order without anything in between?

We Keep Coding

JavaScript is the programming language of the Web.

How to match a URL in this string? - javascript

Use the below regex and get the string you want from group index 1. (?:http:\/\/)?(?:www\.)?([^'"_.-]*\.(?:com|net|org|info)\b)

You need the '$' to match the end of string. If you care about capturing the entire string before the special character you will also need to match the beginning of the string '^'. /^(.*)\.(([^\.]+)\.(com|net|org|info))$/

Related

How can I use regex to determine if a string includes one substring but not a different substring?

Regex to retrieve domain.extension from a url

Parsing units with javascript regex

getting user and tweet ID from url using JavaScript regex

Help with a regex

Categories

Resources