javascript regex that gets all subdomains - javascript

I have the following RegEx:
[!?\.](.*)\.example\.com
and this sample string:
test foo abc.def.example.com bar ghi.jkl.example.com def
I want that the RegEx products the following matches: def.example.com and jkl.example.com.
What do I have to change? Should be working on all subdomains of example.com. If possible it should only take the first subdomain-level (abc.def.example.com -> def.example.com).
Tested it on regexpal, not fully working :(

You may use the following expression : [^.\s]+\.example\.com.
Explanation
[^.\s]+ : match anything except a dot or whitespace one or more times
\.example\.com : match example.com
Note that you don't need to escape a dot in a character class

Just on a side note, while HamZa's answer works for your current sample code, if you need to make sure that the domain names are also valid, you might want to try a different approach, since [^.\s]+ will match ANY character that is not a space or a . (for example, that regex will match jk&^%&*(l.example.com as a "valid" subdomain).
Since there are far fewer valid characters for domain name values than there are invalid ones, you might consider using an "additive" approach to the regex, rather than subtractive. This pattern here is probably the one that you are looking for for valid domain names: /(?:[\s.])([a-z0-9][a-z0-9-]+[a-z0-9]\.example\.com)/gi
To break it down a little more . . .
(?:[\s.]) - matches the space or . that would mark the beginning of the loweset level subdomain
([a-z0-9][a-z0-9-]+[a-z0-9]\.example\.com) - this captures a group of letters, numbers or dashes, that must begin and end with a letter or a number (domain name rules), and then the example.com domain.
gi - makes the regex pattern greedy and case insensitive
At this point, it simply a question of grabbing the matches. Since .match() doesn't play well with the regex "non-capturing groups", use .exec() instead:
var domainString = "test foo abc.def.example.com bar ghi.jkl.example.com def";
var regDomainPattern = /(?:[\s.])([a-z0-9][a-z0-9-]+[a-z0-9]\.example\.com)/gi;
var aMatchedDomainStrings = [];
var patternMatch;
// loop through as long as .exec() still gets a match, and take the second index of the result (the one that ignores the non-capturing groups)
while (null != (patternMatch = regDomainPattern.exec(domainString))) {
aMatchedDomainStrings.push(patternMatch[1]);
}
At that point aMatchedDomainStrings should contain all of your valid, first-level, sub-domains.
var domainString = "test foo abc.def.example.com bar ghi.jkl.example.com def";
. . . should get you: def.example.com and jkl.example.com, while:
var domainString = "test foo abc.def.example.com bar ghi.jk&^%&*(l.example.com def";
. . . should get you only: def.example.com

Related

How to capture a subdomain in regex?

I'm trying to extract the username from a tumblr. That is, the regex should match asdf in the following test string:
https://asdf.tumblr.com/
http://asdf.tumblr.com/faq
www.asdf.tumblr.com/
asdf.tumblr.com
Basically, I think I need to do something like, match from either a dot or a slash until the next dot, but I'm having trouble making it work in every case. Currently I have this:
.*[\/|\.](.*)\.tumblr\.com.*
However, this fails to capture the last group (asdf.tumblr.com). I tried modifying it to no avail. Can this be done?
You may use this regex in Javascript:
/[^.\/]+(?=\.tumblr\.com)/i
RegEx Demo
RegEx Details:
[^.\/]+: Match 1 or more of any character that is not . and /
(?=\.tumblr\.com): Positive lookahead to ensure we have .tumblr.com at next position
Code:
let x = /([^.\/]+)(?=\.tumblr\.com)/;
let y = "https://asdf.tumblr.com";
console.log( y.match(x)[1] );

How to write regexp for finding :smile: in javascript?

I want to write a regular expression, in JavaScript, for finding the string starting and ending with :.
For example "hello :smile: :sleeping:" from this string I need to find the strings which are starting and ending with the : characters. I tried the expression below, but it didn't work:
^:.*\:$
My guess is that you not only want to find the string, but also replace it. For that you should look at using a capture in the regexp combined with a replacement function.
const emojiPattern = /:(\w+):/g
function replaceEmojiTags(text) {
return text.replace(emojiPattern, function (tag, emotion) {
// The emotion will be the captured word between your tags,
// so either "sleep" or "sleeping" in your example
//
// In this function you would take that emotion and return
// whatever you want based on the input parameter and the
// whole tag would be replaced
//
// As an example, let's say you had a bunch of GIF images
// for the different emotions:
return '<img src="/img/emoji/' + emotion + '.gif" />';
});
}
With that code you could then run your function on any input string and replace the tags to get the HTML for the actual images in them. As in your example:
replaceEmojiTags('hello :smile: :sleeping:')
// 'hello <img src="/img/emoji/smile.gif" /> <img src="/img/emoji/sleeping.gif" />'
EDIT: To support hyphens within the emotion, as in "big-smile", the pattern needs to be changed since it is only looking for word characters. For this there is probably also a restriction such that the hyphen must join two words so that it shouldn't accept "-big-smile" or "big-smile-". For that you need to change the pattern to:
const emojiPattern = /:(\w+(-\w+)*):/g
That pattern is looking for any word that is then followed by zero or more instances of a hyphen followed by a word. It would match any of the following: "smile", "big-smile", "big-smile-bigger".
The ^ and $ are anchors (start and end respectively). These cause your regex to explicitly match an entire string which starts with : has anything between it and ends with :.
If you want to match characters within a string you can remove the anchors.
Your * indicates zero or more so you'll be matching :: as well. It'll be better to change this to + which means one or more. In fact if you're just looking for text you may want to use a range [a-z0-9] with a case insensitive modifier.
If we put it all together we'll have regex like this /:([a-z0-9]+):/gmi
match a string beginning with : with any alphanumeric character one or more times ending in : with the modifiers g globally, m multi-line and i case insensitive for things like :FacePalm:.
Using it in JavaScript we can end up with:
var mytext = 'Hello :smile: and jolly :wave:';
var matches = mytext.match(/:([a-z0-9]+):/gmi);
// matches = [':smile:', ':wave:'];
You'll have an array with each match found.

JavaScript to remove whatever is after the tld and before the whitespace

I have a bunch of functions that are filtering a page down to the domains that are attached to email addresses. It's all working great except for one small thing, some of the links are coming out like this:
EXAMPLE.COM
EXAMPLE.ORG.
EXAMPLE.ORG>.
EXAMPLE.COM"
EXAMPLE.COM".
EXAMPLE.COM).
EXAMPLE.COM(COMMENT)"
DEPT.EXAMPLE.COM
EXAMPLE.ORG
EXAMPLE.COM.
I want to figure out one last filter (regex or not) that will remove everything after the TLD. All of these items are in an array.
EDIT
The function I'm using:
function filterByDomain(array) {
var regex = new RegExp("([^.\n]+\.[a-z]{2,6}\b)", 'gi');
return array.filter(function(text){
return regex.test(text);
});
}
You can probably use this regex to match your TLD for each case:
/^[^.\n]+\.[a-z]{2,63}$/gim
RegEx Demo
You validation function can be:
function filterByDomain(array) {
var regex = /^[^.\n]+\.[a-z]{2,63}$/gim;
return array.filter(function(text){
return regex.test(text);
});
}
PS: Do read this Q & A to see that up to 63 characters are allowed in TLD.
I'd match all leading [\w.] and omit the last dot, if any:
var result = url.match(/^[\w\.]+/).join("");
if(result.slice(-1)==".") result = result.slice(0,-1);
With note that \w should be replaced for something more sophisticated:
_ is part of \w set but should not be in url path
- is not part of \w but can be in url not adjacent to . or -
To keep the regexp simple and the code readable, I'd do it this way
substitute _ for # in url (both # and _ can be only after TLD)
substitute - for _ (_ is part of \w)
after the regexp test, substitute _ back for -
URL like www.-example-.com would still pass, can be detected by searching for [.-]{2,}

Split window.location on / and join with : if ends in / don't add :

I am trying to write an if statement to look at the URL and based upon what it is execute certain code blocks. The issue I am having is with looking for when the url ends with a / and not to add a :. For example www.site.com/folder/page/ = folder:page: I want folder:page
var winLoc = window.location.href
if(winLoc.indexOf('error') > -1){
winLoc = 'Error:' + document.referrer.replace(/^.+.com/,'').split('/').join(':')
}
else{
winLoc.replace(/^.+.com/,'').split('/').join(':')
}
Also, I am not a RegEx person so explanations of answers appreciated :]
Put a
.replace(/:$/,'')
at the end. That’s easy, intuitive and obvious, so I guess it qualifies as one of the “best” options. I think you also forgot to assign the result to winLoc…
I’d actually recommend reducing code repetition like so:
var winLoc=(~location.href.indexOf('error')
? 'Error:'+document.referrer
: location.href)
.replace(/^(Error:)?.+\.com/,'$1') // Remove URL but keep optional “Error:”
.replace(/(?:\/)/g,':')
.replace(/:$/,'');
Note that the . before com has to be escaped: \..
Anyway, you can also write /…\.com(?:\/)/ instead of /…\.com/ if you wanna exclude a starting colon as well. The (?:) are just non-capturing groups (i. e. groups with no special meaning) that prevent JS to interpret certain parts as a comment (//).
Testing snippet:
document.getElementById('tester').addEventListener('change',function(){
document.getElementById('out').value=separateURLByColons(
document.getElementById('in1').value,
document.getElementById('in2').value);
});
document.getElementById('out').value=separateURLByColons(
document.getElementById('in1').value,
document.getElementById('in2').value);
function separateURLByColons(input,fallbackReferrer){
var location_href=input,
document_referrer=fallbackReferrer;
var winLoc=(~location_href.indexOf('error')
? 'Error:'+document_referrer
: location_href)
.replace(/^(Error:)?.+\.com/,'$1') // Remove URL but keep optional “Error:”
.replace(/(?:\/)/g,':')
.replace(/:$/,'');
return winLoc;
}
<div id="tester">
<code>location.href:</code> <input id="in1" type="text" value="http://example.com/test/path/there/" size="30"/>, <code>document.referrer:</code> <input id="in2" type="text" value="http://example.com/somewhere/" size="30"/>
<br/>
Result: <input id="out" type="text"/>
</div>
You can just use the following:
winLoc = 'Error:' + document.referrer.replace(/^.+\.com\/([^\/]+)\/([^\/]+)\/?/,'$1:$2')
See DEMO
Again, it's questionable that this is "better", but it should handle what you are trying to do:
var winLoc = window.location;
if (winLoc.href.indexOf('error') < -1) {
winLoc = 'Error:' + document.referrer.replace(/^.+:\/+[^\/]+\//, "").match(/([^\/]+)/g).join(":");
}
else {
winLoc = winLoc.pathname.match(/([^\/]+)/g).join(":");
}
Sadly, document.referrer is a String rather than a URLUtils type, so you are forced to strip the file type and domain manually (i.e., .replace(/^.+:\/+[^\/]+\//, "").
That regex matches the start of the string, (^), followed by one or more characters (.+), followed by a colon (:), followed by one or more slashes (\/+ . . . note, the slash is escaped), followed by one or more non-slash characters ([^\/]+), followed by a single slash (\/). This will match multiple resource types ("http", "https", "file", etc.) as well as multiple domain types (".com", ".org", "biz", etc.).
However, windows.location IS an object that comes from URLUtils, so you can simply reference the pathname property to get everything between the domain (the hostname property) and any parameters (the search property). More on URLUtils here: https://developer.mozilla.org/en-US/docs/Web/API/URLUtils
In either case, "www.site.com/folder/page/" becomes "folder/page/", in the end.
After getting the directory string, you can then apply a single regex match, which will return all of the values that match the regex pattern that you provide. In this case, that is /([^\/]+)/g . . . that means "any continuous group of one or more non-slash characters". And, the g flag at the end of the pattern means that the code should capture all of those directory levels.
"folder/page/" becomes ["folder", "page"]
All of those will be returned in an array, so you then just have to join the array elements together, and you have your string.
["folder", "page"] becomes "folder:page"
Now, there are a few caveats . . . some websites use some pretty strange values at the end of their search URLs that might cause unexpected results (e.g., take a look at Yahoo! redirects) and, if there is a page in the URL (e.g., "blah/blah/blah/page.html"), the page will be included in the result, as well.
Further tweaking can remove those, but, since it wasn't part of your question, I didn't go down that path.

RegEx - Get All Characters After Last Slash in URL

I'm working with a Google API that returns IDs in the below format, which I've saved as a string. How can I write a Regular Expression in javascript to trim the string to only the characters after the last slash in the URL.
var id = 'http://www.google.com/m8/feeds/contacts/myemail%40gmail.com/base/nabb80191e23b7d9'
Don't write a regex! This is trivial to do with string functions instead:
var final = id.substr(id.lastIndexOf('/') + 1);
It's even easier if you know that the final part will always be 16 characters:
var final = id.substr(-16);
A slightly different regex approach:
var afterSlashChars = id.match(/\/([^\/]+)\/?$/)[1];
Breaking down this regex:
\/ match a slash
( start of a captured group within the match
[^\/] match a non-slash character
+ match one of more of the non-slash characters
) end of the captured group
\/? allow one optional / at the end of the string
$ match to the end of the string
The [1] then retrieves the first captured group within the match
Working snippet:
var id = 'http://www.google.com/m8/feeds/contacts/myemail%40gmail.com/base/nabb80191e23b7d9';
var afterSlashChars = id.match(/\/([^\/]+)\/?$/)[1];
// display result
document.write(afterSlashChars);
Just in case someone else comes across this thread and is looking for a simple JS solution:
id.split('/').pop(-1)
this is easy to understand (?!.*/).+
let me explain:
first, lets match everything that has a slash at the end, ok?
that's the part we don't want
.*/ matches everything until the last slash
then, we make a "Negative lookahead" (?!) to say "I don't want this, discard it"
(?!.*) this is "Negative lookahead"
Now we can happily take whatever is next to what we don't want with this
.+
YOU MAY NEED TO ESCAPE THE / SO IT BECOMES:
(?!.*\/).+
this regexp: [^\/]+$ - works like a champ:
var id = ".../base/nabb80191e23b7d9"
result = id.match(/[^\/]+$/)[0];
// results -> "nabb80191e23b7d9"
This should work:
last = id.match(/\/([^/]*)$/)[1];
//=> nabb80191e23b7d9
Don't know JS, using others examples (and a guess) -
id = id.match(/[^\/]*$/); // [0] optional ?
Why not use replace?
"http://google.com/aaa".replace(/(.*\/)*/,"")
yields "aaa"

Categories