How to avoid capturing groups if the captured match is empty? - javascript

I would like to prepend the word "custom" to a list of host-names whose subdomains can be separated by some separator.
Examples:
news.google.com -> custom.news.google.com
news/google/com -> custom.news.google.com
dev.maps.yahoo.fr -> custom.dev.maps.yahoo.fr
dev/maps/yahoo/fr -> custom/dev/maps/yahoo/fr
These strings appear inside a document with more content, so I am trying to solve this problem using regular expressions and JavaScript's string replace function.
The list of hostnames and separators is predefined and known in advance. For the sake of this example, I only showed 2 hostnames (news.google.com and dev.maps.yahoo.com) and 2 separators (. and /), but there are more.
The separator within a single string will always be the same, i.e. there won't be cases like dev/maps.yahoo/fr.
I want to be consistent and use the correct separator when prepending "custom".
I built this long regular expression:
const myRegex = /news\.google\.com|news\/google\/com|dev\.maps\.yahoo\.fr|dev\/maps\/yahoo\/fr/
(For readability purposes, this is the expression:
/news\.google\.com/
OR
/news\/google\/com/
OR
/dev\.maps\.yahoo\.fr/
OR
/dev\/maps\/yahoo\/fr/
)
(Note: It is important to emphasize that the list of hostnames is predefined and well known in advance, that's why I am 'hardcoding' the hostnames and not using tokens such as \w+ or \S+. For example, I might want to replace news.google.com, but leave news2.google.com intact).
However, I am not sure how to capture the separator (whether ., /, or any other separator). I tried using capture groups like this:
const myRegex = /news(\.)google\.com|news(\/)google\/com|dev(\.)maps\.yahoo\.fr|dev(\/)maps\/yahoo\/fr/
However, by doing this, I am creating 4 capture groups, and there's only one separator (and this is just a simple example). 3 of the capture groups will be empty, and one of them will contain the separator. How can I know which capture group is it?
Ideally, I would like something like this:
const myString = 'I navigated to news.google.com'; // For example
const myCustomString = myString.replace(
myRegex,
(match, <SEPARATOR_WRONG>) => `custom${SEPARATOR_WRONG}${match}`,
);
console.log(myCustomString);
// will log 'I navigated to custom.news.google.com'
Is there a way to skip captured groups if they are empty?

Use \1 to refer to the separator captured in the first (\.|\/) group so we don't have to keep writing it over and over.
const text = `I navigated to news.google.com
I navigated to news/google/com
I navigated to dev.maps.yahoo.fr
I navigated to dev/maps/yahoo/fr`;
const re = /\w+(\.|\/)(\w+\1)?(google|yahoo)\1\w+/g;
console.log(text.replace(re, (url, separator) => `custom${separator}${url}`));
Here's an alternate solution given the new requirement described in the comments:
const text = `I navigated to news.google.com
I navigated to news/google/com
I navigated to dev.maps.yahoo.fr
I navigated to dev/maps/yahoo/fr`;
const re = /(news|dev)(\.|\/)(google|maps)\2(com|yahoo)(\2fr)?/g;
console.log(text.replace(re, (url, prefix, separator) => `custom${separator}${url}`));
Yet another alternate solution:
const text = `I navigated to news.google.com
I navigated to news/google/com
I navigated to dev.maps.yahoo.fr
I navigated to dev/maps/yahoo/fr`;
const re = /news(\.)google\.com|news(\/)google\/com|dev(\.)maps\.yahoo\.fr|dev(\/)maps\/yahoo\/fr/g;
console.log(text.replace(re, url => 'custom' + url.match(/\.|\//)[0] + url));

solution that I believe is acceptable to you is to add separator finding logic among the capture groups in the callback
const myCustomString = myString.replace(
myRegex,
(match, ...rest) => {
const sep = rest.slice(0, -2) // last two args are index and full match
.find(sep => !!sep) // first truthy capture group contains a separator
return `custom${sep}${match}`},
);

Related

Replace characters of a string matched by regex

I am in a situation to find the domain name of all valid URLs among a HTML page, replace these domain names with another domain name, but within the domain name, I need to do a 2nd replacement. For example, say the url https://www.example.com/path/to/somewhere is among the HTML page, I need to eventually transfer it into something like www-example-com.another.domain/path/to/somewhere.
I can do the first match and replace with the following code:
const regex = new RegExp('(https?:\/\/([^:\/\n\"\'?]+))', 'g');
txt = txt.replace(regex, "$1.another.domain");
but I have no idea how to do the second match and replace to replace the . into -. I wonder if there is any efficient way to finish this task. I tried to do something like the following but it does not work:
const regex = new RegExp('(https?:\/\/([^:\/\n\"\'?]+))', 'g');
txt = txt.replace(regex, "$1".replace(/'.'/g, '-') + ".another.domain");
Ok - I think I know what you're looking for. I'll explain what it's doing.
You 2 capture groups: the one before and the one after the first /.
You're taking the first capture group, and converting the . to -
You're adding via string .another.domain and then you're appending the 2nd capture group on it afterward
const address1 = 'https://www.example.com/path/to/somewhere';
const newDomain = "another.domain"
const pattern = /(https?:\/\/[^:\/\n\"\'?]+)(\/.*)/;
const matches = pattern.exec(address1);
const converted = matches[1].replace(/\./g, "-") + `.${newDomain}${matches[2]}`;
console.log(converted);
You can use the function version of String.prototype.replace() to have some more control over the specific replacements.
For example...
const txt = 'URL is https://www.example.com/path/to/somewhere'
const newTxt = txt.replace(/(https?:\/\/)([\w.]+)/g, (_, scheme, domain) =>
`${scheme}${domain.replace(/\./g, '-')}.another.domain`)
console.log(newTxt)
Here, scheme is the first capture group (https?:\/\/) and domain is the second ([\w.]+).
If you need a fancier domain matcher (as per your question), just substitute that part of the regex.

How to match bold markdown if it isn't preceded with a backslash?

I'm looking to match bolded markdown. Here are some examples:
qwer *asdf* zxcv matches *asdf*
qwer*asdf*zxcv matches *asdf*
qwer \*asdf* zxcv does not match
*qwer* asdf zxcv matches *qwer*
A negative look behind like this (?<!\\)\*(.*)\* works.
Except there is no browser support in Firefox, so I cannot use it.
Similarly, I can get very close with (^|[^\\])\*(.*)\*
The issue is that there are two capture groups, and I need the index of the second capture group, and Javascript only returns the index of the first capture group. I can bandaid it in this case by just adding 1, but in other cases this hack will not work.
My reasoning for doing this is that I'm trying to replace a small subset of Markdown with React components. As an example, I'm trying to convert this string:
qwer *asdf* zxcv *123*
Into this array:
[ "qwer ", <strong>asdf</strong>, " zxcv ", <strong>123</strong> ]
Where the second and fourth elements are created via JSX and included as array elements.
You will also need to take into account that when a backslash occurs before an asterisk, it may be one that is itself escaped by a backslash, and in that case the asterisk should be considered the start of bold markup. Except if that one is also preceded by a backslash,...etc.
So I would suggest this regular expression:
((?:^|[^\\])(?:\\.)*)\*((\\.|[^*])*)\*
If the purpose is to replace these with tags, like <strong> ... </strong>, then just use JavaScript's replace as follows:
let s = String.raw`now *this is bold*, and \\*this too\\*, but \\\*this\* not`;
console.log(s);
let regex = /((?:^|[^\\])(?:\\.)*)\*((\\.|[^*])*)\*/g;
let res = s.replace(regex, "$1<strong>$2</strong>");
console.log(res);
If the bolded words should be converted to a React component and stored in an array with the other pieces of plain text, then you could use split and map:
let s = String.raw`now *this is bold*, and \\*this too\\*, but \\\*this\* not`;
console.log(s);
let regex = /((?:^|[^\\])(?:\\.)*)\*((?:\\.|[^*])*)\*/g;
let res = s.split(regex).map((s, i) =>
i%3 === 2 ? React.createComponent("strong", {}, s) : s
);
Since there are two capture groups in the "delimiter" for the split call, one having the preceding character(s) and the second the word itself, every third item in the split result is a word to be bolded, hence the i%3 expression.
This should do the trick:
/(?:^|[^\\])(\*[^*]+[^\\]\*)/
The only capturing group there is the string surrounded by *'s.

How can I get a specific part of a URL using RegEx?

I am trying to get a part of a file download using RegEx (or other methods). I have pasted below the link that I am trying to parse and put the part I am trying to select in bold.
https://minecraft.azureedge.net/bin-linux/bedrock-server-1.7.0.13.zip
I have looked around and thought about trying Named Capture Groups, however I couldn't figure it out. I would like to be able to do this in JavaScript/Node.js, even if it requires a module 👻.
You can use node.js default modules to ease the match
URL and path to identify filename, and an easy regexp finally.
const { URL } = require('url')
const path = require('path')
const test = new URL(
'https://minecraft.azureedge.net/bin-linux/bedrock-server-1.7.0.13.zip'
)
/*
test.pathname = '/bin-linux/bedrock-server-1.7.0.13.zip'
path.parse(test.pathname) = { root: '/',
dir: '/bin-linux',
base: 'bedrock-server-1.7.0.13.zip',
ext: '.zip',
name: 'bedrock-server-1.7.0.13' }
match = [ '1.7.0.13', index: 15, input: 'bedrock-server-1.7.0.13' ]
*/
const match = path.parse(test.pathname)
.name
.match(/[0-9.]*$/)
You could use the below regex:
[\d.]+(?=\.\w+$)
This matches dots and digits that are following a file extension. You could also make it more accurate:
\d+(?:\.\d+)*(?=\.\w+$)
I'd stick with this:
-(\d+(?:\.\d+)*)(?:\.\w+)$
It matches a dash before any numbers
The parenthesis will make a capture group
Then, \d+ will match from one to any number of digits
?: will make a group but not capture it
Inside this group, \.\d+ will match a dot followed by any number of digits
The last expression will repeat from zero to any times thanks to *
After that, (?:\.\w+)$ will make a group that matches the extension toward the end of the string but not capture it
So, basically, this format would allow you to capture all the numbers that are after the dash and before the extension, be it 1, 1.7, 1.7.0, 1.7.0.13, 1.7.0.13.5 etc. On the match array, at index [0] you will have the entire regex match, and on [1] you will have your captured group, the number you're looking for.
Perhaps a regular expression like this is what you need?
var url = 'https://minecraft.azureedge.net/bin-linux9.9.9/bedrock-server-1.7.0.13.zip'
var match = url.match(/(\d+[.\d+]*)(?=\.\w+$)/gi)
console.log( match )
The way this pattern /\d+[.\d+]*\d+/gi works is to basically say that we want a sub string match that:
first contains one or more digit characters, ie \d+
immediately following this, there can be optional groupings of digits and decimal characters, ie [.\d+]
and finally, (?=\.\w+$) requires a file extension like .zip to follow immediately after our matched string
For more information on special characters like + and *, see this documentation. Hope that helps!

JavaScript to remove whatever is after the tld and before the whitespace

I have a bunch of functions that are filtering a page down to the domains that are attached to email addresses. It's all working great except for one small thing, some of the links are coming out like this:
EXAMPLE.COM
EXAMPLE.ORG.
EXAMPLE.ORG>.
EXAMPLE.COM"
EXAMPLE.COM".
EXAMPLE.COM).
EXAMPLE.COM(COMMENT)"
DEPT.EXAMPLE.COM
EXAMPLE.ORG
EXAMPLE.COM.
I want to figure out one last filter (regex or not) that will remove everything after the TLD. All of these items are in an array.
EDIT
The function I'm using:
function filterByDomain(array) {
var regex = new RegExp("([^.\n]+\.[a-z]{2,6}\b)", 'gi');
return array.filter(function(text){
return regex.test(text);
});
}
You can probably use this regex to match your TLD for each case:
/^[^.\n]+\.[a-z]{2,63}$/gim
RegEx Demo
You validation function can be:
function filterByDomain(array) {
var regex = /^[^.\n]+\.[a-z]{2,63}$/gim;
return array.filter(function(text){
return regex.test(text);
});
}
PS: Do read this Q & A to see that up to 63 characters are allowed in TLD.
I'd match all leading [\w.] and omit the last dot, if any:
var result = url.match(/^[\w\.]+/).join("");
if(result.slice(-1)==".") result = result.slice(0,-1);
With note that \w should be replaced for something more sophisticated:
_ is part of \w set but should not be in url path
- is not part of \w but can be in url not adjacent to . or -
To keep the regexp simple and the code readable, I'd do it this way
substitute _ for # in url (both # and _ can be only after TLD)
substitute - for _ (_ is part of \w)
after the regexp test, substitute _ back for -
URL like www.-example-.com would still pass, can be detected by searching for [.-]{2,}

Javascript - Regex finding multiple parentheses matches

So currently, my code works for inputs that contain one set of parentheses.
var re = /^.*\((.*\)).*$/;
var inPar = userIn.replace(re, '$1');
...meaning when the user enters the chemical formula Cu(NO3)2, alerting inPar returns NO3) , which I want.
However, if Cu(NO3)2(CO2)3 is the input, only CO2) is being returned.
I'm not too knowledgable in RegEx, so why is this happening, and is there a way I could put NO3) and CO2) into an array after they are found?
You want to use String.match instead of String.replace. You'll also want your regex to match multiple strings in parentheses, so you can't have ^ (start of string) and $ (end of string). And we can't be greedy when matching inside the parentheses, so we'll use .*?
Stepping through the changes, we get:
// Use Match
"Cu(NO3)2(CO2)3".match(/^.*\((.*\)).*$/);
["Cu(NO3)2(CO2)3", "CO2)"]
// Lets stop including the ) in our match
"Cu(NO3)2(CO2)3".match(/^.*\((.*)\).*$/);
["Cu(NO3)2(CO2)3", "CO2"]
// Instead of matching the entire string, lets search for just what we want
"Cu(NO3)2(CO2)3".match(/\((.*)\)/);
["(NO3)2(CO2)", "NO3)2(CO2"]
// Oops, we're being a bit too greedy, and capturing everything in a single match
"Cu(NO3)2(CO2)3".match(/\((.*?)\)/);
["(NO3)", "NO3"]
// Looks like we're only searching for a single result. Lets add the Global flag
"Cu(NO3)2(CO2)3".match(/\((.*?)\)/g);
["(NO3)", "(CO2)"]
// Global captures the entire match, and ignore our capture groups, so lets remove them
"Cu(NO3)2(CO2)3".match(/\(.*?\)/g);
["(NO3)", "(CO2)"]
// Now to remove the parentheses. We can use Array.prototype.map for that!
var elements = "Cu(NO3)2(CO2)3".match(/\(.*?\)/g);
elements = elements.map(function(match) { return match.slice(1, -1); })
["NO3", "CO2"]
// And if you want the closing parenthesis as Fabrício Matté mentioned
var elements = "Cu(NO3)2(CO2)3".match(/\(.*?\)/g);
elements = elements.map(function(match) { return match.substr(1); })
["NO3)", "CO2)"]
Your regex has anchors to match beginning and end of the string, so it won't suffice to match multiple occurrences. Updated code using String.match with the RegExp g flag (global modifier):
var userIn = 'Cu(NO3)2(CO2)3';
var inPar = userIn.match(/\([^)]*\)/g).map(function(s){ return s.substr(1); });
inPar; //["NO3)", "CO2)"]
In case you need old IE support: Array.prototype.map polyfill
Or without polyfills:
var userIn = 'Cu(NO3)2(CO2)3';
var inPar = [];
userIn.replace(/\(([^)]*\))/g, function(s, m) { inPar.push(m); });
inPar; //["NO3)", "CO2)"]
Above matches a ( and captures a sequence of zero or more non-) characters, followed by a ) and pushes it to the inPar array.
The first regex does essentially the same, but uses the entire match including the opening ( parenthesis (which is later removed by mapping the array) instead of a capturing group.
From the question I assume the closing ) parenthesis is expected to be in the resulting strings, otherwise here are the updated solutions without the closing parenthesis:
For the first solution (using s.slice(1, -1)):
var inPar = userIn.match(/\([^)]*\)/g).map(function(s){ return s.slice(1, -1);});
For the second solution (\) outside of capturing group):
userIn.replace(/\(([^)]*)\)/g, function(s, m) { inPar.push(m); });
You could try the below:
"Cu(NO3)2".match(/(\S\S\d)/gi) // returns NO3
"Cu(NO3)2(CO2)3".match(/(\S\S\d)/gi) // returns NO3 CO2

Categories