Conditional matching complex regex in capturing group - javascript

I am writing a regex for my git commit-msg hook and can't deal with the last part.
This is my regex
/^GRP\-[0-9]+\s(FIX|CHANGE|HOTFIX|FEATURE){1}\s(CORE|SHARED|ADM|CSR|BUS|OTHER){1}\s-\s.+/
My commit messages can have 2 variations.
GRP-0888 FIX OTHER - (jest.config.js) : Fix testMatch option issue
GRP-0888 FIX OTHER - Fix testMatch option issue
My current regex works well with both as it completes the check after -.
So basically after the dash, it doesn't take care of checking the format.
I want it to check these 2 conditions and match them respectively.
If after the dash it meets ( then continue with that pattern and do all the checks
Check if the opening and closing brackets are there and after the
closing bracket it has a space and : and again space and the rest of
the commit description.It should match the 1st pattern.
If it meets an Alphanumeric character after the - then it matches the 2nd pattern
I have tried to use a disjunction in a capturing group but somehow it fails. Actually, I am guessing why it is failing as the second condition can always cover everything.
/^GRP\-[0-9]+\s(FIX|CHANGE|HOTFIX|FEATURE){1}\s(CORE|SHARED|ADM|CSR|BUS|OTHER){1}\s-\s(\(.+\)\s:\s.+|.+)/
UPDATED
These commit message patterns are invalid and shouldn't pass
GRP-0988 FIX CORE - (Some change)
GRP-0988 FIX CORE - (Some change) - Some description
GRP-0988 FIX CORE - (
GRP-0988 FIX CORE - ()
GRP-0988 FIX CORE - (Some change
GRP-0988 FIX CORE - Some change)

This is the regex after the dash: (?:(\(.*?\))(?:\s:\s))?(?!\(.*?\)(?:\s-\s)?)(.+)
(?:(\(.*?\))(?:\s:\s))? - Optionally match the previous two parts, without capturing
(\(.*?\)) - Match and capture the part within parentheses. Use the non-greedy quantifier *? to prevent leaking into the rest of the string if there is another ) after the part you want to match.
(?:\s:\s) - Match and discard the colon and surrounding spaces
(?!\(.*?\)(?:\s-\s)?) - Negative lookahead to ensure it does not match messages such as (Some change) and (Some change) - Some description
\(.*?\) - Match stuff within parentheses
(?:\s-\s)? - Optionally match a colon surrounded by spaces
(.+) - Match and capture the rest the commit message
let formats = [
"GRP-0888 FIX OTHER - (jest.config.js) : Fix testMatch option issue",
"GRP-0888 FIX OTHER - Fix testMatch option issue",
"GRP-0900 FIX CORE - (Some change) - Some change. If there are (Some text)",
"GRP-0988 FIX CORE - (Some change)",
"GRP-0988 FIX CORE - (Some change) - Some description",
]
let regex = /^GRP\-[0-9]+\s(FIX|CHANGE|HOTFIX|FEATURE)\s(CORE|SHARED|ADM|CSR|BUS|OTHER)\s-\s(?:(\(.*?\))(?:\s:\s))?(?!\(.*?\)(?:\s-\s)?)(.+)/
for (let format of formats) {
console.log(format.match(regex))
}

You might use:
^GRP-[0-9]+\s(FIX|CHANGE|HOTFIX|FEATURE)\s(CORE|SHARED|ADM|CSR|BUS|OTHER)\s-\s(?=[^a-zA-Z0-9]*[a-zA-Z0-9])(?:\([^()]*\)\s:\s)?[^()]*$
Explanation
^ Start of string
GRP-[0-9]+\s Match GRP- 1+ digits and a whitespace char
(FIX|CHANGE|HOTFIX|FEATURE) Capture one of the alternatives in group 1
\s Match a single whitespace char
(CORE|SHARED|ADM|CSR|BUS|OTHER) Capture one of the alternatives in group 2
\s-\s Match - between 2 whitespace chars
(?=[^a-zA-Z0-9]*[a-zA-Z0-9]) Positive lookahead, assert an alphanumeric to the right
(?:\([^()]*\)\s:\s)? Optionally match (...) followed by :
[^()]* Match optional chars other than ( or )
$ End of string
See a regex101 demo
const regex = /^GRP-[0-9]+\s(FIX|CHANGE|HOTFIX|FEATURE)\s(CORE|SHARED|ADM|CSR|BUS|OTHER)\s-\s(?=[^a-zA-Z0-9]*[a-zA-Z0-9])(?:\([^()]*\)\s:\s)?[^()]*$/;
[
"GRP-0888 FIX OTHER - (jest.config.js) : Fix testMatch option issue",
"GRP-0888 FIX OTHER - Fix testMatch option issue",
"GRP-0988 FIX CORE - (Some change)",
"GRP-0988 FIX CORE - (Some change) - Some description",
"GRP-0988 FIX CORE - (",
"GRP-0988 FIX CORE - ()",
"GRP-0988 FIX CORE - (Some change",
"GRP-0988 FIX CORE - Some change)"
].forEach(s =>
console.log(`${regex.test(s)} ---> ${s}`)
)

Both {1}-quantifiers, each following a grouped alternation, are not necessary at all.
And as for the only 2 patterns ... either (<fileName>) : <fileChangeMessage> or <changeMessage> which are allowed to follow the OP's opening sequence of ... GRP-<version> <type> <target> - ... one has to precisely target this alternation which ...
... either is a parentheses-free character sequence enclosed by parentheses followed by an whitespace enclosed colon followed by at least another character ... (\(.*\)\s\:\s.{1,})...
... or is a parentheses-free character sequence all to the end of the line ... ([^()]+$).
Therefore something like ^GRP ... \s-\s((\(.*\)\s\:\s.{1,})|([^()]+$))/ is well suited for matching only the allowed lines from the examples provided by the OP, which are ...
GRP-0988 FIX CORE - (Some change)
GRP-0988 FIX CORE - (Some change) - Some description
GRP-0888 FIX OTHER - (jest.config.js) : Fix testMatch option issue
GRP-0988 FIX CORE - (
GRP-0988 FIX CORE - ()
GRP-0888 FIX OTHER - Fix testMatch option issue
GRP-0988 FIX CORE - (Some change
GRP-0988 FIX CORE - Some change)
The above linked shortened regex in its entirety looks like this ...
/^GRP-[0-9]+\s(FIX|CHANGE|HOTFIX|FEATURE)\s(CORE|SHARED|ADM|CSR|BUS|OTHER)\s-\s((\(.*\)\s\:\s.{1,})|([^()]+$))/gm
... and in case one wants to also capture the above named data in detail, one could make use of named capturing groups, as well as of matchAll and mapping, all based on the following pattern ... ^GRP-(?<version>[0-9]+)\s(?<type>...)\s(?<target>...)\s-\s(?:(?:(?<file>\(.*\))\s\:\s(?<fileChangeMessage>.{1,}))|(?<changeMessage>[^()]+$)).
The above linked shortened regex in its entirety looks like this ...
/^GRP-(?<version>[0-9]+)\s(?<type>FIX|CHANGE|HOTFIX|FEATURE)\s(?<target>CORE|SHARED|ADM|CSR|BUS|OTHER)\s-\s(?:(?:(?<file>\(.*\))\s\:\s(?<fileChangeMessage>.{1,}))|(?<changeMessage>[^()]+$))/gm
Both regular expression placed into some example code then leads to ...
const multilineSampleData =
`GRP-0988 FIX CORE - (Some change)
GRP-0988 FIX CORE - (Some change) - Some description
GRP-0888 FIX OTHER - (jest.config.js) : Fix testMatch option issue
GRP-0988 FIX CORE - (
GRP-0988 FIX CORE - ()
GRP-0888 FIX OTHER - Fix testMatch option issue
GRP-0988 FIX CORE - (Some change
GRP-0988 FIX CORE - Some change)`;
const regXMatch =
/^GRP-[0-9]+\s(FIX|CHANGE|HOTFIX|FEATURE)\s(CORE|SHARED|ADM|CSR|BUS|OTHER)\s-\s((\(.*\)\s\:\s.{1,})|([^()]+$))/gm;
const regXNamedGroups =
/^GRP-(?<version>[0-9]+)\s(?<type>FIX|CHANGE|HOTFIX|FEATURE)\s(?<target>CORE|SHARED|ADM|CSR|BUS|OTHER)\s-\s(?:(?:(?<file>\(.*\))\s\:\s(?<fileChangeMessage>.{1,}))|(?<changeMessage>[^()]+$))/gm;
console.log(
'multiline sample data (just two expected matches) ...\n',
multilineSampleData
);
console.log(
'matching lines only ...',
multilineSampleData
.match(regXMatch)
);
console.log(
'existing group properties only of each matching line ...',
[
...multilineSampleData
.matchAll(regXNamedGroups)
]
.map(({ groups: { file, fileChangeMessage, changeMessage, ...rest } }) => ({
...rest,
...(file && { file, fileChangeMessage } || { changeMessage }),
}))
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

Related

Regex pattern to extract name from emails

I need some help to find a regex expression to extract user names from these emails:
(Regex newbie here)
john.stewartcompany1#example.com
bruce.williamscompany1#example.com
richard.weiss#example.com
julia.palermocompany2#example.com
edward.philipscompany3#example.com
As you can see from the emails, almost all of them have the company name following the name. (company1, company2, company3)
But some emails have no company inserted. (See richard.weiss)
All of them will have #example.com
So I need to extract only the names, without the company, like this:
john.stewart
bruce.williams
richard.weiss
julia.palermo
edward.philips
I've come up with this pattern so far:
/(.+)(?=#example.com)/g
This only solves half of the problem, as it keeps the company name in the names.
john.stewartcompany1
bruce.williamscompany1
richard.weiss
julia.palermocompany2
edward.philipscompany3
I still need to remove the company names from the user names.
Is there a way to accomplish this with a single regex pattern?
Any help appreciated.
PS:
Thanks for the replies. I forgot to mention...
The company names are limited.
We can safely assume from my example that there will be only
company1, company2 and company3.
Thanks.
You can use
^.*?(?=(?:company1|company2|company3)?#)
See the regex demo.
Details:
^ - start of string
.*? - any zero or more chars other than line break chars as few as possible
(?=(?:company1|company2|company3)?#) - a positive lookahead that requires the following subpatterns to match immediately to the right of the current location:
(?:company1|company2|company3)? - an optional company1, company2 or company3 char sequence
# - a # char.

Regular Expression to find merge conflicts in file

This is the file which contains merge conflicts,
<<<<<<< HEAD
$conf['some_unit_id'] = '4-qw-gg-ds-sometext';
=======
// Some Snippets Site Info
$conf['site_info'] = array(
'customer_service_phone' => '+1 323223232
'logo_path' => 'https://www.google.com/img/icons/src/logo.svg',
'currency' => 'CAD',
'https://www.youtube.com/user/somewebsite/ogog',
'https://www.instagram.com/somewebsite/',
),
);
>>>>>>> ff6df3435231fdff78fwsd83e7dffa0732eft554
// Somes code
$done['rules'] = TRUE;
Am trying to find the best regular expression that detect merge conflicts in the file. Initially I tried with :
/(<* HEAD)/
Which will detect only HEAD with some preceding <
I have some other markers as well like :
1. ======
2. >>>>> ff6df3435231fdff78fwsd83e7dffa0732eft554
These two markers must detect along with HEAD marker as well. And if a developer fixes the merge conflicts only <* HEAD and rest of the ie., ===== and >>> ff6df3435231fdff78fwsd83e7dffa0732eft554 the regular expression should detect that as well.
Since this regular expression am using in pre-commit hook. If one pattern detected in file commit will break. I need exact regex to detect merge conflict markings.
Any solution would be appreciated.
Since they're all the same length, you can use a character group:
/^[<=>]{7}( .+)?$/mg
(make sure to use a multiline regex)
You can use:
^<{7} HEAD(?:(?!={7})[\s\S])*={7}(?:(?!>{7} \w+)[\s\S])*>{7} \w+
Demo & explanation
You might also match all the lines by checking the start of each line to prevent some of the unnecessary backtracking using [\s\S].
First match the <<<<<<< HEAD part, then match all following lines that do not start with ======= and then match it.
Then match all lines that do not start with >>>>>>> followed by matching it and chars [a-z0-9].
^<{7} HEAD(?:\r?\n(?!={7}\r?\n).*)*\r?\n={7}(?:\r?\n(?!>{7} ).*)*\r?\n>{7} [a-z0-9]+
Regex demo
If you want to highlight the markers, you could use a capturing group:
^(<{7} HEAD)(?:\r?\n(?!={7}\r?\n).*)*\r?\n(={7})(?:\r?\n(?!>{7} ).*)*\r?\n(>{7} [a-z0-9]+)
Regex demo
If I understand correctly your desire, you want to find block code that need to resolve conflict. I hope my suggestion can help you.
/^<{7}\sHEAD[\s\S]+?>{7}\s\w+$/gm
Details:
Mode: multiline
^<{7}\sHEAD: block code starts with <<<<<<< HEAD
[\s\S]+?: get any character as few times as possible (line break accepted)
{7}\s\w+$: block code ends with >>>>>>> commit hash
Demo

Javascript regex to match all & character in the text ignoring encodings like ), etc

My requirement is given a string like this,
Edit the Expression &1 Text to & se&e matches ). Roll & over ma&tches & or t
I need to select all '&' characters ignoring ones in the encoding. I have achieved selecting all encoding characters. Here is a demo. Now I need to ignore them select other '&'.
Your regex may be considered as work in progress, e.g. to match & you also may write your current regex as &(?:#x?)?(?:\d{2}|\w{4});. To generalize it a bit, you may even change it to /&(?:#x?)?\w{1,4};/.
Your question is how to negate these entities, and match & in all other locations. It is easy to achieve a with capturing group and a bit of code.
var s = "Edit the Expression & Text to & see matches ). Roll & over ma&tches & or the expression for details. Undo mistakes with ctrl-z. Save Favorites & Share expressions with friends or the Community. Explore & your results with & Tools. A full & Reference & Help is ava&ilable in & the Library, or watch the video Tutorial. & or & or &";
var re = /(&(?:#x?)?\w{1,4};)|&/g;
var result = s.replace(re, function($0,$1) {return $1 ? $1 : "&";});
console.log(result);
Here, the pattern is /(&(?:#x?)?\w{1,4};)|&/g - (<YOUR_NEGATED_PATTERN>)|&. Your pattern is captured into Group 1 and when a match is found, Group 1 value is checked: if Group 1 matched, the entity is put back into the resulting string. All other & are turned into &.

Javascript Regex capture multiline content from each newline entry

I am doing Javascript Regex to process and transform some raw data to 2D array.
Task Briefing (JS only):
Transforming raw string data to 2D array.
Raw Data Input :
Here is a piece of sample with 4 entries, a new entry will go to a newline. Entry 3 comes with multiline content.
2012/12/1, AM12:21 - user1‬: entry1_wasehhjdsaj
2012/12/2, AM9:42 - user2‬: entry2_bahbahbah_dsdeead
2012/12/2, AM9:44 - user3‬: entry3_Line1_ContdWithFollowingLine_bahbahbah
entry3_Line2_ContdWithABoveLine_bahbahbah_erererw
entry3_Line3_ContdWithABoveLine_bahbahbah_dsff
2012/12/4, AM11:48 - user7‬: entry4_bahbahbah_fggf
(raw string data, without empty line. )
Updated: Sorry for misleading, the end of contents do not necessary come with same END pattern, but just a line break.
How the pattern actually ends? (Thanks #Tim Pietzcker's comment).
The content should be end with a line break and following with next entry timestamp start. (You can assume the entry contents do not contain any similar timestamp pattern.)
I understand this may be a trouble regex question, so ANY OTHER JS METHOD ACHIEVING SAME GOAL WILL ALSO BE ACCEPT.
My current regex with capture group:
/^([0-9]{4}|[0-9]{2})[\/]([0]?[1-9]|[1][0-2])[\/]([0]?[1-9]|[1|2][0-9]|[3][0|1]), ([A|P])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (.*)/gm
Desired capture result:
MATCH 1
2012
12
1
A
12
21
user1‬
entry1_wasehhjdsaj
MATCH 2
2012
12
2
A
9
42
user2‬
entry2_bahbahbah_dsdeead
MATCH 3
2012
12
2
A
9
44
user3‬
entry3_Line1_ContdWithFollowingLine_bahbahbah entry3_Line2_ContdWithABoveLine_bahbahbah_erererw entry3_Line3_ContdWithABoveLine_bahbahbah_dsff
MATCH 4
(to be skipped...)
Problem:
There is a problem when I capture Entry 3, I can't capture the 2nd & 3rd line content of Entry 3. If the entry only contains ONE line content, the regex work fine.
How can I capture Entry 3 with Multi-line content? I try to work with m modifier, but I have no idea how to deal with Multi-line contents and newline entry at the same time.
If it is impossible achieve with js regex, please suggest another js approach to transform the raw data to 2D array as ultimate goal.
THANKS!
the end of contents do not necessary come with same END pattern, but just a line break.
Testing: https://regex101.com/r/eS9pY5/1
Multiline doesn't work that way in javascript, but you can workaround it with [\s\S]. This class matches every character and \n as well. Note the *? instead of * after it, to stop it from being greedy and only go until the 1st END:
^([0-9]{4}|[0-9]{2})[\/]([0]?[1-9]|[1][0-2])[\/]([0]?[1-9]|[1|2][0-9]|[3][0|1]), ([A|P])M([1-9]|1[0-2]):([0-5]\d) - (.*?): ([\s\S]*?END)$
See: https://regex101.com/r/mT8rI4/3
Dots (.) don't match newline characters. There is a character class that matches everything ([\S\s]), but you don't want to use that without precautions - otherwise [\S\s]* would match all the entries at once.
So you need to tell the regex engine to stop matching when the next match begins. We can use a negative lookahead assertion for that, and we'll just feed the timestamp pattern into that:
/^([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - ([^:]*): ((?:(?!^([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d))[\S\s])*)/gm
Test it live on regex101.com.
Here is a single regex that will match the strings you have the way you need:
^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): ((?:(?!(?:\d{4}|\d{2})\/(?:0?[1-9]|1[0-2])\/(?:0?[1-9]|[12]\d|3[01]))[\s\S])*)(?=\n|$)
See demo
The last capturing group is no longer a greedy dot matching .* but a tempered greedy token (?:(?!([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]))[\s\S])* matching everything up to the end of string or the date pattern.
If we unroll it to make more efficient:
^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (\D*(?:\d(?!(?:\d{3}|\d)\/(?:0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]))\D*)*)(?=\n|$)
See another demo

reg should start with $, from then on match every full word seperated with a comma

I'm trying to create a regex for this this:
.domain.com$object-subrequest,third-party,domain=domain.com|domain2.com|domain3.com|domain4.com
The best result would be to match it into a result like this:
- object-subrequest
- third-party
- domain
-- domain.com
-- domain2.com
-- domain3.com
-- domain4.com
But I don't know if that's even possible. A result like this would be okay as well:
- object-subrequest
- third-party
- domain
And then another regex to filter out all domains like this:
-- domain.com
-- domain2.com
-- domain3.com
-- domain4.com
So far I've only been able to come up wiht this:
https://regex101.com/r/wP8cY7/1
/(script|image|stylesheet|object|xmlhttprequest|subdocument|document|elemhide|other|third-party|domain|sitekey|match-case|collapse|donottrack),*/g
As you can see, this matches everything containing one of the words, I only need everything after the $.
I use only Javascript (no jQuery).
If you can get rid of everything before the $ in some other way, this regexp is near what you want I think:
/[$,](script|image|stylesheet|object|xmlhttprequest|subdocument|document|elemhide|other|third-party|domain|sitekey|match-case|collapse|donottrack)/gi
Split the input x at dollar signs and commas; take all components of the first component except for the first component of the first component. Then split that by equal and or signs:
s <- strsplit(x, "[$,]")[[1]][-1]
strsplit(s, "[=|]")
giving:
[[1]]
[1] "object-subrequest"
[[2]]
[1] "third-party"
[[3]]
[1] "domain" "domain.com" "domain2.com" "domain3.com" "domain4.com"
Note: We used this as the input x:
x <- ".domain.com$object-subrequest,third-party,domain=domain.com|domain2.com|domain3.com|domain4.com"

Categories