Regex how to capture repeating values without capturing spaces around text - javascript

I am trying to capture multiple values that will be in the following format:
prof:
prof1
prof2
prof3
...
I don't know how many there will be in the list, it's also possible there will be no values either, but what I want to capture are prof1, prof2, prof3, etc without the whitespace on either side. I have a starter regex:
prof:\s*([\w-]*)
This captures the first prof value, but none of the others. If I add a * at the end of the capture group, none of them are captured. If I add [] on either side of the capture group, it results in an error where it can't figure out what the closing parentheses is for.
Basically, the pattern is, some amount of whitespace, capture text, some amount of whitespace, capture text, etc. But I can't figure out the proper regex for that to work.

I'm guessing that this expression in m mode might be an option, not sure though:
([\s\S]*?)(prof:)|([\w-]*)
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

Another option could be to match prof: and capture all after in a capturing group and making sure that there are 1+ empty lines between the prof1, prof2 etc..
Then split that group on 1+ whitespace chars \s+
\bprof:[ \t]*((?:(?:\n[ \t]*$)+\n[ \t]+[\w-]+)*)
Explanation
\bprof:[ \t]* Word boundary, match prof: followed by 0+ tab/spaces
( Capture group 1
(?: Non capturing group
(?:\n[ \t]*$)+ Match 1+ times a newline, 0+ tab/spaces and assert end of string
\n[ \t]+[\w-]+ Match newline, 1+ tabs/spaces, 1+ wordchars/hyphen
)* Close non capturing group and repeat 0+ times
) Close capture group 1
Regex demo
const regex = /\bprof:[ \t]*((?:(?:\n[ \t]*$)+\n[ \t]+[\w-]+)*)/m;
const str = `prof:
prof1
prof2
prof3
...`;
let res = str.match(regex)[1].split(/\s+/).filter(Boolean);
console.log(res);

Related

Regex pattern matching budget numbers issue

I'm having an issue and I'm hoping there is someone who is more knowledgeable with Regex that can help me out.
I'm trying to extract data from a PDF file which contains a budget line items. I'm using this regex pattern to get the index of the first number so I can then extract the numbers to the right.
Regex pattern:
(([(]?[0-9]+[)]? )|([(]?[0-9]+[)]?)|(- )|(-))+$
Line item: 'Modernization and improvement (note 9) 260 (180) 640 - 155'
This works well for 99% of the line items except this one I came across. The problem is the pattern matches the '9)' in what is the text portion.
Is there any way with this Regex pattern to say if there are brackets, the inside must contain numbers only?
Thanks!
You can repeat all possible options until the end of the string:
(?:\(\d+\)|\d+(?:\s*-\s*\d+)?)(?:\s+(?:\(\d+\)|\d+(?:\s*-\s*\d+)?))*$
Explanation
(?: Non capture group
\(\d+\) Match 1+ digits between parenthesis
| Or
\d+(?:\s*-\s*\d+)? Match 1+ digits and optionally match - and 1+ digits
) Close the non capture group
(?: Non capture group to repeat as a whole part
\s+ Match 1+ whitespace chars
(?:\(\d+\)|\d+(?:\s*-\s*\d+)?) Same as the first pattern
)* Close the non capture group and optionally repeat it
$ End of string
Regex demo

Replace second last occurrence of a char(dot) in an email string using regex

Please i would love to replace the second last occurrence of a char in a, the length of the strings can vary but the delimiter is always same I will give some examples below and what I have tried
Input 1: james.sam.uri.stackoverflow.com
Output 1: james.sam.uri#stackoverflow.com
Input 2: noman.stackoverflow.com
Output 2: noman#stackoverflow.com
Input 3: queen.elizabeth.empire.co.uk
Output 3: queen.elizabeth#empire.co.uk
My solution
//This works but I don't want this as its not a regex solution
const e = "noman.stackoverflow.com"
var index = e.lastIndexOf(".", email.lastIndexOf(".")-1)
return ${e.substring(0,index)}#${e.substring(index+1)}
Regex
e.replace(/\.(\.*)/, #$1)
//this works for Input 2 not Input 1, i need regex that would work for both, it only matches the first dot
The issue in the example data for the second last dot, is that the last example ends on .co.uk
One option for these specific examples could be using a pattern to exclude that specific part.
(\S+)\.(?!co\.uk$)(\S*?\.[^\s.]+)$
(\S+) Capture group 1, match 1+ non whitespace chars
\.(?!co\.uk$) Match a . followed by a negative lookahead asserting directly to the right is not co.uk
( Capture group 2
\S*?\. Match 0+ times a non whitspace char non greedy and then a .
[^\s.]+ Match 1+ times a non whitespace char except a .
) Close group 2
$ End of string
See a regex demo.
[
"james.sam.uri.stackoverflow.com",
"noman.stackoverflow.com",
"queen.elizabeth.empire.co.uk"
].forEach(s =>
console.log(s.replace(/(\S+)\.(?!co\.uk$)(\S*?\.[^\s.]+)$/, "$1#$2"))
);
Here's another approach:
(\S+)\.(\S+\.\S{3,}?)$
( )$ At the end of the string, capture by
\S{3,}? lazily matching 3+ non-whitespace characters
\S+\. and any non-whitespace characters with period in front.
(\S+)\. Also capture anything before the separating period.
Notably, it would fail for an email like test.stackoverflow.co.net. If that format is a requirement, I'd recommend a different approach.
[
"james.sam.uri.stackoverflow.com",
"noman.stackoverflow.com",
"queen.elizabeth.empire.co.uk",
"test.stackoverflow.co.net"
].forEach(s =>
console.log(s.replace(/(\S+)\.(\S+\.\S{3,}?)$/, "$1#$2"))
);

Replace last word with asterisk, or last two words

I need to hide surname of persons. For persons with three words in their name, just hide last word, ej:
Laura Torres Bermudez
shoud be
Laura Torres ********
and for
Maria Fernanda Gonzales Lopez
should be
Maria Fernanda ******** *****
I think they are two regex because based on the number of words, regex will be applied.
I know \w+ replaces all word by a single asterisk, and with (?!\s). I can replace chars except spaces. I hope you can help me. Thanks.
This is my example:
https://regex101.com/r/yW4aZ3/942
Try this:
(?<=\w+\s+\w+\s+.*)[^\s]
Explanation:
?<= is a negative lookbehind - match only occurrences preceded by specified pattern
[^\s] / match everything except whitespace (what you used - (?!\s). - is actually weird use of lookahead - "look to next character, if it is not a whitespate; then match any character")
summary: replace any non-whitespace space character preceded by at least two sequences of letters (\w) and spaces (\s).
Just note that it won't hide anything for persons with only two words in their name (which is common in many countries).
Also, the regex has to be slightly modified for that testing tool to match one name per line - see https://regex101.com/r/yW4aZ3/943 (^ was added to match from start of each line and a "multi line" flag was set).
A JavaScript solution that does not rely on the ECMAScript 2018 extended regex features is
s = s.replace(/^(\S+\s+\S+)([\s\S]*)/, function($0, $1, $2) {return $1 + $2.replace(/\S/g, '*');})
Details:
^ - start of string
(\S+\s+\S+) - Group 1: one or more non-whitespaces, 1 or more whitespaces and then 1 or more non-whitespaces
([\s\S]*) - Group 2: any 1 or more chars.
The replacement is Group 1 contents and the contents of Group 2 with each non-whitespace char replaced with an asterisk.
Java solution:
s = s.replaceAll("(\\G(?!^)\\s*|^\\S+\\s+\\S+\\s+)\\S", "$1*");
See the regex demo
Details
(\G(?!^)\s*|^\S+\s+\S+\s+) - Group 1: either then end of the previous match (\G(?!^)) and 0 or more whitespaces or (|) 1+ non-whitespaces, 1+ whitespaces and again 1+ non-whitespaces, 1+ whitespaces at the start of the string
\S - a non-whitespace char.
Interested if this can be done in JavaScript without a callback, I came up with
str = str.replace(/^(\w+\W+\w+\W+\b)\w?|(?!^)(\W*)\w/gy, '$1$2*');
See this demo at regex101
The idea might look a bit confusing but it seems to work fine. It should fail on one or two words but start as soon, as there appears a word character after the first two words. Important to use the sticky flag y which is similar to the \G anchor (continue on last match) but always is bound to start.
To not add an additional asterisk, the ...\b)\w?... part after the first two words is essential. The word boundary will force a third word to start but the first capturing group is closed after \b and the first character of the third word will be consumed but not captured to correctly match the asterisk count.
The second capturing group on the right side of the alternation will capture any optional non word characters appearing between any words after the third one.
var strs = ['Foo', 'Foo Bar B', 'Laura Torres Bermudez', 'Maria Fernanda Gonzales Lopez'];
strs = strs.map(str => str.replace(/^(\w+\W+\w+\W+\b)\w?|(?!^)(\W*)\w/gy, '$1$2*'));
console.log(strs);

Match group before nth character and after that

I want to match everything before the nth character (except the first character) and everything after it. So for the following string
/firstname/lastname/some/cool/name
I want to match
Group 1: firstname/lastname
Group 2: some/cool/name
With the following regex, I'm nearly there, but I can't find the correct regex to also correctly match the first group and ignore the first /:
([^\/]*\/){3}([^.]*)
Note that I always want to match the 3rd forward slash. Everything after that can be any character that is valid in an URL.
Your regex group are not giving proper result because ([^\/]*\/){3} you're repeating captured group which will overwrite the previous matched group Read this
You can use
^.([^/]+\/[^/]+)\/(.*)$
let str = `/firstname/lastname/some/cool/name`
let op = str.match(/^.([^/]+\/[^/]+)\/(.*)$/)
console.log(op)
Ignoring the first /, then capturing the first two words, then capturing the rest of the phrase after the /.
^(:?\/)([^\/]+\/[^\/]+)\/(.+)
See example
The quantifier {3} repeats 3 times the capturing group, which will have the value of the last iteration.
The first iteration will match /, the second firstname/ and the third (the last iteration) lastname/ which will be the value of the group.
The second group captures matching [^.]* which will match 0+ not a literal dot which does not take the the structure of the data into account.
If you want to match the full pattern, you could use:
^\/([^\/]+\/[^\/]+)\/([^\/]+(?:\/[^\/]+)+)$
Explanation
^ Start of string
( Capture group 1
[^\/]+/[^\/]+ Match 2 times not a / using a negated character class then a /
) Close group
\/ Match /
( Capture group 2
[^\/]+ Match 1+ times not /
(?:\/[^\/]+)+ Repeat 1+ times matching / and 1+ times not / to match the pattern of the rest of the string.
) Close group
$ End of string
Regex demo

Not able to match multiline content in a SWIFT message (RegEx)

I want to go over a SWIFT message using RegEx. I have the following excerpt from it:
:16R:FIN
:35B:ISIN CH0117044708
ANTEILE -DT USD- SWISSCANTO (CH)
INDEX FUND V - SWISSCANTO (CH)
INDEX EQUITY FUND USA
:16R:FIA
I am trying to fit the complete information in group 3:
ISIN CH0117044708
ANTEILE -DT USD- SWISSCANTO (CH)
INDEX FUND V - SWISSCANTO (CH)
INDEX EQUITY FUND USA
Instead, I am getting: ISIN CH0117044708 only.
My RegEx doesn't work and I am trying to debug and can't find the solution. This is the RegEx expression: /:([0-9]{2}[A-Z]){1}(::|:)((.*\r\n){1,4}|.*)/gm
Here to play around with it:
https://regex101.com/r/qX9cET/2
Edit:
How would we go about matching this pattern (optional):
([A-Z]*)(?:\/\/)?(.*(?:\/)?){0,2}
No // and / in line
// and a single /
// and two /
Included in the old one (https://regex101.com/r/Ubci69/5):
:16R:FIN
:97A::SAFE//0123-456789-11-020
:35B:ISIN CH0117044708
ANTEILE -DT USD- SWISSCANTO (CH)
INDEX FUND V - SWISSCANTO (CH)
INDEX EQUITY FUND USA
:16R:FIA
:93B::AGGR//UNIT/0,117
:19A::HOLD//CHF237,15
:92B::EXCH//JPY/CHF/0,0087535442107
One way to capture in the third capturing group could be to use [\s\S] instead of the dot to also match whitespace characters and use a negative lookahead (?! to assert that what is on the right side does not match :[0-9]{2}[A-Z]:{1,2} what you try to match at the beginning.
Note that you can omit {1} and if you don't use the first and the second capturing group you could omit those to get your values in only the first capturing group.
:([0-9]{2}[A-Z])(::|:)((?:[\s\S](?!:[0-9]{2}[A-Z]:))*)
Regex Demo
Explanation
: Match literally
([0-9]{2}[A-Z]) Match in the first capturing group 2 times a digit followed by an uppercase character
(::|:) Capture in the second capturing group two or one times a colon
( Start third capturing group
(?: Non capturing group
[\s\S] Match any character including whitespace characters
(?!: Negative lookahead to assert what is on the right does not
[0-9]{2}[A-Z]: Match in the first capturing group 2 times a digit followed by an uppercase character and a colon
) Close negative lookahead
)* Close non capturing group and repeat zero or more times
) Close third capturing group
Update: A more efficient version of the above regex using the dot. This will match the pattern with the colons at the start and then matches any character till the end of the string with an optional line break. Then it will us a negative lookahead to assert matching not the part with the colons and matches the whole line in a repeating pattern.
:([0-9]{2}[A-Z])(::|:)(.*(?:\r?\n)?(?:(?!:[0-9]{2}[A-Z]:).*(?:\r?\n)?)*)
Regex demo

Categories