Not able to match multiline content in a SWIFT message (RegEx) - javascript

I want to go over a SWIFT message using RegEx. I have the following excerpt from it:
:16R:FIN
:35B:ISIN CH0117044708
ANTEILE -DT USD- SWISSCANTO (CH)
INDEX FUND V - SWISSCANTO (CH)
INDEX EQUITY FUND USA
:16R:FIA
I am trying to fit the complete information in group 3:
ISIN CH0117044708
ANTEILE -DT USD- SWISSCANTO (CH)
INDEX FUND V - SWISSCANTO (CH)
INDEX EQUITY FUND USA
Instead, I am getting: ISIN CH0117044708 only.
My RegEx doesn't work and I am trying to debug and can't find the solution. This is the RegEx expression: /:([0-9]{2}[A-Z]){1}(::|:)((.*\r\n){1,4}|.*)/gm
Here to play around with it:
https://regex101.com/r/qX9cET/2
Edit:
How would we go about matching this pattern (optional):
([A-Z]*)(?:\/\/)?(.*(?:\/)?){0,2}
No // and / in line
// and a single /
// and two /
Included in the old one (https://regex101.com/r/Ubci69/5):
:16R:FIN
:97A::SAFE//0123-456789-11-020
:35B:ISIN CH0117044708
ANTEILE -DT USD- SWISSCANTO (CH)
INDEX FUND V - SWISSCANTO (CH)
INDEX EQUITY FUND USA
:16R:FIA
:93B::AGGR//UNIT/0,117
:19A::HOLD//CHF237,15
:92B::EXCH//JPY/CHF/0,0087535442107

One way to capture in the third capturing group could be to use [\s\S] instead of the dot to also match whitespace characters and use a negative lookahead (?! to assert that what is on the right side does not match :[0-9]{2}[A-Z]:{1,2} what you try to match at the beginning.
Note that you can omit {1} and if you don't use the first and the second capturing group you could omit those to get your values in only the first capturing group.
:([0-9]{2}[A-Z])(::|:)((?:[\s\S](?!:[0-9]{2}[A-Z]:))*)
Regex Demo
Explanation
: Match literally
([0-9]{2}[A-Z]) Match in the first capturing group 2 times a digit followed by an uppercase character
(::|:) Capture in the second capturing group two or one times a colon
( Start third capturing group
(?: Non capturing group
[\s\S] Match any character including whitespace characters
(?!: Negative lookahead to assert what is on the right does not
[0-9]{2}[A-Z]: Match in the first capturing group 2 times a digit followed by an uppercase character and a colon
) Close negative lookahead
)* Close non capturing group and repeat zero or more times
) Close third capturing group
Update: A more efficient version of the above regex using the dot. This will match the pattern with the colons at the start and then matches any character till the end of the string with an optional line break. Then it will us a negative lookahead to assert matching not the part with the colons and matches the whole line in a repeating pattern.
:([0-9]{2}[A-Z])(::|:)(.*(?:\r?\n)?(?:(?!:[0-9]{2}[A-Z]:).*(?:\r?\n)?)*)
Regex demo

Related

Regex: select until first space or comma occurrence

I have following example of american addresses.
6301 Stonewood Dr Apt-728, Plano TX-75024
13323 Maham Road, Apt # 1621, Dallas, TX 75240
17040 Carlson Drive, #1027 Parker, CO 80134
3465 25th St., San Francisco, CA 94110
I want to extract city from using regex
Plano, Dallas, Parker, San Francisco
I am using following regex which is working for first example
(?<=[,.|•]).*\s+(?=[\s,.]?CA?[\s,.-]?[\d]{4,})
can you help me for the same as?
You can match the comma, then all except A-Z and capture from the first occurrence of A-Z.
,[^A-Z,]*?\b([A-Z][^,]*?),?\s*[A-Z]{2}[-\s]\d{4,}\s*$
Explanation
,[^A-Z,]*?\b Match a comma, then any char except A-Z or a comma till a word boundary
([A-Z][^,]*?) Capture group 1 Match A-Z and then any char except a comma as least as possible
,?\s*[A-Z]{2} match optional comma, optional whiteapace chars and 2 uppecase chars A-Z
[-\s]\d{4,}\s* Match either - or a whitespace char and then 4 or more digits followed by optional whiteapace chars
$ end of string
Regex demo
You can use
,(?:\s*#\d+)?\s*([^\s,][^,]*)(?=\W+[A-Z]{2}\W*\d{4,}\s*$)
See the regex demo. The necessary value is in Group 1.
Details:
, - a comma
(?:\s*#\d+)? - an optional sequence of zero or more whitespaces, # and then one or more digits
\s* - zero or more whitespaces
([^\s,][^,]*) - Group 1: a char other than whitespace and comma and then zero or more non-comma chars
(?=\W+[A-Z]{2}\W*\d{4,}\s*$) - a positive lookahead that requires (immediately on the right)
\W+ - one or more non-word chars
[A-Z]{2} - two uppercase ASCII letters
\W* - zero or more non-word chars
\d{4,} - gfour or more digits
\s* - zero or more whitespaces
$ - end of string.
Another approach (assuming the structure of ending is more or less fixed)
.+\s(\w+?),?.{4}\d{4,}
The best guess I could achieve was starting from the end of the string looking for a chain of non-spacing characters (being the portion you are looking for) followed by a space, a chain of capital letters, then an option space/dash and in the end a chain of numbers.
([^\s]+?)\,?\s[A-Z]+[\s\-]?\d+$
Being the first group, the target you are aiming for.
This is a live example with your use case embedded:
https://regexr.com/6nkq5
(as a side note, the demo on regexr may tell you the expression took more than 250ms and can't render.. you just slightly edit the test case to make it update and show you the actual result)
As long as your match comes always after the (exactly) two country letters, you can use that simple condition to match your city.
(?<= )[A-Za-z ]+(?=,? [A-Z]{2})
Your match [A-Za-z ]+ will be found between
(?<= ): a space and
(?=,? [A-Z]{2}): an optional comma + a space + two uppercase letters
Check the demo here.

Replace last word with asterisk, or last two words

I need to hide surname of persons. For persons with three words in their name, just hide last word, ej:
Laura Torres Bermudez
shoud be
Laura Torres ********
and for
Maria Fernanda Gonzales Lopez
should be
Maria Fernanda ******** *****
I think they are two regex because based on the number of words, regex will be applied.
I know \w+ replaces all word by a single asterisk, and with (?!\s). I can replace chars except spaces. I hope you can help me. Thanks.
This is my example:
https://regex101.com/r/yW4aZ3/942
Try this:
(?<=\w+\s+\w+\s+.*)[^\s]
Explanation:
?<= is a negative lookbehind - match only occurrences preceded by specified pattern
[^\s] / match everything except whitespace (what you used - (?!\s). - is actually weird use of lookahead - "look to next character, if it is not a whitespate; then match any character")
summary: replace any non-whitespace space character preceded by at least two sequences of letters (\w) and spaces (\s).
Just note that it won't hide anything for persons with only two words in their name (which is common in many countries).
Also, the regex has to be slightly modified for that testing tool to match one name per line - see https://regex101.com/r/yW4aZ3/943 (^ was added to match from start of each line and a "multi line" flag was set).
A JavaScript solution that does not rely on the ECMAScript 2018 extended regex features is
s = s.replace(/^(\S+\s+\S+)([\s\S]*)/, function($0, $1, $2) {return $1 + $2.replace(/\S/g, '*');})
Details:
^ - start of string
(\S+\s+\S+) - Group 1: one or more non-whitespaces, 1 or more whitespaces and then 1 or more non-whitespaces
([\s\S]*) - Group 2: any 1 or more chars.
The replacement is Group 1 contents and the contents of Group 2 with each non-whitespace char replaced with an asterisk.
Java solution:
s = s.replaceAll("(\\G(?!^)\\s*|^\\S+\\s+\\S+\\s+)\\S", "$1*");
See the regex demo
Details
(\G(?!^)\s*|^\S+\s+\S+\s+) - Group 1: either then end of the previous match (\G(?!^)) and 0 or more whitespaces or (|) 1+ non-whitespaces, 1+ whitespaces and again 1+ non-whitespaces, 1+ whitespaces at the start of the string
\S - a non-whitespace char.
Interested if this can be done in JavaScript without a callback, I came up with
str = str.replace(/^(\w+\W+\w+\W+\b)\w?|(?!^)(\W*)\w/gy, '$1$2*');
See this demo at regex101
The idea might look a bit confusing but it seems to work fine. It should fail on one or two words but start as soon, as there appears a word character after the first two words. Important to use the sticky flag y which is similar to the \G anchor (continue on last match) but always is bound to start.
To not add an additional asterisk, the ...\b)\w?... part after the first two words is essential. The word boundary will force a third word to start but the first capturing group is closed after \b and the first character of the third word will be consumed but not captured to correctly match the asterisk count.
The second capturing group on the right side of the alternation will capture any optional non word characters appearing between any words after the third one.
var strs = ['Foo', 'Foo Bar B', 'Laura Torres Bermudez', 'Maria Fernanda Gonzales Lopez'];
strs = strs.map(str => str.replace(/^(\w+\W+\w+\W+\b)\w?|(?!^)(\W*)\w/gy, '$1$2*'));
console.log(strs);

Match group before nth character and after that

I want to match everything before the nth character (except the first character) and everything after it. So for the following string
/firstname/lastname/some/cool/name
I want to match
Group 1: firstname/lastname
Group 2: some/cool/name
With the following regex, I'm nearly there, but I can't find the correct regex to also correctly match the first group and ignore the first /:
([^\/]*\/){3}([^.]*)
Note that I always want to match the 3rd forward slash. Everything after that can be any character that is valid in an URL.
Your regex group are not giving proper result because ([^\/]*\/){3} you're repeating captured group which will overwrite the previous matched group Read this
You can use
^.([^/]+\/[^/]+)\/(.*)$
let str = `/firstname/lastname/some/cool/name`
let op = str.match(/^.([^/]+\/[^/]+)\/(.*)$/)
console.log(op)
Ignoring the first /, then capturing the first two words, then capturing the rest of the phrase after the /.
^(:?\/)([^\/]+\/[^\/]+)\/(.+)
See example
The quantifier {3} repeats 3 times the capturing group, which will have the value of the last iteration.
The first iteration will match /, the second firstname/ and the third (the last iteration) lastname/ which will be the value of the group.
The second group captures matching [^.]* which will match 0+ not a literal dot which does not take the the structure of the data into account.
If you want to match the full pattern, you could use:
^\/([^\/]+\/[^\/]+)\/([^\/]+(?:\/[^\/]+)+)$
Explanation
^ Start of string
( Capture group 1
[^\/]+/[^\/]+ Match 2 times not a / using a negated character class then a /
) Close group
\/ Match /
( Capture group 2
[^\/]+ Match 1+ times not /
(?:\/[^\/]+)+ Repeat 1+ times matching / and 1+ times not / to match the pattern of the rest of the string.
) Close group
$ End of string
Regex demo

Update regex pattern to allow .xx instead of just 0.xx

I have a regular expression that I use to test against user input that expects currency. This statement allows an optional dollar sign, allows optional commas (as long as they are correctly placed), and allows a single decimal point as long as it's followed by at least another number.
^\$?\d{1,3}(,?\d{3})*(\.\d{1,2})?$
Examples like
$12.12
0.34
12,000
12,000000
are all allowed by design. There is one however that doesn't match that I would like to. If a user wants to enter a number like .34 it must be proceeded by a zero. So 0.34 matches, but .34 doesn't.
Here's how I updated the statement to fix this.
^(\$?\d{1,3}(,?\d{3})*)?(\.\d{1,2})?$
I've made the entire statement before the decimal point a capturing group and made it optional. What I'm worried about now though, is that my entire regex statement enclosed by two capturing groups which are optional. I don't want a blank space to match this pattern and I think it will. Is there a better option for what I'm trying to accomplish?
Edit: My original statement doesn't match .12 The second updated statement does however, because the entire statement is wrapped in optional capturing groups, a blank space would match this pattern and that is not desired.
Your optional group is the correct way to proceed. Note that non-capturing groups that are only used to group sequences of subpatterns are more efficient when you do not have to access the captured subvalues later.
The only thing you really miss is to avoid matching an empty string. You may achieve it using a positive lookahead (?=.) or (?!$) negative lookahead:
^(?!$)(?:\$?\d{1,3}(?:,?\d{3})*)?(?:\.\d{1,2})?$
See the regex demo
Details
^ - start of string
(?!$) - no end of string right after the start of string
(?: - start of an optional non-capturing group
\$? - 1 or 0 $ symbols
\d{1,3} - 1 to 3 digits
(?:- start of a non-capturing group repeated 0+ sequences of
,? - 1 or 0 commas
\d{3} - 3 digits
)* - end of the non-capturing group
)? - end of the optional non-capturing group
(?:\.\d{1,2})? - an optional non-capturing group matching 1 or 0 sequences of
\. - a dot
\d{1,2} - 1 or 2 digits
$ - end of string.
You can use the "or" syntax |
^\$?(\d{1,3}(,?\d{3})*|0)(\.\d{1,2})?$
I would also suggest that you don't need to capture the inner groups of (,\d{3})*
^\$?(\d{1,3}(?:,?\d{3})*|0)(\.\d{1,2})?$
(\d{1,3}(?:,?\d{3})*|0) either an amount \d{1,3}(?:,?\d{3})* or 0

why do these characters belong to the first group in this JS regex match?

I am trying to write a regex to find two meaningful groups within a substring that's part of a text I'm working with.
The text and my attempt are here:
https://regex101.com/r/6Sc3aM/1
The complete regex:
Artikelnummer(?:(?:&&&))(.*)(?:\s*.*)\W?(?:Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite)(&&&([^(&&&)]+)&&&([^(&&&)]+)&&&(\d+))+
The test string:
%5B"Deckblatt: Anlagendokumentation&&&Produktdaten&&&KKS-Nummer&&&Hersteller&&&Typ&&&Artikelnummer&&&MA-KF1&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF11&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF12&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF13&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF14&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF15&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF16&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF17&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF18&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF19&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF20&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF21&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF22&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF23&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF24&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF25&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF26&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF27&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF28&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF29&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF30&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF31&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF32&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF33&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF34&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF35&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF36&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF37&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF38&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF39&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF40&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF41&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite&&&all&&&Vorwort&&&6&&&all&&&Produktübersicht&&&7&&&all&&&Grundlagen&&&8&&&all&&&Montage und Verdrahtung&&&9&&&all&&&Inbetriebnahme%2FAnwendungshinweise&&&10&&&all&&&Fehlerbehandlung und Diagnose&&&11&&&all&&&Anhang 1&&&12&&&all&&&Anhang 2&&&13&&&all&&&Anhang 3&&&14&&&all&&&Anhang 4&&&15&&&all&&&Anhang 5&&&16&&&all&&&Anhang 6&&&17&&&all&&&Anhang 7&&&18&&&all&&&Anhang 8&&&19&&&all&&&Anhang 9&&&20&&&all&&&Anhang 10&&&21&&&all&&&Anhang 11&&&22&&&all&&&Anhang 12&&&23&&&all&&&Anhang 13&&&24&&&all&&&Anhang 14&&&25&&&all&&&Anhang 15&&&26&&&all&&&Anhang 16&&&27&&&all&&&Anhang 17&&&28&&&all&&&Anhang 18&&&29&&&all&&&Anhang 19&&&30&&&all&&&Anhang 20&&&31&&&all&&&Anhang 21&&&32&&&all&&&Anhang 22&&&33&&&all&&&Anhang 23&&&34&&&all&&&Anhang 24&&&35&&&all&&&Anhang 25&&&36&&&all&&&Anhang 26&&&37&&&all&&&Anhang 27&&&38&&&all&&&Anhang 28&&&39&&&all&&&Anhang 29&&&40&&&all&&&Anhang 30&&&41&&&all&&&Anhang 31&&&42&&&all&&&Anhang 32&&&43&&&all&&&Anhang 33&&&44&&&all&&&Anhang 34&&&45&&&all&&&Anhang 35&&&46&&&all&&&Anhang 36&&&47&&&all&&&Anhang 37&&&48&&&all&&&Anhang 38&&&49&&&all&&&Anhang 39&&&50&&&all&&&Anhang 40&&&51&&&all&&&Anhang 41&&&52&&&all&&&Anhang 42&&&53"%5D
The regex I wrote should get a first group, which appears after /Artikelnummer/ and before /Dokumentation&&&/ (etc), as well as a second group, which is what I'm having trouble with:
It should consist of repetitions of this pattern: (&&&([^(&&&)]+)&&&([^(&&&)]+)&&&(\d+)+
By my reckoning, that should capture the entire substring:
&&&all&&&Vorwort&&&6&&&all&&&Produktübersicht&&&7&&&all&&&Grundlagen&&&8&&&all&&&Montage und Verdrahtung&&&9&&&all&&&Inbetriebnahme%2FAnwendungshinweise&&&10&&&all&&&Fehlerbehandlung und Diagnose&&&11&&&all&&&Anhang 1&&&12&&&all&&&Anhang 2&&&13&&&all&&&Anhang 3&&&14&&&all&&&Anhang 4&&&15&&&all&&&Anhang 5&&&16&&&all&&&Anhang 6&&&17&&&all&&&Anhang 7&&&18&&&all&&&Anhang 8&&&19&&&all&&&Anhang 9&&&20&&&all&&&Anhang 10&&&21&&&all&&&Anhang 11&&&22&&&all&&&Anhang 12&&&23&&&all&&&Anhang 13&&&24&&&all&&&Anhang 14&&&25&&&all&&&Anhang 15&&&26&&&all&&&Anhang 16&&&27&&&all&&&Anhang 17&&&28&&&all&&&Anhang 18&&&29&&&all&&&Anhang 19&&&30&&&all&&&Anhang 20&&&31&&&all&&&Anhang 21&&&32&&&all&&&Anhang 22&&&33&&&all&&&Anhang 23&&&34&&&all&&&Anhang 24&&&35&&&all&&&Anhang 25&&&36&&&all&&&Anhang 26&&&37&&&all&&&Anhang 27&&&38&&&all&&&Anhang 28&&&39&&&all&&&Anhang 29&&&40&&&all&&&Anhang 30&&&41&&&all&&&Anhang 31&&&42&&&all&&&Anhang 32&&&43&&&all&&&Anhang 33&&&44&&&all&&&Anhang 34&&&45&&&all&&&Anhang 35&&&46&&&all&&&Anhang 36&&&47&&&all&&&Anhang 37&&&48&&&all&&&Anhang 38&&&49&&&all&&&Anhang 39&&&50&&&all&&&Anhang 40&&&51&&&all&&&Anhang 41&&&52&&&all&&&Anhang 42&&&53
But, for some reason, the only string in group 2 is:
&&&Anhang 42&&&53
Why is this happening?
You get &&&all&&&Anhang 42&&&53 in Group 2 because the (pattern)+ is a repeated capturing group that stores only the value captured at the last iteration.
It seems you need
/Artikelnummer&&&([\s\S]*?)&&&Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite((?:(?:&&&[^&]*(?:&&?[^&]+)*){2}&&&\d+)+)/g
See the regex demo
The first capturing group just matches any 0+ chars from Artikelnummer&&& till the first occurrence of &&&Dokumentation..., and the second one grabs 1+ occurrences of &&&...&&&...&&& + digit(s).
Details
Artikelnummer&&& - a literal substring
([\s\S]*?) - Group 1 matching any 0+ chars, as few as possible up to the
&&&Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite - literal substring
((?:&&&[^&]*(?:&&?[^&]+)*&&&[^&]*(?:&&?[^&]+)*&&&\d+)+) - Group 2 matching 1+ occurrences of:
(?:&&&[^&]*(?:&&?[^&]+)*){2} - two occurrences of:
&&& - a literal substring
[^&]*(?:&&?[^&]+)* - any 0+ chars other than & and then 0+ sequences of & or && followed with any 0+ chars other than &
&&& - a literal substring
\d+ - 1+ digits.
Notes on performance: the first capturing group pattern needs to be precised if you need better performance. Right now, the lazy dot pattern is too slow and if the substring between the first and second delimiter grows, then there might be performance issues.

Categories