Regex pattern matching budget numbers issue - javascript

I'm having an issue and I'm hoping there is someone who is more knowledgeable with Regex that can help me out.
I'm trying to extract data from a PDF file which contains a budget line items. I'm using this regex pattern to get the index of the first number so I can then extract the numbers to the right.
Regex pattern:
(([(]?[0-9]+[)]? )|([(]?[0-9]+[)]?)|(- )|(-))+$
Line item: 'Modernization and improvement (note 9) 260 (180) 640 - 155'
This works well for 99% of the line items except this one I came across. The problem is the pattern matches the '9)' in what is the text portion.
Is there any way with this Regex pattern to say if there are brackets, the inside must contain numbers only?
Thanks!

You can repeat all possible options until the end of the string:
(?:\(\d+\)|\d+(?:\s*-\s*\d+)?)(?:\s+(?:\(\d+\)|\d+(?:\s*-\s*\d+)?))*$
Explanation
(?: Non capture group
\(\d+\) Match 1+ digits between parenthesis
| Or
\d+(?:\s*-\s*\d+)? Match 1+ digits and optionally match - and 1+ digits
) Close the non capture group
(?: Non capture group to repeat as a whole part
\s+ Match 1+ whitespace chars
(?:\(\d+\)|\d+(?:\s*-\s*\d+)?) Same as the first pattern
)* Close the non capture group and optionally repeat it
$ End of string
Regex demo

Related

Regex match any js number

So as an exercise I wanted to match any JS number. This is the one I could come up with:
/^(-?)(0|([1-9]\d*?|0)(\.\d+)?)$/
This however doesn't match the new syntax with underscore separators (1_2.3_4). I tried a couple of things but I couldn't come up with something that would work. How could I express all JS numbers in one regex?
For the format in the question, you could use:
^-?\d+(?:_\d+)*(?:\.\d+(?:_\d+)*)?$
See a regex demo.
Or allowing only a single leading zero:
^-?(?:0|[1-9]\d*)(?:_\d+)*(?:\.\d+(?:_\d+)*)?$
The pattern matches:
^ Start of string
-? Match an optional -
(?:0|[1-9]\d*) Match either 0 or 1-9 and optional digits
(?:_\d+)* Optionally repeat matching _ and 1+ digits
(?: Non capture group
\.\d+(?:_\d+)* Match . and 1+ digits and optionally repeat matching _ and 1+ digits
)? Close non capture group
$ End of string
See another regex demo.
how about this?
^(-?)(0|([1-9]*?(\_)?(\d)|0|)(\.\d+)?(\_)?(\d))$

Regex how to capture repeating values without capturing spaces around text

I am trying to capture multiple values that will be in the following format:
prof:
prof1
prof2
prof3
...
I don't know how many there will be in the list, it's also possible there will be no values either, but what I want to capture are prof1, prof2, prof3, etc without the whitespace on either side. I have a starter regex:
prof:\s*([\w-]*)
This captures the first prof value, but none of the others. If I add a * at the end of the capture group, none of them are captured. If I add [] on either side of the capture group, it results in an error where it can't figure out what the closing parentheses is for.
Basically, the pattern is, some amount of whitespace, capture text, some amount of whitespace, capture text, etc. But I can't figure out the proper regex for that to work.
I'm guessing that this expression in m mode might be an option, not sure though:
([\s\S]*?)(prof:)|([\w-]*)
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Another option could be to match prof: and capture all after in a capturing group and making sure that there are 1+ empty lines between the prof1, prof2 etc..
Then split that group on 1+ whitespace chars \s+
\bprof:[ \t]*((?:(?:\n[ \t]*$)+\n[ \t]+[\w-]+)*)
Explanation
\bprof:[ \t]* Word boundary, match prof: followed by 0+ tab/spaces
( Capture group 1
(?: Non capturing group
(?:\n[ \t]*$)+ Match 1+ times a newline, 0+ tab/spaces and assert end of string
\n[ \t]+[\w-]+ Match newline, 1+ tabs/spaces, 1+ wordchars/hyphen
)* Close non capturing group and repeat 0+ times
) Close capture group 1
Regex demo
const regex = /\bprof:[ \t]*((?:(?:\n[ \t]*$)+\n[ \t]+[\w-]+)*)/m;
const str = `prof:
prof1
prof2
prof3
...`;
let res = str.match(regex)[1].split(/\s+/).filter(Boolean);
console.log(res);

why do these characters belong to the first group in this JS regex match?

I am trying to write a regex to find two meaningful groups within a substring that's part of a text I'm working with.
The text and my attempt are here:
https://regex101.com/r/6Sc3aM/1
The complete regex:
Artikelnummer(?:(?:&&&))(.*)(?:\s*.*)\W?(?:Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite)(&&&([^(&&&)]+)&&&([^(&&&)]+)&&&(\d+))+
The test string:
%5B"Deckblatt: Anlagendokumentation&&&Produktdaten&&&KKS-Nummer&&&Hersteller&&&Typ&&&Artikelnummer&&&MA-KF1&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF11&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF12&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF13&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF14&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF15&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF16&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF17&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF18&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF19&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF20&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF21&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF22&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF23&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF24&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF25&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF26&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF27&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF28&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF29&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF30&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF31&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF32&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF33&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF34&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF35&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF36&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF37&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF38&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF39&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF40&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&MA-KF41&&&Beckhoff&&&EK1100&&&BECK%2EEK1100&&&Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite&&&all&&&Vorwort&&&6&&&all&&&Produktübersicht&&&7&&&all&&&Grundlagen&&&8&&&all&&&Montage und Verdrahtung&&&9&&&all&&&Inbetriebnahme%2FAnwendungshinweise&&&10&&&all&&&Fehlerbehandlung und Diagnose&&&11&&&all&&&Anhang 1&&&12&&&all&&&Anhang 2&&&13&&&all&&&Anhang 3&&&14&&&all&&&Anhang 4&&&15&&&all&&&Anhang 5&&&16&&&all&&&Anhang 6&&&17&&&all&&&Anhang 7&&&18&&&all&&&Anhang 8&&&19&&&all&&&Anhang 9&&&20&&&all&&&Anhang 10&&&21&&&all&&&Anhang 11&&&22&&&all&&&Anhang 12&&&23&&&all&&&Anhang 13&&&24&&&all&&&Anhang 14&&&25&&&all&&&Anhang 15&&&26&&&all&&&Anhang 16&&&27&&&all&&&Anhang 17&&&28&&&all&&&Anhang 18&&&29&&&all&&&Anhang 19&&&30&&&all&&&Anhang 20&&&31&&&all&&&Anhang 21&&&32&&&all&&&Anhang 22&&&33&&&all&&&Anhang 23&&&34&&&all&&&Anhang 24&&&35&&&all&&&Anhang 25&&&36&&&all&&&Anhang 26&&&37&&&all&&&Anhang 27&&&38&&&all&&&Anhang 28&&&39&&&all&&&Anhang 29&&&40&&&all&&&Anhang 30&&&41&&&all&&&Anhang 31&&&42&&&all&&&Anhang 32&&&43&&&all&&&Anhang 33&&&44&&&all&&&Anhang 34&&&45&&&all&&&Anhang 35&&&46&&&all&&&Anhang 36&&&47&&&all&&&Anhang 37&&&48&&&all&&&Anhang 38&&&49&&&all&&&Anhang 39&&&50&&&all&&&Anhang 40&&&51&&&all&&&Anhang 41&&&52&&&all&&&Anhang 42&&&53"%5D
The regex I wrote should get a first group, which appears after /Artikelnummer/ and before /Dokumentation&&&/ (etc), as well as a second group, which is what I'm having trouble with:
It should consist of repetitions of this pattern: (&&&([^(&&&)]+)&&&([^(&&&)]+)&&&(\d+)+
By my reckoning, that should capture the entire substring:
&&&all&&&Vorwort&&&6&&&all&&&Produktübersicht&&&7&&&all&&&Grundlagen&&&8&&&all&&&Montage und Verdrahtung&&&9&&&all&&&Inbetriebnahme%2FAnwendungshinweise&&&10&&&all&&&Fehlerbehandlung und Diagnose&&&11&&&all&&&Anhang 1&&&12&&&all&&&Anhang 2&&&13&&&all&&&Anhang 3&&&14&&&all&&&Anhang 4&&&15&&&all&&&Anhang 5&&&16&&&all&&&Anhang 6&&&17&&&all&&&Anhang 7&&&18&&&all&&&Anhang 8&&&19&&&all&&&Anhang 9&&&20&&&all&&&Anhang 10&&&21&&&all&&&Anhang 11&&&22&&&all&&&Anhang 12&&&23&&&all&&&Anhang 13&&&24&&&all&&&Anhang 14&&&25&&&all&&&Anhang 15&&&26&&&all&&&Anhang 16&&&27&&&all&&&Anhang 17&&&28&&&all&&&Anhang 18&&&29&&&all&&&Anhang 19&&&30&&&all&&&Anhang 20&&&31&&&all&&&Anhang 21&&&32&&&all&&&Anhang 22&&&33&&&all&&&Anhang 23&&&34&&&all&&&Anhang 24&&&35&&&all&&&Anhang 25&&&36&&&all&&&Anhang 26&&&37&&&all&&&Anhang 27&&&38&&&all&&&Anhang 28&&&39&&&all&&&Anhang 29&&&40&&&all&&&Anhang 30&&&41&&&all&&&Anhang 31&&&42&&&all&&&Anhang 32&&&43&&&all&&&Anhang 33&&&44&&&all&&&Anhang 34&&&45&&&all&&&Anhang 35&&&46&&&all&&&Anhang 36&&&47&&&all&&&Anhang 37&&&48&&&all&&&Anhang 38&&&49&&&all&&&Anhang 39&&&50&&&all&&&Anhang 40&&&51&&&all&&&Anhang 41&&&52&&&all&&&Anhang 42&&&53
But, for some reason, the only string in group 2 is:
&&&Anhang 42&&&53
Why is this happening?
You get &&&all&&&Anhang 42&&&53 in Group 2 because the (pattern)+ is a repeated capturing group that stores only the value captured at the last iteration.
It seems you need
/Artikelnummer&&&([\s\S]*?)&&&Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite((?:(?:&&&[^&]*(?:&&?[^&]+)*){2}&&&\d+)+)/g
See the regex demo
The first capturing group just matches any 0+ chars from Artikelnummer&&& till the first occurrence of &&&Dokumentation..., and the second one grabs 1+ occurrences of &&&...&&&...&&& + digit(s).
Details
Artikelnummer&&& - a literal substring
([\s\S]*?) - Group 1 matching any 0+ chars, as few as possible up to the
&&&Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite - literal substring
((?:&&&[^&]*(?:&&?[^&]+)*&&&[^&]*(?:&&?[^&]+)*&&&\d+)+) - Group 2 matching 1+ occurrences of:
(?:&&&[^&]*(?:&&?[^&]+)*){2} - two occurrences of:
&&& - a literal substring
[^&]*(?:&&?[^&]+)* - any 0+ chars other than & and then 0+ sequences of & or && followed with any 0+ chars other than &
&&& - a literal substring
\d+ - 1+ digits.
Notes on performance: the first capturing group pattern needs to be precised if you need better performance. Right now, the lazy dot pattern is too slow and if the substring between the first and second delimiter grows, then there might be performance issues.

Capture between pattern of digits

I'm stuck trying to capture a structure like this:
1:1 wefeff qwefejä qwefjk
dfjdf 10:2 jdskjdksdjö
12:1 qwe qwe: qwertyå
I would want to match everything between the digits, followed by a colon, followed by another set of digits. So the expected output would be:
match 1 = 1:1 wefeff qwefejä qwefjk dfjdf
match 2 = 10:2 jdskjdksdjö
match 3 = 12:1 qwe qwe: qwertyå
Here's what I have tried:
\d+\:\d+.+
But that fails if there are word characters spanning two lines.
I'm using a javascript based regex engine.
You may use a regex based on a tempered greedy token:
/\d+:\d+(?:(?!\d+:\d)[\s\S])*/g
The \d+:\d+ part will match one or more digits, a colon, one or more digits and (?:(?!\d+:\d)[\s\S])* will match any char, zero or more occurrences, that do not start a sequence of one or more digits followed with a colon and a digit. See this regex demo.
As the tempered greedy token is a resource consuming construct, you can unroll it into a more efficient pattern like
/\d+:\d+\D*(?:\d(?!\d*:\d)\D*)*/g
See another regex demo.
Now, the () is turned into a pattern that matches strings linearly:
\D* - 0+ non-digit symbols
(?: - start of a non-capturing group matching zero or more sequences of:
\d - a digit that is...
(?!\d*:\d) - not followed with 0+ digits, : and a digit
\D* - 0+ non-digit symbols
)* - end of the non-capturing group.
you can use or not the ñ-Ñ, but you should be ok this way
\d+?:\d+? [a-zñA-ZÑ ]*
Edited:
If you want to include the break lines, you can add the \n or \r to the set,
\d+?:\d+? [a-zñA-ZÑ\n ]*
\d+?:\d+? [a-zñA-ZÑ\r ]*
Give it a try ! also tested in https://regex101.com/
for more chars:
^[a-zA-Z0-9!##\$%\^\&*)(+=._-]+$

How to build a custom regex that matches dashes/alphanumeric characters and '.' dot characters that are not consecutive?

I need to build a regex that doesn't match the words with this requirements:
at least 3 characters
maximum 32 characters
only a-z0-9_-.
dots: . ok, .. nope
this is what i did:
/[0-9a-zA-Z\-\_\.]{3,32}/
the problem is that i can insert more than one . and i don't know how to fix it.
You could use the following expression:
/(?:[\w-]|\.(?!\.)){3,32}/
Explanation:
(?: - Start of a non-capturing group
[\w-] - Character set to match [a-zA-Z0-9_-]
| - Alternation, or..
\.(?!\.) - Negative lookahead to match a . character literally if it isn't followed by another . character.
) - Close the non-capturing group
{3,32} - Match the group 3 to 32 times
You may also want to add anchors if you want to match the entire string against the expression:
/^(?:[\w-]|\.(?!\.)){3,32}$/

Categories