Regex for repeated sub strings in a lengthy string [duplicate]

Regex for repeated sub strings in a lengthy string [duplicate] - javascript

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Related

Sonarqube catastrophic backtracking with regex that I use

So in one of my views I am using regex which is looking like that:
RegExp('^(?!(.|\n)*{\/?.+})(.|\n)*$')
ANd sonarlint is giving me a warning about catastrophic backtracking and should make sure that I should use a regex that cannot lead to denial of service.
From what I have read it mostly happens on regexes that are not too complex and are using a lot of "any" character calls like . or +
This is the first time for me to see a security hotspot like this, should I try to rewrite this regex or is it complex enough so it won't trigger catastrophic backtracking

There are a couple of points to note here:
The (.|\n)* construct causes very poor performance due to excessive backtracking. See "Why (?:\s|.)* is a bad pattern to match any character including line breaks" YouTube video with detailed explanation of why this is a very bad pattern. All you need is a [^] construct to match any chars.
You regex is basically /^(?![^]*{[^}]*})[^]*$/, or even /^(?![^]*{[^}]*})/ since JavaScript regex functions do not require full string match. All you need here is to actually use {[^}]*} pattern and negate the result with !.
So you can use
if !(/{[^}]*}/.test(text)) {
return true;
}
where {[^}]*} matches a {, then any zero or more chars other than } and then a } char.

Regex to process only certain files [duplicate]

This question already has answers here:
Regex: match everything but a specific pattern
(6 answers)
Closed 3 years ago.
I need a regex (javascript) that will process all files except for two.
The file names look like this-
Epic_IKDH_Appt_Phone_Reminders_20191030.txt
Epic_NAMMI_Appt_Phone_Reminders_20191031.txt
QCNIL-Recall_Phone_Reminders_20191029.txt
Epic_SNA_Appt_No_Show_Reminders_20191029.txt
I want to process all files that don't start with QCNIL and Epic_SNA.
Tried this regex but it doesn't seem to work
^((?!QCNIL).)*$|^((?!Epic_SNA).)*$
One of the other seems to work but not together.
Then tried this:
^((?!Epic_SNA)(?!QCNIL).)*$
This seems to work but with my limited knowledge of regular expressions, I'm afraid I might be missing something. Basically, if new file names are generated, I want them to also process. I only don't want to process the SNA and QCNIL files.

The second pattern ^((?!Epic_SNA)(?!QCNIL).)*$ would work but the approach taken is a tempered greedy token which will do 2 assertions before matching a single char and can be a costly operation in the number of steps.
You might simplify the pattern to use a negative lookahead at the start asserting what is directly to the right is not QCNIL or Epic_SNA.
Then match any char except a newline 1+ times to prevent matching an empty string.
^(?!QCNIL|Epic_SNA).+$
Regex demo

You need to perform the or inside the exclusion.
Otherwise you are saying "it is not this" or "it is not that". You need to say it is not "this or that".
Also, did you intend the whole expression to be repeated by *, or only the .? Try the following, though there are more brackets than necessary:
^(?!((Epic_SNA)|(QCNIL))).*$

Regular expression matching multiple entries, spanning multiple lines [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regular expression: matching word boundaries [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

How can I write a regex that will match any string?

Yes, I know this sounds counter-intuitive. But I need it for a JavaScript mashup written by someone else. There is a regex value used to select project names to which the mashup will apply. I want it to apply to all projects.

As others have mentioned, the . doesn't match line feeds.
If you must match absolutely every character including \n then you can use this instead...
[\s\S]*
That's whitespace characters, and non-whitespace characters. In other words, that'll match everything.

OK, folks. I'm not strong with regex, but I think I figured this out. I'm just using the following regex:
'.*'
This is working fine for me. The dot allows any character, and the asterisk allows it to be repeated any number of times.
If anybody knows how to limit this to a string that is a single line (ASCII Character set) I'm all ears.

Your answer (.*) works, and by default will match only a single line. If you wanted it to match multiple lines, you could enable multi-line mode in your particular regex implementation, but nobody enables it by default AFAIK.

You're correct that .* will find any character, however, the exception are newline characters ('\n', etc). What you could try is grouping. This: (.*) should work for you. If you need help with accessing matched group more info can be found here: How do you access the matched groups in a JavaScript regular expression?
EDIT: If you are using something other than JavaScript the implementation may differ slightly with multi vs single-line mode. JavaScript itself does not have single-ling mode.

We Keep Coding

JavaScript is the programming language of the Web.

Regex for repeated sub strings in a lengthy string [duplicate] - javascript

Make .* non-greedy by adding '?' after it: Project name:\s+(.*?)\s+J[0-9]{7}:

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:) This will work for you. Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Related

Sonarqube catastrophic backtracking with regex that I use

Regex to process only certain files [duplicate]

Regular expression matching multiple entries, spanning multiple lines [duplicate]

Regular expression: matching word boundaries [duplicate]

How can I write a regex that will match any string?

Categories

Resources