Extracting data using regex - Optional capturing group causing trouble [duplicate]

Extracting data using regex - Optional capturing group causing trouble [duplicate] - javascript

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this test string, which I want to extract data from using Regex:
"-> Single-row index lookup on using <auto_distinct_key> (actor_id=actor.actor_id) (cost=901.66 rows=5478) (actual time=0.001..0.001 rows=0 loops=200)"
The six fields of data that I want to extract are written in bold. The segment that is written in italic is the optional part of the statement.
This is the pattern which I have arrived at thus far:
-> (.+) (\(cost=([\d\.]+) rows=(\d+)\))? \(actual time=([\d\.]+) rows=(\d+) loops=(\d+)\)
This gives me six groups with all the data I want. However, when I omit the optional part of the string it does not match at all. I suspected this was due to superfluous whitespaces, so I thought it might work to move the whitespace into the optional group, like this:
-> (.+)( \(cost=([\d\.]+) rows=(\d+)\))? \(actual time=([\d\.]+) rows=(\d+) loops=(\d+)\)
Which did not work.
It seems to match the optional group as part of the first group, which is not really what I want. I want them separate, and I'm not quite sure how to do that.

You have to make the first (.+) lazy quantifier (.+?)
https://regex101.com/r/fbE0tW/1
# (.+?)[ ]?((?<=[ ])\(cost=([\d\.]+)[ ]rows=(\d+)\))?[ ]\(actual[ ]time=([\d\.]+)[ ]rows=(\d+)[ ]loops=(\d+)\)
( .+? ) # (1)
[ ]?
( # (2 start)
(?<= [ ] )
\( cost=
( [\d\.]+ ) # (3)
[ ] rows=
( \d+ ) # (4)
\)
)? # (2 end)
[ ]
\(
actual [ ] time=
( [\d\.]+ ) # (5)
[ ]
rows=
( \d+ ) # (6)
[ ]
loops=
( \d+ ) # (7)
\)

Related

Javascript regex to match non repeating 9 digit number [duplicate]

I want to find 10 digit numbers with no repeat digits, for example:
1123456789 //fail, there are two 1's
6758951230 //fail, there are two 5's
6789012345 //pass, each digit occurs once only.
at the moment I am using regex but can only match 10digits numbers(it doesnt check for duplicates. I am using this regex:
[0-9]{10}
Can this be done with regex or is there a better way to achieve this?

This regex works:
^(?!.*(.).*\1)\d{10}$
This uses an anchored negative look ahead with a back reference to assert that there are no repeating characters.
See a live demo working with your examples.
In java:
if (str.matches("^(?!.*(.).*\\1)\\d{10}"))
// number passes

Try this one (?:([0-9])(?!.*\1)){10}, this will work if you're validating numbers one at a time.
This should work (?:([0-9])(?!\d*\1)){10} to search for each occurance of an unique 10-digit sequence, but it will fail with 12345678901234567890, will find the last valid part 1234567890 instead of ignoring it.
Source and explanations: https://stackoverflow.com/a/12870549/1366360

Here's the shortest and efficient regex with less backtracking due to the presence of a ?.
Works for any length of input:
!/(.).*?\1/.test(number)
Examples:
!/(.).*?\1/.test(1234567890) // true
!/(.).*?\1/.test(1234567490) // false - note that it also works for repeated chars which are not adjacent.
Demo
- checks for repeated digits
- opposite of what you want, because rubular doesn't allow a !

lancemanfv regex reference https://stackoverflow.com/a/12870549/1366360 is a great one, but the suggested regex is slightly off.
Instead try
^(?:([0-9])(?!.*\1)){10}$
This will match any string that begins and ends with 10 digits that are all different.
If you want to check (and extract) if a longer string contains a 10 digit number with each number different use this
((?:([0-9])(?!.*\2)){10})*
You can then use a numbered reference to extract the matching number

Works every time (I see this question) -
Revised to define Grp 10 before the (?! \10 ) assertion. \1-\9 are always considered backrefs (> \10, the parenth's must be before it is referenced).
So made them all the same as well.
Note- this can be used to find a floating (substring) 10 uinque digit number. Requires no anchors.
Fyi - With Perl, the \g{#} (or \k'name') syntax could be used before the group is defined, no matter what number the group number is.
# "(?:((?!\\1)1)|((?!\\2)2)|((?!\\3)3)|((?!\\4)4)|((?!\\5)5)|((?!\\6)6)|((?!\\7)7)|((?!\\8)8)|((?!\\9)9)|((?!\\10)0)){10}"
(?:
( # (1)
(?! \1 )
1
)
| ( # (2)
(?! \2 )
2
)
| ( # (3)
(?! \3 )
3
)
| ( # (4)
(?! \4 )
4
)
| ( # (5)
(?! \5 )
5
)
| ( # (6)
(?! \6 )
6
)
| ( # (7)
(?! \7 )
7
)
| ( # (8)
(?! \8 )
8
)
| ( # (9)
(?! \9 )
9
)
| ( # (10)
(?! \10 )
0
)
){10}

Regex, Get sequence if not preceded by symbols [duplicate]

This question already has answers here:
RegEx for a^b instead of pow(a,b)
(6 answers)
Closed 2 years ago.
I'm using the math.js library and I need to take the exponent of some variables. I have the following strings:
//Ok
pow(y,2)
pow(y,2+2)
pow(y,2-3)
pow(y,2.2)
pow(y,(23)/(2))+23123
pow(y,pow(2,pow(2,4)))-932
pow(y,pow(2,1*pow(2,0.5)))+23
//Erro
pow(y,2)*pow(2,2)
pow(y,3)-pow(2,2)
pow(y,4)+pow(2,2)
pow(y,pow(2,1*pow(2,0.5)))+pow(1,1)
I'm having trouble implementing this search using regex. The pow(a,b) function is composed of two arguments "a" is the base and "b" the exponent.
In the last four strings of the code above, I need to capture only "2", "3", "4" and "pow(2,1*pow(2,0.5))". I don't want to take the part after "*", "+" and "-".
Since it is possible to chain the pow() function and both "a" and "b" can have arithmetic operators and functions like pow() and sqrt(), this turned out to be very complex. Is there any way to resolve this using regex?
The closest I got is in this regex: https://regex101.com/r/hB1cg4/4

As stated in the comments, doing balanced match is hard in regex, though the .NET regex flavor supports this feature. Please see this answer: https://stackoverflow.com/a/35271017/8031896
Nevertheless, there is a work-around that uses the common regex flavors. However, please note that you may need to modifiy it according to the number of parentheses recursion layer in your mathematic notation.
((?<=^pow)\(([^()]*|\(([^()]*|\([^()]*\))*\))*\))
demo: https://regex101.com/r/hB1cg4/6
For more detailed explanation, please see this answer: https://stackoverflow.com/a/18703444/8031896

The following regex matches all of the "Euro" strings, and one variant, but unfortunately fails to match two of the "OK" strings. Perhaps some tweaking is possible. The regex contains a single capture group that captures the information of interest.
^pow\([^,]+,(\d[^()]*|pow\(\d+,\d+(?:\)|[^()]*\([^()]*\)\)))\).*
Javascript demo
To match the "Euro" strings I assumed that pow(2,1*pow(2,0.5)) in pow(y,pow(2,1*pow(2,0.5)))+23 represented the maximum number of nested "pow"'s.
The regex performs the following operations.
^ # match beginning of line
pow\( # match 'pow('
[^,]+ # match 1+ chars other than ','
, # match ','
( # begin capture group 1
\d[^()]* # match a digit, 0+ chars other than '(' and ')'
| # or
pow\(\d+,\d+ # match 'pow(', 1+ digits, ',' 1+ digits
(?: # begin non-cap grp
\) # match ')'
| # or
[^()]* # match 0+ chars other than '(' and ')'
\( # match '('
[^()]* # match 0+ chars other than '(' and ')'
\)\) # match '))'
) # end non-cap grp
) # end cap grp 1
\) # match ')'

Validate timestamp step by step with regex

This regular expression validates timestamps e.g. 2018-02-12 00:55:22:
[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]) (2[0-3]|[01][0-9]):[0-5][0-9]:[0-5][0-9]
However, the timestamp should be validated step by step:
201 => true
201a => false
2018- => true
20189 => false
Is there a nice (short) regex extension?
......

Because your question has the javascript tag I am going to assume you are doing "step-by-step" validation like "onkeyup" or similar. The following pattern will validate your datetime string as it is being constructed (I'm including an empty string as valid so that no flag is triggered when empty; but you could change to \d{1,4} if you want to act on empty strings).
I am using \d whenever possible to reduce pattern length.
The x pattern modifier is in play with my dumped pattern, for easier reading. When you apply this to your project, you can compact it all and remove the x flag.
I am using non-capturing groups out of habit; since you are probably only matching, you can use capturing groups if you like.
Pattern Demo
Pattern:
~
^
(?:
\d{0,4}|
\d{4}-|
\d{4}-[01]|
\d{4}-(?:0[1-9]|1[0-2])|
\d{4}-(?:0[1-9]|1[0-2])-|
\d{4}-(?:0[1-9]|1[0-2])-[0-3]|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])\s|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])\s[0-2]|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])\s(?:2[0-3]|[01]\d)|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])\s(?:2[0-3]|[01]\d):|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])\s(?:2[0-3]|[01]\d):[0-5]|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])\s(?:2[0-3]|[01]\d):[0-5]\d|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])\s(?:2[0-3]|[01]\d):[0-5]\d:|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])\s(?:2[0-3]|[01]\d):[0-5]\d:[0-5]|
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[0-1])\s(?:2[0-3]|[01]\d):[0-5]\d:[0-5]\d
)
$
~
x

You can combine it to get 2 overall information blocks.
Incrementally match the form delimiters - - : :
while allowing/matching bad segments.
In the end you get info on the form progress.
And also the form segments.
You test the form's progress via capture groups 2,4,6,8,10
You test the date/time elements via groups 1,3,5,7,9,11
Though, you only need to test the elements up to the maximum group in the form
progress.
^(?:(?:([0-9]{4})|\d*)(-(?:(0[1-9]|1[0-2])|\d*)(-(?:(0[1-9]|[1-2][0-9]|3[0-1])|\d*)([ ]+(?:(2[0-3]|[01][0-9])|\d*)(:(?:([0-5][0-9])|\d*)(:(?:([0-5][0-9])|\d*))?)?)?)?)?)$
Formatted
^
(?:
(?:
( [0-9]{4} ) # (1)
| \d*
)
( # (2 start)
-
(?:
( 0 [1-9] | 1 [0-2] ) # (3)
| \d*
)
( # (4 start)
-
(?:
( 0 [1-9] | [1-2] [0-9] | 3 [0-1] ) # (5)
| \d*
)
( # (6 start)
[ ]+
(?:
( 2 [0-3] | [01] [0-9] ) # (7)
| \d*
)
( # (8 start)
:
(?:
( [0-5] [0-9] ) # (9)
| \d*
)
( # (10 start)
:
(?:
( [0-5] [0-9] ) # (11)
| \d*
)
)? # (10 end)
)? # (8 end)
)? # (6 end)
)? # (4 end)
)? # (2 end)
)
$
segments via if the capture groups matched.

Regex is not the way to do this.
Heres a simple function. You use a good date in the correct format, strip off the number of characters from the front that have been entered and combine it with the vale entered, then check if its valid
function validateDate($date)
{
$fakedate = "2018-02-12 00:55:22";
$date .= substr($fakedate, strlen($date));
$format = 'Y-m-d H:i:s';
$d = DateTime::createFromFormat($format, $date);
return $d && $d->format($format) == $date;
}
var_dump(validateDate('201')); bool(true)
var_dump(validateDate('201a')); bool(false)
var_dump(validateDate('2018-')); bool(true)
var_dump(validateDate('20189')); bool(false)

Javascript regex: Ignore closing bracket when enclosed in parentheses

I have a regex with a couple of (optional) capture groups. I'm trying to add a new feature that allows a user to add content to one of the capture groups that matches the closing bracket of the regex. I'm struggling to ignore this match
The current regex is /\[(.+?)=(.+?)(?:\|(.+?))?(?:\:(.+?))?\]/g
This allows a user to target data according to:
[key=value|filter:filtervalue]
where filter and filtervalue are optional.
The problem is that for the value it should now be possible to target indexes in an array. For example:
[data=example(products[0].id)]
However, the regex matches only up to .id so the second capture group is example(products[0. I would like it to be example(products[0].id). I think I should be fine if I can ignore the closing bracket when it is wrapped by parentheses, but I've been unable to figure out how.
Examples that should be matched:
[data=example(products[0].id)]
[data=example(products[index].id)]
[data=regular]
[data=test|number:2]
I created a regex101. Any help is appreciated.

You may use
/\[([^=\]]+)=((?:\[[^\][]*]|[^\]])+?)(?:\|([^:\]]+):((?:\[[^\][]*]|[^\]])+?))?]/g
See the regex demo
Note that lazy dot matching patterns are replaced with more restrictive negated character classes to make sure there is no overflow from one part of the bracketed string to another. To allow matching 1st-level nested [...] substrings, a \[[^\][]*]|[^\]] pattern is used to match [...] or non-] substrings is used.
Details:
\[ - a literal [
([^=\]]+) - Group 1 capturing 1+ chars other than = and ]
= - a = symbols
((?:\[[^\][]*]|[^\]])+?) - Group 2 capturing 1+ occurrences (as few as possible) of:
\[[^\][]*] - [, then 0+ chars other than ] and [, and then ]
| - or
[^\]] - any char other than ]
(?:\|([^:\]]+):((?:\[[^\][]*]|[^\]])+?))? - an optional group matching:
\| - a | char
([^:\]]+) - Group 3 capturing 1+ chars other than : and ]
: - a colon
((?:\[[^\][]*]|[^\]])+?) - Group 4 capturing the same text as Group 2.
] - a literal ] char.

I would probably break it down to two separate regular expressions and "or" them.
Below I have used your expression to match the first kind of string:
\[(?:(.+?)=(.+)(?:\|(.+?)?\:(.+?)))\]
[key=value|filter:filtervalue]
And another for the second one:
\[(.+?)=(.+)\]
[data=example(products[0].id)]
Then concatenating them to:
\[(?:(.+?)=(.+)(?:\|(.+?)?\:(.+?))|(.+?)=(.+))\]
Where it first tries to match the tricky part and if that fails, resorts to the more general one.
https://regex101.com/r/d6LwEt/1

regex to match simple URLs does not work properly

I'm trying to make a simple regex expression to match simple URLs (without URL parameters etc.)
it seems to work but there is still some problem..
This is my regex:
/(https|http|ftp):\/\/((-|[a-z0-9])+\.)+(com|org|net)\/?((-|[a-z0-9]\/?)+(-|[a-z0-9])*\.(css|js))?/ig
In this little list you can see what does not work properly:
HTTP://q-2Ud.a.q-2Ud.com/
https://q-2Ud.q-2Ud.q-2Ud.com
http://www.q-2Ud.q-2Ud.q-2Ud.com
http://www.q-2Ud.q-2Ud.q-2Ud.com/c ------------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/cs -----------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/css ----------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/csss ---------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/csss/css -----------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/css/yuyuyu/gyygug.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/h/.css -------------------------------> NOT WORK
http://www.q-2Ud.q-2Ud.q-2Ud.com/.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/k.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/kk.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/kkk.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/f-1.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/o/o.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/d-1/d-2/d-3/d-4/f-1.css
http://www.q-2Ud.q-2Ud.q-2Ud.com/q-2Ud/q-2Ud/q-2Ud/q-2Ud/q-2Ud.js
Demo Here

it is matching URLs with .css or .js ending.
Remove \.(css|js) and it should work

/(https|http|ftp):\/\/((-|[a-z0-9])+\.)+(com|org|net)\/?\.?((-|[a-z0-9]\/?)+(-|[a-z0-9])*\/?(\.css|\.js)?)?/ig
This may catch all the ones that you are missing

Just need to arrange the groups a little better while maintaining validity.
This is trimmed to capture just the main 4 parts without delimiters.
edit: If you don't want to match .js or .css without a filename, use this regex ->
(?i)(https|http|ftp)://((?:[a-z0-9-]+\.)+(?:com|org|net))(?:/(?:([a-z0-9-]+(?:/?[a-z0-9-])*(?:\.(css|js))?))?)?
otherwise use this one ->
# /(?i)(https|http|ftp):\/\/((?:[a-z0-9-]+\.)+(?:com|org|net))(?:\/(?:([a-z0-9-]+(?:\/?[a-z0-9-])*)\/?)?(?:\.(css|js))?)?/
(?i)
( https | http | ftp ) # (1)
://
( # (2 start)
(?:
[a-z0-9-]+
\.
)+
(?: com | org | net )
) # (2 end)
(?:
/
(?:
( # (3 start)
[a-z0-9-]+
(?:
/?
[a-z0-9-]
)*
) # (3 end)
/?
)?
(?:
\.
( css | js ) # (4)
)?
)?

We Keep Coding

JavaScript is the programming language of the Web.

Extracting data using regex - Optional capturing group causing trouble [duplicate] - javascript

Related

Javascript regex to match non repeating 9 digit number [duplicate]

Regex, Get sequence if not preceded by symbols [duplicate]

Validate timestamp step by step with regex

Javascript regex: Ignore closing bracket when enclosed in parentheses

regex to match simple URLs does not work properly

Categories

Resources