Optional regex string/pattern sections with and without non-capturing groups - javascript

Here's what I'm trying to do:
http://i.imgur.com/Xqrf8Wn.png
Simply take a URL with 3 groups, $1 not so important, $2 & $3 are but $2 is totally optional including (obviously) the corresponding backslash when present, which is all I am trying to make optional. I get that it can/should? be in a non-cap group, but does it HAVE to be? I've seen enough now seems to indicate it does not HAVE to be. If possible, I'd really like to have someone explain it so I can try to fully understand it, and not just get one possible working answer handed to me to simply copy, like some come here seeking.
Here's my regex string(s) tried and at best only currently matching second URL string with optional present:
^https:\/\/([a-z]{0,2})\.?blah\.com(?:\/)(.*)\/required\/B([A-Z0-9]{9}).*
^https:\/\/([a-z]{0,2})\.?blah\.com(\/)?(.*)\/required\/B([A-Z0-9]{9}).*
^https:\/\/([a-z]{0,2})\.?blah\.com(?:\/)?(.*)?\/required\/B([A-Z0-9]{9}).*
Here are the two URLs that I want to capture group 2 & 3, with 1 and 2 being optional, but $2 being the problem. I've tried all the strings above and have yet to get it to match the string when the optional is NOT present and I believe it must be due to the backslashes?
https://blah.com/required/B7BG0Z0GU1A
https://blah.com/optional/required/B7BG0Z0GU1A

Making a part of the pattern optional is as simple as adding ?, and your last two attempts both work: https://regex101.com/r/RIKvYY/1
Your mistake is that your test is wrong - you are using ^ which matches the beginning of the string. You need to add the /m flag (multiline) to make it match the beginning of each line. This is the reason your patterns never match the second line...
Note that you're allowing two slashes (//required, for example). You can solve it by joining the first slash and the optional part to the same capturing group (of course, as long as you are using .* you can still match multiple slashes):
https:\/\/([a-z]{0,2})\.?blah\.com(?:\/(.*))?\/required\/B([A-Z0-9]{9}).*

Related

Javascript -- Regex -- Blacklist of multiple words to END with a partial match

I've read many Questions on StackOverflow, including this one, this one, and even read Rexegg's Best Trick, which is also in a question here. I found this one, which works on entire lines, but not "everything up to the bad word". None of these have helped me, so here I go:
In Javascript, I have a long regex pattern. I'm trying to match a sequence in similar sentence structures, like follows:
1 UniquePrefixA [some-token] and [some-token] want to take [some-token] to see some monkeys.
2 UniqueC [some-token] wants to take [some-token] to the store. UniqueB, [some-token] is in the pattern once more.
3 UniquePrefixA [some-token] is using [some-token] to [some-token].
Notice that each pattern starts with a unique prefix. Encountering that prefix signals the start of a pattern. If I encounter that pattern again during capture, I should not capture a second occurance, and STOP THERE. I'll have captured everything up to that prefix.
If I don't encounter the prefix later in the pattern, I need to continue matching that pattern.
I'm also using capture groups (not repeating, since Capture Groups only return the last matched of that group). The capture group contents need to be returned, so I'm using match, non-greedy.
Here's my pattern and a working example
/(?:UniquePrefixA|UniqueB|UniqueC)\s*(\[some-token\])(?:and|\s)*(\[some-token\])?(\s|[^\[\]])*(\[some-token\])? --->(\s|[^\[\]])*<--- (\[some-token\])?(\s|[^\[\]])*/i
It's basically 2 repeating patterns in a specific order:
(\s|[^\[\]])* // Basicaly .*, but excluding brackets
(\[some-token\]) // A token [some-token]
How I can prevent the match from continuing past a black list of words?
I want this to happen where I drew three arrows, for context. The equivalent of Any character, but not the contents of this list: (UniquePrefixA|UniqueB|UniqueC) (as seen in capture group 1).
It's possible I need a better understanding of negative lookahead, or if it can work with a group of things. Most importantly, I'm looking to know if a negative look-ahead approach can support a list of options Or is there a better way altogether? If the answer is "you can't do that," that's cool too.
I think, an easier to maintain solution is to divide your task into 2 parts:
Find each chunk of text starting from any of your unique prefixes,
up to the next or to the end of string.
Process each such chunk, looking for your some tokens and maybe
also the content between them.
The regex performing the first task should include 3 parts:
(?:UniquePrefixA|UniqueB|UniqueC) - A non-capturing group looking
for any unique prefix.
((?:.|\n)+?) - A capturing group - the fragment to catch for further
processing (see the note below).
(?=UniquePrefixA|UniqueB|UniqueC|$) - A positive lookahead, looking
for either any unique prefix or the end of the string (a stop criterion
you are looking for).
To sum up, the whole regex looks like below:
/(?:UniquePrefixA|UniqueB|UniqueC)((?:.|\n)+?)(?=UniquePrefixA|UniqueB|UniqueC|$)/gi
Note: Unfortunately, JavaScript flavour of regex does not implement
single-line (-s) option. So, instead of just . in the capturing group
above, you must use (?:.|\n), meaning:
either any char other than \n (.),
or just \n.
Both these variants are "enveloped" into a non-capturing group,
to put limits of variants (both sides of |), because the repetition
marker (+?) pertains to both variants.
Note ? after +, meaning the reluctant version.
So this part of regex (the capturing group) will match any sequence of chars
including \n, ending before the next uniqie prefix (if any),
just as you expect.
The second task is to apply another regex to the captured chunk (group 1),
looking for [some-token]s and possibly the content between them.
You didn't specify what you want exactly do with each chunk,
so I'm not sure what this second regex shoud include.
Maybe it will be enough just to match [some-token]?
to ensure a pattern not occurs in a repeating character sequence such as (\s|[^\[\]])*, note that \s is included in [^\[\]] so may be just [^\[\]]*, is to prepend a negative lookahead (which is a zero lentgh match assertion like ^) at the left and inside the repeating pattern so that it is checked for every character :
((?!UniquePrefixA)(\s|[^\[\]]))*

Making part of JavaScript regex optional

I have the following original strings:
# Original strings
js/main_dev.js # This one is just to check for false positives
js/blog/foo.js
js/blog/foo.min.js
I'd like to run a substitution regex on each one and end up with this:
# Desired result
js/main_dev.js
js/blog/foo_dev.js
js/blog/foo_dev.min.js
So basically, add _dev before .min.js if it exists, if not, before .js. That should only apply on strings that start with js/blog or js/calculator and end in .js.
I initially started out with (js\/(?:blog|calculators)\/.+)(\.min\.js)$ as a regex and then $1_dev$2 as substitution. That works for my third string, but not my second obviously as I am "hard" looking for .min.js. So I figured I'd make .min optional by throwing it in a non capture group with a ? like this: (js\/(?:blog|calculators)\/.+)((?:\.min)?\.js)$. Now this works for my second string but my third string is all out of whack:
js/main_dev.js # Good
js/blog/foo_dev.js # Good
js/blog/foo.min_dev.js # Bad! _dev should be before the .min
I've tried many permutations on this, including simply doing (\.min\.js|\.js) but to no avail, all result in the same behavior. Here is a Regex 101 paste with the "bad" regex: https://regex101.com/r/bH3yP6/1
What am I doing wrong here?
Try throwing a ? after the + at the end of your first group (the non-capturing one) to make it lazy (non-greedy):
(js\/(?:blog|calculators)\/.+?)((?:\.min)?\.js)$
(?:\.min)? is optional and .+ is greedy, so .+ was capturing .min.
https://regex101.com/r/bH3yP6/3

Regex - Don't match a group if it starts with a string in javascript

I'm struggling with some regex, in javascript which doesn't have a typical lookbehind option, to only match a group if it's not preceded with a string:
(^|)(www\.[\S]+?(?= |[,;:!?]|\.( )|$))
so in the following
hello http:/www.mytestwebsite.com is awesome
I'm trying to detect if the www.mytestwebsite.com is preceeded by
/
and if it is I don't want to match, otherwise match away. I tried using a look ahead but it looked to be conflicting with the look ahead I already had.
I've been playing around with placing (?!&#x2f) in different areas with no success.
(^|)((?!&#x2f)www\.[\S]+?(?= |[,;:!?]|\.( )|$))
A look ahead to not match if the match is preceded
Due to lack of lookbehinds in JS, the only way to accomplish your goal
is to match those web sites that contain the errant / as well.
This is because a lookahead won't advance the current position.
Only a match on consumable text will advance the position.
But, a good workaround has always been to include the errant text as an option
within the regex. You'd put some capture groups around it, then test the
group for a match. If it matched, skip, go on to next match.
This requires sitting in a while loop checking each successful match.
In the below regex, if group 1 matched, don't store the group 2 url,
If it didn't, store the group 2 url.
(/)?(www\.\S+?(?= |[,;:!?]|\.( )|$))
Formatted:
( &\#x2f; )? # (1)
( # (2 start)
www\. \S+?
(?=
&\#x20;
| [,;:!?]
| \.
( &\#x20; ) # (3)
| $
)
) # (2 end)
Another option (and I've done zero performance testing) would be to use string.replace() with a regex and a callback as the second parameter.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace
Then, inside of the replace function, prepend/append the illegal/characters you don't want to match to the matched string, using the offset parameter passed to the callback (see above docs) you can determine each match, and it's position and make a determination whether to replace the text or not.

Understanding replacement using regex

I want to remove all trailing and leading dashes (-) and replace any repeating dashes with one dash otherwise in JavaScript. I've developed a regex to do it:
"----asdas----asd-as------q---".replace(/^-+()|()-+$|(-)+/g,'$3')
And it works:
asdas-asd-as-q
But I don't understand the $3 part (obtained through desperate experiment). Why not $1?
You can actually use this without any capturing groups:
"----asdas----asd-as------q---".replace(/^-+|-+$|-+(?=-)/g, '');
//=> "asdas-asd-as-q"
Here -+(?=-) is a positive lookahead that makes sure to match 1 or more hyphens except the last - in the match.
Because there are 3 capturing groups. (two redundant empty ones and (-)). $3 replaced with the string that matched the third group.
If you remove the first two empty capturing groups, you can use $1.
"----asdas----asd-as------q---".replace(/^-+|-+$|(-)+/g, '$1')
// => "asdas-asd-as-q"
As other answers say, $3 indicates the third captured subpattern, ie. third set of parentheses.
Personally, however, I would see that as two operations, and do it as such:
Trim leading and trailing -s
Condense duplicate -s
Like so:
"----asdas----asd-as------q---".replace(/^-+|-+$/g,"").replace(/--+/g,"-");
This kind of concept may mean more code, but I believe it makes it much easier to read and understand what's going on here, because you're doing one thing at a time instead of trying to do everything at once.
$ are the replacement groups being formed.
See demo.
http://regex101.com/r/pP3pN1/25
On the right side you can see the groups being generated by ().
Replace and see.$1 is blank in your case.

regular expression for ends with some word

I want to build regular expression for series
cd1_inputchk,rd_inputchk,optinputchk where inputchk is common (ending characters)
please guide for the same
Very simply, it's:
/inputchk$/
On a per-word basis (only testing matching /inputchk$/.test(word) ? 'matches' : 'doesn\'t match';). The reason this works, is it matches "inputchk" that comes at the end of a string (hence the $)
As for a list of words, it starts becoming more complicated.
Are there spaces in the list?
Are they needed?
I'm going to assume no is the answer to both questions, and also assume that the list is comma-separated.
There are then a couple of ways you could proceed. You could use list.split() to get an array of each word, and teast each to see if they end in inputchk, or you could use a modified regular expression:
/[^,]*inputchk(?:,|$)/g
This one's much more complicated.
[^,] says to match non-, characters
* then says to match 0 or more of those non-, chars. (it will be greedy)
inputchk matches inputchk
(?:...) is a non-capturing parenthesis. It says to match the characters, but not store the match as part of the result.
, matches the , character
| says match one side or the other
$ says to match the end of the string
Hopefully all of this together will select the strings that you're looking for, but it's very easy to make a mistake, so I'd suggest doing some rigorous testing to make sure there aren't any edge-conditions that are being missed.
This one should work (dollar sign basically means "end of string"):
/inputchk$/

Categories