Non-capturing group ignored by regex101.com - javascript

A question arrived from my previous question regarding regular expressions. I am stucked to understand the difference in results I get, and I am worrying if there may be a bug in parsing libraries or something else.
So the initial question was to replace all :/ in given string, except ones that may be inside tags in that given string. The initial string is
not feeling well today :/ check out this link http://example.com
I have tried to use the following regexp to replace only the first :/ in given example. To skip occurances inside tags non-capturing group is used:
/(?:<[^\/]*?.*?<\/.*?>)|(:\/)/g
What was most surprising is that this regexp gives different results depending on tool/language being used. Here's a brief summary of results I got
regex101.com shows 1 match!!
regexpal.com shows 2 matches
regexr.com shows 2 matches
regextester.com shows 2 matches
Below also a javascript snippet to check the same regexp, and the result, as you can see also is different from what supposed to be - 2 matches -> 2 replacements will occur.
var s = 'not feeling well today :/ check out this link http://example.com';
var replaced = s.replace(/(?:<[^\/]*?.*?<\/.*?>)|(:\/)/g, "smiley_image_here");
document.querySelector("pre").textContent = replaced;
<pre></pre>
It seems that non-capturing group is simply ignored.
So, what is wrong, why results differs and what is the correct regexp to solve initial question?

regex101 also returns 2 matches, as you can see on the label:
and the 2 different colors in the text
It is indeed a bit confusing if you look at the MATCH INFORMATION section. However, that is only intended to show you captures, not necessarily matches:
You may as well test this by replacing each match with some string:
https://regex101.com/r/kY6vI5/2
The non-capturing group is not ignored. It simply doesn't create a capture, but it is in fact matched.

Related

RegEx group characters into a word

I want to tell RegEx to match/not match when a set of characters exist all together in the format i design (Like a word) and not as seperate characters. (Using JavaScript for this particular example)
I am making a RegEx for Discord IDs following the rules set in https://discord.com/developers/docs/resources/user and heres what ive got so far:
/^(.)[^##]+[#][0-9]{4}$/
For those who dont want to open the page, the rule is:
1-in the first part can contain (any number of) any characters except #, #, and '''(the third is not added yet).
2- second part can only be a # character.
3- third part should a 4 digit number.
All works except when i want my regex to allow ', '' or even '''''' but not ''', therefore only the entire "word" or set of characters is found. How can i make it work ?
Edited:
Adding this since the question seems to be vague and cause confusion, the answer to the main question would be to add a lookahead ((?!''')) of the word you want to exclude to the part of the regex you want. Yet for '''''' to be allowed as ive asked in my question, since '''''' does include ''' in itself, its no longer a matter of finding the word, but also checking for what it comes before/after it, in which case the accepted answer is correct.
I explained my real situation but other examples would be for it to allow # and # but not ##.
(also for those wondering i changed the ``` character set, defined by discord devs to ''' because the latter would have interfered with stack overflow codes. and the length is being controlled via JS not regex, and im ignoring spaces for the sake of simplicity in this case.)
To not allow matching only 3 occurrences of ''' and the lookbehind support is available, you might use a negative lookahead.
The single capture group at the start (.) can be part of the negated character class [^##\n]+ if you don't want to reuse its value for after processing.
^(?!.*(?<!')'''(?!'))[^##\n]+#[0-9]{4}$
Regex demo
^ Start of string
(?!.*(?<!')'''(?!')) Negative lookahead, assert not 3 times a ' char that are not surrounded by a '
[^##\n]+ Match 1+ times any char except the listed
#[0-9]{4} match # and 4 digits
$ End of string
Note that this char [#] does not have to be in a character class, and if you don't want to cross newlines, you can add \n to the character class.
This should suit your needs:
^('(?!'')|[^##'])+#\d{4}$
The first part was your issue, '(?!'')|[^##'] means:
either ' if not followed by ''
or any char except #, # and ' (as already handled above)
See demo.
For the sake of completeness, the following will forbid any multiple of 3 consecutive ', so ''', '''''', etc.:
'(?!'')|'''(?=')|[^##']
'''(?='): ''' as long as followed by another '
See demo.
The following will forbid exactly 3 consecutive ', but will allow any other occurrence (including '''''' for example):
'(?!'')|''''+|[^##']
''''+: four or more ' (could be rewritten '{4,})
See demo.
Keep in mind that, while regexes can be very entertaining, in practice an extremely complex regex is usually a sign that someone got fixated on regex and didn't consider an easier approach.
Consider this advice from Jeff Atwood:
Regular expressions are like a particularly spicy hot sauce – to be used in moderation and with restraint only when appropriate. Should you try to solve every problem you encounter with a regular expression? Well, no. Then you'd be writing Perl, and I'm not sure you need those kind of headaches. If you drench your plate in hot sauce, you're going to be very, very sorry later.
...
Let me be very clear on this point: If you read an incredibly complex, impossible to decipher regular expression in your codebase, they did it wrong. If you write regular expressions that are difficult to read into your codebase, you are doing it wrong.
I don't know your situation, but it sounds like it would be much easier to look for a bad ID then to try and define a good ID. If you can break this into two steps, then the logic will be easier to read and maintain.
Verify that the final part of the ID is as expected (/#\d{4}/)
Verify that the first part of the ID does not have any invalid characters or sequences
function isValid(id) {
const idPrefix = /(.+)#\d{4}/.exec(id)?.[1];
if (idPrefix === undefined) return false; // The #\d{4} postfix was missing
// If we find an illegal character or sequence, then the id is not valid:
return !(/[##]|(^|[^'])(''')($|[^'])/.test(idPrefix));
}
That second regex is a bit long, but here's how it breaks down:
If the Id contains a # or # then it's not legal.
Check for a sequence of ''' that IS NOT surrounded by a fourth '. Also take the beginning and ending of he string into account. If we found a sequence of exactly three ', then it's not legal.
The result:
isValid("foobar#1234") // true
isValid("f#obar#1234") // false
isValid("f#obar#1234") // false
isValid("f''bar#1234") // true
isValid("f'''ar#1234") // false
isValid("f''''r#1234") // true

RegEx start with, contains these, not end with?

I need a regular expression to find all comparison parts in the following example.
var magicalRegex = "";
var example = '({Color}=="red"|{Color}=="yellow")&{Size}>32';
example.replace(magicalRegex, function (matchedBlock) {
console.log(matchedBlock);
});
//so i want to see the following result on console
//{Color}=="red"
//{Color}=="yellow"
//{Size}>32
In fact i did some things but couldn't complete, also you may check the following template which i couldn't complete.
\{.*?(==|>)
https://regex101.com/r/aodDeX/1
Thanks
Answer
According to the example you have on regex101 as well as the string you have in your code snippet (two different strings) the following regex will do exactly what you want.
Answer 1
({.*?}(?:==|>)(?:\d+|(?:(["']?).*?\2)))
You can see this regex in use here
Answer 2
Note that I've added both single and double quotes in the above regex. If you only need double quotes, use the following regex.
({.*?}(?:==|>)(?:\d+|".*?"))
You can see this regex in use here
Explanation
These regular expressions work as follows:
Match {, followed by any character (except newline) any number of times, but as few matches as possible, followed by }
Match == or >
Match a digit one to unlimited times or match a quoted string (any character any number of times, but as few matches as possible) e.g. "something"
The regex captures the entire section and if you look at the examples on regex101 as presented, you can see what each capture group is matching. You can remove the capture groups if this is not the intended use.
Expected Results
Input
Note that the two strings below were used for testing purposes. One string is present in the question and the other is present in the link provided by the OP.
({Renk}=="kirmizi"or{Renk}=="sari")or{Size}>32
({Color}=="red"|{Color}=="yellow")&{Size}>32
Output
Note that the output mentioned hereafter specifies what is matched/also capture group 1 (since the whole regex is in a capture group). Any other groups are disregarded as they are not important to the overall question/answer.
{Renk}=="kirmizi"
{Renk}=="sari"
{Size}>32
{Color}=="red"
{Color}=="yellow"
{Size}>32

Regex - why empty parenthesis?

I have a regex to update and there is an empty parenthesis in it. And i wondering : what is the purpose ? I don't find something about it.
The regex :
(DE)()([0-9]{1,12})
Because, if it is useless, i can remove it.
There is one possible application for empty parentheses that I'm aware of, and that is if you plan to use a regex to determine if a certain string matches a permutation of sub-regexes.
For example,
^(?:A()|B()|C()){3}\1\2\3$
will match ABC or CBA or BCA but not AAA or BCC etc.
But it doesn't look like that's what the author of your regex was going for.
Maybe (and only maybe) the other code uses the capturing groups by their numbers.
It happened to me that I changed one regex changing the parenthesis so the matching groups were changed as well and the rest of the code stopped working because depended on the number of the matching groups.
I recommend you to verify if this is your case before removing the parenthesis.

regular expression for ends with some word

I want to build regular expression for series
cd1_inputchk,rd_inputchk,optinputchk where inputchk is common (ending characters)
please guide for the same
Very simply, it's:
/inputchk$/
On a per-word basis (only testing matching /inputchk$/.test(word) ? 'matches' : 'doesn\'t match';). The reason this works, is it matches "inputchk" that comes at the end of a string (hence the $)
As for a list of words, it starts becoming more complicated.
Are there spaces in the list?
Are they needed?
I'm going to assume no is the answer to both questions, and also assume that the list is comma-separated.
There are then a couple of ways you could proceed. You could use list.split() to get an array of each word, and teast each to see if they end in inputchk, or you could use a modified regular expression:
/[^,]*inputchk(?:,|$)/g
This one's much more complicated.
[^,] says to match non-, characters
* then says to match 0 or more of those non-, chars. (it will be greedy)
inputchk matches inputchk
(?:...) is a non-capturing parenthesis. It says to match the characters, but not store the match as part of the result.
, matches the , character
| says match one side or the other
$ says to match the end of the string
Hopefully all of this together will select the strings that you're looking for, but it's very easy to make a mistake, so I'd suggest doing some rigorous testing to make sure there aren't any edge-conditions that are being missed.
This one should work (dollar sign basically means "end of string"):
/inputchk$/

Struggling with regex to match only two of a character, not three

I need to match all occurrences of // in a string in a Javascript regex
It can't match /// or /
So far I have (.*[^\/])\/{2}([^\/].*)
which is basically "something that isn't /, followed by // followed by something that isn't /"
The approach seems to work apart from when the string I want to match starts with //
This doesn't work:
//example
This does
stuff // example
How do I solve this problem?
Edit: A bit more context - I am trying to replace // with !, so I am then using:
result = result.replace(myRegex, "$1 ! $2");
Replace two slashes that either begin the string or do not follow a slash,
and are followed by anything not a slash or the end of the string.
s=s.replace(/(^|[^/])\/{2}([^/]|$)/g,'$1!$2');
It looks like it wouldn't work for example// either.
The problem is because you're matching // preceded and followed by at least one non-slash character. This can be solved by anchoring the regex, and then you can make the preceding/following text optional:
^(.*[^\/])?\/{2}([^\/].*)?$
Use negative lookahead/lookbehind assertions:
(.*)(?<!/)//(?!/)(.*)
Use this:
/([^/]*)(\/{2})([^/]*)/g
e.g.
alert("///exam//ple".replace(/([^/]*)(\/{2})([^/]*)/g, "$1$3"));
EDIT: Updated the expression as per the comment.
/[/]{2}/
e.g:
alert("//example".replace(/[/]{2}/, ""));
This does not answer the OP's question about using regex, but since some of the original comments suggested using .replaceAll, since not everyone who reads the question in the future wants to use regex, since people might mistakenly assume that regex is the only alternative, and since these details cannot be accommodated by submitting a comment, here's a poor man's non-regex approach:
Temporarily replace the three contiguous characters with something that would never naturally occur — really important when dealing with user-entered values.
Replace the remaining two contiguous characters using .replaceAll().
Return the original three contiguous characters.
For instance, let's say you wanted to remove all instances of ".." without affecting occurrences of "...".
var cleansedText = $(this).text().toString()
.replaceAll("...", "☰☸☧")
.replaceAll("..", "")
.replaceAll("☰☸☧", "...")
;
$(this).text(cleansedText);
Perhaps not as fast as regex for longer strings, but works great for short ones.

Categories