Regex explanation - javascript

I am looking at the code in the tumblr bookmarklet and was curious what the code below did.
try{
if(!/^(.*\.)?tumblr[^.]*$/.test(l.host))
throw(0);
tstbklt();
}
Can anyone tell me what the if line is testing? I have tried to decode the regex but have been unable to do so.

Initially excluding the specifics of the regex, this code is:
if ( ! /.../.test(l.host) )
"if not regex.matches(l.host)" or "if l.host does not match this regex"
So, the regex must correctly describe the contents of l.host text for the conditional to fail and thus avoid throwing the error.
On to the regex itself:
^(.*\.)?tumblr[^.]*$
This is checking for the existence of tumblr but only after any string ending in . that might exist:
^ # start of line
( # begin capturing group 1
.* # match any (non-newline) character, as many times as possible, but zero allowed
\. # match a literal .
) # end capturing group 1
? # make whole preceeding item optional
tumblr # match literal text tumblr
[^.]* # match any non . character, as many times as possible, but zero allowed
$ # match end of line
I thought it was testing to see if the host was tumblr
Yeah, it looked like it might be intended to check that, but if so it's the wrong way to do it.
For that, the first bit should be something like ^(?:[\w-]+\.)? to capture an alphanumeric subdomain (the ?: is a non-capturing group, the [\w-]+ is at least 1 alphanumeric, underscore or hyphen) and the last bit should be either \.(?:com|net|org)$ or perhaps like (?:\.[a-zA-Z]+)+$ depending on how flexible the tld section might need to be.

My attempt to break it down. I'm no expert with regex however:
if(!/^(..)?tumblr[^.]$/.test(l.host))
This part isn't really regex but tells us to only execute the if() if this test does not work.
if(!/^(.*\.)?tumblr[^.]*$/.test(l.host))
This part allows for any characters before the tumblr word as long as they are followed by a . But it is all optional (See the ? at the end)
if(!/^(.*.)?tumblr**[^.]*$/**.test(l.host))
Next, it matches any character except the . and it the *$ extends that to match any character afterwards (so it doesn't break after 1) and it works until the end of the string.
Finally, the .test() looks to test it against the current hostname or whatever l.host contains (I'm not familiar with the tumblr bookmarklet)
So basically, it looks like that part is checking to see that if the host is not part of tumblr, then throw that exception.
Looking forward to see how wrong I am :)

Related

RegEx group characters into a word

I want to tell RegEx to match/not match when a set of characters exist all together in the format i design (Like a word) and not as seperate characters. (Using JavaScript for this particular example)
I am making a RegEx for Discord IDs following the rules set in https://discord.com/developers/docs/resources/user and heres what ive got so far:
/^(.)[^##]+[#][0-9]{4}$/
For those who dont want to open the page, the rule is:
1-in the first part can contain (any number of) any characters except #, #, and '''(the third is not added yet).
2- second part can only be a # character.
3- third part should a 4 digit number.
All works except when i want my regex to allow ', '' or even '''''' but not ''', therefore only the entire "word" or set of characters is found. How can i make it work ?
Edited:
Adding this since the question seems to be vague and cause confusion, the answer to the main question would be to add a lookahead ((?!''')) of the word you want to exclude to the part of the regex you want. Yet for '''''' to be allowed as ive asked in my question, since '''''' does include ''' in itself, its no longer a matter of finding the word, but also checking for what it comes before/after it, in which case the accepted answer is correct.
I explained my real situation but other examples would be for it to allow # and # but not ##.
(also for those wondering i changed the ``` character set, defined by discord devs to ''' because the latter would have interfered with stack overflow codes. and the length is being controlled via JS not regex, and im ignoring spaces for the sake of simplicity in this case.)
To not allow matching only 3 occurrences of ''' and the lookbehind support is available, you might use a negative lookahead.
The single capture group at the start (.) can be part of the negated character class [^##\n]+ if you don't want to reuse its value for after processing.
^(?!.*(?<!')'''(?!'))[^##\n]+#[0-9]{4}$
Regex demo
^ Start of string
(?!.*(?<!')'''(?!')) Negative lookahead, assert not 3 times a ' char that are not surrounded by a '
[^##\n]+ Match 1+ times any char except the listed
#[0-9]{4} match # and 4 digits
$ End of string
Note that this char [#] does not have to be in a character class, and if you don't want to cross newlines, you can add \n to the character class.
This should suit your needs:
^('(?!'')|[^##'])+#\d{4}$
The first part was your issue, '(?!'')|[^##'] means:
either ' if not followed by ''
or any char except #, # and ' (as already handled above)
See demo.
For the sake of completeness, the following will forbid any multiple of 3 consecutive ', so ''', '''''', etc.:
'(?!'')|'''(?=')|[^##']
'''(?='): ''' as long as followed by another '
See demo.
The following will forbid exactly 3 consecutive ', but will allow any other occurrence (including '''''' for example):
'(?!'')|''''+|[^##']
''''+: four or more ' (could be rewritten '{4,})
See demo.
Keep in mind that, while regexes can be very entertaining, in practice an extremely complex regex is usually a sign that someone got fixated on regex and didn't consider an easier approach.
Consider this advice from Jeff Atwood:
Regular expressions are like a particularly spicy hot sauce – to be used in moderation and with restraint only when appropriate. Should you try to solve every problem you encounter with a regular expression? Well, no. Then you'd be writing Perl, and I'm not sure you need those kind of headaches. If you drench your plate in hot sauce, you're going to be very, very sorry later.
...
Let me be very clear on this point: If you read an incredibly complex, impossible to decipher regular expression in your codebase, they did it wrong. If you write regular expressions that are difficult to read into your codebase, you are doing it wrong.
I don't know your situation, but it sounds like it would be much easier to look for a bad ID then to try and define a good ID. If you can break this into two steps, then the logic will be easier to read and maintain.
Verify that the final part of the ID is as expected (/#\d{4}/)
Verify that the first part of the ID does not have any invalid characters or sequences
function isValid(id) {
const idPrefix = /(.+)#\d{4}/.exec(id)?.[1];
if (idPrefix === undefined) return false; // The #\d{4} postfix was missing
// If we find an illegal character or sequence, then the id is not valid:
return !(/[##]|(^|[^'])(''')($|[^'])/.test(idPrefix));
}
That second regex is a bit long, but here's how it breaks down:
If the Id contains a # or # then it's not legal.
Check for a sequence of ''' that IS NOT surrounded by a fourth '. Also take the beginning and ending of he string into account. If we found a sequence of exactly three ', then it's not legal.
The result:
isValid("foobar#1234") // true
isValid("f#obar#1234") // false
isValid("f#obar#1234") // false
isValid("f''bar#1234") // true
isValid("f'''ar#1234") // false
isValid("f''''r#1234") // true

Regex for matching string in parentheses including when opening or closing parenthesis is missing

I want to match strings in parentheses (including the parens themselves) and also match strings when a closing or opening parenthesis is missing.
From looking around my ideal solution would involve conditional regex however I need to work within the limitations of javascript's regex engine.
My current solution that almost works: /\(?[^()]+\)|\([^()]+/g. I could split this up (might be better for readability) but am curious to know if there is a way to achieve it without being overly verbose with multiple |'s.
Examples
Might help to understand what I'm trying to achieve through examples (highlighted sections are the parts I want to match):
(paren without closing
(paren in start) of string
paren (in middle) of string
paren (at end of string)
paren without opening)
string without any parens
(string with only paren)
string (with multiple) parens (in a row)
Here's a link to the tests I set up in regexr.com.
You can match the following regular expression.
^\([^()]*$|^[^()]*\)$|\([^()]*\)
Javascript Demo
Javascript's regex engine performs the following operations.
^ # match the beginning of the string
\( # match '('
[^()]*. # match zero or more chars other than parentheses,
# as many as possible
$ # match the end of the string
| # or
^ # match the beginning of the string
[^()]*. # match zero or more chars other than parentheses,
# as many as possible
\) # match ')'
$ # match the end of the string
| # or
\( # match '('
[^()]*. # match zero or more chars other than parentheses,
# as many as possible
\) # match ')'
As of question date (May 14th 2020) the Regexr's test mechanism was not working as expected (it matches (with multiple) but not (in a row)) Seems to be a bug in the test mechanism. If you copy and paste the 8 items in the "text' mode of Regexr and test your expression you'll see it matches (in a row). The expression also works as expected in Regex101.
I think you've done alright. The issue is that you need to match [()] in two places and only one of them needs to be true but both can't be false and regex isn't so smart as to keep state like that. So you need to check if there is 0 or 1 opening or 0 or 1 closing in alternatives like you have.
Update:
I stand corrected since all you seem to care about is where there is an open or closing parenthesis you could just do something like this:
.*[\(?\)]+.*
In English: any number of characters with eith an ( or ) followed by any number of characters. This will match them in any order though, so if you need ( to be before closed even though you don't seem to care if both are present, this won't work.

Optional regex string/pattern sections with and without non-capturing groups

Here's what I'm trying to do:
http://i.imgur.com/Xqrf8Wn.png
Simply take a URL with 3 groups, $1 not so important, $2 & $3 are but $2 is totally optional including (obviously) the corresponding backslash when present, which is all I am trying to make optional. I get that it can/should? be in a non-cap group, but does it HAVE to be? I've seen enough now seems to indicate it does not HAVE to be. If possible, I'd really like to have someone explain it so I can try to fully understand it, and not just get one possible working answer handed to me to simply copy, like some come here seeking.
Here's my regex string(s) tried and at best only currently matching second URL string with optional present:
^https:\/\/([a-z]{0,2})\.?blah\.com(?:\/)(.*)\/required\/B([A-Z0-9]{9}).*
^https:\/\/([a-z]{0,2})\.?blah\.com(\/)?(.*)\/required\/B([A-Z0-9]{9}).*
^https:\/\/([a-z]{0,2})\.?blah\.com(?:\/)?(.*)?\/required\/B([A-Z0-9]{9}).*
Here are the two URLs that I want to capture group 2 & 3, with 1 and 2 being optional, but $2 being the problem. I've tried all the strings above and have yet to get it to match the string when the optional is NOT present and I believe it must be due to the backslashes?
https://blah.com/required/B7BG0Z0GU1A
https://blah.com/optional/required/B7BG0Z0GU1A
Making a part of the pattern optional is as simple as adding ?, and your last two attempts both work: https://regex101.com/r/RIKvYY/1
Your mistake is that your test is wrong - you are using ^ which matches the beginning of the string. You need to add the /m flag (multiline) to make it match the beginning of each line. This is the reason your patterns never match the second line...
Note that you're allowing two slashes (//required, for example). You can solve it by joining the first slash and the optional part to the same capturing group (of course, as long as you are using .* you can still match multiple slashes):
https:\/\/([a-z]{0,2})\.?blah\.com(?:\/(.*))?\/required\/B([A-Z0-9]{9}).*

Making part of JavaScript regex optional

I have the following original strings:
# Original strings
js/main_dev.js # This one is just to check for false positives
js/blog/foo.js
js/blog/foo.min.js
I'd like to run a substitution regex on each one and end up with this:
# Desired result
js/main_dev.js
js/blog/foo_dev.js
js/blog/foo_dev.min.js
So basically, add _dev before .min.js if it exists, if not, before .js. That should only apply on strings that start with js/blog or js/calculator and end in .js.
I initially started out with (js\/(?:blog|calculators)\/.+)(\.min\.js)$ as a regex and then $1_dev$2 as substitution. That works for my third string, but not my second obviously as I am "hard" looking for .min.js. So I figured I'd make .min optional by throwing it in a non capture group with a ? like this: (js\/(?:blog|calculators)\/.+)((?:\.min)?\.js)$. Now this works for my second string but my third string is all out of whack:
js/main_dev.js # Good
js/blog/foo_dev.js # Good
js/blog/foo.min_dev.js # Bad! _dev should be before the .min
I've tried many permutations on this, including simply doing (\.min\.js|\.js) but to no avail, all result in the same behavior. Here is a Regex 101 paste with the "bad" regex: https://regex101.com/r/bH3yP6/1
What am I doing wrong here?
Try throwing a ? after the + at the end of your first group (the non-capturing one) to make it lazy (non-greedy):
(js\/(?:blog|calculators)\/.+?)((?:\.min)?\.js)$
(?:\.min)? is optional and .+ is greedy, so .+ was capturing .min.
https://regex101.com/r/bH3yP6/3

Regex - Don't match a group if it starts with a string in javascript

I'm struggling with some regex, in javascript which doesn't have a typical lookbehind option, to only match a group if it's not preceded with a string:
(^|)(www\.[\S]+?(?= |[,;:!?]|\.( )|$))
so in the following
hello http:/www.mytestwebsite.com is awesome
I'm trying to detect if the www.mytestwebsite.com is preceeded by
/
and if it is I don't want to match, otherwise match away. I tried using a look ahead but it looked to be conflicting with the look ahead I already had.
I've been playing around with placing (?!&#x2f) in different areas with no success.
(^|)((?!&#x2f)www\.[\S]+?(?= |[,;:!?]|\.( )|$))
A look ahead to not match if the match is preceded
Due to lack of lookbehinds in JS, the only way to accomplish your goal
is to match those web sites that contain the errant / as well.
This is because a lookahead won't advance the current position.
Only a match on consumable text will advance the position.
But, a good workaround has always been to include the errant text as an option
within the regex. You'd put some capture groups around it, then test the
group for a match. If it matched, skip, go on to next match.
This requires sitting in a while loop checking each successful match.
In the below regex, if group 1 matched, don't store the group 2 url,
If it didn't, store the group 2 url.
(/)?(www\.\S+?(?= |[,;:!?]|\.( )|$))
Formatted:
( &\#x2f; )? # (1)
( # (2 start)
www\. \S+?
(?=
&\#x20;
| [,;:!?]
| \.
( &\#x20; ) # (3)
| $
)
) # (2 end)
Another option (and I've done zero performance testing) would be to use string.replace() with a regex and a callback as the second parameter.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace
Then, inside of the replace function, prepend/append the illegal/characters you don't want to match to the matched string, using the offset parameter passed to the callback (see above docs) you can determine each match, and it's position and make a determination whether to replace the text or not.

Categories