I've been trying to work this out for almost an hour now, and I can't see myself getting much further with it without any help or explanation. I've used regex before, but only ones that are very simple or had already been made.
This time, I'm trying to work out how to write a regex that achieves the following:
Email address must contain one # character and at least one dot (.) at least one position after the # character.
So far, this is all I've been able to work out, and it still matches email addresses that, for example, have more than one # symbol.
.*?#?[^#]*\.+.*
It would be helpful if you can show me how to construct a regular expression that checks for a single # and at least one full stop one or more spaces after the #. If you could break down the regex and explain what each bit does, that would be really helpful.
I want to keep it simple for now, so it doesn't have to be a full-on super-accurate email validation expression.
With the help of ClasG's comment, I now have a fairly straightforward and suitable regex for my problem. For the sake of anyone learning regex who might come across this question in the future, I'll break the expression down below.
Expression: ^[^#]+#[^#]+\.[^#]+$
^ Matches the beginning of the string (or line if multiline)
[^#] Match any character that is not in this set (i.e. not "#")
+ Match one or more of this
# Match "#" character
[^#] Match any character that is not in this set
+ Match one or more
\. Match "." (full stop) character (backslash escapes the full stop)
[^#] Match any character that is not in this set
+ Match one or more
$ Matches the end of the string (or line if multiline)
And in plain language:
Start at beginning of string or line
Include all characters except # until the # sign
Include the # sign
Include all characters except # after the # sign until the full stop
Include all characters except # after the full stop
Stop at the end of the string or line
Email address must contain one # character
No they don't. An email address with no '#' character is perfectly valid. An email address with multiple '#' characters before an IP address is perfectly valid (as long as all but 1 are outside the ADDR_SPEC or are quoted/escaped within the mailbox name).
I suspect you're not trying to validate an email address but rather an ADDR_SPEC. The answer linked by Máté Safranka describes how to validate an ADDR_SPEC (not an email address). Unless you expect to be validating records which don't have a valid internet MX record, and more than one '#' is more likely be a typo than a valid address....
/[a-z0-9\._%+!$&*=^|~#%'`?{}/\-]+#([a-z0-9\-]+\.){1,}([a-z]{2,16})/
^[^\W_]+\w*(?:[.-]\w*)*[^\W_]+#[^\W_]+(?:[.-]?\w*[^\W_]+)*(?:\.[^\W_]{2,})$
Related
I am a newbie to regex and would like to create a regular expression to check usernames. These are the conditions:
username must have between 4 and 20 characters
username must not contain anything but letters a-z, digits 0-9 and special characters -._
the special characters -._ must not be used successively in order to avoid confusion
the username must not contain whitespaces
Examples
any.user.13 => valid
any..user13 => invalid (two dots successively)
anyuser => valid
any => invalid (too short)
anyuserthathasasupersuperlonglongname => invalid (too many characters)
any username => invalid because of the whitespace
I've tried to create my own regex and only got to the point where I specify the allowed characters:
[a-z0-9.-_]{4,20}
Unfortunately, it still matches a string if there's a whitespace in between and it's possible to have two special chars .-_ successively:
If anybody would be able to provide me with help on this issue, I would be extremely grateful. Please keep in mind that I'm a newbie on regex and still learning it. Therefore, an explanation of your regex would be great.
Thanks in advance :)
Sometimes writing a regular expression can be almost as challenging as finding a user name. But here you were quite close to make it work. I can point out three reasons why your attempt fails.
First of all, we need to match all of the input string, not just a part of it, because we don't want to ignore things like white spaces and other characters that appear in the input. For that, one will typically use the anchors ^ (match start) and $ (match end) respectively.
Another point is that we need to prevent two special characters to appear next to each other. This is best done with a negative lookahead.
Finally, I can see that the tool you are using to test your regex is adding the flags gmi, which is not what we want. Particularly, the i flag says that the regex should be case insensitive, so it should match capital letters like small ones. Remove that flag.
The final regex looks like this:
/^([a-z0-9]|[-._](?![-._])){4,20}$/
There is nothing really cryptic here, except maybe for the group [-._](?![-._]) which means any of -._ not followed by any of -._.
I have this simple regular expression for Emails.
/^[a-z]+([\.-_]?[a-z0-9]+)*#([a-z]{3,})+(\.[a-z]{2,3})+$/i;
But when I use this example: first#last#example.com it's still works, And Also when I remove # character from expression :
`/^[a-z]+([\.-_]?[a-z0-9]+)*([a-z]{3,})+(\.[a-z]{2,3})+$/i
it gives the same result.
This expression allows an infinite number of at signs (i.e. #) between at least 2 characters in the email !!
Where is the problem with this expression?
Your pattern is rather restrictive, you might think of other options of validating an email address, like type="email" if it is an input field validation.
As to why the regex matches # even if you take it out, or matches a string with two # symbols, that is cased by [.-_] that matches a lot of chars as the hyphen creates a range that includes #. You need to use [._-] instead.
You may "fix" the regex as
/^[a-z]+([._-]?[a-z0-9]+)*[a-z]{3,}(\.[a-z]{2,3})+$/i
However, this regex is not good to use in real life scenarios.
You want something like that?
/^[a-z\.\-_]+#([a-z]{3,})+(\.[a-z]{2,3})+$/
Probably with sign \.-_ you wanted to have either ".", or "-" or "_" to be used inside the regex, but you forgot to escape "minus".
Or you can use your own but with escape:
^[a-z]+([\.\-_]?[a-z0-9]+)*#([a-z]{3,})+(\.[a-z]{2,3})+$
PS: Remember that a real valid email address could be completely different and has a huge regex, and moreover, each web server defines what is allowed and what is not in email.
I am building a JSON validator from scratch, but I am quite stuck with the string part. My hope was building a regex which would match the following sequence found on JSON.org:
My regex so far is:
/^\"((?=\\)\\(\"|\/|\\|b|f|n|r|t|u[0-9a-f]{4}))*\"$/
It does match the criteria with a backslash following by a character and an empty string. But I'm not sure how to use the UNICODE part.
Is there a regex to match any UNICODE character expert " or \ or control character? And will it match a newline or horizontal tab?
The last question is because the regex match the string "\t", but not " " (four spaces, but the idea is to be a tab). Otherwise I will need to expand the regex with it, which is not a problem, but my guess is the horizontal tab is a UNICODE character.
Thanks to Jaeger Kor, I now have the following regex:
/^\"((?=\\)\\(\"|\/|\\|b|f|n|r|t|u[0-9a-f]{4})|[^\\"]*)*\"$/
It appears to be correct, but is there any way to check for control characters or is this unneeded as they appear on the non-printable characters on regular-expressions.info? The input to validate is always text from a textarea.
Update: the regex is as following in case anyone needs it:
/^("(((?=\\)\\(["\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^"\\\0-\x1F\x7F]+)*")$/
For your exact question create a character class
# Matches any character that isn't a \ or "
/[^\\"]/
And then you can just add * on the end to get 0 or unlimited number of them or alternatively 1 or an unlimited number with +
/[^\\"]*/
or
/[^\\"]+/
Also there is this below, found at https://regex101.com/ under the library tab when searching for json
/(?(DEFINE)
# Note that everything is atomic, JSON does not need backtracking if it's valid
# and this prevents catastrophic backtracking
(?<json>(?>\s*(?&object)\s*|\s*(?&array)\s*))
(?<object>(?>\{\s*(?>(?&pair)(?>\s*,\s*(?&pair))*)?\s*\}))
(?<pair>(?>(?&STRING)\s*:\s*(?&value)))
(?<array>(?>\[\s*(?>(?&value)(?>\s*,\s*(?&value))*)?\s*\]))
(?<value>(?>true|false|null|(?&STRING)|(?&NUMBER)|(?&object)|(?&array)))
(?<STRING>(?>"(?>\\(?>["\\\/bfnrt]|u[a-fA-F0-9]{4})|[^"\\\0-\x1F\x7F]+)*"))
(?<NUMBER>(?>-?(?>0|[1-9][0-9]*)(?>\.[0-9]+)?(?>[eE][+-]?[0-9]+)?))
)
\A(?&json)\z/x
This should match any valid json, you can also test it at the website above
EDIT:
Link to the regex
Use this, works also with array jsons [{...},{...}]:
((\[[^\}]{3,})?\{s*[^\}\{]{3,}?:.*\}([^\{]+\])?)
Demo:
https://regex101.com/r/aHAnJL/1
This character set
[^\/:] // all characters except / or :
is weak per jslint b.c. I should be specifying the characters that can be used not he characters that can not be used per this SO Post.
This is for a simple not production level domain tester that looks like this:
domain: /:\/\/(www\.)?([^\/:]+)/,
I'm just looking for some direction on how to think about this. The post mentions that allowing the myriad of Unicode characters is not a good thing...How do I formulate a plan to write this a tad better?
I am not concerned with the completeness of my domain checker ( it is just a prototype )...I am concerned with how to write reg-exes differently.
According to http://en.wikipedia.org/wiki/Domain_name#Internationalized_domain_names
the character set allowed in the Domain Name System is based on ASCII
and as per http://www.netregister.biz/faqit.htm#1
to name your domain you can use any letter, numbers between 0 and 9, and the symbol "-" [as long as the first character is not "-"]
and considering that your domain must end with .something, you are looking for
([a-zA-Z0-9][a-zA-Z0-9-]*\.)+[a-zA-Z0-9][a-zA-Z0-9-]*
"I should be specifying the characters that can be used not he characters that can not be use"
No, that's nonsense, just JSLint being JSLint.
When you see [^\/:] in a regex it's immediately obvious what it is doing. If you tried to list all possible allowed characters the resulting regex would be horrendously difficult to read and it would be easy to accidentally forget to include some characters.
If you have a specific set of allowed characters then fine, list them. That's easier and more reliable than trying to list all possible invalid characters.
But if you have a specific set of invalid characters the [^] syntax is the appropriate way to do it.
Here`s a regex for characters you can have:
mycharactersarecool[^shouldnothavethesechars](oneoftwooptions|anotheroption)
Is this what you're talking about ?
This is a great question for Google, you know... but just to wet your beak: Matthew O'Riordan has written such regular expression that mathces link with or without protocol.
Here's link to his blog post
But for future reference let me provide the regular expression from the post here as well:
/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[.\!\/\\w]*))?)/
And as nicely broken down by blog writer Matthew himself:
(
( # brackets covering match for protocol (optional) and domain
([A-Za-z]{3,9}:(?:\/\/)?) # match protocol, allow in format http:// or mailto:
(?:[\-;:&=\+\$,\w]+#)? # allow something# for email addresses
[A-Za-z0-9\.\-]+ # anything looking at all like a domain, non-unicode domains
| # or instead of above
(?:www\.|[\-;:&=\+\$,\w]+#) # starting with something# or www.
[A-Za-z0-9\.\-]+ # anything looking at all like a domain
)
( # brackets covering match for path, query string and anchor
(?:\/[\+~%\/\.\w\-]*) # allow optional /path
?\??(?:[\-\+=&;%#\.\w]*) # allow optional query string starting with ?
#?(?:[\.\!\/\\\w]*) # allow optional anchor #anchor
)? # make URL suffix optional
)
What about your particular example
But in your case of mathing URL domains the negative of [^\/:] could simply be:
[-0-9a-zA-Z_.]
And that should match everything after // and before first /. But what happens when your URLs don't end with a slash? what will you do in that case?
Upper regular expression (simplification) only matches one character just like your negative character set does. So this just replaces your negative set in the complete reg ex you're using.
I am looking at the code in the tumblr bookmarklet and was curious what the code below did.
try{
if(!/^(.*\.)?tumblr[^.]*$/.test(l.host))
throw(0);
tstbklt();
}
Can anyone tell me what the if line is testing? I have tried to decode the regex but have been unable to do so.
Initially excluding the specifics of the regex, this code is:
if ( ! /.../.test(l.host) )
"if not regex.matches(l.host)" or "if l.host does not match this regex"
So, the regex must correctly describe the contents of l.host text for the conditional to fail and thus avoid throwing the error.
On to the regex itself:
^(.*\.)?tumblr[^.]*$
This is checking for the existence of tumblr but only after any string ending in . that might exist:
^ # start of line
( # begin capturing group 1
.* # match any (non-newline) character, as many times as possible, but zero allowed
\. # match a literal .
) # end capturing group 1
? # make whole preceeding item optional
tumblr # match literal text tumblr
[^.]* # match any non . character, as many times as possible, but zero allowed
$ # match end of line
I thought it was testing to see if the host was tumblr
Yeah, it looked like it might be intended to check that, but if so it's the wrong way to do it.
For that, the first bit should be something like ^(?:[\w-]+\.)? to capture an alphanumeric subdomain (the ?: is a non-capturing group, the [\w-]+ is at least 1 alphanumeric, underscore or hyphen) and the last bit should be either \.(?:com|net|org)$ or perhaps like (?:\.[a-zA-Z]+)+$ depending on how flexible the tld section might need to be.
My attempt to break it down. I'm no expert with regex however:
if(!/^(..)?tumblr[^.]$/.test(l.host))
This part isn't really regex but tells us to only execute the if() if this test does not work.
if(!/^(.*\.)?tumblr[^.]*$/.test(l.host))
This part allows for any characters before the tumblr word as long as they are followed by a . But it is all optional (See the ? at the end)
if(!/^(.*.)?tumblr**[^.]*$/**.test(l.host))
Next, it matches any character except the . and it the *$ extends that to match any character afterwards (so it doesn't break after 1) and it works until the end of the string.
Finally, the .test() looks to test it against the current hostname or whatever l.host contains (I'm not familiar with the tumblr bookmarklet)
So basically, it looks like that part is checking to see that if the host is not part of tumblr, then throw that exception.
Looking forward to see how wrong I am :)