The inverse of [^\/:] | Regular Expression Improvement - javascript

This character set
[^\/:] // all characters except / or :
is weak per jslint b.c. I should be specifying the characters that can be used not he characters that can not be used per this SO Post.
This is for a simple not production level domain tester that looks like this:
domain: /:\/\/(www\.)?([^\/:]+)/,
I'm just looking for some direction on how to think about this. The post mentions that allowing the myriad of Unicode characters is not a good thing...How do I formulate a plan to write this a tad better?
I am not concerned with the completeness of my domain checker ( it is just a prototype )...I am concerned with how to write reg-exes differently.

According to http://en.wikipedia.org/wiki/Domain_name#Internationalized_domain_names
the character set allowed in the Domain Name System is based on ASCII
and as per http://www.netregister.biz/faqit.htm#1
to name your domain you can use any letter, numbers between 0 and 9, and the symbol "-" [as long as the first character is not "-"]
and considering that your domain must end with .something, you are looking for
([a-zA-Z0-9][a-zA-Z0-9-]*\.)+[a-zA-Z0-9][a-zA-Z0-9-]*

"I should be specifying the characters that can be used not he characters that can not be use"
No, that's nonsense, just JSLint being JSLint.
When you see [^\/:] in a regex it's immediately obvious what it is doing. If you tried to list all possible allowed characters the resulting regex would be horrendously difficult to read and it would be easy to accidentally forget to include some characters.
If you have a specific set of allowed characters then fine, list them. That's easier and more reliable than trying to list all possible invalid characters.
But if you have a specific set of invalid characters the [^] syntax is the appropriate way to do it.

Here`s a regex for characters you can have:
mycharactersarecool[^shouldnothavethesechars](oneoftwooptions|anotheroption)
Is this what you're talking about ?

This is a great question for Google, you know... but just to wet your beak: Matthew O'Riordan has written such regular expression that mathces link with or without protocol.
Here's link to his blog post
But for future reference let me provide the regular expression from the post here as well:
/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[.\!\/\\w]*))?)/
And as nicely broken down by blog writer Matthew himself:
(
( # brackets covering match for protocol (optional) and domain
([A-Za-z]{3,9}:(?:\/\/)?) # match protocol, allow in format http:// or mailto:
(?:[\-;:&=\+\$,\w]+#)? # allow something# for email addresses
[A-Za-z0-9\.\-]+ # anything looking at all like a domain, non-unicode domains
| # or instead of above
(?:www\.|[\-;:&=\+\$,\w]+#) # starting with something# or www.
[A-Za-z0-9\.\-]+ # anything looking at all like a domain
)
( # brackets covering match for path, query string and anchor
(?:\/[\+~%\/\.\w\-]*) # allow optional /path
?\??(?:[\-\+=&;%#\.\w]*) # allow optional query string starting with ?
#?(?:[\.\!\/\\\w]*) # allow optional anchor #anchor
)? # make URL suffix optional
)
What about your particular example
But in your case of mathing URL domains the negative of [^\/:] could simply be:
[-0-9a-zA-Z_.]
And that should match everything after // and before first /. But what happens when your URLs don't end with a slash? what will you do in that case?
Upper regular expression (simplification) only matches one character just like your negative character set does. So this just replaces your negative set in the complete reg ex you're using.

Related

The dynamic route parameter does not support Unicode languages other than English

Official Documentation of NextJS 13 (beta) describes dynamic route parameters in this link.
Explanation:
app/shop/[slug]/page.js
To access slug, we use params.slug.
Example:
app/shop/this-is-shop/page.js
Here params. slug return -> this-is-shop.
If we use another language other than english, for example in Hindi:
app/shop/यह-दुकान-है/page.js
params.slug return -> %E0%A4%B9-%E0%A4%A6%E0%A5%81%E0%A4%95%E0%A4%BE%E0%A4%A8-%E0%A4%B9%E0%A5%88.
But it should return -> यह-दुकान-है.
That's the issue. How to fix this? Or a way to get an actual text?
This is not an issue related to Next.js. It's how URLs work. You can only have a limited number of characters in a URL. Others should be encoded. Here is a quote from an article from URLEncoder.io:
A URL is composed of a limited set of characters belonging to the US-ASCII character set. These characters include digits (0-9), letters(A-Z, a-z), and a few special characters ("-", ".", "_", "~").
ASCII control characters (e.g. backspace, vertical tab, horizontal tab, line feed etc), unsafe characters like space, \, <, >, {, } etc, and any character outside the ASCII charset is not allowed to be placed directly within URLs.
Moreover, there are some characters that have special meanings within URLs. These characters are called reserved characters. Some examples of reserved characters are ?, /, #, : etc. Any data transmitted as part of the URL, whether in query string or path segment, must not contain these characters.
So, what do we do when we need to transmit any data in the URL that contain these disallowed characters? Well, we encode them!
URL Encoding converts reserved, unsafe, and non-ASCII characters in URLs to a format that is universally accepted and understood by all web browsers and servers...
What you are seeing is the encoded version of "यह-दुकान-है" as you can see in this example:
console.log(encodeURI("यह-दुकान-है"))
If you want the initial value, use decodeURI(), like so:
console.log(decodeURI('%E0%A4%AF%E0%A4%B9-%E0%A4%A6%E0%A5%81%E0%A4%95%E0%A4%BE%E0%A4%A8-%E0%A4%B9%E0%A5%88'));

Regex to validate URL/URI without special characters

I need a regex to validate the given value is a URL or URI without ascii or special characters.
So valid scenarios :
https://stackoverflow.com/questions/ask
/questions/ask
Invalid scenarios:
https://stackoverflow.com/questions/ask/##$ds
questions/ask#*
https://stackoverflow.com/questions/ask/##$ds ### (without spaces)
How could this be achieved using a regex?
I'd say this one is a good place to start
^[https:\/\/|http:\/\/]?(www\.)?[-a-zA-Z0-9\.\/]+$
It's not exactly evident which are your limitations
Optional protocol
Optional the www. Clause
Limitation of accepted characters
Validation from start to finish ^ - $

Simple regex pattern for email

I've been trying to work this out for almost an hour now, and I can't see myself getting much further with it without any help or explanation. I've used regex before, but only ones that are very simple or had already been made.
This time, I'm trying to work out how to write a regex that achieves the following:
Email address must contain one # character and at least one dot (.) at least one position after the # character.
So far, this is all I've been able to work out, and it still matches email addresses that, for example, have more than one # symbol.
.*?#?[^#]*\.+.*
It would be helpful if you can show me how to construct a regular expression that checks for a single # and at least one full stop one or more spaces after the #. If you could break down the regex and explain what each bit does, that would be really helpful.
I want to keep it simple for now, so it doesn't have to be a full-on super-accurate email validation expression.
With the help of ClasG's comment, I now have a fairly straightforward and suitable regex for my problem. For the sake of anyone learning regex who might come across this question in the future, I'll break the expression down below.
Expression: ^[^#]+#[^#]+\.[^#]+$
^ Matches the beginning of the string (or line if multiline)
[^#] Match any character that is not in this set (i.e. not "#")
+ Match one or more of this
# Match "#" character
[^#] Match any character that is not in this set
+ Match one or more
\. Match "." (full stop) character (backslash escapes the full stop)
[^#] Match any character that is not in this set
+ Match one or more
$ Matches the end of the string (or line if multiline)
And in plain language:
Start at beginning of string or line
Include all characters except # until the # sign
Include the # sign
Include all characters except # after the # sign until the full stop
Include all characters except # after the full stop
Stop at the end of the string or line
Email address must contain one # character
No they don't. An email address with no '#' character is perfectly valid. An email address with multiple '#' characters before an IP address is perfectly valid (as long as all but 1 are outside the ADDR_SPEC or are quoted/escaped within the mailbox name).
I suspect you're not trying to validate an email address but rather an ADDR_SPEC. The answer linked by Máté Safranka describes how to validate an ADDR_SPEC (not an email address). Unless you expect to be validating records which don't have a valid internet MX record, and more than one '#' is more likely be a typo than a valid address....
/[a-z0-9\._%+!$&*=^|~#%'`?{}/\-]+#([a-z0-9\-]+\.){1,}([a-z]{2,16})/
^[^\W_]+\w*(?:[.-]\w*)*[^\W_]+#[^\W_]+(?:[.-]?\w*[^\W_]+)*(?:\.[^\W_]{2,})$

Regex to retrieve domain.extension from a url

I am needing to come up with a regex to extract only domainname.extension from a url. Right now I have a regex that strips out "www." from the host name, but I need to update the regex to remove any subdomain strings from the hostname:
This strips off www.:
window.location.hostname.replace(/^www\./i, '')
But I need to detect any subdomain info on abc.def.test.com or ghi.test.com to replace it with an empty string and always return "test.com"
You could achieve the same result with replace method but match is some how more suitable:
console.log(
window.location.hostname.match(/[^\s.]+\.[^\s.]+$/)[0]
);
[^\s.]+ Match non-whitespace characters except dot
$ Assert end of input string
Doing so with replace method according to comments:
console.log(
window.location.hostname.replace(/[^\s.]+\.(?=[^\s.]\.)/g, '')
);
Well, that depends mainly on what you define as a domain and how do you define a subdomain. I'll use the most generalised approach of considering the top domain as the last two subcomponents (like you use in test.com) In that case you can proceed as:
([a-zA-Z0-9-]+\.)*([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+) ==> $2
as you see, the regexp is divided in two groups, and we only get the second in the output, which is the last two domain components. The [a-zA-Z0-9-] subexpression demands some explanation, as it appears thrice in the regexp: It is the set of chars allowed in a domain component, including the - hyphen. See [1] for a working demo.
in the case you want to cope with the co.uk example posted in the last demo, to match www.test.co.uk as test.co.uk, then you have to anchor your regexp to the end (with $, or if you are in the middle of a url, with the next : or / that can follow the domain name), to avoid that prefixes get detected as valid domains like it is shown in [2]:
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)$ ==> $3
or [3]
(([a-zA-Z-9-]+\.)*?)([a-zA-Z0-9-]+\.[a-zA-Z0-9-]+(\.(uk|au|tw|cn))?)(?=[:/]|$) ==> $3
Of course, you have to put in the list all countries that follow the convention of using top domains as prefixes under their structure. You have to be careful here, as not all countries follow this approach. I've used the non-greedy *? operator here, as if I don't, then the group matching doesn't get as desired (the first group gets greedy, and the match is again at co.uk instead of test.co.uk)
But as you have finally to anchor your regexp (mainly because you can have domain names in the query string part of the url or in the subpath part, the best it to anchor it to the whole url.

Regex explanation

I am looking at the code in the tumblr bookmarklet and was curious what the code below did.
try{
if(!/^(.*\.)?tumblr[^.]*$/.test(l.host))
throw(0);
tstbklt();
}
Can anyone tell me what the if line is testing? I have tried to decode the regex but have been unable to do so.
Initially excluding the specifics of the regex, this code is:
if ( ! /.../.test(l.host) )
"if not regex.matches(l.host)" or "if l.host does not match this regex"
So, the regex must correctly describe the contents of l.host text for the conditional to fail and thus avoid throwing the error.
On to the regex itself:
^(.*\.)?tumblr[^.]*$
This is checking for the existence of tumblr but only after any string ending in . that might exist:
^ # start of line
( # begin capturing group 1
.* # match any (non-newline) character, as many times as possible, but zero allowed
\. # match a literal .
) # end capturing group 1
? # make whole preceeding item optional
tumblr # match literal text tumblr
[^.]* # match any non . character, as many times as possible, but zero allowed
$ # match end of line
I thought it was testing to see if the host was tumblr
Yeah, it looked like it might be intended to check that, but if so it's the wrong way to do it.
For that, the first bit should be something like ^(?:[\w-]+\.)? to capture an alphanumeric subdomain (the ?: is a non-capturing group, the [\w-]+ is at least 1 alphanumeric, underscore or hyphen) and the last bit should be either \.(?:com|net|org)$ or perhaps like (?:\.[a-zA-Z]+)+$ depending on how flexible the tld section might need to be.
My attempt to break it down. I'm no expert with regex however:
if(!/^(..)?tumblr[^.]$/.test(l.host))
This part isn't really regex but tells us to only execute the if() if this test does not work.
if(!/^(.*\.)?tumblr[^.]*$/.test(l.host))
This part allows for any characters before the tumblr word as long as they are followed by a . But it is all optional (See the ? at the end)
if(!/^(.*.)?tumblr**[^.]*$/**.test(l.host))
Next, it matches any character except the . and it the *$ extends that to match any character afterwards (so it doesn't break after 1) and it works until the end of the string.
Finally, the .test() looks to test it against the current hostname or whatever l.host contains (I'm not familiar with the tumblr bookmarklet)
So basically, it looks like that part is checking to see that if the host is not part of tumblr, then throw that exception.
Looking forward to see how wrong I am :)

Categories