Call a function if a valid URL is found in a textarea - javascript

What would be the best way to immediately call a function (myFunction()) as soon as a valid URL has been typed into a textfield? I've googled around but I haven't found anything that helps. Using a regular expression would probably be best but I need one that recognizes all sorts of URLs:
http://google.tld, www.google.tld, http://www.google.tld
But still doesn't consider things like "index.php" to be a URL. Does anyone know about such an expression?

^((?:https?|ftp):\/\/)?([\w\.]+.)([a-z]{2,4})$
Supports ftp as well ;)

You might struggle a bit as there would be so many different possibilities. This one will match any url that is technically a valid http or https path (which includes any character in the path after the domain name, any number of subdomains, etc)
((http)s?(://))?[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)*(/(.*))?
Of if you'd like to exclude intranets, you can force a tld using the following:
((http)s?(://))?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9]+)*(\.[a-zA-Z0-9]{2,4})+(/(.*))?

this will match any URL that ends with something like '.com' or '.ch' (you have to maintain the list of valid TLDs)
^(http:\/\/)?([\w\.]+\.)((com)|(ch))$
with javascript the forwardslash doesn't need escaping and TLDs could be less strict e.g. just something with 2-4 characters.
^(https?://)?([\w\.]+\.)([a-z]{2,4})$
Considering a comment to this question by CanSpice the idea of allowing TLDs with different length is difficult to cover as the event trigger may fire too early. A time delay upon the onchange trigger may solve this kind of issues.
Precise requirements and pros/cons of each solution should be weighted.
example at rubular

Related

Custom - Javascript regular expression to validate custom email addresses

I want it to be correct my JavaScript regex pattern to validate below email address scenarios
msekhar#yahoo.com
msekhar#cs.aau.edu
ms.sekhar#yahoo.com
ms_sekhar#yahoo.com
msekhar#cs2.aau.edu
msekhar#autobots.ai
msekhar#interior.homeland1.myanmar.mm
msekhar1922#yahoo.com
msekhar#21#autobots.com
\u001\u002#autobots.com
I have tried the following regex pattern but it's not validating all the above scenarios
/^[_a-z0-9]+(\.[_a-z0-9]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$/
Could any one please help me with this where am doing wrong?
The following regex should do:
^(([^<>()\[\]\.,;:\s#\"]+(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#(([^<>()[\]\.,;:\s#\"]+\.)+[^<>()[\]\.,;:\s#\"]{2,})$
Test it: https://regex101.com/r/7gH0BR/2
EDIT: I have added all your test cases
I have always used this one but note it doesn't trigger on escaped unicode:
^([\w\d._\-#])+#([\w\d._\-#]+[.][\w\d._\-#]+)+$
You can see how it works here: https://regex101.com/r/caa7b2/4
First off [_a-z0-9]+ is going to match the username fields for the majority of those testcases. Anything further testing of username field content will result in a mismatch. If you write a pattern that expects two .-delimitered fields, it'll match when you provide two .-delimitered fields and only then, not anything else. Make a mental note of that. I think you probably meant to put the . in the first character set, and omit this part here: (.[_a-z0-9]+)...
As for the domain part of the email address, similar story there... if you're trying to match domains containing two labels (yahoo and com) against a pattern that expects three... it's going to fail because there's one less label, right? There are domain names that only contain one label which you might want to recognise as email addresses, too, like localhost...
You know, there is a point to where you can dig yourself down a very deep rabbit hole trying to parse email addresses, much to the effect of this question and answer sequence. If you're making this complex using regular expressions... I think maybe a better tool is a proper parser generator... otherwise, write the following:
A pattern that matches anything up until an # character
A pattern that matches the # character (this will help you learn how to avoid your .-related error)
A pattern that matches everything (this will help you understand your .-related error)
Combine the three above in the order presented.

javascript encodeURIComponent and escape?

I use JS to sent encodeURIComponent string to a PHP file write and has been working fine for years; until recently I met with a strange effect that the text need to be further encoded with escape in order to get it to work! The sympton start to show only when I use an open source wysiwyg editor !
What could be the offending characters in URI that need escape to fix it? I used to think URI only reserve ? & = for its syntax to work.
The situation you describe could possibly be explained--although there's no way of knowing without you telling us what the string is, and how it's being used--by a URL which involves two levels of nested URL-like values.
Consider a URL taking a query parameter which is another URL:
http://me.com?url=http://you.com?qp=1
That URL is subject to misinterpretation, so we would normally URL-encode the you.com URL, giving us:
http://me.com?url=http%3A%2F%2Fyou.com%3Fqp%3D1
Whoever is working with this URL can now extract the query parameter named url with the value http%3A%2F%2Fyou.com%3Fqp%3D1, decode it (often a framework or library will decode it for you), and then use it to jump to or call that URL.
Consider, however, the case where the you.com URL itself has a query parameter, not ?qp=1 as given in the first example, but rather something that itself needs to be URL-encoded. To keep things simple, we'll just use "cat?pictures". We'd need to encode that, making the query parameter
In other words, the URL in question is going to be
?qp=cat%3Fpictures
If we just use that as is, then our entire URL becomes
http://me.com?url=http%3A%2F%2Fyou.com%3Fqp=cat%3Fpictures
Unfortunately, if we now decode that in a naive way, we get
http://me.com?url=http://you.com?qp=cat?pictures
In other words, the nested URL has been decoded as well, meaning that it will think the URL has two query paramters, namely url and qp. To successfully deal with this problem, we need to encode the second query parameter a second time, yielding
http://me.com?url=http%3A%2F%2Fyou.com%3Fqp%3Dcat%253Fpictures
Please note, however, that if you use your language or environment's built-in tools and libraries for handling query parameters, most of this will happen automatically and prevent you from having to worry about it.
The symptom start to show only when I use an open source wysiwyg editor
An editor merely places characters in a file. It's very hard to imagine that an editor is causing the problem you refer to, unless perhaps one editor is configured to use smart quotes, for example, which would pretty much break everything that involved quotes.

How to approach string length constraints when localization is brought into the equation?

Once there was a search input.
It was responsible for filtering data in a table based on user input.
But this search input was special: it would not do anything unless a minimum of 3 characters was entered.
Not because it was lazy, but because it didn't make sense otherwise.
Everything was good until a new and strange (compared to English) language came to town.
It was Japanese and now the minimum string length of 3 was stupid and useless.
I lost the last few pages of that story. Does anyone remember how it ends?
In order to fix the issue, you obviously need to determine if user's input belongs to certain script(s). The most obvious way to do this is to use Unicode Regular Expressions:
var regexPattern = "[\\p{Katakana}\\p{Hiragana}\\p{Han}]+";
The only issue would be, that JavaScript does not support this kind of regular expressions out of the box. Anyway, you are lucky - there is a JS library called XRegExp and its Scripts add-on seems to exactly what you need. Now, the question is, whether you want to require at least three characters for non-Japanese or non-Chinese users, or do it otherwise - require at least three characters for certain scripts (Latin, Common, Cyrillic, Greek and Hebrew) while allowing any other to be searched on one character. I'd suggest the second solution:
if (XRegExp('[\\p{Latin}\\p{Common}\\p{Cyrillic}\\p{Greek}\\p{Hebrew}]+').test(input)) {
// test for string length and call AJAX if the string is long enough
} else {
// call AJAX search method
}
You might want to pre-compile the regular expression for better performance, but that's basically it.
I guess it mainly depends on where you get that min length variable from. If it's hardcoded, you'd probably better use a dynamic internationalization module:
int.getMinStringLength(int.getCurrentLanguage())
Either you have a dynamic bindings framework such as AngularJS, or you update that module when the user changes the language.
Now maybe you'd want to sort your supported languages by using grouping attributes such as "verbose" and "condensed".

Should I worry that using GET in a form element doesn't automatically URL-encode angle brackets?

So I decided to use GET in my form element, point it to my cshtml page, and found (as expected) that it automatically URL encodes any passed form values.
I then, however, decided to test if it encodes angle brackets and surprisingly found that it did not when the WebMatrix validator threw a server error warning me about a potentially dangerous value being passed.
I said to myself, "Okay, then I guess I'll use Request.Unvalidated["searchText"] instead of Request.QueryString["searchText"]. Then, as any smart developer who uses Request.Unvalidated does, I tried to make sure that I was being extra careful, but I honestly don't know much about inserting JavaScript into URLs so I am not sure if I should worry about this or not. I have noticed that it encodes apostrophes, quotations, parenthesis, and many other JavaScript special characters (actually, I'm not even sure if an angle bracket even has special meaning in JavaScript OR URLs, but it probably does in one, if not both. I know it helps denote a List in C#, but in any event you can write script tags with it if you could find a way to get it on the HTML page, so I guess that's why WebMatrix's validator screams at me when it sees them).
Should I find another way to submit this form, whereas I can intercept and encode the user data myself, or is it okay to use Request.Unvalidated in this instance without any sense of worry?
Please note, as you have probably already noticed, my question comes from a WebMatrix C#.net environment.
Bonus question (if you feel like saving me some time and you already know the answer off the top of your head): If I use Request.Unvalidated will I have to URL-decode the value, or does it do that automatically like Request.QueryString does?
---------------------------UPDATE----------------------------
Since I know I want neither a YSOD nor a custom error page to appear simply because a user included angle brackets in their "searchText", I know I have to use Request.Unvalidated either way, and I know I can encode whatever I want once the value reaches the cshtml page.
So I guess the question really becomes: Should I worry about possible XSS attacks (or any other threat for that matter) inside the URL based on angle brackets alone?
Also, in case this is relevant:
Actually, the value I am using (i.e. "searchText") goes straight to a cshtml page where the value is ran through a (rather complex) SQL query that queries many tables in a database (using both JOINS and UNIONS, as well as Aliases and function-based calculations) to determine the number of matches found against "searchText" in each applicable field. Then I remember the page locations of all of these matches, determine a search results order based on relevance (determined by type and number of matches found) and finally use C# to write the search results (as links, of course) to a page.
And I guess it is important to note that the database values could easily contain angle brackets. I know it's safe so far (thanks to HTML encoding), but I suppose it may not be necessary to actually "search" against them. I am confused as to how to proceed to maximum security and functional expecations, but if I choose one way or the other, I may not know I chose the wrong decision until it is much too late...
URL and special caracters
The url http://test.com/?param="><script>alert('xss')</script> is "benign" until it is read and ..
print in a template : Hello #param. (Potential reflected/persisted XSS)
or use in Javascript : divContent.innerHTML = '<a href="' + window.location.href + ... (Potential DOM XSS)
Otherwise, the browser doesn't evaluate the query string as html/script.
Request.Unvalidated/Request.QueryString
You should use Request.Unvalidated["searchText"] if you are expecting to receive special caracters.
For example : <b>User content</b><p>Some text...</p>
If your application is working as expected with QueryString["searchText"], you should keep it since it validate for potential XSS.
Ref: http://msdn.microsoft.com/en-us/library/system.web.httprequest.unvalidated.aspx

Encoding uri component, do not automatically decode

I have users with slahes in their usernames. I want to give them easy urls such as /user/username even if their username is problematic. ie /user/xXx/superboy.
I'm using client side routing and I don't think there's any wildcard support. One obvious way to fix this would be to encode their username. href="/user/xXx%2Fsuperboy". But the browser automatically decodes the url when going to the link and then my router ends up not matching anyway. Is there some way to keep the browser from automatically decoding the url or any other way to solve my problem (perhaps a different decoding scheme?). Thanks.
I'm using angularjs with angular ui-router for routing.
Part 1.
Automatic decoding of URIs can be encounted in many situations, such as it being interpreted once then the interpretation passed on (to be re-interpreted).
Part 2.
In a path in a URI, / has a special meaning, so you can't use it as the name of a file or directory. This means if you're mapping something that isn't a real path to a path, you may end up with unexpected characters causing problems. To solve this the characters need to be encoded.
As you want to map usernames to a URI, you have to consider this might happen, so you have to encode in a way that allows for this. From your question, it looks like this happens once, so you'll need to double encode any part of the URI that isn't a "real URI path".
Also maybe you can explain how reliable this is and whether it's advisable
If you always have it used in the same way, it should be reliable. As for advisable, it would be much better to use the query part, rather than the path for this. href="/user?xXx/superboy" is a valid URI and you can get the query string easily (everything after first ?, or an inbuilt method). The only character you'd have to watch for is #, which has special meaning again.

Categories