Related
I'm trying to implement an asymmetrical search for a dictionary web app, so searching for ü, for example, will return only tokens that actually contain ü, but searching for u will return both u and ü. (This is so users who don't know how to type special characters can still search for them, but users who do know how to type them won't be inundated with the plain character forms unnecessarily.)
It has to all be client-side JavaScript without any external libraries.
I've managed to make the second search type work by running both the search term and the text I'm searching through the following function, effectively merging special characters with their plain counterparts:
function cleanUp(dirty) {
cleaned = dirty.replace(/[áàâãäāă]/ig,"a");
cleaned = cleaned.replace(/đ/ig,"d");
cleaned = cleaned.replace(/[éèêẽëēĕ]/ig,"e");
cleaned = cleaned.replace(/[íìîĩïīĭ]/ig,"i");
cleaned = cleaned.replace(/ñ/ig,"n");
cleaned = cleaned.replace(/[óòôõöōŏ]/ig,"o");
cleaned = cleaned.replace(/[úùûũüūŭ]/ig,"u");
return cleaned;
}
I then compare the strings to get my results with something like:
var search_term = cleanup(search_input.value);
var text_to_search = cleanup(main_text);
if (text_to_search.indexOf(search_term) > -1) ... //do something
It's not elegant, but it works. After cleaning up both strings the user can search for i.e. uber and get über even if they don't know how to type ü. But if they do know how, searching for über directly also returns things like uber, which is what I don't want.
I've already thought of things like checking for each special character separately for each search term or duplicating every dictionary entry that has a special character to produce a special-character and a plain-character version, but all of my ideas would seriously slow down the processing time for the search.
Any ideas are greatly appreciated.
The answer you posted sounds quite reasonable.
I would just like to suggest a cleaner way (pun intended) to code your cleanup() function and similar functions that do a series of string operations:
function cleanUp(dirty) {
return dirty
.replace(/[áàâãäāă]/ig,"a")
.replace(/đ/ig,"d")
.replace(/[éèêẽëēĕ]/ig,"e")
.replace(/[íìîĩïīĭ]/ig,"i")
.replace(/ñ/ig,"n")
.replace(/[óòôõöōŏ]/ig,"o")
.replace(/[úùûũüūŭ]/ig,"u");
}
I ended up checking to see if the search term contained any special characters, and if it did, I didn't run it through cleanup(), and compared it to the original dictionary entry instead of the cleaned one. Thanks for the comments everyone.
So I've made this search that does what its supposed to do front-end wise. However, when submitting I'd like the query to ignore commas.
Right now I'm using commas to make a comma separated search. The whole thing is, when I submit; the comma's are included and thus messes up my search values.
Is there any way to ignore comma's upon submit?
Example: Searching [Example][Test] will actually return Example,Test.
I've made a fiddle here
Any suggestions and help is greatly appreciated.
var firster = true;
//capture form submit
$('form.nice').submit(function(e){
if(firster){
// if its the first submit prevent default
e.preventDefault();
// update input value to have no commas
var val = $('input').val();
val = val.replace(/,/g, ' ');
$('input').val(val);
// let submit go through and submit
firster = false;
$(this).submit();
}
});
DEMO
Looking at your profile, I'm guessing you're using python as a server-side language. The issue you're trying to solve is best dealt with server-side: never rely on front-end code to escape or format data that is being used in a query... check Bobby Tables for more info
Anyhow, in python, you could try this:
ajaxString.replace(",","\", \"")
Thiis will replace all commas witIh " OR ", so a string like some, keywords is translated into some", "keywords, just add some_field IN (" and the closing ") to form a valid query.
Alternatively, you can split the keywords, and deal with them separately (which could come in handy when sorting the results depending on how relevant the results might be.
searchTerms = ajaxString.split(",")
>>>['some','keywords']
That should help you on your way, I hope.
Lastly, I'd suggest just not bothering with developing your own search function at all. Just add a google search to your site, they're the experts. There is just no way you, by yourself, can do better. Or even if you could, just imagine how long it'd take you!
Yes, sometimes a company will create their own search-engine, but only if they have a good reason to do so, and have the resources such an endevour requires. Programming is often all about being "cleverly lazy": Don't reinvent the wheel.
I want to find anything that comes after s= and before & or the end of the string. For example, if the string is
t=qwerty&s=hello&p=3
I want to get hello. And if the string is
t=qwerty&s=hello
I also want to get hello
Thank you!
\bs=([^&]+) and grabbing $1should be good enough, no?
edit: added word anchor! Otherwise it would also match for herpies, dongles...
Why don't you try something that was generically aimed at parsing query strings? That way, you can assume you won't run into the obvious next hurdle while reinventing the wheel.
jQuery has the query object for that (see JavaScript query string)
Or you can google a bit:
function getQuerystring(key, default_)
{
if (default_==null) default_="";
key = key.replace(/[\[]/,"\\\[").replace(/[\]]/,"\\\]");
var regex = new RegExp("[\\?&]"+key+"=([^&#]*)");
var qs = regex.exec(window.location.href);
if(qs == null)
return default_;
else
return qs[1];
}
looks useful; for example with
http://www.bloggingdeveloper.com?author=bloggingdeveloper
you want to get the "author" querystring's value:
var author_value = getQuerystring('author');
The simplest way to do this is with a selector s=([^&]*)&. The inside of the parentheses has [^&] to prevent it from grabbing hello&p=3 of there were another field after p.
You can also use the following expression, based on the solution provided here, which finds all characters between the two given strings:
(?<=s=)(.*)(?=&)
In your case you may need to slightly modify it to account for the "end of the string" option (there are several ways to do it, especially when you can use simple code manipulations such as manually adding a & character to the end of the string before running the regex).
According to JSHint, a Javascript programmer should not add a space after the first parenthesis and before the last one.
I have seen a lot of good Javascript libraries that add spaces, like this:
( foo === bar ) // bad according to JSHint
instead of this way:
(foo === bar) // good according to JSHint
Frankly, I prefer the first way (more spaces) because it makes the code more readable. Is there a strong reason to prefer the second way, which is recommended by JSHint?
This is my personal preference with reasons as to why.
I will discuss the following items in the accepted answer but in reverse order.
note-one not picking on Alnitak, these comments are common to us all...
note-two Code examples are not written as code blocks, because syntax highlighting deters from the actual question of whitespace only.
I've always done it that way.
Not only is this never a good reason to defend a practice in programming, but it also is never a good reason to defend ANY idea opposing change.
JS file download size matters [although minification does of course fix that]
Size will always matter for Any file(s) that are to be sent over-the-wire, which is why we have minification to remove unnecessary whitespace. Since JS files can now be reduced, the debate over whitespace in production code is moot.
moot: of little or no practical value or meaning; purely academic.
moot definition
Now we move on to the core issue of this question. The following ideas are mine only, and I understand that debate may ensue. I do not profess that this practice is correct, merely that it is currently correct for me. I am willing to discuss alternatives to this idea if it is sufficiently shown to be a poor choice.
It's perfectly readable and follows the vast majority of formatting conventions in Javascript's ancestor languages
There are two parts to this statement: "It's perfectly readable,"; "and follows the vast majority of formatting conventions in Javascript's ancestor languages"
The second item can be dismissed as to the same idea of I've always done it that way.
So let's just focus on the first part of the statement It's perfectly readable,"
First, let's make a few statements regarding code.
Programming languages are not for computers to read, but for humans to read.
In the English language, we read left to right, top to bottom.
Following established practices in English grammar will result in more easily read code by a larger percentage of programmers that code in English.
NOTE: I am establishing my case for the English language only, but may apply generally to many Latin-based languages.
Let's reduce the first statement by removing the adverb perfectly as it assumes that there can be no improvement. Let's instead work on what's left: "It's readable". In fact, we could go all JS on it and create a variable: "isReadable" as a boolean.
THE QUESTION
The question provides two alternatives:
( foo === bar )
(foo === bar)
Lacking any context, we could fault on the side of English grammar and go with the second option, which removes the whitespace. However, in both cases "isReadable" would easily be true.
So let's take this a step further and remove all whitespace...
(foo===bar)
Could we still claim isReadable to be true? This is where a boolean value might not apply so generally. Let's move isReadable to an Float where 0 is unreadable and 1 is perfectly readable.
In the previous three examples, we could assume that we would get a collection of values ranging from 0 - 1 for each of the individual examples, from each person we asked: "On a scale of 0 - 1, how would you rate the readability of this text?"
Now let's add some JS context to the examples...
if ( foo === bar ) { } ;
if(foo === bar){};
if(foo===bar){};
Again, here is our question: "On a scale of 0 - 1, how would you rate the readability of this text?"
I will make the assumption here that there is a balance to whitespace: too little whitespace and isReadable approaches 0; too much whitespace and isReadable approaches 0.
example: "Howareyou?" and "How are you ?"
If we continued to ask this question after many JS examples, we may discover an average limit to acceptable whitespace, which may be close to the grammar rules in the English language.
But first, let's move on to another example of parentheses in JS: the function!
function isReadable(one, two, three){};
function examineString(string){};
The two function examples follow the current standard of no whitespace between () except after commas. The next argument below is not concerned with how whitespace is used when declaring a function like the examples above, but instead the most important part of the readability of code: where the code is invoked!
Ask this question regarding each of the examples below...
"On a scale of 0 - 1, how would you rate the readability of this text?"
examineString(isReadable(string));
examineString( isReadable( string ));
The second example makes use of my own rule
whitespace in-between parentheses between words, but not between opening or closing punctuation.
i.e. not like this examineString( isReadable( string ) ) ;
but like this examineString( isReadable( string ));
or this examineString( isReadable({ string: string, thing: thing });
If we were to use English grammar rules, then we would space before the "(" and our code would be...
examineString (isReadable (string));
I am not in favor of this practice as it breaks apart the function invocation away from the function, which it should be part of.
examineString(); // yes; examineString (): // no;
Since we are not exactly mirroring proper English grammar, but English grammar does say that a break is needed, then perhaps adding whitespace in-between parentheses might get us closer to 1 with isReadable?
I'll leave it up to you all, but remember the basic question:
"Does this change make it more readable, or less?"
Here are some more examples in support of my case.
Assume functions and variables have already been declared...
input.$setViewValue(setToUpperLimit(inputValue));
Is this how we write a proper English sentence?
input.$setViewValue( setToUpperLimit( inputValue ));
closer to 1?
config.urls['pay-me-now'].initialize(filterSomeValues).then(magic);
or
config.urls[ 'pay-me-now' ].initialize( fitlerSomeValues ).then( magic );
(spaces just like we do with operators)
Could you imagine no whitespace around operators?
var hello='someting';
if(type===undefined){};
var string="I"+"can\'t"+"read"+"this";
What I do...
I space between (), {}, and []; as in the following examples
function hello( one, two, three ){
return one;
}
hello( one );
hello({ key: value, thing1: thing2 });
var array = [ 1, 2, 3, 4 ];
array.slice( 0, 1 );
chain[ 'things' ].together( andKeepThemReadable, withPunctuation, andWhitespace ).but( notTooMuch );
There are few if any technical reasons to prefer one over the other - the reasons are almost entirely subjective.
In my case I would use the second format, simply because:
It's perfectly readable, and follows the vast majority of formatting conventions in Javascript's ancestor languages
JS file download size matters [although minification does of course fix that]
I've always done it that way.
Quoting Code Conventions for the JavaScript Programming Language:
All binary operators except . (period) and ( (left parenthesis) and [ (left bracket) should be separated from their operands by a space.
and:
There should be no space between the name of a function and the ( (left parenthesis) of its parameter list.
I prefer the second format. However there are also coding style standards out there that insist on the first. Given the fact that javascript is often transmitted as source (e.g. any client-side code), one could see a slightly stronger case with it than with other languages, but only marginally so.
I find the second more readable, you find the first more readable, and since we aren't working on the same code we should each stick as we like. Were you and I to collaborate then it would probably be better that we picked one rather than mixed them (less readable than either), but while there have been holy wars on such matters since long before javascript was around (in other languages with similar syntax such as C), both have their merits.
I use the second (no space) style most of the time, but sometimes I put spaces if there are nested brackets - especially nested square brackets which for some reason I find harder to read than nested curved brackets (parentheses). Or to put that another way, I'll start any given expression without spaces, but if I find it hard to read I insert a few spaces to compare, and leave 'em in if they helped.
Regarding JS Hint, I wouldn't worry- this particular recommendation is more a matter of opinion. You're not likely to introduce bugs because of this one.
I used JSHint to lint this code snippet and it didn't give such an advice:
if( window )
{
var me = 'me';
}
I personally use no spaces between the arguments in parentheses and the parentheses themselves for one reason: I use keyboard navigation and keyboard shortcuts. When I navigate around the code, I expect the cursor to jump to the next variable name, symbol etc, but adding spaces messes things up for me.
It's just personal preference as it all gets converted to the same bytecode/binary at the end of the day!
Standards are important and we should follow them, but not blindly.
To me, this question is about that syntax styling should be all about readability.
this.someMethod(toString(value),max(value1,value2),myStream(fileName));
this.someMethod( toString( value ), max( value1, value2 ), myStream( fileName ) );
The second line is clearly more readable to me.
In the end, it may come down to personal preference, but I would ask those who prefer the 1st line if they really make their choice because "they are used it" or because they truly believe it's more readable.
If it's something you are used to, then a short time investment into a minor discomfort for a long term benefit might be worth the switch.
In my web app I've got a form field where the user can enter an URL. I'm already doing some preliminary client-side validation and I was wondering if I could use a regexp to validate if the entered string is a valid URL. So, two questions:
Is it safe to do this with a regexp? A URL is a complex beast, and just like you shouldn't use a regexp for parsing HTML, I'm worried that it might be unsuitable for a URL as well.
If it can be done, what would be a good regexp for the task? (I know that Google turns up countless regexps, but I'm worried about their quality).
My goal is to prevent a situation where the URL appears in the web page and is unusable by the browser.
Well... maybe. People often ask a similar question about email addresses, and with those you would need a horrendously complicated regular expression (i.e. a couple pages long, at least) to correctly validate them. I don't think URLs are quite as complicated (the W3C has a document describing their format) but still, any reasonably short regexp you come up with will probably block some valid URLs.
I would suggest thinking about what kinds of URLs you need to be accepting. Maybe for your purposes, blocking the occasional valid-but-weird submission is fine, and in that case you can use a simple regex that matches most URLs, like the one in Dobiatowski's answer. Or you could use a regex that accepts all valid URLs and a few invalid ones, if that works for you. But I'd be wary of trying to find a regular expression that accepts exactly all valid URLs and no invalid ones. If you want to have 100% foolproof verification in that way, I'd suggest using a client-side validation of the second type I mentioned (that accepts a few invalid URLs) and doing a more comprehensive check on the server side, using some library in whatever language you are using to process the form data.
As stated by Crescent Fresh, in the comments, there are similar questions to this one which I didn't find. One of them also supplies the full standards-compliant regex for validating an URL:
(?:http://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.
)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)
){3}))(?::(?:\d+))?)(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F
\d]{2}))|[;:#&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{
2}))|[;:#&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{
2}))|[;:#&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?
:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-
fA-F\d]{2}))|[;?&=])*))?#)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-
)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?
:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!
*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[?:#&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?:
(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;/?:&=])+#(?:(?:(
?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[
a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3})))|(?:[a-zA-Z](
?:[a-zA-Z\d]|[_.+-])*)|\*))|(?:nntp://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[
a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d
])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:[a-zA-Z](?:[a-zA-Z
\d]|[_.+-])*)(?:/(?:\d+))?)|(?:telnet://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[;?&=])*))?#)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a
-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d]
)?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))/?)|(?:gopher://(?:(?:
(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:
(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+
))?)(?:/(?:[a-zA-Z\d$\-_.+!*'(),;/?:#&=]|(?:%[a-fA-F\d]{2}))(?:(?:(?:[
a-zA-Z\d$\-_.+!*'(),;/?:#&=]|(?:%[a-fA-F\d]{2}))*)(?:%09(?:(?:(?:[a-zA
-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:#&=])*)(?:%09(?:(?:[a-zA-Z\d$
\-_.+!*'(),;/?:#&=]|(?:%[a-fA-F\d]{2}))*))?)?)?)?)|(?:wais://(?:(?:(?:
(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:
[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?
)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)(?:(?:/(?:(?:[a-zA
-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(
?:%[a-fA-F\d]{2}))*))|\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]
{2}))|[;:#&=])*))?)|(?:mailto:(?:(?:[a-zA-Z\d$\-_.+!*'(),;/?:#&=]|(?:%
[a-fA-F\d]{2}))+))|(?:file://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]
|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:
(?:\d+)(?:\.(?:\d+)){3}))|localhost)?/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[?:#&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(
?:%[a-fA-F\d]{2}))|[?:#&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:[a-zA-Z
\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)
*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:(?:(?:(?
:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&=])*)(?:/(?:(?:(?:[a-
zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&=])*))*)(?:(?:;(?:(?:(?:[
a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&])*)=(?:(?:(?:[a-zA-Z\d
$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&])*)))*)|(?:ldap://(?:(?:(?:(?:
(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:
[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?
))?/(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d])
)|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%2
0)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F
\d]{2}))*))(?:(?:(?:%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?
:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID
|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])
?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*)(?:(
?:(?:(?:%0[Aa])?(?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))(?:(?:(?:(?:(
?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|o
id)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(
?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*))(?:(?:(?:
%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?:(?:(?:[a-zA-Z\d]|%(
?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:
\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a
-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*))*(?:(?:(?:%0[Aa])?(?:%2
0)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))?)(?:\?(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:,(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-f
A-F\d]{2}))+))*)?)(?:\?(?:base|one|sub)(?:\?(?:((?:[a-zA-Z\d$\-_.+!*'(
),;/?:#&=]|(?:%[a-fA-F\d]{2}))+)))?)?)?)|(?:(?:z39\.50[rs])://(?:(?:(?
:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?
:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))
?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:\+(?:(?:
[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*(?:\?(?:(?:[a-zA-Z\d$\-_
.+!*'(),]|(?:%[a-fA-F\d]{2}))+))?)?(?:;esn=(?:(?:[a-zA-Z\d$\-_.+!*'(),
]|(?:%[a-fA-F\d]{2}))+))?(?:;rs=(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA
-F\d]{2}))+)(?:\+(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*)
?))|(?:cid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:#&=
])*))|(?:mid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:#
&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:#&=]
)*))?)|(?:vemmi://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z
\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\
.(?:\d+)){3}))(?::(?:\d+))?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a
-fA-F\d]{2}))|[/?:#&=])*)(?:(?:;(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a
-fA-F\d]{2}))|[/?:#&])*)=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d
]{2}))|[/?:#&])*))*))?)|(?:imap://(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+)(?:(?:;[Aa][Uu][Tt][Hh]=(?:\*|(?:(
?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+))))?)|(?:(?:;[
Aa][Uu][Tt][Hh]=(?:\*|(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2
}))|[&=~])+)))(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[
&=~])+))?))#)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])
?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:
\d+)){3}))(?::(?:\d+))?))/(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:
%[a-fA-F\d]{2}))|[&=~:#/])+)?;[Tt][Yy][Pp][Ee]=(?:[Ll](?:[Ii][Ss][Tt]|
[Ss][Uu][Bb])))|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))
|[&=~:#/])+)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[
&=~:#/])+))?(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-
9]\d*)))?)|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~
:#/])+)(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-9]\d*
)))?(?:/;[Uu][Ii][Dd]=(?:[1-9]\d*))(?:(?:/;[Ss][Ee][Cc][Tt][Ii][Oo][Nn
]=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~:#/])+)))?))
)?)|(?:nfs:(?:(?://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-
Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:
\.(?:\d+)){3}))(?::(?:\d+))?)(?:(?:/(?:(?:(?:(?:(?:[a-zA-Z\d\$\-_.!~*'
(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\-_.!~*'(),
])|(?:%[a-fA-F\d]{2})|[:#&=+])*))*)?)))?)|(?:/(?:(?:(?:(?:(?:[a-zA-Z\d
\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\
-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*))*)?))|(?:(?:(?:(?:(?:[a-zA-
Z\d\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*)(?:/(?:(?:(?:[a-zA-Z\d
\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*))*)?)))
Obviously this is as insane as I feared it would be, so I'll be rethinking the whole thing.
So, the answer is - yes, it can be done, but you should REALLY think twice whether you want to do it this way. Or accept that the regex will be imperfect.
Regex's are safe for lexical validity, but it doesn't mean the site is going to be there. You will actually have to test the connection to see if it's a valid URL by checking the response returned. In all, it depends on your user requirements to say what is valid and what is not - how safe/secure it is depends on you. If you have something like http://foo.com/?referral=http://bar.com/, it'll break some scripts because users don't expect another protocol/path combo as a parameter. Also, some other special characters and null byte hacks have been known to do fishy things with parameters, but I don't think there have been any successful Regex hacks - perhaps memory overflows?
If this is something that needs to be secure, it should probably be handled server side. Although you can perform server side Javascript, I would probably recommend Perl, since it was designed/developed as a text-based parser.
The Regex is only as advanced, or as limited, as you make it. Humans are logical, therefore the problems they solve can be solved by a logic engine (computers), provided that accurate instructions (in this case the regular expression pattern) are given to follow.
#Kerry:
In Javascript you don't "have to" put '/'s around the regular expression. There are also conditions where you can put it in quotes: var re = new Regexp("\w+");
Web Examples:
I think Kerry's was pretty nice, though I only gave it a glance w/o checking it, but here are some simpler examples found from the web
http://www.javascriptkit.com/script/script2/acheck.shtml:
// Email Check
var filter=/^([\w-]+(?:\.[\w-]+)*)#((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$/i
if (filter.test("email address or variable here")){...}
http://snippets.dzone.com/posts/show/452:
// URL Validation
function isUrl(s) {
var regexp = /(ftp|http|https):\/\/(\w+:{0,1}\w*#)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%#!\-\/]))?/
return regexp.test(s);
}
Finally, this looks promising.
http://www.weberdev.com/get_example-4569.html:
// URL Validation
function isValidURL(url){
var RegExp = /^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/;
if(RegExp.test(url)){
return true;
}else{
return false;
}
}
// Email Validation
function isValidEmail(email){
var RegExp = /^((([a-z]|[0-9]|!|#|$|%|&|'|\*|\+|\-|\/|=|\?|\^|_|`|\{|\||\}|~)+(\.([a-z]|[0-9]|!|#|$|%|&|'|\*|\+|\-|\/|=|\?|\^|_|`|\{|\||\}|~)+)*)#((((([a-z]|[0-9])([a-z]|[0-9]|\-){0,61}([a-z]|[0-9])\.))*([a-z]|[0-9])([a-z]|[0-9]|\-){0,61}([a-z]|[0-9])\.)[\w]{2,4}|(((([0-9]){1,3}\.){3}([0-9]){1,3}))|(\[((([0-9]){1,3}\.){3}([0-9]){1,3})\])))$/
if(RegExp.test(email)){
return true;
}else{
return false;
}
}
i think that its safe
here is one sample
http://snippets.dzone.com/posts/show/452
I do believe it is safe, and on #1, that isn't a steadfast rule but more a guideline, speaking from personal experience.
This is the one I use to validate a URL:
(([\w]+:)?\/\/)(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
Realize that in javascript you have to put '/'s around the regular expression.