I'm sort of building an AI for a Telegram Bot, and currently I'm trying to process the text and respond to the user almost like a human does.
For example;
"I want to register"
As a human we understand that the user wants to register.
So I'd process this text using javascript's indexOf to look for want and register
var user_text = message.text;
if (user_text.indexOf('want') >= 0) {
if (user_text.indexOf('register') >= 0) {
console.log('He wants to register?')
}
}
But what if the text contains not somewhere in the string? Of course I'd have like a zillion of conditions for a zillion of cases. It'd be tiring to write this kind of logic.
My question is — Is there any other elegant way to do this? I don't really know the keyword to Google this...
The concept you're looking for is natural language processing and is a very broad field. Full NLP is very intricate and complicated, with all kinds of issues.
I would suggest starting with a much simpler solution, by splitting your input into words. You can do that using the String.prototype.split method with some tweaks. Filter out tokens you don't care about and don't contribute to the command, like "the", "a", "an". Take the remaining tokens, look for negation ("not", "don't") and keywords. You may need to combine adjacent tokens, if you have some two-word commands.
That could look something like:
var user_text = message.text;
var tokens = user_text.split(' '); // split on spaces, very simple "word boundary"
tokens = tokens.map(function (token) {
return token.toLowerCase();
});
var remove = ['the', 'a', 'an'];
tokens = tokens.filter(function (token) {
return remove.indexOf(token) === -1; // if remove array does *not* contain token
});
if (tokens.indexOf('register') !== -1) {
// User wants to register
} else if (tokens.indexOf('enable') !== -1) {
if (tokens.indexOf('not') !== -1) {
// User does not want to enable
} else {
// User does want to enable
}
}
This is not a full solution: you will eventually want to run the string through a real tokenizer and potentially even a full parser, and may want to employ a rule engine to simplify the logic.
If you can restrict the inputs you need to understand (a limited number of sentence forms and nouns/verbs), you can probably just use a simple parser with a few rules to handle most commands. Enforcing a predictable sentence structure with articles removed will make your life much easier.
You could also take the example above and replace the filter with a whitelist (only include words that are known). That would leave you with a small set of known tokens, but introduces the potential to strip useful words and misinterpret the command, so you should confirm with the user before running anything.
If you really want to parse and understand sentences expressed in natural language, you should look into the topic of natural language processing. This is usually done with some kind of neural network trained to "understand" different variations of sentences (aka machine learning), because specifying all of different syntactic and semantic rules of the language appears to be an overwhelming task.
If however the amount of variations of these sentences is limited, then you could specify some rules in the form of commonly used word combinations, probably even regular expressions would do in the simplest case.
Related
I'm really new to Javascript, kinda just learned a little earlier today and been messing around with it, but I'm running into a few issues her and there. I'd appreciate help from some people that know their way around the code.
What's the best way to search a string for multiple words? I'm not completely sure how to explain what I mean, so I'll include my current test code and try to explain. I'm making an attached script to pull text from a text based game online, converting it to lowercase, and defining variables for the use of a money system that changes the input text. Once changes are made, I'm re-inputting the modified text into the game as a return.
let money = 0;
const modifier = (text) => {
let modifiedText = text;
const lowered = text.toLowerCase();
let moneyChange = 0;
// The text passed in is either the user's input or players output to modify.
if(lowered.includes('take their money') || lowered.includes('take ' + 'money')) {
moneyChange = (Math.floor(Math.random() * 500));
if ((moneyChange) > 1) {
console.log(moneyChange);
money += moneyChange;
modifiedText = `You find ${moneyChange} Credits. You now have ${money} Credits`;
} else {
modifiedText = 'You find nothing.';
console.log(modifiedText);
}
}
console.log(modifiedText);
// You must return an object with the text property defined.
return {text: modifiedText};
}
modifier(text);
Currently, as you can see, I have to specifically type "Take their money" or "Take money" as an action before the text pulled is recognized as me taking money from someone or taking some in general. My main issue is that with how the game works, it's somewhat impossible to guess exactly how the input or output is going to come out. The way it works is that the game takes your character's action or speech that you type out, processes it via AI into it's own action or dialogue and generates procedural story to make more sense with the setting so that the player only has to type a vague idea of what's going to happen.
Here's an example:
There's a dead man on the street in front of you.
>loot him
You loot the man, digging through his pockets. You take some money from his wallet, but find nothing else.
The > is my only input and the rest is completely AI generated. My script looks through the AI result and , so I could look for every possible result, from "take his money" to "take her money" and so forth, but that's a little too much to bother with if there's an easier way. If I could have it search the result for specific words that may not be in the normal order and/or with other words in between. Like, it must contain the words "take" and "money" so that if the game says "You find some money, along with a gun. You take both", it recognizes that I'm taking the money. As well as the fact that I still need to write code for every single other time I do anything with money, such as buying things, and if I have to write every possible thing it's going to be a pain.
I know that it would be easier if this code was integrated into the game, but due to AI limitations, that kinda breaks how it works and it goes a little crazy... Any sort of help you can give me will be a help.
If you're looking for a way to search a string which includes multiple sub-phrases, you can use string.includes() in a loop like shown below:
function containsWords(string, words) {
for (let i=0, len=string.length; i<len; i++) {
if (!string.includes(words[i])) {
return false;
}
}
return true;
}
However you also mentioned
search the result for specific words that may not be in the normal order and/or with other words in between
Which immediately brings to mind regex, a text and string matching technology. You can easily find tutorials for regex online, and this live tester is nice too.
I'll quickly build a search string to match "take *** money", where any word can be *** as a quick introduction and example to regex:
/take .+ money/g
Here it matches the specific string take , then .+ matches one or more characters (the middle pronoun eg him/her), then matches money.
I have an idea for a game where people can type in some simple instructions for their character like player.goLeft() or player.attackInFront() and for that I have people type their code into a text box and then I parse it into eval(). This works well but it also allows people to change their own character object by typing things like player.health = Infinity; or something similar. I have a list of functions I want to allow people to use, but I am unsure how to restrict it to only use them.
I understand that the whole point of not letting people use eval is to avoid accidental cross-site scripting but I am unsure on how else to do this. If you have a suggestion please leave a comment about that.
I asked some people around on what to do and most suggested somehow changing scope(which is something I was not able to figure out) or to add some odd parameter to each function in my code that would be required to be a specific string to execute any function, but that seems hacky and since I am making the game in browser with p5js it would be easy to just inspect element and see what the password is.
basically every character has variable called "instruction" which is just a string of javascript. Then every frame of the game I execute it by doing eval(playerList[i].instruction);
tl;dr, how can I only allow specific function to be executed and not allow any others?
EDIT: I forgot to mention that I also am planning to provide player with information so that people can made code that would adapt to the situation. For example there will be parameter called vision that has vision.front and vision.left etc. These variables would just say if there is an enemy, wall, flower, etc around them in a grid. Some people suggested that I just replace some functions with key words but then it compromises the idea of using if statements and making it act differently.
EDIT 2: Sorry for lack of code in this post, but because of the way I am making it, half of the logic is written on server side and half of it works on client side. It will be a little large and to be completely honest I am not sure how readable my code is, still so far I am getting great help and I am very thankful for it. Thank you to everybody who is answering
Do NOT use eval() to execute arbitrary user input as code! There's no way to allow your code to run a function but prevent eval() from doing the same.
Instead, what you should do is make a map of commands the player can use, mapping them to functions. That way, you run the function based on the map lookup, but if it's not in the map, it can't be run. You can even allow arguments by splitting the string at spaces and spreading the array over the function parameters. Something like this:
const instructions = {
goLeft: player.goLeft.bind(player),
goRight: player.goRight.bind(player),
attackInFront: player.attackInFront.bind(player)
};
function processInstruction(instruction_string) {
const pieces = instruction_string.split(' ');
const command = pieces[0];
const args = pieces.slice(1);
if (instructions[command]) {
instructions[command](...args);
} else {
// Notify the user their command is not recognized.
}
};
With that, the player can enter things like goLeft 5 6 and it will call player.goLeft(5,6), but if they try to enter otherFunction 20 40 it will just say it's unrecognized, since otherFunction isn't in the map.
This issue sounds similar to the SQL Injection problem. I suggest you use a similar solution. Create an abstraction layer between the users input and your execution, similar to using parameters with stored procedures.
Let the users type keywords such as 'ATTACK FRONT', then pass that input to a function which parses the string, looks for keywords, then passes back 'player.attackInFront()' to be evaluated.
With this approach you simplify the syntax for the users, and limit the possible actions to those you allow.
I hope this isn't too vague. Good luck!
From your edit, it sounds like you're looking for an object-oriented approach to players. I'm not sure of your existing implementation needs, but it would look like this.
function Player() {
this.vision = {
left: '',
// and so on
}
}
Player.prototype.updateVisibilities = function() {
// to modify the values of this.visibility for each player
}
Player.prototype.moveLeft = function() {
}
Don't give the user an arbitrary interface (such as an input textfield that uses eval) to modify their attributes. Make a UI layer to control this logic. Things like buttons, inputs which explicitly run functions/methods that operate on the player. It shouldn't be up to the player as to what attributes they should have.
Registering a new stemmer function in lunr for greek words doesn't work as expected. here is my code on codepen. I am not receiving any errors, the function stemWord() works fine when used separately but it fails to stem the words in lunr.
below is a sample of the code:
function stemWord(w) {
// code that returns the stemmed word
};
// create the new function
greekStemmer = function (token) {
return stemWord(token);
};
// register it with lunr.Pipeline, this allows you to still serialise the index
lunr.Pipeline.registerFunction(greekStemmer, 'greekStemmer')
var index = lunr(function () {
this.field('title', {boost: 10})
this.field('body')
this.ref('id')
this.pipeline.remove(lunr.trimmer) // it doesn't work well with non-latin characters
this.pipeline.add(greekStemmer)
})
index.add({
id: 1,
title: 'ΚΑΠΟΙΟΣ',
body: 'Foo foo foo!'
})
index.add({
id: 2,
title: 'ΚΑΠΟΙΕΣ',
body: 'Bar bar bar!'
})
index.add({
id: 3,
title: 'ΤΙΠΟΤΑ',
body: 'Bar bar bar!'
})
In lunr a stemmer is implemented as a pipeline function. A pipeline function is executed against each word in a document when indexing the document, and each word in a search query when searching.
For a function to work in a pipeline it has to implement a very simple interface. It needs to accept a single string as input, and it must respond with a string as its output.
So a very simple (and useless) pipeline function would look like the following:
var simplePipelineFunction = function (word) {
return word
}
To actually make use of this pipeline function we need to do two things:
Register it as a pipeline function, this allows lunr to correctly serialise and deserialise your pipeline.
Add it to your indexes pipeline.
That would look something like this:
// registering our pipeline function with the name 'simplePipelineFunction'
lunr.Pipeline.registerFunction(simplePipelineFunction, 'simplePipelineFunction')
var idx = lunr(function () {
// adding the pipeline function to our indexes pipeline
// when defining the pipeline
this.pipeline.add(simplePipelineFunction)
})
Now, you can take the above, and swap out the implementation of our pipeline function. So, instead of just returning the word unchanged, it could use the greek stemmer you have found to stem the word, maybe like this:
var myGreekStemmer = function (word) {
// I don't know how to use the greek stemmer, but I think
// its safe to assume it won't be that different than this
return greekStem(word)
}
Adapting lunr to work with a language other than English requires more than just adding your stemmer though. The default language of lunr is English, and so, by default, it includes pipeline functions that are specialised for English. English and Greek are different enough that you will probably run into issues trying to index Greek words with the English defaults, so we need to do the following:
Replace the default stemmer with our language specific stemmer
Remove the default trimmer which doesn't play so nice with non-latin characters
Replace/remove the default stop word filter, its unlikely to be much use on a language other than English.
The trimmer and stop word filter are implemented as pipeline functions, so implementing language specific ones would be similar for the stemmer.
So, to set up lunr for Greek you would have this:
var idx = lunr(function () {
this.pipeline.after(lunr.stemmer, greekStemmer)
this.pipeline.remove(lunr.stemmer)
this.pipeline.after(lunr.trimmer, greekTrimmer)
this.pipeline.remove(lunr.trimmer)
this.pipeline.after(lunr.stopWordFilter, greekStopWordFilter)
this.pipeline.remove(lunr.stopWordFilter)
// define the index as normal
this.ref('id')
this.field('title')
this.field('body')
})
For some more inspiration you can take a look at the excellent lunr-languages project, it has many examples of creating language extensions for lunr. You could even submit one for Greek!
EDIT Looks like I don't know the lunr.Pipeline API as well as I thought, there is no replace function, instead we just insert the replacement after the function to remove, and then remove it.
EDIT Adding this to help others in the future... It turns out the problem was down to the casing of the tokens within lunr. lunr wants to treat all tokens as lowercase, this is done, without any configurability, in the tokenizer. For most language processing functions this is not a problem, indeed, most assume words are lower cased. In this case, the Greek stemmer only stems uppercase words due to the complexity of stemming in Greek (I'm not a Greek speaker so can't comment on how much more complex that stemming is). A solution is to convert to upper case before calling the Greek stemmer, then convert back to lowercase before passing the tokens on to the rest of the pipeline.
I'm trying to implement an asymmetrical search for a dictionary web app, so searching for ü, for example, will return only tokens that actually contain ü, but searching for u will return both u and ü. (This is so users who don't know how to type special characters can still search for them, but users who do know how to type them won't be inundated with the plain character forms unnecessarily.)
It has to all be client-side JavaScript without any external libraries.
I've managed to make the second search type work by running both the search term and the text I'm searching through the following function, effectively merging special characters with their plain counterparts:
function cleanUp(dirty) {
cleaned = dirty.replace(/[áàâãäāă]/ig,"a");
cleaned = cleaned.replace(/đ/ig,"d");
cleaned = cleaned.replace(/[éèêẽëēĕ]/ig,"e");
cleaned = cleaned.replace(/[íìîĩïīĭ]/ig,"i");
cleaned = cleaned.replace(/ñ/ig,"n");
cleaned = cleaned.replace(/[óòôõöōŏ]/ig,"o");
cleaned = cleaned.replace(/[úùûũüūŭ]/ig,"u");
return cleaned;
}
I then compare the strings to get my results with something like:
var search_term = cleanup(search_input.value);
var text_to_search = cleanup(main_text);
if (text_to_search.indexOf(search_term) > -1) ... //do something
It's not elegant, but it works. After cleaning up both strings the user can search for i.e. uber and get über even if they don't know how to type ü. But if they do know how, searching for über directly also returns things like uber, which is what I don't want.
I've already thought of things like checking for each special character separately for each search term or duplicating every dictionary entry that has a special character to produce a special-character and a plain-character version, but all of my ideas would seriously slow down the processing time for the search.
Any ideas are greatly appreciated.
The answer you posted sounds quite reasonable.
I would just like to suggest a cleaner way (pun intended) to code your cleanup() function and similar functions that do a series of string operations:
function cleanUp(dirty) {
return dirty
.replace(/[áàâãäāă]/ig,"a")
.replace(/đ/ig,"d")
.replace(/[éèêẽëēĕ]/ig,"e")
.replace(/[íìîĩïīĭ]/ig,"i")
.replace(/ñ/ig,"n")
.replace(/[óòôõöōŏ]/ig,"o")
.replace(/[úùûũüūŭ]/ig,"u");
}
I ended up checking to see if the search term contained any special characters, and if it did, I didn't run it through cleanup(), and compared it to the original dictionary entry instead of the cleaned one. Thanks for the comments everyone.
In my web app I've got a form field where the user can enter an URL. I'm already doing some preliminary client-side validation and I was wondering if I could use a regexp to validate if the entered string is a valid URL. So, two questions:
Is it safe to do this with a regexp? A URL is a complex beast, and just like you shouldn't use a regexp for parsing HTML, I'm worried that it might be unsuitable for a URL as well.
If it can be done, what would be a good regexp for the task? (I know that Google turns up countless regexps, but I'm worried about their quality).
My goal is to prevent a situation where the URL appears in the web page and is unusable by the browser.
Well... maybe. People often ask a similar question about email addresses, and with those you would need a horrendously complicated regular expression (i.e. a couple pages long, at least) to correctly validate them. I don't think URLs are quite as complicated (the W3C has a document describing their format) but still, any reasonably short regexp you come up with will probably block some valid URLs.
I would suggest thinking about what kinds of URLs you need to be accepting. Maybe for your purposes, blocking the occasional valid-but-weird submission is fine, and in that case you can use a simple regex that matches most URLs, like the one in Dobiatowski's answer. Or you could use a regex that accepts all valid URLs and a few invalid ones, if that works for you. But I'd be wary of trying to find a regular expression that accepts exactly all valid URLs and no invalid ones. If you want to have 100% foolproof verification in that way, I'd suggest using a client-side validation of the second type I mentioned (that accepts a few invalid URLs) and doing a more comprehensive check on the server side, using some library in whatever language you are using to process the form data.
As stated by Crescent Fresh, in the comments, there are similar questions to this one which I didn't find. One of them also supplies the full standards-compliant regex for validating an URL:
(?:http://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.
)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)
){3}))(?::(?:\d+))?)(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F
\d]{2}))|[;:#&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{
2}))|[;:#&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{
2}))|[;:#&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?
:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-
fA-F\d]{2}))|[;?&=])*))?#)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-
)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?
:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!
*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[?:#&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?:
(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;/?:&=])+#(?:(?:(
?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[
a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3})))|(?:[a-zA-Z](
?:[a-zA-Z\d]|[_.+-])*)|\*))|(?:nntp://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[
a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d
])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:[a-zA-Z](?:[a-zA-Z
\d]|[_.+-])*)(?:/(?:\d+))?)|(?:telnet://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[;?&=])*))?#)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a
-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d]
)?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))/?)|(?:gopher://(?:(?:
(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:
(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+
))?)(?:/(?:[a-zA-Z\d$\-_.+!*'(),;/?:#&=]|(?:%[a-fA-F\d]{2}))(?:(?:(?:[
a-zA-Z\d$\-_.+!*'(),;/?:#&=]|(?:%[a-fA-F\d]{2}))*)(?:%09(?:(?:(?:[a-zA
-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:#&=])*)(?:%09(?:(?:[a-zA-Z\d$
\-_.+!*'(),;/?:#&=]|(?:%[a-fA-F\d]{2}))*))?)?)?)?)|(?:wais://(?:(?:(?:
(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:
[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?
)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)(?:(?:/(?:(?:[a-zA
-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(
?:%[a-fA-F\d]{2}))*))|\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]
{2}))|[;:#&=])*))?)|(?:mailto:(?:(?:[a-zA-Z\d$\-_.+!*'(),;/?:#&=]|(?:%
[a-fA-F\d]{2}))+))|(?:file://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]
|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:
(?:\d+)(?:\.(?:\d+)){3}))|localhost)?/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[?:#&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(
?:%[a-fA-F\d]{2}))|[?:#&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:[a-zA-Z
\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)
*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:(?:(?:(?
:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&=])*)(?:/(?:(?:(?:[a-
zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&=])*))*)(?:(?:;(?:(?:(?:[
a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&])*)=(?:(?:(?:[a-zA-Z\d
$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&])*)))*)|(?:ldap://(?:(?:(?:(?:
(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:
[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?
))?/(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d])
)|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%2
0)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F
\d]{2}))*))(?:(?:(?:%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?
:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID
|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])
?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*)(?:(
?:(?:(?:%0[Aa])?(?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))(?:(?:(?:(?:(
?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|o
id)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(
?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*))(?:(?:(?:
%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?:(?:(?:[a-zA-Z\d]|%(
?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:
\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a
-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*))*(?:(?:(?:%0[Aa])?(?:%2
0)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))?)(?:\?(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:,(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-f
A-F\d]{2}))+))*)?)(?:\?(?:base|one|sub)(?:\?(?:((?:[a-zA-Z\d$\-_.+!*'(
),;/?:#&=]|(?:%[a-fA-F\d]{2}))+)))?)?)?)|(?:(?:z39\.50[rs])://(?:(?:(?
:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?
:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))
?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:\+(?:(?:
[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*(?:\?(?:(?:[a-zA-Z\d$\-_
.+!*'(),]|(?:%[a-fA-F\d]{2}))+))?)?(?:;esn=(?:(?:[a-zA-Z\d$\-_.+!*'(),
]|(?:%[a-fA-F\d]{2}))+))?(?:;rs=(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA
-F\d]{2}))+)(?:\+(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*)
?))|(?:cid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:#&=
])*))|(?:mid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:#
&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:#&=]
)*))?)|(?:vemmi://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z
\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\
.(?:\d+)){3}))(?::(?:\d+))?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a
-fA-F\d]{2}))|[/?:#&=])*)(?:(?:;(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a
-fA-F\d]{2}))|[/?:#&])*)=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d
]{2}))|[/?:#&])*))*))?)|(?:imap://(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+)(?:(?:;[Aa][Uu][Tt][Hh]=(?:\*|(?:(
?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+))))?)|(?:(?:;[
Aa][Uu][Tt][Hh]=(?:\*|(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2
}))|[&=~])+)))(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[
&=~])+))?))#)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])
?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:
\d+)){3}))(?::(?:\d+))?))/(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:
%[a-fA-F\d]{2}))|[&=~:#/])+)?;[Tt][Yy][Pp][Ee]=(?:[Ll](?:[Ii][Ss][Tt]|
[Ss][Uu][Bb])))|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))
|[&=~:#/])+)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[
&=~:#/])+))?(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-
9]\d*)))?)|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~
:#/])+)(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-9]\d*
)))?(?:/;[Uu][Ii][Dd]=(?:[1-9]\d*))(?:(?:/;[Ss][Ee][Cc][Tt][Ii][Oo][Nn
]=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~:#/])+)))?))
)?)|(?:nfs:(?:(?://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-
Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:
\.(?:\d+)){3}))(?::(?:\d+))?)(?:(?:/(?:(?:(?:(?:(?:[a-zA-Z\d\$\-_.!~*'
(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\-_.!~*'(),
])|(?:%[a-fA-F\d]{2})|[:#&=+])*))*)?)))?)|(?:/(?:(?:(?:(?:(?:[a-zA-Z\d
\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\
-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*))*)?))|(?:(?:(?:(?:(?:[a-zA-
Z\d\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*)(?:/(?:(?:(?:[a-zA-Z\d
\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*))*)?)))
Obviously this is as insane as I feared it would be, so I'll be rethinking the whole thing.
So, the answer is - yes, it can be done, but you should REALLY think twice whether you want to do it this way. Or accept that the regex will be imperfect.
Regex's are safe for lexical validity, but it doesn't mean the site is going to be there. You will actually have to test the connection to see if it's a valid URL by checking the response returned. In all, it depends on your user requirements to say what is valid and what is not - how safe/secure it is depends on you. If you have something like http://foo.com/?referral=http://bar.com/, it'll break some scripts because users don't expect another protocol/path combo as a parameter. Also, some other special characters and null byte hacks have been known to do fishy things with parameters, but I don't think there have been any successful Regex hacks - perhaps memory overflows?
If this is something that needs to be secure, it should probably be handled server side. Although you can perform server side Javascript, I would probably recommend Perl, since it was designed/developed as a text-based parser.
The Regex is only as advanced, or as limited, as you make it. Humans are logical, therefore the problems they solve can be solved by a logic engine (computers), provided that accurate instructions (in this case the regular expression pattern) are given to follow.
#Kerry:
In Javascript you don't "have to" put '/'s around the regular expression. There are also conditions where you can put it in quotes: var re = new Regexp("\w+");
Web Examples:
I think Kerry's was pretty nice, though I only gave it a glance w/o checking it, but here are some simpler examples found from the web
http://www.javascriptkit.com/script/script2/acheck.shtml:
// Email Check
var filter=/^([\w-]+(?:\.[\w-]+)*)#((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$/i
if (filter.test("email address or variable here")){...}
http://snippets.dzone.com/posts/show/452:
// URL Validation
function isUrl(s) {
var regexp = /(ftp|http|https):\/\/(\w+:{0,1}\w*#)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%#!\-\/]))?/
return regexp.test(s);
}
Finally, this looks promising.
http://www.weberdev.com/get_example-4569.html:
// URL Validation
function isValidURL(url){
var RegExp = /^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/;
if(RegExp.test(url)){
return true;
}else{
return false;
}
}
// Email Validation
function isValidEmail(email){
var RegExp = /^((([a-z]|[0-9]|!|#|$|%|&|'|\*|\+|\-|\/|=|\?|\^|_|`|\{|\||\}|~)+(\.([a-z]|[0-9]|!|#|$|%|&|'|\*|\+|\-|\/|=|\?|\^|_|`|\{|\||\}|~)+)*)#((((([a-z]|[0-9])([a-z]|[0-9]|\-){0,61}([a-z]|[0-9])\.))*([a-z]|[0-9])([a-z]|[0-9]|\-){0,61}([a-z]|[0-9])\.)[\w]{2,4}|(((([0-9]){1,3}\.){3}([0-9]){1,3}))|(\[((([0-9]){1,3}\.){3}([0-9]){1,3})\])))$/
if(RegExp.test(email)){
return true;
}else{
return false;
}
}
i think that its safe
here is one sample
http://snippets.dzone.com/posts/show/452
I do believe it is safe, and on #1, that isn't a steadfast rule but more a guideline, speaking from personal experience.
This is the one I use to validate a URL:
(([\w]+:)?\/\/)(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
Realize that in javascript you have to put '/'s around the regular expression.