I'm really new to Javascript, kinda just learned a little earlier today and been messing around with it, but I'm running into a few issues her and there. I'd appreciate help from some people that know their way around the code.
What's the best way to search a string for multiple words? I'm not completely sure how to explain what I mean, so I'll include my current test code and try to explain. I'm making an attached script to pull text from a text based game online, converting it to lowercase, and defining variables for the use of a money system that changes the input text. Once changes are made, I'm re-inputting the modified text into the game as a return.
let money = 0;
const modifier = (text) => {
let modifiedText = text;
const lowered = text.toLowerCase();
let moneyChange = 0;
// The text passed in is either the user's input or players output to modify.
if(lowered.includes('take their money') || lowered.includes('take ' + 'money')) {
moneyChange = (Math.floor(Math.random() * 500));
if ((moneyChange) > 1) {
console.log(moneyChange);
money += moneyChange;
modifiedText = `You find ${moneyChange} Credits. You now have ${money} Credits`;
} else {
modifiedText = 'You find nothing.';
console.log(modifiedText);
}
}
console.log(modifiedText);
// You must return an object with the text property defined.
return {text: modifiedText};
}
modifier(text);
Currently, as you can see, I have to specifically type "Take their money" or "Take money" as an action before the text pulled is recognized as me taking money from someone or taking some in general. My main issue is that with how the game works, it's somewhat impossible to guess exactly how the input or output is going to come out. The way it works is that the game takes your character's action or speech that you type out, processes it via AI into it's own action or dialogue and generates procedural story to make more sense with the setting so that the player only has to type a vague idea of what's going to happen.
Here's an example:
There's a dead man on the street in front of you.
>loot him
You loot the man, digging through his pockets. You take some money from his wallet, but find nothing else.
The > is my only input and the rest is completely AI generated. My script looks through the AI result and , so I could look for every possible result, from "take his money" to "take her money" and so forth, but that's a little too much to bother with if there's an easier way. If I could have it search the result for specific words that may not be in the normal order and/or with other words in between. Like, it must contain the words "take" and "money" so that if the game says "You find some money, along with a gun. You take both", it recognizes that I'm taking the money. As well as the fact that I still need to write code for every single other time I do anything with money, such as buying things, and if I have to write every possible thing it's going to be a pain.
I know that it would be easier if this code was integrated into the game, but due to AI limitations, that kinda breaks how it works and it goes a little crazy... Any sort of help you can give me will be a help.
If you're looking for a way to search a string which includes multiple sub-phrases, you can use string.includes() in a loop like shown below:
function containsWords(string, words) {
for (let i=0, len=string.length; i<len; i++) {
if (!string.includes(words[i])) {
return false;
}
}
return true;
}
However you also mentioned
search the result for specific words that may not be in the normal order and/or with other words in between
Which immediately brings to mind regex, a text and string matching technology. You can easily find tutorials for regex online, and this live tester is nice too.
I'll quickly build a search string to match "take *** money", where any word can be *** as a quick introduction and example to regex:
/take .+ money/g
Here it matches the specific string take , then .+ matches one or more characters (the middle pronoun eg him/her), then matches money.
I'm sort of building an AI for a Telegram Bot, and currently I'm trying to process the text and respond to the user almost like a human does.
For example;
"I want to register"
As a human we understand that the user wants to register.
So I'd process this text using javascript's indexOf to look for want and register
var user_text = message.text;
if (user_text.indexOf('want') >= 0) {
if (user_text.indexOf('register') >= 0) {
console.log('He wants to register?')
}
}
But what if the text contains not somewhere in the string? Of course I'd have like a zillion of conditions for a zillion of cases. It'd be tiring to write this kind of logic.
My question is — Is there any other elegant way to do this? I don't really know the keyword to Google this...
The concept you're looking for is natural language processing and is a very broad field. Full NLP is very intricate and complicated, with all kinds of issues.
I would suggest starting with a much simpler solution, by splitting your input into words. You can do that using the String.prototype.split method with some tweaks. Filter out tokens you don't care about and don't contribute to the command, like "the", "a", "an". Take the remaining tokens, look for negation ("not", "don't") and keywords. You may need to combine adjacent tokens, if you have some two-word commands.
That could look something like:
var user_text = message.text;
var tokens = user_text.split(' '); // split on spaces, very simple "word boundary"
tokens = tokens.map(function (token) {
return token.toLowerCase();
});
var remove = ['the', 'a', 'an'];
tokens = tokens.filter(function (token) {
return remove.indexOf(token) === -1; // if remove array does *not* contain token
});
if (tokens.indexOf('register') !== -1) {
// User wants to register
} else if (tokens.indexOf('enable') !== -1) {
if (tokens.indexOf('not') !== -1) {
// User does not want to enable
} else {
// User does want to enable
}
}
This is not a full solution: you will eventually want to run the string through a real tokenizer and potentially even a full parser, and may want to employ a rule engine to simplify the logic.
If you can restrict the inputs you need to understand (a limited number of sentence forms and nouns/verbs), you can probably just use a simple parser with a few rules to handle most commands. Enforcing a predictable sentence structure with articles removed will make your life much easier.
You could also take the example above and replace the filter with a whitelist (only include words that are known). That would leave you with a small set of known tokens, but introduces the potential to strip useful words and misinterpret the command, so you should confirm with the user before running anything.
If you really want to parse and understand sentences expressed in natural language, you should look into the topic of natural language processing. This is usually done with some kind of neural network trained to "understand" different variations of sentences (aka machine learning), because specifying all of different syntactic and semantic rules of the language appears to be an overwhelming task.
If however the amount of variations of these sentences is limited, then you could specify some rules in the form of commonly used word combinations, probably even regular expressions would do in the simplest case.
So I've made this search that does what its supposed to do front-end wise. However, when submitting I'd like the query to ignore commas.
Right now I'm using commas to make a comma separated search. The whole thing is, when I submit; the comma's are included and thus messes up my search values.
Is there any way to ignore comma's upon submit?
Example: Searching [Example][Test] will actually return Example,Test.
I've made a fiddle here
Any suggestions and help is greatly appreciated.
var firster = true;
//capture form submit
$('form.nice').submit(function(e){
if(firster){
// if its the first submit prevent default
e.preventDefault();
// update input value to have no commas
var val = $('input').val();
val = val.replace(/,/g, ' ');
$('input').val(val);
// let submit go through and submit
firster = false;
$(this).submit();
}
});
DEMO
Looking at your profile, I'm guessing you're using python as a server-side language. The issue you're trying to solve is best dealt with server-side: never rely on front-end code to escape or format data that is being used in a query... check Bobby Tables for more info
Anyhow, in python, you could try this:
ajaxString.replace(",","\", \"")
Thiis will replace all commas witIh " OR ", so a string like some, keywords is translated into some", "keywords, just add some_field IN (" and the closing ") to form a valid query.
Alternatively, you can split the keywords, and deal with them separately (which could come in handy when sorting the results depending on how relevant the results might be.
searchTerms = ajaxString.split(",")
>>>['some','keywords']
That should help you on your way, I hope.
Lastly, I'd suggest just not bothering with developing your own search function at all. Just add a google search to your site, they're the experts. There is just no way you, by yourself, can do better. Or even if you could, just imagine how long it'd take you!
Yes, sometimes a company will create their own search-engine, but only if they have a good reason to do so, and have the resources such an endevour requires. Programming is often all about being "cleverly lazy": Don't reinvent the wheel.
var str = '<div part="1">
<div>
...
<p class="so">text</p>
...
</div>
</div><span></span>';
I got a long string stored in var str, I need to extract the the strings inside div part="1". Can you help me please?
you could create a DOM element and set its innerHTML to your string.
Then you can iterate through the childNodes and read the attributes you want ;)
example
var str = "<your><html>";
var node = document.createElement("div");
node.innerHTML = str;
for(var i = 0; i < node.childNodes.length; i++){
console.log(node.childNodes[i].getAttribute("part"));
}
If you're using a library like JQuery, this is trivially easy without having to go through the horrors of parsing HTML with regex.
Simply load the string into a JQuery object; then you'll be able to query it using selectors. It's as simple as this:
var so = $(str).find('.so');
to get the class='so' elememnt.
If you want to get all the text in part='1', then it would be this:
var part1 = $(str).find('[part=1]').text();
Similar results can be achieved with Prototype library, or others. Without any library, you can still do the same thing using the DOM, but it'll be much harder work.
Just to clarify why it's a bad idea to do this sort of thing in regex:
Yes, it can be done. It is possible to scan a block of HTML code with regex and find things within the string.
However, the issue is that HTML is too variable -- it is defined as a non-regular language (bear in mind that the 'reg' in 'regex' is for 'regular').
If you know that your HTML structure is always going to look the same, it's relatively easy. However if it's ever going to be possible that the incoming HTML might contain elements or attributes other than the exact ones you're expecting, suddenly writing the regex becomes extremely difficult, because regex is designed for searching in predictable strings. When you factor in the possibility of being given invalid HTML code to parse, the difficulty factor increases even more.
With a lot of effort and good understanding of the more esoteric parts of regex, it can be done, with a reasonable degree of reliability. But it's never going to be perfect -- there's always going to be the possibility of your regex not working if it's fed with something it doesn't expect.
By contrast, parsing it with the DOM is much much simpler -- as demonstrated, with the right libraries, it can be a single line of code (and very easy to read, unlike the horrific regex you'd need to write). It'll also be much more efficient to run, and gives you the ability to do other search operations on the same chunk of HTML, without having to re-parse it all again.
In my web app I've got a form field where the user can enter an URL. I'm already doing some preliminary client-side validation and I was wondering if I could use a regexp to validate if the entered string is a valid URL. So, two questions:
Is it safe to do this with a regexp? A URL is a complex beast, and just like you shouldn't use a regexp for parsing HTML, I'm worried that it might be unsuitable for a URL as well.
If it can be done, what would be a good regexp for the task? (I know that Google turns up countless regexps, but I'm worried about their quality).
My goal is to prevent a situation where the URL appears in the web page and is unusable by the browser.
Well... maybe. People often ask a similar question about email addresses, and with those you would need a horrendously complicated regular expression (i.e. a couple pages long, at least) to correctly validate them. I don't think URLs are quite as complicated (the W3C has a document describing their format) but still, any reasonably short regexp you come up with will probably block some valid URLs.
I would suggest thinking about what kinds of URLs you need to be accepting. Maybe for your purposes, blocking the occasional valid-but-weird submission is fine, and in that case you can use a simple regex that matches most URLs, like the one in Dobiatowski's answer. Or you could use a regex that accepts all valid URLs and a few invalid ones, if that works for you. But I'd be wary of trying to find a regular expression that accepts exactly all valid URLs and no invalid ones. If you want to have 100% foolproof verification in that way, I'd suggest using a client-side validation of the second type I mentioned (that accepts a few invalid URLs) and doing a more comprehensive check on the server side, using some library in whatever language you are using to process the form data.
As stated by Crescent Fresh, in the comments, there are similar questions to this one which I didn't find. One of them also supplies the full standards-compliant regex for validating an URL:
(?:http://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.
)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)
){3}))(?::(?:\d+))?)(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F
\d]{2}))|[;:#&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{
2}))|[;:#&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{
2}))|[;:#&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?
:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-
fA-F\d]{2}))|[;?&=])*))?#)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-
)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?
:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!
*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[?:#&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?:
(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;/?:&=])+#(?:(?:(
?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[
a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3})))|(?:[a-zA-Z](
?:[a-zA-Z\d]|[_.+-])*)|\*))|(?:nntp://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[
a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d
])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:[a-zA-Z](?:[a-zA-Z
\d]|[_.+-])*)(?:/(?:\d+))?)|(?:telnet://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[;?&=])*))?#)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a
-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d]
)?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))/?)|(?:gopher://(?:(?:
(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:
(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+
))?)(?:/(?:[a-zA-Z\d$\-_.+!*'(),;/?:#&=]|(?:%[a-fA-F\d]{2}))(?:(?:(?:[
a-zA-Z\d$\-_.+!*'(),;/?:#&=]|(?:%[a-fA-F\d]{2}))*)(?:%09(?:(?:(?:[a-zA
-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:#&=])*)(?:%09(?:(?:[a-zA-Z\d$
\-_.+!*'(),;/?:#&=]|(?:%[a-fA-F\d]{2}))*))?)?)?)?)|(?:wais://(?:(?:(?:
(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:
[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?
)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)(?:(?:/(?:(?:[a-zA
-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)/(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(
?:%[a-fA-F\d]{2}))*))|\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]
{2}))|[;:#&=])*))?)|(?:mailto:(?:(?:[a-zA-Z\d$\-_.+!*'(),;/?:#&=]|(?:%
[a-fA-F\d]{2}))+))|(?:file://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]
|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:
(?:\d+)(?:\.(?:\d+)){3}))|localhost)?/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'()
,]|(?:%[a-fA-F\d]{2}))|[?:#&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(
?:%[a-fA-F\d]{2}))|[?:#&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:[a-zA-Z
\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)
*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)/(?:(?:(?:(?
:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&=])*)(?:/(?:(?:(?:[a-
zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&=])*))*)(?:(?:;(?:(?:(?:[
a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&])*)=(?:(?:(?:[a-zA-Z\d
$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:#&])*)))*)|(?:ldap://(?:(?:(?:(?:
(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:
[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?
))?/(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d])
)|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%2
0)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F
\d]{2}))*))(?:(?:(?:%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?
:(?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID
|oid)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])
?(?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*)(?:(
?:(?:(?:%0[Aa])?(?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))(?:(?:(?:(?:(
?:(?:[a-zA-Z\d]|%(?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|o
id)\.(?:(?:\d+)(?:\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(
?:%20)*))?(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*))(?:(?:(?:
%0[Aa])?(?:%20)*)\+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?:(?:(?:[a-zA-Z\d]|%(
?:3\d|[46][a-fA-F\d]|[57][Aa\d]))|(?:%20))+|(?:OID|oid)\.(?:(?:\d+)(?:
\.(?:\d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a
-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))*)))*))*(?:(?:(?:%0[Aa])?(?:%2
0)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))?)(?:\?(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:,(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-f
A-F\d]{2}))+))*)?)(?:\?(?:base|one|sub)(?:\?(?:((?:[a-zA-Z\d$\-_.+!*'(
),;/?:#&=]|(?:%[a-fA-F\d]{2}))+)))?)?)?)|(?:(?:z39\.50[rs])://(?:(?:(?
:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?
:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))
?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+)(?:\+(?:(?:
[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*(?:\?(?:(?:[a-zA-Z\d$\-_
.+!*'(),]|(?:%[a-fA-F\d]{2}))+))?)?(?:;esn=(?:(?:[a-zA-Z\d$\-_.+!*'(),
]|(?:%[a-fA-F\d]{2}))+))?(?:;rs=(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA
-F\d]{2}))+)(?:\+(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))+))*)
?))|(?:cid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:#&=
])*))|(?:mid:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:#
&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?:#&=]
)*))?)|(?:vemmi://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z
\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\
.(?:\d+)){3}))(?::(?:\d+))?)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a
-fA-F\d]{2}))|[/?:#&=])*)(?:(?:;(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a
-fA-F\d]{2}))|[/?:#&])*)=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d
]{2}))|[/?:#&])*))*))?)|(?:imap://(?:(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+
!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+)(?:(?:;[Aa][Uu][Tt][Hh]=(?:\*|(?:(
?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~])+))))?)|(?:(?:;[
Aa][Uu][Tt][Hh]=(?:\*|(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2
}))|[&=~])+)))(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[
&=~])+))?))#)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])
?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:
\d+)){3}))(?::(?:\d+))?))/(?:(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:
%[a-fA-F\d]{2}))|[&=~:#/])+)?;[Tt][Yy][Pp][Ee]=(?:[Ll](?:[Ii][Ss][Tt]|
[Ss][Uu][Bb])))|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))
|[&=~:#/])+)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[
&=~:#/])+))?(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-
9]\d*)))?)|(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~
:#/])+)(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-9]\d*
)))?(?:/;[Uu][Ii][Dd]=(?:[1-9]\d*))(?:(?:/;[Ss][Ee][Cc][Tt][Ii][Oo][Nn
]=(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[&=~:#/])+)))?))
)?)|(?:nfs:(?:(?://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-
Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:
\.(?:\d+)){3}))(?::(?:\d+))?)(?:(?:/(?:(?:(?:(?:(?:[a-zA-Z\d\$\-_.!~*'
(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\-_.!~*'(),
])|(?:%[a-fA-F\d]{2})|[:#&=+])*))*)?)))?)|(?:/(?:(?:(?:(?:(?:[a-zA-Z\d
\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*)(?:/(?:(?:(?:[a-zA-Z\d\$\
-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*))*)?))|(?:(?:(?:(?:(?:[a-zA-
Z\d\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*)(?:/(?:(?:(?:[a-zA-Z\d
\$\-_.!~*'(),])|(?:%[a-fA-F\d]{2})|[:#&=+])*))*)?)))
Obviously this is as insane as I feared it would be, so I'll be rethinking the whole thing.
So, the answer is - yes, it can be done, but you should REALLY think twice whether you want to do it this way. Or accept that the regex will be imperfect.
Regex's are safe for lexical validity, but it doesn't mean the site is going to be there. You will actually have to test the connection to see if it's a valid URL by checking the response returned. In all, it depends on your user requirements to say what is valid and what is not - how safe/secure it is depends on you. If you have something like http://foo.com/?referral=http://bar.com/, it'll break some scripts because users don't expect another protocol/path combo as a parameter. Also, some other special characters and null byte hacks have been known to do fishy things with parameters, but I don't think there have been any successful Regex hacks - perhaps memory overflows?
If this is something that needs to be secure, it should probably be handled server side. Although you can perform server side Javascript, I would probably recommend Perl, since it was designed/developed as a text-based parser.
The Regex is only as advanced, or as limited, as you make it. Humans are logical, therefore the problems they solve can be solved by a logic engine (computers), provided that accurate instructions (in this case the regular expression pattern) are given to follow.
#Kerry:
In Javascript you don't "have to" put '/'s around the regular expression. There are also conditions where you can put it in quotes: var re = new Regexp("\w+");
Web Examples:
I think Kerry's was pretty nice, though I only gave it a glance w/o checking it, but here are some simpler examples found from the web
http://www.javascriptkit.com/script/script2/acheck.shtml:
// Email Check
var filter=/^([\w-]+(?:\.[\w-]+)*)#((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$/i
if (filter.test("email address or variable here")){...}
http://snippets.dzone.com/posts/show/452:
// URL Validation
function isUrl(s) {
var regexp = /(ftp|http|https):\/\/(\w+:{0,1}\w*#)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%#!\-\/]))?/
return regexp.test(s);
}
Finally, this looks promising.
http://www.weberdev.com/get_example-4569.html:
// URL Validation
function isValidURL(url){
var RegExp = /^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/;
if(RegExp.test(url)){
return true;
}else{
return false;
}
}
// Email Validation
function isValidEmail(email){
var RegExp = /^((([a-z]|[0-9]|!|#|$|%|&|'|\*|\+|\-|\/|=|\?|\^|_|`|\{|\||\}|~)+(\.([a-z]|[0-9]|!|#|$|%|&|'|\*|\+|\-|\/|=|\?|\^|_|`|\{|\||\}|~)+)*)#((((([a-z]|[0-9])([a-z]|[0-9]|\-){0,61}([a-z]|[0-9])\.))*([a-z]|[0-9])([a-z]|[0-9]|\-){0,61}([a-z]|[0-9])\.)[\w]{2,4}|(((([0-9]){1,3}\.){3}([0-9]){1,3}))|(\[((([0-9]){1,3}\.){3}([0-9]){1,3})\])))$/
if(RegExp.test(email)){
return true;
}else{
return false;
}
}
i think that its safe
here is one sample
http://snippets.dzone.com/posts/show/452
I do believe it is safe, and on #1, that isn't a steadfast rule but more a guideline, speaking from personal experience.
This is the one I use to validate a URL:
(([\w]+:)?\/\/)(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
Realize that in javascript you have to put '/'s around the regular expression.