I have a string and I need to make sure that it contains only a regular expression and no javascript because I'm creating a new script with the string so a javascript snippet would be a security risk.
Exact scenario:
JS in mozilla addon loads configuration as json through HTTPrequest (json contains {"something": "^(?:http|https)://(?:.*)"}
JS creates a pac file(proxy configuration script) that uses the "something" regex from the configuration
Any ideas how to escape the string without destroying the regex in it?
It seems that most of the standard JavaScript functionality is available (source), so you can just do:
try {
RegExp(json.something+'');
pacFile += 'RegExp(' + JSON.stringify(json.something+'') + ')';
} catch(e) {/*handle invalid regexp*/}
And not worry, because a RegExp("console.log('test')") will only produce a valid /console.log('test')/ regexp and execute nothing.
You can use a regular expression to pull apart a JavaScript regular expression.
Then you should convert the regex to a lexically simpler subset of JavaScript that avoids all the non-context-free weirdness about what / means, and any irregularities in the input regex.
var REGEXP_PARTS = "(?:"
// A regular character
+ "[^/\r\n\u2028\u2029\\[\\\\]"
// An escaped character, charset reference or backreference
+ "|\\\\[^\r\n\u2028\u2029]"
// A character set
+ "|\\[(?!\\])(?:[^\\]\\\\]|\\\\[^\r\n\u2028\u2029])+\\]"
+ ")";
var REGEXP_REGEXP = new RegExp(
// A regex starts with a slash
"^[/]"
// It cannot be lexically ambiguous with a line or block comemnt
+ "(?![*/])"
// Capture the body in group 1
+ "(" + REGEXP_PARTS + "+)"
// The body is terminated by a slash
+ "[/]"
// Capture the flags in group 2
+ "([gmi]{0,3})$");
var match = myString.match(REGEXP_REGEXP);
if (match) {
var ctorExpression =
"(new RegExp("
// JSON.stringify escapes special chars in the body, so will
// preserve token boundaries.
+ JSON.stringify(match[1])
+ "," + JSON.stringify(match[2])
+ "))";
alert(ctorExpression);
}
which will result in an expression that is in a well-understood subset of JavaScript.
The complex regex above is not in the TCB. The only part that needs to function correctly for security to hold is the ctorExpression including the use of JSON.stringify.
I am trying to "intelligently" pre-fill a form, I want to prefill the firstname and lastname inputs based on a user email address, so for example,
jon.doe#email.com RETURNS Jon Doe
jon_doe#email.com RETURN Jon Doe
jon-doe#email.com RETURNS Jon Doe
I have managed to get the string before the #,
var email = letters.substr(0, letters.indexOf('#'));
But cant work out how to split() when the separator can be multiple values, I can do this,
email.split("_")
but how can I split on other email address valid special characters?
JavaScript's string split method can take a regex.
For example the following will split on ., -, and _.
"i-am_john.doe".split(/[.\-_]/)
Returning the following.
["i", "am", "john", "doe"]
You can use a regular expression for what you want to split on. You can for example split on anything that isn't a letter:
var parts = email.split(/[^A-Za-z]/);
Demo: http://jsfiddle.net/Guffa/xt3Lb9e6/
You can split a string using a regular expression. To match ., _ or -, you can use a character class, for example [.\-_]. The syntax for regular expressions in JavaScript is /expression/, so your example would look like:
email.split(/[\.\-_]/);
Note that the backslashes are to prevent . and - being interpreted as special characters. . is a special character class representing any character. In a character class, - can be used to specify ranges, such as [a-z].
If you require a dynamic list of characters to split on, you can build a regular expression using the RegExp constructor. For example:
var specialChars = ['.', '\\-', '_'];
var specialRegex = new RegExp('[' + specialChars.join('') + ']');
email.split(specialRegex);
More information on regular expressions in JavaScript can be found on MDN.
Regular Expressions --
email.split(/[_\.-]/)
This one matches (therefore splits at) any of (a character set, indicated by []) _, ., or -.
Here's a good resource for learning regular expressions: http://qntm.org/files/re/re.html
You can use regex to do it, just provide a list of the characters in square brackets and escape if necessary.
email.split("[_-\.]");
Is that what you mean?
You are correct that you need to use the split function.
Split function works by taking an argument to split the string on. Multiple values can be split via regular expression. For you usage, try something like
var re = /[\._\-]/;
var split = email.split(re, 2);
This should result in an array with two values, first/second name. The second argument is the number of elements returned.
I created a jsFiddle to show how this could be done :
function printName(email){
var name = email.split('#')[0];
// source : http://stackoverflow.com/questions/650022/how-do-i-split-a-string-with-multiple-separators-in-javascript
var returnVal = name.split(/[._-]/g);
return returnVal;
}
http://jsfiddle.net/ts6nx9tt/1/
If you define your seperators, below code can return all alternatives for you.
var arr = ["_",".","-"];
var email = letters.substr(0, letters.indexOf('#'));
arr.map(function(val,index,rest){
var r = email.split(val);
if(r.length > 1){
return r.join(' ');
}
return "";
}
);
I have a string as follows
var company = "Microst+Apple+Google";
And I want replace all the + signs with %2B
But when I use this code. It returns 0
var company = company.replace(/+/g, "%2B");
I think JavaScript thinks that + is an arithmetic operation. Is there a special symbol to be used? or can user a variable except directly using the + sign?
If so please mention. Any Idea how to do this ?
You need escape it :
var company = company.replace(/\+/g, "%2B");
It is because + is special symbol used to indicate that preceding character should match 1 or more times.
You can read more about regular expressions syntax here: https://developer.mozilla.org/en/docs/Web/JavaScript/Guide/Regular_Expressions
No, JavaScript doesn't think it's an arithmetic operation but + is a quantifier in regular expressions and the regular expression parser doesn't understand yours.
You must escape the + :
var company = company.replace(/\+/g, "%2B");
You can use this :
var company = company.replace(/\+/g, "%2B");
Or an easier way :
var company = encodeURIComponent(company);
which will do the same operation as the regex one. Also, it encodes all URI chars like &, " (quotes), % etc... if there is one like it in the string given.
In both cases, the output is :
Microst%2BApple%2BGoogle
I am writing a Javascript code to parse some grammar files, it is quite some code but I will post relevant information here. I am using Javascript Regexp in order to match a duplicate line held within a string. The string contains, for example (assume the string name is lines):
if
else
;
print
{
}
test1
test1
=
+
-
*
/
(
)
num
string
comment
id
test2
test2
What should happen, is a match found on 'test1' and 'test2'. It should then delete the duplicate, leaving 1 instance of test1 and test2. What is happening is no match at all. I am confident in my regex but javascript may be doing something I am not expecting. Here is the code doing the work on the string given above:
var rex = new RegExp("(.*)(\r?\n\1)+","g");
var re = '/(.*)(\r?\n\1)+/g';
rex.lastIndex = 0;
var m = rex.exec(lines);
if (m) {
alert("Found Duplicate");
var linenum = lines.search(re); //Get line number of error
alert("Error: Symbol Defined twice\n");
alert("Error occured on line: " + linenum);
lines = lines.replace(rex,""); //Gets rid of the duplicate
}
It never gets into the if(m) statement. Therefore no match is found. I tested the regex here: http://regexpal.com/ using the regex in my code as well as the example text provided. It matches just fine, so I am at kind of a loss. If anyone can help, it would be great.
Thank you.
Edit:
Forgot to add, I am testing this in firefox, and it only has to work in firefox. Not sure if that matters.
First error: \ in a JS string is also an escape character.
var rex = new RegExp("(.*)(\r?\n\1)+","g");
should be written
var rex = new RegExp("(.*)(\\r?\\n\\1)+","g");
// or, shorter:
var rex = /(.*)(\r?\n\1)+/g;
if you want to make it work. In the case of the RegExp constructor, you’re passing the pattern as a string to the constructor function. This means you need to escape each \ backslash that occurs in the pattern. If you use a regexp literal, you don’t need to escape them, since they’re not in a string, but retain their ‘normal’ properties in the regexp pattern.
Second error, your expression
var re = '/(.*)(\r?\n\1)+/g';
is wrong. What you’re doing here is assigning a string literal to a variable. I’m assuming you meant to assign a regular expression literal, which should be written like this:
var re = /(.*)(\r?\n\1)+/g;
Third error: the last line
lines = lines.replace(rex,""); //Gets rid of the duplicate
removes both instances of all duplicate lines! If you want to keep the first instance of each duplicate, you should use
lines = lines.replace(rex, "$1");
And finally, this method only detects two consecutive identical lines. Is that what you want, or do you need to detect any duplicates, wherever they may be?
var str = 'if\nelse\n;\nprint\n{\n}\ntest1\ntest1\n=\n+\n-\n*\n/\n(\n)\nnum\nstring\ncomment\nid\ntest2\ntest2\ntest2\ntest2\ntest2';
console.log(str);
str = str.replace(/\r\n?/g,'');
// I prefer replacing all the newline characters with \n's here
str = str.replace(/(^|\n)([^\n]*)(\n\2)+/g,function(m0,m1,m2,m3,ind) {
var line = str.substr(0,ind).split(/\n/).length + 1;
var msg = '[Found duplicate]';
msg += '\nFollowing symbol defined more than once';
msg += '\n\tsymbol: ' + m2;
msg += '\n\ton line ' + line;
console.log(msg);
return m1 + m2;
});
console.log(str);
Otherwise you can skip the first line and change the pattern into
/(^|\r\n?|\n)([^\r\n]*)((?:\r\n?|\n)\2)+/g
Note that [^\n]* will also catch multiple empty lines. If you want to make sure it matches (and replaces) non-empty lines then you might want to use [^\n]+.
[EDIT]
For the record, each m represents each arguments object, so m0 is the whole match, m1 is the 1st subgroup ((^|\n)), m2 is the 2nd subgroup (([^\n]*)) and m3 is the last subgroup ((\n\2)). I could have used arguments[n] instead but these are shorter.
As with the return value, due to lack of lookbehind in the regex flavor used by Javascript, this pattern is catching a possible preceding newline (unless it is the first line) so it needs to return the match and that preceding newline if any. That's why it shouldn't be returning m2 only.
I have a very long regular expression, which I wish to split into multiple lines in my JavaScript code to keep each line length 80 characters according to JSLint rules. It's just better for reading, I think.
Here's pattern sample:
var pattern = /^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
Extending #KooiInc answer, you can avoid manually escaping every special character by using the source property of the RegExp object.
Example:
var urlRegex= new RegExp(''
+ /(?:(?:(https?|ftp):)?\/\/)/.source // protocol
+ /(?:([^:\n\r]+):([^#\n\r]+)#)?/.source // user:pass
+ /(?:(?:www\.)?([^\/\n\r]+))/.source // domain
+ /(\/[^?\n\r]+)?/.source // request
+ /(\?[^#\n\r]*)?/.source // query
+ /(#?[^\n\r]*)?/.source // anchor
);
or if you want to avoid repeating the .source property you can do it using the Array.map() function:
var urlRegex= new RegExp([
/(?:(?:(https?|ftp):)?\/\/)/ // protocol
,/(?:([^:\n\r]+):([^#\n\r]+)#)?/ // user:pass
,/(?:(?:www\.)?([^\/\n\r]+))/ // domain
,/(\/[^?\n\r]+)?/ // request
,/(\?[^#\n\r]*)?/ // query
,/(#?[^\n\r]*)?/ // anchor
].map(function(r) {return r.source}).join(''));
In ES6 the map function can be reduced to:
.map(r => r.source)
[Edit 2022/08] Created a small github repository to create regular expressions with spaces, comments and templating.
You could convert it to a string and create the expression by calling new RegExp():
var myRE = new RegExp (['^(([^<>()[\]\\.,;:\\s#\"]+(\\.[^<>(),[\]\\.,;:\\s#\"]+)*)',
'|(\\".+\\"))#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.',
'[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\\.)+',
'[a-zA-Z]{2,}))$'].join(''));
Notes:
when converting the expression literal to a string you need to escape all backslashes as backslashes are consumed when evaluating a string literal. (See Kayo's comment for more detail.)
RegExp accepts modifiers as a second parameter
/regex/g => new RegExp('regex', 'g')
[Addition ES20xx (tagged template)]
In ES20xx you can use tagged templates. See the snippet.
Note:
Disadvantage here is that you can't use plain whitespace in the regular expression string (always use \s, \s+, \s{1,x}, \t, \n etc).
(() => {
const createRegExp = (str, opts) =>
new RegExp(str.raw[0].replace(/\s/gm, ""), opts || "");
const yourRE = createRegExp`
^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|
(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|
(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$`;
console.log(yourRE);
const anotherLongRE = createRegExp`
(\byyyy\b)|(\bm\b)|(\bd\b)|(\bh\b)|(\bmi\b)|(\bs\b)|(\bms\b)|
(\bwd\b)|(\bmm\b)|(\bdd\b)|(\bhh\b)|(\bMI\b)|(\bS\b)|(\bMS\b)|
(\bM\b)|(\bMM\b)|(\bdow\b)|(\bDOW\b)
${"gi"}`;
console.log(anotherLongRE);
})();
Using strings in new RegExp is awkward because you must escape all the backslashes. You may write smaller regexes and concatenate them.
Let's split this regex
/^foo(.*)\bar$/
We will use a function to make things more beautiful later
function multilineRegExp(regs, options) {
return new RegExp(regs.map(
function(reg){ return reg.source; }
).join(''), options);
}
And now let's rock
var r = multilineRegExp([
/^foo/, // we can add comments too
/(.*)/,
/\bar$/
]);
Since it has a cost, try to build the real regex just once and then use that.
Thanks to the wonderous world of template literals you can now write big, multi-line, well-commented, and even semantically nested regexes in ES6.
//build regexes without worrying about
// - double-backslashing
// - adding whitespace for readability
// - adding in comments
let clean = (piece) => (piece
.replace(/((^|\n)(?:[^\/\\]|\/[^*\/]|\\.)*?)\s*\/\*(?:[^*]|\*[^\/])*(\*\/|)/g, '$1')
.replace(/((^|\n)(?:[^\/\\]|\/[^\/]|\\.)*?)\s*\/\/[^\n]*/g, '$1')
.replace(/\n\s*/g, '')
);
window.regex = ({raw}, ...interpolations) => (
new RegExp(interpolations.reduce(
(regex, insert, index) => (regex + insert + clean(raw[index + 1])),
clean(raw[0])
))
);
Using this you can now write regexes like this:
let re = regex`I'm a special regex{3} //with a comment!`;
Outputs
/I'm a special regex{3}/
Or what about multiline?
'123hello'
.match(regex`
//so this is a regex
//here I am matching some numbers
(\d+)
//Oh! See how I didn't need to double backslash that \d?
([a-z]{1,3}) /*note to self, this is group #2*/
`)
[2]
Outputs hel, neat!
"What if I need to actually search a newline?", well then use \n silly!
Working on my Firefox and Chrome.
Okay, "how about something a little more complex?"
Sure, here's a piece of an object destructuring JS parser I was working on:
regex`^\s*
(
//closing the object
(\})|
//starting from open or comma you can...
(?:[,{]\s*)(?:
//have a rest operator
(\.\.\.)
|
//have a property key
(
//a non-negative integer
\b\d+\b
|
//any unencapsulated string of the following
\b[A-Za-z$_][\w$]*\b
|
//a quoted string
//this is #5!
("|')(?:
//that contains any non-escape, non-quote character
(?!\5|\\).
|
//or any escape sequence
(?:\\.)
//finished by the quote
)*\5
)
//after a property key, we can go inside
\s*(:|)
|
\s*(?={)
)
)
((?:
//after closing we expect either
// - the parent's comma/close,
// - or the end of the string
\s*(?:[,}\]=]|$)
|
//after the rest operator we expect the close
\s*\}
|
//after diving into a key we expect that object to open
\s*[{[:]
|
//otherwise we saw only a key, we now expect a comma or close
\s*[,}{]
).*)
$`
It outputs /^\s*((\})|(?:[,{]\s*)(?:(\.\.\.)|(\b\d+\b|\b[A-Za-z$_][\w$]*\b|("|')(?:(?!\5|\\).|(?:\\.))*\5)\s*(:|)|\s*(?={)))((?:\s*(?:[,}\]=]|$)|\s*\}|\s*[{[:]|\s*[,}{]).*)$/
And running it with a little demo?
let input = '{why, hello, there, "you huge \\"", 17, {big,smelly}}';
for (
let parsed;
parsed = input.match(r);
input = parsed[parsed.length - 1]
) console.log(parsed[1]);
Successfully outputs
{why
, hello
, there
, "you huge \""
, 17
,
{big
,smelly
}
}
Note the successful capturing of the quoted string.
I tested it on Chrome and Firefox, works a treat!
If curious you can checkout what I was doing, and its demonstration.
Though it only works on Chrome, because Firefox doesn't support backreferences or named groups. So note the example given in this answer is actually a neutered version and might get easily tricked into accepting invalid strings.
There are good answers here, but for completeness someone should mention Javascript's core feature of inheritance with the prototype chain. Something like this illustrates the idea:
RegExp.prototype.append = function(re) {
return new RegExp(this.source + re.source, this.flags);
};
let regex = /[a-z]/g
.append(/[A-Z]/)
.append(/[0-9]/);
console.log(regex); //=> /[a-z][A-Z][0-9]/g
The regex above is missing some black slashes which isn't working properly. So, I edited the regex. Please consider this regex which works 99.99% for email validation.
let EMAIL_REGEXP =
new RegExp (['^(([^<>()[\\]\\\.,;:\\s#\"]+(\\.[^<>()\\[\\]\\\.,;:\\s#\"]+)*)',
'|(".+"))#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.',
'[0-9]{1,3}\])|(([a-zA-Z\\-0-9]+\\.)+',
'[a-zA-Z]{2,}))$'].join(''));
To avoid the Array join, you can also use the following syntax:
var pattern = new RegExp('^(([^<>()[\]\\.,;:\s#\"]+' +
'(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#' +
'((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|' +
'(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$');
You can simply use string operation.
var pattenString = "^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|"+
"(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|"+
"(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$";
var patten = new RegExp(pattenString);
I tried improving korun's answer by encapsulating everything and implementing support for splitting capturing groups and character sets - making this method much more versatile.
To use this snippet you need to call the variadic function combineRegex whose arguments are the regular expression objects you need to combine. Its implementation can be found at the bottom.
Capturing groups can't be split directly that way though as it would leave some parts with just one parenthesis. Your browser would fail with an exception.
Instead I'm simply passing the contents of the capture group inside an array. The parentheses are automatically added when combineRegex encounters an array.
Furthermore quantifiers need to follow something. If for some reason the regular expression needs to be split in front of a quantifier you need to add a pair of parentheses. These will be removed automatically. The point is that an empty capture group is pretty useless and this way quantifiers have something to refer to. The same method can be used for things like non-capturing groups (/(?:abc)/ becomes [/()?:abc/]).
This is best explained using a simple example:
var regex = /abcd(efghi)+jkl/;
would become:
var regex = combineRegex(
/ab/,
/cd/,
[
/ef/,
/ghi/
],
/()+jkl/ // Note the added '()' in front of '+'
);
If you must split character sets you can use objects ({"":[regex1, regex2, ...]}) instead of arrays ([regex1, regex2, ...]). The key's content can be anything as long as the object only contains one key. Note that instead of () you have to use ] as dummy beginning if the first character could be interpreted as quantifier. I.e. /[+?]/ becomes {"":[/]+?/]}
Here is the snippet and a more complete example:
function combineRegexStr(dummy, ...regex)
{
return regex.map(r => {
if(Array.isArray(r))
return "("+combineRegexStr(dummy, ...r).replace(dummy, "")+")";
else if(Object.getPrototypeOf(r) === Object.getPrototypeOf({}))
return "["+combineRegexStr(/^\]/, ...(Object.entries(r)[0][1]))+"]";
else
return r.source.replace(dummy, "");
}).join("");
}
function combineRegex(...regex)
{
return new RegExp(combineRegexStr(/^\(\)/, ...regex));
}
//Usage:
//Original:
console.log(/abcd(?:ef[+A-Z0-9]gh)+$/.source);
//Same as:
console.log(
combineRegex(
/ab/,
/cd/,
[
/()?:ef/,
{"": [/]+A-Z/, /0-9/]},
/gh/
],
/()+$/
).source
);
Personally, I'd go for a less complicated regex:
/\S+#\S+\.\S+/
Sure, it is less accurate than your current pattern, but what are you trying to accomplish? Are you trying to catch accidental errors your users might enter, or are you worried that your users might try to enter invalid addresses? If it's the first, I'd go for an easier pattern. If it's the latter, some verification by responding to an e-mail sent to that address might be a better option.
However, if you want to use your current pattern, it would be (IMO) easier to read (and maintain!) by building it from smaller sub-patterns, like this:
var box1 = "([^<>()[\]\\\\.,;:\s#\"]+(\\.[^<>()[\\]\\\\.,;:\s#\"]+)*)";
var box2 = "(\".+\")";
var host1 = "(\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\])";
var host2 = "(([a-zA-Z\-0-9]+\\.)+[a-zA-Z]{2,})";
var regex = new RegExp("^(" + box1 + "|" + box2 + ")#(" + host1 + "|" + host2 + ")$");
#Hashbrown's great answer got me on the right track. Here's my version, also inspired by this blog.
function regexp(...args) {
function cleanup(string) {
// remove whitespace, single and multi-line comments
return string.replace(/\s+|\/\/.*|\/\*[\s\S]*?\*\//g, '');
}
function escape(string) {
// escape regular expression
return string.replace(/[-.*+?^${}()|[\]\\]/g, '\\$&');
}
function create(flags, strings, ...values) {
let pattern = '';
for (let i = 0; i < values.length; ++i) {
pattern += cleanup(strings.raw[i]); // strings are cleaned up
pattern += escape(values[i]); // values are escaped
}
pattern += cleanup(strings.raw[values.length]);
return RegExp(pattern, flags);
}
if (Array.isArray(args[0])) {
// used as a template tag (no flags)
return create('', ...args);
}
// used as a function (with flags)
return create.bind(void 0, args[0]);
}
Use it like this:
regexp('i')`
//so this is a regex
//here I am matching some numbers
(\d+)
//Oh! See how I didn't need to double backslash that \d?
([a-z]{1,3}) /*note to self, this is group #2*/
`
To create this RegExp object:
/(\d+)([a-z]{1,3})/i