JavaScript Regex to match balanced constructs without caring about imbalanced constructs

JavaScript Regex to match balanced constructs without caring about imbalanced constructs - javascript

I'm working on a JavaScript-based project that involves a rudimentary Bash-inspired scripting system, and I'm using a regex to separate lines into (a number of types of) tokens.
One such token class is of course the recursive $() construct. This construct can be nested arbitrarily. I am trying to devise a JavaScript regular expression to match this type of token without accidentally leaving parts behind or grabbing parts of other tokens.
The problem, more specifically:
Given a string such as this example:
"$(foo $(bar) fizz $(buzz)) $(something $(else))"
which is made up of individual tokens, each delimited by an outer $() and followed either by whitespace or end-of-string,
match the first such token in the string from and including its open $( to and including its final closing )
In the event that an unbalanced construct occurs anywhere in the string, the behavior of this regex is considered undefined and does not matter.
So in the above example the regex should match "$(foo $(bar) fizz $(buzz))"
Further usage details:
Both the input string and the returned match are passed through String.prototype.trim(), so leading and trailing whitespace doesn't matter
I am able to deem unbalanced constructs an undefined case because the code that consumes this type of token once extracted does its own balance checking. Even if the regex returns a match that is surrounded by an outer $(), the error will eventually be caught elsewhere.
What I've tried so-far
For a while I've been using this regex:
/\$\(.*?\)(?!(?:(?!\$\().)*\))(?:\s+|$)/
which seemed to work for quite some time. It matches arbitrarily nested balanced constructs so long as they do not have multiple nests at the same level. That case just slipped past my mind somehow when testing initially. It consumes the contents of the token with lazy repetition, and asserts that after the closing paren there is not another close paren before there is an open $(. This is of course broken by tokens such as the example above.
I know that the traditional "balanced constructs regex problem" is not solvable without subroutines/recursion. However I'm hoping that since I only need to match balanced constructs and not fail to match imbalanced ones, that there's some clever way to cheat here.

So in the above example the regex should match $(foo $(bar) fizz $(buzz))
The solution as I see it is almost the same as I posted today (based on Matching Nested Constructs in JavaScript, Part 2 by Steven Levithan), but all you need is to add the delimiters since they are known.
Example usage:
matchRecursiveRegExp("$(foo $(bar) fizz $(buzz)) $(something $(else))", "\\$\\(", "\\)");
Code:
function matchRecursiveRegExp (str, left, right, flags) {
var f = flags || "",
g = f.indexOf("g") > -1,
x = new RegExp(left + "|" + right, "g" + f),
l = new RegExp(left, f.replace(/g/g, "")),
a = [],
t, s, m;
do {
t = 0;
while (m = x.exec(str)) {
if (l.test(m[0])) {
if (!t++) s = x.lastIndex;
} else if (t) {
if (!--t) {
a.push(str.slice(s, m.index));
if (!g) return a;
}
}
}
} while (t && (x.lastIndex = s));
return a;
}
document.write("$(" + matchRecursiveRegExp("$(foo $(bar) fizz $(buzz)) $(something $(else))", "\\$\\(", "\\)") + ")");
Note this solution does not support global matching.

Related

3 While Loops into a Single Loop?

I have to remove the commas, periods, and hyphens from an HTML text value. I do not want to write all 3 of these while loops, instead I only want one loop (any) to do all of this.
I already tried a while with multiple && and if else nested inside but i would always only just get the commas removed.
while(beg.indexOf(',') > -1)
{
beg = beg.replace(',','');
document.twocities.begins.value= beg;
}
while(beg.indexOf('-') > -1)
{
beg = beg.replace('-','');
document.twocities.begins.value= beg;
}
while(beg.indexOf('.') > -1)
{
beg= beg.replace('.','');
document.twocities.begins.value= beg;
}

You can do all this without loops by using regex.
Here is an example of removing all those characters using a single regex:
let str = "abc,d-e.fg,hij,1-2,34.56.7890"
str = str.replace(/[,.-]/g, "")
console.log(str)

No loops are necessary for this in the first place.
You can replace characters in a string with String.replace() and you can determine which characters and patterns to replace using regular expressions.
let sampleString = "This, is. a - test - - of, the, code. ";
console.log(sampleString.replace(/[,-.]/g, ""));

A single call to the replace function and using a regular expression suffice:
document.twocities.begins.value = beg = beg.replace(/[,.-]/g, "");
Regular expressions are a pattern matching language. The pattern employed here basically says "every occurrence of one of the characters ., ,, -)". Note that the slash / delimits the pattern while the suffix consists of flags controlling the matching process - in this case it is g (global) telling the engine to replace each occurrence ( as opposed to the first only without the flag ).
This site provides lots of info about regular expressions, their use in programming and implementations in different programming environments.
There are several online sites to test actual regular expression and what they match (including explanations), eg. Regex 101.
Even more details ... ;): You may use the .replace function with a string as the first argument (as you did in your code sample). However, only the first occurrence of the string searched for will be replaced - thus you would have to resort to loops. Specs of the .replace function (and of JS in general) can be found here.

Use regex like below.
let example = "This- is a,,., string.,";
console.log(example.replace(/[-.,]+/g, ""));

For a regex pattern, how to determine the length of longest string that matchs the pattern?

Having a Regex pattern regexPattern, how can I determine the length of the longest string that matches the regexPattern.
The imaginary int LongestLength(string pattern) should work like this:
Assert.Equals(LongestLength("[abc]"), 1);
Assert.Equals(LongestLength("(a|b)c?"), 2);
Assert.Equals(LongestLength("d*"), int.MaxValue); // Or throws NoLongestLengthException
Although the question is in C#, both C# and JavaScript answers are good.

It's pretty straightforward for a proper regex using just the operators ?, * and + and |, plus parentheses, character classes and of course ordinary characters. In fact even \1-style backreferences (which aren't part of the formal definition of a regex, and do complicate some questions about regexes) can be handled without problems.
A regex is just a compact encoding of a tree structure (similar to how mathematical formulas are compact encodings of tree structures describing arithmetic). Between every adjacent pair of characters there is an implicit "follows" operator that corresponds to a node with 2 children, one being the subregex just to its left, and the other being the entire rest of the regex; a sequence of subregexes separated by | characters corresponds to a single "alt" node with as many children as there are alternatives (i.e., one more than the number of | characters), while every other operator has just a single child (namely the subregex it operates on), and every ordinary character or character class has no children at all. To calculate the maximum-length matching string, you can just do a bottom-up traversal of this tree structure, at each node greedily assigning the length of the longest string that would match that node, given knowledge of the longest strings that would match its children.
The rules for deciding the length of the longest string that matches any given node in this tree are:
follows(x, y) (xy): maxlen(x) + maxlen(y)
alt(a, b, ..., z) (a|b|...|z): max(maxlen(a), maxlen(b), ..., maxlen(z))
maybe(x) (x?): maxlen(x)
rep(x) (x*) or posrep(x) (x+): infinity
Any other single character or character class ([...]): 1
\1-style backreferences: the maxlen for the corresponding parenthesised expression
One consequence is that the presence of * or + anywhere (unless escaped or part of a character class, obviously) will cause infinity to propagate up the tree until it hits the root.
Examples
Regex: abcd
"Function call syntax": follows(a, follows(b, follows(c, d)))
As a tree:
follows
/ \
a follows
/ \
b follows
/ \
c d
A second example:
Regex: (a|b|de)c?
"Function call" syntax: follows(alt(a, b, follows(d, e)), maybe(c))
As a tree:
follows
/ \
alt maybe
/ | \ \
a b follows c
/ \
d e
For this second regex/tree, a bottom-up traversal will assign a maxlen of 1 for the leaf nodes a, b, d, e and c; then the maxlen for the bottom follows() node is 1 + 1 = 2; then the maxlen for the alt() node is max(1, 1, 2) = 2; the maxlen for the maybe node is 1; the maxlen for the topmost follows() node, and thus for the entire regex, is 2 + 1 = 3.
If you mean regexes that allow other Perl-style enhanced features, it might get much more complicated, because a locally optimal choice of length may lead to a globally suboptimal one. (I had thought that it might even be possible that Perl-style extensions make regexes Turing complete, meaning that it would be in general impossible to decide whether there is any matching string -- but the discussion here seems to indicate this is not the case, unless of course you allow in the ?{...} construct.)

So how I would do this function is by first creating key value pair datatype. Then filling up the data type with every type of regex syntax. so the key would be a regex syntax (For example: "*"). The value would be how much it would add to the possible length of strings that match. So the key: "*" would have a value of int.maxvalue. So you would loop through your list and search through the regex expression that was passed in for any of the syntax and sum up all the values you find and return it. However you have to keep in mind some syntax are escaped so you can't count them. As well as that some of the syntax automatically make the possible length to int.maxvalue ("*", "+", etc..). So check these syntax first so you can automatically send back int.maxvalue as soon as you find one these types of regex syntax.

How to parse and capture any measurement unit

In my application, users can customize measurement units, so if they want to work in decimeters instead of inches or in full-turns instead of degrees, they can. However, I need a way to parse a string containing multiple values and units, such as 1' 2" 3/8. I've seen a few regular expressions on SO and didn't find any which matched all cases of the imperial system, let alone allowing any kind of unit. My objective is to have the most permissive input box possible.
So my question is: how can I extract multiple value-unit pairs from a string in the most user-friendly way?
I came up with the following algorithm:
Check for illegal characters and throw an error if needed.
Trim leading and trailing spaces.
Split the string into parts every time there's a non-digit character followed by a digit character, except for .,/ which are used to identify decimals and fractions.
Remove all spaces from parts, check for character misuse (multiple decimal points or fraction bars) and replace '' with ".
Split value and unit-string for each part. If a part has no unit:
If it is the first part, use the default unit.
Else if it is a fraction, consider it as the same unit as the previous part.
Else if it isn't, consider it as in, cm or mm based on the previous part's unit.
If it isn't the first part and there's no way to guess the unit, throw an error.
Check if units mean something, are all of the same system (metric/imperial) and follow a descending order (ft > in > fraction or m > cm > mm > fraction), throw an error if not.
Convert and sum all parts, performing division in the process.
I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.
I came up with a regex:
((\d+('|''|"|m|cm|mm|\s|$) *)+(\d+(\/\d+)?('|''|"|m|cm|mm|\s|$) *)?)|((\d+('|''|"|m|cm|mm|\s) *)*(\d+(\/\d+)?('|''|"|m|cm|mm|\s|$) *))
It only allows fractions at the end and allows to place spaces between values. I've never used regex capturing though, so I'm not so sure how I'll manage to extract the values out of this mess. I'll work again on this tomorrow.

My objective is to have the most permissive input box possible.
Careful, more permissive doesn't always mean more intuitive. An ambiguous input should warn the user, not pass silently, as that might lead them to make multiple mistakes before they realize their input wasn't interpreted like they hoped.
How can I extract multiple value-unit pairs from a string? I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.
Regular expressions are a powerful tool, especially since they work in many programming languages, but be warned. When you're holding a hammer everything starts to look like a nail. Don't try to use a regular expression to solve every problem just because you recently learned how they work.
Looking at the pseudocode you wrote, you are trying to solve two problems at once: splitting up a string (which we call tokenization) and interpreting input according to a grammar (which we call parsing). You should should try to first split up the input into a list of tokens, or maybe unit-value pairs. You can start making sense of these pairs once you're done with string manipulation. Separation of concerns will spare you a headache, and your code will be much easier to maintain as a result.
I've never used regex capturing though, so I'm not so sure how I'll manage to extract the values out of this mess.
If a regular expression has the global (g) flag, it can be used to find multiple matches in the same string. That would be useful if you had a regular expression that finds a single unit-value pair. In JavaScript, you can retrieve a list of matches using string.match(regex). However, that function ignores capture groups on global regular expressions.
If you want to use capture groups, you need to call regex.exec(string) inside a loop. For each successful match, the exec function will return an array where item 0 is the entire match and items 1 and onwards are the captured groups.
For example, /(\d+) ([a-z]+)/g will look for an integer followed by a space and a word. If you made successive calls to regex.exec("1 hour 30 minutes") you would get:
["1 hour", "1", "hour"]
["30 minutes", "30", "minutes"]
null
Successive calls work like this because the regex object keeps an internal cursor you can get or set with regex.lastIndex. You should set it back to 0 before using the regex again with a different input.
You've been using parentheses to isolate OR clauses such as a|b and to apply quantifiers to a character sequence such as (abc)+. If you want to do that without creating capture groups, you can use (?: ) instead. This is called a non-capturing group. It does the same thing as regular parentheses in a regex, but what's inside it won't create an entry in the returned array.
Is there a better way to approach this?
A previous version of this answer concluded with a regular expression even more incomprehensible than the one posted in the question because I didn't know better at the time, but today this would be my recommendation. It's a regular expression that only extracts one token at a time from the input string.
/ (\s+) // 1 whitespace
| (\d+)\/(\d+) // 2,3 fraction
| (\d*)([.,])(\d+) // 4,5,6 decimal
| (\d+) // 7 integer
| (km|cm|mm|m|ft|in|pi|po|'|") // 8 unit
/gi
Sorry about the weird syntax highlighting. I used whitespace to make this more readable but properly formatted it becomes:
/(\s+)|(\d+)\/(\d+)|(\d*)([.,])(\d+)|(\d+)|(km|cm|mm|m|ft|in|pi|po|'|")/gi
This regular expression makes clever uses of capture groups separated by OR clauses. Only the capture groups of one type of token will contain anything. For example, on the string "10 ft", successive calls to exec would return:
["10", "", "", "", "", "", "", "10", ""] (because "10" is an integer)
[" ", " ", "", "", "", "", "", "", ""] (because " " is whitespace)
["ft", "", "", "", "", "", "", "", "ft"] (because "ft" is a unit)
null
A tokenizer function can then do something like this to treat each individual token:
function tokenize (input) {
const localTokenRx = new RegExp(tokenRx);
return function next () {
const startIndex = localTokenRx.lastIndex;
if (startIndex >= input.length) {
// end of input reached
return undefined;
}
const match = localTokenRx.exec(input);
if (!match) {
localTokenRx.lastIndex = input.length;
// there is leftover garbage at the end of the input
return ["garbage", input.slice(startIndex)];
}
if (match.index !== startIndex) {
localTokenRx.lastIndex = match.index;
// the regex skipped over some garbage
return ["garbage", input.slice(startIndex, match.index)];
}
const [
text,
whitespace,
numerator, denominator,
integralPart, decimalSeparator, fractionalPart,
integer,
unit
] = match;
if (whitespace) {
return ["whitespace", undefined];
// or return next(); if we want to ignore it
}
if (denominator) {
return ["fraction", Number(numerator) / Number(denominator)];
}
if (decimalSeparator) {
return ["decimal", Number(integralPart + "." + fractionalPart)];
}
if (integer) {
return ["integer", Number(integer)];
}
if (unit) {
return ["unit", unit];
}
};
}
This function can do all the necessary string manipulation and type conversion all in one place, letting another piece of code do proper analysis of the sequence of tokens. But that would be out of scope for this Stack Overflow answer, especially since the question doesn't specify the rules of the grammar we are willing to accept.
But this is most likely too generic and complex of a solution if all you're trying to do is accept imperial lengths and metric lengths. For that, I'd probably only write a different regular expression for each acceptable format, then test the user's input to see which one matches. If two different expressions match, then the input is ambiguous and we should warn the user.

Shared part in RegEx matched string

In following code:
"a sasas b".match(/sas/g) //returns ["sas"]
The string actually include two sas strings, a [sas]as b and a sa[sas] b.
How can I modify RegEx to match both?
Another example:
"aaaa".match(/aa/g); //actually include [aa]aa,a[aa]a,aa[aa]
Please consider the issue in general not just above instances.
A pure RexEx solution is preferred.

If you want to match at least one such "merged" occurrence, then you could do something like:
"a sasas b".match(/s(as)+/g)
If you want to retrieve the matches as separate results, then you have a bit more work to do; this is not a case that regular expressions are designed to handle. The basic algorithm would be:
Attempt a match. If it was unsuccessful, stop.
Extract the match you are interested in and do whatever you want with it.
Take the substring of the original target string, starting from one character following the first character in your match.
Start over, using this substring as the new input.
(To be more efficient, you could match with an offset instead of using substrings; that technique is discussed in this question.)
For example, you would start with "a sasas b". After the first match, you have "sas". Taking the substring that starts one character after the match starts, we would have "asas b". The next match would find the "sas" here, and you would again repeat the process with "as b". This would fail to match, so you would be done.

This significantly-improved answer owes itself to #EliGassert.
String.prototype.match_overlap = function(re)
{
if (!re.global)
re = new RegExp(re.source,
'g' + (re.ignoreCase ? 'i' : '')
+ (re.multiline ? 'm' : ''));
var matches = [];
var result;
while (result = re.exec(this))
matches.push(result),
re.lastIndex = result.index + 1;
return matches.length ? matches : null;
}
#EliGassert points out that there is no need to walk through the entire string character by character; instead we can find a match anywhere (i.e. do without the anchor), and then continue one character after the index of the found match. While researching how to retrieve said index, I found that the re.lastIndex property, used by exec to keep track of where it should continue its search, is in fact settable! This works rather nicely with what we intend to do.
The only bit needing further explanation might be the beginning. In the absence of the g flag, exec may never return null (always returning its one match, if it exists), thus possibly going into an infinite loop. Since, however, match_overlap by design seeks multiple matches, we can safely recompile any non-global RegExp as a global RegExp, importing the i and m options as well if set.
Here is a new jsFiddle: http://jsfiddle.net/acheong87/h5MR5/.
document.write("<pre>");
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");
Output:
sas,sas
aa,aa,aa
my1na,me2is,is3pi

var match = "a sasas b".match(/s(?=as)/g);
for(var i =0; i != match.length; ++i)
alert(match[i]);
Going off of the comment by Q. Sheets and the response by cdhowie, I came up with the above solution: it consumes ONE character in the regular expression and does a lookahead for the rest of the match string. With these two pieces, you can construct all the positions and matching strings in your regular expression.
I wish there was an "inspect but don't consume" operator that you could use to actually include the rest of the matching (lookahead) string in the results, but there unfortunately isn't -- at least not in JS.

Here's a generic way to do it:
String.prototype.match_overlap = function(regexp)
{
regexp = regexp.toString().replace(/^\/|\/$/g, '');
var re = new RegExp('^' + regexp);
var matches = [];
var result;
for (var i = 0; i < this.length; i++)
if (result = re.exec(this.substr(i)))
matches.push(result);
return matches.length ? matches : null;
}
Usage:
var results = 'sasas'.match_overlap(/sas/);
Returns:
An array of (overlapping) matches, or null.
Example:
Here's a jsFiddle in which this:
document.write("<pre>");
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");
returns this:
sas,sas
aa,aa,aa
my1na,me2is,is3pi
Explanation:
To explain a little bit, we intend for the user to pass a RegExp object to this new function, match_overlap, as he or she would do normally with match. From this we want to create a new RegExp object anchored at the beginning (to prevent duplicate overlapped matches—this part probably won't make sense unless you encounter the issue yourself—don't worry about it). Then, we simply match against each substring of the subject string this and push the results to an array, which is returned if non-empty (otherwise returning null). Note that if the user passes in an expression that is already anchored, this is inherently wrong—at first I stripped anchors out, but then I realized I was making an assumption in the user's stead, which we should avoid. Finally one could go further and somehow merge the resulting array of matches into a single match result resembling what would normally occur with the //g option; and one could go even further and make up a new flag, e.g. //o that gets parsed to do overlap-matching, but this is getting a little crazy.

How to split a long regular expression into multiple lines in JavaScript?

I have a very long regular expression, which I wish to split into multiple lines in my JavaScript code to keep each line length 80 characters according to JSLint rules. It's just better for reading, I think.
Here's pattern sample:
var pattern = /^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;

Extending #KooiInc answer, you can avoid manually escaping every special character by using the source property of the RegExp object.
Example:
var urlRegex= new RegExp(''
+ /(?:(?:(https?|ftp):)?\/\/)/.source // protocol
+ /(?:([^:\n\r]+):([^#\n\r]+)#)?/.source // user:pass
+ /(?:(?:www\.)?([^\/\n\r]+))/.source // domain
+ /(\/[^?\n\r]+)?/.source // request
+ /(\?[^#\n\r]*)?/.source // query
+ /(#?[^\n\r]*)?/.source // anchor
);
or if you want to avoid repeating the .source property you can do it using the Array.map() function:
var urlRegex= new RegExp([
/(?:(?:(https?|ftp):)?\/\/)/ // protocol
,/(?:([^:\n\r]+):([^#\n\r]+)#)?/ // user:pass
,/(?:(?:www\.)?([^\/\n\r]+))/ // domain
,/(\/[^?\n\r]+)?/ // request
,/(\?[^#\n\r]*)?/ // query
,/(#?[^\n\r]*)?/ // anchor
].map(function(r) {return r.source}).join(''));
In ES6 the map function can be reduced to:
.map(r => r.source)

[Edit 2022/08] Created a small github repository to create regular expressions with spaces, comments and templating.
You could convert it to a string and create the expression by calling new RegExp():
var myRE = new RegExp (['^(([^<>()[\]\\.,;:\\s#\"]+(\\.[^<>(),[\]\\.,;:\\s#\"]+)*)',
'|(\\".+\\"))#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.',
'[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\\.)+',
'[a-zA-Z]{2,}))$'].join(''));
Notes:
when converting the expression literal to a string you need to escape all backslashes as backslashes are consumed when evaluating a string literal. (See Kayo's comment for more detail.)
RegExp accepts modifiers as a second parameter
/regex/g => new RegExp('regex', 'g')
[Addition ES20xx (tagged template)]
In ES20xx you can use tagged templates. See the snippet.
Note:
Disadvantage here is that you can't use plain whitespace in the regular expression string (always use \s, \s+, \s{1,x}, \t, \n etc).
(() => {
const createRegExp = (str, opts) =>
new RegExp(str.raw[0].replace(/\s/gm, ""), opts || "");
const yourRE = createRegExp`
^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|
(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|
(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$`;
console.log(yourRE);
const anotherLongRE = createRegExp`
(\byyyy\b)|(\bm\b)|(\bd\b)|(\bh\b)|(\bmi\b)|(\bs\b)|(\bms\b)|
(\bwd\b)|(\bmm\b)|(\bdd\b)|(\bhh\b)|(\bMI\b)|(\bS\b)|(\bMS\b)|
(\bM\b)|(\bMM\b)|(\bdow\b)|(\bDOW\b)
${"gi"}`;
console.log(anotherLongRE);
})();

Using strings in new RegExp is awkward because you must escape all the backslashes. You may write smaller regexes and concatenate them.
Let's split this regex
/^foo(.*)\bar$/
We will use a function to make things more beautiful later
function multilineRegExp(regs, options) {
return new RegExp(regs.map(
function(reg){ return reg.source; }
).join(''), options);
}
And now let's rock
var r = multilineRegExp([
/^foo/, // we can add comments too
/(.*)/,
/\bar$/
]);
Since it has a cost, try to build the real regex just once and then use that.

Thanks to the wonderous world of template literals you can now write big, multi-line, well-commented, and even semantically nested regexes in ES6.
//build regexes without worrying about
// - double-backslashing
// - adding whitespace for readability
// - adding in comments
let clean = (piece) => (piece
.replace(/((^|\n)(?:[^\/\\]|\/[^*\/]|\\.)*?)\s*\/\*(?:[^*]|\*[^\/])*(\*\/|)/g, '$1')
.replace(/((^|\n)(?:[^\/\\]|\/[^\/]|\\.)*?)\s*\/\/[^\n]*/g, '$1')
.replace(/\n\s*/g, '')
);
window.regex = ({raw}, ...interpolations) => (
new RegExp(interpolations.reduce(
(regex, insert, index) => (regex + insert + clean(raw[index + 1])),
clean(raw[0])
))
);
Using this you can now write regexes like this:
let re = regex`I'm a special regex{3} //with a comment!`;
Outputs
/I'm a special regex{3}/
Or what about multiline?
'123hello'
.match(regex`
//so this is a regex
//here I am matching some numbers
(\d+)
//Oh! See how I didn't need to double backslash that \d?
([a-z]{1,3}) /*note to self, this is group #2*/
`)
[2]
Outputs hel, neat!
"What if I need to actually search a newline?", well then use \n silly!
Working on my Firefox and Chrome.
Okay, "how about something a little more complex?"
Sure, here's a piece of an object destructuring JS parser I was working on:
regex`^\s*
(
//closing the object
(\})|
//starting from open or comma you can...
(?:[,{]\s*)(?:
//have a rest operator
(\.\.\.)
|
//have a property key
(
//a non-negative integer
\b\d+\b
|
//any unencapsulated string of the following
\b[A-Za-z$_][\w$]*\b
|
//a quoted string
//this is #5!
("|')(?:
//that contains any non-escape, non-quote character
(?!\5|\\).
|
//or any escape sequence
(?:\\.)
//finished by the quote
)*\5
)
//after a property key, we can go inside
\s*(:|)
|
\s*(?={)
)
)
((?:
//after closing we expect either
// - the parent's comma/close,
// - or the end of the string
\s*(?:[,}\]=]|$)
|
//after the rest operator we expect the close
\s*\}
|
//after diving into a key we expect that object to open
\s*[{[:]
|
//otherwise we saw only a key, we now expect a comma or close
\s*[,}{]
).*)
$`
It outputs /^\s*((\})|(?:[,{]\s*)(?:(\.\.\.)|(\b\d+\b|\b[A-Za-z$_][\w$]*\b|("|')(?:(?!\5|\\).|(?:\\.))*\5)\s*(:|)|\s*(?={)))((?:\s*(?:[,}\]=]|$)|\s*\}|\s*[{[:]|\s*[,}{]).*)$/
And running it with a little demo?
let input = '{why, hello, there, "you huge \\"", 17, {big,smelly}}';
for (
let parsed;
parsed = input.match(r);
input = parsed[parsed.length - 1]
) console.log(parsed[1]);
Successfully outputs
{why
, hello
, there
, "you huge \""
, 17
,
{big
,smelly
}
}
Note the successful capturing of the quoted string.
I tested it on Chrome and Firefox, works a treat!
If curious you can checkout what I was doing, and its demonstration.
Though it only works on Chrome, because Firefox doesn't support backreferences or named groups. So note the example given in this answer is actually a neutered version and might get easily tricked into accepting invalid strings.

There are good answers here, but for completeness someone should mention Javascript's core feature of inheritance with the prototype chain. Something like this illustrates the idea:
RegExp.prototype.append = function(re) {
return new RegExp(this.source + re.source, this.flags);
};
let regex = /[a-z]/g
.append(/[A-Z]/)
.append(/[0-9]/);
console.log(regex); //=> /[a-z][A-Z][0-9]/g

The regex above is missing some black slashes which isn't working properly. So, I edited the regex. Please consider this regex which works 99.99% for email validation.
let EMAIL_REGEXP =
new RegExp (['^(([^<>()[\\]\\\.,;:\\s#\"]+(\\.[^<>()\\[\\]\\\.,;:\\s#\"]+)*)',
'|(".+"))#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.',
'[0-9]{1,3}\])|(([a-zA-Z\\-0-9]+\\.)+',
'[a-zA-Z]{2,}))$'].join(''));

To avoid the Array join, you can also use the following syntax:
var pattern = new RegExp('^(([^<>()[\]\\.,;:\s#\"]+' +
'(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#' +
'((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|' +
'(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$');

You can simply use string operation.
var pattenString = "^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|"+
"(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|"+
"(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$";
var patten = new RegExp(pattenString);

I tried improving korun's answer by encapsulating everything and implementing support for splitting capturing groups and character sets - making this method much more versatile.
To use this snippet you need to call the variadic function combineRegex whose arguments are the regular expression objects you need to combine. Its implementation can be found at the bottom.
Capturing groups can't be split directly that way though as it would leave some parts with just one parenthesis. Your browser would fail with an exception.
Instead I'm simply passing the contents of the capture group inside an array. The parentheses are automatically added when combineRegex encounters an array.
Furthermore quantifiers need to follow something. If for some reason the regular expression needs to be split in front of a quantifier you need to add a pair of parentheses. These will be removed automatically. The point is that an empty capture group is pretty useless and this way quantifiers have something to refer to. The same method can be used for things like non-capturing groups (/(?:abc)/ becomes [/()?:abc/]).
This is best explained using a simple example:
var regex = /abcd(efghi)+jkl/;
would become:
var regex = combineRegex(
/ab/,
/cd/,
[
/ef/,
/ghi/
],
/()+jkl/ // Note the added '()' in front of '+'
);
If you must split character sets you can use objects ({"":[regex1, regex2, ...]}) instead of arrays ([regex1, regex2, ...]). The key's content can be anything as long as the object only contains one key. Note that instead of () you have to use ] as dummy beginning if the first character could be interpreted as quantifier. I.e. /[+?]/ becomes {"":[/]+?/]}
Here is the snippet and a more complete example:
function combineRegexStr(dummy, ...regex)
{
return regex.map(r => {
if(Array.isArray(r))
return "("+combineRegexStr(dummy, ...r).replace(dummy, "")+")";
else if(Object.getPrototypeOf(r) === Object.getPrototypeOf({}))
return "["+combineRegexStr(/^\]/, ...(Object.entries(r)[0][1]))+"]";
else
return r.source.replace(dummy, "");
}).join("");
}
function combineRegex(...regex)
{
return new RegExp(combineRegexStr(/^\(\)/, ...regex));
}
//Usage:
//Original:
console.log(/abcd(?:ef[+A-Z0-9]gh)+$/.source);
//Same as:
console.log(
combineRegex(
/ab/,
/cd/,
[
/()?:ef/,
{"": [/]+A-Z/, /0-9/]},
/gh/
],
/()+$/
).source
);

Personally, I'd go for a less complicated regex:
/\S+#\S+\.\S+/
Sure, it is less accurate than your current pattern, but what are you trying to accomplish? Are you trying to catch accidental errors your users might enter, or are you worried that your users might try to enter invalid addresses? If it's the first, I'd go for an easier pattern. If it's the latter, some verification by responding to an e-mail sent to that address might be a better option.
However, if you want to use your current pattern, it would be (IMO) easier to read (and maintain!) by building it from smaller sub-patterns, like this:
var box1 = "([^<>()[\]\\\\.,;:\s#\"]+(\\.[^<>()[\\]\\\\.,;:\s#\"]+)*)";
var box2 = "(\".+\")";
var host1 = "(\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\])";
var host2 = "(([a-zA-Z\-0-9]+\\.)+[a-zA-Z]{2,})";
var regex = new RegExp("^(" + box1 + "|" + box2 + ")#(" + host1 + "|" + host2 + ")$");

#Hashbrown's great answer got me on the right track. Here's my version, also inspired by this blog.
function regexp(...args) {
function cleanup(string) {
// remove whitespace, single and multi-line comments
return string.replace(/\s+|\/\/.*|\/\*[\s\S]*?\*\//g, '');
}
function escape(string) {
// escape regular expression
return string.replace(/[-.*+?^${}()|[\]\\]/g, '\\$&');
}
function create(flags, strings, ...values) {
let pattern = '';
for (let i = 0; i < values.length; ++i) {
pattern += cleanup(strings.raw[i]); // strings are cleaned up
pattern += escape(values[i]); // values are escaped
}
pattern += cleanup(strings.raw[values.length]);
return RegExp(pattern, flags);
}
if (Array.isArray(args[0])) {
// used as a template tag (no flags)
return create('', ...args);
}
// used as a function (with flags)
return create.bind(void 0, args[0]);
}
Use it like this:
regexp('i')`
//so this is a regex
//here I am matching some numbers
(\d+)
//Oh! See how I didn't need to double backslash that \d?
([a-z]{1,3}) /*note to self, this is group #2*/
`
To create this RegExp object:
/(\d+)([a-z]{1,3})/i

We Keep Coding

JavaScript is the programming language of the Web.

JavaScript Regex to match balanced constructs without caring about imbalanced constructs - javascript

Related

3 While Loops into a Single Loop?

For a regex pattern, how to determine the length of longest string that matchs the pattern?

How to parse and capture any measurement unit

Shared part in RegEx matched string

How to split a long regular expression into multiple lines in JavaScript?

Categories

Resources