Escape string for use in Javascript regex [duplicate] - javascript

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Is there a RegExp.escape function in Javascript?
I am trying to build a javascript regex based on user input:
function FindString(input) {
var reg = new RegExp('' + input + '');
// [snip] perform search
}
But the regex will not work correctly when the user input contains a ? or * because they are interpreted as regex specials. In fact, if the user puts an unbalanced ( or [ in their string, the regex isn't even valid.
What is the javascript function to correctly escape all special characters for use in regex?

Short 'n Sweet (Updated 2021)
To escape the RegExp itself:
function escapeRegExp(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
}
To escape a replacement string:
function escapeReplacement(string) {
return string.replace(/\$/g, '$$$$');
}
Example
All escaped RegExp characters:
escapeRegExp("All of these should be escaped: \ ^ $ * + ? . ( ) | { } [ ]");
>>> "All of these should be escaped: \\ \^ \$ \* \+ \? \. \( \) \| \{ \} \[ \] "
Find & replace a string:
var haystack = "I love $x!";
var needle = "$x";
var safeNeedle = escapeRegExp(needle); // "\\$x"
var replacement = "$100 bills"
var safeReplacement = escapeReplacement(replacement); // "$$100 bills"
haystack.replace(
new RegExp(safeNeedle, 'g'),
escapeReplacement(safeReplacement),
);
// "I love $100 bills!"
(NOTE: the above is not the original answer; it was edited to show the one from MDN. This means it does not match what you will find in the code in the below npm, and does not match what is shown in the below long answer. The comments are also now confusing. My recommendation: use the above, or get it from MDN, and ignore the rest of this answer. -Darren,Nov 2019)
Install
Available on npm as escape-string-regexp
npm install --save escape-string-regexp
Note
See MDN: Javascript Guide: Regular Expressions
Other symbols (~`!## ...) MAY be escaped without consequence, but are not required to be.
.
.
.
.
Test Case: A typical url
escapeRegExp("/path/to/resource.html?search=query");
>>> "\/path\/to\/resource\.html\?search=query"
The Long Answer
If you're going to use the function above at least link to this stack overflow post in your code's documentation so that it doesn't look like crazy hard-to-test voodoo.
var escapeRegExp;
(function () {
// Referring to the table here:
// https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/regexp
// these characters should be escaped
// \ ^ $ * + ? . ( ) | { } [ ]
// These characters only have special meaning inside of brackets
// they do not need to be escaped, but they MAY be escaped
// without any adverse effects (to the best of my knowledge and casual testing)
// : ! , =
// my test "~!##$%^&*(){}[]`/=?+\|-_;:'\",<.>".match(/[\#]/g)
var specials = [
// order matters for these
"-"
, "["
, "]"
// order doesn't matter for any of these
, "/"
, "{"
, "}"
, "("
, ")"
, "*"
, "+"
, "?"
, "."
, "\\"
, "^"
, "$"
, "|"
]
// I choose to escape every character with '\'
// even though only some strictly require it when inside of []
, regex = RegExp('[' + specials.join('\\') + ']', 'g')
;
escapeRegExp = function (str) {
return str.replace(regex, "\\$&");
};
// test escapeRegExp("/path/to/res?search=this.that")
}());

Related

replace words that starts with "(uuid: )"

I would like to replace text using javascript/regex
"TV "my-samsung" (UUID: a1c3bbc1d27c5be8:8baabe2fa7f5d9ca) is already switched off."
with
TV 'my-samsung' is already switched off.
by removing text (UUID: ) and replace " with '
Looks like regex can be used
\([\s\S]*?\)
https://regex101.com/r/xXDncn/1
or have also tried using replace method in JS
str = str.replace("(UUID", "");
You can use
const str = '" "Tv "my-samsung" (UUID: a1c3bbc1d27c5be8:8baabe2fa7f5d9ca) is already switched-off""';
console.log(
str.replace(/\s*\(UUID:[^()]*\)/g, '').replace(/^[\s"]+|[\s"]+$/g, '').replaceAll('"', "'")
)
See the first regex demo. It matches
\s* - zero or more whitespaces
\(UUID: - (UUID: string
[^()]* - zero or more chars other than ( and )
\) - a ) char.
The g flag makes it replace all occurrences.
The second regex removes trailing and leading whitespace and double quotation marks:
^[\s"]+ - one or more whitespaces and double quotes at the start of string
| - or
[\s"]+$ - one or more whitespaces and double quotes at the end of string.
The .replaceAll('"', "'") is necessary to replace all " with ' chars.
It is not a good idea to merge these two operations into one as the replacements are different. Here is how it could be done, just for learning purposes:
const str = '" "Tv "my-samsung" (UUID: a1c3bbc1d27c5be8:8baabe2fa7f5d9ca) is already switched-off""';
console.log(
str.replace(/^[\s"]+|[\s"]+$|\s*\(UUID:[^()]*\)|(")/g, (x,y) => y ? "'" : "")
)
That is, " is captured into Group 1, the replacement is now a callable, where x is the whole match and y is the Group 1 contents. If Group 1 matched, the replacement is ', else, the replacement is an empty string (to remove the match found).
you can try this
str.replace(/\(.*?\)/, "")
str.replace(/\(.*?\)/, "with")
--- update ---
const str = `"TV "my-samsung" (UUID: a1c3bbc1d27c5be8:8baabe2fa7f5d9ca) is already switched off."`;
const a = str.replace(/"(.*?)\(.*\)(.*)"/, (a, b, c) => {
return b.replace(/"/g, "'") + c
});
console.log(a); //TV 'my-samsung' is already switched off.

Is there a javascript method to recognize a string even if the words are out of order? [duplicate]

I would like to find all the matches of given strings (divided by spaces) in a string.
(The way for example, iTunes search box works).
That, for example, both "ab de" and "de ab" will return true on "abcde" (also "bc e a" or any order should return true)
If I replace the white space with a wild card, "ab*de" would return true on "abcde", but not "de*ab".
[I use * and not Regex syntax just for this explanation]
I could not find any pure Regex solution for that.
The only solution I could think of is spliting the search term and run multiple Regex.
Is it possible to find a pure Regex expression that will cover all these options ?
Returns true when all parts (divided by , or ' ') of a searchString occur in text. Otherwise false is returned.
filter(text, searchString) {
const regexStr = '(?=.*' + searchString.split(/\,|\s/).join(')(?=.*') + ')';
const searchRegEx = new RegExp(regexStr, 'gi');
return text.match(searchRegEx) !== null;
}
I'm pretty sure you could come up with a regex to do what you want, but it may not be the most efficient approach.
For example, the regex pattern (?=.*bc)(?=.*e)(?=.*a) will match any string that contains bc, e, and a.
var isMatch = 'abcde'.match(/(?=.*bc)(?=.*e)(?=.*a)/) != null; // equals true
var isMatch = 'bcde'.match(/(?=.*bc)(?=.*e)(?=.*a)/) != null; // equals false
You could write a function to dynamically create an expression based on your search terms, but whether it's the best way to accomplish what you are doing is another question.
Alternations are order insensitive:
"abcde".match(/(ab|de)/g); // => ['ab', 'de']
"abcde".match(/(de|ab)/g); // => ['ab', 'de']
So if you have a list of words to match you can build a regex with an alternation on the fly like so:
function regexForWordList(words) {
return new RegExp('(' + words.join('|') + ')', 'g');
}
'abcde'.match(['a', 'e']); // => ['a', 'e']
Try this:
var str = "your string";
str = str.split( " " );
for( var i = 0 ; i < str.length ; i++ ){
// your regexp match
}
This is script which I use - it works also with single word searchStrings
var what="test string with search cool word";
var searchString="search word";
var search = new RegExp(searchString, "gi"); // one-word searching
// multiple search words
if(searchString.indexOf(' ') != -1) {
search="";
var words=searchString.split(" ");
for(var i = 0; i < words.length; i++) {
search+="(?=.*" + words[i] + ")";
}
search = new RegExp(search + ".+", "gi");
}
if(search.test(what)) {
// found
} else {
// notfound
}
I assume you are matching words, or parts of words. You want space-separated search terms to limit search results, and it seems you intend to return only those entries which have all the words that the user supplies. And you intend a wildcard character * to stand for 0 or more characters in a matching word.
For example, if the user searches for the words term1 term2, you intend to return only those items which have both words term1 and term2. If the user searches for the word term*, it would match any word beginning with term.
There are suitable regular expressions which are equivalent to this search language and can be generated from it.
A simple example, the word term, can be asserted in regex by converting to \bterm\b. But two or more words which must match in any order require lookahead assertions. Using extended syntax, the equivalent regex is:
(?= .* \b term1 \b )
(?= .* \b term2 \b )
The asterisk wildcard can be asserted in regex with a character class followed by asterisk. The character class identifies which letters you consider to be part of word. For example, you might find that [A-Za-z0-9]* fits the bill.
In short, you might be satisfied if you convert an expression such as:
foo ba* quux
to:
(?= .* \b foo \b )
(?= .* \b ba[A-Za-z0-9]* \b )
(?= .* \b quux \b )
That is a simple matter of search and replace. But do be careful to sanitize the input string to avoid injection attacks by removing punctuation, etc.
I think you may be barking up the wrong tree with RegEx. What you might want to look at is the Levenshtein distance of two input strings.
There's a Javascript implementation here and a usage example here.

Use Regex to split string at array keeping all words of array [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

Javascript dynamic regex, match string between brackets that have prefix [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

Javascript/Regex: Expression works in one environment and not another

I am trying to only allow alphanumeric entry or these characters:'()-_. (with the "." included)
Using regexpal.com I entered this regular expression: [^a-zA-Z0-9()\.'\-\_ ]
It is correctly identifying * and # as a match. What's baffling is that I have that same exact expression in my javascript on an .aspx page and it is not catching * or #. I have confirmed that is indeed entering that function and that the expression evaluates. Here is that code:
$(".validateText").keyup(function (e) {
var matchPattern = "[^a-zA-Z0-9()\.'\-\_ ]";
var regEx = new RegExp(matchPattern);
console.log("Regex: " + regEx + "\nValue of " + e.target.id + " is: " + e.target.value);
if (regEx.test(e.target.value)) {
console.log("Found invalid data.");//I don't get here with # or *
var failingChar = e.target.value.length - 1;
e.target.value = e.target.value.substring(0, failingChar);
}
});
Rather than using string literals to define regexes, use regex literals.
var regEx = /[^a-zA-Z0-9()\.'\-\_ ]/;
String literals interpret backslashes as escape characters, so they need to be escaped. Regex literals don't require this.
As per Bergi's suggestion, you wouldn't even need to escape all those characters.
/[^a-zA-Z0-9().'_ -]/
You could probably even use the general \w character.
/[^\w().' -]/
var matchPattern = "[^a-zA-Z0-9()\\.'\\-\\_ ]";
Would work.

Categories