RegExp sub-pattern reuse for different open-close conditions? - javascript

Is it possible to reuse a matching RegExp sub-pattern for a variety of opening and closing conditions of the containing pattern?
I have a complex/long RegExp sub-pattern for a certain expression X, which I expect to reside within any of the open-close statements, defined as: ${...}, $(...), $[...], $/.../, etc., which in combination makes the whole pattern (mixing open-close conditions is not accepted, or it would have been trivial).
What I want is to avoid repeating the same long X sub-pattern for each of the open-close conditions (using |) when defining the whole pattern, as it becomes too long and unreadable, even though it is mostly just repeated X sub-pattern.
My question - is this achievable within the RegExp syntax? And if yes, then how?
Environments: Node 0.12 for ES5 and IO.js 2.0 for ES6.
P.S. Strictly speaking, we are talking RegExp optimization here, for better code readability, and, possibly, performance.

You can use an extremely hacky way of matching specific opening and closing braces when used together:
\$(?:(\[)|(\()|({)|(\/)).*?(?:(?=\2)(?=\3)(?=\4)\]|(?=\1)(?=\3)(?=\4)\)|(?=\1)(?=\2)(?=\4)}|(?=\1)(?=\2)(?=\3)\/)
^^^ Inner Match Here
It basically looks for all groups except one specific one to be empty and happens to only work in JavaScript regex. The .*? section pointed out in the above code just needs to be replaced with the regex to be matched inside of the braces to match an arbitrary pattern.
Demo: https://regex101.com/r/aX7rH1/1
// Matches
${...}
$(...)
$[...]
$/.../
// Does Not Match
${...)
${...]
${.../
$(...}
$(...]
$(.../
$[...}
$[...)
$[.../
$/...}
$/...)
$/...]

Related

Allowing new line characters in javascript str.replace

This question is similar to "Allowing new line characters in javascript regex"
but the solution /m not runs with str.replace. You can test the code below at this page
<p id="demo"><i>I need to TRIM the italics here,
despite this line.</i>
</p>
<button onclick="myFunction()">Try it</button>
<script>
function myFunction()
{
var str=document.getElementById("demo").innerHTML;
var n=str.replace(/^(\s*)<i>(.+)<\/i>(\s*)$/m,"$1$2$3"); //tested also /s
alert(str)
document.getElementById("demo").innerHTML=n;
}
</script>
This answer is mostly to give you some insight into why your current approach does not work, and how you generally solve it.
The reason m doesn't help is that the other answer is wrong. This is not what m does. m simply makes the anchors match line beginnings and endings in addition to the string beginnings and endings. Some regex flavors have s for what you want to accomplish, but not ECMAScript. The simplest thing (and general solution) is to replace . (which matches everything except line breaks) with [\s\S] (which matches whitespace and non-whitespace, i.e. everything).
However, Casimir's approach is better in your case, as it avoids some other problems like greediness. Of course, as Casimir said, if there are tags in between the opening and closing <i> tags, then the approach will not work. In that case, something like <i>([\s\S]+?)</i> might be an option, but that's still not the full solution, in case you have nested i-tags or attributes in the opening tag, or capitalized I-tags and whatnot.
All in all, using regex to parse HTML is wrong! You should really use DOM manipulation. Especially, since you are using Javascript - THE language for DOM manipulation. What you should really do is traverse the DOM for all i tags in your demo element, and replace them with their inner HTML.
A way to avoid problems with newlines is to not use the dot, example:
var n=str.replace(/<i>([^<]+)<\/i>/,"$1");
I have replaced the dot by [^<] (all that is not a <, that include newlines)
the m modifier is not needed here, and you don't need to capture white characters too.
Note that my solution suppose that you don't have any < between <i> and </i>
In the other case, when you have nested tags for example, you can use this trick to avoid lazy quantifier:
var n=str.replace(/<i>((?:[^<]+|<+(?!\/i>)+)<\/i>/,"$1");

RegEx inner content

Using JavaScript, I'm looking to pinpoint text that's inside two other strings WITHOUT including those strings. For example:
input: ONE example TWO
regular expression: (?=ONE).+(?=TWO)
matches: ONE example
I want: example
I'm really surprised that the question mark (which is supposed just include that string in the query but not the result) works on the end of the string, but not on the start.
Ah-ha! I figured it out.
for example, here's how to get text inside parenthesis without the parenthesis
(?<=\().+(?=\))
Here's a nice reference: http://www.regular-expressions.info/lookaround.html
Part of my confusion was javascript's fault. It evidently doesn't support "lookbehinds" natively. I found this workaround though:
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript
(I use Python's re module to show the examples -- exactly how to do this depends on your regexp implementation [some don't have groups, for example -- or backreferences])
Use a backwards assertion, not a forward assertion, for the first assertion.
>>> re.search(r"(?<=ONE).+(?=TWO)", "ONE x a b TWO").group()
' x a b '
The problem is that the zero width assertion (?=ONE) matches the text "ONE", but doesn't "consume" it -- i.e. it just checks that it's there, but leaves the string as-is. Then the .+ starts reading text, and does consume it.
Backwards assertions don't look ahead, they look behind, so .+ doesn't get run until whatever is behind it is "ONE".
It is probably better not to bother with these at all, but use groups. Consider:
>>> re.search(r"ONE(.+)TWO", "ONE x a b TWO").group(1)
' x a b '

The space character as a punctuator in JavaScript

In chapter 7.7 (Punctuators) of the ECMAScript spec ( http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf ) the grid of punctuators appears to have a gap in row 3 of the last column. This is in fact the space character punctuator, correct?
I understand that space characters may be inserted optionally between tokens in the JavaScript code (in order to improve readability), however, I was wondering where they are actually required...
In order to find this out, I searched for space characters in the minified version of the jQuery library. These are my results:
A space is required... (see Update below)
... between a keyword and an identifier:
function x(){}
var x;
return x;
typeof x;
new X();
... between two keywords:
return false;
if(x){}else if(y){}else{}
These are the two cases that I identified. Are there any other cases?
Note: Space characters inside string literals are not regarded as punctuator tokens (obviously).
Update: As it turns out, a space character is not required in those cases. For example a keyword token and a identifier token have to be seperated by something, but that something does not have to be a space character. It could be any input element which is not a token (WhiteSpace, LineTerminator or Comment).
Also... It seems that the space character is regarded as a WhiteSpace input element, and not a token at all, which would mean that it's not a punctuator.
Update (2021): The spec is much clearer now, and space is definitely not in the list of punctuators. Space is whitespace, which is covered in the White Space section.
Answer from 2010:
I don't think that gap is meant to be a space, no, I think it's just a gap (an unfortunate one). If they really meant to be listing a space, I expect they'd use "Whitespace" as they have elsewhere in the document. But whitespace as a punctuator doesn't really make sense.
I believe spaces (and other forms of whitespace) are delimiters. The spec sort of defines them by omission rather than explicitly. The space is required between function and x because otherwise you have the token functionx, which is not of course a keyword (though it could be a name token — e.g., a variable, property, or function name).
You need delimiters around some tokens (Identifiers and ReservedWords), because that's how we recognize where those tokens begin and end — an IdentifierName starts with an IdentifierStart followed by zero or more IdentifierParts, a class which doesn't include whitespace or any of the characters used for punctuators. Other tokens (Punctuators for instance) we can recognize without delimiters. I think that's about it, and so your two rules are pretty much just two examples of the same rule: IdentifierNames must be delimited (by whitespace, by punctuators, by beginning or end of file, ...).
Somewhat off-topic, but of course not all delimiters are equal. Line-breaking delimiters are sometimes treated specially by the grammar for the horror that is "semicolon insertion".
Whitespaces are not required in any of these cases. You just have to write a syntax that is understandable for the parser. In other words: the machine has to know whether you're using a keyword like function or new or just defining another variable like newFunction.
Each keyword has to be delimited somehow - whitespaces are the most sensible and readable, however they can be replaced:
return/**/false;
return(false);
This is just a guess, but I would say that spaces aren't actually required anywhere. They are used just as one of many alternatives to generate word boundaries between keywords. This means you could just as well replace them with other characters.
If what you want is to remove the unnecessary spaces from some code I would say that spaces (white-space to be more exact, tabs will work just as well) are mandatory only where there are no other means of separating keywords and/or variable identifiers. I.e. where by removing the white-space you no longer have the same keywords and identifiers in the resulting code.
What follows is not exactly relevant to your needs but you may find it interesting. You can write you examples so that they no longer have those spaces. I hope none of the examples are wrong.
x=function(){} instead of function x(){}
this.x=null; instead of var x;
return(x); instead of return x;
typeof(x); instead of typeof x;
y=X(); instead of y = new X();
return(false) instead of return false
if(x){}else{if(y){}else{}} instead of if(x){}else if(y){}else{}

Why is my RegExp construction not accepted by JavaScript?

I'm using a RegExp to validate some user input on an ASP.NET web page. It's meant to enforce the construction of a password (i.e. between 8 and 20 long, at least one upper case character, at least one lower case character, at least one number, at least one of the characters ##!$% and no use of letters L or O (upper or lower) or numbers 0 and 1. This RegExp works fine in my tester (Expresso) and in my C# code.
This is how it looks:
(?-i)^(?=.{8,20})(?=.*[2-9])(?=.*[a-hj-km-np-z])(?=.*[A-HJ-KM-NP-Z])
(?=.*[##!$%])[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]*$
(Line break added for formatting)
However, when I run the code it lives in in IE6 or IE7 (haven't tried other browsers as this is an internal app and we're a Microsoft shop), I get a runtime error saying 'Syntax error in regular expression'. That's it - no further information in the error message aside from the line number.
What is it about this that JavaScript doesn't like?
Well, there are two ways of defining a Regex in Javascript:
a. Through a Regexp object constructor:
var re = new RegExp("pattern","flags");
re.test(myTestString);
b. Using a string literal:
var re = /pattern/flags;
You should also note that JS does not support some of the tenets of Regular Expressions. For a non-comprehensive list of features unsupported in JS, check out the regular-expressions.info site.
Specifically speaking, you appear to be setting some flags on the expression (for example, the case insensitive flag). I would suggest that you use the /i flag (as indicated by the syntax above) instead of using (?-i)
That would make your Regex as follows (Positive Lookahead appears to be supported):
/^(?=.{8,20})(?=.*[2-9])(?=.*[a-hj-km-np-z])(?=.*[A-HJ-KM-NP-Z])(?=.*[##!$%])[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]*$/i;
For a very good article on the subject, check out Regular Expressions in JavaScript.
Edit (after Howard's comment)
If you are simply assigning this Regex pattern to a RegularExpressionValidator control, then you will not have the ability to set Regex options (such as ignore case). Also, you will not be able to use the Regex literal syntax supported by Javascript. Therefore, the only option that remains is to make your pattern intrinsically case insensitive. For example, [a-h] would have to be written as [A-Ha-h]. This would make your Regex quite long-winded, I'm sorry to say.
Here is a solution to this problem, though I cannot vouch for it's legitimacy. Some other options that come to mind may be to turn of Client side validation altogether and validate exclusively on the Server. This will give you access to the full Regex flavour implemented by the System.Text.RegularExpressions.Regex object. Alternatively, use a CustomValidator and create your own JS function which applies the Regex match using the patterns that I (and others) have suggested.
I'm not familiar with C#'s regular expression syntax, but is this (at the start)
(?-i)
meant to turn the case insensitivity pattern modifier on? If so, that's your problem. Javascript doesn't support specifying the pattern modifiers in the expression. There's two ways to do this in javascript
var re = /pattern/i
var re = new RegExp('pattern','i');
Give one of those a try, and your expression should be happy.
As Cerberus mentions, (?-i) is not supported in JavaScript regexps. So, you need to get rid of that and use /i. Something to keep in mind is that there is no standard for regular expression syntax; it is different in each language, so testing in something that uses the .NET regular expression engine is not a valid test of how it will work in JavaScript. Instead, try and look for a reference on JavaScript regular expressions, such as this one.
Your match that looks for 8-20 characters is also invalid. This will ensure that there are at least 8 characters, but it does not limit the string to 20, since the character class with the kleene-closure (* operator) at the end can match as many characters as provided. What you want instead is to replace the * at the end with the {8,20}, and eliminate it from the beginning.
var re = /^(?=.*[2-9])(?=.*[a-hj-km-np-z])(?=.*[A-HJ-KM-NP-Z])(?=.*[##!$%])[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]{8,20}$/i;
On the other hand, I'm not really sure why you would want to restrict the length of passwords, unless there's a hard database limit (which there shouldn't be, since you shouldn't be storing passwords in plain text in the database, but instead hashing them down to something fixed size using a secure hash algorithm with a salt). And as mentioned, I don't see a reason to be so restrictive on the set of characters you allow. I'd recommend something more like this:
var re = /^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##!$%])[a-zA-Z0-9##!$%]{8,}$/i;
Also, why would you forbid 1, 0, L and O from your passwords (and it looks like you're trying to forbid I as well, which you forgot to mention)? This will make it very hard for people to construct good passwords, and since you never see a password as you type it, there's no reason to worry about letters which look confusingly similar. If you want to have a more permissive regexp:
var re = /^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##!$%]).{8,}$/i;
Are you enclosing the regexp in / / characters?
var regexp = /[]/;
return regexp.test();
(?-i)
Doesn't exist in JS Regexp. Flags can be specified as “new RegExp('pattern', 'i')”, or literal syntax “/pattern/i”.
(?=
Exists in modern implementations of JS Regexp, but is dangerously buggy in IE. Lookahead assertions should be avoided in JS for this reason.
between 8 and 20 long, at least one upper case character, at least one lower case character, at least one number, at least one of the characters ##!$% and no use of letters L or O (upper or lower) or numbers 0 and 1.
Do you have to do this in RegExp, and do you have to put all the conditions in one RegExp? Because those are easy conditions to match using multiple RegExps, or even simple string matching:
if (
s.length<8 || s.length>20 ||
s==s.toLowerCase() || s==s.toUpperCase() ||
s.indexOf('0')!=-1 || s.indexOf('1')!=-1 ||
s.toLowerCase().indexOf('l')!=-1 || s.toLowerCase().indexOf('o')!=-1 ||
(s.indexOf('#')==-1 && s.indexOf('#')==-1 && s.indexOf('!')==-1 && s.indexOf('%')==-1 && s.indexOf('%')==-1)
)
alert('Bad password!');
(These are really cruel and unhelpful password rules if meant for end-users BTW!)
I would use this regular expression:
/(?=[^2-9]*[2-9])(?=[^a-hj-km-np-z]*[a-hj-km-np-z])(?=[^A-HJ-KM-NP-Z]*[A-HJ-KM-NP-Z])(?=[^##!$%]*[##!$%])^[2-9a-hj-km-np-zA-HJ-KM-NP-Z##!$%]{8,}$/
The [^a-z]*[a-z] will make sure that the match is made as early as possible instead of expanding the .* and doing backtracking.
(?-i) is supposed to turn case-insensitivity off. Everybody seems to be assuming you're trying to turn it on, but that would be (?i). Anyway, you don't want it to be case-insensitive, since you need to ensure that there are both uppercase and lowercase letters. Since case-sensitive matching is the default, prefacing a regex with (?-i) is pointless even in those flavors (like .NET) that support inline modifiers.

Processing Javascript RegEx submatches

I am trying to write some JavaScript RegEx to replace user inputed tags with real html tags, so [b] will become <b> and so forth. the RegEx I am using looks like so
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
with the following JavaScript
s.replace(exptags,"<$1>$2</$1>");
this works fine for single nested tags, for example:
[b]hello[/b] [u]world[/u]
but if the tags are nested inside each other it will only match the outer tags, for example
[b]foo [u]to the[/u] bar[/b]
this will only match the b tags. how can I fix this? should i just loop until the starting string is the same as the outcome? I have a feeling that the ((.){1,}?) patten is wrong also?
Thanks
The easiest solution would be to to replace all the tags, whether they are closed or not and let .innerHTML work out if they are matched or not it will much more resilient that way..
var tagreg = /\[(\/?)(b|u|i|s|center|code)]/ig
div.innerHTML="[b][i]helloworld[/b]".replace(tagreg, "<$1$2>") //no closing i
//div.inerHTML=="<b><i>helloworld</i></b>"
AFAIK you can't express recursion with regular expressions.
You can however do that with .NET's System.Text.RegularExpressions using balanced matching. See more here: http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx
If you're using .NET you can probably implement what you need with a callback.
If not, you may have to roll your own little javascript parser.
Then again, if you can afford to hit the server you can use the full parser. :)
What do you need this for, anyway? If it is for anything other than a preview I highly recommend doing the processing server-side.
You could just repeatedly apply the regexp until it no longer matches. That would do odd things like "[b][b]foo[/b][/b]" => "<b>[b]foo</b>[/b]" => "<b><b>foo</b></b>", but as far as I can see the end result will still be a sensible string with matching (though not necessarily properly nested) tags.
Or if you want to do it 'right', just write a simple recursive descent parser. Though people might expect "[b]foo[u]bar[/b]baz[/u]" to work, which is tricky to recognise with a parser.
The reason the nested block doesn't get replaced is because the match, for [b], places the position after [/b]. Thus, everything that ((.){1,}?) matches is then ignored.
It is possible to write a recursive parser in server-side -- Perl uses qr// and Ruby probably has something similar.
Though, you don't necessarily need true recursive. You can use a relatively simple loop to handle the string equivalently:
var s = '[b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]';
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
while (s.match(exptags)) {
s = s.replace(exptags, "<$1>$2</$1>");
}
document.writeln('<div>' + s + '</div>'); // after
In this case, it'll make 2 passes:
0: [b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]
1: <b>hello</b> <u>world</u> <b>foo [u]to the[/u] bar</b>
2: <b>hello</b> <u>world</u> <b>foo <u>to the</u> bar</b>
Also, a few suggestions for cleaning up the RegEx:
var exptags = /\[(b|u|i|s|center|code)\](.+?)\[\/(\1)\]/ig;
{1} is assumed when no other count specifiers exist
{1,} can be shortened to +
Agree with Richard Szalay, but his regex didn't get quoted right:
var exptags = /\[(b|u|i|s|center|code)](.*)\[\/\1]/ig;
is cleaner. Note that I also change .+? to .*. There are two problems with .+?:
you won't match [u][/u], since there isn't at least one character between them (+)
a non-greedy match won't deal as nicely with the same tag nested inside itself (?)
Yes, you will have to loop. Alternatively since your tags looks so much like HTML ones you could replace [b] for <b> and [/b] for </b> separately. (.){1,}? is the same as (.*?) - that is, any symbols, least possible sequence length.
Updated: Thanks to MrP, (.){1,}? is (.)+?, my bad.
How about:
tagreg=/\[(.?)?(b|u|i|s|center|code)\]/gi;
"[b][i]helloworld[/i][/b]".replace(tagreg, "<$1$2>");
"[b]helloworld[/b]".replace(tagreg, "<$1$2>");
For me the above produces:
<b><i>helloworld</i></b>
<b>helloworld</b>
This appears to do what you want, and has the advantage of needing only a single pass.
Disclaimer: I don't code often in JS, so if I made any mistakes please feel free to point them out :-)
You are right about the inner pattern being troublesome.
((.){1,}?)
That is doing a captured match at least once and then the whole thing is captured. Every character inside your tag will be captured as a group.
You are also capturing your closing element name when you don't need it and are using {1} when that is implied. Below is a cleanup up version:
/\[(b|u|i|s|center|code)](.+?)\[\/\1]/ig
Not sure about the other problem.

Categories