Javascript regular expression (unbroken repetitions of a pattern)

Javascript regular expression (unbroken repetitions of a pattern) - javascript

Let's say that I have a given string in javascript - e.g., var s = "{{1}}SomeText{{2}}SomeText"; It may be very long (e.g., 25,000+ chars).
NOTE: I'm using "SomeText" here as a placeholder to refer to any number of characters of plain text. In other words, "SomeText" could be any plain text string which doesn't include {{1}} or {{2}}. So the above example could be var s = "{{1}}Hi there. This is a string with one { curly bracket{{2}}Oh, very nice to meet you. I also have one } curly bracket!"; And that would be perfectly valid.
The rules for it are simple:
It does not need to have any instances of {{2}}. However, if it does, then after that instance we cannot encounter another {{2}} unless we find a {{1}} first.
Valid examples:
"{{2}}SomeText"
"{{1}}SomeText{{2}}SomeText"
"{{1}}SomeText{{1}}SomeText{{2}}SomeText"
"{{1}}SomeText{{1}}SomeText{{2}}SomeText{{1}}SomeText"
"{{1}}SomeText{{1}}SomeText{{2}}SomeText{{1}}SomeText{{1}}SomeText"
"{{1}}SomeText{{1}}SomeText{{2}}SomeText{{1}}SomeText{{1}}SomeText{{2}}SomeText"
etc...
Invalid examples:
"{{2}}SomeText{{2}}SomeText"
"{{1}}SomeText{{2}}SomeText{{2}}SomeText"
"{{1}}SomeText{{2}}SomeText{{2}}SomeText{{1}}SomeText"
etc...
This seems like a relatively easy problem to solve - and indeed I could easily solve it without regular expressions, but I'm keen to learn how to do something like this with regular expressions. Unfortunately, I'm not even sure if "conditionals and lookaheads" is a correct description of the issue in this case.
NOTE: If a workable solution is presented that doesn't involve "conditionals and lookaheads" then I will edit the title.

It's probably easier to invert the condition. Try to match any text that contains two consecutive instances of {{2}}, and if it doesn't match that, it's good.
Using this strategy, your pattern can be as simple as:
/{\{2}}([^{]*){\{2}}/
Demonstration
This will match a literal {{2}}, followed by zero or more characters other than {, followed by a literal {{2}}.
Notice that the second { needs to be escaped, otherwise, the regex engine will consider the {2} as to be a quantifier on the previous { (i.e. {{2} matches exactly two { characters).
Just in case you need to allow characters like {, and between the two {{2}}, you can use a pattern like this:
/{\{2}}((?!{\{1}}).)*{\{2}}/
Demonstration
This will match a literal {{2}}, followed by zero or more of any character, so long as those characters create a sequence like {{1}}, followed by a literal {{2}}.

(({{1}}SomeText)+({{2}}SomeText)?)*
Broken down:
({{1}}SomeText)+ - 1 to many {{1}} instances (greedy match)
({{2}}SomeText)? - followed by an optional {{2}} instance
Then the whole thing is wrapped in ()* such that the sequence can appear 0 to many times in a row.
No conditionals or lookaheads needed.

You said you can have one instance of {2} first, right?
^(.(?!{2}))(.{2})?(?!{2})((.(?!{2})){1}(.(?!{2}))({2})?)$
Note if {2} is one letter replace all dots with [^{2}]

Related

Exclude list of string in validation - regex [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?

A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].

Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.

Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)

You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.

The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]

I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.

The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.

If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo

Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*

I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}

Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.

I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

Matching variable-term equations

I am trying to develop a regular expression to match the following equations:
(Price+10%+100+200)
(Price+20%+200)
(Price+30%)
(Price+100)
(Price-10%-100-200)
(Price-20%-200)
(Price-30%)
(Price-100)
My regex so far is...
/([(])+([P])+([r])+([i])+([c])+([e])+([+]|[-]){1}([\d])+([+]|[-])?([\d])+([%])?([)])/g
..., but it only matches the following equations:
(Price+100+10%)
(Price+100+100)
(Price+200)
(Price-100-10%)
(Price-100-100)
(Price-200)
Can someone help me understand how to make my pattern match the full set of equations provided?
Note: Parentheses and 'Price' are musts in the equations that the pattern must match.

Try this, which matches all the input strings provided in the question:
/\(Price([+-]\d+%?){1,3}\)/g
You can test it in a regex fiddle.
Things to note:
Only use parentheses where you want to group. Parentheses around single-possibility, fixed-quantity matches (e.g. ([P]) provide no value.
Use character classes (opened with [ and closed with ]) for multiple characters that can match at a position in the pattern (e.g. [+-]). Single-possibility character classes (e.g. [P]) similarly provide no value.
Yes, character classes (generally) implicitly escape regex special characters within them (e.g. ( in [(] vs. equivalent \( outside a character class), but to just escape regex special characters (i.e. to match them literally), you are better off not using a character class and just escaping them (e.g. \() – unless multiple characters should match at a position in the pattern (per the previous point to note).
The quantifier {1} is (almost) always useless: drop it.
The quantifier + means "one or more" as you probably know. However, in a series of cases where you used it (i.e. ([(])+([P])+([r])+([i])+([c])+([e])+), it would match many values that I doubt you expect (e.g. ((((((PPPrriiiicccceeeeee): basically, don't overuse it. Stop to consider whether you really want to match one or more of the character (class) or group to which + applies in the pattern.
To match a literal string without any regex special characters like Price, just use the literal string at the appropriate position in the pattern – e.g. Price in \(Price.

/\(Price[+-](\d)+(%)?([+-]\d+%?)?([+-]\d+%?)?\)/g
works on http://www.regexr.com/

/^[(Price]+\d+\d+([%]|[)])&/i
try at your own risk!

Negate random regular expression

Is there a way to negate any regular expression? I'm using regular expressions to validate input on a form. I'm now trying to create a button that sanitizes my input. Is there a way so I can use the regular expression used for the validating also for stripping the invalid characters?
I'm using this regex for validation of illegal characters
<input data-val-regex-pattern="[^|<>:\?'\*\[\]\=%\$\+,;~&\{\}]*" type="text" />
When clicking on a button next to it, I'm calling this function:
$('#button').click(function () {
var inputElement = $(this).prev();
var regex = new RegExp(inputElement.attr('data-val-regex-pattern'), 'g');
var value = inputElement.val();
inputElement.val(value.replace(regex, ''));
});
At the moment the javascript is doing the exact opposite of what I'm trying to accomplish. I need to find a way to 'reverse' the regex.
Edit: I'm trying to reverse the regex in the javascript function. The regex in the data-val-regex-pattern-attribute is doing his job for validation.

To find the invalid characters, just take the ^ off from your regex. The carret is the negative of everything that is inside the brackets.
data-val-regex-pattern="[|<>:\?'\*\[\]\=%\$\+,;~&\{\}]*"
This will return the undesired characters so you can replace them.
Also, as you want to take off a lot of non-word characters, you could try a simpler regex. If you want only word characters and spaces, you could use something like this:
data-val-regex-pattern="[\W\S]*"

Your reges is as so:
[^|<>:\?'\*\[\]\=%\$\+,;~&\{\}]*
That means, it matches any non-invalid character multiple times.
Then you replace this for empty, so you leave only the bad characters.
Try this instead, without the negation (hat moved somewhere else):
[|^<>:\?'\*\[\]\=%\$\+,;~&\{\}]*

The following answer is to the general question of negating a regular expression. In your specific case you just need to negate a character group, or more precisely remove the negation of a character group - which is detailed in other answers.
Regular languages – those consisting of all strings entirely by matched some RE – are in fact closed under negation: there is another RE which matches exactly those strings the original RE does not. It is however not trivial to construct, which perhaps explains why RE implementations often do not offer a negation operator.
However the Javascript regexp language has extensions that make it more expressive than regular languages; in particular there is the construct of negative lookahead.
If R1 is a regexp then
^(?!.*(R1))
matches precisely the strings that does not contain a match for R1.
And
^(?!R1$)
matches precisely the strings where the whole string is not a match for R1.
Ie. negation.
For rewriting any substring not matching a given regexp, the above is insufficient. One would have to do something like
((?!R1).)*
Which would catch any substring not containing a subsubstring that matches R1. - But consideration of the edge cases show that this does not quite do what we are after. For example ((?!ab).)* matches "b" in "ab", because "ab" is not a substring of "b".
One can cheat, and make your regexp like;
(.*)(R1|$)
And rewrite to T1$2
Where T1 is the target string you want to rewrite to.
This should rewrite any portion of the string not matching R1 to T1. However I would be very careful about any edge cases for this. So much so that it might be better to write the regexp from scratch rather than trying a general approach.

In a regular expression, match one thing or another, or both

In a regular expression, I need to know how to match one thing or another, or both (in order). But at least one of the things needs to be there.
For example, the following regular expression
/^([0-9]+|\.[0-9]+)$/
will match
234
and
.56
but not
234.56
While the following regular expression
/^([0-9]+)?(\.[0-9]+)?$/
will match all three of the strings above, but it will also match the empty string, which we do not want.
I need something that will match all three of the strings above, but not the empty string. Is there an easy way to do that?
UPDATE:
Both Andrew's and Justin's below work for the simplified example I provided, but they don't (unless I'm mistaken) work for the actual use case that I was hoping to solve, so I should probably put that in now. Here's the actual regexp I'm using:
/^\s*-?0*(?:[0-9]+|[0-9]{1,3}(?:,[0-9]{3})+)(?:\.[0-9]*)?(\s*|[A-Za-z_]*)*$/
This will match
45
45.988
45,689
34,569,098,233
567,900.90
-9
-34 banana fries
0.56 points
but it WON'T match
.56
and I need it to do this.

The fully general method, given regexes /^A$/ and /^B$/ is:
/^(A|B|AB)$/
i.e.
/^([0-9]+|\.[0-9]+|[0-9]+\.[0-9]+)$/
Note the others have used the structure of your example to make a simplification. Specifically, they (implicitly) factorised it, to pull out the common [0-9]* and [0-9]+ factors on the left and right.
The working for this is:
all the elements of the alternation end in [0-9]+, so pull that out: /^(|\.|[0-9]+\.)[0-9]+$/
Now we have the possibility of the empty string in the alternation, so rewrite it using ? (i.e. use the equivalence (|a|b) = (a|b)?): /^(\.|[0-9]+\.)?[0-9]+$/
Again, an alternation with a common suffix (\. this time): /^((|[0-9]+)\.)?[0-9]+$/
the pattern (|a+) is the same as a*, so, finally: /^([0-9]*\.)?[0-9]+$/

Nice answer by huon (and a bit of brain-twister to follow it along to the end). For anyone looking for a quick and simple answer to the title of this question, 'In a regular expression, match one thing or another, or both', it's worth mentioning that even (A|B|AB) can be simplified to:
A|A?B
Handy if B is a bit more complex.
Now, as c0d3rman's observed, this, in itself, will never match AB. It will only match A and B. (A|B|AB has the same issue.) What I left out was the all-important context of the original question, where the start and end of the string are also being matched. Here it is, written out fully:
^(A|A?B)$
Better still, just switch the order as c0d3rman recommended, and you can use it anywhere:
A?B|A

Yes, you can match all of these with such an expression:
/^[0-9]*\.?[0-9]+$/
Note, it also doesn't match the empty string (your last condition).

Sure. You want the optional quantifier, ?.
/^(?=.)([0-9]+)?(\.[0-9]+)?$/
The above is slightly awkward-looking, but I wanted to show you your exact pattern with some ?s thrown in. In this version, (?=.) makes sure it doesn't accept an empty string, since I've made both clauses optional. A simpler version would be this:
/^\d*\.?\d+$/
This satisfies your requirements, including preventing an empty string.
Note that there are many ways to express this. Some are long and some are very terse, but they become more complex depending on what you're trying to allow/disallow.
Edit:
If you want to match this inside a larger string, I recommend splitting on and testing the results with /^\d*\.?\d+$/. Otherwise, you'll risk either matching stuff like aaa.123.456.bbb or missing matches (trust me, you will. JavaScript's lack of lookbehind support ensures that it will be possible to break any pattern I can think of).
If you know for a fact that you won't get strings like the above, you can use word breaks instead of ^$ anchors, but it will get complicated because there's no word break between . and (a space).
/(\b\d+|\B\.)?\d*\b/g
That ought to do it. It will block stuff like aaa123.456bbb, but it will allow 123, 456, or 123.456. It will allow aaa.123.456.bbb, but as I've said, you'll need two steps if you want to comprehensively handle that.
Edit 2: Your use case
If you want to allow whitespace at the beginning, negative/positive marks, and words at the end, those are actually fairly strict rules. That's a good thing. You can just add them on to the simplest pattern above:
/^\s*[-+]?\d*\.?\d+[a-z_\s]*$/i
Allowing thousands groups complicates things greatly, and I suggest you take a look at the answer I linked to. Here's the resulting pattern:
/^\s*[-+]?(\d+|\d{1,3}(,\d{3})*)?(\.\d+)?\b(\s[a-z_\s]*)?$/i
The \b ensures that the numeric part ends with a digit, and is followed by at least one whitespace.

Maybe this helps (to give you the general idea):
(?:((?(digits).^|[A-Za-z]+)|(?<digits>\d+))){1,2}
This pattern matches characters, digits, or digits following characters, but not characters following digits.
The pattern matches aa, aa11, and 11, but not 11aa, aa11aa, or the empty string.
Don't be puzzled by the ".^", which means "a character followd by line start", it is intended to prevent any match at all.
Be warned that this does not work with all flavors of regex, your version of regex must support (?(named group)true|false).

Struggling with regex to match only two of a character, not three

I need to match all occurrences of // in a string in a Javascript regex
It can't match /// or /
So far I have (.*[^\/])\/{2}([^\/].*)
which is basically "something that isn't /, followed by // followed by something that isn't /"
The approach seems to work apart from when the string I want to match starts with //
This doesn't work:
//example
This does
stuff // example
How do I solve this problem?
Edit: A bit more context - I am trying to replace // with !, so I am then using:
result = result.replace(myRegex, "$1 ! $2");

Replace two slashes that either begin the string or do not follow a slash,
and are followed by anything not a slash or the end of the string.
s=s.replace(/(^|[^/])\/{2}([^/]|$)/g,'$1!$2');

It looks like it wouldn't work for example// either.
The problem is because you're matching // preceded and followed by at least one non-slash character. This can be solved by anchoring the regex, and then you can make the preceding/following text optional:
^(.*[^\/])?\/{2}([^\/].*)?$

Use negative lookahead/lookbehind assertions:
(.*)(?<!/)//(?!/)(.*)

Use this:
/([^/]*)(\/{2})([^/]*)/g
e.g.
alert("///exam//ple".replace(/([^/]*)(\/{2})([^/]*)/g, "$1$3"));
EDIT: Updated the expression as per the comment.
/[/]{2}/
e.g:
alert("//example".replace(/[/]{2}/, ""));

This does not answer the OP's question about using regex, but since some of the original comments suggested using .replaceAll, since not everyone who reads the question in the future wants to use regex, since people might mistakenly assume that regex is the only alternative, and since these details cannot be accommodated by submitting a comment, here's a poor man's non-regex approach:
Temporarily replace the three contiguous characters with something that would never naturally occur — really important when dealing with user-entered values.
Replace the remaining two contiguous characters using .replaceAll().
Return the original three contiguous characters.
For instance, let's say you wanted to remove all instances of ".." without affecting occurrences of "...".
var cleansedText = $(this).text().toString()
.replaceAll("...", "☰☸☧")
.replaceAll("..", "")
.replaceAll("☰☸☧", "...")
;
$(this).text(cleansedText);
Perhaps not as fast as regex for longer strings, but works great for short ones.

We Keep Coding

JavaScript is the programming language of the Web.