Regex to replace single backslashes, excluding those followed by certain chars - javascript

I have a regex expression which removes any backslashes from a string if not followed by one of these characters: \ / or }.
It should turn this string:
foo\bar\\batz\/hi
Into this:
foobar\\batz\/hi
But the problem is that it is dealing with each backslash as it goes along. So it follows the rule in that it removes that first backslash, and ignores the 2nd one because it is followed by another backslash. But when it gets to the 3rd one, it removes it, because it isn't followed by another.
My current code looks like this: str.replace(/\\(?!\\|\/|\})/g,"")
But the resulting string looks like this: foobar\batz\/hi
How do I get it to skip the 3rd backslash? Or is it a case of doing some sort of explicit negative search & replace type thing? Eg. replace '\', but don't replace '\\', '\/' or '\}'?
Please help! :)
EDIT
Sorry, I should have explained - I am using javascript, so I don't think I can do negative lookbehinds...

You need to watch out for an escaped backslash, followed by a single backslash. Or better: an uneven number of successive backslashes. In that case, you need to keep the even number of backslashes intact, and only replace the last one (if not followed by a / or {).
You can do that with the following regex:
(?<!\\)(?:((\\\\)*)\\)(?![\\/{])
and replace it with:
$1
where the first match group is the first even number of backslashes that were matched.
A short explanation:
(?<!\\) # looking behind, there can't be a '\'
(?:((\\\\)*)\\) # match an uneven number of backslashes and store the even number in group 1
(?![\\/{]) # looking ahead, there can't be a '\', '/' or '{'
In plain ENglish that would read:
match an uneven number of back-slashes, (?:((\\\\)*)\\), not followed by \\ or { or /, (?![\\/{]), and not preceded by a backslash (?<!\\).
A demo in Java (remember that the backslashes are double escaped!):
String s = "baz\\\\\\foo\\bar\\\\batz\\/hi";
System.out.println(s);
System.out.println(s.replaceAll("(?<!\\\\)(?:((\\\\\\\\)*)\\\\)(?![\\\\/{])", "$1"));
which will print:
baz\\\foo\bar\\batz\/hi
baz\\foobar\\batz\/hi
EDIT
And a solution that does not need look-behinds would look like:
([^\\])((\\\\)*)\\(?![\\/{])
and is replaced by:
$1$2
where $1 is the non-backslash char at the start, and $2 is the even (or zero) number of backslashes following that non-backslash char.

The required regex is as simple as \\.
You need to know however, that the second argument to replace() can be a function like so:
result = string.replace(/\\./g, function (ab) { // ab is the matched portion of the input string
var b = ab.charAt(1);
switch (b) { // if char after backslash
case '\\': case '}': case '/': // ..is one of these
return ab; // keep original string
default: // else
return b; // replace by second char
}
});

You need a lookahead, like you have, and also a lookbehind, to ensure that you dont delete the second slash (which clearly doesnt have a special character after it. Try this:
(?<![\\])[\\](?![\\\/\}]) as your regex

Related

Remove Last Instance Of Character From String - Javascript - Revisited

According to the accepted answer from this question, the following is the syntax for removing the last instance of a certain character from a string (In this case I want to remove the last &):
function remove (string) {
string = string.replace(/&([^&]*)$/, '$1');
return string;
}
console.log(remove("height=74&width=12&"));
But I'm trying to fully understand why it works.
According to regex101.com,
/&([^&]*)$/
& matches the character & literally (case sensitive)
1st Capturing Group ([^&]*)
Match a single character not present in the list below [^&]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
& matches the character & literally (case sensitive)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
So if we're matching the character & literally with the first &:
Then why are we also "matching a single character not present in the following list"?
Seems counter productive.
And then, "$ asserts position at the end of the string" - what does this mean? That it starts searching for matches from the back of the string first?
And finally, what is the $1 doing in the replaceValue? Why is it $1 instead of an empty string? ""
1- The solution for that problem I think is different to the solution you want:
That regex will replace the last "&" no matter where it is, in the middle or in the end of the string.
If you apply this regex to this two examples you will see that the first get "incorrectly" replaced:
height=74&width=12&test=1
height=74&width=12&test=1&
They get replaced as :
height=74&width=12test=1
height=74&width=12&test=1
So to really replace the last "&" the only thing you need to do is :
string.replace(/&$/, '');
Now, if you want to replace the last ocurrence of "&" no matter where it is, I will explain that regex :
$1 Represents a (capturing group), everything inside those ([^&]*) are captured inside that $1. This is a oversimplification.
&([^&]*)$
& Will match a literal "&" then in the following capturing group this regex will look for any amount (0 to infinite) of characters (NOT EQUAL TO "&", explained latter) until the end of the string or line (Depending on the flag you use in the regex, /m for matching lines ). Anything captured in this capturing group will go to $1 when you apply the replacement.
So, If you apply this logic in your mind you will see that it will always match the last & and replace it with anything on its right that does not contain a single "&""
&(<nothing-like-a-&>*)<until-we-reach-the-end> replaced by anything found inside (<nothing-like-a-&>*) == $1. In this case because of the use of * , it means 0 or more times, sometimes the capturing group $1 will be empty.
NOT EQUAL TO part:
The regex uses a [^], in simple terms [] represents a group of independent characters, example: [ab] or [ba] represents the same, it will always look for "a" or "b". Inside this you can also look for ranges like 0 to 9 like this [0-9ba], it will always match anything from 0 to 9, a or b.
The "^" here [^] represents a negation of the content, so, it will match anything not in this group, like [^0-9] will always match anything that is not a number. In your regex [^&] it was used for looking for anything that is not a "&"

unable to parse - in Regular expression in Javascript

I am a bit new to the regular expressions in Javascript.
I am trying to write a function called parseRegExpression()
which parses the attributes passed and generates a key/value pairs
It works fine with the input:
"iconType:plus;iconPosition:bottom;"
But it is not able to parse the input:
"type:'date';locale:'en-US';"
Basically the - sign is being ignored. The code is at:
http://jsfiddle.net/visibleinvisibly/ZSS5G/
The Regular Expression key value pair is as below
/[a-z|A-Z|-]*\s*:\s*[a-z|A-Z|'|"|:|-|_|\/|\.|0-9]*\s*;|[a-z|A-Z|-]*\s*:\s*[a-z|A-Z|'|"|:|-|_|\/|\.|0-9]*\s*$/gi;
There are a few problems:
A | inside a character class means a literal | character, not an alternation.
A . inside a character class means a literal . character, so there's no need to escape it.
A - as the first or last character inside a character class means a literal - character, otherwise it means a character range.
There's no need to use [a-zA-Z] when you use the case-insensitive modifier (i); [a-z] is enough.
The only difference between your alterations is the last bit; this can be simplified significantly by just limiting your alternation to that part which is different.
This should be equivalent to your original pattern:
/[a-z-]*\s*:\s*[a-z0-9'":_\/.-]*\s*(?:;|$)/gi
You can avoid the regex:
var test1 = "iconType:plus;iconPosition:bottom;";
var test2 = "type:'date';locale:'en-US';";
function toto(str) {
var result = new Array();
var temp = str.split(';');
for (i=0; i<temp.length-1; i++) {
result[i] = temp[i].split(':',1);
}
return result;
}
console.log(toto(test1));
console.log(toto(test2));
Inside a character set atom [...] the pipe char | is just a regular char and doesn't mean "or".
A character set atom lists characters or ranges you want to accept (or exclude if the character set starts with ^) and "or" is implicit.
You can use a backslash in a character set if you need to include/exclude a close bracket ], the ^ sign, the dash - that is used for ranges, the backslash \ itself, an unprintable character or if you want to use a non-ASCII unicode char specifying the code instead of literally.
Regular expression syntax however also lets you to avoid backslash-escaping in a character set atom by placing the character in a position where it cannot have the special meaning... for example a dash - as first or last in the set (it cannot mean a range there).
Note also that if you need to be able to match as values quoted strings, including backslash escaping, the regular expression is more complex, for example
'(?:[^'\\]|\\.)*'|"(?:[^"\\]|\\.)*"
matches a single-quoted or double-quoted string including backslash escaping, the meaning being:
A single quote '
Zero or more of either:
Any char except the single quote ' or the backslash \
A pair composed of a backslash \ followed by any char
A single quote '
or the same with double quotes " instead.
Note that the groups have been delimited with (?:...) instead of plain (...) to avoid capture
It doesn't match hyphens because it interpreting |-| as a range that starts at | and ends at |. (I would have expected that to be treated as a syntax error, but there you have it. It works the same in every regex flavor I've tried, too.)
Have a look at this regex:
/(?:^|;)([a-z-]*)\s*:\s*([a-z'":_\/.0-9-]*)\s*(?=;|$)/ig
As suggested by the other responders, I collapsed it to one alternative, removed the unneeded pipes, and escaped the hyphen by moving it to the end. I also anchored it at the beginning as well as the end. Or anchored it as well as I can, anyway. I used a lookahead to match the trailing semicolon so it will still be there when the next match starts. It's far from foolproof, but it should work okay as long as the input is well formed.
Replace regular expressions in your code as follow:
regExpKeyValuePair = /[-a-z]*\s*:\s*[-a-z'":_\/.0-9]*\s*;|[-a-z]*\s*:\s*[-a-z'":-_\/.0-9]*\s*$/gi;
regExpKey = /[-a-z]*/gi;
regExpValue = /[-a-z:_\/.0-9]*/gi;
You don't need escape . inside [].
No need to put | between elements [].
Because you are using /i flag, [A-Z] is not needed.
- should be at the beginning or at the end.

can someone help to explain this regular expression in javascript?

This code is used to get rid of mime type from rawdata.but I can not understand how it works
content.replace(/^[^,]*,/ , '')
it seems quite different from java.... any help will be appreciated.
Your mime-type probably is seperated by a comma , and at the beginning of your raw data.
This regex says take everything from the beginning (^) that is NOT a comma ([^,]*) (the star makes it as many characters until there is a comma) and take the comma itself (,). Then replace it by nothing ('').
This one only gets the first appearence because it is marked by the beginning ^ that it must be at the beginning of the string.
The first thing you need to know is that there are regex literals in JavaScript, constructed by pairs of slashes. So like "..." is a string, /.../ is a regex. That's actually the only difference your code shows as compared to a Java regex.
Then, [abc] within a regex is called a character class, meaning "one character out of a, b or c". Conversely, [^abc] is a negated character class, meaning "one character except a, b or c".
So your sample means:
/ # Start of regex literal
^ # Start the match at the start of the string
[^,]* # Match any number of characters except commas
, # Match a comma
/ # End of regex literal
The regular expression is the text between the two forward slashes, the first carat (^) means at the begining of the string, the brackets mean a character class, the carat inside the brackets means any character except a comma, then asterisk after the closing bracket means match zero or more of the character defined by the character class (which again is any character except the comma), and then finally the last comma means match the comma after all this. Then its used in a replace function so the matching result will be replaced with the second parameter, in your case: an empty string.
Basically it matches the first characters up to and including the first comma in the 'content' variable and then replaces it with an empty string.

Regular Expression - Match any character except +, empty string should also be matched

I am having a bit of trouble with one part of a regular expression that will be used in JavaScript. I need a way to match any character other than the + character, an empty string should also match.
[^+] is almost what I want except it does not match an empty string. I have tried [^+]* thinking: "any character other than +, zero or more times", but this matches everything including +.
Add a {0,1} to it so that it will only match zero or one times, no more no less:
[^+]{0,1}
Or, as FailedDev pointed out, ? works too:
[^+]?
As expected, testing with Chrome's JavaScript console shows no match for "+" but does match other characters:
x = "+"
y = "A"
x.match(/[^+]{0,1}/)
[""]
y.match(/[^+]{0,1}/)
["A"]
x.match(/[^+]?/)
[""]
y.match(/[^+]?/)
["A"]
[^+] means "match any single character that is not a +"
[^+]* means "match any number of characters that are not a +" - which almost seems like what I think you want, except that it will match zero characters if the first character (or even all of the characters) are +.
use anchors to make sure that the expression validates the ENTIRE STRING:
^[^+]*$
means:
^ # assert at the beginning of the string
[^+]* # any character that is not '+', zero or more times
$ # assert at the end of the string
If you're just testing the string to see if it doesn't contain a +, then you should use:
^[^+]*$
This will match only if the ENTIRE string has no +.

Help interpreting a javascript Regex

I have found the following expression which is intended to modify the id of a cloned html element e.g. change contactDetails[0] to contactDetails[1]:
var nel = 1;
var s = $(this).attr(attribute);
s.replace(/([^\[]+)\[0\]/, "$1["+nel+"]");
$(this).attr(attribute, s);
I am not terribly familiar with regex, but have tried to interpret it and with the help of The Regex Coach however I am still struggling. It appears that ([^\[]+) matches one or more characters which are not '[' and \[0\]/ matches [0]. The / in the middle I interpret as an 'include both', so I don't understand why the author has even included the first expression.
I dont understand what the $1 in the replace string is and if I use the Regex Coach replace functionality if I simply use [0] as the search and 1 as the replace I get the correct result, however if I change the javascript to s.replace(/\[0\]/, "["+nel+"]"); the string s remains unchanged.
I would be grateful for any advice as to what the original author intended and help in finding a solution which will successfully replace the a number in square brackets anywhere within a search string.
Find
/ # Signifies the start of a regex expression like " for a string
([^\[]+) # Capture the character that isn't [ 1 or more times into $1
\[0\] # Find [0]
/ # Signifies the end of a regex expression
Replace
"$1[" # Insert the item captured above And [
+nel+ # New index
"]" # Close with ]
To create an expression that captures any digit, you can replace the 0 with \d+ which will match a digit 1 or more times.
s.replace(/([^\[]+)\[\d+\]/, "$1["+nel+"]");
The $1 is a backreference to the first group in the regex. Groups are the pieces inside (). So, in this case $1 will be replaced by whatever the ([^\[]+) part matched.
If the string was contactDetails[0] the resulting string would be contactDetails[1].
Note that this regex only replaces 0s inside square brackets. If you want to replace any number you will need something like:
([^\[]+)\[\d+\]
The \d matches any digit character. \d+ then becomes any sequence of at least one digit.
But your code will still not work, because Javascript strings are immutable. That means they can't be changed once created. The replace method returns a new string, instead of changing the original one. You should use:
s = s.replace(...)
looks like it replaces arrays of 0 with 1.
For example: array[0] goes to array[1]
Explanation:
([^[]+) - This part means save everything that is not a [ into variable $1
[0]/ - This part limits Part 1 to save everything up to a [0]
"$1["+nel+"]" - Print out the contents of $1 (loaded from part 1) and add the brackets with the value of nel. (in your example nel = 1)
Square braces define a set of characters to match. [abc] will match the letters a, b or c.
By adding the carat you are now specifying that you want characters not in the set. [^abc] will match any character that is not an a, b or c.
Because square braces have special meaning in RegExps you need to escape them with a slash if you want to match one. [ starts a character set, \[ matches a brace. (Same concept for closing braces.)
So, [^\[]+ captures 1 or more characters that are not [.
Wrapping that in parenthesis "captures" the matched portion of the string (in this case "contactDetails" so that you can use it in the replacement.
$1 uses the "captured" string (i.e. "contactDetails") in the replacement string.
This regex matches "something" followed by a [0].
"something" is identified by the expression [^\[]+ which matches all charactes that are not a [. You can see the () around this expression, because the match is reused with $1, later. The rest of your regex - that is \[0\] just matches the index [0]. The author had to write \[ and \] because [ and ] are special charactes for regular expressions and have to be escaped.
$1 is a reference to the value of the first paranthesis pair. In your case the value of
[^\[]+
which matches one or more characters which are not a '['
The remaining part of the regexp matches string '[0]'.
So if s is 'foobar[0]' the result will be 'foobar[1]'.
[^\[] will match any character that is not [, the '+' means one or more times. So [^[]+ will match contactDetails. The brackets will capture this for later use. The '\' is an escape symbol so the end \[0\] will match [0]. The replace string will use $1 which is what was captured in the brackets and add the new index.
Your interpretation of the regular expression is correct. It is intended to match one or more characters which are not [, followed by a literal [0]. And used in the replace method, the match would be replaced with the match of the first grouping (that’s what $1 is replaced with) together with the sequence [ followed by the value of nel and ] (that’s how "$1["+nel+"]" is to be interpreted).
And again, a simple s.replace(/\[0\]/, "["+nel+"]") does the same. Except if there is nothing in front of [0], because in that case the first regex wouldn’t find a match.

Categories