Javascript regex to remove punctuation [duplicate] - javascript

This question already has answers here:
How can I strip all punctuation from a string in JavaScript using regex?
(16 answers)
Closed 7 years ago.
I'm having trouble with my regex. I'm sure something is not escaping properly.
function regex(str) {
str = str.replace(/(~|`|!|#|#|$|%|^|&|*|\(|\)|{|}|\[|\]|;|:|\"|'|<|,|\.|>|\?|\/|\\|\||-|_|+|=)/g,"")
document.getElementById("innerhtml").innerHTML = str;
}
<div id="innerhtml"></div>
<p><input type="button" value="Click Me" onclick="regex('test # . / | ) this');">

* and + needs to be escaped.
function regex (str) {
return str.replace(/(~|`|!|#|#|$|%|^|&|\*|\(|\)|{|}|\[|\]|;|:|\"|'|<|,|\.|>|\?|\/|\\|\||-|_|\+|=)/g,"")
}
var testStr = 'test # . / | ) this'
document.write('<strong>before: </strong>' + testStr)
document.write('<br><strong>after: </strong>' + regex(testStr))

The accepted answer on the question proposed duplicate doesn't cover all the punctuation characters in ASCII range. (The comment on the accepted answer does, though).
A better way to write this regex is to use put the characters into a character class.
/[~`!##$%^&*(){}\[\];:"'<,.>?\/\\|_+=-]/g
In a character class, to match the literal characters:
^ does not need escaping, unless it is at the beginning of the character class.
- should be placed at the beginning of the character class (after the ^ in a negated character class) or at the end of a character class.
] has to be escaped to be specified as literal character. [ does not need to be escaped (but I escape it anyway, as a habit, since some language requires [ to be escaped inside character class).
$, *, +, ?, (, ), {, }, |, . loses their special meaning inside character class.
In RegExp literal, / has to be escaped.
In RegExp, since \ is the escape character, if you want to specify a literal \, you need to escape it \\.

Related

Remove excessive blank lines [duplicate]

This question already has answers here:
Why do regex constructors need to be double escaped?
(5 answers)
Closed 4 years ago.
Here is an attempt to remove any excessive blank lines in string.
I'm trying to understand why second approach doesn't workfor lines which contains whitespace.
Demo.
var string = `
foo
bar (there are whitespaced lines between bar and baz. I replaced them with dots)
....................
.......................
...........
baz
`;
// It works
string = string.replace(/^(\s*\n){2,}/gm, '\n');
// Why it doesn't work?
var EOL = string.match(/\r\n/gm) ? '\r\n' : '\n';
var regExp = new RegExp('^(\s*' + EOL + '){2,}', 'gm');
string = string.replace(regExp, EOL);
alert(string);
Your \s needs to be changed to \\s. Just putting \s is the same as s.
In strings (enclosed in quotes), the backslash has a special meaning. For example, \n is the newline character. There are a couple of others that you may or may not have heard of, e.g. \b, \t, \v. It would be bad language design choice to make only a few defined ones special, and consider the non-existent \s to be an actual backslash and an s, because it would be inconsistent, a source of errors, and not future-proof. That's why, when you want to have a backslash in a string, you escape the backslash to \\.
In your first example, you use / characters to delimit the regular expression. This is not considered a string bound by the above rules.

Do I need to escape dash character in regex? [duplicate]

This question already has answers here:
Regex - Should hyphens be escaped? [duplicate]
(3 answers)
Closed 7 years ago.
I'm trying to understand dash character - needs to escape using backslash in regex?
Consider this:
var url = '/user/1234-username';
var pattern = /\/(\d+)\-/;
var match = pattern.exec(url);
var id = match[1]; // 1234
As you see in the above regex, I'm trying to extract the number of id from the url. Also I escaped - character in my regex using backslash \. But when I remove that backslash, still all fine ....! In other word, both of these are fine:
/\/(\d+)\-/
/\/(\d+)-/
Now I want to know, which one is correct (standard)? Do I need to escape dash character in regex?
You only need to escape the dash character if it could otherwise be interpreted as a range indicator (which can be the case inside a character class).
/-/ # matches "-"
/[a-z]/ # matches any letter in the range between ASCII a and ASCII z
/[a\-z]/ # matches "a", "-" or "z"
/[a-]/ # matches "a" or "-"
/[-z]/ # matches "-" or "z"
- may have a meaning only inside a character class [], so when you're outside of it you don't need to escape -

Why have two '\' in Regex? [duplicate]

This question already has answers here:
Why do regex constructors need to be double escaped?
(5 answers)
Extra backslash needed in PHP regexp pattern
(4 answers)
Regex to replace single backslashes, excluding those followed by certain chars
(3 answers)
Closed 7 years ago.
function trim(str) {
var trimer = new RegExp("(^[\\s\\t\\xa0\\u3000]+)|([\\u3000\\xa0\\s\\t]+\x24)", "g");
return String(str).replace(trimer, "");
}
why have two '\' before 's' and 't'?
and what's this "[\s\t\xa0\u3000]" mean?
You're using a literal string.
In a literal string, the \ character is used to escape some other chars, for example \n (a new line) or \" (a double quote), and it must be escaped itself as \\. So when you want your string to have \s, you must write \\s in your string literal.
Thankfully JavaScript provides a better solution, Regular expression literals:
var trimer = /(^[\s\t\xa0\u3000]+)|([\u3000\xa0\s\t]+\x24)/g
why have two '\' before 's' and 't'?
In regex the \ is an escape which tells regex that a special character follows. Because you are using it in a string literal you need to escape the \ with \.
and what's this "[\s\t\xa0\u3000]" mean?
It means to match one of the following characters:
\s white space.
\t tab character.
\xa0 non breaking space.
\u3000 wide space.
This function is inefficient because each time it is called it is converting a string to a regex and then it is compiling that regex. It would be more efficient to use a Regex literal not a string and compile the regex outside the function like the following:
var trimRegex = /(^[\s\t\xa0\u3000]+)|([\u3000\xa0\s\t]+$)/g;
function trim(str) {
return String(str).replace(trimRegex, "");
}
Further to this \s will match any whitespace which includes tabs, the wide space and the non breaking space so you could simplify the regex to the following:
var trimRegex = /(^\s+)|(\s+$)/g;
Browsers now implement a trim function so you can use this and use a polyfill for older browsers. See this Answer

RegExp (^|\\?|&) in javascript

Could you please help me understand this javascript RegExp :
cbreg = new RegExp('((^|\\?|&)' + cbkey + ')=([^&]+)')
// where cbkey is a string
I am confused by the (^|\\?|&) portion. What could that mean?
Thanks !
Well first of all given that the regex is created from a string literal the double backslashes become only a single backslash in the resulting regex (because that's how escaping works in a string literal):
(^|\?|&)
The | means OR, so then you have:
^ - start of line, or
\? - a question mark, or
& - an ampersand
A question mark on its own has special meaning within a regex, but an escaped question mark matches an actual question mark.
The parentheses means it matches one of those choices before matching the next part of the regex. Without parens the third choice would include the next part of the expression (whatever is in cbkey).
| means "OR". So that means: ^ (start of line) OR ? OR &.
It means either (|) the start of the string (^), a literal question (\? because the question mark needs to be escaped in regexes and \\? because the backslash needs to be escaped in strings) mark or an ampersand (&).
It searches for the block (the parentheses mean a block) which must start (^ = must start with) with character '?' or (| = or) character '&'.

List of all characters that should be escaped before put in to RegEx?

Could someone please give a complete list of special characters that should be escaped?
I fear I don't know some of them.
Take a look at PHP.JS's implementation of PHP's preg_quote function, that should do what you need:
http://phpjs.org/functions/preg_quote:491
The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
According to this site, the list of characters to escape is
[, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket ).
In addition to that, you need to escape characters that are interpreted by the Javascript interpreter as end of the string, that is either ' or ".
Based off of Tatu Ulmanen's answer, my solution in C# took this form:
private static List<string> RegexSpecialCharacters = new List<string>
{
"\\",
".",
"+",
"*",
"?",
"[",
"^",
"]",
"$",
"(",
")",
"{",
"}",
"=",
"!",
"<",
">",
"|",
":",
"-"
};
foreach (var rgxSpecialChar in RegexSpecialCharacters)
rgxPattern = input.Replace(rgxSpecialChar, "\\" + rgxSpecialChar);
Note that I have switched the positions of '\' and '.', failure to process the slashes first will lead to doubling up of the '\'s
Edit
Here is a javascript translation
var regexSpecialCharacters = [
"\\", ".", "+", "*", "?",
"[", "^", "]", "$", "(",
")", "{", "}", "=", "!",
"<", ">", "|", ":", "-"
];
regexSpecialCharacters.forEach(rgxSpecChar =>
input = input.replace(new RegExp("\\" + rgxSpecChar,"gm"), "\\" +
rgxSpecChar))
Inside a character set, to match a literal hyphen -, it needs to be escaped when not positioned at the start or the end. For example, given the position of the last hyphen in the following pattern, it needs to be escaped:
[a-z0-9\-_]+
But it doesn't need to be escaped here:
[a-z0-9_-]+
If you fail to escape a hyphen, the engine will attempt to interpret it as a range between the preceding character and the next character (just like a-z matches any character between a and z).
Additionally, /s do not be escaped inside a character set (though they do need to be escaped when outside a character set). So, the following syntax is valid;
const pattern = /[/]/;
I was looking for this list in regards to ESLint's "no-useless-escape" setting for reg-ex. And found some of these characters mentioned do not need to be escaped for a regular-expression in JS. The longer list in the other answer here is for PHP, which does require the additional characters to be escaped.
In this github issue for ESLint, about halfway down, user not-an-aardvark explains why the character referenced in the issue is a character that should maybe be escaped.
In javascript, a character that NEEDS to be escaped is a syntax character, or one of these:
^ $ \ . * + ? ( ) [ ] { } |
The response to the github issue I linked to above includes explanation about "Annex B" semantics (which I don't know much about) which allows 4 of the above mentioned characters to be UNescaped: ) ] { }.
Another thing to note is that escaping a character that doesn't require escaping won't do any harm (except maybe if you're trying to escape the escape character). So, my personal rule of thumb is: "When in doubt, escape"
The problem:
const character = '+'
new RegExp(character, 'gi') // error
Smart solutions:
// with babel-polyfill
// Warning: will be removed from babel-polyfill v7
const character = '+'
const escapeCharacter = RegExp.escape(character)
new RegExp(escapeCharacter, 'gi') // /\+/gi
// ES5
const character = '+'
const escapeCharacter = escapeRegExp(character)
new RegExp(escapeCharacter, 'gi') // /\+/gi
function escapeRegExp(string){
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')
}
The answer here has become a bit more complicated with the introduction of Unicode regular expressions in JavaScript (that is, regular expressions constructed with the u flag). In particular:
Non-unicode regular expressions support "identity" escapes; that is, if a character does not have a special interpretation in the regular expression pattern, then escaping it does nothing. This implies that /a/ and /\a/ will match in an identical way.
Unicode regular expressions are more strict -- attempting to escape a character not considered "special" is an error. For example, /\a/u is not a valid regular expression.
The set of specially-interpreted characters can be divined from the ECMAScript standard; for example, with ECMAScript 2021, https://262.ecma-international.org/12.0/#sec-patterns, we see the following "syntax" characters:
SyntaxCharacter :: one of
^ $ \ . * + ? ( ) [ ] { } |
In particular, in contrast to other answers, note that the !, <, >, : and - are not considered syntax characters. Instead, these characters might only have a special interpretation in specific contexts.
For example, the < and > characters only have a special interpretation when used as a capturing group name; e.g. as in
/(?<name>\w+)/
And because < and > are not considered syntax characters, escaping them is an error in unicode regular expressions.
> /\</
/\</
> /\</u
Uncaught SyntaxError: Invalid regular expression: /\</: Invalid escape
Additionally, the - character is only specially interpreted within a character class, when used to express a character range, as in e.g.
/[a-z]/
It is valid to escape a - within a character class, but not outside a character class, for unicode regular expressions.
> /\-/
/\-/
> /\-/u
Uncaught SyntaxError: Invalid regular expression: /\-/: Invalid escape
> /[-]/
/[-]/
> /[\-]/u
/[\-]/u
For a regular expression constructed using the / / syntax (as opposed to new RegExp()), interior slashes (/) would need to be escaped, but this is required for the JavaScript parser rather than the regular expression itself, to avoid ambiguity between a / acting as the end marker for a pattern versus a literal / in the pattern.
> /\//.test("/")
true
> new RegExp("/").test("/")
true
Ultimately though, if your goal is to escape characters so they are not specially interpreted within a regular expression, it should suffice to escape only the syntax characters. For example, if we wanted to match the literal string (?:hello), we might use:
> /\(\?:hello\)/.test("(?:hello)")
true
> /\(\?:hello\)/u.test("(?:hello)")
true
Note that the : character is not escaped. It might seem necessary to escape the : character because it has a special interpretation in the pattern (?:hello), but because it is not considered a syntax character, escaping it is unnecessary. (Escaping the preceding ( and ? characters is enough to ensure : is not interpreted specially.)
Above code snippets were tested with:
$ node -v
v16.14.0
$ node -p process.versions.v8
9.4.146.24-node.20

Categories