I've seen the various posts regarding this topic, but I'm getting a strange result when I do the following:
var dirtyString = '<>I\really|\re\ad?"the/wh\ole*:da|\y?.'
var cleanString = dirtyString.replace(/[\/:*?"<>|.]/g, "");
console.log(cleanString);
It removes all the illegal characters, but the "r" letters are also removed. In the console log I'm getting "Ieallyeadthewholeday" It seems that "\" before "r" erases the "r". "\" isn't erasing other letters it comes before. Am I missing something?
If you would try console.log(dirtyString) you would also see that your "r" are "missing" too.
This is because '\r' is actually an escape sequence for Carriage Return character (code 13). Your replace() does nothing to it. It is still there just isn't displayed. Try playing with String.charAt() and String.charCodeAt() and you will see that the character is still there.
As a side note you are trying to remove "blacklisted" characters and blacklisting is almost never right approach. As you can see in your own case you forgot to blacklist '\r' character (and many others). Much safer is whitelisting. For example you may decide that you accept only latin letters and digits, then remove everything not whitelisted: var cleanString = dirtyString.replace(/[^a-z0-9]/gi, "");.
\r is the Carriage Return character. If you want a backslash followed by an r then you need to escape the backslash: \\r.
\y is not a reserved escape sequence, so JavaScript interprets it as \ followed by y. Other programming languages, like C#, will instead raise a compiler error about an unrecognised escape sequence.
Further confounding things: most regular-expression syntaxes have their own backslash escape sequences that are distinct from the hosting language's, such as the character-classes \W, \d etc. Fortunately they work because \W and \d are not reserved in JavaScript, but in this author's opinion it makes sense to escape the backslashes then just to make things really clear to the reader, or if you're wanting to make your regexes portable between languages.
Related
How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");
NB. I only want to know if it's a valid application of unescaped hyphen in the regex definition. It's not a question about matching email, meaning of hyphen nor backslash, quantifiers or anything else. Also, please note that the linked in answer doesn't really discuss the validity issue between escaped/unescaped hyphen.
Usually I declare the regex for matching email addresses like this.
var emailPattern = /^[a-z.\-_]+#[a-z]+[.]{1}[a-z]{2,3}$/;
emailPattern.test('ss.a_a-#ass.com');
Now, by mistake, a colleague of mine forgot the escape character and **still* made it work, which surprised me, because of the interval meaning of the hyphen. It looks like this.
var weirdPattern = /^[a-z._-]+#[a-z]+[.]{1}[a-z]{2,3}$/;
weirdPattern.test('ss.a_a-#ass.com');
Apparently, it works because the hyphen is the last character in the brackets. My question is if this is just a happy coincidence or if it's a valid syntax? Have I been regexing wrong my whole life?
Hyphens inside character class are used for range. However, when put at the beginning or at the end inside character class there is no need of escaping that.
Note that, in some browsers, hyphens at any position in the character class are still considered as range metacharacters, so it is best practice to always escape it.
Quoting from regular-expressions.info
The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen. [^-x] and [^x-] match any character that is not an x or a hyphen. Hyphens at other positions in character classes where they can't form a range may be interpreted as literals or as errors. Regex flavors are quite inconsistent about this.
I'm trying to construct a regular expression to treat delimited speech marks (\") as a single character.
The following code compiles fine, but terminates on trying to initialise rgx, throwing the error Abort trap: 6 using libc++.
std::regex rgx("[[.\\\\\".]]");
std::smatch results;
std::string test_str("\\\"");
std::regex_search(test_str, results, rgx);
If I remove the [[. .]], it runs fine, results[0] returning \" as intended, but as said, I'd like for this sequence to be usable as a character class.
Edit: Ok, I realise now that my previous understanding of collated sequences was incorrect, and the reason it wouldn't work is that \\\\\" is not defined as a sequence. So my new question: is it possible to define collated sequences?
So I figured out where I was going wrong and thought I'd leave this here in case anyone stumbles across it.
You can specify a passive group of characters with (?:sequence), allowing quantifiers to be applied as with a character class. Perhaps not exactly what I'd originally asked, but fulfils the same purpose, in my case at least.
To match a string beginning and ending with double quotation marks (including these characters in the results), but allowing delimited quotation marks within the the string, I used the expression
\"(?:[^\"^\\\\]+|(?:\\\\\\\\)+|\\\\\")*\"
which says to grab the as many characters as possible, provided characters are not quotation marks or backslashes, then if this does not match, to firstly attempt to match an even number of backslashes (to allow delimiting of this character), or secondly a delimited quotation mark. This non-capturing group is matched as many times as possible, stopping only when it reaches a \".
I couldn't comment on the efficiency of this, but it definitely works.
I need a JS regular expression which should allow only the word having alphanumeric, dot and hyphen.
Let me know this is correct.
var regex = /^[a-zA-Z_0-9/.-]+$/;
Almost. That will also allow underscores and slashes. Remove those from your range:
var regex = /^[a-zA-Z0-9.-]+$/;
This will also not match the empty string. That may be what you want, but it also may not be what you want. If it's not what you want, change + to *.
The first simplifications I'd make are to use the "word character" shorthand '\w', which is about the same as 'a-zA-Z', but shorter, and automagically stays correct when you move to other languages that include some accented alphabetic characters, and the "digit character" shorthand '\d'.
Also, although dot is special in most places in regular expressions, it's not special inside square brackets, and shouldn't be quoted there. (Besides, the single character quote character is back-slash, not forward-slash. That forward-slash of yours inside the brackets is the same character that begins and ends the RE, and so is likely to prematurely terminate the RE and so cause a parse error!) Since we're completely throwing it away, it no longer matters whether it should be forward-slash or back-slash, quoted or bare.
And as you've noticed, hyphen has a special meaning of "range" inside brackets (ex: a-z), so if you want a literal hyphen you have to do something a little different. By convention that something is to put the literal hyphen first inside the brackets.
So my result would be var regex = /^[-.\w\d]+$/;
(As you've probably noticed, there's almost always more than one way to express a regular expression so it works, and RE weenies spend as much time on a) economy of expression and b) run-time performance as they do on getting it "correct". In other words, you can ignore much of what I've just said, as it doesn't really matter to you. I think all that really matters is a) getting rid of that extraneous forward-slash and b) moving the literal hyphen to be the very first character inside the square brackets.)
(Another thought: very frequently when accepting alphabetic characters and hyphens, underscore is acceptable too ...so did you really mean to have that underscore after all?)
(Yet another thought: sometimes the very first character of an identifier must be an alpha, in which case what you probably want is var regex = /^\w[-.\w\d]*$/; You may want a different rule for the very first character in any case, as the naive recipe above would allow "-" and "." as legitimate words of length one.)
I'm writing a function that takes a prospective filename and validates it in order to ensure that no system disallowed characters are in the filename. These are the disallowed characters: / \ | * ? " < >
I could obviously just use string.indexOf() to search for each special char one by one, but that's a lot longer than it would be to just use string.search() using a regular expression to find any of those characters in the filename.
The problem is that most of these characters are considered to be part of describing a regular expression, so I'm unsure how to include those characters as actually being part of the regex itself. For example, the / character in a Javascript regex tells Javascript that it is the beginning or end of the regex. How would one write a JS regex that functionally behaves like so: filename.search(\ OR / OR | OR * OR ? OR " OR < OR >)
Put your stuff in a character class like so:
[/\\|*?"<>]
You're gonna have to escape the backslash, but the other characters lose their special meaning. Also, RegExp's test() method is more appropriate than String.search in this case.
filenameIsInvalid = /[/\\|*?"<>]/.test(filename);
Include a backslash before the special characters [\^$.|?*+(){}, for instance, like \$
You can also search for a character by specified ASCII/ANSI value. Use \xFF where FF are 2 hexadecimal digits. Here is a hex table reference. http://www.asciitable.com/ Here is a regex reference http://www.regular-expressions.info/reference.html
The correct syntax of the regex is:
/^[^\/\\|\*\?"<>]+$/
The [^ will match anything, but anything that is matched in the [^] group will return the match as null. So to check for validation is to match against null.
Demo: jsFiddle.
Demo #2: Comparing against null.
The first string is valid; the second is invalid, hence null.
But obviously, you need to escape regex characters that are used in the matching. To escape a character that is used for regex needs to have a backslash before the character, e.g. \*, \/, \$, \?.
You'll need to escape the special characters. In javascript this is done by using the \ (backslash) character.
I'd recommend however using something like xregexp which will handle the escaping for you if you wish to match a string literal (something that is lacking in javascript's native regex support).