Regular expression in JS for alphanumeric, dot and hyphen - javascript

I need a JS regular expression which should allow only the word having alphanumeric, dot and hyphen.
Let me know this is correct.
var regex = /^[a-zA-Z_0-9/.-]+$/;

Almost. That will also allow underscores and slashes. Remove those from your range:
var regex = /^[a-zA-Z0-9.-]+$/;
This will also not match the empty string. That may be what you want, but it also may not be what you want. If it's not what you want, change + to *.

The first simplifications I'd make are to use the "word character" shorthand '\w', which is about the same as 'a-zA-Z', but shorter, and automagically stays correct when you move to other languages that include some accented alphabetic characters, and the "digit character" shorthand '\d'.
Also, although dot is special in most places in regular expressions, it's not special inside square brackets, and shouldn't be quoted there. (Besides, the single character quote character is back-slash, not forward-slash. That forward-slash of yours inside the brackets is the same character that begins and ends the RE, and so is likely to prematurely terminate the RE and so cause a parse error!) Since we're completely throwing it away, it no longer matters whether it should be forward-slash or back-slash, quoted or bare.
And as you've noticed, hyphen has a special meaning of "range" inside brackets (ex: a-z), so if you want a literal hyphen you have to do something a little different. By convention that something is to put the literal hyphen first inside the brackets.
So my result would be var regex = /^[-.\w\d]+$/;
(As you've probably noticed, there's almost always more than one way to express a regular expression so it works, and RE weenies spend as much time on a) economy of expression and b) run-time performance as they do on getting it "correct". In other words, you can ignore much of what I've just said, as it doesn't really matter to you. I think all that really matters is a) getting rid of that extraneous forward-slash and b) moving the literal hyphen to be the very first character inside the square brackets.)
(Another thought: very frequently when accepting alphabetic characters and hyphens, underscore is acceptable too ...so did you really mean to have that underscore after all?)
(Yet another thought: sometimes the very first character of an identifier must be an alpha, in which case what you probably want is var regex = /^\w[-.\w\d]*$/; You may want a different rule for the very first character in any case, as the naive recipe above would allow "-" and "." as legitimate words of length one.)

Related

Split String Mathematical Operators Regex [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

My javascript regular expression does not work sometimes [duplicate]

NB. I only want to know if it's a valid application of unescaped hyphen in the regex definition. It's not a question about matching email, meaning of hyphen nor backslash, quantifiers or anything else. Also, please note that the linked in answer doesn't really discuss the validity issue between escaped/unescaped hyphen.
Usually I declare the regex for matching email addresses like this.
var emailPattern = /^[a-z.\-_]+#[a-z]+[.]{1}[a-z]{2,3}$/;
emailPattern.test('ss.a_a-#ass.com');
Now, by mistake, a colleague of mine forgot the escape character and **still* made it work, which surprised me, because of the interval meaning of the hyphen. It looks like this.
var weirdPattern = /^[a-z._-]+#[a-z]+[.]{1}[a-z]{2,3}$/;
weirdPattern.test('ss.a_a-#ass.com');
Apparently, it works because the hyphen is the last character in the brackets. My question is if this is just a happy coincidence or if it's a valid syntax? Have I been regexing wrong my whole life?
Hyphens inside character class are used for range. However, when put at the beginning or at the end inside character class there is no need of escaping that.
Note that, in some browsers, hyphens at any position in the character class are still considered as range metacharacters, so it is best practice to always escape it.
Quoting from regular-expressions.info
The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen. [^-x] and [^x-] match any character that is not an x or a hyphen. Hyphens at other positions in character classes where they can't form a range may be interpreted as literals or as errors. Regex flavors are quite inconsistent about this.

remove illegal characters from file name javascript

I've seen the various posts regarding this topic, but I'm getting a strange result when I do the following:
var dirtyString = '<>I\really|\re\ad?"the/wh\ole*:da|\y?.'
var cleanString = dirtyString.replace(/[\/:*?"<>|.]/g, "");
console.log(cleanString);
It removes all the illegal characters, but the "r" letters are also removed. In the console log I'm getting "Ieallyeadthewholeday" It seems that "\" before "r" erases the "r". "\" isn't erasing other letters it comes before. Am I missing something?
If you would try console.log(dirtyString) you would also see that your "r" are "missing" too.
This is because '\r' is actually an escape sequence for Carriage Return character (code 13). Your replace() does nothing to it. It is still there just isn't displayed. Try playing with String.charAt() and String.charCodeAt() and you will see that the character is still there.
As a side note you are trying to remove "blacklisted" characters and blacklisting is almost never right approach. As you can see in your own case you forgot to blacklist '\r' character (and many others). Much safer is whitelisting. For example you may decide that you accept only latin letters and digits, then remove everything not whitelisted: var cleanString = dirtyString.replace(/[^a-z0-9]/gi, "");.
\r is the Carriage Return character. If you want a backslash followed by an r then you need to escape the backslash: \\r.
\y is not a reserved escape sequence, so JavaScript interprets it as \ followed by y. Other programming languages, like C#, will instead raise a compiler error about an unrecognised escape sequence.
Further confounding things: most regular-expression syntaxes have their own backslash escape sequences that are distinct from the hosting language's, such as the character-classes \W, \d etc. Fortunately they work because \W and \d are not reserved in JavaScript, but in this author's opinion it makes sense to escape the backslashes then just to make things really clear to the reader, or if you're wanting to make your regexes portable between languages.

What is this "/\,$/"?

Tried to search for /\,$/ online, but coudnt find anything.
I have:
coords = coords.replace(/\,$/, "");
Im guessing it returns coords string index number. What I have to search online for this, so I can learn more?
/\,$/ finds the comma character (,) at the end of a string (denoted by the $) and replaces it with empty (""). You sometimes see this in regex code aiming to clean up excerpts of text.
It's a regular expression to remove a trailing comma.
That thing is a Regular Expression, also known as regex or regexp. It is a way to "match" strings using some rules. If you want to learn how to use it in JavaScript, read the Mozilla Developer Network page about RegExp.
By the way, regular expressions are also available on most languages and in some tools. It is a very useful thing to learn.
That's a regular expression that finds a comma at the end of a string. That code removes the comma.
// defines a JavaScript regular expression, used to match a pattern within a string.
\,$ is the pattern
In this case \, translates to ,. A backslash is used to escape special characters, but in this case, it's not necessary. An example where it would be necessary would be to remove trailing periods. If you tried to do that with /.$/ the period here has a different meaning; it is used as a wildcard to match [almost] any character (aside for some newlines). So in this case to match on "." (period character) you would have to escape the wildcard (/\.$/).
When $ is placed at the end of the pattern, it means only look at the end of the string. This means that you can't mistakingly find a comma anywhere in the middle of the string (e.g., not after help in help, me,), only at the end (trailing). It also speeds of the regular expression search considerably. If you wanted to match on characters only at the beginning of the string, you would start off the pattern with a carat (^), for instance /^,/ would find a comma at the start of a string if one existed.
It's also important to note that you're only removing one comma, whereas if you use the plus (+) after the comma, you'd be replacing one or more: /,+$/.
Without the +; trailing commas,, becomes trailing commas,
With the +; no trailing comma,, becomes no trailing comma

Trouble with word-boundary (\b)

I have an array of keywords, and I want to know whether at least one of the keywords is found within some string that has been submitted. I further want to be absolutely sure that it is the keyword that has been matched, and not something that is very similar to the word.
Say, for example, that our keywords are [English, Eng, En] because we are looking for some variation of English.
Now, say that the input from a user is i h8 eng class, or something equally provocative and illiterate - then the eng should be matched. It should also fail to match a word like england or some odd thing chen, even though it's got the en bit.
So, in my infinite lack of wisdom I believed I could do something along the lines of this in order to match one of my array items with the input:
.match(RegExp('\b('+array.join('|')+')\b','i'))
With the thinking that the regular expression would look for matches from the array, now presented like (English|Eng|En) and then look to see whether there were zero-width word bounds on either side.
You need to double the backslashes.
When you create a regex with the RegExp() constructor, you're passing in a string. JavaScript string constant syntax also treats the backslash as a meta-character, for quoting quotes etc. Thus, the backslashes will be effectively stripped out before the RegExp() code even runs!
By doubling them, the step of parsing the string will leave one backslash behind. Then the RegExp() parser will see the single backslash before the "b" and do the right thing.
You need to double the backslashes in a JavaScript string or you'll encode a Backspace character:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))
You need to double-escape a \b, cause it have special value in strings:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))
\b is an escape sequence inside string literals (see table 2.1 on this page). You should escape it by adding one extra slash:
.match(RegExp('\\b('+array.join('|')+')\\b','i'))
You do not need to escape \b when used inside a regular expression literal:
/\b(english|eng|en)\b/i

Categories