I'm trying to learn from reading Mozilla documentation for regular expressions, but there's one thing I don't get. For the special character \s it gives the following example
/\s\w*/ matches ' bar' in "foo bar."
I understand that \s is the special character for white space, but why is there a w* in the example?
doesn't /\s/ also match ' bar' in "foo bar."?
What's with the w*?
/\s\w*/ is whitespace character followed by 0 or more word characters.
/\s/ would only find the whitespace in the example.
\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]).
It's a character escape.
\w is all word characters (letters, digits, and underscores)
Check this link for more documentation on such shorthand
Related
I want a regex matching a specific word that is not surrounded by any alphanumeric character. My thought was to include a negation before and after:
[^a-zA-Z\d]myspecificword[^a-zA-Z\d]
So it would match:
myspecificword
_myspecificword_
-myspecificword
And not match:
notmyspecificword
myspecificword123
But this simple regex won't match the word by itself unless it is preceeded by a whitespace:
myspecificword // no match
myspecificword // match
Using the flags "gmi" and testing with JavaScript. What am I doing wrong? Shouldn't it be as simple as that?
https://regex101.com/r/BCkbVQ/3
Try using:
(?<![^\s_-])myspecificword(?![^\s_-])
This says to match myspecificword when it surrounded, on both sides, by either the start/end of the input, whitespace, underscore, or dash.
Demo
It is not whitespace that is required but any symbol that is matches [^a-zA-Z\d].
You should use: (Demo)
(?:^|[^a-zA-Z\d])myspecificword(?:[^a-zA-Z\d]|$)
The main benefit is support across all Regexp parsers.
If you truly mean "not surrounded by alphanumerics other than _ (and in your attempted regex you seem to be willing to match anything that isn't a letter or digit), then any of the following should be acceptable:
'myspecificword'
'_myspecificword_'
' myspecificword '
'-myspecificword-'
'(myspecificword)'
And the regex should be:
(?<![^_\W])myspecificword(?![^_\W])
let tests = ['myspecificword',
'_myspecificword_',
' myspecificword ',
'-myspecificword-',
'(myspecificword)',
'amyspecificword',
'1myspecificword'
];
let regex = /(?<![^_\W])myspecificword(?![^_\W])/;
for (let test of tests) {
console.log(regex.test(test));
}
The "accepted" answer will not match (myspecificword), for example.
The title of this question is
Regex for word not surrounded by alphanumeric characters
The other answers have all addressed a different question (which may well be the one intended):
Regex for word neither preceded nor followed by alphanumeric characters
I will refer to these statements as #1 and #2 respectively.
If the specified word were 'cat' and the string were '9cat', 'cat' is not surrounded by alphanumeric characters in the string, so there is a match with #1, but not with #2.
For #1, one could use the regex:
/cat(?!\p{Alpha}|(?<!\p{Alnum})cat/
("match 'cat' not followed by a Unicode alphanumeric character or 'cat' not preceded by a Unicode alphanumeric character"), though it's easier to test for the negation:
/(?<=\p{Alpha}cat(?<=\p{Alnum})/
The test passes if the string does not match this regex.
With interpretation #2, the regex is:
/(?<!\p{Alpha}cat(?!\p{Alnum})/
I think this will work:
/[^a-z0-9]?myspesificword[^a-z0-9]?/i
I'm reading Ionic's source code. I came across this regex, and i"m pretty baffled by it.
([\s\S]+?)
Ok, it's grouping on every char that is either a white space, or non white space???
Why didn't they just do
(.+?)
Am I missing something?
The . matches any symbol but a newline. In order to make it match a newline, in most languages there is a modifier (dotall, singleline). However, in JS, there is no such a modifier.
Thus, a work-around is to use a [\s\S] character class that will match any character, including a newline, because \s will match all whitespace and \S will match all non-whitespace characters. Similarly, one could use [\d\D] or [\w\W].
Also, there is a [^] pattern to match the same thing in JS, but since it is JavaScript-specific, the regexes containing this pattern are not portable between regex flavors.
The +? lazy quanitifier matches 1 or more symbols conforming to the preceding subpattern, but as few as possible. Thus, it will match just 1 symbol if used like this, at the end of the pattern.
In many realizations of Regexp "." doesn't match new lines. So they use "[\s\S]" as a little hack =)
A . matches everything but the newline character. This is actually a well known/documented problem with javascript. The \s (whitespace match) alongside it's negation \S (non-whitespace match) provides a dotall match including the newline. Thus [\s\S] is generally used more frequently than .
The RegEx they used includes more characters (essentially everything).
\s matches any word or digit character or whitespace.
\S matches anything except a digit, word character, or whitespace
As Casimir notes:
. matches any character except newline (\n)
. matches any char except carriage return /r and new line /n
The Shortest way to do [/s/S](white space and non white space) is [^](not nothing)
I am confused about /\w\b\w/. I think it should match "e w" in "we we", since:
\w is word character which is "e"
\b is word broundary which is " " (space)
\w is another word which is "w"
So the match is "e w" in "we we".
But...
/\w\b\w/ will never match anything, because a word character can never
be followed by both a non-word and a word character.
I got this one from MDN:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions?redirectlocale=en-US&redirectslug=JavaScript%2FGuide%2FRegular_Expressions
I can't understand their explanation. Can you help me explain it in baby step? Thank you!
Nick
The space character isn't the word boundary. A word boundary isn't a character itself, it's the place "in between characters" where a word character transitions to a non-word character.
So "e w".match(/\w\b/) only matches "e", not "e ".
/\w\b\w/ never matches anything because it would require that a word character be immediately followed by a non-word character and also by a word character, which is of course not possible.
The key is the \b meaning. \b matches a word boundary. A word boundary matches the position where a word-character is not followed or preceded by another word-character. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero.
So \b itself doesn't match anything, it's just a condition like ^, $ and so on. Like /^\w/ mean start with word-character, /\w\b/ mean a word-character not followed by a word-character.
In "e w", /\w\b/ only match "e" which a word-character not followed by a word-character in here is space, but not "e ".
/\w\W/ do match "e " in "e w". \b just a condition don't match anything.
/\w\b\w/ is mean a word-character both followed by a non-word and a word-character is contradictory, so will never match anything.
\w\b\w means match:
an alphanumeric character (\w); followed by
a transition from alphanumeric to non-alphanumeric characters (or vice-versa) ('\b'). But not any actual character; followed by
an alphanumeric character (\w).
The key point is that \b doesn't consume any characters, it checks which characters are adjacent to the tested position. So \w\b\w matches only two characters, both must be alphanumeric (\w) and the imaginary point between them must have an alphanumeric on one side and non-alphanumeric on the other, which is therefore not possible to match.
Hope this helps.
Your regular expression would fail for the input "we we" because a word boundary in most dialects is a position between \w and a non-word character (\W), or at the beginning or end of a string if it begins or ends with a word character.
Your regular expression is doing this:
\w word characters (a-z, A-Z, 0-9, _)
\b the boundary between a word char (\w) and not a word char
\w word characters (a-z, A-Z, 0-9, _)
Therefore, its saying look for a word character following the position of your word boundary. If you were to remove the ending \w it would match the e in your input.
console.log("we we".match(/\w\b/));
// => [ 'e', index: 1, input: 'we we' ]
I had same question. Reading this post, i finaly figured it out. The difficulty here may be that we imagine \b in \w\b\w as asymbol of space. But here and everywhere \b only points out "after or before" must be non-word (not represents the non-word symbol). And given last assertion, in case \w\b\w, last \w says "No! here is word-symbol". So last \w contradicts to \b. Well, take in account that \b is pointer, not a symbol-class. And for exercise prove, that for firs \w in \w\b\w all this true also :)
use \w\s\w to match what you need. note that \s and \d are different
I'm using the following regular expression to match one or more special characters for a password strength test.
if (password.match(/\W+/)) points++;
This doesn't seem to match the underscore '_' as a special character. Why is this and how can I fix it?
It is because \W is the same as [^\w], while \w contains a-z, A-Z, 0-9, and _ as well.
In order to fix it just add _ character separately:
if (password.match(/[\W_]+/)) points++;
\W (uppercase) means not \w, so anything except word characters.
Word characters (\w) includes letters, digits, and underscore.
Perhaps you should use /[^a-z0-9]+/i to match non-letters.
Are you sure you don't want the \w? The \W is the negation of \w.
\w matches (letters, digits, and underscores), so \W does NOT match letters, digits, and underscores. See here: http://www.regular-expressions.info/reference.html
The match fails because underscore is treated as a word character. From the MDN documentation for \W:
Matches any non-word character. Equivalent to [^A-Za-z0-9_]
You can fix this by grouping underscore and \W:
if (password.match(/[\W_]+/)) points++;
A regex tool such as Javascript Regex Tester can be especially helpful for debugging this sort of thing.
I am using this expression: /\W+/g to match all characters that are not numbers, letters and spaces. It seems to be including spaces. How would I build a regex that did not include spaces?
/[^a-z0-9\s]+/ig
Explanation:
[^ Character class which matches characters NOT in the following class
a-z All lowercase letters of the alphabet
0-9 All numbers
\s Whitespace characters
] End of the character class
i Case-insensitivity to match uppercase letters
A more accurate wording for \W is any Non-Alphanumeric character.
\s is for Any Whitespace.
So, it would be something like this:
[^\s]
\W means "non-word characters", the inverse of \w, so it will match spaces as well. I'm a bit surprised it doesn't match numbers, though.