regular expression - /\w\b\w/ - javascript

I am confused about /\w\b\w/. I think it should match "e w" in "we we", since:
\w is word character which is "e"
\b is word broundary which is " " (space)
\w is another word which is "w"
So the match is "e w" in "we we".
But...
/\w\b\w/ will never match anything, because a word character can never
be followed by both a non-word and a word character.
I got this one from MDN:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions?redirectlocale=en-US&redirectslug=JavaScript%2FGuide%2FRegular_Expressions
I can't understand their explanation. Can you help me explain it in baby step? Thank you!
Nick

The space character isn't the word boundary. A word boundary isn't a character itself, it's the place "in between characters" where a word character transitions to a non-word character.
So "e w".match(/\w\b/) only matches "e", not "e ".
/\w\b\w/ never matches anything because it would require that a word character be immediately followed by a non-word character and also by a word character, which is of course not possible.

The key is the \b meaning. \b matches a word boundary. A word boundary matches the position where a word-character is not followed or preceded by another word-character. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero.
So \b itself doesn't match anything, it's just a condition like ^, $ and so on. Like /^\w/ mean start with word-character, /\w\b/ mean a word-character not followed by a word-character.
In "e w", /\w\b/ only match "e" which a word-character not followed by a word-character in here is space, but not "e ".
/\w\W/ do match "e " in "e w". \b just a condition don't match anything.
/\w\b\w/ is mean a word-character both followed by a non-word and a word-character is contradictory, so will never match anything.

\w\b\w means match:
an alphanumeric character (\w); followed by
a transition from alphanumeric to non-alphanumeric characters (or vice-versa) ('\b'). But not any actual character; followed by
an alphanumeric character (\w).
The key point is that \b doesn't consume any characters, it checks which characters are adjacent to the tested position. So \w\b\w matches only two characters, both must be alphanumeric (\w) and the imaginary point between them must have an alphanumeric on one side and non-alphanumeric on the other, which is therefore not possible to match.
Hope this helps.

Your regular expression would fail for the input "we we" because a word boundary in most dialects is a position between \w and a non-word character (\W), or at the beginning or end of a string if it begins or ends with a word character.
Your regular expression is doing this:
\w word characters (a-z, A-Z, 0-9, _)
\b the boundary between a word char (\w) and not a word char
\w word characters (a-z, A-Z, 0-9, _)
Therefore, its saying look for a word character following the position of your word boundary. If you were to remove the ending \w it would match the e in your input.
console.log("we we".match(/\w\b/));
// => [ 'e', index: 1, input: 'we we' ]

I had same question. Reading this post, i finaly figured it out. The difficulty here may be that we imagine \b in \w\b\w as asymbol of space. But here and everywhere \b only points out "after or before" must be non-word (not represents the non-word symbol). And given last assertion, in case \w\b\w, last \w says "No! here is word-symbol". So last \w contradicts to \b. Well, take in account that \b is pointer, not a symbol-class. And for exercise prove, that for firs \w in \w\b\w all this true also :)

use \w\s\w to match what you need. note that \s and \d are different

Related

How to do lookbehind and lookforward at the same time around a regex?

The input is this:
*Word. Word.* Word word. *…*
"…" Word word. "…"
"…" word. "…"
The following is matching the empty space on the right side of a sentence.
(?<=["*]*[A-Z].+?\.["*]*)\s
If I want to match the empty space on the left side, I have to do this:
\s(?=["*]*[A-Z].+?\.["*]*)
The output should be this (the [] symbolize the matches):
*Word.[]Word.*[]Word word.[]*…*
"…"[]Woad word.[]"…"
"…" word.[]"…"
How to modify this regex so it matches the empty spaces on both sides of a sentence at the same time?
https://regexr.com/5tddc
For the examples shown, you may be able to use this regex with look arounds to match spaces:
(?<=\.\*?) |(?<!\w) (?=[A-Z])
RegEx Demo
RegEx Details:
(?<=\.\*?) : Match a space if that is preceded by a dot and optional *
|: OR
(?<!\w) (?=[A-Z]): Match a space that must be followed by an uppercase letter and must not be preceded by a word character
Perhaps you can match a non word boundary and assert either an uppercase char A-Z or one of " * at the right.
\B[ ](?=[A-Z"*])
The pattern matches:
\B A position where \b does not match
[ ] Match a space (The brackets are for clarity only)
(?= Positive lookahead, assert what is at the right is
[A-Z"*] Match one of A-Z or " or *
) Close lookahead
regex demo

\b regex for end of word [duplicate]

I'm attempting to match the last character in a WORD.
A WORD is a sequence of non-whitespace characters
'[^\n\r\t\f ]', or an empty line matching ^$.
The expression I made to do this is:
"[^ \n\t\r\f]\(?:[ \$\n\t\r\f]\)"
The regex matches a non-whitespace character that follows a whitespace character or the end of the line.
But I don't know how to stop it from excluding the following whitespace character from the result and why it doesn't seem to capture a character preceding the end of the line.
Using the string "Hi World!", I would expect: the "i" and "!" to be captured.
Instead I get: "i ".
What steps can I take to solve this problem?
"Word" that is a sequence of non-whitespace characters scenario
Note that a non-capturing group (?:...) in [^ \n\t\r\f](?:[ \$\n\t\r\f]) still matches (consumes) the whitespace char (thus, it becomes a part of the match) and it does not match at the end of the string as the $ symbol is not a string end anchor inside a character class, it is parsed as a literal $ symbol.
You may use
\S(?!\S)
See the regex demo
The \S matches a non-whitespace char that is not followed with a non-whitespace char (due to the (?!\S) negative lookahead).
General "word" case
If a word consists of just letters, digits and underscores, that is, if it is matched with \w+, you may simply use
\w\b
Here, \w matches a "word" char, and the word boundary asserts there is no word char right after.
See another regex demo.
In Word text, if I want to highlight the last a in para. I search for all the words that have [space][para][space] to make sure I only have the word I want, then when it is found it should be highlighted.
Next, I search for the last [a ] space added, in the selection and I will get only the last [a] and I will highlight it or color it differently.

JavaScript Regex splitting string into words

I have the following Regex
console.log("Test #words 100-200-300".toLowerCase().match(/(?:\B#)?\w+/g))
From the above you can see it is splitting "100-200-300". I want it to ignore "-" and keep the word in full as below:
--> ["test", "#words", "100-200-300"]
I need the Regex to keep the same rules, with the addition of not splitting words connected with "-"
For your current example, you could match an optional #, 1+ word chars and repeat 0+ times a part that matches a # and 1+ word chars again.
#?\w+(?:-\w+)*
#? Optional #
\w+ 1+ word characters
(?:-\w+)* Repeat as a group 0+ times matching - and 1+ word chars
Regex demo
console.log("Test #words 100-200-300".toLowerCase().match(/#?\w+(?:-\w+)*/g));
About the \B anchor (following text taken from the link)
\B is the negated version of \b. \B matches at every position where \b
does not. Effectively, \B matches at any position between two word
characters as well as at any position between two non-word characters.
If you do want to use that anchor, see for example some difference in matches with \B and without \B

Regex: I want to match only words without a dot at the end

For example: George R.R. Martin
I want to match only George and Martin.
I have tried: \w+\b. But doesn't work!
The \w+\b. matches 1+ word chars that are followed with a word boundary, and then any char that is a non-word char (as \b restricts the following . subpattern). Note that this way is not negating anything and you miss an important thing: a literal dot in the regex pattern must be escaped.
You may use a negative lookahead (?!\.):
var s = "George R.R. Martin";
console.log(s.match(/\b\w+\b(?!\.)/g));
See the regex demo
Details:
\b - leading word boundary
\w+ - 1+ word chars
\b - trailing word boundary
(?!\.) - there must be no . after the last word char matched.
See more about how negative lookahead works here.

JavaScript special characters/regular expressions

I'm trying to learn from reading Mozilla documentation for regular expressions, but there's one thing I don't get. For the special character \s it gives the following example
/\s\w*/ matches ' bar' in "foo bar."
I understand that \s is the special character for white space, but why is there a w* in the example?
doesn't /\s/ also match ' bar' in "foo bar."?
What's with the w*?
/\s\w*/ is whitespace character followed by 0 or more word characters.
/\s/ would only find the whitespace in the example.
\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]).
It's a character escape.
\w is all word characters (letters, digits, and underscores)
Check this link for more documentation on such shorthand

Categories