JavaScript Regex splitting string into words - javascript

I have the following Regex
console.log("Test #words 100-200-300".toLowerCase().match(/(?:\B#)?\w+/g))
From the above you can see it is splitting "100-200-300". I want it to ignore "-" and keep the word in full as below:
--> ["test", "#words", "100-200-300"]
I need the Regex to keep the same rules, with the addition of not splitting words connected with "-"

For your current example, you could match an optional #, 1+ word chars and repeat 0+ times a part that matches a # and 1+ word chars again.
#?\w+(?:-\w+)*
#? Optional #
\w+ 1+ word characters
(?:-\w+)* Repeat as a group 0+ times matching - and 1+ word chars
Regex demo
console.log("Test #words 100-200-300".toLowerCase().match(/#?\w+(?:-\w+)*/g));
About the \B anchor (following text taken from the link)
\B is the negated version of \b. \B matches at every position where \b
does not. Effectively, \B matches at any position between two word
characters as well as at any position between two non-word characters.
If you do want to use that anchor, see for example some difference in matches with \B and without \B

Related

How to do lookbehind and lookforward at the same time around a regex?

The input is this:
*Word. Word.* Word word. *…*
"…" Word word. "…"
"…" word. "…"
The following is matching the empty space on the right side of a sentence.
(?<=["*]*[A-Z].+?\.["*]*)\s
If I want to match the empty space on the left side, I have to do this:
\s(?=["*]*[A-Z].+?\.["*]*)
The output should be this (the [] symbolize the matches):
*Word.[]Word.*[]Word word.[]*…*
"…"[]Woad word.[]"…"
"…" word.[]"…"
How to modify this regex so it matches the empty spaces on both sides of a sentence at the same time?
https://regexr.com/5tddc
For the examples shown, you may be able to use this regex with look arounds to match spaces:
(?<=\.\*?) |(?<!\w) (?=[A-Z])
RegEx Demo
RegEx Details:
(?<=\.\*?) : Match a space if that is preceded by a dot and optional *
|: OR
(?<!\w) (?=[A-Z]): Match a space that must be followed by an uppercase letter and must not be preceded by a word character
Perhaps you can match a non word boundary and assert either an uppercase char A-Z or one of " * at the right.
\B[ ](?=[A-Z"*])
The pattern matches:
\B A position where \b does not match
[ ] Match a space (The brackets are for clarity only)
(?= Positive lookahead, assert what is at the right is
[A-Z"*] Match one of A-Z or " or *
) Close lookahead
regex demo

\b regex for end of word [duplicate]

I'm attempting to match the last character in a WORD.
A WORD is a sequence of non-whitespace characters
'[^\n\r\t\f ]', or an empty line matching ^$.
The expression I made to do this is:
"[^ \n\t\r\f]\(?:[ \$\n\t\r\f]\)"
The regex matches a non-whitespace character that follows a whitespace character or the end of the line.
But I don't know how to stop it from excluding the following whitespace character from the result and why it doesn't seem to capture a character preceding the end of the line.
Using the string "Hi World!", I would expect: the "i" and "!" to be captured.
Instead I get: "i ".
What steps can I take to solve this problem?
"Word" that is a sequence of non-whitespace characters scenario
Note that a non-capturing group (?:...) in [^ \n\t\r\f](?:[ \$\n\t\r\f]) still matches (consumes) the whitespace char (thus, it becomes a part of the match) and it does not match at the end of the string as the $ symbol is not a string end anchor inside a character class, it is parsed as a literal $ symbol.
You may use
\S(?!\S)
See the regex demo
The \S matches a non-whitespace char that is not followed with a non-whitespace char (due to the (?!\S) negative lookahead).
General "word" case
If a word consists of just letters, digits and underscores, that is, if it is matched with \w+, you may simply use
\w\b
Here, \w matches a "word" char, and the word boundary asserts there is no word char right after.
See another regex demo.
In Word text, if I want to highlight the last a in para. I search for all the words that have [space][para][space] to make sure I only have the word I want, then when it is found it should be highlighted.
Next, I search for the last [a ] space added, in the selection and I will get only the last [a] and I will highlight it or color it differently.

Optionally match after a space

I'm looking for a regular expression that can match both of these lines:
foo/bar
foo/bar baz
And capture foo, bar, and baz into separate match groups.
I've tried with this regex:
^([^\/]+)\/([^\/#]+)? (\w+)$
You can use below regex
^(\w+)\/(\w+)\s*(\w+)?$
^: Starts with anchor
(\w+): Match one or more word characters(alphabets, numbers and underscore) and add them to capturing group
\/: Match forward slash
\s*: Match any number of spaces
(\w+)?: Optional alphanumeric+underscore match
$: Ends with anchor
Here's demo on RegEx101.com.
This will match first word before / in first capture group which can be accessed by $1, word after / in second group-$2 and optional word in $3.
If there are other characters than \w i.e. [a-zA-Z0-9_], you can use below regex
^([^\/]+)\/(\S+)\s*(\S+)?$
Demo
[^\/]+ will match one or more characters except /. \S+ will match one or more non-space characters.
Try using this ^([^\/]+)\/([^\/#]+)\s*(\w*)$ with g and m flags.

Capture between pattern of digits

I'm stuck trying to capture a structure like this:
1:1 wefeff qwefejä qwefjk
dfjdf 10:2 jdskjdksdjö
12:1 qwe qwe: qwertyå
I would want to match everything between the digits, followed by a colon, followed by another set of digits. So the expected output would be:
match 1 = 1:1 wefeff qwefejä qwefjk dfjdf
match 2 = 10:2 jdskjdksdjö
match 3 = 12:1 qwe qwe: qwertyå
Here's what I have tried:
\d+\:\d+.+
But that fails if there are word characters spanning two lines.
I'm using a javascript based regex engine.
You may use a regex based on a tempered greedy token:
/\d+:\d+(?:(?!\d+:\d)[\s\S])*/g
The \d+:\d+ part will match one or more digits, a colon, one or more digits and (?:(?!\d+:\d)[\s\S])* will match any char, zero or more occurrences, that do not start a sequence of one or more digits followed with a colon and a digit. See this regex demo.
As the tempered greedy token is a resource consuming construct, you can unroll it into a more efficient pattern like
/\d+:\d+\D*(?:\d(?!\d*:\d)\D*)*/g
See another regex demo.
Now, the () is turned into a pattern that matches strings linearly:
\D* - 0+ non-digit symbols
(?: - start of a non-capturing group matching zero or more sequences of:
\d - a digit that is...
(?!\d*:\d) - not followed with 0+ digits, : and a digit
\D* - 0+ non-digit symbols
)* - end of the non-capturing group.
you can use or not the ñ-Ñ, but you should be ok this way
\d+?:\d+? [a-zñA-ZÑ ]*
Edited:
If you want to include the break lines, you can add the \n or \r to the set,
\d+?:\d+? [a-zñA-ZÑ\n ]*
\d+?:\d+? [a-zñA-ZÑ\r ]*
Give it a try ! also tested in https://regex101.com/
for more chars:
^[a-zA-Z0-9!##\$%\^\&*)(+=._-]+$

Regex: I want to match only words without a dot at the end

For example: George R.R. Martin
I want to match only George and Martin.
I have tried: \w+\b. But doesn't work!
The \w+\b. matches 1+ word chars that are followed with a word boundary, and then any char that is a non-word char (as \b restricts the following . subpattern). Note that this way is not negating anything and you miss an important thing: a literal dot in the regex pattern must be escaped.
You may use a negative lookahead (?!\.):
var s = "George R.R. Martin";
console.log(s.match(/\b\w+\b(?!\.)/g));
See the regex demo
Details:
\b - leading word boundary
\w+ - 1+ word chars
\b - trailing word boundary
(?!\.) - there must be no . after the last word char matched.
See more about how negative lookahead works here.

Categories