Regex-Groups in Javascript - javascript

I have a problem using a Javascript-Regexp.
This is a very simplified regexp, which demonstrates my Problem:
(?:\s(\+\d\w*))|(\w+)
This regex should only match strings, that doesn't contain forbidden characters (everything that is no word-character).
The only exception is the Symbol +
A match is allowed to start with this symbol, if [0-9] is trailing.
And a + must not appear within words (44+44 is not a valid match, but +4ad is)
In order to allow the + only at the beginning, I said that there must be a whitespace preceding. However, I don't want the whitespace to be part of the match.
I tested my regex with this tool: http://regex101.com/#javascript and the resultig matches look fine.
There are 2 Issues with that regexp:
If I use it in my JS-Code, the space is always part of the match
If +42 appears at the beginning of a line, it won't be matched
My Questions:
How should the regex look like?
Why does this regex add the space to the matches?
Here's my JS-Code:
var input = "+5ad6 +5ad6 sd asd+as +we";
var regexp = /(?:\s(\+\d\w*))|(\w+)/g;
var tokens = input.match(regexp);
console.log(tokens);

How should the regex look like?
You've got multiple choices to reach your goal:
It's fine as you have it. You might allow the string beginning in place of the whitespace as well, though. Just get the capturing groups (tokens[1], tokens[2]) out of it, which will not include the whitespace.
If you didn't use JavaScript, a lookbehind could help. Unfortunately it's not supported.
Require a non-word-boundary before the +, which would make every \w character before the + prevent the match:
/\B\+\d\w+|\w+/
Why does this regex add the space to the matches?
Because the regex does match the whitespace. It does not add the \s(\+\d\w+) to the captured groups, though.

Related

Regex including all special characters except space

I have a regex which checks all the special characters except space but that looks weird and too long.
const specialCharsRegex = new RegExp(/#|#|\$|!|%|&|\^|\*|-|\+|_|=|{|}|\[|\]|\(|\)|~|`|\.|\?|\<|\>|,|\/|:|;|"|'|\\/).
This looks too long and if i use regex (\W) it also includes the space.
Is there is any way i can achieve this?
Well you could use:
[^\w ]
This matches non word characters except for space. You may blacklist anything else you also might want by adding it to the above character class.
To match anything that is not a word character nor a whitespace character (cr, lf, ff, space, tab)
const specialCharsRegex = new RegExp(/[^\w\s]+|_+/, 'g');
See this demo at regex101 or a JS replace demo at tio.run (used g flag for all occurrences)
The underscore belongs to word characters [A-Za-z0-9_] and needs to be matched separately.
Try this using a-A-0-9/a-z/A-Z
Pattern regex = Pattern.compile("[^A-Za-z0-9]");

RegEx matching help: won't match on each appearence

I need to write a little RegEx matcher which will match any occurrence of strings in the form of
[a-zA-Z]+(_[a-zA-Z0-9]+)?
If I use the regex above it does match the sections needed but would also match onto the abc part of 4_abc which is not intended. I tried to exclude it with:
(?:[^a-zA-Z0-9_]|^)([a-zA-Z]+(_[a-zA-Z0-9]+)?)(?:[^a-zA-Z0-9_]|$)
The problem is that the 'not' matches at the beginning and end are not really working like I hoped they would. If I use them on the example
a_d Dd_da 4_d d_4
they would block matching the second Dd_da because the space was used in the first match.Sadly I can't use lookarounds because I am using JS.
So the input:
a_d Dd_da 4_d d_4
should match: a_d, Dd_da and d_4
but matches: a_d (there is a space at the end)
Is there another way to match the needed sections, or to not consume the 'anchor' matches?
I really appreciate your help.
You can make use of \b:
\b[a-zA-Z]+(_[a-zA-Z0-9]+)?\b
\b matches the (zero-width) point where either the preceding character or following character is a letter, digit or underscore, but not both. It also matches with the start/end of the string if the first/last character is a letter, digit or underscore.

Why does the "." get not caught in the regex?

I want to have a regular epxresion, that allows that checks wether the email adress given is correct. Firstly, it will check if a specific provider is there, in this case (#test.de) - this is not the problem. However the email names that are allowed must consist only of letters or dots. so: .#test.de is valid. However this specific case does not get accepted. My regex looks like the following:
[A-Za-z\.]{1,}\b#test\.de\b
It works fine, for all other cases but if a "." is only in front of the #it does not fit.
Any pointers what I am doing wrong?
The first word boundary \b in your pattern requires that there must be a word char before #. Thus, a dot cannot appear there, the match is failed.
You need to remove the word boundary, use
[A-Za-z.]+#test\.de\b
Note you do not need to escape a dot inside a character class, it already denotes a literal dot.
If you still want to match "whole" words after removing \b, you might use lookbehinds (if the regex engine supports them):
(?<!\w)[A-Za-z.]+#test\.de\b
or to only match after whitespace/start of string:
(?<!\S)[A-Za-z.]+#test\.de\b
Or just use a word boundary if the name starts with a letter, and a non-word boundary if it starts with a dot:
(?:\b[A-Za-z]|\B\.)[A-Za-z.]*#test\.de\b
See this demo

RegExp to match hashtag at the begining of the string or after a space

I have looked through previous questions and answers, however they do not solve the following:
https://stackoverflow.com/questions/ask#notHashTag
The closest I got to is this: (^#|(?:\s)#)(\w+), which finds the hashtag in half the necessary cases and also includes the leading space in the returned text. Here are all the cases that need to be matched:
#hashtag
a #hashtag
a #hashtag world
cool.#hashtag
##hashtag, but only until the comma and starting at second hash
#hashtag#hashtag two separate matches
And these should be skipped:
https://stackoverflow.com/questions/ask#notHashTag
Word#notHashTag
#ab is too short to be a hashtag, 3 characters minimum
This should work for everything but #hashtag#duplicates, and because JS doesn't support lookbehind, that's probably not possible to match that by itself.
\B#\w{3,}
\B is designed to match only between two word characters or two non-word characters. Since # is a non-word character, this forces the match to be preceded by a space or punctuation, or the beginning of the string.
Try this regex:
(?:^|[\s.])(#+\w{3,})(#+\w{3,})?
Online Demo: http://regex101.com/r/kG1nD5

Regex expression using word boundary for matching alphanumeric and non alphanumeric characters in javascript

I am trying to highlight a set of keywords using JavaScript and regex, I facing one problem, my keyword may contain literal and special characters as in #text #number etc. I am using word boundary to match and replace the whole word and not a partial word (contained within another word).
var pattern = new regex('\b '( + keyword +')\b',gi);
Here this expression matches the whole keywords and highlights them, however in case if any keyword like "number:" do not get highlighted.
I am aware that \bword\b matches for a word boundary and special characters are non alphanumeric characters hence are not matched by the above expression.
Can you let me know what regex expression I can use to accomplish the above.
==Update==
For the above I tried Tim Pietzcker's suggestion for the below regex,
expr: (?:^|\\b|\\s)(" + keyword + ")(?:$|\\b|\\s)
The above seems to be working for getting me a match for the whole word with alphanumeric and non alphanumeric characters, however whenever a keyword has consecutive html tag before or after the keyword without a space, it does not highlight that keyword (e.g. social security *number:< br >*)
I tried the following regex, but it replaces the html tag preceding the keyword
expr: (?:^|\b|\s|<[^>]+>)number:(?:$|\b|\s|<[^>]+>)
Here for the keyword number: which has < br > (space added intentionally for br tag to avoid browser interpreting the tag) coming next without space in between gets highlighted with the keyword.
Can you suggest an expression which would ignore the consecutive html tag for the whole word with both alphanumeric and non alphanumeric characters.
2021 update: JS now supports lookbehind so this answer is a little outdated.
OK, so you have two problems: JavaScript doesn't support lookbehind, and \b only finds boundaries between alphanumeric and non-alphanumeric characters.
The first question: What exactly does constitute a word boundary for your keywords? My guess is that it must be either a \b boundary or whitespace. If that is the case, you could search for
"(?:^|\\b|\\s)(" + keyword + ")(?:$|\\b|\\s)"
Of course the whitespace characters around keywords like #number# would also become part of the match, but perhaps highlighting those isn't such a problem. In other cases, i. e. if there is an actual word boundary that can match, the spaces won't be part of the match so it should work fine in the majority of cases.
The actual word you're interested in will be in backreference #1, so if you can highlight that separately, even better.
EDIT:
If other characters than space may occur after/before a keyword, then I think the only thing you can do (if you're stuck with JavaScript) is:
Check if your keyword starts with an alnum character.
If so, prepend \b to your regex.
Check if your keyword ends with an alnum character.
If so, append \b to your regex.
So, for keyword, use \bkeyword\b; for number:, use \bnumber:; for #twitter, use #twitter\b.
We need to look for a substring that has a whitespace character on both sides. If JavaScript supported lookbehind, this would look like:
var re = new RegExp('(?<!\\S)' + keyword + '(?!\\S)', 'gi');
That won't work though (but would in Perl and other scripting languages). Instead, we need to include the leading whitespace character (or beginning of string) as the beginning part of the match (and optionally capture what we are really looking for into $1):
var re = new RegExp('(?:^|\\s)(' + keyword + ')(?!\\S)', 'gi');
Just consider that the real place where any match starts will be one character after what is returned by the .index property returned by re.exec(string), and that if you are accessing the matched string, you either need to remove the first character with .slice(1) or simply access what is captured.
maybe what you're trying to do is
'\b\W*(' + keyword + ')\W*\b'
Lookahead and lookbehind are your answer: "(?=<[\s^])" + keyword + "(?=[\s$])". The bits in brackets aren't included in the match, so include whatever characters aren't permitted in the keywords in there.
As Tim correctly points out, \b are tricky things that work differently than the way people often think they work. Read this answer for more details about this matter, and what you can do about it.
In brief, this is a boundary to the left:
(?(?=\w)(?<!\w)|(?<!\W))
and this is a boundary to the right:
(?(?<=\w)(?!\w)|(?!\W))
People always think there are spaces involved, but there aren’t. However, now that you know the real definitions, it’s easy to build that into them. One could swap out \w and \W in echange for \s and \Sin the two patterns above. Or one could add in whitespace awareness to the else blocks.
Try this it should work...
var pattern = new regex(#"\b"+Regex.escape(keyword)+#"\b",gi);

Categories