I have this regexp:
(\b)(emozioni|gioia|felicità)(\b)
In a string like the one below:
emozioni emozioniamo felicità felicitàs
it should match the first and the third word. Instead it matches the first and the last. I assume it is because of the accented character. I tried this alternative:
(\b)(emozioni|gioia|felicità\s)(\b)
but it matched "felicità" only if there is an other word after it. So for being specific only if it is in this context:
emozioni emozioniamo felicità felicitàs
and not in this other:
emozioni emozioniamo felicitàs felicità
I've found an article about accented characters in French (so at the beginning of the word) here, i followed the second answer. If anyone knows a better solution it is very welcome.
A word boundary \b works only with characters that are in \w character class, i.e [0-9a-zA-Z_], thus you can't put a \b after an accentued character like à.
You can solve the problem in your case using a lookahead:
felicità(?=\s|$)
or shorter:
felicità(?!\S)
(or \W in place of \s as suggested #Sniffer, but you take the risk to match something like :felicitàà)
Try the following alternative:
\b(emozioni|gioia|felicità)(?=\W|$)
This will match any of your listed words, as long as any of those words is followed by either a non-word character \W or end-of-string $.
Regex101 Demo
Related
I'm building on a regular expression I found that works well for my use case. The purpose is to check for what I consider valid hashtags (I know there's a ton of hashtag regex posts on SO but this question is specific).
Here's the regex I'm using
/(^|\B)#(?![0-9_]+\b)([a-zA-Z0-9_]{1,20})(\b|\r)/g
The only problem I'm having is I can't figure out how to check if the second character is a-z (the first character would be the hashtag). I only want the first character after the hashtag to be a-z or A-Z. No numbers or non-alphanumeric.
Any help much appreciated, I'm very novice when it comes to regular expressions.
As I mentioned in the comments, you can replace [a-zA-Z0-9_]{1,20} with [a-zA-Z][a-zA-Z0-9_]{0,19} so that the first character is guaranteed to be a letter and then followed by 0 to 19 word characters (alphanumeric or underscore).
However, there are other unnecessary parts in your pattern. It appears that all you need is something like this:
/(?:^|\B)#[a-zA-Z][a-zA-Z0-9_]{0,19}\b/g
Demo.
Breakdown of (?:^|\B):
(?: # Start of a non-capturing group (don't use a capturing group unless needed).
^ # Beginning of the string/line.
| # Alternation (OR).
\B # The opposite of `\b`. In other words, it makes sure that
# the `#` is not preceded by a word character.
) # End of the non-capturing group.
Note: You may also replace [a-zA-Z0-9_] with \w.
References:
Word Boundaries.
Difference between \b and \B in regex.
The below should work.
(^|\B)#(?![0-9_]+\b)([a-zA-Z][a-zA-Z0-9_]{0,19})(\b|\r)
If you only want to accept two or more letter hashtags then change {0,19} with {1,19}.
You can test it here
In your pattern you use (?![0-9_]+\b) which asserts that what is directly on the right is not a digit or an underscore and can match a lot of other characters as well besides an upper or lower case a-z.
If you want you can use this part [a-zA-Z0-9_]{1,20} but then you have to use a positive lookahead instead (?=[a-zA-Z]) to assert what is directly to the right is an upper or lower case a-z.
(?:^|\B)#(?=[a-zA-Z])[a-zA-Z0-9_]{1,20}\b
Regex demo
I have come so far:
1) Run regex / /g to match all spaces.
2) Run a new call to regex /\b( )\b/g to match the spaces that need to be excluded.
Now I need them both fused in one statement. All spaces except the ones returned by the second. Any help?
Live regex for testing: https://regex101.com/r/26w2WR/1
EDIT: Although good answers are already available, I found that trying to match "words" with \b or \B is not always a good idea, as a lot of printable characters like dots and quotes are not seen as words by RegEx. Another problem is when you are looping through DOM nodes, sometimes you encounter inline styling tags like <strong> which should also just count as a beginning/end of a word, but a #text node just ends before the tag. So you may want to include start & end of a string in the RegEx too. For anyone wishing to address these too, I ended up with this RegEx:
/(\S|^)( )(?=\S|$)/g
This uses \S (not white space), inlcudes start/end of a string and applies groups for replacement ability. Replace JS looks like this:
yourTextNode.replace(/(\S|^)( )(?=\S|$)/g, '$1'+ yourreplacement)
To match chars, you can use (\u00A0) instead of ( )
Hope this helps.
You can use negative look-ahead:
(?!\b \b)( )
Without any look-around you can use this regex with \B and alternation:
\B +| +\B
Updated RegEx Demo
\B assert position where \b does not match
Above matches a space that is preceded or followed by \B
I have a regex here at scriptular.com
/(?=.*net)(?=.*income)(?=.*total)(?=.*depreciation)/i
How do I make the regex successfully match the string?
Without the newline characters in the string, the regex would succeed. I could remove them... but I'd rather not.
1.) The dot matches any character besides newline. It won't skip over newlines if the desired words would match in lines after the first one. In many regex flavors there is the dotall or single line s-flag available for making the dot also match newlines but unfortunately not in JS Regex.
Workarounds are to use a character class that contains any character. Such as [\s\S] any whitespace character \s together with any non whitespace \S or [\w\W] for any word character together with any non word character or even [^] for not nothing instead of the dot.
2.) Anchor the lookaheads to ^ start of string as it's not wanted to repeat the lookaheads at any position in the string. This will drastically improve performance.
3.) Use lazy matching for being satisfied with first match of each word.
/^(?=[\s\S]*?net)(?=[\s\S]*?income)(?=[\s\S]*?total)(?=[\s\S]*?depreciation)/i
See demo at regex101 (dunno why this doesn't work in your demo tool)
Additionally you can use \b word boundaries around the words for making sure such as net won't be matched in brunet, network... so the regex becomes ^(?=[\s\S]*?\bnet\b)...
Now, I know I can calculate if a string contains particular substring.Using this:
if(str.indexOf("substr") > -1){
}
Having my substring 'GRE' I want to match for my autocomplete list:
GRE:Math
GRE-Math
But I don't want to match:
CONGRES
and I particularly need to match:
NON-WORD-CHARGRENON-WORD-CHAR
and also
GRE
What should be the perfect regex in my case?
Maybe you want to use \b word boundaries:
(\bGRE\b)
Here is the explanation
Demo: http://regex101.com/r/hJ3vL6
MD, if I understood your spec, this simple regex should work for you:
\W?GRE(?!\w)(?:\W\w+)?
But I would prefer something like [:-]?GRE(?!\w)(?:[:-]\w+)? if you are able to specify which non-word characters you are willing to allow (see explanation below).
This will match
GRE
GRE:Math
GRE-Math
but not CONGRES
Ideally though, I would like to replace the \W (non-word character) with a list of allowable characters, for instance [-:] Why? Because \W would match non-word characters you do not want, such as spaces and carriage returns. So what goes in that list is for you to decide.
How does this work?
\W? optionally matches one single non-word character as you specified. Then we match the literal GRE. Then the lookahead (?!\w) asserts that the next character cannot be a word character. Then, optionally, we match a non-word character (as per your spec) followed by any number of word characters.
Depending on where you see this appearing, you could add boundaries.
You can use the regex: /\bGRE\b/g
if(/\bGRE\b/g.test("string")){
// the string has GRE
}
else{
// it doesn't have GRE
}
\b means word boundary
I have this regex that detects hashtags. It shouldn't match things with letters before them, so we've got a space character at the beginning of the regex:
/( #[a-zA-Z_]+)/gm
The issue is it no longer matches words at the beginning of sentences. How can I modify this regex so that instead of matching with spaces, it simply DOESN'T match things with letters before them.
Thanks!
Use \b at the start to indicate a word boundary.
\b won't work, since # isn't a word starter.
Just check for the start of the string or a space before: (?:^|\s)(\#[a-zA-Z_]+)
Also, make sure you escape the #, so it doesn't get interpreted as a comment.
Without lookbehind:
pattern = /(?:^|[^a-zA-Z])#[a-zA-Z]+/
With lookbehind (but not allowed in Javascript):
pattern = "(?:^|(?<![a-zA-Z]))#[a-zA-Z]+"