Is there a way to achieve the equivalent of a negative lookbehind in JavaScript regular expressions? I need to match a string that does not start with a specific set of characters.
It seems I am unable to find a regex that does this without failing if the matched part is found at the beginning of the string. Negative lookbehinds seem to be the only answer, but JavaScript doesn't has one.
This is the regex that I would like to work, but it doesn't:
(?<!([abcdefg]))m
So it would match the 'm' in 'jim' or 'm', but not 'jam'
Since 2018, Lookbehind Assertions are part of the ECMAScript language specification.
// positive lookbehind
(?<=...)
// negative lookbehind
(?<!...)
Answer pre-2018
As Javascript supports negative lookahead, one way to do it is:
reverse the input string
match with a reversed regex
reverse and reformat the matches
const reverse = s => s.split('').reverse().join('');
const test = (stringToTests, reversedRegexp) => stringToTests
.map(reverse)
.forEach((s,i) => {
const match = reversedRegexp.test(s);
console.log(stringToTests[i], match, 'token:', match ? reverse(reversedRegexp.exec(s)[0]) : 'Ø');
});
Example 1:
Following #andrew-ensley's question:
test(['jim', 'm', 'jam'], /m(?!([abcdefg]))/)
Outputs:
jim true token: m
m true token: m
jam false token: Ø
Example 2:
Following #neaumusic comment (match max-height but not line-height, the token being height):
test(['max-height', 'line-height'], /thgieh(?!(-enil))/)
Outputs:
max-height true token: height
line-height false token: Ø
Lookbehind Assertions got accepted into the ECMAScript specification in 2018.
Positive lookbehind usage:
console.log(
"$9.99 €8.47".match(/(?<=\$)\d+\.\d*/) // Matches "9.99"
);
Negative lookbehind usage:
console.log(
"$9.99 €8.47".match(/(?<!\$)\d+\.\d*/) // Matches "8.47"
);
Platform support:
✔️ V8
✔️ Google Chrome 62.0
✔️ Microsoft Edge 79.0
✔️ Node.js 6.0 behind a flag and 9.0 without a flag
✔️ Deno (all versions)
✔️ SpiderMonkey
✔️ Mozilla Firefox 78.0
🛠️ JavaScriptCore: support in beta as of 2023-02-16: the feature has been merged
✔️ Apple Safari 16.4 (in beta as of 2023-02-16)
✔️ iOS 16.4 beta WebView (all browsers on iOS + iPadOS)
✔️ Bun 0.2.2
❌ Chakra: Microsoft was working on it but Chakra is now abandoned in favor of V8
❌ Internet Explorer
❌ Edge versions prior to 79 (the ones based on EdgeHTML+Chakra)
Let's suppose you want to find all int not preceded by unsigned :
With support for negative look-behind:
(?<!unsigned )int
Without support for negative look-behind:
((?!unsigned ).{9}|^.{0,8})int
Basically idea is to grab n preceding characters and exclude match with negative look-ahead, but also match the cases where there's no preceeding n characters. (where n is length of look-behind).
So the regex in question:
(?<!([abcdefg]))m
would translate to:
((?!([abcdefg])).|^)m
You might need to play with capturing groups to find exact spot of the string that interests you or you want to replace specific part with something else.
Mijoja's strategy works for your specific case but not in general:
js>newString = "Fall ball bill balll llama".replace(/(ba)?ll/g,
function($0,$1){ return $1?$0:"[match]";});
Fa[match] ball bi[match] balll [match]ama
Here's an example where the goal is to match a double-l but not if it is preceded by "ba". Note the word "balll" -- true lookbehind should have suppressed the first 2 l's but matched the 2nd pair. But by matching the first 2 l's and then ignoring that match as a false positive, the regexp engine proceeds from the end of that match, and ignores any characters within the false positive.
Use
newString = string.replace(/([abcdefg])?m/, function($0,$1){ return $1?$0:'m';});
You could define a non-capturing group by negating your character set:
(?:[^a-g])m
...which would match every m NOT preceded by any of those letters.
This is how I achieved str.split(/(?<!^)#/) for Node.js 8 (which doesn't support lookbehind):
str.split('').reverse().join('').split(/#(?!$)/).map(s => s.split('').reverse().join('')).reverse()
Works? Yes (unicode untested). Unpleasant? Yes.
following the idea of Mijoja, and drawing from the problems exposed by JasonS, i had this idea; i checked a bit but am not sure of myself, so a verification by someone more expert than me in js regex would be great :)
var re = /(?=(..|^.?)(ll))/g
// matches empty string position
// whenever this position is followed by
// a string of length equal or inferior (in case of "^")
// to "lookbehind" value
// + actual value we would want to match
, str = "Fall ball bill balll llama"
, str_done = str
, len_difference = 0
, doer = function (where_in_str, to_replace)
{
str_done = str_done.slice(0, where_in_str + len_difference)
+ "[match]"
+ str_done.slice(where_in_str + len_difference + to_replace.length)
len_difference = str_done.length - str.length
/* if str smaller:
len_difference will be positive
else will be negative
*/
} /* the actual function that would do whatever we want to do
with the matches;
this above is only an example from Jason's */
/* function input of .replace(),
only there to test the value of $behind
and if negative, call doer() with interesting parameters */
, checker = function ($match, $behind, $after, $where, $str)
{
if ($behind !== "ba")
doer
(
$where + $behind.length
, $after
/* one will choose the interesting arguments
to give to the doer, it's only an example */
)
return $match // empty string anyhow, but well
}
str.replace(re, checker)
console.log(str_done)
my personal output:
Fa[match] ball bi[match] bal[match] [match]ama
the principle is to call checker at each point in the string between any two characters, whenever that position is the starting point of:
--- any substring of the size of what is not wanted (here 'ba', thus ..) (if that size is known; otherwise it must be harder to do perhaps)
--- --- or smaller than that if it's the beginning of the string: ^.?
and, following this,
--- what is to be actually sought (here 'll').
At each call of checker, there will be a test to check if the value before ll is not what we don't want (!== 'ba'); if that's the case, we call another function, and it will have to be this one (doer) that will make the changes on str, if the purpose is this one, or more generically, that will get in input the necessary data to manually process the results of the scanning of str.
here we change the string so we needed to keep a trace of the difference of length in order to offset the locations given by replace, all calculated on str, which itself never changes.
since primitive strings are immutable, we could have used the variable str to store the result of the whole operation, but i thought the example, already complicated by the replacings, would be clearer with another variable (str_done).
i guess that in terms of performances it must be pretty harsh: all those pointless replacements of '' into '', this str.length-1 times, plus here manual replacement by doer, which means a lot of slicing...
probably in this specific above case that could be grouped, by cutting the string only once into pieces around where we want to insert [match] and .join()ing it with [match] itself.
the other thing is that i don't know how it would handle more complex cases, that is, complex values for the fake lookbehind... the length being perhaps the most problematic data to get.
and, in checker, in case of multiple possibilities of nonwanted values for $behind, we'll have to make a test on it with yet another regex (to be cached (created) outside checker is best, to avoid the same regex object to be created at each call for checker) to know whether or not it is what we seek to avoid.
hope i've been clear; if not don't hesitate, i'll try better. :)
Using your case, if you want to replace m with something, e.g. convert it to uppercase M, you can negate set in capturing group.
match ([^a-g])m, replace with $1M
"jim jam".replace(/([^a-g])m/g, "$1M")
\\jiM jam
([^a-g]) will match any char not(^) in a-g range, and store it in first capturing group, so you can access it with $1.
So we find im in jim and replace it with iM which results in jiM.
As mentioned before, JavaScript allows lookbehinds now. In older browsers you still need a workaround.
I bet my head there is no way to find a regex without lookbehind that delivers the result exactly. All you can do is working with groups. Suppose you have a regex (?<!Before)Wanted, where Wanted is the regex you want to match and Before is the regex that counts out what should not precede the match. The best you can do is negate the regex Before and use the regex NotBefore(Wanted). The desired result is the first group $1.
In your case Before=[abcdefg] which is easy to negate NotBefore=[^abcdefg]. So the regex would be [^abcdefg](m). If you need the position of Wanted, you must group NotBefore too, so that the desired result is the second group.
If matches of the Before pattern have a fixed length n, that is, if the pattern contains no repetitive tokens, you can avoid negating the Before pattern and use the regular expression (?!Before).{n}(Wanted), but still have to use the first group or use the regular expression (?!Before)(.{n})(Wanted) and use the second group. In this example, the pattern Before actually has a fixed length, namely 1, so use the regex (?![abcdefg]).(m) or (?![abcdefg])(.)(m). If you are interested in all matches, add the g flag, see my code snippet:
function TestSORegEx() {
var s = "Donald Trump doesn't like jam, but Homer Simpson does.";
var reg = /(?![abcdefg])(.{1})(m)/gm;
var out = "Matches and groups of the regex " +
"/(?![abcdefg])(.{1})(m)/gm in \ns = \"" + s + "\"";
var match = reg.exec(s);
while(match) {
var start = match.index + match[1].length;
out += "\nWhole match: " + match[0] + ", starts at: " + match.index
+ ". Desired match: " + match[2] + ", starts at: " + start + ".";
match = reg.exec(s);
}
out += "\nResulting string after statement s.replace(reg, \"$1*$2*\")\n"
+ s.replace(reg, "$1*$2*");
alert(out);
}
This effectively does it
"jim".match(/[^a-g]m/)
> ["im"]
"jam".match(/[^a-g]m/)
> null
Search and replace example
"jim jam".replace(/([^a-g])m/g, "$1M")
> "jiM jam"
Note that the negative look-behind string must be 1 character long for this to work.
Related
Right now I'm using this piece of code :
public static bool ContainsEmoji(this string text)
{
Regex rgx = new Regex(#"\p{Cs}");
return rgx.IsMatch(text);
}
And it's being somewhat helpful.
Most of them appear to be detected, but some aren't.
Here's a reference list to help : http://unicode.org/emoji/charts/full-emoji-list.html
All the smiley faces appear to be fine, but these specific emojis do not get caught by the Regex :
1920 U+2614 ☔ umbrella with rain drops
1921 U+26F1 ⛱ umbrella on ground
1922 U+26A1 ⚡ high voltage
1923 U+2744 ❄ snowflake
On the keyboard these are not close to each other, but in the list they are following each other, so I just assumed that there was a point where it would start not working in the emoji list, and it's not really verifying. From 1905 (weather-like emojis), going down, some are caught in the regex, some aren't. There does not seem to be any rule.
I can't afford to just go full ASCII because I need people to enter characters such as cyrillic, but I can't accept emojis specifically. I have no clue how to go forward from here.
I read the MSDN docs about surrogates high/low pairs, but at this stage this is very confusing to me, and I think some push in the right direction would go a long way.
Thank you very much for your time :)
You may use the following regex to match all 4702 emoji characters defined in the Emoji Keyboard/Display Test Data for UTR #51 (Version: 14.0) file:
var EmojiPattern = #"[#*0-9]\uFE0F?\u20E3|©\uFE0F?|[®\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA]\uFE0F?|[\u231A\u231B]|[\u2328\u23CF]\uFE0F?|[\u23E9-\u23EC]|[\u23ED-\u23EF]\uFE0F?|\u23F0|[\u23F1\u23F2]\uFE0F?|\u23F3|[\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB\u25FC]\uFE0F?|[\u25FD\u25FE]|[\u2600-\u2604\u260E\u2611]\uFE0F?|[\u2614\u2615]|\u2618\uFE0F?|\u261D(?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)?|[\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2640\u2642]\uFE0F?|[\u2648-\u2653]|[\u265F\u2660\u2663\u2665\u2666\u2668\u267B\u267E]\uFE0F?|\u267F|\u2692\uFE0F?|\u2693|[\u2694-\u2697\u2699\u269B\u269C\u26A0]\uFE0F?|\u26A1|\u26A7\uFE0F?|[\u26AA\u26AB]|[\u26B0\u26B1]\uFE0F?|[\u26BD\u26BE\u26C4\u26C5]|\u26C8\uFE0F?|\u26CE|[\u26CF\u26D1\u26D3]\uFE0F?|\u26D4|\u26E9\uFE0F?|\u26EA|[\u26F0\u26F1]\uFE0F?|[\u26F2\u26F3]|\u26F4\uFE0F?|\u26F5|[\u26F7\u26F8]\uFE0F?|\u26F9(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?|\uFE0F(?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\u26FA\u26FD]|\u2702\uFE0F?|\u2705|[\u2708\u2709]\uFE0F?|[\u270A\u270B](?:\uD83C[\uDFFB-\uDFFF])?|[\u270C\u270D](?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)?|\u270F\uFE0F?|[\u2712\u2714\u2716\u271D\u2721]\uFE0F?|\u2728|[\u2733\u2734\u2744\u2747]\uFE0F?|[\u274C\u274E\u2753-\u2755\u2757]|\u2763\uFE0F?|\u2764(?:\u200D(?:\uD83D\uDD25|\uD83E\uDE79)|\uFE0F(?:\u200D(?:\uD83D\uDD25|\uD83E\uDE79))?)?|[\u2795-\u2797]|\u27A1\uFE0F?|[\u27B0\u27BF]|[\u2934\u2935\u2B05-\u2B07]\uFE0F?|[\u2B1B\u2B1C\u2B50\u2B55]|[\u3030\u303D\u3297\u3299]\uFE0F?|\uD83C(?:[\uDC04\uDCCF]|[\uDD70\uDD71\uDD7E\uDD7F]\uFE0F?|[\uDD8E\uDD91-\uDD9A]|\uDDE6\uD83C[\uDDE8-\uDDEC\uDDEE\uDDF1\uDDF2\uDDF4\uDDF6-\uDDFA\uDDFC\uDDFD\uDDFF]|\uDDE7\uD83C[\uDDE6\uDDE7\uDDE9-\uDDEF\uDDF1-\uDDF4\uDDF6-\uDDF9\uDDFB\uDDFC\uDDFE\uDDFF]|\uDDE8\uD83C[\uDDE6\uDDE8\uDDE9\uDDEB-\uDDEE\uDDF0-\uDDF5\uDDF7\uDDFA-\uDDFF]|\uDDE9\uD83C[\uDDEA\uDDEC\uDDEF\uDDF0\uDDF2\uDDF4\uDDFF]|\uDDEA\uD83C[\uDDE6\uDDE8\uDDEA\uDDEC\uDDED\uDDF7-\uDDFA]|\uDDEB\uD83C[\uDDEE-\uDDF0\uDDF2\uDDF4\uDDF7]|\uDDEC\uD83C[\uDDE6\uDDE7\uDDE9-\uDDEE\uDDF1-\uDDF3\uDDF5-\uDDFA\uDDFC\uDDFE]|\uDDED\uD83C[\uDDF0\uDDF2\uDDF3\uDDF7\uDDF9\uDDFA]|\uDDEE\uD83C[\uDDE8-\uDDEA\uDDF1-\uDDF4\uDDF6-\uDDF9]|\uDDEF\uD83C[\uDDEA\uDDF2\uDDF4\uDDF5]|\uDDF0\uD83C[\uDDEA\uDDEC-\uDDEE\uDDF2\uDDF3\uDDF5\uDDF7\uDDFC\uDDFE\uDDFF]|\uDDF1\uD83C[\uDDE6-\uDDE8\uDDEE\uDDF0\uDDF7-\uDDFB\uDDFE]|\uDDF2\uD83C[\uDDE6\uDDE8-\uDDED\uDDF0-\uDDFF]|\uDDF3\uD83C[\uDDE6\uDDE8\uDDEA-\uDDEC\uDDEE\uDDF1\uDDF4\uDDF5\uDDF7\uDDFA\uDDFF]|\uDDF4\uD83C\uDDF2|\uDDF5\uD83C[\uDDE6\uDDEA-\uDDED\uDDF0-\uDDF3\uDDF7-\uDDF9\uDDFC\uDDFE]|\uDDF6\uD83C\uDDE6|\uDDF7\uD83C[\uDDEA\uDDF4\uDDF8\uDDFA\uDDFC]|\uDDF8\uD83C[\uDDE6-\uDDEA\uDDEC-\uDDF4\uDDF7-\uDDF9\uDDFB\uDDFD-\uDDFF]|\uDDF9\uD83C[\uDDE6\uDDE8\uDDE9\uDDEB-\uDDED\uDDEF-\uDDF4\uDDF7\uDDF9\uDDFB\uDDFC\uDDFF]|\uDDFA\uD83C[\uDDE6\uDDEC\uDDF2\uDDF3\uDDF8\uDDFE\uDDFF]|\uDDFB\uD83C[\uDDE6\uDDE8\uDDEA\uDDEC\uDDEE\uDDF3\uDDFA]|\uDDFC\uD83C[\uDDEB\uDDF8]|\uDDFD\uD83C\uDDF0|\uDDFE\uD83C[\uDDEA\uDDF9]|\uDDFF\uD83C[\uDDE6\uDDF2\uDDFC]|\uDE01|\uDE02\uFE0F?|[\uDE1A\uDE2F\uDE32-\uDE36]|\uDE37\uFE0F?|[\uDE38-\uDE3A\uDE50\uDE51\uDF00-\uDF20]|[\uDF21\uDF24-\uDF2C]\uFE0F?|[\uDF2D-\uDF35]|\uDF36\uFE0F?|[\uDF37-\uDF7C]|\uDF7D\uFE0F?|[\uDF7E-\uDF84]|\uDF85(?:\uD83C[\uDFFB-\uDFFF])?|[\uDF86-\uDF93]|[\uDF96\uDF97\uDF99-\uDF9B\uDF9E\uDF9F]\uFE0F?|[\uDFA0-\uDFC1]|\uDFC2(?:\uD83C[\uDFFB-\uDFFF])?|[\uDFC3\uDFC4](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDFC5\uDFC6]|\uDFC7(?:\uD83C[\uDFFB-\uDFFF])?|[\uDFC8\uDFC9]|\uDFCA(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDFCB\uDFCC](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?|\uFE0F(?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDFCD\uDFCE]\uFE0F?|[\uDFCF-\uDFD3]|[\uDFD4-\uDFDF]\uFE0F?|[\uDFE0-\uDFF0]|\uDFF3(?:\u200D(?:\u26A7\uFE0F?|\uD83C\uDF08)|\uFE0F(?:\u200D(?:\u26A7\uFE0F?|\uD83C\uDF08))?)?|\uDFF4(?:\u200D\u2620\uFE0F?|\uDB40\uDC67\uDB40\uDC62\uDB40(?:\uDC65\uDB40\uDC6E\uDB40\uDC67|\uDC73\uDB40\uDC63\uDB40\uDC74|\uDC77\uDB40\uDC6C\uDB40\uDC73)\uDB40\uDC7F)?|[\uDFF5\uDFF7]\uFE0F?|[\uDFF8-\uDFFF])|\uD83D(?:[\uDC00-\uDC07]|\uDC08(?:\u200D\u2B1B)?|[\uDC09-\uDC14]|\uDC15(?:\u200D\uD83E\uDDBA)?|[\uDC16-\uDC3A]|\uDC3B(?:\u200D\u2744\uFE0F?)?|[\uDC3C-\uDC3E]|\uDC3F\uFE0F?|\uDC40|\uDC41(?:\u200D\uD83D\uDDE8\uFE0F?|\uFE0F(?:\u200D\uD83D\uDDE8\uFE0F?)?)?|[\uDC42\uDC43](?:\uD83C[\uDFFB-\uDFFF])?|[\uDC44\uDC45]|[\uDC46-\uDC50](?:\uD83C[\uDFFB-\uDFFF])?|[\uDC51-\uDC65]|[\uDC66\uDC67](?:\uD83C[\uDFFB-\uDFFF])?|\uDC68(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D(?:\uDC66(?:\u200D\uD83D\uDC66)?|\uDC67(?:\u200D\uD83D[\uDC66\uDC67])?|[\uDC68\uDC69]\u200D\uD83D(?:\uDC66(?:\u200D\uD83D\uDC66)?|\uDC67(?:\u200D\uD83D[\uDC66\uDC67])?)|[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|\uD83E[\uDDAF-\uDDB3\uDDBC\uDDBD])|\uD83C(?:\uDFFB(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFC-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFC(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFB\uDFFD-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFD(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFB\uDFFC\uDFFE\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFE(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFB-\uDFFD\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFF(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFB-\uDFFE]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?))?|\uDC69(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?[\uDC68\uDC69]|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D(?:\uDC66(?:\u200D\uD83D\uDC66)?|\uDC67(?:\u200D\uD83D[\uDC66\uDC67])?|\uDC69\u200D\uD83D(?:\uDC66(?:\u200D\uD83D\uDC66)?|\uDC67(?:\u200D\uD83D[\uDC66\uDC67])?)|[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|\uD83E[\uDDAF-\uDDB3\uDDBC\uDDBD])|\uD83C(?:\uDFFB(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF]|\uDC8B\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF])|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFC-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFC(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF]|\uDC8B\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF])|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB\uDFFD-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFD(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF]|\uDC8B\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF])|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB\uDFFC\uDFFE\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFE(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF]|\uDC8B\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF])|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFD\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFF(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D\uD83D(?:[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF]|\uDC8B\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF])|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFE]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?))?|\uDC6A|[\uDC6B-\uDC6D](?:\uD83C[\uDFFB-\uDFFF])?|\uDC6E(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDC6F(?:\u200D[\u2640\u2642]\uFE0F?)?|[\uDC70\uDC71](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDC72(?:\uD83C[\uDFFB-\uDFFF])?|\uDC73(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDC74-\uDC76](?:\uD83C[\uDFFB-\uDFFF])?|\uDC77(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDC78(?:\uD83C[\uDFFB-\uDFFF])?|[\uDC79-\uDC7B]|\uDC7C(?:\uD83C[\uDFFB-\uDFFF])?|[\uDC7D-\uDC80]|[\uDC81\uDC82](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDC83(?:\uD83C[\uDFFB-\uDFFF])?|\uDC84|\uDC85(?:\uD83C[\uDFFB-\uDFFF])?|[\uDC86\uDC87](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDC88-\uDC8E]|\uDC8F(?:\uD83C[\uDFFB-\uDFFF])?|\uDC90|\uDC91(?:\uD83C[\uDFFB-\uDFFF])?|[\uDC92-\uDCA9]|\uDCAA(?:\uD83C[\uDFFB-\uDFFF])?|[\uDCAB-\uDCFC]|\uDCFD\uFE0F?|[\uDCFF-\uDD3D]|[\uDD49\uDD4A]\uFE0F?|[\uDD4B-\uDD4E\uDD50-\uDD67]|[\uDD6F\uDD70\uDD73]\uFE0F?|\uDD74(?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)?|\uDD75(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?|\uFE0F(?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDD76-\uDD79]\uFE0F?|\uDD7A(?:\uD83C[\uDFFB-\uDFFF])?|[\uDD87\uDD8A-\uDD8D]\uFE0F?|\uDD90(?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)?|[\uDD95\uDD96](?:\uD83C[\uDFFB-\uDFFF])?|\uDDA4|[\uDDA5\uDDA8\uDDB1\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDE8\uDDEF\uDDF3\uDDFA]\uFE0F?|[\uDDFB-\uDE2D]|\uDE2E(?:\u200D\uD83D\uDCA8)?|[\uDE2F-\uDE34]|\uDE35(?:\u200D\uD83D\uDCAB)?|\uDE36(?:\u200D\uD83C\uDF2B\uFE0F?)?|[\uDE37-\uDE44]|[\uDE45-\uDE47](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDE48-\uDE4A]|\uDE4B(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDE4C(?:\uD83C[\uDFFB-\uDFFF])?|[\uDE4D\uDE4E](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDE4F(?:\uD83C[\uDFFB-\uDFFF])?|[\uDE80-\uDEA2]|\uDEA3(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDEA4-\uDEB3]|[\uDEB4-\uDEB6](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDEB7-\uDEBF]|\uDEC0(?:\uD83C[\uDFFB-\uDFFF])?|[\uDEC1-\uDEC5]|\uDECB\uFE0F?|\uDECC(?:\uD83C[\uDFFB-\uDFFF])?|[\uDECD-\uDECF]\uFE0F?|[\uDED0-\uDED2\uDED5-\uDED7]|[\uDEE0-\uDEE5\uDEE9]\uFE0F?|[\uDEEB\uDEEC]|[\uDEF0\uDEF3]\uFE0F?|[\uDEF4-\uDEFC\uDFE0-\uDFEB])|\uD83E(?:\uDD0C(?:\uD83C[\uDFFB-\uDFFF])?|[\uDD0D\uDD0E]|\uDD0F(?:\uD83C[\uDFFB-\uDFFF])?|[\uDD10-\uDD17]|[\uDD18-\uDD1C](?:\uD83C[\uDFFB-\uDFFF])?|\uDD1D|[\uDD1E\uDD1F](?:\uD83C[\uDFFB-\uDFFF])?|[\uDD20-\uDD25]|\uDD26(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDD27-\uDD2F]|[\uDD30-\uDD34](?:\uD83C[\uDFFB-\uDFFF])?|\uDD35(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDD36(?:\uD83C[\uDFFB-\uDFFF])?|[\uDD37-\uDD39](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDD3A|\uDD3C(?:\u200D[\u2640\u2642]\uFE0F?)?|[\uDD3D\uDD3E](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDD3F-\uDD45\uDD47-\uDD76]|\uDD77(?:\uD83C[\uDFFB-\uDFFF])?|[\uDD78\uDD7A-\uDDB4]|[\uDDB5\uDDB6](?:\uD83C[\uDFFB-\uDFFF])?|\uDDB7|[\uDDB8\uDDB9](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDDBA|\uDDBB(?:\uD83C[\uDFFB-\uDFFF])?|[\uDDBC-\uDDCB]|[\uDDCD-\uDDCF](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDDD0|\uDDD1(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1|[\uDDAF-\uDDB3\uDDBC\uDDBD]))|\uD83C(?:\uDFFB(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFC-\uDFFF]|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFC(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFB\uDFFD-\uDFFF]|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFD(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFB\uDFFC\uDFFE\uDFFF]|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFE(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFB-\uDFFD\uDFFF]|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?|\uDFFF(?:\u200D(?:[\u2695\u2696\u2708]\uFE0F?|\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFB-\uDFFE]|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])))?))?|[\uDDD2\uDDD3](?:\uD83C[\uDFFB-\uDFFF])?|\uDDD4(?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|\uDDD5(?:\uD83C[\uDFFB-\uDFFF])?|[\uDDD6-\uDDDD](?:\u200D[\u2640\u2642]\uFE0F?|\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?)?|[\uDDDE\uDDDF](?:\u200D[\u2640\u2642]\uFE0F?)?|[\uDDE0-\uDDFF\uDE70-\uDE74\uDE78-\uDE7A\uDE80-\uDE86\uDE90-\uDEA8\uDEB0-\uDEB6\uDEC0-\uDEC2\uDED0-\uDED6])";
See the regex demo (although the JS option is used, the regex yields the same result in C#.)
The pattern is created dynamically from the list of emojis and contracted using a regex trie, with a couple of post-process steps to shrink the pattern even more.
If you need to check if the whole string is solely composed of emojis, wrap the above pattern with a group, append anchors on both ends, and add + (one or more) / * (zero or more) quantifier to the group.
// If the string should contain at least one emoji:
var rx = $#"^(?:{EmojiPattern})+\z";
var rx = $#"^(?:{EmojiPattern})+$";
// Or, to allow empty strings
var rx = $#"^(?:{EmojiPattern})*\z";
var rx = $#"^(?:{EmojiPattern})*$";
Here, ^ matches the start of string, $ matches the end of string (but there may be a final \n at the end of string) and \z matches the very end of string.
The Unicode Technical Standard #51 defines a regular expression pattern for detecting Emoji based on the Unicode character properties. It needs translation into your language and RegExp library of choice and relies on Unicode support in those patterns.
\p{RI} \p{RI}
| \p{Emoji}
( \p{EMod}
| \x{FE0F} \x{20E3}?
| [\x{E0020}-\x{E007E}]+ \x{E007F} )?
(\x{200D} \p{Emoji}
( \p{EMod}
| \x{FE0F} \x{20E3}?
| [\x{E0020}-\x{E007E}]+ \x{E007F} )?
)*
By relying on these character properties it won't be as necessary to maintain a custom database or table of known characters and combinations. Unicode specification updates should trickle down through the provided framework, language, or platform.
I've done this for JavaScript but alas not for C#, still I'm hoping that this answer helps out.
const emoji = new RegExp(
'(?:' + /*** Proper "Emoji" ***/
'\\p{RI}\\p{RI}' + // Regional indicators like 🇦 + 🇸 = 🇦🇸
'|' +
'\\p{Emoji}' + // Emoji character
'(?:' + // optionally followed…
'\\p{Emoji_Modifier}' + // …with a modifier like skin-tone, hair style, etc…
'|' +
'\\u{FE0F}\\u{20E3}?' + // …with "visual representation" marker ☂ vs ☂️
'|' + // "combining visual keycap" a vs. a⃣
'[\\u{E0020}-\\u{E007E}]+\\u{E007F}' + // …with Emoji tag sequences for flags
')?' +
'(?:' +
'\\u{200D}\\p{Emoji}' + // …with zero or more ZWJ sequences
'(?:' + // …each of which might also have modifier
'\\p{Emoji_Modifier}' + // sequences as on the base Emoji above
'|' +
'\\u{FE0F}\\u{20E3}?' +
'|' +
'[\\u{E0020}-\\u{E007E}]+\\u{E007F}' +
')?' +
')*' +
')',
'gu' // Global to find all matches and persist `lastIndex` if necessary
);
and also as a one-liner with a regex101 link
/(?:\p{RI}\p{RI}|\p{Emoji}(?:\p{Emoji_Modifier}|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?(?:\u{200D}\p{Emoji}(?:\p{Emoji_Modifier}|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?)*)|[\u{1f900}-\u{1f9ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}]/gu
I've taken Wiktor Stribiżew's idea of generating a regular expression from the Emoji Keyboard/Display Test Data file and implemented it here. An UTF-16 based regular expression that matches exactly the emojis listed in the test file for UNICODE EMOJI 13.1 looks like this:
\uD83D(?:\uDC68(?:\uD83C(?:\uDFFB(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFC-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFC(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFB\uDFFD-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFD(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFB\uDFFC\uDFFE\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFE(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFB-\uDFFD\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFF(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFB-\uDFFE]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?)|\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68|\uD83D(?:(?:[\uDC68\uDC69]\u200D\uD83D)?(?:\uDC66(?:\u200D\uD83D\uDC66)?|\uDC67(?:\u200D\uD83D[\uDC66\uDC67])?)|[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83E[\uDDAF-\uDDB3\uDDBC\uDDBD]))?|\uDC69(?:\uD83C(?:\uDFFB(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D[\uDC68\uDC69]|[\uDC68\uDC69])\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFC-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFC(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D[\uDC68\uDC69]|[\uDC68\uDC69])\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB\uDFFD-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFD(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D[\uDC68\uDC69]|[\uDC68\uDC69])\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB\uDFFC\uDFFE\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFE(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D[\uDC68\uDC69]|[\uDC68\uDC69])\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFD\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFF(?:\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D[\uDC68\uDC69]|[\uDC68\uDC69])\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFE]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?)|\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D[\uDC68\uDC69]|[\uDC68\uDC69])|\uD83D(?:(?:\uDC69\u200D\uD83D)?(?:\uDC66(?:\u200D\uD83D\uDC66)?|\uDC67(?:\u200D\uD83D[\uDC66\uDC67])?)|[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83E[\uDDAF-\uDDB3\uDDBC\uDDBD]))?|(?:\uDD75(?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)?|\uDC6F)(?:\u200D[\u2640\u2642]\uFE0F?)?|[\uDC6E\uDC70\uDC71\uDC73\uDC77\uDC81\uDC82\uDC86\uDC87\uDE45-\uDE47\uDE4B\uDE4D\uDE4E\uDEA3\uDEB4-\uDEB6](?:\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?|\u200D[\u2640\u2642]\uFE0F?)?|\uDC41(?:\uFE0F(?:\u200D\uD83D\uDDE8\uFE0F?)?|\u200D\uD83D\uDDE8\uFE0F?)?|\uDE36(?:\u200D\uD83C\uDF2B\uFE0F?)?|\uDC15(?:\u200D\uD83E\uDDBA)?|\uDC3B(?:\u200D\u2744\uFE0F?)?|\uDE2E(?:\u200D\uD83D\uDCA8)?|\uDE35(?:\u200D\uD83D\uDCAB)?|[\uDC42\uDC43\uDC46-\uDC50\uDC66\uDC67\uDC6B-\uDC6D\uDC72\uDC74-\uDC76\uDC78\uDC7C\uDC83\uDC85\uDC8F\uDC91\uDCAA\uDD7A\uDD95\uDD96\uDE4C\uDE4F\uDEC0\uDECC](?:\uD83C[\uDFFB-\uDFFF])?|[\uDD74\uDD90]\uD83C[\uDFFB-\uDFFF]|\uDC08(?:\u200D\u2B1B)?|[\uDC3F\uDCFD\uDD49\uDD4A\uDD6F\uDD70\uDD73\uDD74\uDD76-\uDD79\uDD87\uDD8A-\uDD8D\uDD90\uDDA5\uDDA8\uDDB1\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDE8\uDDEF\uDDF3\uDDFA\uDECB\uDECD-\uDECF\uDEE0-\uDEE5\uDEE9\uDEF0\uDEF3]\uFE0F?|[\uDC00-\uDC07\uDC09-\uDC14\uDC16-\uDC3A\uDC3C-\uDC3E\uDC40\uDC44\uDC45\uDC51-\uDC65\uDC6A\uDC79-\uDC7B\uDC7D-\uDC80\uDC84\uDC88-\uDC8E\uDC90\uDC92-\uDCA9\uDCAB-\uDCFC\uDCFF-\uDD3D\uDD4B-\uDD4E\uDD50-\uDD67\uDDA4\uDDFB-\uDE2D\uDE2F-\uDE34\uDE37-\uDE44\uDE48-\uDE4A\uDE80-\uDEA2\uDEA4-\uDEB3\uDEB7-\uDEBF\uDEC1-\uDEC5\uDED0-\uDED2\uDED5-\uDED7\uDEEB\uDEEC\uDEF4-\uDEFC\uDFE0-\uDFEB])|\uD83E(?:\uDDD1(?:\uD83C(?:\uDFFB(?:\u200D(?:\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFC-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFC(?:\u200D(?:\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFB\uDFFD-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFD(?:\u200D(?:\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFB\uDFFC\uDFFE\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFE(?:\u200D(?:\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFB-\uDFFD\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|\uDFFF(?:\u200D(?:\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFB-\uDFFE]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?)|\u200D(?:\uD83E(?:\uDD1D\u200D\uD83E\uDDD1|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|[\uDD26\uDD35\uDD37-\uDD39\uDD3D\uDD3E\uDDB8\uDDB9\uDDCD-\uDDCF\uDDD4\uDDD6-\uDDDD](?:\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?|\u200D[\u2640\u2642]\uFE0F?)?|[\uDD3C\uDDDE\uDDDF](?:\u200D[\u2640\u2642]\uFE0F?)?|[\uDD0C\uDD0F\uDD18-\uDD1C\uDD1E\uDD1F\uDD30-\uDD34\uDD36\uDD77\uDDB5\uDDB6\uDDBB\uDDD2\uDDD3\uDDD5](?:\uD83C[\uDFFB-\uDFFF])?|[\uDD0D\uDD0E\uDD10-\uDD17\uDD1D\uDD20-\uDD25\uDD27-\uDD2F\uDD3A\uDD3F-\uDD45\uDD47-\uDD76\uDD78\uDD7A-\uDDB4\uDDB7\uDDBA\uDDBC-\uDDCB\uDDD0\uDDE0-\uDDFF\uDE70-\uDE74\uDE78-\uDE7A\uDE80-\uDE86\uDE90-\uDEA8\uDEB0-\uDEB6\uDEC0-\uDEC2\uDED0-\uDED6])|\uD83C(?:\uDFF4(?:\uDB40\uDC67\uDB40\uDC62\uDB40(?:\uDC65\uDB40\uDC6E\uDB40\uDC67|\uDC73\uDB40\uDC63\uDB40\uDC74|\uDC77\uDB40\uDC6C\uDB40\uDC73)\uDB40\uDC7F|\u200D\u2620\uFE0F?)?|[\uDFC3\uDFC4\uDFCA](?:\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?|\u200D[\u2640\u2642]\uFE0F?)?|[\uDFCB\uDFCC](?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)(?:\u200D[\u2640\u2642]\uFE0F?)?|\uDFF3(?:\uFE0F(?:\u200D(?:\u26A7\uFE0F?|\uD83C\uDF08))?|\u200D(?:\u26A7\uFE0F?|\uD83C\uDF08))?|(?:[\uDFCB\uDFCC]\u200D[\u2640\u2642]|[\uDD70\uDD71\uDD7E\uDD7F\uDE02\uDE37\uDF21\uDF24-\uDF2C\uDF36\uDF7D\uDF96\uDF97\uDF99-\uDF9B\uDF9E\uDF9F\uDFCD\uDFCE\uDFD4-\uDFDF\uDFF5\uDFF7])\uFE0F?|[\uDF85\uDFC2\uDFC7](?:\uD83C[\uDFFB-\uDFFF])?|\uDDE6\uD83C[\uDDE8-\uDDEC\uDDEE\uDDF1\uDDF2\uDDF4\uDDF6-\uDDFA\uDDFC\uDDFD\uDDFF]|\uDDE7\uD83C[\uDDE6\uDDE7\uDDE9-\uDDEF\uDDF1-\uDDF4\uDDF6-\uDDF9\uDDFB\uDDFC\uDDFE\uDDFF]|\uDDE8\uD83C[\uDDE6\uDDE8\uDDE9\uDDEB-\uDDEE\uDDF0-\uDDF5\uDDF7\uDDFA-\uDDFF]|\uDDE9\uD83C[\uDDEA\uDDEC\uDDEF\uDDF0\uDDF2\uDDF4\uDDFF]|\uDDEA\uD83C[\uDDE6\uDDE8\uDDEA\uDDEC\uDDED\uDDF7-\uDDFA]|\uDDEB\uD83C[\uDDEE-\uDDF0\uDDF2\uDDF4\uDDF7]|\uDDEC\uD83C[\uDDE6\uDDE7\uDDE9-\uDDEE\uDDF1-\uDDF3\uDDF5-\uDDFA\uDDFC\uDDFE]|\uDDED\uD83C[\uDDF0\uDDF2\uDDF3\uDDF7\uDDF9\uDDFA]|\uDDEE\uD83C[\uDDE8-\uDDEA\uDDF1-\uDDF4\uDDF6-\uDDF9]|\uDDEF\uD83C[\uDDEA\uDDF2\uDDF4\uDDF5]|\uDDF0\uD83C[\uDDEA\uDDEC-\uDDEE\uDDF2\uDDF3\uDDF5\uDDF7\uDDFC\uDDFE\uDDFF]|\uDDF1\uD83C[\uDDE6-\uDDE8\uDDEE\uDDF0\uDDF7-\uDDFB\uDDFE]|\uDDF2\uD83C[\uDDE6\uDDE8-\uDDED\uDDF0-\uDDFF]|\uDDF3\uD83C[\uDDE6\uDDE8\uDDEA-\uDDEC\uDDEE\uDDF1\uDDF4\uDDF5\uDDF7\uDDFA\uDDFF]|\uDDF4\uD83C\uDDF2|\uDDF5\uD83C[\uDDE6\uDDEA-\uDDED\uDDF0-\uDDF3\uDDF7-\uDDF9\uDDFC\uDDFE]|\uDDF6\uD83C\uDDE6|\uDDF7\uD83C[\uDDEA\uDDF4\uDDF8\uDDFA\uDDFC]|\uDDF8\uD83C[\uDDE6-\uDDEA\uDDEC-\uDDF4\uDDF7-\uDDF9\uDDFB\uDDFD-\uDDFF]|\uDDF9\uD83C[\uDDE6\uDDE8\uDDE9\uDDEB-\uDDED\uDDEF-\uDDF4\uDDF7\uDDF9\uDDFB\uDDFC\uDDFF]|\uDDFA\uD83C[\uDDE6\uDDEC\uDDF2\uDDF3\uDDF8\uDDFE\uDDFF]|\uDDFB\uD83C[\uDDE6\uDDE8\uDDEA\uDDEC\uDDEE\uDDF3\uDDFA]|\uDDFC\uD83C[\uDDEB\uDDF8]|\uDDFD\uD83C\uDDF0|\uDDFE\uD83C[\uDDEA\uDDF9]|\uDDFF\uD83C[\uDDE6\uDDF2\uDDFC]|[\uDC04\uDCCF\uDD8E\uDD91-\uDD9A\uDE01\uDE1A\uDE2F\uDE32-\uDE36\uDE38-\uDE3A\uDE50\uDE51\uDF00-\uDF20\uDF2D-\uDF35\uDF37-\uDF7C\uDF7E-\uDF84\uDF86-\uDF93\uDFA0-\uDFC1\uDFC5\uDFC6\uDFC8\uDFC9\uDFCB\uDFCC\uDFCF-\uDFD3\uDFE0-\uDFF0\uDFF8-\uDFFF])|\u26F9(?:(?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)(?:\u200D[\u2640\u2642]\uFE0F?)?|\u200D[\u2640\u2642]\uFE0F?)?|\u2764(?:\uFE0F(?:\u200D(?:\uD83D\uDD25|\uD83E\uDE79))?|\u200D(?:\uD83D\uDD25|\uD83E\uDE79))?|[\#\*0-9]\uFE0F?\u20E3|[\u261D\u270C\u270D]\uD83C[\uDFFB-\uDFFF]|[\u270A\u270B](?:\uD83C[\uDFFB-\uDFFF])?|[\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u2328\u23CF\u23ED-\u23EF\u23F1\u23F2\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB\u25FC\u2600-\u2604\u260E\u2611\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2640\u2642\u265F\u2660\u2663\u2665\u2666\u2668\u267B\u267E\u2692\u2694-\u2697\u2699\u269B\u269C\u26A0\u26A7\u26B0\u26B1\u26C8\u26CF\u26D1\u26D3\u26E9\u26F0\u26F1\u26F4\u26F7\u26F8\u2702\u2708\u2709\u270C\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2733\u2734\u2744\u2747\u2763\u27A1\u2934\u2935\u2B05-\u2B07\u3030\u303D\u3297\u3299]\uFE0F?|[\u231A\u231B\u23E9-\u23EC\u23F0\u23F3\u25FD\u25FE\u2614\u2615\u2648-\u2653\u267F\u2693\u26A1\u26AA\u26AB\u26BD\u26BE\u26C4\u26C5\u26CE\u26D4\u26EA\u26F2\u26F3\u26F5\u26FA\u26FD\u2705\u2728\u274C\u274E\u2753-\u2755\u2757\u2795-\u2797\u27B0\u27BF\u2B1B\u2B1C\u2B50\u2B55]
If you are okay with a few skin tone combinations that do not exist in the test file being matches too, the regex can be compressed to about ~66% (7524 characters)
\uD83D(?:\uDC68(?:\uD83C(?:[\uDFFB-\uDFFF]\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D\uDC68\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|[\uDFFB-\uDFFF])|\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D)?\uDC68|\uD83D(?:(?:[\uDC68\uDC69]\u200D\uD83D)?(?:\uDC66(?:\u200D\uD83D\uDC66)?|\uDC67(?:\u200D\uD83D[\uDC66\uDC67])?)|[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83E[\uDDAF-\uDDB3\uDDBC\uDDBD]))?|\uDC69(?:\uD83C(?:[\uDFFB-\uDFFF]\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D[\uDC68\uDC69]|[\uDC68\uDC69])\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83D[\uDC68\uDC69]\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|[\uDFFB-\uDFFF])|\u200D(?:\u2764\uFE0F?\u200D\uD83D(?:\uDC8B\u200D\uD83D[\uDC68\uDC69]|[\uDC68\uDC69])|\uD83D(?:(?:\uDC69\u200D\uD83D)?(?:\uDC66(?:\u200D\uD83D\uDC66)?|\uDC67(?:\u200D\uD83D[\uDC66\uDC67])?)|[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83E[\uDDAF-\uDDB3\uDDBC\uDDBD]))?|(?:\uDD75(?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)?|\uDC6F)(?:\u200D[\u2640\u2642]\uFE0F?)?|[\uDC6E\uDC70\uDC71\uDC73\uDC77\uDC81\uDC82\uDC86\uDC87\uDE45-\uDE47\uDE4B\uDE4D\uDE4E\uDEA3\uDEB4-\uDEB6](?:\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?|\u200D[\u2640\u2642]\uFE0F?)?|\uDC41(?:\uFE0F(?:\u200D\uD83D\uDDE8\uFE0F?)?|\u200D\uD83D\uDDE8\uFE0F?)?|\uDE36(?:\u200D\uD83C\uDF2B\uFE0F?)?|\uDC15(?:\u200D\uD83E\uDDBA)?|\uDC3B(?:\u200D\u2744\uFE0F?)?|\uDE2E(?:\u200D\uD83D\uDCA8)?|\uDE35(?:\u200D\uD83D\uDCAB)?|[\uDC42\uDC43\uDC46-\uDC50\uDC66\uDC67\uDC6B-\uDC6D\uDC72\uDC74-\uDC76\uDC78\uDC7C\uDC83\uDC85\uDC8F\uDC91\uDCAA\uDD7A\uDD95\uDD96\uDE4C\uDE4F\uDEC0\uDECC](?:\uD83C[\uDFFB-\uDFFF])?|[\uDD74\uDD90]\uD83C[\uDFFB-\uDFFF]|\uDC08(?:\u200D\u2B1B)?|[\uDC3F\uDCFD\uDD49\uDD4A\uDD6F\uDD70\uDD73\uDD74\uDD76-\uDD79\uDD87\uDD8A-\uDD8D\uDD90\uDDA5\uDDA8\uDDB1\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDE8\uDDEF\uDDF3\uDDFA\uDECB\uDECD-\uDECF\uDEE0-\uDEE5\uDEE9\uDEF0\uDEF3]\uFE0F?|[\uDC00-\uDC07\uDC09-\uDC14\uDC16-\uDC3A\uDC3C-\uDC3E\uDC40\uDC44\uDC45\uDC51-\uDC65\uDC6A\uDC79-\uDC7B\uDC7D-\uDC80\uDC84\uDC88-\uDC8E\uDC90\uDC92-\uDCA9\uDCAB-\uDCFC\uDCFF-\uDD3D\uDD4B-\uDD4E\uDD50-\uDD67\uDDA4\uDDFB-\uDE2D\uDE2F-\uDE34\uDE37-\uDE44\uDE48-\uDE4A\uDE80-\uDEA2\uDEA4-\uDEB3\uDEB7-\uDEBF\uDEC1-\uDEC5\uDED0-\uDED2\uDED5-\uDED7\uDEEB\uDEEC\uDEF4-\uDEFC\uDFE0-\uDFEB])|\uD83E(?:\uDDD1(?:\uD83C(?:[\uDFFB-\uDFFF]\u200D(?:\u2764\uFE0F?\u200D(?:\uD83D\uDC8B\u200D)?\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|\uD83E(?:\uDD1D\u200D\uD83E\uDDD1\uD83C[\uDFFB-\uDFFF]|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|[\uDFFB-\uDFFF])|\u200D(?:\uD83E(?:\uDD1D\u200D\uD83E\uDDD1|[\uDDAF-\uDDB3\uDDBC\uDDBD])|[\u2695\u2696\u2708]\uFE0F?|\uD83C[\uDF3E\uDF73\uDF7C\uDF84\uDF93\uDFA4\uDFA8\uDFEB\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92]))?|[\uDD26\uDD35\uDD37-\uDD39\uDD3D\uDD3E\uDDB8\uDDB9\uDDCD-\uDDCF\uDDD4\uDDD6-\uDDDD](?:\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?|\u200D[\u2640\u2642]\uFE0F?)?|[\uDD3C\uDDDE\uDDDF](?:\u200D[\u2640\u2642]\uFE0F?)?|[\uDD0C\uDD0F\uDD18-\uDD1C\uDD1E\uDD1F\uDD30-\uDD34\uDD36\uDD77\uDDB5\uDDB6\uDDBB\uDDD2\uDDD3\uDDD5](?:\uD83C[\uDFFB-\uDFFF])?|[\uDD0D\uDD0E\uDD10-\uDD17\uDD1D\uDD20-\uDD25\uDD27-\uDD2F\uDD3A\uDD3F-\uDD45\uDD47-\uDD76\uDD78\uDD7A-\uDDB4\uDDB7\uDDBA\uDDBC-\uDDCB\uDDD0\uDDE0-\uDDFF\uDE70-\uDE74\uDE78-\uDE7A\uDE80-\uDE86\uDE90-\uDEA8\uDEB0-\uDEB6\uDEC0-\uDEC2\uDED0-\uDED6])|\uD83C(?:\uDFF4(?:\uDB40\uDC67\uDB40\uDC62\uDB40(?:\uDC65\uDB40\uDC6E\uDB40\uDC67|\uDC73\uDB40\uDC63\uDB40\uDC74|\uDC77\uDB40\uDC6C\uDB40\uDC73)\uDB40\uDC7F|\u200D\u2620\uFE0F?)?|[\uDFC3\uDFC4\uDFCA](?:\uD83C[\uDFFB-\uDFFF](?:\u200D[\u2640\u2642]\uFE0F?)?|\u200D[\u2640\u2642]\uFE0F?)?|[\uDFCB\uDFCC](?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)(?:\u200D[\u2640\u2642]\uFE0F?)?|\uDFF3(?:\uFE0F(?:\u200D(?:\u26A7\uFE0F?|\uD83C\uDF08))?|\u200D(?:\u26A7\uFE0F?|\uD83C\uDF08))?|(?:[\uDFCB\uDFCC]\u200D[\u2640\u2642]|[\uDD70\uDD71\uDD7E\uDD7F\uDE02\uDE37\uDF21\uDF24-\uDF2C\uDF36\uDF7D\uDF96\uDF97\uDF99-\uDF9B\uDF9E\uDF9F\uDFCD\uDFCE\uDFD4-\uDFDF\uDFF5\uDFF7])\uFE0F?|[\uDF85\uDFC2\uDFC7](?:\uD83C[\uDFFB-\uDFFF])?|\uDDE6\uD83C[\uDDE8-\uDDEC\uDDEE\uDDF1\uDDF2\uDDF4\uDDF6-\uDDFA\uDDFC\uDDFD\uDDFF]|\uDDE7\uD83C[\uDDE6\uDDE7\uDDE9-\uDDEF\uDDF1-\uDDF4\uDDF6-\uDDF9\uDDFB\uDDFC\uDDFE\uDDFF]|\uDDE8\uD83C[\uDDE6\uDDE8\uDDE9\uDDEB-\uDDEE\uDDF0-\uDDF5\uDDF7\uDDFA-\uDDFF]|\uDDE9\uD83C[\uDDEA\uDDEC\uDDEF\uDDF0\uDDF2\uDDF4\uDDFF]|\uDDEA\uD83C[\uDDE6\uDDE8\uDDEA\uDDEC\uDDED\uDDF7-\uDDFA]|\uDDEB\uD83C[\uDDEE-\uDDF0\uDDF2\uDDF4\uDDF7]|\uDDEC\uD83C[\uDDE6\uDDE7\uDDE9-\uDDEE\uDDF1-\uDDF3\uDDF5-\uDDFA\uDDFC\uDDFE]|\uDDED\uD83C[\uDDF0\uDDF2\uDDF3\uDDF7\uDDF9\uDDFA]|\uDDEE\uD83C[\uDDE8-\uDDEA\uDDF1-\uDDF4\uDDF6-\uDDF9]|\uDDEF\uD83C[\uDDEA\uDDF2\uDDF4\uDDF5]|\uDDF0\uD83C[\uDDEA\uDDEC-\uDDEE\uDDF2\uDDF3\uDDF5\uDDF7\uDDFC\uDDFE\uDDFF]|\uDDF1\uD83C[\uDDE6-\uDDE8\uDDEE\uDDF0\uDDF7-\uDDFB\uDDFE]|\uDDF2\uD83C[\uDDE6\uDDE8-\uDDED\uDDF0-\uDDFF]|\uDDF3\uD83C[\uDDE6\uDDE8\uDDEA-\uDDEC\uDDEE\uDDF1\uDDF4\uDDF5\uDDF7\uDDFA\uDDFF]|\uDDF4\uD83C\uDDF2|\uDDF5\uD83C[\uDDE6\uDDEA-\uDDED\uDDF0-\uDDF3\uDDF7-\uDDF9\uDDFC\uDDFE]|\uDDF6\uD83C\uDDE6|\uDDF7\uD83C[\uDDEA\uDDF4\uDDF8\uDDFA\uDDFC]|\uDDF8\uD83C[\uDDE6-\uDDEA\uDDEC-\uDDF4\uDDF7-\uDDF9\uDDFB\uDDFD-\uDDFF]|\uDDF9\uD83C[\uDDE6\uDDE8\uDDE9\uDDEB-\uDDED\uDDEF-\uDDF4\uDDF7\uDDF9\uDDFB\uDDFC\uDDFF]|\uDDFA\uD83C[\uDDE6\uDDEC\uDDF2\uDDF3\uDDF8\uDDFE\uDDFF]|\uDDFB\uD83C[\uDDE6\uDDE8\uDDEA\uDDEC\uDDEE\uDDF3\uDDFA]|\uDDFC\uD83C[\uDDEB\uDDF8]|\uDDFD\uD83C\uDDF0|\uDDFE\uD83C[\uDDEA\uDDF9]|\uDDFF\uD83C[\uDDE6\uDDF2\uDDFC]|[\uDC04\uDCCF\uDD8E\uDD91-\uDD9A\uDE01\uDE1A\uDE2F\uDE32-\uDE36\uDE38-\uDE3A\uDE50\uDE51\uDF00-\uDF20\uDF2D-\uDF35\uDF37-\uDF7C\uDF7E-\uDF84\uDF86-\uDF93\uDFA0-\uDFC1\uDFC5\uDFC6\uDFC8\uDFC9\uDFCB\uDFCC\uDFCF-\uDFD3\uDFE0-\uDFF0\uDFF8-\uDFFF])|\u26F9(?:(?:\uD83C[\uDFFB-\uDFFF]|\uFE0F)(?:\u200D[\u2640\u2642]\uFE0F?)?|\u200D[\u2640\u2642]\uFE0F?)?|\u2764(?:\uFE0F(?:\u200D(?:\uD83D\uDD25|\uD83E\uDE79))?|\u200D(?:\uD83D\uDD25|\uD83E\uDE79))?|[\#\*0-9]\uFE0F?\u20E3|[\u261D\u270C\u270D]\uD83C[\uDFFB-\uDFFF]|[\u270A\u270B](?:\uD83C[\uDFFB-\uDFFF])?|[\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u2328\u23CF\u23ED-\u23EF\u23F1\u23F2\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB\u25FC\u2600-\u2604\u260E\u2611\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2640\u2642\u265F\u2660\u2663\u2665\u2666\u2668\u267B\u267E\u2692\u2694-\u2697\u2699\u269B\u269C\u26A0\u26A7\u26B0\u26B1\u26C8\u26CF\u26D1\u26D3\u26E9\u26F0\u26F1\u26F4\u26F7\u26F8\u2702\u2708\u2709\u270C\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2733\u2734\u2744\u2747\u2763\u27A1\u2934\u2935\u2B05-\u2B07\u3030\u303D\u3297\u3299]\uFE0F?|[\u231A\u231B\u23E9-\u23EC\u23F0\u23F3\u25FD\u25FE\u2614\u2615\u2648-\u2653\u267F\u2693\u26A1\u26AA\u26AB\u26BD\u26BE\u26C4\u26C5\u26CE\u26D4\u26EA\u26F2\u26F3\u26F5\u26FA\u26FD\u2705\u2728\u274C\u274E\u2753-\u2755\u2757\u2795-\u2797\u27B0\u27BF\u2B1B\u2B1C\u2B50\u2B55]
Here is a demo matching all 4590 listed emoji, plus the 35 added ones.
Regular is not working corectelly.
const reg = /(?:(?:\p{RI}\p{RI}|\p{Emoji}(?:\p{Emoji_Modifier}|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?(?:\u{200D}\p{Emoji}(?:\p{Emoji_Modifier}|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?)*)|[\u{1f900}-\u{1f9ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}])*/gu;
'♠John♠'.replace(reg, ''); // output "John" - it`s ok
'John 123'.replace(reg, ''); // output John - Numbers is missing, why?
Is possible fix regular for keep numbers in string?
Is there a way to achieve the equivalent of a negative lookbehind in JavaScript regular expressions? I need to match a string that does not start with a specific set of characters.
It seems I am unable to find a regex that does this without failing if the matched part is found at the beginning of the string. Negative lookbehinds seem to be the only answer, but JavaScript doesn't has one.
This is the regex that I would like to work, but it doesn't:
(?<!([abcdefg]))m
So it would match the 'm' in 'jim' or 'm', but not 'jam'
Since 2018, Lookbehind Assertions are part of the ECMAScript language specification.
// positive lookbehind
(?<=...)
// negative lookbehind
(?<!...)
Answer pre-2018
As Javascript supports negative lookahead, one way to do it is:
reverse the input string
match with a reversed regex
reverse and reformat the matches
const reverse = s => s.split('').reverse().join('');
const test = (stringToTests, reversedRegexp) => stringToTests
.map(reverse)
.forEach((s,i) => {
const match = reversedRegexp.test(s);
console.log(stringToTests[i], match, 'token:', match ? reverse(reversedRegexp.exec(s)[0]) : 'Ø');
});
Example 1:
Following #andrew-ensley's question:
test(['jim', 'm', 'jam'], /m(?!([abcdefg]))/)
Outputs:
jim true token: m
m true token: m
jam false token: Ø
Example 2:
Following #neaumusic comment (match max-height but not line-height, the token being height):
test(['max-height', 'line-height'], /thgieh(?!(-enil))/)
Outputs:
max-height true token: height
line-height false token: Ø
Lookbehind Assertions got accepted into the ECMAScript specification in 2018.
Positive lookbehind usage:
console.log(
"$9.99 €8.47".match(/(?<=\$)\d+\.\d*/) // Matches "9.99"
);
Negative lookbehind usage:
console.log(
"$9.99 €8.47".match(/(?<!\$)\d+\.\d*/) // Matches "8.47"
);
Platform support:
✔️ V8
✔️ Google Chrome 62.0
✔️ Microsoft Edge 79.0
✔️ Node.js 6.0 behind a flag and 9.0 without a flag
✔️ Deno (all versions)
✔️ SpiderMonkey
✔️ Mozilla Firefox 78.0
🛠️ JavaScriptCore: support in beta as of 2023-02-16: the feature has been merged
✔️ Apple Safari 16.4 (in beta as of 2023-02-16)
✔️ iOS 16.4 beta WebView (all browsers on iOS + iPadOS)
✔️ Bun 0.2.2
❌ Chakra: Microsoft was working on it but Chakra is now abandoned in favor of V8
❌ Internet Explorer
❌ Edge versions prior to 79 (the ones based on EdgeHTML+Chakra)
Let's suppose you want to find all int not preceded by unsigned :
With support for negative look-behind:
(?<!unsigned )int
Without support for negative look-behind:
((?!unsigned ).{9}|^.{0,8})int
Basically idea is to grab n preceding characters and exclude match with negative look-ahead, but also match the cases where there's no preceeding n characters. (where n is length of look-behind).
So the regex in question:
(?<!([abcdefg]))m
would translate to:
((?!([abcdefg])).|^)m
You might need to play with capturing groups to find exact spot of the string that interests you or you want to replace specific part with something else.
Mijoja's strategy works for your specific case but not in general:
js>newString = "Fall ball bill balll llama".replace(/(ba)?ll/g,
function($0,$1){ return $1?$0:"[match]";});
Fa[match] ball bi[match] balll [match]ama
Here's an example where the goal is to match a double-l but not if it is preceded by "ba". Note the word "balll" -- true lookbehind should have suppressed the first 2 l's but matched the 2nd pair. But by matching the first 2 l's and then ignoring that match as a false positive, the regexp engine proceeds from the end of that match, and ignores any characters within the false positive.
Use
newString = string.replace(/([abcdefg])?m/, function($0,$1){ return $1?$0:'m';});
You could define a non-capturing group by negating your character set:
(?:[^a-g])m
...which would match every m NOT preceded by any of those letters.
This is how I achieved str.split(/(?<!^)#/) for Node.js 8 (which doesn't support lookbehind):
str.split('').reverse().join('').split(/#(?!$)/).map(s => s.split('').reverse().join('')).reverse()
Works? Yes (unicode untested). Unpleasant? Yes.
following the idea of Mijoja, and drawing from the problems exposed by JasonS, i had this idea; i checked a bit but am not sure of myself, so a verification by someone more expert than me in js regex would be great :)
var re = /(?=(..|^.?)(ll))/g
// matches empty string position
// whenever this position is followed by
// a string of length equal or inferior (in case of "^")
// to "lookbehind" value
// + actual value we would want to match
, str = "Fall ball bill balll llama"
, str_done = str
, len_difference = 0
, doer = function (where_in_str, to_replace)
{
str_done = str_done.slice(0, where_in_str + len_difference)
+ "[match]"
+ str_done.slice(where_in_str + len_difference + to_replace.length)
len_difference = str_done.length - str.length
/* if str smaller:
len_difference will be positive
else will be negative
*/
} /* the actual function that would do whatever we want to do
with the matches;
this above is only an example from Jason's */
/* function input of .replace(),
only there to test the value of $behind
and if negative, call doer() with interesting parameters */
, checker = function ($match, $behind, $after, $where, $str)
{
if ($behind !== "ba")
doer
(
$where + $behind.length
, $after
/* one will choose the interesting arguments
to give to the doer, it's only an example */
)
return $match // empty string anyhow, but well
}
str.replace(re, checker)
console.log(str_done)
my personal output:
Fa[match] ball bi[match] bal[match] [match]ama
the principle is to call checker at each point in the string between any two characters, whenever that position is the starting point of:
--- any substring of the size of what is not wanted (here 'ba', thus ..) (if that size is known; otherwise it must be harder to do perhaps)
--- --- or smaller than that if it's the beginning of the string: ^.?
and, following this,
--- what is to be actually sought (here 'll').
At each call of checker, there will be a test to check if the value before ll is not what we don't want (!== 'ba'); if that's the case, we call another function, and it will have to be this one (doer) that will make the changes on str, if the purpose is this one, or more generically, that will get in input the necessary data to manually process the results of the scanning of str.
here we change the string so we needed to keep a trace of the difference of length in order to offset the locations given by replace, all calculated on str, which itself never changes.
since primitive strings are immutable, we could have used the variable str to store the result of the whole operation, but i thought the example, already complicated by the replacings, would be clearer with another variable (str_done).
i guess that in terms of performances it must be pretty harsh: all those pointless replacements of '' into '', this str.length-1 times, plus here manual replacement by doer, which means a lot of slicing...
probably in this specific above case that could be grouped, by cutting the string only once into pieces around where we want to insert [match] and .join()ing it with [match] itself.
the other thing is that i don't know how it would handle more complex cases, that is, complex values for the fake lookbehind... the length being perhaps the most problematic data to get.
and, in checker, in case of multiple possibilities of nonwanted values for $behind, we'll have to make a test on it with yet another regex (to be cached (created) outside checker is best, to avoid the same regex object to be created at each call for checker) to know whether or not it is what we seek to avoid.
hope i've been clear; if not don't hesitate, i'll try better. :)
Using your case, if you want to replace m with something, e.g. convert it to uppercase M, you can negate set in capturing group.
match ([^a-g])m, replace with $1M
"jim jam".replace(/([^a-g])m/g, "$1M")
\\jiM jam
([^a-g]) will match any char not(^) in a-g range, and store it in first capturing group, so you can access it with $1.
So we find im in jim and replace it with iM which results in jiM.
As mentioned before, JavaScript allows lookbehinds now. In older browsers you still need a workaround.
I bet my head there is no way to find a regex without lookbehind that delivers the result exactly. All you can do is working with groups. Suppose you have a regex (?<!Before)Wanted, where Wanted is the regex you want to match and Before is the regex that counts out what should not precede the match. The best you can do is negate the regex Before and use the regex NotBefore(Wanted). The desired result is the first group $1.
In your case Before=[abcdefg] which is easy to negate NotBefore=[^abcdefg]. So the regex would be [^abcdefg](m). If you need the position of Wanted, you must group NotBefore too, so that the desired result is the second group.
If matches of the Before pattern have a fixed length n, that is, if the pattern contains no repetitive tokens, you can avoid negating the Before pattern and use the regular expression (?!Before).{n}(Wanted), but still have to use the first group or use the regular expression (?!Before)(.{n})(Wanted) and use the second group. In this example, the pattern Before actually has a fixed length, namely 1, so use the regex (?![abcdefg]).(m) or (?![abcdefg])(.)(m). If you are interested in all matches, add the g flag, see my code snippet:
function TestSORegEx() {
var s = "Donald Trump doesn't like jam, but Homer Simpson does.";
var reg = /(?![abcdefg])(.{1})(m)/gm;
var out = "Matches and groups of the regex " +
"/(?![abcdefg])(.{1})(m)/gm in \ns = \"" + s + "\"";
var match = reg.exec(s);
while(match) {
var start = match.index + match[1].length;
out += "\nWhole match: " + match[0] + ", starts at: " + match.index
+ ". Desired match: " + match[2] + ", starts at: " + start + ".";
match = reg.exec(s);
}
out += "\nResulting string after statement s.replace(reg, \"$1*$2*\")\n"
+ s.replace(reg, "$1*$2*");
alert(out);
}
This effectively does it
"jim".match(/[^a-g]m/)
> ["im"]
"jam".match(/[^a-g]m/)
> null
Search and replace example
"jim jam".replace(/([^a-g])m/g, "$1M")
> "jiM jam"
Note that the negative look-behind string must be 1 character long for this to work.
Is there a way to achieve the equivalent of a negative lookbehind in JavaScript regular expressions? I need to match a string that does not start with a specific set of characters.
It seems I am unable to find a regex that does this without failing if the matched part is found at the beginning of the string. Negative lookbehinds seem to be the only answer, but JavaScript doesn't has one.
This is the regex that I would like to work, but it doesn't:
(?<!([abcdefg]))m
So it would match the 'm' in 'jim' or 'm', but not 'jam'
Since 2018, Lookbehind Assertions are part of the ECMAScript language specification.
// positive lookbehind
(?<=...)
// negative lookbehind
(?<!...)
Answer pre-2018
As Javascript supports negative lookahead, one way to do it is:
reverse the input string
match with a reversed regex
reverse and reformat the matches
const reverse = s => s.split('').reverse().join('');
const test = (stringToTests, reversedRegexp) => stringToTests
.map(reverse)
.forEach((s,i) => {
const match = reversedRegexp.test(s);
console.log(stringToTests[i], match, 'token:', match ? reverse(reversedRegexp.exec(s)[0]) : 'Ø');
});
Example 1:
Following #andrew-ensley's question:
test(['jim', 'm', 'jam'], /m(?!([abcdefg]))/)
Outputs:
jim true token: m
m true token: m
jam false token: Ø
Example 2:
Following #neaumusic comment (match max-height but not line-height, the token being height):
test(['max-height', 'line-height'], /thgieh(?!(-enil))/)
Outputs:
max-height true token: height
line-height false token: Ø
Lookbehind Assertions got accepted into the ECMAScript specification in 2018.
Positive lookbehind usage:
console.log(
"$9.99 €8.47".match(/(?<=\$)\d+\.\d*/) // Matches "9.99"
);
Negative lookbehind usage:
console.log(
"$9.99 €8.47".match(/(?<!\$)\d+\.\d*/) // Matches "8.47"
);
Platform support:
✔️ V8
✔️ Google Chrome 62.0
✔️ Microsoft Edge 79.0
✔️ Node.js 6.0 behind a flag and 9.0 without a flag
✔️ Deno (all versions)
✔️ SpiderMonkey
✔️ Mozilla Firefox 78.0
🛠️ JavaScriptCore: support in beta as of 2023-02-16: the feature has been merged
✔️ Apple Safari 16.4 (in beta as of 2023-02-16)
✔️ iOS 16.4 beta WebView (all browsers on iOS + iPadOS)
✔️ Bun 0.2.2
❌ Chakra: Microsoft was working on it but Chakra is now abandoned in favor of V8
❌ Internet Explorer
❌ Edge versions prior to 79 (the ones based on EdgeHTML+Chakra)
Let's suppose you want to find all int not preceded by unsigned :
With support for negative look-behind:
(?<!unsigned )int
Without support for negative look-behind:
((?!unsigned ).{9}|^.{0,8})int
Basically idea is to grab n preceding characters and exclude match with negative look-ahead, but also match the cases where there's no preceeding n characters. (where n is length of look-behind).
So the regex in question:
(?<!([abcdefg]))m
would translate to:
((?!([abcdefg])).|^)m
You might need to play with capturing groups to find exact spot of the string that interests you or you want to replace specific part with something else.
Mijoja's strategy works for your specific case but not in general:
js>newString = "Fall ball bill balll llama".replace(/(ba)?ll/g,
function($0,$1){ return $1?$0:"[match]";});
Fa[match] ball bi[match] balll [match]ama
Here's an example where the goal is to match a double-l but not if it is preceded by "ba". Note the word "balll" -- true lookbehind should have suppressed the first 2 l's but matched the 2nd pair. But by matching the first 2 l's and then ignoring that match as a false positive, the regexp engine proceeds from the end of that match, and ignores any characters within the false positive.
Use
newString = string.replace(/([abcdefg])?m/, function($0,$1){ return $1?$0:'m';});
You could define a non-capturing group by negating your character set:
(?:[^a-g])m
...which would match every m NOT preceded by any of those letters.
This is how I achieved str.split(/(?<!^)#/) for Node.js 8 (which doesn't support lookbehind):
str.split('').reverse().join('').split(/#(?!$)/).map(s => s.split('').reverse().join('')).reverse()
Works? Yes (unicode untested). Unpleasant? Yes.
following the idea of Mijoja, and drawing from the problems exposed by JasonS, i had this idea; i checked a bit but am not sure of myself, so a verification by someone more expert than me in js regex would be great :)
var re = /(?=(..|^.?)(ll))/g
// matches empty string position
// whenever this position is followed by
// a string of length equal or inferior (in case of "^")
// to "lookbehind" value
// + actual value we would want to match
, str = "Fall ball bill balll llama"
, str_done = str
, len_difference = 0
, doer = function (where_in_str, to_replace)
{
str_done = str_done.slice(0, where_in_str + len_difference)
+ "[match]"
+ str_done.slice(where_in_str + len_difference + to_replace.length)
len_difference = str_done.length - str.length
/* if str smaller:
len_difference will be positive
else will be negative
*/
} /* the actual function that would do whatever we want to do
with the matches;
this above is only an example from Jason's */
/* function input of .replace(),
only there to test the value of $behind
and if negative, call doer() with interesting parameters */
, checker = function ($match, $behind, $after, $where, $str)
{
if ($behind !== "ba")
doer
(
$where + $behind.length
, $after
/* one will choose the interesting arguments
to give to the doer, it's only an example */
)
return $match // empty string anyhow, but well
}
str.replace(re, checker)
console.log(str_done)
my personal output:
Fa[match] ball bi[match] bal[match] [match]ama
the principle is to call checker at each point in the string between any two characters, whenever that position is the starting point of:
--- any substring of the size of what is not wanted (here 'ba', thus ..) (if that size is known; otherwise it must be harder to do perhaps)
--- --- or smaller than that if it's the beginning of the string: ^.?
and, following this,
--- what is to be actually sought (here 'll').
At each call of checker, there will be a test to check if the value before ll is not what we don't want (!== 'ba'); if that's the case, we call another function, and it will have to be this one (doer) that will make the changes on str, if the purpose is this one, or more generically, that will get in input the necessary data to manually process the results of the scanning of str.
here we change the string so we needed to keep a trace of the difference of length in order to offset the locations given by replace, all calculated on str, which itself never changes.
since primitive strings are immutable, we could have used the variable str to store the result of the whole operation, but i thought the example, already complicated by the replacings, would be clearer with another variable (str_done).
i guess that in terms of performances it must be pretty harsh: all those pointless replacements of '' into '', this str.length-1 times, plus here manual replacement by doer, which means a lot of slicing...
probably in this specific above case that could be grouped, by cutting the string only once into pieces around where we want to insert [match] and .join()ing it with [match] itself.
the other thing is that i don't know how it would handle more complex cases, that is, complex values for the fake lookbehind... the length being perhaps the most problematic data to get.
and, in checker, in case of multiple possibilities of nonwanted values for $behind, we'll have to make a test on it with yet another regex (to be cached (created) outside checker is best, to avoid the same regex object to be created at each call for checker) to know whether or not it is what we seek to avoid.
hope i've been clear; if not don't hesitate, i'll try better. :)
Using your case, if you want to replace m with something, e.g. convert it to uppercase M, you can negate set in capturing group.
match ([^a-g])m, replace with $1M
"jim jam".replace(/([^a-g])m/g, "$1M")
\\jiM jam
([^a-g]) will match any char not(^) in a-g range, and store it in first capturing group, so you can access it with $1.
So we find im in jim and replace it with iM which results in jiM.
As mentioned before, JavaScript allows lookbehinds now. In older browsers you still need a workaround.
I bet my head there is no way to find a regex without lookbehind that delivers the result exactly. All you can do is working with groups. Suppose you have a regex (?<!Before)Wanted, where Wanted is the regex you want to match and Before is the regex that counts out what should not precede the match. The best you can do is negate the regex Before and use the regex NotBefore(Wanted). The desired result is the first group $1.
In your case Before=[abcdefg] which is easy to negate NotBefore=[^abcdefg]. So the regex would be [^abcdefg](m). If you need the position of Wanted, you must group NotBefore too, so that the desired result is the second group.
If matches of the Before pattern have a fixed length n, that is, if the pattern contains no repetitive tokens, you can avoid negating the Before pattern and use the regular expression (?!Before).{n}(Wanted), but still have to use the first group or use the regular expression (?!Before)(.{n})(Wanted) and use the second group. In this example, the pattern Before actually has a fixed length, namely 1, so use the regex (?![abcdefg]).(m) or (?![abcdefg])(.)(m). If you are interested in all matches, add the g flag, see my code snippet:
function TestSORegEx() {
var s = "Donald Trump doesn't like jam, but Homer Simpson does.";
var reg = /(?![abcdefg])(.{1})(m)/gm;
var out = "Matches and groups of the regex " +
"/(?![abcdefg])(.{1})(m)/gm in \ns = \"" + s + "\"";
var match = reg.exec(s);
while(match) {
var start = match.index + match[1].length;
out += "\nWhole match: " + match[0] + ", starts at: " + match.index
+ ". Desired match: " + match[2] + ", starts at: " + start + ".";
match = reg.exec(s);
}
out += "\nResulting string after statement s.replace(reg, \"$1*$2*\")\n"
+ s.replace(reg, "$1*$2*");
alert(out);
}
This effectively does it
"jim".match(/[^a-g]m/)
> ["im"]
"jam".match(/[^a-g]m/)
> null
Search and replace example
"jim jam".replace(/([^a-g])m/g, "$1M")
> "jiM jam"
Note that the negative look-behind string must be 1 character long for this to work.
I have this structure of text :
1.6.1 Members................................................................ 12
1.6.2 Accessibility.......................................................... 13
1.6.3 Type parameters........................................................ 13
1.6.4 The T generic type aka <T>............................................. 13
I need to create JS objects :
{
num:"1.6.1",
txt:"Members"
},
{
num:"1.6.2",
txt:"Accessibility"
} ...
That's not a problem.
The problem is that I want to extract values via Regex split via positive lookahead :
Split via the first time you see that next character is a letter
What have i tried :
'1.6.1 Members........... 12'.split(/\s(?=(?:[\w\. ])+$)/i)
This is working fine :
["1.6.1", "Members...........", "12"] // I don't care about the 12.
But If I have 2 words or more :
'1.6.3 Type parameters................ 13'.split(/\s(?=(?:[\w\. ])+$)/i)
The result is :
["1.6.3", "Type", "parameters................", "13"] //again I don't care about 13.
Of course I can join them , but I want the words to be together.
Question :
How can I enhance my regex NOT to split words ?
Desired result :
["1.6.3", "Type parameters"]
or
["1.6.3", "Type parameters........"] // I will remove extras later
or
["1.6.3", "Type parameters........13"]// I will remove extras later
NB
I know I can do split via " " or by other simpler solution but I'm seeking ( for pure knowledge) for an enhancement for my solution which uses positive lookahead split.
Full online example :
nb2 :
The text can contain capital letter in the middle also.
You can use this regex:
/^(\d+(?:\.\d+)*) (\w+(?: \w+)*)/gm
And get your desired matches using matched group #1 and matched group #2.
Online Regex Demo
Update: For String#split you can use this regex:
/ +(?=[A-Z\d])/g
Regex Demo
Update 2: With the possibility of having capital letters also in chapter names following more complex regex is needed:
var re = /(\D +(?=[a-z]))| +(?=[a-z\d])/gmi;
var str = '1.6.3 Type Foo Bar........................................................ 13';
var m = str.split( re );
console.log(m[0], ',', m.slice(1, -1).join(''), ',', m.pop() );
//=> 1.6.3 , Type Foo Bar........................................................ , 13
EDIT: Since you added 1.6.1 The .net 4.5 framework.... to the requirements, we can tweak the answer to this:
^([\d.]+) ((?:[^.]|\.(?!\.))+)
And if you want to allow sequences of up to three dots in the title, as in 1.6.1 She said... Boo!..........., it's an easy tweak from there ({3} quantifier):
^([\d.]+) ((?:[^.]|\.(?!\.{3}))+)
Original:
^([\d.]+) ([^.]+)
In the regex demo, see the Groups in the right pane.
To retrieve Groups 1 and 2, something like:
var myregex = /^([\d.]+) ((?:[^.]|\.(?!\.))+)/mg;
var theMatchObject = myregex.exec(yourString);
while (theMatchObject != null) {
// the numbers: theMatchObject[1]
// the title: theMatchObject[1]
theMatchObject = myregex.exec(yourString);
}
OUTPUT
Group 1 Group 2
1.6.1 Members
1.6.2 Accessibility
1.6.3 Type parameters
1.6.4 The T generic type aka <T>**
1.6.1 The .net 4.5 framework
Explanation
^ asserts that we are a the beginning of the line
The parentheses in ([\d.]+) capture digits and dots to Group 1
The parentheses in ((?:[^.]|\.(?!\.))+) capture to Group 2...
[^.] one char that is not a dot, | OR...
\.(?!\.) a dot that is not followed by a dot...
+ one or more times
You can use this pattern too:
var myStr = "1.6.1 Members................................................................ 12\n1.6.2 Accessibility.......................................................... 13\n1.6.3 Type parameters........................................................ 13\n1.6.4 The T generic type aka <T>............................................. 13";
console.log(myStr.split(/ (.+?)\.{2,} ?\d+$\n?/m));
About a way with a lookahead :
I don't think it is possible. Because the only way to skip a character (here a space between two words), is to match it on the occasion of the previous occurence of a space (between the number and the first word). In other words, you use the fact that characters can not be matched more than one time.
But if, except the space where you want to split, all the pattern is enclosed in a lookahead, and since the substring matched by this subpattern in the lookahead isn't a part of the match result (in other words, it's only a check and the corresponding characters are not eaten by the regex engine), you can't skip the next spaces, and the regex engine will continue his road until the next space character.
Here is what i've got so far:
/(netscape)|(navigator)\/(\d+)(\.(\d+))?/.test(UserAgentString.toLowerCase()) ? ' netscape'+RegExp.$3+RegExp.$4 : ''
I'm trying to do several different things here.
(1). I want to match either netscape or navigator, and it must be followed by a single slash and one or more digits.
(2). It can optionally follow those digits with up to one of: one period and one or more digits.
The expression should evaluate to an empty string if (1) is not true.
The expression should return ' netscape8' if UserAgentString is Netscape/8 or Navigator/8.
The expression should return ' netscape8.4' if UserAgentString is Navigator/8.4.2.
The regex is not working. In particular (this is an edited down version for my testing, and it still doesn't work):
// in Chrome this produces ["netscape", "netscape", undefined, undefined]
(/(netscape)|(navigator)\/(\d+)/.exec("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.7.5) Gecko/20060912 Netscape/8.1.2".toLowerCase()))
Why does the 8 not get matched? Is it supposed to show up in the third entry or the fourth?
There are a couple things that I want to figure out if they are supported. Notice how I have 5 sets of capture paren groups. group #5 \d+ is contained within group #4: \.(\d+). Is it possible to retrieve the matched groups?
Also, what happens if I specify a group like this? /(\.\d+)*/ This matches any number of "dot-number" strings contatenated together (like in a version number). What's RegExp.$1 supposed to match here?
Your "or" expression is not doing what you think.
Simplified, you're doing this:
(a)|(b)cde
Which matches either a or bcde.
Put parentheses around your "or" expression: ((a)|(b))cde and that will match either acde or bcde.
I find http://regexpal.com/ to be a very useful tool for quickly checking my regex syntax.
Regex (netscape|navigator)\/(\d+(?:\.\d+)?) will return 2 groups (if match found):
netscape or navigator
number behind the name
var m = /(netscape|navigator)\/(\d+(?:\.\d+)?)/.exec(text);
if (m != null) {
var r = m[1] + m[2];
}
(....) Creates a group. Everything inside that group is returned with that group's variable.
The following will match netscape or navigator and the first two numbers of the version separated by a period.
$1 $2
|------------------| |------------|
/(netscape|navigator)[^\/]*\/((\d+)\.(\d+))/
The final code looks like this:
/(netscape|navigator)[^\/]*\/((\d+)\.(\d+))/.test(
navigator.userAgent.toLowerCase()
) ? 'netscape'+RegExp.$2 : ''
Which will give you
netscape5.0
Check out these great tuts (there are many more):
http://perldoc.perl.org/perlrequick.html
http://perldoc.perl.org/perlre.html
http://perldoc.perl.org/perlretut.html