How can I split a string containing emoji into an array?

How can I split a string containing emoji into an array? - javascript

I want to take a string of emoji and do something with the individual characters.
In JavaScript "😴😄😃⛔🎠🚓🚇".length == 13 because "⛔" length is 1, the rest are 2. So we can't do
var string = "😴😄😃⛔🎠🚓🚇";
s = string.split("");
console.log(s);

JavaScript ES6 has a solution!, for a real split:
[..."😴😄😃⛔🎠🚓🚇"] // ["😴", "😄", "😃", "⛔", "🎠", "🚓", "🚇"]
Yay? Except for the fact that when you run this through your transpiler, it might not work (see #brainkim's comment). It only works when natively run on an ES6-compliant browser. Luckily this encompasses most browsers (Safari, Chrome, FF), but if you're looking for high browser compatibility this is not the solution for you.

Edit: see Orlin Georgiev's answer for a proper solution in a library: https://github.com/orling/grapheme-splitter
Thanks to this answer I made a function that takes a string and returns an array of emoji:
var emojiStringToArray = function (str) {
split = str.split(/([\uD800-\uDBFF][\uDC00-\uDFFF])/);
arr = [];
for (var i=0; i<split.length; i++) {
char = split[i]
if (char !== "") {
arr.push(char);
}
}
return arr;
};
So
emojiStringToArray("😴😄😃⛔🎠🚓🚇")
// => Array [ "😴", "😄", "😃", "⛔", "🎠", "🚓", "🚇" ]

The grapheme-splitter library that does just that, is fully compatible even with old browsers and works not just with emoji but all sorts of exotic characters:
https://github.com/orling/grapheme-splitter
You are likely to miss edge-cases in any home-brew solution. This one is actually based on the UAX-29 Unicode standart

The modern / proper way to split a UTF8 string is using Array.from(str) instead of str.split('')

With the upcoming Intl.Segmenter. You can do this:
const splitEmoji = (string) => [...new Intl.Segmenter().segment(string)].map(x => x.segment)
splitEmoji("😴😄😃⛔🎠🚓🚇") // ['😴', '😄', '😃', '⛔', '🎠', '🚓', '🚇']
This also solve the problem with "👨‍👨‍👧‍👧" and "👦🏾".
splitEmoji("👨‍👨‍👧‍👧👦🏾") // ['👨‍👨‍👧‍👧', '👦🏾']
According to CanIUse, apart from IE and Firefox, this can be use 84.17% globally currently.

It can be done using the u flag of a regular expression. The regular expression is:
/.*?/u
This is broken every time there are there are at least minimally zero or more characters that may or may not be emojis, but cannot be spaces or new lines break.
There are at least minimally zero or more: ? (split in zero chars)
Zero or more: *
Cannot be spaces or new line break: .
May or may not be emojis: /u
By using the question mark ? I am forcing to cut exactly every zero chars, otherwise /.*/u it cuts by all characters until I find a space or newline break.
var string = "😴😄😃⛔🎠🚓🚇"
var c = string.split(/.*?/u)
console.log(c)

The Grapheme Splitter library by Orlin Georgiev is pretty amazing.
Although it hasn't been updated in a while and presently (Sep 2020) it only supports Unicode 10 and below.
For an updated version of Grapheme Splitter built in Typescript with Unicode 13 support have a look at: https://github.com/flmnt/graphemer
Here is a quick example:
import Graphemer from 'graphemer';
const splitter = new Graphemer();
const string = "😴😄😃⛔🎠🚓🚇";
splitter.countGraphemes(string); // returns 7
splitter.splitGraphemes(string); // returns array of characters
The library also works with the latest emojis.
For example "👩🏻‍🦰".length === 7 but splitter.countGraphemes("👩🏻‍🦰") === 1.
Full disclosure: I created the library and did the work to update to Unicode 13. The API is identical to Grapheme Splitter and is entirely based on that work, just updated to the latest version of Unicode as the original library hasn't been updated for a couple of years and seems to be no longer maintained.

Related

Get emoji from url parameter [duplicate]

I want to take a string of emoji and do something with the individual characters.
In JavaScript "😴😄😃⛔🎠🚓🚇".length == 13 because "⛔" length is 1, the rest are 2. So we can't do
var string = "😴😄😃⛔🎠🚓🚇";
s = string.split("");
console.log(s);

JavaScript ES6 has a solution!, for a real split:
[..."😴😄😃⛔🎠🚓🚇"] // ["😴", "😄", "😃", "⛔", "🎠", "🚓", "🚇"]
Yay? Except for the fact that when you run this through your transpiler, it might not work (see #brainkim's comment). It only works when natively run on an ES6-compliant browser. Luckily this encompasses most browsers (Safari, Chrome, FF), but if you're looking for high browser compatibility this is not the solution for you.

Edit: see Orlin Georgiev's answer for a proper solution in a library: https://github.com/orling/grapheme-splitter
Thanks to this answer I made a function that takes a string and returns an array of emoji:
var emojiStringToArray = function (str) {
split = str.split(/([\uD800-\uDBFF][\uDC00-\uDFFF])/);
arr = [];
for (var i=0; i<split.length; i++) {
char = split[i]
if (char !== "") {
arr.push(char);
}
}
return arr;
};
So
emojiStringToArray("😴😄😃⛔🎠🚓🚇")
// => Array [ "😴", "😄", "😃", "⛔", "🎠", "🚓", "🚇" ]

The grapheme-splitter library that does just that, is fully compatible even with old browsers and works not just with emoji but all sorts of exotic characters:
https://github.com/orling/grapheme-splitter
You are likely to miss edge-cases in any home-brew solution. This one is actually based on the UAX-29 Unicode standart

The modern / proper way to split a UTF8 string is using Array.from(str) instead of str.split('')

With the upcoming Intl.Segmenter. You can do this:
const splitEmoji = (string) => [...new Intl.Segmenter().segment(string)].map(x => x.segment)
splitEmoji("😴😄😃⛔🎠🚓🚇") // ['😴', '😄', '😃', '⛔', '🎠', '🚓', '🚇']
This also solve the problem with "👨‍👨‍👧‍👧" and "👦🏾".
splitEmoji("👨‍👨‍👧‍👧👦🏾") // ['👨‍👨‍👧‍👧', '👦🏾']
According to CanIUse, apart from IE and Firefox, this can be use 84.17% globally currently.

It can be done using the u flag of a regular expression. The regular expression is:
/.*?/u
This is broken every time there are there are at least minimally zero or more characters that may or may not be emojis, but cannot be spaces or new lines break.
There are at least minimally zero or more: ? (split in zero chars)
Zero or more: *
Cannot be spaces or new line break: .
May or may not be emojis: /u
By using the question mark ? I am forcing to cut exactly every zero chars, otherwise /.*/u it cuts by all characters until I find a space or newline break.
var string = "😴😄😃⛔🎠🚓🚇"
var c = string.split(/.*?/u)
console.log(c)

The Grapheme Splitter library by Orlin Georgiev is pretty amazing.
Although it hasn't been updated in a while and presently (Sep 2020) it only supports Unicode 10 and below.
For an updated version of Grapheme Splitter built in Typescript with Unicode 13 support have a look at: https://github.com/flmnt/graphemer
Here is a quick example:
import Graphemer from 'graphemer';
const splitter = new Graphemer();
const string = "😴😄😃⛔🎠🚓🚇";
splitter.countGraphemes(string); // returns 7
splitter.splitGraphemes(string); // returns array of characters
The library also works with the latest emojis.
For example "👩🏻‍🦰".length === 7 but splitter.countGraphemes("👩🏻‍🦰") === 1.
Full disclosure: I created the library and did the work to update to Unicode 13. The API is identical to Grapheme Splitter and is entirely based on that work, just updated to the latest version of Unicode as the original library hasn't been updated for a couple of years and seems to be no longer maintained.

Javascript .replaceAll() is not a function type error

The documentation page: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replaceAll
let string = ":insertx: :insertx: :inserty: :inserty: :insertz: :insertz:";
let newstring = string.replaceAll(":insertx:", 'hello!');
When I run this, I receive Uncaught TypeError: string.replaceAll is not a function. Maybe I'm misunderstanding what a prototype is, but the function appears to be a string method that is available for use.
I'm using Chrome.

Use replace with a regular expression with the global modifier for better browser support. (Check the browser compatibility table on MDN to see which version of each browser started supporting the replaceAll method.)
let string = ":insertx: :insertx: :inserty: :inserty: :insertz: :insertz:";
let newstring = string.replace(/:insertx:/g, 'hello!');
console.log(newstring);
For a more generic solution, we can escape regular expression metacharacters and use the RegExp constructor. You could also add the function to String.prototype as a polyfill.
(It is necessary to escape the string to replace so that characters that have special meanings in regular expressions will be interpreted literally, e.g. . will refer only to actual dots rather than any character.)
//Taken from https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
function escapeRegExp(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
}
function replaceAll(str, match, replacement){
return str.replace(new RegExp(escapeRegExp(match), 'g'), ()=>replacement);
}
console.log(replaceAll('a.b.c.d.e', '.', '__'));
console.log(replaceAll('a.b.c.d.e', '.', '$&'));
A specification-compliant shim can be found here.

.replaceAll will be available starting on Chrome 85. The current version is 83.
If you download Google Chrome Canary (which is on version 86), you'll be able to see that your code runs fine. Firefox is on version 78, and since .replaceAll has been available starting version 77, it works there too. It will work on current Safari as well. Microsoft Edge has it as unsupported.
You'll find supported browser versions at the bottom of the article in your question.

If you don't want to upgrade your Chrome nor use reg expressions (since they're less performant), you can also do this:
let string = ":insertx: :insertx: :inserty: :inserty: :insertz: :insertz:";
let newstring = string.split(":insertx:").join('hello!');
And you can, of course, attach to the String prototype if you'd like it everywhere. But since the real replaceAll is more feature filled (supports regex), you'd be instead safer doing:
String.prototype.replaceAllTxt = function replaceAll(search, replace) { return this.split(search).join(replace); }

You can define it yourself easily:
if(typeof String.prototype.replaceAll === "undefined") {
String.prototype.replaceAll = function(match, replace) {
return this.replace(new RegExp(match, 'g'), () => replace);
}
}
And use it:
"fafa".replaceAll("a", "o");
>>> fofo

str.replaceAll function is added in ES2021 ES12, that's why it is not defined in older versions of browsers, and nodejs.

Although a bit off-topic, but I stumbled here for my use case which is not listed so here is for someone like me. I needed to hide a word but if it starts with certain characters i.e. dev. And, there is wild-card character * that helped me do it.
'Fullstack developer'.replace(/dev.*/g, '') // => Fullstack
Note: Notice the dot.

I was also getting the same type error issues with replace All. I resolved it with replace method with regular expression with global flag set.
let unformattedDate = "06=07=2022";
const formattedString = unformattedDate.replace(/=/g, ':');
console.log(formattedString);

As I am not that familiar with Regex, I wrote a simple workaround to gain the same results as before. In the example, I want to replace any whitespace by an "X"
while (myString.includes(' '))
{
myString = myString.replace(' ', 'X')
}
So just iterate through your string as long as the substring you want to replace is found.

Extracting hashtags out of a string.

If I had a string as such
var comment = "Mmmm #yummy #donut at #CZ"
How can I get a list of hash tags that exist in the string variable?
I tried using JavaScript split() method but I have to keep splitting all strings created from the initial split string.
Is there a simpler way of doing this?

Just use a regular expression to find occurences of a hash followed by non-whitespace characters.
"Mmmm #yummy #donut at #CZ".match(/#\w+/g)
// evaluates to ["#yummy", "#donut", "#CZ"]

This will do it for anything with alphabetic characters, you can extend the regexp for other characters if you want:
myString.match(/#[a-z]+/gi);

Do you care about Unicode or non-English hashtags?
"Mmmm #yummy #donut at #CZ #中文 #.dou #。#？#♥️ #にほ".match(/#[\p{L}]+/ugi)
=> (5) ["#yummy", "#donut", "#CZ", "#中文", "#にほ"]
As explained by this answer: https://stackoverflow.com/a/35112226/515585
\p{L} matches unicode characters
u the PCRE_UTF8 modifier, this modifier turns on additional
functionality of PCRE that is incompatible with Perl.

if you care about readability:
yourText.split(' ').filter(v=> v.startsWith('#'))
will return ["#yummy", "#donut", "#CZ"]

Here is another very simple regex which will allow using emojis and numbers in hashtags as well as not using any white space to have them split. Most of the time this should be more than sufficent:
"Mmmm #yummy #donut at #CZ#efrefg #:) #cool😎#r234#FEGERGR#fegergr".match(/#[^\s#]*/gmi);
// => ["#yummy", "#donut", "#CZ", "#efrefg", "#:)", "#cool😎", "#r234", "#FEGERGR", "#fegergr"]
There is a little downside though: This regex will add punctuation to the end of hashtags, e.g.:
"Mmmm #yummy.#donut#cool😎#r234#FEGERGR;#fegergr".match(/#[^\s#]*/gmi);
// => ["#yummy.", "#donut", "#cool😎", "#r234", "#FEGERGR;", "#fegergr"]
But you can extend the regex yourself to the characters (punctuation) that you want to omit though, like this:
"Mmmm #yummy.#donut#cool😎#r234#FEGERGR;#fegergr".match(/#[^\s#\.\;]*/gmi);
// => ["#yummy", "#donut", "#cool😎", "#r234", "#FEGERGR", "#fegergr"]

If you need a character of any alphabet within hashtag, I'd go with something like this:
let text = "улетные #выходные // #holiday in the countryside";
const hashtags = []
if (text.length) {
let preHashtags = text.split('#')
let i = 0;
if (text[0] !== '#') i++
for (null; i < preHashtags.length; i++) {
let item = preHashtags[i]
hashtags.push(item.split(' ')[0])
// String.prototype.split() is needed to sort out non-hashtag related string data
}
}
console.log(hashtags) // outputs [ 'выходные', 'holiday' ]
We use if (text[0] !== '#') i++ to check if first letter in "text" string is not a '#'. In that case we do not need to iterate through the first element in the preHashtags Array. Otherwise, our text string starts with a hashtag and we need to handle it.
Take note that you may need to do input validation of resulting hashtags array. Also note that null in the for loop is only for readability purposes, you could also use for (;i < preHashtags.length; i++)
The benefit of this approach is that it definitely includes any possible symbol (hence the need for sanity checks), including all non-latin alphabets, as well as simpler to understand, especially for beginners. The performance, on the other hand, is superior, when checked in Chrome (and thus probably other Chromium-derived browsers, as well as node.js), while 6-7% worse in Firefox & 13% worse in Safari, judged by this test: https://jsben.ch/VuhEi.
Thus, the choice depends on whether you are going to run your code in node.js or browser and if it is the latter, do you have a lot of mobile clients using MobileSafari?

content.split(/[\s\n\r]/gim).filter(tag => tag.startsWith('#'))

while trying to reverse string in javascript, getting NaN in the beginning of the reversed string

var input="string to be reversed";
function reverse(reversestring)
{
var result;
for(var i=reversestring.length-1;i>=0;i--)
{
result+=reversestring[i];
}
return result;
}
console.log(reverse(input));
Can you please guide me with the above code?

Initialize your var result with a blank value
like
var result="";
Because if you don't initialize it with a value at first place then the variable returns an undefined value
Optimize your loop like(If you want)
for(var i=reversestring.length-1;i--;)

It would be easier to just split the string into an array of the parts, and then javascript has a reverse() method to reverse the order of the array, and then you can join it back together again:
var input = "string to be reversed";
var output = input.split('').reverse().join('');
FIDDLE

The following technique (or similar) is commonly used to reverse a string in JavaScript:
// Don’t use this!
var naiveReverse = function(string) {
return string.split('').reverse().join('');
}
In fact, all the answers posted so far are a variation of this pattern. However, there are some problems with this solution. For example:
naiveReverse('foo 𝌆 bar');
// → 'rab �� oof'
// Where did the `𝌆` symbol go? Whoops!
If you’re wondering why this happens, read up on JavaScript’s internal character encoding. (TL;DR: 𝌆 is an astral symbol, and JavaScript exposes it as two separate code units.)
But there’s more:
// To see which symbols are being used here, check:
// http://mothereff.in/js-escapes#1ma%C3%B1ana%20man%CC%83ana
naiveReverse('mañana mañana');
// → 'anãnam anañam'
// Wait, so now the tilde is applied to the `a` instead of the `n`? WAT.
A good string to test string reverse implementations is the following:
'foo 𝌆 bar mañana mañana'
Why? Because it contains an astral symbol (𝌆) (which are represented by surrogate pairs in JavaScript) and a combining mark (the ñ in the last mañana actually consists of two symbols: U+006E LATIN SMALL LETTER N and U+0303 COMBINING TILDE).
The order in which surrogate pairs appear cannot be reversed, else the astral symbol won’t show up anymore in the ‘reversed’ string. That’s why you saw those �� marks in the output for the previous example.
Combining marks always get applied to the previous symbol, so you have to treat both the main symbol (U+006E LATIN SMALL LETTER N) as the combining mark (U+0303 COMBINING TILDE) as a whole. Reversing their order will cause the combining mark to be paired with another symbol in the string. That’s why the example output had ã instead of ñ.
Hopefully, this explains why all the answers posted so far are wrong.
To answer your initial question — how to [properly] reverse a string in JavaScript —, I’ve written a small JavaScript library that is capable of Unicode-aware string reversal. It doesn’t have any of the issues I just mentioned. The library is called Esrever; its code is on GitHub, and it works in pretty much any JavaScript environment. It comes with a shell utility/binary, so you can easily reverse strings from your terminal if you want.
var input = 'foo 𝌆 bar mañana mañana';
esrever.reverse(input);
// → 'anañam anañam rab 𝌆 oof'

Javascript RegExp quantifier issue

I have some JavaScript that runs uses a replace with regular expressions to modify content on a page. I'm having a problem with a specific regex quantifier, though. All the documentation I've read (and I know it work in regex in other languages, too) says that JavaScript supports the {N}, {N,} and {N,N} quantifiers. That is, you can specify a particular number of matches you want, or a range of matches. E.g. (zz){5,} matches at least 10 z's in a row, and z{5,10} would match any number of z's from 5 to 10, no more and no less.
The problem is, I can match an exact number (e.g. z{5}) but not a range. The nearest I can figure is that it has something to do with the comma in the regex string, but I don't understand why and can't get around this. I have tried escaping the comma and even using the unicode hexidecimal string for comma (\u002C), but to no avail.
To clear up any possible misunderstandings, and to address some of the questions asked in the comments, here is some additional information (also found in the comments): I have tried creating the array in all possible ways, including var = [/z{5,}/gi,/a{4,5}/gi];, var = [new RegExp('z{5,}', 'gi'), new RegExp('a{4,5}', 'gi')];, as well as var[0] = new RegExp('z{5,}'), 'gi');, var[1] = /z{5,}/gi;, etc. The array is used in a for-loop as somevar.replace(regex[i], subst[i]);.

Perhaps I'm misunderstanding the question, but it seems like the Javascript implementation of the {n} operators is pretty good:
"foobar".match(/o{2,4}/); // => matches 'oo'
"fooobar".match(/o{2,4}/); // => matches 'ooo'
"foooobar".match(/o{2,4}/); // => matches 'oooo'
"fooooooobar".match(/o{2,4}/); // => matches 'oooo'
"fooooooobar".match(/o{2,4}?/); // => lazy, matches 'oo'
"foooobar".match(/(oo){2}/); // => matches 'oooo', and captures 'oo'
"fobar".match(/[^o](o{2,3})[^o]/); // => no match
"foobar".match(/[^o](o{2,3})[^o]/); // => matches 'foob' and captures 'oo'
"fooobar".match(/[^o](o{2,3})[^o]/); // => matches 'fooob' and captures 'oo'
"foooobar".match(/[^o](o{2,3})[^o]/); // => no match

It works for me.
var regex = [/z{5,}/gi,/a{4,5}/gi];
var subst = ['ZZZZZ','AAAAA'];
var somevar = 'zzzzz aaaaa aaaaaaa zzzzzzzzzz aaazzzaaaaaa';
print(somevar);
for (var i=0; i<2; i++) {
somevar = somevar.replace(regex[i], subst[i]);
}
print(somevar);
output:
zzzzz aaaaa aaaaaaa zzzzzzzzzz aaazzzaaaaaa
ZZZZZ AAAAA AAAAAaa ZZZZZ aaazzzAAAAAa
The constructor version works, too:
var regex = [new RegExp('z{5,}','gi'),new RegExp('a{4,5}','gi')];
See it in action on ideone.com.

I think I've figured it out. I was building the array various ways to get it to work, but what I think made the difference was using single-quotes around the regex string, instead of leaving it open like [/z{5,}/,/t{7,9}/gi]. So when I did ['/z{5,}/','/t{7,9}/gi'] that seems to have fixed it. Even though, like in Alan's example, it does sometimes work fine without them. Just not in my case I guess.

We Keep Coding

JavaScript is the programming language of the Web.

How can I split a string containing emoji into an array? - javascript

I want to take a string of emoji and do something with the individual characters. In JavaScript "😴😄😃⛔🎠🚓🚇".length == 13 because "⛔" length is 1, the rest are 2. So we can't do var string = "😴😄😃⛔🎠🚓🚇"; s = string.split(""); console.log(s);

The modern / proper way to split a UTF8 string is using Array.from(str) instead of str.split('')

Related

Get emoji from url parameter [duplicate]

Javascript .replaceAll() is not a function type error

Extracting hashtags out of a string.

while trying to reverse string in javascript, getting NaN in the beginning of the reversed string

Javascript RegExp quantifier issue

Categories

Resources