How can Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ text be prevented?

How can Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ text be prevented? - javascript

I've read about how Zalgo text works, and I'm looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set of Unicode combining characters that needs to:
a) either be stripped, assuming chat participants are to use only languages that don't require combining marks (i.e. you could write "fiancé" with a combining mark, but you'd be a bit Zalgo'ed yourself if you insisted on doing so); or,
b) reduced to maximum 8 consecutive characters (the maximum encountered in actual languages)?
EDIT: In the meantime I found a completely differently phrased question ("How to protect against... diacritics?"), which is essentially the same as this one. I made its title more explicit so others will find it as well.

Assuming you're very serious about this and want a technical solution you could do as follows:
Split the incoming text into smaller units (words or sentences);
Render each unit on the server with your font of choice (with a huge line height and lots of space below the baseline where the Zalgo "noise" would go);
Train a machine learning algorithm to judge if it looks too "dark" and "busy";
If the algorithm's confidence is low defer to human moderators.
This could be fun to implement but in practice it would likely be better to go to step four straight away.
Edit: Here's a more practical, if blunt, solution in Python 2.7. Unicode characters classified as "Mark, nonspacing" and "Mark, enclosing" appear to be the main tools used to create the Zalgo effect. Unlike the above idea this won't try to determine the "aesthetics" of the text but will instead simply remove all such characters. (Needless to say, this will trash text in many, many languages. Read on for a better solution.) To filter out more character categories add them to ZALGO_CHAR_CATEGORIES.
#!/usr/bin/env python
import unicodedata
import codecs
ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
with codecs.open("zalgo", 'r', 'utf-8') as infile:
for line in infile:
print ''.join([c for c in unicodedata.normalize('NFD', line) if unicodedata.category(c) not in ZALGO_CHAR_CATEGORIES]),
Example input:
1
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
2
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
3
Output:
1
How does Zalgo text work?
2
How does Zalgo text work?
3
Finally, if you're looking to detect, rather than unconditionally remove, Zalgo text you could perform character frequency analysis. The program below does that for each line of the input file. The function is_zalgo calculates a "Zalgo score" for each word of the string it is given (the score is the number of potential Zalgo characters divided by the total number of characters). It then looks if the third quartile of the words' scores is greater than THRESHOLD. If THRESHOLD equals 0.5 it means we're trying to detect if one out of each four words has more than 50% Zalgo characters. (The THRESHOLD of 0.5 was guessed and may require adjustment for real-world use.) This type of algorithm is probably the best in terms of payoff/coding effort.
#!/usr/bin/env python
from __future__ import division
import unicodedata
import codecs
import numpy
ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
THRESHOLD = 0.5
DEBUG = True
def is_zalgo(s):
if len(s) == 0:
return False
word_scores = []
for word in s.split():
cats = [unicodedata.category(c) for c in word]
score = sum([cats.count(banned) for banned in ZALGO_CHAR_CATEGORIES]) / len(word)
word_scores.append(score)
total_score = numpy.percentile(word_scores, 75)
if DEBUG:
print total_score
return total_score > THRESHOLD
with codecs.open("zalgo", 'r', 'utf-8') as infile:
for line in infile:
print is_zalgo(unicodedata.normalize('NFD', line)), "\t", line
Sample output:
0.911483990148
True Señor, could you or your fiancé explain, H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
0.333333333333
False Příliš žluťoučký kůň úpěl ďábelské ódy.

Make the box overflow:hidden. It doesn't actually disable Zalgo text, but it prevents it from damaging other comments.
.comment {
/* the overflow: hidden is what prevents one comment's combining marks from affecting its siblings */
overflow: hidden;
/* the padding gives space for any legitimate combining marks */
padding: 0.5em;
/* the rest are just to visually divide the three comments */
border: solid 1px #ccc;
margin-top: -1px;
margin-bottom: -1px;
}
<div class=comment>The below comment looks awful.</div>
<div class=comment>H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡</div>
<div class=comment>The above comment looks awful.</div>

A related question was asked before: https://stackoverflow.com/questions/5073191/how-is-zalgo-text-implemented but it's interesting to go into prevention here.
In terms of preventing this you can choose several strategies:
prevent combining diacritics entirely (and piss off many international users),
filter out combining characters using whitelisting or blacklisting (and piss off a smaller percentage of international users)
prevent a certain number of combining characters (and piss of an even smaller percentage of users)
have a healthy moderator community (with all the downsides that has, see your question as an example here)

You can get rid off Zalgo text in your application using strip-combining-marks by Mathias Bynens.
The module strip-combining-marks is available for browsers (via Bower) and Node.js applications (via npm).
Here is an example on how to use it with npm:
var stripCombiningMarks = require("strip-combining-marks");
var zalgoText = 'U̼̥̻̮͍͖n͠i͏c̯̮o̬̝̠͉̤d͖͟e̫̟̗͟ͅ';
var stripptedText = stripCombiningMarks(zalgoText); // "Unicode"

Using PHP and the mindset of a demolition worker you can get rid of the Zalgo with the iconv function. Of course that will kill any other UTF-8 chars too.
$unZalgoText = iconv("UTF-8", "ISO-8859-1//IGNORE", $zalgoText);

Related

How does Chrome decide what to highlight when you double-click Japanese text?

If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:
どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。
For example, if you click on 薄暗い, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.
How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental module that doesn't seem active in my version of Chrome.

So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.
function tokenizeJA(text) {
var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
it.adoptText(text)
var words = []
var cur = 0, prev = 0
while (cur < text.length) {
prev = cur
cur = it.next()
words.push(text.substring(prev, cur))
}
return words
}
console.log(tokenizeJA('どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。'))
// ["どこ", "で", "生れ", "たか", "とんと", "見当", "が", "つ", "か", "ぬ", "。", "何でも", "薄暗い", "じめじめ", "した", "所", "で", "ニャーニャー", "泣", "い", "て", "いた事", "だけ", "は", "記憶", "し", "て", "いる", "。"]
I also made a jsfiddle that shows this.
The quality is not amazing but I'm surprised this is supported at all.

Based on links posted by JonathonW, the answer basically boils down to: "There's a big list of Japanese words and Chrome checks to see if you double-clicked in a word."
Specifically, v8 uses ICU to do a bunch of Unicode-related text processing things, including breaking text up into words. The ICU boundary-detection code includes a "Dictionary-Based BreakIterator" for languages that don't have spaces, including Japanese, Chinese, Thai, etc.
And for your specific example of "薄暗い", you can find that word in the combined Chinese-Japanese dictionary shipped by ICU (line 255431). There are currently 315,671 total Chinese/Japanese words in the list. Presumably if you find a word that Chrome doesn't split properly, you could send ICU a patch to add that word.

It is still rudimentary (2022-11-27) but Google progresses very fast in the various fields of language parsing.
As of today's state of the code, Google Chrome broke |生れ|たか| and |泣|い|て|いた事|, both 'たか' and 'いた事' are odd lexically since both 'たか' and 'いた' (A) are usually used 'agglutinated' with the previous string 99,9% of the time (B) have very little meaning (frequency usage beyond the 10000th rank).
For Chinese and Japanese anyone can get better results with a vocabulary list of just 100,000 items (you add to the list as you read) that you organize from longest strings to shorter (single characters), for Chinese I set the length at 5 characters maximum, anything bigger is the name of an organization or such, for Japanese I set the maximum at 9 char length. Tonal languages have (65%) shorter words compared to non-tonal.
To parse a paragraph you launch a "do while" loop that starts from the first character and tries to find first the longest possible string in the vocabulary list, if that wasn't successful the search proceed towards the end of the list to shorter parts of words with less meaning, till it gets too simple letters or rare single-characters (you need to have all these single items, like, all 6,000 kanji/hanzi for daily reading).
You set a separator when you encounter punctuation or numbers and you skip to the next word.
It would be easier if I showed this at work but I don't know if people are interested and if I can post video links here.

how to compare two strings by meaning?

I want the user of my node.js application to write down ideas, which then get stored in a database.
So far so good, but I don't want redundant entrys in that table, so I decided to check for similarity, using this one:
https://www.npmjs.com/package/string-similarity-js
Do you know a way, in which I can compare two strings by meaning? In like getting a high similarity score for "using public transport" vs "driving by train" which performs very poor in the above one.

To compare two strings by meaning, the strings would need to be convert first to a tensor and then evalutuate the distance or similarity between the tensors. Many algorithm can be used to convert strings to tensors - all related to the domain of interest. But the Universal Sentence Encoder is a wide broad sentence encoder that will project all words in one dimensional space. The cosine similarity can be used to see how closed some words are in meaning.
Example
Though king and kind are closed in hamming distance (difference of only one character), they are very different. Whereas queen and king though they seems not related (because all characters are different) are close in meaning. Therefore the distance (in meaning) between king and queen should be smaller than between king and kind as demonstrated in the following snippet.
<script src="https://cdn.jsdelivr.net/npm/#tensorflow/tfjs"></script>
<script src="https://cdn.jsdelivr.net/npm/#tensorflow-models/universal-sentence-encoder"></script>
<script>
(async() => {
const model = await use.load();
const embeddings = (await model.embed(['queen', 'king', 'kind'])).unstack()
tf.losses.cosineDistance(embeddings[0], embeddings[1], 0).print() // 0.39812755584716797
tf.losses.cosineDistance(embeddings[1], embeddings[2], 0).print() // 0.5585797429084778
})()
</script>

Comparing the meaning of two string is still an ongoing research. If you really want to solve the problem (or to get really good performance of your language modal) you should consider get a PhD.
For out of box solution at the time: I found this Github repo that implement google's BERT modal and use it to get the embedding of two sentences. In theory, the two sentence share the same meaning if there embedding is similar.
https://github.com/UKPLab/sentence-transformers
# the following is simplified from their README.md
embedder = SentenceTransformer('bert-base-nli-mean-tokens')
# Corpus with example sentences
S1 = ['A man is eating a food.']
S2 = ['A man is eating pasta.']
s1_embedding = embedder.encode(S1)
s2_embedding = embedder.encode(S2)
dist = scipy.spatial.distance.cdist([s1_embedding], [s2_embedding], "cosine")[0]
Example output (copied from their README.md)
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating a piece of bread. (Score: 0.8518)
A man is eating a food. (Score: 0.8020)
A monkey is playing drums. (Score: 0.4167)
A man is riding a horse. (Score: 0.2621)
A man is riding a white horse on an enclosed ground. (Score: 0.2379)

Regex: How to (better) optimize text from/for messages

Since this question does not contain a specific question on regex but more on it's design/approach, it might take a while to understand the requirements and their dependencies. I have done everything I can to make it as easy as possible with this fully working yet not elegant solution(deadlink).
I need to optimize text in a messaging platform that is being created/edited by others and may have to be sanitized with regex. All optimizations need to be done with one single regex, since these happen often and are quite expensive (or am I wrong on this?). Furthermore the regex needs to be language-agnostic (at least compatible with Javascript and Php). Last but not least, the optimized text must not contain (additional) Html as it is used in a text-only environment.
Requirements
Optimize lines
Remove single lines
Do not remove single lines that end with two|no spaces (thus allow editors to force a newline)
Do not remove empty lines (double line breaks)
Do not remove single lines that start with symbol|char|digit|entity+space (raw lists)
Condense multiple consecutive empty lines (double line breaks) to one double line break
Optimize spaces
Remove excess spaces
Do not remove spaces at the end of a sentence
Optimize comments
Remove single line comments
Do not remove trailing comments
Overall
Preserve Html and do not add Html
Intermediate solution
So far, my solution is to combine 4 regexes which 'match' my requirements and get replaced by a single space:
Matches single lines while leaving empty lines intact and preserving raw lists: \n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][\.|\)|\:|\*]) (the length is due to several list-style-types I want to support)
Matches excess empty lines: (\n+)(?=\n\n)
Matches excess spaces: +
Matches single line comments (while ignoring trailing comments): ^\n?\/\/ .+\n
To make the optimization rather inexpensive, I concatenate them with | to one single regex which I can use in Javascript (as well as Php).
r = new RegExp(" \n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][.):*] )|(\n+)(?=\n\n)| + |^\n?\/\/ .+\n", "gm");
i = document.getElementById("input").innerHTML;
p = " ";
o = i.replace(r, p);
document.getElementById("output").innerHTML = o;
#input, #output { width: 100%; height: 88vh; }
#input { display: none; } #output { border: none; }
<textarea id="input">
MAKE PARAGRAPHS
This is the first paragraph.
Some sentences end with newlines.
Some don't. We need to cope with that.
This is the second paragraph.
It contains some unnecessary spaces.
Even at the end of a line.
This is the third paragraph.
Some sentences end with question- and exclamation-marks.
I hope that is ok for you. Is it? That's great! Really.
KEEP LISTS
This is an unordered list, starting with a minus+space:
- This is the first item.
- This is the second item.
- This is the third item.
Here is an unordered list, starting with entity|symbol+space:
• This is the second item.
> This is the third item. // Works in php only
* This is the fifth item.
This is a (manually) ordered lists, starting with char|digit+entity+space:
1. This is the first item.
b) This is the second item.
3: This is the third item.
Here is a mathematical list, starting with operators:
+ Plus
- Minus
% Percentage
/ Division
* Multiply
~ Like
= Equal
These are (manually) ordered lists, which are not summed up because they do not end with a space:
1 This is the first item.
b This is the second item.
I like the third item.
First: This works.
Second: It works great.
Third: That is nice!
KEEP HTML
The input text may contain Html.
The output text must simply keep it for further processing.
The output must not add Html as it is processed in a text-only environment.
I know this sounds stupid, but it isn't.
REMOVE COMMENTS
Single/whole line comments are being removed.
// Sources
// Removing single lines: https://regex101.com/r/qU1eP8/5
// Removing comments: https://www.perlmonks.org/?node_id=996552
// Tests
// Dialog: https://api.sefzig.net/dialog/test/regex/
// Jsbin: https://jsbin.com/goromad/edit?output
// Regex101: https://regex101.com/r/Xz5atA/2
// Regexr: https://regexr.com/45svm
Thank you, regex ♥ // Problem solved
~Fin~
</textarea>
<textarea id="output"><!-- Press "Run" --></textarea>
My request
Since I am not a regex-expert and my approach feels cumbersome, I'd like to hear your suggestions. I know Regex is expensive and everything can be done better.
You might wonder about a few details I haven't mentioned here for the sake of clarity. You also might want to test my Regexes. This is why I have set up a sandbox, isolating the requirements (Regexes), containing an example text with all use-cases as well as a detailed description:
https://api.sefzig.net/dialog/test/regex/(deadlink)
In case you want to use the features of great tools out there, here you go:
Regexr: https://regexr.com/45svm
Regex101: https://regex101.com/r/Xz5atA/2
Jsbin: https://jsbin.com/goromad/edit?output
Thank you
for helping me straighten this important feature of my messaging platform! Please feel free to enhance my approach, suggest an alternative or use the results in your own project ♥
This is my first question on stack overflow. I have researched a lot. Please bear with me if I have done anything wrong and help me fix that.

How to detect and remove unwanted lines from a string?

I am working on a project in which i have to extract text data from a PDF.
I am able to extract text from the PDF, but extracted text sometimes contains lines which i would like to strip off from it.
Here's and example of unwanted lines -
ISBN 0-7225-3293-8. = CONTENTS = Part One Part Two Epilogue
Page 1 / 94
And, here's an example of good line (which i'd like to keep) -
Dusk was falling as the boy arrived with his herd at an abandoned church.
I wanted to sleep a little longer, he thought. He had had the same dream that night as a week ago
Different PDFs can give out different unwanted lines.
How can i detect them ?

Option 1 - Give the computer a rule: If you are able to narrow down what content it is that you would like to keep, the obvious criteria that sticks out to me is the exclusion of special characters, then you can filter your results based on this.
So let's say you agree that all "good lines" will be without special characters ('/', '-', and '=') for example, if a line DOES contain one of these items, you know you can remove it from the content you are keeping. This could be done in a for loop containing an if-then condition that looks something like this..
var lineArray = //code needed to make each line of the file an element of the array
For (cnt = 0; cnt < totalLines; cnt++)
{
var line = lineArray[cnt];
if (line.contains("/") || line.contains("-") || line.contains("="))
lineArray[cnt] = "";
}
At the end of this code you could simply get all the text within the array and it would no longer contain the unwanted lines. If there are unwanted lines however, that are virtually indistinguishable by characters, length, positioning etc. the previous approach begins to break down on some of the trickier lines.
This is because there is no rule you can give the computer to distinguish between the good and the bad without giving it a brain such as yours that recognizes parts of speech and sentence structure. In which case you might consider option 2, which is just that.
Option 2- Give the computer a brain: Given that the text you want to remove will more or less be incoherent documentation based on what you have shown us, an open source (or purchased) natural language processor may be what you are looking for.
I found a good beginner's intro at http://myreaders.info/10_Natural_Language_Processing.pdf with some information that might be of use to you. From the source,
"Linguistics is the science of language. Its study includes:
sounds (phonology),
word formation (morphology),
sentence structure (syntax),
meaning (semantics), and understanding (pragmatics) etc.
Syntactic Analysis : Here the analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the language for how words may be combined. Example: An English syntactic analyzer would reject the sentence say : 'Boy the go the to store.' "
Using some sort of NLP, you can discover whether a given section of text contains a sentence or some incoherent rambling. This test could then be used as a filter in your program for what you would like to keep or remove.
Side note- As it appears your sample text is not just sentences but literature, sometimes characters will speak in sentence fragments as part of their nature given by the author. In this case, you could add a separate condition that if the text is contained within two quotations and has no special characters, you want to keep the text regardless.
In the end NLP may be more work than you require or that you want to do, in which case Option 1 is likely going to be your best bet. On the other hand, it may be just the thing you are looking for. Whatever the case or if you decide you need some combination of the two, best of luck! I hope this answer helps.

Node.js Emoji Parsing

I'm trying to parse an incoming string to determine whether it contains any non-emojis.
I've gone through this great article by Mathias and am leveraging both native punycode for the encoding / decoding and regenerate for the regex generation. I'm also using EmojiData to get my dictionary of emojis.
With that all said, certain emojis continue to be pesky little buggers and refuse to match. For certain emoji, I continue to get a pair of code points.
// Example of a single code point:
console.log(punycode.ucs2.decode('💩'));
>> [ 128169 ]
// Example of a paired code point:
console.log(punycode.ucs2.decode('⌛️'));
>> [ 8987, 65039 ]
Mathias touches on this in his article (and gives an example of punycode working around this) but even using his example I get an incorrect response:
function countSymbols(string) {
return punycode.ucs2.decode(string).length;
}
console.log(countSymbols('💩'));
>> 1
console.log(countSymbols('⌛️'));
>> 2
What is the best way to detect whether a string contains all emojis or not? This is for a proof of concept so the solution can be as brute force as need be.
--- UPDATE ---
A little more context on my pesky emoji above.
These are visually identical but in fact different unicode values (the second one is from the example above):
⌛ // \u231b
⌛️ // \u231b\ufe0f
The first one works great, the second does not. Unfortunately, the second version is what iOS seems to use (if you copy and paste from iMessage you get the second one, and when receiving a text from Twilio, same thing).

The U+FE0F is not a combining mark, it's a variation sequence that controls the rendering of the glyph (see this answer). Removing such sequences may change the appearance of the character, for example: U+231B+U+FE0E (⌛︎).
Also, emoji sequences can be made from multiple code points. For example, U+0032 (2) is not an emoji by itself, but U+0032+U+20E3 (2⃣) or U+0032+U+20E3+U+FE0F (2⃣️) is—but U+0041+U+20E3 (A⃣) isn't. A complete list of emoji sequences are maintained in the emoji-data.txt file by the Unicode Consortium (the emoji-data-js library appears to have this information).
To check if a string contains emoji characters, you will need to test if any single character is in emoji-data.txt, or starts a substring for a sequence in it.

If, hypothetically, you know what non-emoji characters you expect to run into, you can use a little lodash magic via their toArray or split modules, which are emoji aware. For example, if you want to see if a string contains alphanumeric characters, you could write a function like so:
function containsAlphaNumeric(string){
return _(string).toArray().filter(function(char){
return char.match(/[a-zA-Z0-9]/);
}).value().length > 0 ? true : false;
}

We Keep Coding

JavaScript is the programming language of the Web.

How can Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ text be prevented? - javascript

Using PHP and the mindset of a demolition worker you can get rid of the Zalgo with the iconv function. Of course that will kill any other UTF-8 chars too. $unZalgoText = iconv("UTF-8", "ISO-8859-1//IGNORE", $zalgoText);

Related

How does Chrome decide what to highlight when you double-click Japanese text?

how to compare two strings by meaning?

Regex: How to (better) optimize text from/for messages

How to detect and remove unwanted lines from a string?

Node.js Emoji Parsing

Categories

Resources