Node.js Emoji Parsing

Node.js Emoji Parsing - javascript

I'm trying to parse an incoming string to determine whether it contains any non-emojis.
I've gone through this great article by Mathias and am leveraging both native punycode for the encoding / decoding and regenerate for the regex generation. I'm also using EmojiData to get my dictionary of emojis.
With that all said, certain emojis continue to be pesky little buggers and refuse to match. For certain emoji, I continue to get a pair of code points.
// Example of a single code point:
console.log(punycode.ucs2.decode('💩'));
>> [ 128169 ]
// Example of a paired code point:
console.log(punycode.ucs2.decode('⌛️'));
>> [ 8987, 65039 ]
Mathias touches on this in his article (and gives an example of punycode working around this) but even using his example I get an incorrect response:
function countSymbols(string) {
return punycode.ucs2.decode(string).length;
}
console.log(countSymbols('💩'));
>> 1
console.log(countSymbols('⌛️'));
>> 2
What is the best way to detect whether a string contains all emojis or not? This is for a proof of concept so the solution can be as brute force as need be.
--- UPDATE ---
A little more context on my pesky emoji above.
These are visually identical but in fact different unicode values (the second one is from the example above):
⌛ // \u231b
⌛️ // \u231b\ufe0f
The first one works great, the second does not. Unfortunately, the second version is what iOS seems to use (if you copy and paste from iMessage you get the second one, and when receiving a text from Twilio, same thing).

The U+FE0F is not a combining mark, it's a variation sequence that controls the rendering of the glyph (see this answer). Removing such sequences may change the appearance of the character, for example: U+231B+U+FE0E (⌛︎).
Also, emoji sequences can be made from multiple code points. For example, U+0032 (2) is not an emoji by itself, but U+0032+U+20E3 (2⃣) or U+0032+U+20E3+U+FE0F (2⃣️) is—but U+0041+U+20E3 (A⃣) isn't. A complete list of emoji sequences are maintained in the emoji-data.txt file by the Unicode Consortium (the emoji-data-js library appears to have this information).
To check if a string contains emoji characters, you will need to test if any single character is in emoji-data.txt, or starts a substring for a sequence in it.

If, hypothetically, you know what non-emoji characters you expect to run into, you can use a little lodash magic via their toArray or split modules, which are emoji aware. For example, if you want to see if a string contains alphanumeric characters, you could write a function like so:
function containsAlphaNumeric(string){
return _(string).toArray().filter(function(char){
return char.match(/[a-zA-Z0-9]/);
}).value().length > 0 ? true : false;
}

Related

chars by unicode not appear on html

I tried to get Unicode of the "🤔" emoji with javascript but not appear when I try to add it as HTML, a lot of emojis not appear
var x="A".charCodeAt(0);
document.write(x); // gaves 65
A gave A
var y="🤔".charCodeAt(0);
document.write(y); // gaves 55358
? didn't give 🤔
What is the reason?

Characters (or code points) in JavaScript are made up of code units. Sometimes, a code point is made up of one code unit. For example, the code point for "a" is made up of one code unit (97 or 0x61). Other code points however are made up of two code units (known as a surrogate pair) that work together to form a single character. For example, the Thinking Face emoji (🤔) you shared in your question is made up of two code units which when paired together make the one code point to form the thinking face emoji. You can see this by taking the .length of it:
console.log("🤔".length); // 2 (as made up of two code points)
console.log("🤔".charCodeAt(0)); // 55358 or 0xD83E
console.log("🤔".charCodeAt(1)); // 56596 or 0xDD14
When you use .charCodeAt() you're accessing the code units of your string elements. As seen above, "🤔" comprises of two code units, 0xD83E (55358) and 0xDD14 (56596). The first code unit returned by .charCodeAt(0) only makes sense when used together with other code units, and so it doesn't work standalone, which is why you're getting the replacement character symbol (�) in your result:
console.log("🤔"[0], "🤔"[1]); // � �
Rather than using the individual code units obtained by .chartCodeAt(), you can show the entire code point (ie: character), which can be obtained using .codePointAt():
console.log("🤔".codePointAt(0)); // 129300 or 0x1F914
Then, you can use this number as a HTML entity:
<p>🤔</p>

This emoji: 🤔 has a decimal (dec) reference of 129300. If you want it to show in your HTML try this:
<p>🤔</p>
For a full list of dec references visit w3schools' website.

How to remove all of string up to and including hyphen

I am using javascript in a Mirth transformer. I apologize for my ignorance but I have no javascript training and have a hard time successfully utilizing info from similar threads here or elsewhere. I am trying to trim a string from 'Room-Bed' to be just 'Bed'. The "Room" and "Bed" values will fluctuate. All associated data is coming in from an ADT interface where our client is sending both the room and bed values, separated by a hyphen, in the bed field creating unnecessary redundancy. Please help me with the code needed to produce the result of 'Bed' from the received 'Room-Bed'.

There are many ways to reduce the string you have to the string you want. Which you choose will depend on your inputs and the output you want. For your simple example, all will work. But if you have strings come in with multiple hyphens, they'll render different results. They'll also have different performance characteristics. Balance the performance of it with how often it will be called, and whichever you find to be most readable.
// Turns the string in to an array, then pops the last instance out: 'Bed'!
'Room-Bed'.split('-').pop() === 'Bed'
// Looks for the given pattern,
// replacing the match with everything after the hyphen.
'Room-Bed'.replace(/.+-(.+)/, '$1') === 'Bed'
// Finds the first index of -,
// and creates a substring based on it.
'Room-Bed'.substr('Room-Bed'.indexOf('-') + 1) === 'Bed'

My regex that should only accept latin-based characters is acting strangely

I've got a regex written to the best of my ability that allows the latin character set only with the option of a '-' that, if included MUST be followed by at least one other latin character.
My RegEx:
[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:[-]?[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)
I came to this after reading a few posts and rereading the manual to figure out the best way to approach this. This check is attached to a text field where a user types only their first name and then submits.
It works okay but there is certainly room for improvement.
Examples:
Tom // passes
Éve // passes
John-Paul // passes
2pac // passes and removes numbers (not really what I want)
John316 // passes and removes numbers (not really what I want)
What I would REALLY want to happen is a fail on those last two checks.
How would I revise it to get the outcome I'd like?

You need to anchor the regex by adding ^ at the start and $ at the end. That way you will not let any other symbols in the input string.
I also suggest enhancing the pattern by moving ? from after hyphen to the end (that will make regex execution linear as the hyphen has no quantifier and is required, thus, limiting backtracking):
^[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:-[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)?$
See regex demo.
JS snippet:
console.log(/^[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:-[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)?$/.test('Éve')); //=> true
console.log(/^[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:-[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)?$/.test('John-Paul')); // => true
console.log(/^[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+(?:-[\u00BF-\u1FFF\u2C00-\uD7FFA-Za-z]+)?$/.test('John316')); // => false

Capturing optional part of URL with RegExp

While writing an API service for my site, I realized that String.split() won't do it much longer, and decided to try my luck with regular expressions. I have almost done it but I can't find the last bit. Here is what I want to do:
The URL represents a function call:
/api/SECTION/FUNCTION/[PARAMS]
This last part, including the slash, is optional. Some functions display a JSON reply without having to receive any arguments. Example: /api/sounds/getAllSoundpacks prints a list of available sound packs. Though, /api/sounds/getPack/8Bit prints the detailed information.
Here is the expression I have tried:
req.url.match(/\/(.*)\/(.*)\/?(.*)/);
What am I missing to make the last part optional - or capture it in whole?

This will capture everything after FUNCTION/ in your URL, independent of the appearance of any further / after FUNCTION/:
FUNCTION\/(.+)$
The RegExp will not match if there is no part after FUNCTION.

This regex should work by making last slash and part after optional:
/^\/[^/]*\/[^/]*(?:\/.*)?$/
This matches all of these strings:
/api/SECTION/FUNCTION/abc
/api/SECTION
/api/SECTION/
/api/SECTION/FUNCTION

Your pattern /(.*)/(.*)/?(.*) was almost correct, it's just a bit too short - it allows 2 or 3 slashes, but you want to accept anything with 3 or 4 slashes. And if you want to capture the last (optional) slash AND any text behind it as a whole, you simply need to create a group around that section and make it optional:
/.*/.*/.*(?:/.+)?
should do the trick.
Demo. (The pattern looks different because multiline mode is enabled, but it still works. It's also a little "better" because it won't match garbage like "///".)

Printing emojis with JavaScript and HTML

Why does this work:
<p id="emoji">😄</p>
And this doesn't:
document.getElementById("emoji").innerHTML = String.fromCharCode(parseInt('1f604', 16));

A 'char' in JS terms is actually a UTF-16 code unit, not a full Unicode character. (This sad state of affairs stems from ancient times when there wasn't a difference*.) To use a character outside of the Basic Multilingual Plane you have to write it in the UTF-16-encoded form of a surrogate pair of two 16-bit code units:
String.fromCharCode(0xD83D, 0xDE04)
In ECMAScript 6 we will get some interfaces that let us deal with strings as if they were full Unicode code points, though they are incomplete and are only a façade over the String type which is still stored as a code unit sequence. Then we'll be able to do:
String.fromCodePoint(0x1F604)
See this question for some polyfill code to let you use this feature in today's browsers.
(*: When I get access to a time machine I'm leaving Hitler alone and going back to invent UTF-8 earlier. UTF-16 must never have been!)

You can also use the hacky method if you don't want to include String.fromCodePoint() in your code. It consists in creating a virtual element ...
elem=document.createElement('p')
... Filling it with the Working HTML...
elem.innerHTML = "&#x1f604"
... And finally getting its value
value = elem.innerHTML
To make it short, this works because of the fact that, as soon as you set the value of a HTML container, the value gets converted into the corresponding character.
Hope I could help.

We Keep Coding

JavaScript is the programming language of the Web.

Node.js Emoji Parsing - javascript

Related

chars by unicode not appear on html

How to remove all of string up to and including hyphen

My regex that should only accept latin-based characters is acting strangely

Capturing optional part of URL with RegExp

Printing emojis with JavaScript and HTML

Categories

Resources