What is the difference between String.prototype.codePointAt() and String.prototype.charCodeAt() in JavaScript?
'A'.codePointAt(); // 65
'A'.charCodeAt(); // 65
From the MDN page on charCodeAt:
The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The UTF-16 code unit matches the Unicode code point for code points which can be represented in a single UTF-16 code unit. If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than 0xFFFF) then the code unit returned will be the first part of a surrogate pair for the code point. If you want the entire code point value, use codePointAt().
TLDR;
charCodeAt() is UTF-16
codePointAt() is Unicode.
To add a few for the ToxicTeacakes's answer, here is another example to help you know the difference:
"𠮷".charCodeAt(0).toString(16);//d842
"𠮷".charCodeAt(1).toString(16);//dfb7
"𠮷".codePointAt(0);//20bb7
"𠮷".codePointAt(1);//dfb7
console.log("\ud842\udfb7");//𠮷, an example of hexadecimal digits
console.log("\u20bb7\udfb7");//₻7�
console.log("\u{20bb7}");//𠮷 an unicode code point escapes the "\ud842\udfb7"
The following is the info about javascript string literals:
"\uXXXX"
The Unicode character specified by the four hexadecimal digits
XXXX. For example, \u00A9 is the Unicode sequence for the copyright
symbol.
"\u{XXXXX}"
Unicode code point
escapes. For example, \u{2F804} is the same as the simple Unicode
escapes \uD87E\uDC04.
see also msdn
Example in JS
On The example with strings and emojis, I am going to illustrate how things could go wrong when you do not know that some of the characters could consist of 2 code units. Some of the characters take up more than one code unit. Consider using codePointAt() over charCodeAt() or use the first one if you are sure that your characters lie in of between 0 and 65535 (216)
more about code units here
// charCodeAt() is UTF-16
// codePointAt() is Unicode
/* UTF-16 is generally considered a bad idea today */
const strings = ["o", "four", "to"];
const emojis = ["🐎", "👟"];
function printItemsLength(arr) {
for (const item of arr) {
console.log(item, item.length);
}
}
printItemsLength(strings);
console.log('================================');
printItemsLength(emojis);
console.log('================================');
console.log("i.charCodeAt(0)", "i".charCodeAt(0)); // 105
console.log("i.charCodeAt(1)", "i".charCodeAt(1)); // 105
console.log("i.codePointAt(0)", "i".codePointAt(0)); // 105
console.log('=============EMOJIS=============');
// getting the decimal (dec) by which you can find them
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[0] + '.charCodeAt(0)', emojis[0].charCodeAt(0)); // only half-character - 55357
console.log(emojis[0] + '.charCodeAt(1)', emojis[0].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
console.log(emojis[0] + '.codePointAt(0)', emojis[0].codePointAt(0)); // 128014
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[1] + '.charCodeAt(0)', emojis[1].charCodeAt(0)); // only half-character - 55357
console.log(emojis[1] + '.charCodeAt(1)', emojis[1].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
// full-character
console.log(emojis[1] + '.codePointAt(0)', emojis[1].codePointAt(0)); // 128095
console.log(emojis[1] + '.codePointAt(1)', emojis[1].codePointAt(1)); // will return lower surragate (non-displayable character)
// to find this emojis have a look here: https://www.w3schools.com/charsets/ref_emoji.asp
as someone may have noticed I have tried to convert back from charcode to the emoji, and it did not work on one of the symbols (that is because it is not in range of UTF-16
Introduction to Unicode and UTF-16
please skip this section if you already familiar with it
Unicode – is a set of characters used around the world; UTF-16 -
00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111
01100010 for "𤭢" (two 16-bits)
read more
"surrogate pair" characters are emoji and some letters that consist of more than 1 character as it is explained here
The term "surrogate pair" refers to a means of encoding Unicode
characters with high code-points in the UTF-16 encoding scheme. In the
Unicode character encoding, characters are mapped to values between
0x0 and 0x10FFFF.
read more
Unicode - It assigns every character a unique number called a code point.
Differentiating charCodeAt() from codePointAt()
charCodeAt(pos) returns code a code unit (not a full character).
If you need a character (that could be either one or two code units), you can use codePointAt(pos) to get its code.
charCodeAt() - returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index link
codePointAt() - returns a non-negative integer that is the Unicode code point value at the given position link
where pos is the index of the character you want to check.
Quote from the book:
UTF-16 is generally considered a bad idea today. It seems almost
intentionally designed to invite mistakes. It’s easy to write programs
that pretend code units and characters are the same things.
read more
jsfiddle sandbox
Sources:
What is Unicode, UTF-8, UTF-16?
Marijn Haverbeke Eloquent JavaScript, 3rd Edition: A Modern Introduction to Programming [Text] – City(not-specified) : No Starch Press, 2018 – 447 p. can be found here
What is "surrogate pair"
to find this emojis have a look w3schools.com/charsets/ref_emoji
Chapter 5, p. 91 => Strings and character codes
Related
I'm getting too confused. Why do code points from U+D800 to U+DBFF encode as a single (2 bytes) String element, when using the ECMAScript 6 native Unicode helpers?
I'm not asking how JavaScript/ECMAScript encodes Strings natively, I'm asking about an extra functionality to encode UTF-16 that makes use of UCS-2.
var str1 = '\u{D800}';
var str2 = String.fromCodePoint(0xD800);
console.log(
str1.length, str1.charCodeAt(0), str1.charCodeAt(1)
);
console.log(
str2.length, str2.charCodeAt(0), str2.charCodeAt(1)
);
Re-TL;DR: I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string, since my browser's ES6 implementation incorporates UCS-2 encoding in strings, which uses 2 bytes for each character code?
Both of these approaches return a one-element String for the U+D800 code point (char code: 55296, same as 0xD800). But for code points bigger than U+FFFF each one returns a two-element String, the lead and trail. lead would be a number between U+D800 and U+DBFF, and trail I'm not sure about, I only know it helps changing the result code point. For me the return value doesn't make sense, it represents a lead without trail. Am I understanding something wrong?
I think your confusion is about how Unicode encodings work in general, so let me try to explain.
Unicode itself just specifies a list of characters, called "code points", in a particular order. It doesn't tell you how to convert those to bits, it just gives them all a number between 0 and 1114111 (in hexadecimal, 0x10FFFF). There are several different ways these numbers from U+0 to U+10FFFF can be represented as bits.
In an earlier version, it was expected that a range of 0 to 65535 (0xFFFF) would be enough. This can be naturally represented in 16 bits, using the same convention as an unsigned integer. This was the original way of storing Unicode, and is now known as UCS-2. To store a single code point, you reserve 16 bits of memory.
Later, it was decided that this range was not large enough; this meant that there were code points higher than 65535, which you can't represent in a 16-bit piece of memory. UTF-16 was invented as a clever way of storing these higher code points. It works by saying "if you look at a 16-bit piece of memory, and it's a number between 0xD800 and 0xDBF (a "low surrogate"), then you need to look at the next 16 bits of memory as well". Any piece of code which is performing this extra check is processing its data as UTF-16, and not UCS-2.
It's important to understand that the memory itself doesn't "know" which encoding it's in, the difference between UCS-2 and UTF-16 is how you interpret that memory. When you write a piece of software, you have to choose which interpretation you're going to use.
Now, onto Javascript...
Javascript handles input and output of strings by interpreting its internal representation as UTF-16. That's great, it means that you can type in and display the famous 💩 character, which can't be stored in one 16-bit piece of memory.
The problem is that most of the built in string functions actually handle the data as UCS-2 - that is, they look at 16 bits at a time, and don't care if what they see is a special "surrogate". The function you used, charCodeAt(), is an example of this: it reads 16 bits out of memory, and gives them to you as a number between 0 and 65535. If you feed it 💩, it will just give you back the first 16 bits; ask it for the next "character" after, and it will give you the second 16 bits (which will be a "high surrogate", between 0xDC00 and 0xDFFF).
In ECMAScript 6 (2015), a new function was added: codePointAt(). Instead of just looking at 16 bits and giving them to you, this function checks if they represent one of the UTF-16 surrogate code units, and if so, looks for the "other half" - so it gives you a number between 0 and 1114111. If you feed it 💩, it will correctly give you 128169.
var poop = '💩';
console.log('Treat it as UCS-2, two 16-bit numbers: ' + poop.charCodeAt(0) + ' and ' + poop.charCodeAt(1));
console.log('Treat it as UTF-16, one value cleverly encoded in 32 bits: ' + poop.codePointAt(0));
// The surrogates are 55357 and 56489, which encode 128169 as follows:
// 0x010000 + ((55357 - 0xD800) << 10) + (56489 - 0xDC00) = 128169
Your edited question now asks this:
I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string?
The hexadecimal value D800 is 55296 in decimal, which is less than 65536, so given everything I've said above, this fits fine in 16 bits of memory. So if we ask charCodeAt to read 16 bits of memory, and it finds that number there, it's not going to have a problem.
Similarly, the .length property measures how many sets of 16 bits there are in the string. Since this string is stored in 16 bits of memory, there is no reason to expect any length other than 1.
The only unusual thing about this number is that in Unicode, that value is reserved - there isn't, and never will be, a character U+D800. That's because it's one of the magic numbers that tells a UTF-16 algorithm "this is only half a character". So a possible behaviour would be for any attempt to create this string to simply be an error - like opening a pair of brackets that you never close, it's unbalanced, incomplete.
The only way you could end up with a string of length 2 is if the engine somehow guessed what the second half should be; but how would it know? There are 1024 possibilities, from 0xDC00 to 0xDFFF, which could be plugged into the formula I show above. So it doesn't guess, and since it doesn't error, the string you get is 16 bits long.
Of course, you can supply the matching halves, and codePointAt will interpret them for you.
// Set up two 16-bit pieces of memory
var high=String.fromCharCode(55357), low=String.fromCharCode(56489);
// Note: String.fromCodePoint will give the same answer
// Glue them together (this + is string concatenation, not number addition)
var poop = high + low;
// Read out the memory as UTF-16
console.log(poop);
console.log(poop.codePointAt(0));
Well, it does this because the specification says it has to:
http://www.ecma-international.org/ecma-262/6.0/#sec-string.fromcodepoint
http://www.ecma-international.org/ecma-262/6.0/#sec-utf16encoding
Together these two say that if an argument is < 0 or > 0x10FFFF, a RangeError is thrown, but otherwise any codepoint <= 65535 is incorporated into the result string as-is.
As for why things are specified this way, I don't know. It seems like JavaScript doesn't really support Unicode, only UCS-2.
Unicode.org has the following to say on the matter:
http://www.unicode.org/faq/utf_bom.html#utf16-2
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
http://www.unicode.org/faq/utf_bom.html#utf16-7
Q: Are there any 16-bit values that are invalid?
A: Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.
Therefore the result of String.fromCodePoint is not always valid UTF-16 because it can emit unpaired surrogates.
We must do a small program for our teacher to get the ASCII code of any value in Javascript.
I have searched and researched, but it seems that there is no method to do so. I have only found:
charCodeAt()
http://www.hacksparrow.com/get-ascii-value-of-character-convert-ascii-to-character-in-javascript.html
That returns the Unicode value, but not ASCII.
I have read in this forum that the ASCII value is the same as the Unicode value for the ASCII characters that already have an ASCII value:
Are Unicode and Ascii characters the same?
But it seems that is not always the case, as for example with the extended ASCII characters. So for example:
var myCaracter = "├";
var n = myCaracter.charCodeAt(0);
document.write (n);
The ASCII value of that character is 195, but the program returns 226 (Unicode value).
I can't find a pattern to follow to convert from one to another, so:
¿Can we obtain the ASCII from Unicode, or should I look for another way?
Thanks!
ASCII characters only use 7 bits, with values from 0 to 127 (00 to 7F hex). They include:
control characters (0 to 31, as well as 127)
digits (0 to 9, encoded 48 to 57)
uppercase letters (65 to 90)
lowercase letters (97 to 122)
a limited number of punctuation and other symbols.
ASCII characters are a subset of Unicode (the "C0 Controls and Basic Latin Block"), and they are encoded exactly the same in UTF-8. The ASCII code of "A" (65 or 0x41) is the same as the Unicode code point for "A" (U+0041).
The character (├) you're considering is not ASCII. It's part of many different character sets / code pages, where it may have different numerical values / encodings, but it's definitely not ASCII.
That characters is not even defined in the most common ASCII 8-bit extensions, known as ISO-8859-*. It is part of the code page 437 (used on MS-DOS), where its numerical code is 0xC3 (195). But that's definitely not ASCII.
The Unicode code point for that character is U+251C (9500 decimal), which is the return value of charCodeAt for this character, not 226.
You're probably getting 226 because you're interpreting an UTF-8 string that has not been recognised as such.
Today my teacher has apologized because maybe it was her fault to tell us that charCodeAt() is wrong to obtain the ASCII code; she wanted us to use that method, like #Rad Lexus suggested.
So, it is not neccesary in my excercise, but as a practice and to help everyone who could need it, what I have done is to add to the code a small validation in order to avoid that the user could enter ASCII extended characters bigger than or equal to 128, where the problems with charCodeAt() seem to start.
Maybe it is not a smart solution and it was certainly not necessary in my exercise, plus it makes that some necessary characters in another languages (ö for German or ñ for Spanish, for example) are forbidden... but I think it is good to post the code and let everyone which uses it to choose whether using this validation or not.
Thanks to everyone who helped me.
Defining function:
function validate(text)
{
var isValid=false;
var i=0;
if(text != null && text.length>0 && text !='' )
{
isValid=true;
for (i=0;i<text.length;++i)/*this is not necessary, but I did*/
{
if(text.charCodeAt(i)>=128)
{
isValid=false;
}
}
}
return isValid;
}
Using function
var isValid=false;
var position=0;
while(isValid==false)
{
text=prompt("Enter your text");
isValid=validate(text);
}
I have been trying the following code to get the ASCII equivalent character
String.fromCharCode("149")
but, it seems to work till 126 is passed as parameter. But for 149, the symbol generated should be
•
128 and beyond is not standard ASCII.
var s = "•";
alert(s.charCodeAt(0))
gives 8226
https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode
Getting it to work with higher values Although most common Unicode
values can be represented with one 16-bit number (as expected early on
during JavaScript standardization) and fromCharCode() can be used to
return a single character for the most common values (i.e., UCS-2
values which are the subset of UTF-16 with the most common
characters), in order to deal with ALL legal Unicode values (up to 21
bits), fromCharCode() alone is inadequate. Since the higher code point
characters use two (lower value) "surrogate" numbers to form a single
character, String.fromCodePoint() (part of the ES6 draft) can be used
to return such a pair and thus adequately represent these higher
valued characters.
The fromCharCode() method converts Unicode values into characters.
to use unicode see the link for unicode table
http://unicode-table.com/en/
I got String.fromCodePoint(149) to show inside an alert in firefox but not in IE & Chrome. It may be because of browser language settings.
But this looks correct accourding to the ASCII table.
http://www.asciitable.com/
This is the code I used
alert(String.fromCodePoint(149));
the following doesn't seem correct
"🚀".charCodeAt(0); // returns 55357 in both Firefox and Chrome
that's a Unicode character named ROCKET (U+1F680), the decimal should be 128640.
this is for a unicode app am writing. Seems most but not ALL chars from unicode 6 all stuck at 55357.
how can I fix it? Thanks.
JavaScript is using UTF-16 encoding; see this article for details:
Characters outside the BMP, e.g. U+1D306 tetragram for centre (𝌆), can only be encoded in UTF-16 using two 16-bit code units: 0xD834
0xDF06. This is called a surrogate pair. Note that a surrogate pair
only represents a single character.
The first code unit of a surrogate pair is always in the range from
0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.
The second code unit of a surrogate pair is always in the range from
0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.
You can decode the surrogate pair like this:
codePoint = (text.charCodeAt(0) - 0xD800) * 0x400 + text.charCodeAt(1) - 0xDC00 + 0x10000
Complete code can be found can be found in the Mozilla documentation for charCodeAt.
Tried this out:
> "🚀".charCodeAt(0);
55357
> "🚀".charCodeAt(1);
56960
Related questions on SO:
Expressing UTF-16 unicode characters in JavaScript
Unicode characters from charcode in javascript for charcodes > 0xFFFF
You might want to take a look at this too:
Getting it to work with higher values
I think it's because they're returning you the first code unit UTF-16 encoding of that character. I'm not sure there's much you can do, because they're returning a 16-bit value -- I would probably try manually decoding the character from the first two code units and then encoding it in UTF-32, which seems to be what you want.
I need to get a string / char from a unicode charcode and finally put it into a DOM TextNode to add into an HTML page using client side JavaScript.
Currently, I am doing:
String.fromCharCode(parseInt(charcode, 16));
where charcode is a hex string containing the charcode, e.g. "1D400". The unicode character which should be returned is 𝐀, but a 퐀 is returned! Characters in the 16 bit range (0000 ... FFFF) are returned as expected.
Any explanation and / or proposals for correction?
Thanks in advance!
String.fromCharCode can only handle code points in the BMP (i.e. up to U+FFFF). To handle higher code points, this function from Mozilla Developer Network may be used to return the surrogate pair representation:
function fixedFromCharCode (codePt) {
if (codePt > 0xFFFF) {
codePt -= 0x10000;
return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
} else {
return String.fromCharCode(codePt);
}
}
The problem is that characters in JavaScript are (mostly) UCS-2 encoded but can represent a character outside the Basic Multilingual Plane in JavaScript as a UTF-16 surrogate pair.
The following function is adapted from Converting punycode with dash character to Unicode:
function utf16Encode(input) {
var output = [], i = 0, len = input.length, value;
while (i < len) {
value = input[i++];
if ( (value & 0xF800) === 0xD800 ) {
throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
}
if (value > 0xFFFF) {
value -= 0x10000;
output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
value = 0xDC00 | (value & 0x3FF);
}
output.push(String.fromCharCode(value));
}
return output.join("");
}
alert( utf16Encode([0x1D400]) );
Section 8.4 of the EcmaScript language spec says
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.
So you need to encode supplemental code-points as pairs of UTF-16 code units.
The article "Supplementary Characters in the Java Platform" gives a good description of how to do this.
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.
The following table shows the different representations of a few characters in comparison:
code points / UTF-16 code units
U+0041 / 0041
U+00DF / 00DF
U+6771 / 6771
U+10400 / D801 DC00
Once you know the UTF-16 code units, you can create a string using the javascript function String.fromCharCode:
String.fromCharCode(0xd801, 0xdc00) === '𐐀'
String.fromCodePoint() seems to do the trick as well. See here.
console.log(String.fromCodePoint(0x1D622, 0x1D623, 0x1D624, 0x1D400));
Output:
𝘢𝘣𝘤𝐀