JavaScript provides charAt and charCodeAt methods on strings.
What is the difference between these two methods?
When would one use on over the other?
From the MDN page on charAt
The String object's charAt() method returns a new string consisting of the single UTF-16 code unit located at the specified offset into the string.
From the MDN page on charCodeAt:
The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The UTF-16 code unit matches the Unicode code point for code points which can be represented in a single UTF-16 code unit. If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than 0xFFFF) then the code unit returned will be the first part of a surrogate pair for the code point. If you want the entire code point value, use codePointAt().
If you need the char as a string, call charAt.
If, for some reason, you need the UTF char code, call charCodeAt
(You could use it to increment the character for example.)
var a = 'ABC.................Z';
a.charCodeAt(0); // will return 65
a.charAt(0); // will return 'A'
Related
From MDN:
JavaScript's String type is used to represent textual data. It is a
set of "elements" of 16-bit unsigned integer values. Each element in
the String occupies a position in the String. The first element is at
index 0, the next at index 1, and so on. The length of a String is the
number of elements in it. You can create strings using string literals
or string objects.
What does it mean when you say the JavaScript String type is a set of "elements" of 16-bit unsigned integer values?
Please explain why it is a series of integer values.
The 16-bit unsigned integer values is a representation of specific characters and since it is a set of elements, you are able to grab specific characters within a string with [] notation as you would a list. Ex:
const string = 'john doe';
console.log(string[3]) // Will print 'n' as it is the 3rd index characters (starts at 0)
It just means that a string is an "array-like" object with each character available in a similar manner to an array element. Each of those characters are stored as a UTF-16 value.
// The following is one string literal:
let s = "ABCDEFG";
console.log(s);
// But it's also an array-like object in that it has a length and can be indexed
console.log("The length of the string is: ", s.length);
console.log("The 3rd character is: ", s[2]);
// And we can see that the characters are stored as separate UTF-16 values:
console.log(s.charCodeAt(2));
As I understood:
unsigned means not + or -.
16 bit means 2^16 number of elements/characters can represent.
set of Integers mean to represent a String use multiple integers (1 or more).
Therefore this means to represents a string, JavaScript uses a set of numbers (each number is one of 2^16 numbers, because no float numbers and no positive/negative representation).
Note: to understand more read about UTF-16
Reference: UTF-16 (IBM)
In Unicode, each symbol has an associated number. For example, "A" is 65, "a" is 97, etc. These numbers are called code points. Depending on the encoding we’re using (UTF-32, UTF-16, UTF-8, ASCII, etc.), we represent/encode these code points in different ways. The things we use to encode these code point numbers are called "code units", or as MDN calls them, "elements".
As we're using JavaScript, we're interested in the UTF-16 encoding of characters. This means that to represent a single code unit/"element", we use 16 bits (two bytes). For "A", the "element" representation is:
0000000001000001 // (16 bits, hence 0 padding)
There are a lot of characters that we need to represent (think emojis, Chinese, Japanese, Korean scripts, etc. that each have their own code points), so 16 bits to represent and encode all of these characters alone isn't enough. That's why sometimes some code points are encoded using two code units/elements. For example, 😂 has a code point of 128514 and in UTF-16 is encoded by two elements/code units:
1101100000111101 1101111000000010
So these two code units/elements 1101100000111101 (decimal 55357) and 1101111000000010 (decimal 56834) encode the code point/"character" of 128514 which represents 😂. Notice how both code units are both positive (unsigned), and are whole numbers (integers). UTF16 outlines the algorithm to take these elements from the element form to their code point form and vice-versa (see here for examples).
What are the implications of all this? Well it means that strings like "😂" will have a length of 2:
console.log("😂".length); // 2
And that when you access the indexes of the string, you will access the code units/"elements" of that string:
// "😂" in UTF16 is "1101100000111101 1101111000000010"
// So "😂"[0] gives 1101100000111101 (in decimal 55357)
// So "😂"[1] gives 1101111000000010 (in decimal 56834)
console.log("😂"[0], "😂".charCodeAt(0)); // 1101100000111101
console.log("😂"[1], "😂".charCodeAt(1)); // 1101111000000010
according to this article:
Internally, JavaScript source code is treated as a sequence of UTF-16 code units.
And this IBM doc says that:
UTF-16 is based on 16-bit code units. Therefore, each character can be 16 bits (2 bytes) or 32 bits (4 bytes).
But I tested in Chrome's console that English letters are only taking 1 byte, not 2 or 4.
new Blob(['a']).size === 1
I wonder why that is the case? Am I missing something here?
Internally, JavaScript source code is treated as a sequence of UTF-16 code units.
Note that this is referring to source code, not String values. String values are referenced to also be UTF-16 later in the article:
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.
The discrepancy here is actually in the Blob constructor. From MDN:
Note that strings here are encoded as UTF-8, unlike the usual JavaScript UTF-16 strings.
UTF has a varying character size.
a has a size of 1 byte, but ą for example has 2
console.log('a', new Blob(['a']).size)
console.log('ą', new Blob(['ą']).size)
Is there an upper limit to the possible character length of strings in JavaScript, and ES6+ in particular?
Could you do this?
const wowThisIsALongString = `${collectedWorksOfWilliamShakespeare}`
[I'd write the collected works out by hand but am feeling lazy.]
If I understand correctly (and odds are that I don't), a JavaScript string is just a special kind of JavaScript Object, so there's technically no limit?
But maybe things are different in practice?
EDIT / UPDATE: As people have noted, a string primitive isn't an Object. I'd never thought of it as such until I checked the ECMAScript 2015 specs.
4.3.17 String value
primitive value that is a finite ordered sequence of zero or more
16-bit unsigned integer
NOTE A String value is a member of the String type. Each integer value
in the sequence usually represents a single 16-bit unit of UTF-16
text. However, ECMAScript does not place any restrictions or
requirements on the values except that they must be 16-bit unsigned
integers.
4.3.18 String type
set of all possible String values
4.3.19 String object
member of the Object type that is an instance of the standard built-in
String constructor
NOTE A String object is created by using the String constructor in a
new expression, supplying a String value as an argument. The resulting
object has an internal slot whose value is the String value. A String
object can be coerced to a String value by calling the String
constructor as a function (21.1.1.1).
So, when they write that, is the meaning that String objects are objects which contain strings, or ... something else?
Another Update: I think that Ryan has answered this below.
There is a specified length of 253 − 1 in Section 6.1.4:
The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 253-1 elements.
This is the highest integer with unambiguous representation as a JavaScript number:
> 2**53 === 2**53 - 1
false
> 2**53 === 2**53 + 1
true
Individual engines can have smaller limits. V8, for example, limits its strings to 228 − 14 characters.
Side note: primitive strings aren’t objects, but that doesn’t have much to do with length limits. JavaScript has a “primitive wrapper” misfeature allowing strings, numbers, and booleans to be wrapped by objects, and that’s what the section you linked refers to, but there’s no reason to ever use it.
What is the difference between String.prototype.codePointAt() and String.prototype.charCodeAt() in JavaScript?
'A'.codePointAt(); // 65
'A'.charCodeAt(); // 65
From the MDN page on charCodeAt:
The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The UTF-16 code unit matches the Unicode code point for code points which can be represented in a single UTF-16 code unit. If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than 0xFFFF) then the code unit returned will be the first part of a surrogate pair for the code point. If you want the entire code point value, use codePointAt().
TLDR;
charCodeAt() is UTF-16
codePointAt() is Unicode.
To add a few for the ToxicTeacakes's answer, here is another example to help you know the difference:
"𠮷".charCodeAt(0).toString(16);//d842
"𠮷".charCodeAt(1).toString(16);//dfb7
"𠮷".codePointAt(0);//20bb7
"𠮷".codePointAt(1);//dfb7
console.log("\ud842\udfb7");//𠮷, an example of hexadecimal digits
console.log("\u20bb7\udfb7");//₻7�
console.log("\u{20bb7}");//𠮷 an unicode code point escapes the "\ud842\udfb7"
The following is the info about javascript string literals:
"\uXXXX"
The Unicode character specified by the four hexadecimal digits
XXXX. For example, \u00A9 is the Unicode sequence for the copyright
symbol.
"\u{XXXXX}"
Unicode code point
escapes. For example, \u{2F804} is the same as the simple Unicode
escapes \uD87E\uDC04.
see also msdn
Example in JS
On The example with strings and emojis, I am going to illustrate how things could go wrong when you do not know that some of the characters could consist of 2 code units. Some of the characters take up more than one code unit. Consider using codePointAt() over charCodeAt() or use the first one if you are sure that your characters lie in of between 0 and 65535 (216)
more about code units here
// charCodeAt() is UTF-16
// codePointAt() is Unicode
/* UTF-16 is generally considered a bad idea today */
const strings = ["o", "four", "to"];
const emojis = ["🐎", "👟"];
function printItemsLength(arr) {
for (const item of arr) {
console.log(item, item.length);
}
}
printItemsLength(strings);
console.log('================================');
printItemsLength(emojis);
console.log('================================');
console.log("i.charCodeAt(0)", "i".charCodeAt(0)); // 105
console.log("i.charCodeAt(1)", "i".charCodeAt(1)); // 105
console.log("i.codePointAt(0)", "i".codePointAt(0)); // 105
console.log('=============EMOJIS=============');
// getting the decimal (dec) by which you can find them
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[0] + '.charCodeAt(0)', emojis[0].charCodeAt(0)); // only half-character - 55357
console.log(emojis[0] + '.charCodeAt(1)', emojis[0].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
console.log(emojis[0] + '.codePointAt(0)', emojis[0].codePointAt(0)); // 128014
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[1] + '.charCodeAt(0)', emojis[1].charCodeAt(0)); // only half-character - 55357
console.log(emojis[1] + '.charCodeAt(1)', emojis[1].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
// full-character
console.log(emojis[1] + '.codePointAt(0)', emojis[1].codePointAt(0)); // 128095
console.log(emojis[1] + '.codePointAt(1)', emojis[1].codePointAt(1)); // will return lower surragate (non-displayable character)
// to find this emojis have a look here: https://www.w3schools.com/charsets/ref_emoji.asp
as someone may have noticed I have tried to convert back from charcode to the emoji, and it did not work on one of the symbols (that is because it is not in range of UTF-16
Introduction to Unicode and UTF-16
please skip this section if you already familiar with it
Unicode – is a set of characters used around the world; UTF-16 -
00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111
01100010 for "𤭢" (two 16-bits)
read more
"surrogate pair" characters are emoji and some letters that consist of more than 1 character as it is explained here
The term "surrogate pair" refers to a means of encoding Unicode
characters with high code-points in the UTF-16 encoding scheme. In the
Unicode character encoding, characters are mapped to values between
0x0 and 0x10FFFF.
read more
Unicode - It assigns every character a unique number called a code point.
Differentiating charCodeAt() from codePointAt()
charCodeAt(pos) returns code a code unit (not a full character).
If you need a character (that could be either one or two code units), you can use codePointAt(pos) to get its code.
charCodeAt() - returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index link
codePointAt() - returns a non-negative integer that is the Unicode code point value at the given position link
where pos is the index of the character you want to check.
Quote from the book:
UTF-16 is generally considered a bad idea today. It seems almost
intentionally designed to invite mistakes. It’s easy to write programs
that pretend code units and characters are the same things.
read more
jsfiddle sandbox
Sources:
What is Unicode, UTF-8, UTF-16?
Marijn Haverbeke Eloquent JavaScript, 3rd Edition: A Modern Introduction to Programming [Text] – City(not-specified) : No Starch Press, 2018 – 447 p. can be found here
What is "surrogate pair"
to find this emojis have a look w3schools.com/charsets/ref_emoji
Chapter 5, p. 91 => Strings and character codes
I have just observed that the parseInt function doesn't take care about the decimals in case of integers (numbers containing the e character).
Let's take an example: -3.67394039744206e-15
> parseInt(-3.67394039744206e-15)
-3
> -3.67394039744206e-15.toFixed(19)
-3.6739e-15
> -3.67394039744206e-15.toFixed(2)
-0
> Math.round(-3.67394039744206e-15)
0
I expected that the parseInt will also return 0. What's going on at lower level? Why does parseInt return 3 in this case (some snippets from the source code would be appreciated)?
In this example I'm using node v0.12.1, but I expect same to happen in browser and other JavaScript engines.
I think the reason is parseInt converts the passed value to string by calling ToString which will return "-3.67394039744206e-15", then parses it so it will consider -3 and will return it.
The mdn documentation
The parseInt function converts its first argument to a string, parses
it, and returns an integer or NaN
parseInt(-3.67394039744206e-15) === -3
The parseInt function expects a string as the first argument. JavaScript will call toString method behind the scene if the argument is not a string. So the expression is evaluated as follows:
(-3.67394039744206e-15).toString()
// "-3.67394039744206e-15"
parseInt("-3.67394039744206e-15")
// -3
-3.67394039744206e-15.toFixed(19) === -3.6739e-15
This expression is parsed as:
Unary - operator
The number literal 3.67394039744206e-15
.toFixed() -- property accessor, property name and function invocation
The way number literals are parsed is described here. Interestingly, +/- are not part of the number literal. So we have:
// property accessor has higher precedence than unary - operator
3.67394039744206e-15.toFixed(19)
// "0.0000000000000036739"
-"0.0000000000000036739"
// -3.6739e-15
Likewise for -3.67394039744206e-15.toFixed(2):
3.67394039744206e-15.toFixed(2)
// "0.00"
-"0.00"
// -0
If the parsed string (stripped of +/- sign) contains any character that is not a radix digit (10 in your case), then a substring is created containing all the other characters before such character discarding those unrecognized characters.
In the case of -3.67394039744206e-15, the conversion starts and the radix is determined as base 10 -> The conversion happens till it encounters '.' which is not a valid character in base 10 - Thus, effectively, the conversion happens for 3 which gives the value 3 and then the sign is applied, thus -3.
For implementation logic - http://www.ecma-international.org/ecma-262/5.1/#sec-15.1.2.2
More Examples -
alert(parseInt("2711e2", 16));
alert(parseInt("2711e2", 10));
TO note:
The radix starts out at base 10.
If the first character is a '0', it switches to base 8.
If the next character is an 'x', it switches to base 16.
It tries to parse strings to integers. My suspicion is that your floats are first getting casted to strings. Then rather than parsing the whole value then rounding, it uses a character by character parsing function and will stop when it gets to the first decimal point ignoring any decimal places or exponents.
Some examples here http://www.w3schools.com/jsref/jsref_parseint.asp
parseInt has the purpose of parsing a string and not a number:
The parseInt() function parses a string argument and returns an
integer of the specified radix (the base in mathematical numeral
systems).
And parseInt calls the function ToString wherein all the non numerical characters are ignored.
You can use Math.round, which also parses strings, and rounds a number to the nearest integer:
Math.round("12.2e-2") === 0 //true
Math.round("12.2e-2") may round up or down based on the value. Hence may cause issues.
new Number("3.2343e-10").toFixed(0) may solve the issue.
Looks like you try to calculate using parseFloat, this will give you the correct answer.
parseInt as it says, returns an integer, whereas parseFloat returns a floating-point number or exponential number:
parseInt(-3.67394039744206e-15) = -3
parseFloat(-3.67394039744206e-15) = -3.67394039744206e-15
console.log('parseInt(-3.67394039744206e-15) = ' , parseInt(-3.67394039744206e-15));
console.log('parseFloat(-3.67394039744206e-15) = ',parseFloat(-3.67394039744206e-15));