How is a JavaScript string a set of elements of integer values?

How is a JavaScript string a set of elements of integer values? - javascript

From MDN:
JavaScript's String type is used to represent textual data. It is a
set of "elements" of 16-bit unsigned integer values. Each element in
the String occupies a position in the String. The first element is at
index 0, the next at index 1, and so on. The length of a String is the
number of elements in it. You can create strings using string literals
or string objects.
What does it mean when you say the JavaScript String type is a set of "elements" of 16-bit unsigned integer values?
Please explain why it is a series of integer values.

The 16-bit unsigned integer values is a representation of specific characters and since it is a set of elements, you are able to grab specific characters within a string with [] notation as you would a list. Ex:
const string = 'john doe';
console.log(string[3]) // Will print 'n' as it is the 3rd index characters (starts at 0)

It just means that a string is an "array-like" object with each character available in a similar manner to an array element. Each of those characters are stored as a UTF-16 value.
// The following is one string literal:
let s = "ABCDEFG";
console.log(s);
// But it's also an array-like object in that it has a length and can be indexed
console.log("The length of the string is: ", s.length);
console.log("The 3rd character is: ", s[2]);
// And we can see that the characters are stored as separate UTF-16 values:
console.log(s.charCodeAt(2));

As I understood:
unsigned means not + or -.
16 bit means 2^16 number of elements/characters can represent.
set of Integers mean to represent a String use multiple integers (1 or more).
Therefore this means to represents a string, JavaScript uses a set of numbers (each number is one of 2^16 numbers, because no float numbers and no positive/negative representation).
Note: to understand more read about UTF-16
Reference: UTF-16 (IBM)

In Unicode, each symbol has an associated number. For example, "A" is 65, "a" is 97, etc. These numbers are called code points. Depending on the encoding we’re using (UTF-32, UTF-16, UTF-8, ASCII, etc.), we represent/encode these code points in different ways. The things we use to encode these code point numbers are called "code units", or as MDN calls them, "elements".
As we're using JavaScript, we're interested in the UTF-16 encoding of characters. This means that to represent a single code unit/"element", we use 16 bits (two bytes). For "A", the "element" representation is:
0000000001000001 // (16 bits, hence 0 padding)
There are a lot of characters that we need to represent (think emojis, Chinese, Japanese, Korean scripts, etc. that each have their own code points), so 16 bits to represent and encode all of these characters alone isn't enough. That's why sometimes some code points are encoded using two code units/elements. For example, 😂 has a code point of 128514 and in UTF-16 is encoded by two elements/code units:
1101100000111101 1101111000000010
So these two code units/elements 1101100000111101 (decimal 55357) and 1101111000000010 (decimal 56834) encode the code point/"character" of 128514 which represents 😂. Notice how both code units are both positive (unsigned), and are whole numbers (integers). UTF16 outlines the algorithm to take these elements from the element form to their code point form and vice-versa (see here for examples).
What are the implications of all this? Well it means that strings like "😂" will have a length of 2:
console.log("😂".length); // 2
And that when you access the indexes of the string, you will access the code units/"elements" of that string:
// "😂" in UTF16 is "1101100000111101 1101111000000010"
// So "😂"[0] gives 1101100000111101 (in decimal 55357)
// So "😂"[1] gives 1101111000000010 (in decimal 56834)
console.log("😂"[0], "😂".charCodeAt(0)); // 1101100000111101
console.log("😂"[1], "😂".charCodeAt(1)); // 1101111000000010

Related

charAt vs charCodeAt?

JavaScript provides charAt and charCodeAt methods on strings.
What is the difference between these two methods?
When would one use on over the other?

From the MDN page on charAt
The String object's charAt() method returns a new string consisting of the single UTF-16 code unit located at the specified offset into the string.
From the MDN page on charCodeAt:
The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The UTF-16 code unit matches the Unicode code point for code points which can be represented in a single UTF-16 code unit. If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than 0xFFFF) then the code unit returned will be the first part of a surrogate pair for the code point. If you want the entire code point value, use codePointAt().
If you need the char as a string, call charAt.
If, for some reason, you need the UTF char code, call charCodeAt
(You could use it to increment the character for example.)

var a = 'ABC.................Z';
a.charCodeAt(0); // will return 65
a.charAt(0); // will return 'A'

Is there a maximum length for an ES6 multi-line string?

Is there an upper limit to the possible character length of strings in JavaScript, and ES6+ in particular?
Could you do this?
const wowThisIsALongString = `${collectedWorksOfWilliamShakespeare}`
[I'd write the collected works out by hand but am feeling lazy.]
If I understand correctly (and odds are that I don't), a JavaScript string is just a special kind of JavaScript Object, so there's technically no limit?
But maybe things are different in practice?
EDIT / UPDATE: As people have noted, a string primitive isn't an Object. I'd never thought of it as such until I checked the ECMAScript 2015 specs.
4.3.17 String value
primitive value that is a finite ordered sequence of zero or more
16-bit unsigned integer
NOTE A String value is a member of the String type. Each integer value
in the sequence usually represents a single 16-bit unit of UTF-16
text. However, ECMAScript does not place any restrictions or
requirements on the values except that they must be 16-bit unsigned
integers.
4.3.18 String type
set of all possible String values
4.3.19 String object
member of the Object type that is an instance of the standard built-in
String constructor
NOTE A String object is created by using the String constructor in a
new expression, supplying a String value as an argument. The resulting
object has an internal slot whose value is the String value. A String
object can be coerced to a String value by calling the String
constructor as a function (21.1.1.1).
So, when they write that, is the meaning that String objects are objects which contain strings, or ... something else?
Another Update: I think that Ryan has answered this below.

There is a specified length of 253 − 1 in Section 6.1.4:
The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 253-1 elements.
This is the highest integer with unambiguous representation as a JavaScript number:
> 2**53 === 2**53 - 1
false
> 2**53 === 2**53 + 1
true
Individual engines can have smaller limits. V8, for example, limits its strings to 228 − 14 characters.
Side note: primitive strings aren’t objects, but that doesn’t have much to do with length limits. JavaScript has a “primitive wrapper” misfeature allowing strings, numbers, and booleans to be wrapped by objects, and that’s what the section you linked refers to, but there’s no reason to ever use it.

Why does code points between U+D800 and U+DBFF generate one-length string in ECMAScript 6?

I'm getting too confused. Why do code points from U+D800 to U+DBFF encode as a single (2 bytes) String element, when using the ECMAScript 6 native Unicode helpers?
I'm not asking how JavaScript/ECMAScript encodes Strings natively, I'm asking about an extra functionality to encode UTF-16 that makes use of UCS-2.
var str1 = '\u{D800}';
var str2 = String.fromCodePoint(0xD800);
console.log(
str1.length, str1.charCodeAt(0), str1.charCodeAt(1)
);
console.log(
str2.length, str2.charCodeAt(0), str2.charCodeAt(1)
);
Re-TL;DR: I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string, since my browser's ES6 implementation incorporates UCS-2 encoding in strings, which uses 2 bytes for each character code?
Both of these approaches return a one-element String for the U+D800 code point (char code: 55296, same as 0xD800). But for code points bigger than U+FFFF each one returns a two-element String, the lead and trail. lead would be a number between U+D800 and U+DBFF, and trail I'm not sure about, I only know it helps changing the result code point. For me the return value doesn't make sense, it represents a lead without trail. Am I understanding something wrong?

I think your confusion is about how Unicode encodings work in general, so let me try to explain.
Unicode itself just specifies a list of characters, called "code points", in a particular order. It doesn't tell you how to convert those to bits, it just gives them all a number between 0 and 1114111 (in hexadecimal, 0x10FFFF). There are several different ways these numbers from U+0 to U+10FFFF can be represented as bits.
In an earlier version, it was expected that a range of 0 to 65535 (0xFFFF) would be enough. This can be naturally represented in 16 bits, using the same convention as an unsigned integer. This was the original way of storing Unicode, and is now known as UCS-2. To store a single code point, you reserve 16 bits of memory.
Later, it was decided that this range was not large enough; this meant that there were code points higher than 65535, which you can't represent in a 16-bit piece of memory. UTF-16 was invented as a clever way of storing these higher code points. It works by saying "if you look at a 16-bit piece of memory, and it's a number between 0xD800 and 0xDBF (a "low surrogate"), then you need to look at the next 16 bits of memory as well". Any piece of code which is performing this extra check is processing its data as UTF-16, and not UCS-2.
It's important to understand that the memory itself doesn't "know" which encoding it's in, the difference between UCS-2 and UTF-16 is how you interpret that memory. When you write a piece of software, you have to choose which interpretation you're going to use.
Now, onto Javascript...
Javascript handles input and output of strings by interpreting its internal representation as UTF-16. That's great, it means that you can type in and display the famous 💩 character, which can't be stored in one 16-bit piece of memory.
The problem is that most of the built in string functions actually handle the data as UCS-2 - that is, they look at 16 bits at a time, and don't care if what they see is a special "surrogate". The function you used, charCodeAt(), is an example of this: it reads 16 bits out of memory, and gives them to you as a number between 0 and 65535. If you feed it 💩, it will just give you back the first 16 bits; ask it for the next "character" after, and it will give you the second 16 bits (which will be a "high surrogate", between 0xDC00 and 0xDFFF).
In ECMAScript 6 (2015), a new function was added: codePointAt(). Instead of just looking at 16 bits and giving them to you, this function checks if they represent one of the UTF-16 surrogate code units, and if so, looks for the "other half" - so it gives you a number between 0 and 1114111. If you feed it 💩, it will correctly give you 128169.
var poop = '💩';
console.log('Treat it as UCS-2, two 16-bit numbers: ' + poop.charCodeAt(0) + ' and ' + poop.charCodeAt(1));
console.log('Treat it as UTF-16, one value cleverly encoded in 32 bits: ' + poop.codePointAt(0));
// The surrogates are 55357 and 56489, which encode 128169 as follows:
// 0x010000 + ((55357 - 0xD800) << 10) + (56489 - 0xDC00) = 128169
Your edited question now asks this:
I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string?
The hexadecimal value D800 is 55296 in decimal, which is less than 65536, so given everything I've said above, this fits fine in 16 bits of memory. So if we ask charCodeAt to read 16 bits of memory, and it finds that number there, it's not going to have a problem.
Similarly, the .length property measures how many sets of 16 bits there are in the string. Since this string is stored in 16 bits of memory, there is no reason to expect any length other than 1.
The only unusual thing about this number is that in Unicode, that value is reserved - there isn't, and never will be, a character U+D800. That's because it's one of the magic numbers that tells a UTF-16 algorithm "this is only half a character". So a possible behaviour would be for any attempt to create this string to simply be an error - like opening a pair of brackets that you never close, it's unbalanced, incomplete.
The only way you could end up with a string of length 2 is if the engine somehow guessed what the second half should be; but how would it know? There are 1024 possibilities, from 0xDC00 to 0xDFFF, which could be plugged into the formula I show above. So it doesn't guess, and since it doesn't error, the string you get is 16 bits long.
Of course, you can supply the matching halves, and codePointAt will interpret them for you.
// Set up two 16-bit pieces of memory
var high=String.fromCharCode(55357), low=String.fromCharCode(56489);
// Note: String.fromCodePoint will give the same answer
// Glue them together (this + is string concatenation, not number addition)
var poop = high + low;
// Read out the memory as UTF-16
console.log(poop);
console.log(poop.codePointAt(0));

Well, it does this because the specification says it has to:
http://www.ecma-international.org/ecma-262/6.0/#sec-string.fromcodepoint
http://www.ecma-international.org/ecma-262/6.0/#sec-utf16encoding
Together these two say that if an argument is < 0 or > 0x10FFFF, a RangeError is thrown, but otherwise any codepoint <= 65535 is incorporated into the result string as-is.
As for why things are specified this way, I don't know. It seems like JavaScript doesn't really support Unicode, only UCS-2.
Unicode.org has the following to say on the matter:
http://www.unicode.org/faq/utf_bom.html#utf16-2
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
http://www.unicode.org/faq/utf_bom.html#utf16-7
Q: Are there any 16-bit values that are invalid?
A: Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.
Therefore the result of String.fromCodePoint is not always valid UTF-16 because it can emit unpaired surrogates.

Understanding String heap size in Javascript / V8

Does anyone have a good understanding/explanation of how the heap size of strings are determined in Javascript with Chrome(V8)?
Some examples of what I see in a heap dump:
1) Multiple copies of an identical 2 character strings (ie. "dt") with different # object Ids all designated as OneByteStrings. The heapdump says each copy has a shallow & retained size of 32 bytes. It isn't clear how a two byte string has a retained size of 32 and why the strings don't appear to be interned.
2) Long object path string which is 78 characters long. All characters would be a single byte in utf8. It is classified as a InternalizedString. It has a 184 byte retained size. Even with a 2 byte character encoding that would still not account for the remaining 28 bytes. Why are these path strings taking up so much space? I could imagine another 4 bytes (maybe 8) being used for address and another 4 for storing the string length, but that still leaves 16 bytes even with a 2 byte character encoding.

Internally, V8 has a number of different representations for strings:
SeqOneByteString: The simplest, contains a few header fields and then the string's bytes (not UTF-8 encoded, can only contain characters in the first 256 unicode code points)
SeqTwoByteString: Same, but uses two bytes for each character (using surrogate pairs to represent unicode characters that can't be represented in two bytes).
SlicedString: A substring of some other string. Contains a pointer to the "parent" string and an offset and length.
ConsString: The result of adding two strings (if over a certain size). Contains pointers to both strings (which may themselves be any of these types of strings).
ExternalString: Used for strings that have been passed in from outside of V8.
"Internalized" is just a flag, the actual string representation could be any of the above.
All of these have a common parent class String, whose parent is Name, whose parent is HeapObject (which is the root of the V8 class hierarchy for objects allocated on the V8 heap).
HeapObject has one field: the pointer to its Map (there's a good explanation of these here).
Name adds one additional field: a hash value.
String adds another field: the length.
On a 32-bit system, each of these is 4 bytes. On a 64-bit system, each one is 8 bytes.
If you're on a 64-bit system then the minimum size of a SeqOneByteString will be 32 bytes: 24 bytes for the header fields described above plus at least one byte for the string data, rounded up to a multiple of 8.
Regarding your second question, it's difficult to say exactly what's going on. It could be that the string is using a 2-byte representation and its header fields are pushing up the size above what you are expecting, or it could be that it's a ConsString or a SlicedString (whose retained sizes would include the strings that it points to).
V8 doesn't internalize strings most of the time - it internalizes string constants and identifier names that it finds during parsing, and strings that are used as object property keys, and probably a few other cases.

JSON.parse parses / converts big numbers incorrectly

My problem is really simple but I'm not sure if there's a "native" solution using JSON.parse.
I receive this string from an API :
{ "key" : -922271061845347495 }
When I'm using JSON.parse on this string, it turns into this object:
{ "key" : -922271061845347500 }
As you can see, the parsing stops when the number is too long (you can check this behavior here). It has only 15 exact digits, the last one is rounded and those after are set to 0. Is there a "native" solution to keep the exact value ? (it's an ID so I can't round it)
I know I can use regex to solve this problem but I'd prefer to use a "native" method if it exists.

Your assumption that the parsing stops after certain digits is incorrect.
It says here:
In JavaScript all numbers are floating-point numbers. JavaScript uses
the standard 8 byte IEEE floating-point numeric format, which means
the range is from:
±1.7976931348623157 x 10308 - very large, and ±5 x 10-324 - very small.
As JavaScript uses floating-point numbers the accuracy is only assured
for integers between: -9007199254740992 (-253) and 9007199254740992
(253)
You number lies outside the "accurate" range hence it is converted to the nearest representation of the JavaScript number. Any attempt to evaluate this number (using JSON.parse, eval, parseInt) will cause data loss. I therefore recommend that you pass the key as a string. If you do not control the API, file a feature request.

The number is too big to be parsed correctly.
One solution is:
Preprocessing your string from API to convert it into string before parsing.
Preform normal parsing
Optionally, you could convert it back into number for your own purpose.
Here is the RegExp to convert all numbers in your string (proceeded with :) into strings:
// convert all number fields into strings to maintain precision
// : 922271061845347495, => : "922271061845347495",
stringFromApi = stringFromApi.replace(/:\s*(-?\d+),/g, ': "$1",');
Regex explanation:
\s* any number of spaces
-? one or zero '-' symbols (negative number support)
\d+ one or more digits
(...) will be put in the $1 variable

We Keep Coding

JavaScript is the programming language of the Web.