Understanding String heap size in Javascript / V8

Understanding String heap size in Javascript / V8 - javascript

Does anyone have a good understanding/explanation of how the heap size of strings are determined in Javascript with Chrome(V8)?
Some examples of what I see in a heap dump:
1) Multiple copies of an identical 2 character strings (ie. "dt") with different # object Ids all designated as OneByteStrings. The heapdump says each copy has a shallow & retained size of 32 bytes. It isn't clear how a two byte string has a retained size of 32 and why the strings don't appear to be interned.
2) Long object path string which is 78 characters long. All characters would be a single byte in utf8. It is classified as a InternalizedString. It has a 184 byte retained size. Even with a 2 byte character encoding that would still not account for the remaining 28 bytes. Why are these path strings taking up so much space? I could imagine another 4 bytes (maybe 8) being used for address and another 4 for storing the string length, but that still leaves 16 bytes even with a 2 byte character encoding.

Internally, V8 has a number of different representations for strings:
SeqOneByteString: The simplest, contains a few header fields and then the string's bytes (not UTF-8 encoded, can only contain characters in the first 256 unicode code points)
SeqTwoByteString: Same, but uses two bytes for each character (using surrogate pairs to represent unicode characters that can't be represented in two bytes).
SlicedString: A substring of some other string. Contains a pointer to the "parent" string and an offset and length.
ConsString: The result of adding two strings (if over a certain size). Contains pointers to both strings (which may themselves be any of these types of strings).
ExternalString: Used for strings that have been passed in from outside of V8.
"Internalized" is just a flag, the actual string representation could be any of the above.
All of these have a common parent class String, whose parent is Name, whose parent is HeapObject (which is the root of the V8 class hierarchy for objects allocated on the V8 heap).
HeapObject has one field: the pointer to its Map (there's a good explanation of these here).
Name adds one additional field: a hash value.
String adds another field: the length.
On a 32-bit system, each of these is 4 bytes. On a 64-bit system, each one is 8 bytes.
If you're on a 64-bit system then the minimum size of a SeqOneByteString will be 32 bytes: 24 bytes for the header fields described above plus at least one byte for the string data, rounded up to a multiple of 8.
Regarding your second question, it's difficult to say exactly what's going on. It could be that the string is using a 2-byte representation and its header fields are pushing up the size above what you are expecting, or it could be that it's a ConsString or a SlicedString (whose retained sizes would include the strings that it points to).
V8 doesn't internalize strings most of the time - it internalizes string constants and identifier names that it finds during parsing, and strings that are used as object property keys, and probably a few other cases.

Related

How is a JavaScript string a set of elements of integer values?

From MDN:
JavaScript's String type is used to represent textual data. It is a
set of "elements" of 16-bit unsigned integer values. Each element in
the String occupies a position in the String. The first element is at
index 0, the next at index 1, and so on. The length of a String is the
number of elements in it. You can create strings using string literals
or string objects.
What does it mean when you say the JavaScript String type is a set of "elements" of 16-bit unsigned integer values?
Please explain why it is a series of integer values.

The 16-bit unsigned integer values is a representation of specific characters and since it is a set of elements, you are able to grab specific characters within a string with [] notation as you would a list. Ex:
const string = 'john doe';
console.log(string[3]) // Will print 'n' as it is the 3rd index characters (starts at 0)

It just means that a string is an "array-like" object with each character available in a similar manner to an array element. Each of those characters are stored as a UTF-16 value.
// The following is one string literal:
let s = "ABCDEFG";
console.log(s);
// But it's also an array-like object in that it has a length and can be indexed
console.log("The length of the string is: ", s.length);
console.log("The 3rd character is: ", s[2]);
// And we can see that the characters are stored as separate UTF-16 values:
console.log(s.charCodeAt(2));

As I understood:
unsigned means not + or -.
16 bit means 2^16 number of elements/characters can represent.
set of Integers mean to represent a String use multiple integers (1 or more).
Therefore this means to represents a string, JavaScript uses a set of numbers (each number is one of 2^16 numbers, because no float numbers and no positive/negative representation).
Note: to understand more read about UTF-16
Reference: UTF-16 (IBM)

In Unicode, each symbol has an associated number. For example, "A" is 65, "a" is 97, etc. These numbers are called code points. Depending on the encoding we’re using (UTF-32, UTF-16, UTF-8, ASCII, etc.), we represent/encode these code points in different ways. The things we use to encode these code point numbers are called "code units", or as MDN calls them, "elements".
As we're using JavaScript, we're interested in the UTF-16 encoding of characters. This means that to represent a single code unit/"element", we use 16 bits (two bytes). For "A", the "element" representation is:
0000000001000001 // (16 bits, hence 0 padding)
There are a lot of characters that we need to represent (think emojis, Chinese, Japanese, Korean scripts, etc. that each have their own code points), so 16 bits to represent and encode all of these characters alone isn't enough. That's why sometimes some code points are encoded using two code units/elements. For example, 😂 has a code point of 128514 and in UTF-16 is encoded by two elements/code units:
1101100000111101 1101111000000010
So these two code units/elements 1101100000111101 (decimal 55357) and 1101111000000010 (decimal 56834) encode the code point/"character" of 128514 which represents 😂. Notice how both code units are both positive (unsigned), and are whole numbers (integers). UTF16 outlines the algorithm to take these elements from the element form to their code point form and vice-versa (see here for examples).
What are the implications of all this? Well it means that strings like "😂" will have a length of 2:
console.log("😂".length); // 2
And that when you access the indexes of the string, you will access the code units/"elements" of that string:
// "😂" in UTF16 is "1101100000111101 1101111000000010"
// So "😂"[0] gives 1101100000111101 (in decimal 55357)
// So "😂"[1] gives 1101111000000010 (in decimal 56834)
console.log("😂"[0], "😂".charCodeAt(0)); // 1101100000111101
console.log("😂"[1], "😂".charCodeAt(1)); // 1101111000000010

Why is that JavaScript's strings are using UTF-16 but one character's actual size can be just one byte?

according to this article:
Internally, JavaScript source code is treated as a sequence of UTF-16 code units.
And this IBM doc says that:
UTF-16 is based on 16-bit code units. Therefore, each character can be 16 bits (2 bytes) or 32 bits (4 bytes).
But I tested in Chrome's console that English letters are only taking 1 byte, not 2 or 4.
new Blob(['a']).size === 1
I wonder why that is the case? Am I missing something here?

Internally, JavaScript source code is treated as a sequence of UTF-16 code units.
Note that this is referring to source code, not String values. String values are referenced to also be UTF-16 later in the article:
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.
The discrepancy here is actually in the Blob constructor. From MDN:
Note that strings here are encoded as UTF-8, unlike the usual JavaScript UTF-16 strings.

UTF has a varying character size.
a has a size of 1 byte, but ą for example has 2
console.log('a', new Blob(['a']).size)
console.log('ą', new Blob(['ą']).size)

Keep trailing or leading zeroes on number

Is it possible to keep trailing or leading zeroes on a number in javascript, without using e.g. a string instead?
const leading = 003; // literal, leading
const trailing = 0.10; // literal, trailing
const parsed = parseFloat('0.100'); // parsed or somehow converted
console.log(leading, trailing, parsed); // desired: 003 0.10 0.100
This question has been regularly asked (and still is), yet I don't have a place I'd feel comfortable linking to (did i miss it?).
Fully analogously would be keeping any other aspect of the representation a number literal was entered as, although asked nowhere near as often:
console.log(0x10); // 16 instead of potentially desired 0x10
console.log(1e1); // 10 instead of potentially desired 1e1
For disambiguation, this is not about the following topics, for some of which I'll add links, as they might be of interest as well:
Padding to a set amount of digits, formatting to some specific string representation, e.g. How can i pad a value with leading zeroes?, How to output numbers with leading zeros in JavaScript?, How to add a trailing zero to a price
Why a certain string representation will be produced for some number by default, e.g. How does JavaScript determine the number of digits to produce when formatting floating-point values?
Floating point precision/accuracy problems, e.g. console.log(0.1 + 0.2) producing 0.30000000000000004, see Is floating point math broken?, and How to deal with floating point number precision in JavaScript?

No. A number stores no information about the representation it was entered as, or parsed from. It only relates to its mathematical value. Perhaps reconsider using a string after all.
If i had to guess, it would be that much of the confusion comes from the thought, that numbers, and their textual representations would either be the same thing, or at least tightly coupled, with some kind of bidirectional binding between them. This is not the case.
The representations like 0.1 and 0.10, which you enter in code, are only used to generate a number. They are convenient names, for what you intend to produce, not the resulting value. In this case, they are names for the same number. It has a lot of other aliases, like 0.100, 1e-1, or 10e-2. In the actual value, there is no contained information, about what or where it came from. The conversion is a one-way street.
When displaying a number as text, by default (Number.prototype.toString), javascript uses an algorithm to construct one of the possible representations from a number. This can only use what's available, the number value, also meaning it will produce the same results for two same numbers. This implies, that 0.1 and 0.10 will produce the same result.
Concerning the number1 value, javascript uses IEEE754-2019 float642. When source code is being evaluated3, and a number literal is encountered, the engine will convert the mathematical value the literal represents to a 64bit value, according to IEEE754-2019. This means any information about the original representation in code is lost4.
There is another problem, which is somewhat unrelated to the main topic. Javascript used to have an octal notation, with a prefix of "0". This means, that 003 is being parsed as an octal, and would throw in strict-mode. Similarly, 010 === 8 (or an error in strict-mode), see Why JavaScript treats a number as octal if it has a leading zero
In conclusion, when trying to keep information about some representation for a number (including leading or trailing zeroes, whether it was written as decimal, hexadecimal, and so on), a number is not a good choice. For how to achieve some specific representation other than the default, which doesn't need access to the originally entered text (e.g. pad to some amount of digits), there are many other questions/articles, some of which were already linked.
[1]: Javascript also has BigInt, but while it uses a different format, the reasoning is completely analogous.
[2]: This is a simplification. Engines are allowed to use other formats internally (and do, e.g. to save space/time), as long as they are guaranteed to behave like an IEEE754-2019 float64 in any regard, when observed from javascript.
[3]: E.g. V8 would convert to bytecode earlier than evaluation, already exchanging the literal. The only relevant thing is, that the information is lost, before we could do anything with it.
[4]: Javascript gives the ability to operate on code itself (e.g. Function.prototype.toString), which i will not discuss here much. Parsing the code yourself, and storing the representation, is an option, but has nothing to do with how number works (you would be operating on code, a string). Also, i don't immediately see any sane reason to do so, over alternatives.

How JavaScript decides what size of memory to allocate for a numeric value?

Programming languages like Java / C has int, long , byte etc that suggest interpreter exactly how much memory it should allocate for a number at run-time . This saves a lot of memory if you are dealing with large number of variables.
I'm wondering how programming languages , who doesn't have this primitive variable type declaration (JavaScript , Ruby) , decides how much memory to allocate for lets say var a = 1 . If it allocates lets say 1 byte ,then in the next line if I do a = 99999999999 , it will have to swipe out that variable and reallocate. Won't it be an expensive operation ?
Or does they allocate a very big memory space for all the variables so that one size fit all

Here is a good explanation.
JavaScript values
The type JS::Value represents a JavaScript value.
The representation is 64 bits and uses NaN-boxing on all platforms,
although the exact NaN-boxing format depends on the platform.
NaN-boxing is a technique based on the fact that in IEEE-754 there are
2**53-2 different bit patterns that all represent NaN. Hence, we can
encode any floating-point value as a C++ double (noting that
JavaScript NaN must be represented as one canonical NaN format). Other
values are encoded as a value and a type tag:
On x86, ARM, and similar 32-bit platforms, we use what we call
"nunboxing", in which non-double values are a 32-bit type tag and a
32-bit payload, which is normally either a pointer or a signed 32-bit
integer. There are a few special values: NullValue(),
UndefinedValue(), TrueValue() and FalseValue(). On x64 and similar
64-bit platforms, pointers are longer than 32 bits, so we can't use
the nunboxing format. Instead, we use "punboxing", which has 17 bits
of tag and 47 bits of payload. Only JIT code really depends on the
layout--everything else in the engine interacts with values through
functions like val.isDouble(). Most parts of the JIT also avoid
depending directly on the layout: the files PunboxAssembler.h and
NunboxAssembler.h are used to generate native code that depends on the
value layout.
Objects consist of a possibly shared structural description, called
the map or scope; and unshared property values in a vector, called the
slots. Each property has an id, either a nonnegative integer or an
atom (unique string), with the same tagged-pointer encoding as a
jsval.
The atom manager consists of a hash table associating strings uniquely
with scanner/parser information such as keyword type, index in script
or function literal pool, etc. Atoms play three roles: as literals
referred to by unaligned 16-bit immediate bytecode operands, as unique
string descriptors for efficient property name hashing, and as members
of the root GC set for exact GC.
According to W3Schools:
This format stores numbers in 64 bits, where the number (the fraction)
is stored in bits 0 to 51, the exponent in bits 52 to 62, and the sign
in bit 63:
Value (aka Fraction/Mantissa): 52 bits (0 - 51)
Exponent: 11 bits (52 - 62)
Sign: 1 bit (63)
Also read this article here.

Why does code points between U+D800 and U+DBFF generate one-length string in ECMAScript 6?

I'm getting too confused. Why do code points from U+D800 to U+DBFF encode as a single (2 bytes) String element, when using the ECMAScript 6 native Unicode helpers?
I'm not asking how JavaScript/ECMAScript encodes Strings natively, I'm asking about an extra functionality to encode UTF-16 that makes use of UCS-2.
var str1 = '\u{D800}';
var str2 = String.fromCodePoint(0xD800);
console.log(
str1.length, str1.charCodeAt(0), str1.charCodeAt(1)
);
console.log(
str2.length, str2.charCodeAt(0), str2.charCodeAt(1)
);
Re-TL;DR: I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string, since my browser's ES6 implementation incorporates UCS-2 encoding in strings, which uses 2 bytes for each character code?
Both of these approaches return a one-element String for the U+D800 code point (char code: 55296, same as 0xD800). But for code points bigger than U+FFFF each one returns a two-element String, the lead and trail. lead would be a number between U+D800 and U+DBFF, and trail I'm not sure about, I only know it helps changing the result code point. For me the return value doesn't make sense, it represents a lead without trail. Am I understanding something wrong?

I think your confusion is about how Unicode encodings work in general, so let me try to explain.
Unicode itself just specifies a list of characters, called "code points", in a particular order. It doesn't tell you how to convert those to bits, it just gives them all a number between 0 and 1114111 (in hexadecimal, 0x10FFFF). There are several different ways these numbers from U+0 to U+10FFFF can be represented as bits.
In an earlier version, it was expected that a range of 0 to 65535 (0xFFFF) would be enough. This can be naturally represented in 16 bits, using the same convention as an unsigned integer. This was the original way of storing Unicode, and is now known as UCS-2. To store a single code point, you reserve 16 bits of memory.
Later, it was decided that this range was not large enough; this meant that there were code points higher than 65535, which you can't represent in a 16-bit piece of memory. UTF-16 was invented as a clever way of storing these higher code points. It works by saying "if you look at a 16-bit piece of memory, and it's a number between 0xD800 and 0xDBF (a "low surrogate"), then you need to look at the next 16 bits of memory as well". Any piece of code which is performing this extra check is processing its data as UTF-16, and not UCS-2.
It's important to understand that the memory itself doesn't "know" which encoding it's in, the difference between UCS-2 and UTF-16 is how you interpret that memory. When you write a piece of software, you have to choose which interpretation you're going to use.
Now, onto Javascript...
Javascript handles input and output of strings by interpreting its internal representation as UTF-16. That's great, it means that you can type in and display the famous 💩 character, which can't be stored in one 16-bit piece of memory.
The problem is that most of the built in string functions actually handle the data as UCS-2 - that is, they look at 16 bits at a time, and don't care if what they see is a special "surrogate". The function you used, charCodeAt(), is an example of this: it reads 16 bits out of memory, and gives them to you as a number between 0 and 65535. If you feed it 💩, it will just give you back the first 16 bits; ask it for the next "character" after, and it will give you the second 16 bits (which will be a "high surrogate", between 0xDC00 and 0xDFFF).
In ECMAScript 6 (2015), a new function was added: codePointAt(). Instead of just looking at 16 bits and giving them to you, this function checks if they represent one of the UTF-16 surrogate code units, and if so, looks for the "other half" - so it gives you a number between 0 and 1114111. If you feed it 💩, it will correctly give you 128169.
var poop = '💩';
console.log('Treat it as UCS-2, two 16-bit numbers: ' + poop.charCodeAt(0) + ' and ' + poop.charCodeAt(1));
console.log('Treat it as UTF-16, one value cleverly encoded in 32 bits: ' + poop.codePointAt(0));
// The surrogates are 55357 and 56489, which encode 128169 as follows:
// 0x010000 + ((55357 - 0xD800) << 10) + (56489 - 0xDC00) = 128169
Your edited question now asks this:
I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string?
The hexadecimal value D800 is 55296 in decimal, which is less than 65536, so given everything I've said above, this fits fine in 16 bits of memory. So if we ask charCodeAt to read 16 bits of memory, and it finds that number there, it's not going to have a problem.
Similarly, the .length property measures how many sets of 16 bits there are in the string. Since this string is stored in 16 bits of memory, there is no reason to expect any length other than 1.
The only unusual thing about this number is that in Unicode, that value is reserved - there isn't, and never will be, a character U+D800. That's because it's one of the magic numbers that tells a UTF-16 algorithm "this is only half a character". So a possible behaviour would be for any attempt to create this string to simply be an error - like opening a pair of brackets that you never close, it's unbalanced, incomplete.
The only way you could end up with a string of length 2 is if the engine somehow guessed what the second half should be; but how would it know? There are 1024 possibilities, from 0xDC00 to 0xDFFF, which could be plugged into the formula I show above. So it doesn't guess, and since it doesn't error, the string you get is 16 bits long.
Of course, you can supply the matching halves, and codePointAt will interpret them for you.
// Set up two 16-bit pieces of memory
var high=String.fromCharCode(55357), low=String.fromCharCode(56489);
// Note: String.fromCodePoint will give the same answer
// Glue them together (this + is string concatenation, not number addition)
var poop = high + low;
// Read out the memory as UTF-16
console.log(poop);
console.log(poop.codePointAt(0));

Well, it does this because the specification says it has to:
http://www.ecma-international.org/ecma-262/6.0/#sec-string.fromcodepoint
http://www.ecma-international.org/ecma-262/6.0/#sec-utf16encoding
Together these two say that if an argument is < 0 or > 0x10FFFF, a RangeError is thrown, but otherwise any codepoint <= 65535 is incorporated into the result string as-is.
As for why things are specified this way, I don't know. It seems like JavaScript doesn't really support Unicode, only UCS-2.
Unicode.org has the following to say on the matter:
http://www.unicode.org/faq/utf_bom.html#utf16-2
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
http://www.unicode.org/faq/utf_bom.html#utf16-7
Q: Are there any 16-bit values that are invalid?
A: Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.
Therefore the result of String.fromCodePoint is not always valid UTF-16 because it can emit unpaired surrogates.

We Keep Coding

JavaScript is the programming language of the Web.