I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?
I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?
You can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['๐']).size, // 4
new Blob(['๐']).size, // 4
new Blob(['๐๐']).size, // 8
new Blob(['๐๐']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, itโs just an implementation detail that wonโt affect the languageโs characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source
If you're using node.js, there is a simpler solution using buffers :
function getBinarySize(string) {
return Buffer.byteLength(string, 'utf8');
}
There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)
String values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:
4.3.16 String Value
A string value is a member of the type String and is a
finite ordered sequence of zero or
more 16-bit unsigned integer values.
NOTE Although each value usually
represents a single 16-bit unit of
UTF-16 text, the language does not
place any restrictions or requirements
on the values except that they be
16-bit unsigned integers.
These are 3 ways I use:
TextEncoder
new TextEncoder().encode("myString").length
Blob
new Blob(["myString"]).size
Buffer
Buffer.byteLength("myString", 'utf8')
Try this combination with using unescape js function:
const byteAmount = unescape(encodeURIComponent(yourString)).length
Full encode proccess example:
const s = "1 a ั โ # ยฎ"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ั-2,โ-3,#-1,ยฎ-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11
Note that if you're targeting node.js you can use Buffer.from(string).length:
var str = "\u2620"; // => "โ "
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)
The size of a JavaScript string is
Pre-ES6: 2 bytes per character
ES6 and later: 2 bytes per character,
or 5 or more bytes per character
Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.
ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}
Practical notes
This doesn't relate to the internal implemention of a particular engine. For
example, some engines use data structures and libraries with full
UTF-16 support, but what they provide externally doesn't have to be
full UTF-16 support. Also an engine may provide external UTF-16
support as well but is not mandated to do so.
For ES6, practically speaking characters will never be more than 5
bytes long (2 bytes for the escape point + 3 bytes for the Unicode
code point) because the latest version of Unicode only has 136,755
possible characters, which fits easily into 3 bytes. However this is
technically not limited by the standard so in principal a single
character could use say, 4 bytes for the code point and 6 bytes
total.
Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.
UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).
If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:
getStringMemorySize = function( _string ) {
"use strict";
var codePoint
, accum = 0
;
for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
codePoint = _string.charCodeAt( stringIndex );
if( codePoint < 0x100 ) {
accum += 1;
continue;
}
if( codePoint < 0x10000 ) {
accum += 2;
continue;
}
if( codePoint < 0x1000000 ) {
accum += 3;
} else {
accum += 4;
}
}
return accum * 2;
}
Examples:
getStringMemorySize( 'I' ); // 2
getStringMemorySize( 'โค' ); // 4
getStringMemorySize( '๐ ฐ' ); // 8
getStringMemorySize( 'Iโค๐ ฐ' ); // 14
The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.
byteCount(String.fromCharCode(55555))
// URIError: URI malformed
This longer function should handle all strings:
function bytes (str) {
var bytes=0, len=str.length, codePoint, next, i;
for (i=0; i < len; i++) {
codePoint = str.charCodeAt(i);
// Lone surrogates cannot be passed to encodeURI
if (codePoint >= 0xD800 && codePoint < 0xE000) {
if (codePoint < 0xDC00 && i + 1 < len) {
next = str.charCodeAt(i + 1);
if (next >= 0xDC00 && next < 0xE000) {
bytes += 4;
i++;
continue;
}
}
}
bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
}
return bytes;
}
E.g.
bytes(String.fromCharCode(55555))
// 3
It will correctly calculate the size for strings containing surrogate pairs:
bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)
The results can be compared with Node's built-in function Buffer.byteLength:
Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3
Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)
A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).
The charCodeAt() method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The codePointAt() can be used to return the entire code point value for Unicode characters, e.g. UTF-32.
When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)
See Unicode encodings for different encodings and their code ranges.
The Blob interface's size property returns the size of the Blob or File in bytes.
const getStringSize = (s) => new Blob([s]).size;
I'm working with an embedded version of the V8 Engine.
I've tested a single string. Pushing each step 1000 characters. UTF-8.
First test with single byte (8bit, ANSI) Character "A" (hex: 41).
Second test with two byte character (16bit) "ฮฉ" (hex: CE A9) and the
third test with three byte character (24bit) "โบ" (hex: E2 98 BA).
In all three cases the device prints out of memory at
888 000 characters and using ca. 26 348 kb in RAM.
Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.
Embedded or not, fact is that the characters are not only stored in 16bit.
Unfortunally I've no 100% answer, what Javascript do at low level area.
Btw. I've tested the same (first test above) with an array of character "A".
Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000.
So, the javascript engine is not simple restricted. It's a kind more complex.
You can try this:
var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));
It worked for me.
I have a string "ใฏใ" and I'm trying to understand how it's represented as bytes.
Number.prototype.toBits = function () {
let str = this.toString(2);
return str.padStart(8, "0");
}
let ja = "ใฏใ";
console.log(ja);
let buf = Buffer.from(ja);
for (const c of buf) {
console.log(c + "=" + c.toBits());
}
produces:
ใฏใ
227=11100011
129=10000001
175=10101111
227=11100011
129=10000001
132=10000100
In the Unicode table, the character "ใฏ" is 306F and the character "ใ" is 3044.
I understand that the leading "1" bit says this is Unicode and that the number of 1s until the next 0 is the number of bytes in Unicode. I don't understand how 306F becomes 11100011 10000001 10101111
The fact that the most-significant bit (MSB) is 1 indicates that it's a UTF-8 multibyte sequence. If the first two bits are 11 then it's the start of a sequence; if 10 it's the continuation of a sequence. Bits of the actual code point are stored in the "unused" portion of both start-bytes and continuation-bytes; as many bytes as are necessary to store the value (and, as indicated by the start-byte).
Notice how it is possible to "drop in anywhere" in the byte-sequence and align yourself to the start of a character: if MSB=0 then it's a single-byte character (ASCII-compatible). If MSBs=10 it's a continuation byte and you should walk-backwards to find the start byte. The start-byte should always be followed by exactly the number of continuation-bytes that it promises. UTF encodings use exactly the number of bytes needed to represent any given Unicode code-point.
According to UTF-8, code points between U+0800 and U+FFFF (which U+306F meets) will be encoded as 3 bytes, spreading their bits across the pattern
1110.... 10...... 10......
The binary representation of 0x306F is 0b11000001101111, which fits in the gaps:
| ....0011 ..000001 ..101111
Together, they form what you are observing:
= 11100011 10000001 10101111
Overview:
I'm building a Javascript tool inside a web page. Except for loading that page, the tool will run without server communication. A user will select a local file containing multiple binary records, each with a x'F0 start byte and x'F0 end byte. The data in between is constrained to x'00 - x'7F and consists of:
bit maps
1-byte numbers
2-byte numbers, low order byte first
a smattering of ASCII characters
The records vary in lengths and use different formats.
[It's a set of MIDI Sysex messages, probably not relevant].
The local file is read via reader.readAsArrayBuffer and then processed thus:
var contents = event.target.result;
var bytes = new Uint8Array(contents);
var rawAccum = '';
for (x = 0; x < bytes.length; x++) {
rawAccum += bytes[x];
}
var records = rawAccum.split(/\xF0/g);
I expect this to split the string into an array of its constituent records, deleting the x'F0 start byte in the process.
It actually does very little. records.length is 1 and records[0] contains the entire input stream.
[The actual split code is: var records = rawAccum.split(/\xF0\x00\x00\x26\x02/g); which should remove several identical bytes from the start of each record. When this failed I tried the abbreviated version above, with identical (non)results.]
I've looked at the doc on split( and at several explanations of \xXX among regex references. Clearly something does not work as I have deduced. My experience with JavaScript is minimal and sporadic.
How can I split a string of binary data at the occurrence of a specific binary byte?
The splitting appears to work correctly:
var rawAccum = "\xf0a\xf0b\xf0c\xf0"
console.log( rawAccum.length); // 7
var records = rawAccum.split(/\xF0/g);
console.log(records); // "", "a", "b", "c", ""
but the conversion of the array buffer to a string looks suspicious. Try converting the unsigned byte value to a string before appending it to rawAccum:
for (x = 0; x < bytes.length; x++) {
rawAccum += String.fromCharCode( bytes[x]);
}
Data conversions (update after comment)
The filereader reads the file into an array buffer in memory, but JavaScript does not provide access to array buffers directly. You can either create and initialize a typed array from the buffer (e.g. using the Uint8Array constructor as in the post), or access bytes in the buffer using a DataView object. Methods of DataView objects can convert sequences of bytes at specified positions to integers of varying types, such as the 16 bit integers in the Midi sysex records.
JavaScript strings use sequences of 16 bit values to hold characters, where each character uses one or two 16 bit values encoded using UTF-16 character encoding. 8 bit characters use only the lower 8 bits of a single 16 bit value to store their Unicode code point.
It is possible to convert an array buffer of octet values into a "binary string", by storing each byte value from the buffer in the low order bits of a 16 bit character and appending it to an existing string. This is what the post attempts to do. But in JavaScript strings (and individual characters which have a string length of 1) are not a subset of integer numbers and have their own data type, "string".
So to convert an unsigned 8 bit number to a JavaScript 16 bit character of type "string", use the fromCharCode static method of the global String object, as in
rawAccum += String.fromCharCode( bytes[x]);
Calling String.fromCharCode is also how to convert an ASCII character code located within MIDI data to a character in JavaScript.
To convert a binary string character derived from an 8 bit value back into a number, use the String instance method charCodeAt on a string value and provide the character position:
var byteValue = "\xf0".charCodeAt(0);
returns the number 0xf0 or 250 decimal.
If you append a number to a string, as in the question, the number is implicitly converted to a decimal string representation of its value first:
"" + 0xf0 + 66 // becomes the string "24066"
Note that an array buffer can be inspected using a Uint8Array created from it, sliced into pieces using the buffer's slice method and have integers of various types extracted from the buffer using data views. Please review if creating a binary string remains the best way to extract and interpret Midi record contents.
I'm trying to make my database as efficient as possible. It's a really simple database in which the collections only have documents that store either strings, or numbers or booleans. No arrays, mixed data types etc. If I want to store 24 as the value of one of the fields (which will only contain natural numbers). Which would take less space
{
field: 24
}
or
{
field: "24"
}
I'm using mongoose, and what I'm basically asking is that should I set Number or String as the type in my Schema for that particular field.
Store numbers as Numbers.
MongoDB uses BSON (spec). Number in this context usually means a 64-bit IEEE-754 floating-point number (what BSON calls a double), so that's going to take...64 bits. :-) Add to that the overhead (according to the spec) of saying it's a number (one byte) and the field name (the length of the name plus one byte), but as those will be the same for Number and String we can disregard them. So 64 bits for Number.
The spec says String is stored as a 32-bit length followed by the string in UTF-8 followed by a terminator byte (plus the overhead in common with Number). That's 32 + (8 x number_of_bytes_in_utf_8) + 8 bits for a string.
Each of the characters used to represent numbers in strings (-, +, 0-9, e/E [for scientific notation], and .) are represented with a single byte in UTF-8, so for our purposes in this question, # of chars = # of bytes.
So:
For "24" it's 32 + (8 x 2) + 8 giving us 56 bits.
For "254" it's 32 + (8 x 3) + 8 giving us 64 bits.
For "2254" it's 32 + (8 x 4) + 8 giving us 72 bits.
For "1.334" it's 32 + (8 x 5) + 8 giving us 80 bits.
See where I'm going with this? :-)
Add to that the fact that if it's a number, then storing it as a string:
...imposes a runtime penalty (converting to and fron string)
...means you can't do range comparisons like Ali Dehghani's {$gt: {age: "25"}} example
...and I'd say Number is your clear choice.
I need to get a string / char from a unicode charcode and finally put it into a DOM TextNode to add into an HTML page using client side JavaScript.
Currently, I am doing:
String.fromCharCode(parseInt(charcode, 16));
where charcode is a hex string containing the charcode, e.g. "1D400". The unicode character which should be returned is ๐, but a ํ is returned! Characters in the 16 bit range (0000 ... FFFF) are returned as expected.
Any explanation and / or proposals for correction?
Thanks in advance!
String.fromCharCode can only handle code points in the BMP (i.e. up to U+FFFF). To handle higher code points, this function from Mozilla Developer Network may be used to return the surrogate pair representation:
function fixedFromCharCode (codePt) {
if (codePt > 0xFFFF) {
codePt -= 0x10000;
return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
} else {
return String.fromCharCode(codePt);
}
}
The problem is that characters in JavaScript are (mostly) UCS-2 encoded but can represent a character outside the Basic Multilingual Plane in JavaScript as a UTF-16 surrogate pair.
The following function is adapted from Converting punycode with dash character to Unicode:
function utf16Encode(input) {
var output = [], i = 0, len = input.length, value;
while (i < len) {
value = input[i++];
if ( (value & 0xF800) === 0xD800 ) {
throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
}
if (value > 0xFFFF) {
value -= 0x10000;
output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
value = 0xDC00 | (value & 0x3FF);
}
output.push(String.fromCharCode(value));
}
return output.join("");
}
alert( utf16Encode([0x1D400]) );
Section 8.4 of the EcmaScript language spec says
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.
So you need to encode supplemental code-points as pairs of UTF-16 code units.
The article "Supplementary Characters in the Java Platform" gives a good description of how to do this.
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.
The following table shows the different representations of a few characters in comparison:
code points / UTF-16 code units
U+0041 / 0041
U+00DF / 00DF
U+6771 / 6771
U+10400 / D801 DC00
Once you know the UTF-16 code units, you can create a string using the javascript function String.fromCharCode:
String.fromCharCode(0xd801, 0xdc00) === '๐'
String.fromCodePoint() seems to do the trick as well. See here.
console.log(String.fromCodePoint(0x1D622, 0x1D623, 0x1D624, 0x1D400));
Output:
๐ข๐ฃ๐ค๐