I need to get a string / char from a unicode charcode and finally put it into a DOM TextNode to add into an HTML page using client side JavaScript.
Currently, I am doing:
String.fromCharCode(parseInt(charcode, 16));
where charcode is a hex string containing the charcode, e.g. "1D400". The unicode character which should be returned is ๐, but a ํ is returned! Characters in the 16 bit range (0000 ... FFFF) are returned as expected.
Any explanation and / or proposals for correction?
Thanks in advance!
String.fromCharCode can only handle code points in the BMP (i.e. up to U+FFFF). To handle higher code points, this function from Mozilla Developer Network may be used to return the surrogate pair representation:
function fixedFromCharCode (codePt) {
if (codePt > 0xFFFF) {
codePt -= 0x10000;
return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
} else {
return String.fromCharCode(codePt);
}
}
The problem is that characters in JavaScript are (mostly) UCS-2 encoded but can represent a character outside the Basic Multilingual Plane in JavaScript as a UTF-16 surrogate pair.
The following function is adapted from Converting punycode with dash character to Unicode:
function utf16Encode(input) {
var output = [], i = 0, len = input.length, value;
while (i < len) {
value = input[i++];
if ( (value & 0xF800) === 0xD800 ) {
throw new RangeError("UTF-16(encode): Illegal UTF-16 value");
}
if (value > 0xFFFF) {
value -= 0x10000;
output.push(String.fromCharCode(((value >>>10) & 0x3FF) | 0xD800));
value = 0xDC00 | (value & 0x3FF);
}
output.push(String.fromCharCode(value));
}
return output.join("");
}
alert( utf16Encode([0x1D400]) );
Section 8.4 of the EcmaScript language spec says
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.
So you need to encode supplemental code-points as pairs of UTF-16 code units.
The article "Supplementary Characters in the Java Platform" gives a good description of how to do this.
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.
The following table shows the different representations of a few characters in comparison:
code points / UTF-16 code units
U+0041 / 0041
U+00DF / 00DF
U+6771 / 6771
U+10400 / D801 DC00
Once you know the UTF-16 code units, you can create a string using the javascript function String.fromCharCode:
String.fromCharCode(0xd801, 0xdc00) === '๐'
String.fromCodePoint() seems to do the trick as well. See here.
console.log(String.fromCodePoint(0x1D622, 0x1D623, 0x1D624, 0x1D400));
Output:
๐ข๐ฃ๐ค๐
Related
I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?
I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?
You can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['๐']).size, // 4
new Blob(['๐']).size, // 4
new Blob(['๐๐']).size, // 8
new Blob(['๐๐']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, itโs just an implementation detail that wonโt affect the languageโs characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source
If you're using node.js, there is a simpler solution using buffers :
function getBinarySize(string) {
return Buffer.byteLength(string, 'utf8');
}
There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)
String values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:
4.3.16 String Value
A string value is a member of the type String and is a
finite ordered sequence of zero or
more 16-bit unsigned integer values.
NOTE Although each value usually
represents a single 16-bit unit of
UTF-16 text, the language does not
place any restrictions or requirements
on the values except that they be
16-bit unsigned integers.
These are 3 ways I use:
TextEncoder
new TextEncoder().encode("myString").length
Blob
new Blob(["myString"]).size
Buffer
Buffer.byteLength("myString", 'utf8')
Try this combination with using unescape js function:
const byteAmount = unescape(encodeURIComponent(yourString)).length
Full encode proccess example:
const s = "1 a ั โ # ยฎ"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ั-2,โ-3,#-1,ยฎ-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11
Note that if you're targeting node.js you can use Buffer.from(string).length:
var str = "\u2620"; // => "โ "
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)
The size of a JavaScript string is
Pre-ES6: 2 bytes per character
ES6 and later: 2 bytes per character,
or 5 or more bytes per character
Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.
ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}
Practical notes
This doesn't relate to the internal implemention of a particular engine. For
example, some engines use data structures and libraries with full
UTF-16 support, but what they provide externally doesn't have to be
full UTF-16 support. Also an engine may provide external UTF-16
support as well but is not mandated to do so.
For ES6, practically speaking characters will never be more than 5
bytes long (2 bytes for the escape point + 3 bytes for the Unicode
code point) because the latest version of Unicode only has 136,755
possible characters, which fits easily into 3 bytes. However this is
technically not limited by the standard so in principal a single
character could use say, 4 bytes for the code point and 6 bytes
total.
Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.
UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).
If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:
getStringMemorySize = function( _string ) {
"use strict";
var codePoint
, accum = 0
;
for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
codePoint = _string.charCodeAt( stringIndex );
if( codePoint < 0x100 ) {
accum += 1;
continue;
}
if( codePoint < 0x10000 ) {
accum += 2;
continue;
}
if( codePoint < 0x1000000 ) {
accum += 3;
} else {
accum += 4;
}
}
return accum * 2;
}
Examples:
getStringMemorySize( 'I' ); // 2
getStringMemorySize( 'โค' ); // 4
getStringMemorySize( '๐ ฐ' ); // 8
getStringMemorySize( 'Iโค๐ ฐ' ); // 14
The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.
byteCount(String.fromCharCode(55555))
// URIError: URI malformed
This longer function should handle all strings:
function bytes (str) {
var bytes=0, len=str.length, codePoint, next, i;
for (i=0; i < len; i++) {
codePoint = str.charCodeAt(i);
// Lone surrogates cannot be passed to encodeURI
if (codePoint >= 0xD800 && codePoint < 0xE000) {
if (codePoint < 0xDC00 && i + 1 < len) {
next = str.charCodeAt(i + 1);
if (next >= 0xDC00 && next < 0xE000) {
bytes += 4;
i++;
continue;
}
}
}
bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
}
return bytes;
}
E.g.
bytes(String.fromCharCode(55555))
// 3
It will correctly calculate the size for strings containing surrogate pairs:
bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)
The results can be compared with Node's built-in function Buffer.byteLength:
Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3
Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)
A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).
The charCodeAt() method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The codePointAt() can be used to return the entire code point value for Unicode characters, e.g. UTF-32.
When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)
See Unicode encodings for different encodings and their code ranges.
The Blob interface's size property returns the size of the Blob or File in bytes.
const getStringSize = (s) => new Blob([s]).size;
I'm working with an embedded version of the V8 Engine.
I've tested a single string. Pushing each step 1000 characters. UTF-8.
First test with single byte (8bit, ANSI) Character "A" (hex: 41).
Second test with two byte character (16bit) "ฮฉ" (hex: CE A9) and the
third test with three byte character (24bit) "โบ" (hex: E2 98 BA).
In all three cases the device prints out of memory at
888 000 characters and using ca. 26 348 kb in RAM.
Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.
Embedded or not, fact is that the characters are not only stored in 16bit.
Unfortunally I've no 100% answer, what Javascript do at low level area.
Btw. I've tested the same (first test above) with an array of character "A".
Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000.
So, the javascript engine is not simple restricted. It's a kind more complex.
You can try this:
var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));
It worked for me.
We must do a small program for our teacher to get the ASCII code of any value in Javascript.
I have searched and researched, but it seems that there is no method to do so. I have only found:
charCodeAt()
http://www.hacksparrow.com/get-ascii-value-of-character-convert-ascii-to-character-in-javascript.html
That returns the Unicode value, but not ASCII.
I have read in this forum that the ASCII value is the same as the Unicode value for the ASCII characters that already have an ASCII value:
Are Unicode and Ascii characters the same?
But it seems that is not always the case, as for example with the extended ASCII characters. So for example:
var myCaracter = "โ";
var n = myCaracter.charCodeAt(0);
document.write (n);
The ASCII value of that character is 195, but the program returns 226 (Unicode value).
I can't find a pattern to follow to convert from one to another, so:
ยฟCan we obtain the ASCII from Unicode, or should I look for another way?
Thanks!
ASCII characters only use 7 bits, with values from 0 to 127 (00 to 7F hex). They include:
control characters (0 to 31, as well as 127)
digits (0 to 9, encoded 48 to 57)
uppercase letters (65 to 90)
lowercase letters (97 to 122)
a limited number of punctuation and other symbols.
ASCII characters are a subset of Unicode (the "C0 Controls and Basic Latin Block"), and they are encoded exactly the same in UTF-8. The ASCII code of "A" (65 or 0x41) is the same as the Unicode code point for "A" (U+0041).
The character (โ) you're considering is not ASCII. It's part of many different character sets / code pages, where it may have different numerical values / encodings, but it's definitely not ASCII.
That characters is not even defined in the most common ASCII 8-bit extensions, known as ISO-8859-*. It is part of the code page 437 (used on MS-DOS), where its numerical code is 0xC3 (195). But that's definitely not ASCII.
The Unicode code point for that character is U+251C (9500 decimal), which is the return value of charCodeAt for this character, not 226.
You're probably getting 226 because you're interpreting an UTF-8 string that has not been recognised as such.
Today my teacher has apologized because maybe it was her fault to tell us that charCodeAt() is wrong to obtain the ASCII code; she wanted us to use that method, like #Rad Lexus suggested.
So, it is not neccesary in my excercise, but as a practice and to help everyone who could need it, what I have done is to add to the code a small validation in order to avoid that the user could enter ASCII extended characters bigger than or equal to 128, where the problems with charCodeAt() seem to start.
Maybe it is not a smart solution and it was certainly not necessary in my exercise, plus it makes that some necessary characters in another languages (รถ for German or รฑ for Spanish, for example) are forbidden... but I think it is good to post the code and let everyone which uses it to choose whether using this validation or not.
Thanks to everyone who helped me.
Defining function:
function validate(text)
{
var isValid=false;
var i=0;
if(text != null && text.length>0 && text !='' )
{
isValid=true;
for (i=0;i<text.length;++i)/*this is not necessary, but I did*/
{
if(text.charCodeAt(i)>=128)
{
isValid=false;
}
}
}
return isValid;
}
Using function
var isValid=false;
var position=0;
while(isValid==false)
{
text=prompt("Enter your text");
isValid=validate(text);
}
What is the difference between String.prototype.codePointAt() and String.prototype.charCodeAt() in JavaScript?
'A'.codePointAt(); // 65
'A'.charCodeAt(); // 65
From the MDN page on charCodeAt:
The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The UTF-16 code unit matches the Unicode code point for code points which can be represented in a single UTF-16 code unit. If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than 0xFFFF) then the code unit returned will be the first part of a surrogate pair for the code point. If you want the entire code point value, use codePointAt().
TLDR;
charCodeAt() is UTF-16
codePointAt() is Unicode.
To add a few for the ToxicTeacakes's answer, here is another example to help you know the difference:
"๐ ฎท".charCodeAt(0).toString(16);//d842
"๐ ฎท".charCodeAt(1).toString(16);//dfb7
"๐ ฎท".codePointAt(0);//20bb7
"๐ ฎท".codePointAt(1);//dfb7
console.log("\ud842\udfb7");//๐ ฎท, an example of hexadecimal digits
console.log("\u20bb7\udfb7");//โป7๏ฟฝ
console.log("\u{20bb7}");//๐ ฎท an unicode code point escapes the "\ud842\udfb7"
The following is the info about javascript string literals:
"\uXXXX"
The Unicode character specified by the four hexadecimal digits
XXXX. For example, \u00A9 is the Unicode sequence for the copyright
symbol.
"\u{XXXXX}"
Unicode code point
escapes. For example, \u{2F804} is the same as the simple Unicode
escapes \uD87E\uDC04.
see also msdn
Example in JS
On The example with strings and emojis, I am going to illustrate how things could go wrong when you do not know that some of the characters could consist of 2 code units. Some of the characters take up more than one code unit. Consider using codePointAt() over charCodeAt() or use the first one if you are sure that your characters lie in of between 0 and 65535 (216)
more about code units here
// charCodeAt() is UTF-16
// codePointAt() is Unicode
/* UTF-16 is generally considered a bad idea today */
const strings = ["o", "four", "to"];
const emojis = ["๐", "๐"];
function printItemsLength(arr) {
for (const item of arr) {
console.log(item, item.length);
}
}
printItemsLength(strings);
console.log('================================');
printItemsLength(emojis);
console.log('================================');
console.log("i.charCodeAt(0)", "i".charCodeAt(0)); // 105
console.log("i.charCodeAt(1)", "i".charCodeAt(1)); // 105
console.log("i.codePointAt(0)", "i".codePointAt(0)); // 105
console.log('=============EMOJIS=============');
// getting the decimal (dec) by which you can find them
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[0] + '.charCodeAt(0)', emojis[0].charCodeAt(0)); // only half-character - 55357
console.log(emojis[0] + '.charCodeAt(1)', emojis[0].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
console.log(emojis[0] + '.codePointAt(0)', emojis[0].codePointAt(0)); // 128014
console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[1] + '.charCodeAt(0)', emojis[1].charCodeAt(0)); // only half-character - 55357
console.log(emojis[1] + '.charCodeAt(1)', emojis[1].charCodeAt(1)); // only half-character - 55357
console.log('===========codePointAt===========');
// full-character
console.log(emojis[1] + '.codePointAt(0)', emojis[1].codePointAt(0)); // 128095
console.log(emojis[1] + '.codePointAt(1)', emojis[1].codePointAt(1)); // will return lower surragate (non-displayable character)
// to find this emojis have a look here: https://www.w3schools.com/charsets/ref_emoji.asp
as someone may have noticed I have tried to convert back from charcode to the emoji, and it did not work on one of the symbols (that is because it is not in range of UTF-16
Introduction to Unicode and UTF-16
please skip this section if you already familiar with it
Unicode โ is a set of characters used around the world; UTF-16 -
00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111
01100010 for "๐คญข" (two 16-bits)
read more
"surrogate pair" characters are emoji and some letters that consist of more than 1 character as it is explained here
The term "surrogate pair" refers to a means of encoding Unicode
characters with high code-points in the UTF-16 encoding scheme. In the
Unicode character encoding, characters are mapped to values between
0x0 and 0x10FFFF.
read more
Unicode - It assigns every character a unique number called a code point.
Differentiating charCodeAt() from codePointAt()
charCodeAt(pos) returns code a code unit (not a full character).
If you need a character (that could be either one or two code units), you can use codePointAt(pos) to get its code.
charCodeAt() - returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index link
codePointAt() - returns a non-negative integer that is the Unicode code point value at the given position link
where pos is the index of the character you want to check.
Quote from the book:
UTF-16 is generally considered a bad idea today. It seems almost
intentionally designed to invite mistakes. Itโs easy to write programs
that pretend code units and characters are the same things.
read more
jsfiddle sandbox
Sources:
What is Unicode, UTF-8, UTF-16?
Marijn Haverbeke Eloquent JavaScript, 3rd Edition: A Modern Introduction to Programming [Text] โ City(not-specified) : No Starch Press, 2018 โ 447 p. can be found here
What is "surrogate pair"
to find this emojis have a look w3schools.com/charsets/ref_emoji
Chapter 5, p. 91 => Strings and character codes
the following doesn't seem correct
"๐".charCodeAt(0); // returns 55357 in both Firefox and Chrome
that's a Unicode character named ROCKET (U+1F680), the decimal should be 128640.
this is for a unicode app am writing. Seems most but not ALL chars from unicode 6 all stuck at 55357.
how can I fix it? Thanks.
JavaScript is using UTF-16 encoding; see this article for details:
Characters outside the BMP, e.g. U+1D306 tetragram for centre (๐), can only be encoded in UTF-16 using two 16-bit code units: 0xD834
0xDF06. This is called a surrogate pair. Note that a surrogate pair
only represents a single character.
The first code unit of a surrogate pair is always in the range from
0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.
The second code unit of a surrogate pair is always in the range from
0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.
You can decode the surrogate pair like this:
codePoint = (text.charCodeAt(0) - 0xD800) * 0x400 + text.charCodeAt(1) - 0xDC00 + 0x10000
Complete code can be found can be found in the Mozilla documentation for charCodeAt.
Tried this out:
> "๐".charCodeAt(0);
55357
> "๐".charCodeAt(1);
56960
Related questions on SO:
Expressing UTF-16 unicode characters in JavaScript
Unicode characters from charcode in javascript for charcodes > 0xFFFF
You might want to take a look at this too:
Getting it to work with higher values
I think it's because they're returning you the first code unit UTF-16 encoding of that character. I'm not sure there's much you can do, because they're returning a 16-bit value -- I would probably try manually decoding the character from the first two code units and then encoding it in UTF-32, which seems to be what you want.
How to count bits of the string in JavaScript?
For example how many bits long is the string 0000xfe-kemZlF4IlEgljDF_4df:1102pwrq7?
The string provided ("0000xfe-kemZlF4IlEgljDF_4df:1102pwrq7") would be:
length * 2 * 8
bits long, or 592 bits.
This is because each char in a string is treated as a 16-bit unsigned value, at least in the most common mainstream implementation. The details of this can probably be discussed, but you mention in the comments that it is for security purposes -
So assuming you are giving ASCII characters (0-127) or UTF-8 (0-255) you can use the TextEncoder object to make sure you provide enough chars to produce 128 bits. Just be careful with Latin-1 chars in UTF-8 as the encoder may project them to the UTF-16 equivalent meaning it will produce 2 bytes for it instead of just one.
If you use a plain JavaScript string to hold ASCII characters you will have half the positions represented as 0's which reduce the security significantly, so an encoding from UTF-16/UCS-2 to ASCII or UTF-8 is required.
To use TextEncoder you simply provide a string representing 16 characters, at this point 256 bits (16x16) but where each char is within the ASCII/UTF-8 value range. After encoding, unless some special chars where used, the binary buffer as typed array should represent 128 bits (16x8).
Example
if (!("TextEncoder" in window)) alert("Sorry, no TextEncoder in this browser...");
else {
btn.onclick = function() {
var s = txt.value;
if (s.length !== 16) {
alert("Need 16 chars. " + (16 - s.length) + " to go...");
return
}
var encoder = new TextEncoder("ASCII"); // or use UTF-8
var bytes = encoder.encode(s);
console.log(bytes);
if (bytes.byteLength === 16) alert("OK, got 128 bits");
else alert("Oops, got " + (bytes.byteLength * 8) + " bits.");
};
}
<label>Enter 16 ASCII chars: <input id=txt maxlength=16></label>
<button id=btn>Convert</button>
An alternative to TextEncoder if using older browsers is to manually iterate over the string and extract and mask each char to build a binary array from that.
Can you copy the string into a buffer and then check the length of the buffer?
var str = ' ... ';
var buf = new Buffer(str);
console.log(buf.length);
If, as you say, you just need to make sure the given value is at least 128 bit, then you're probably passing this string to something that will be converting the string to some byte representation. How the string is converted to bytes depends on how it's encoded.
The sample string you gave us contains ASCII-range characters. If the string is encoded as ASCII, then it's 8 bits per character. If the string was encoded as UTF-8, then it would be 8 bits per character, but if the string could contain larger character values than the sample you provided, then it may be more than 8 bits per character depending on the character. If it's encoded as UTF-16, then each character is a minimum of 16 bits, but could be more depending on the character. If it's encoded as USC-2, then it would always be 16 bits per character.
We don't know where this requirement is coming from and how the system requiring this string uses it. If the system uses a fixed number of bits per character, then this is as straightforward as taking the length of the string and multiplying by the appropriate number. If it's not that straightforward, then you would need to encode the string using the proper encoding, most likely to a byte array, then multiply 8 * the number of bytes to get the number of bits.