i'm not sure what 0xFF does here...
is it there just to make sure that the binary code is 8bit long or has something to do with the signed/unsigned encoding? ty.
var nBytes = data.length, ui8Data = new Uint8Array(nBytes);
for (var nIdx = 0; nIdx < nBytes; nIdx++) {
ui8Data[nIdx] = data.charCodeAt(nIdx) & 0xff;
}
XHR.send(ui8Data);
You're right with your first guess. It takes only the least significant 8 bits of what's returned by data.charCodeAt.
charCodeAt will return a value in the range of 0..65536. This code truncates that range to 0..255. Effectively, it's taking each 16-bit character in the string, assuming it can fit into 8 bits, and throwing out the upper byte.
[6 years later edit] In the comments, we discovered a few things: you're questioning the code for the MDN polyfill for sendAsBinary. As you came to understand, the least significant byte does come first in little-endian systems, while the most significant byte comes first in big-endian systems.
Given that this is code from MDN, the code certainly does what was intended - by using FileReader.readAsBinaryString, it stores 8bit values into a 16bit holder. If you're worried about data loss, you can tweak the polyfill to extract the other byte using sData.charCodeAt(nIdx) && 0xff00 >> 8.
Related
I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?
I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?
You can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['๐']).size, // 4
new Blob(['๐']).size, // 4
new Blob(['๐๐']).size, // 8
new Blob(['๐๐']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, itโs just an implementation detail that wonโt affect the languageโs characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source
If you're using node.js, there is a simpler solution using buffers :
function getBinarySize(string) {
return Buffer.byteLength(string, 'utf8');
}
There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)
String values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:
4.3.16 String Value
A string value is a member of the type String and is a
finite ordered sequence of zero or
more 16-bit unsigned integer values.
NOTE Although each value usually
represents a single 16-bit unit of
UTF-16 text, the language does not
place any restrictions or requirements
on the values except that they be
16-bit unsigned integers.
These are 3 ways I use:
TextEncoder
new TextEncoder().encode("myString").length
Blob
new Blob(["myString"]).size
Buffer
Buffer.byteLength("myString", 'utf8')
Try this combination with using unescape js function:
const byteAmount = unescape(encodeURIComponent(yourString)).length
Full encode proccess example:
const s = "1 a ั โ # ยฎ"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ั-2,โ-3,#-1,ยฎ-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11
Note that if you're targeting node.js you can use Buffer.from(string).length:
var str = "\u2620"; // => "โ "
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)
The size of a JavaScript string is
Pre-ES6: 2 bytes per character
ES6 and later: 2 bytes per character,
or 5 or more bytes per character
Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.
ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}
Practical notes
This doesn't relate to the internal implemention of a particular engine. For
example, some engines use data structures and libraries with full
UTF-16 support, but what they provide externally doesn't have to be
full UTF-16 support. Also an engine may provide external UTF-16
support as well but is not mandated to do so.
For ES6, practically speaking characters will never be more than 5
bytes long (2 bytes for the escape point + 3 bytes for the Unicode
code point) because the latest version of Unicode only has 136,755
possible characters, which fits easily into 3 bytes. However this is
technically not limited by the standard so in principal a single
character could use say, 4 bytes for the code point and 6 bytes
total.
Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.
UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).
If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:
getStringMemorySize = function( _string ) {
"use strict";
var codePoint
, accum = 0
;
for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
codePoint = _string.charCodeAt( stringIndex );
if( codePoint < 0x100 ) {
accum += 1;
continue;
}
if( codePoint < 0x10000 ) {
accum += 2;
continue;
}
if( codePoint < 0x1000000 ) {
accum += 3;
} else {
accum += 4;
}
}
return accum * 2;
}
Examples:
getStringMemorySize( 'I' ); // 2
getStringMemorySize( 'โค' ); // 4
getStringMemorySize( '๐ ฐ' ); // 8
getStringMemorySize( 'Iโค๐ ฐ' ); // 14
The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.
byteCount(String.fromCharCode(55555))
// URIError: URI malformed
This longer function should handle all strings:
function bytes (str) {
var bytes=0, len=str.length, codePoint, next, i;
for (i=0; i < len; i++) {
codePoint = str.charCodeAt(i);
// Lone surrogates cannot be passed to encodeURI
if (codePoint >= 0xD800 && codePoint < 0xE000) {
if (codePoint < 0xDC00 && i + 1 < len) {
next = str.charCodeAt(i + 1);
if (next >= 0xDC00 && next < 0xE000) {
bytes += 4;
i++;
continue;
}
}
}
bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
}
return bytes;
}
E.g.
bytes(String.fromCharCode(55555))
// 3
It will correctly calculate the size for strings containing surrogate pairs:
bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)
The results can be compared with Node's built-in function Buffer.byteLength:
Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3
Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)
A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).
The charCodeAt() method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The codePointAt() can be used to return the entire code point value for Unicode characters, e.g. UTF-32.
When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)
See Unicode encodings for different encodings and their code ranges.
The Blob interface's size property returns the size of the Blob or File in bytes.
const getStringSize = (s) => new Blob([s]).size;
I'm working with an embedded version of the V8 Engine.
I've tested a single string. Pushing each step 1000 characters. UTF-8.
First test with single byte (8bit, ANSI) Character "A" (hex: 41).
Second test with two byte character (16bit) "ฮฉ" (hex: CE A9) and the
third test with three byte character (24bit) "โบ" (hex: E2 98 BA).
In all three cases the device prints out of memory at
888 000 characters and using ca. 26 348 kb in RAM.
Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.
Embedded or not, fact is that the characters are not only stored in 16bit.
Unfortunally I've no 100% answer, what Javascript do at low level area.
Btw. I've tested the same (first test above) with an array of character "A".
Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000.
So, the javascript engine is not simple restricted. It's a kind more complex.
You can try this:
var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));
It worked for me.
I'm getting too confused. Why do code points from U+D800 to U+DBFF encode as a single (2 bytes) String element, when using the ECMAScript 6 native Unicode helpers?
I'm not asking how JavaScript/ECMAScript encodes Strings natively, I'm asking about an extra functionality to encode UTF-16 that makes use of UCS-2.
var str1 = '\u{D800}';
var str2 = String.fromCodePoint(0xD800);
console.log(
str1.length, str1.charCodeAt(0), str1.charCodeAt(1)
);
console.log(
str2.length, str2.charCodeAt(0), str2.charCodeAt(1)
);
Re-TL;DR: I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string, since my browser's ES6 implementation incorporates UCS-2 encoding in strings, which uses 2 bytes for each character code?
Both of these approaches return a one-element String for the U+D800 code point (char code: 55296, same as 0xD800). But for code points bigger than U+FFFF each one returns a two-element String, the lead and trail. lead would be a number between U+D800 and U+DBFF, and trail I'm not sure about, I only know it helps changing the result code point. For me the return value doesn't make sense, it represents a lead without trail. Am I understanding something wrong?
I think your confusion is about how Unicode encodings work in general, so let me try to explain.
Unicode itself just specifies a list of characters, called "code points", in a particular order. It doesn't tell you how to convert those to bits, it just gives them all a number between 0 and 1114111 (in hexadecimal, 0x10FFFF). There are several different ways these numbers from U+0 to U+10FFFF can be represented as bits.
In an earlier version, it was expected that a range of 0 to 65535 (0xFFFF) would be enough. This can be naturally represented in 16 bits, using the same convention as an unsigned integer. This was the original way of storing Unicode, and is now known as UCS-2. To store a single code point, you reserve 16 bits of memory.
Later, it was decided that this range was not large enough; this meant that there were code points higher than 65535, which you can't represent in a 16-bit piece of memory. UTF-16 was invented as a clever way of storing these higher code points. It works by saying "if you look at a 16-bit piece of memory, and it's a number between 0xD800 and 0xDBF (a "low surrogate"), then you need to look at the next 16 bits of memory as well". Any piece of code which is performing this extra check is processing its data as UTF-16, and not UCS-2.
It's important to understand that the memory itself doesn't "know" which encoding it's in, the difference between UCS-2 and UTF-16 is how you interpret that memory. When you write a piece of software, you have to choose which interpretation you're going to use.
Now, onto Javascript...
Javascript handles input and output of strings by interpreting its internal representation as UTF-16. That's great, it means that you can type in and display the famous ๐ฉ character, which can't be stored in one 16-bit piece of memory.
The problem is that most of the built in string functions actually handle the data as UCS-2 - that is, they look at 16 bits at a time, and don't care if what they see is a special "surrogate". The function you used, charCodeAt(), is an example of this: it reads 16 bits out of memory, and gives them to you as a number between 0 and 65535. If you feed it ๐ฉ, it will just give you back the first 16 bits; ask it for the next "character" after, and it will give you the second 16 bits (which will be a "high surrogate", between 0xDC00 and 0xDFFF).
In ECMAScript 6 (2015), a new function was added: codePointAt(). Instead of just looking at 16 bits and giving them to you, this function checks if they represent one of the UTF-16 surrogate code units, and if so, looks for the "other half" - so it gives you a number between 0 and 1114111. If you feed it ๐ฉ, it will correctly give you 128169.
var poop = '๐ฉ';
console.log('Treat it as UCS-2, two 16-bit numbers: ' + poop.charCodeAt(0) + ' and ' + poop.charCodeAt(1));
console.log('Treat it as UTF-16, one value cleverly encoded in 32 bits: ' + poop.codePointAt(0));
// The surrogates are 55357 and 56489, which encode 128169 as follows:
// 0x010000 + ((55357 - 0xD800) << 10) + (56489 - 0xDC00) = 128169
Your edited question now asks this:
I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string?
The hexadecimal value D800 is 55296 in decimal, which is less than 65536, so given everything I've said above, this fits fine in 16 bits of memory. So if we ask charCodeAt to read 16 bits of memory, and it finds that number there, it's not going to have a problem.
Similarly, the .length property measures how many sets of 16 bits there are in the string. Since this string is stored in 16 bits of memory, there is no reason to expect any length other than 1.
The only unusual thing about this number is that in Unicode, that value is reserved - there isn't, and never will be, a character U+D800. That's because it's one of the magic numbers that tells a UTF-16 algorithm "this is only half a character". So a possible behaviour would be for any attempt to create this string to simply be an error - like opening a pair of brackets that you never close, it's unbalanced, incomplete.
The only way you could end up with a string of length 2 is if the engine somehow guessed what the second half should be; but how would it know? There are 1024 possibilities, from 0xDC00 to 0xDFFF, which could be plugged into the formula I show above. So it doesn't guess, and since it doesn't error, the string you get is 16 bits long.
Of course, you can supply the matching halves, and codePointAt will interpret them for you.
// Set up two 16-bit pieces of memory
var high=String.fromCharCode(55357), low=String.fromCharCode(56489);
// Note: String.fromCodePoint will give the same answer
// Glue them together (this + is string concatenation, not number addition)
var poop = high + low;
// Read out the memory as UTF-16
console.log(poop);
console.log(poop.codePointAt(0));
Well, it does this because the specification says it has to:
http://www.ecma-international.org/ecma-262/6.0/#sec-string.fromcodepoint
http://www.ecma-international.org/ecma-262/6.0/#sec-utf16encoding
Together these two say that if an argument is < 0 or > 0x10FFFF, a RangeError is thrown, but otherwise any codepoint <= 65535 is incorporated into the result string as-is.
As for why things are specified this way, I don't know. It seems like JavaScript doesn't really support Unicode, only UCS-2.
Unicode.org has the following to say on the matter:
http://www.unicode.org/faq/utf_bom.html#utf16-2
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.
http://www.unicode.org/faq/utf_bom.html#utf16-7
Q: Are there any 16-bit values that are invalid?
A: Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.
Therefore the result of String.fromCodePoint is not always valid UTF-16 because it can emit unpaired surrogates.
How do you generate cryptographically secure floats in Javascript?
This should be a plug-in for Math.random, with range (0, 1), but cryptographically secure. Example usage
cryptoFloat.random();
0.8083966837153522
Secure random numbers in javascript? shows how to create a cryptographically secure Uint32Array. Maybe this could be converted to a float somehow?
The Mozilla Uint32Array documentation was not totally clear on how to convert from an int.
Google was not to the point, either.
Float32Array.from(someUintBuf); always gave a whole number.
Since the following code is quite simple and functionally equivalent to the division method, here is the alternate method of altering the bits. (This code is copied and modified from #T.J. Crowder's very helpful answer).
// A buffer with just the right size to convert to Float64
let buffer = new ArrayBuffer(8);
// View it as an Int8Array and fill it with 8 random ints
let ints = new Int8Array(buffer);
window.crypto.getRandomValues(ints);
// Set the sign (ints[7][7]) to 0 and the
// exponent (ints[7][6]-[6][5]) to just the right size
// (all ones except for the highest bit)
ints[7] = 63;
ints[6] |= 0xf0;
// Now view it as a Float64Array, and read the one float from it
let float = new DataView(buffer).getFloat64(0, true) - 1;
document.body.innerHTML = "The number is " + float;
Explanation:
The format of a IEEE754 double is 1 sign bit (ints[7][7]), 11 exponent bits (ints[7][6] to ints[6][5]), and the rest as mantissa (which holds the values). The formula to compute is
To set the factor to 1, the exponent needs to be 1023. It has 11 bits, thus the highest-order bit gives 2048. This needs to be set to 0, the other bits to 1.
How to count bits of the string in JavaScript?
For example how many bits long is the string 0000xfe-kemZlF4IlEgljDF_4df:1102pwrq7?
The string provided ("0000xfe-kemZlF4IlEgljDF_4df:1102pwrq7") would be:
length * 2 * 8
bits long, or 592 bits.
This is because each char in a string is treated as a 16-bit unsigned value, at least in the most common mainstream implementation. The details of this can probably be discussed, but you mention in the comments that it is for security purposes -
So assuming you are giving ASCII characters (0-127) or UTF-8 (0-255) you can use the TextEncoder object to make sure you provide enough chars to produce 128 bits. Just be careful with Latin-1 chars in UTF-8 as the encoder may project them to the UTF-16 equivalent meaning it will produce 2 bytes for it instead of just one.
If you use a plain JavaScript string to hold ASCII characters you will have half the positions represented as 0's which reduce the security significantly, so an encoding from UTF-16/UCS-2 to ASCII or UTF-8 is required.
To use TextEncoder you simply provide a string representing 16 characters, at this point 256 bits (16x16) but where each char is within the ASCII/UTF-8 value range. After encoding, unless some special chars where used, the binary buffer as typed array should represent 128 bits (16x8).
Example
if (!("TextEncoder" in window)) alert("Sorry, no TextEncoder in this browser...");
else {
btn.onclick = function() {
var s = txt.value;
if (s.length !== 16) {
alert("Need 16 chars. " + (16 - s.length) + " to go...");
return
}
var encoder = new TextEncoder("ASCII"); // or use UTF-8
var bytes = encoder.encode(s);
console.log(bytes);
if (bytes.byteLength === 16) alert("OK, got 128 bits");
else alert("Oops, got " + (bytes.byteLength * 8) + " bits.");
};
}
<label>Enter 16 ASCII chars: <input id=txt maxlength=16></label>
<button id=btn>Convert</button>
An alternative to TextEncoder if using older browsers is to manually iterate over the string and extract and mask each char to build a binary array from that.
Can you copy the string into a buffer and then check the length of the buffer?
var str = ' ... ';
var buf = new Buffer(str);
console.log(buf.length);
If, as you say, you just need to make sure the given value is at least 128 bit, then you're probably passing this string to something that will be converting the string to some byte representation. How the string is converted to bytes depends on how it's encoded.
The sample string you gave us contains ASCII-range characters. If the string is encoded as ASCII, then it's 8 bits per character. If the string was encoded as UTF-8, then it would be 8 bits per character, but if the string could contain larger character values than the sample you provided, then it may be more than 8 bits per character depending on the character. If it's encoded as UTF-16, then each character is a minimum of 16 bits, but could be more depending on the character. If it's encoded as USC-2, then it would always be 16 bits per character.
We don't know where this requirement is coming from and how the system requiring this string uses it. If the system uses a fixed number of bits per character, then this is as straightforward as taking the length of the string and multiplying by the appropriate number. If it's not that straightforward, then you would need to encode the string using the proper encoding, most likely to a byte array, then multiply 8 * the number of bytes to get the number of bits.
I'm currently generating UUIDs in Javascript with this function (Create GUID / UUID in JavaScript?):
lucid.uuid = function() {
return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
var r = Math.random()*16|0, v = c == 'x' ? r : (r&0x3|0x8);
return v.toString(16);
});
}
I understand that all the randomness is only coming from Javascript's Math.random() function, and I don't care if it meets an RFC for a UUID. What I want is to pack as much randomness into as few bytes as possible in a Javascript string. The above function gives about 128 bits of randomness. How small of a string (as measured in UTF8 bytes sent over the wire in an HTTP POST) could I fit 128 bits into in Javascript? And how would I generate such a string?
Edit: This string will be part of a JSON object when sent to the server, so characters that need to be escaped in a string are not very helpful.
Here is one potential function I came up with. The seed string is the set of unreserved URL characters (66 of them). I prefix the randomness with about a year's worth of 1-second-resolution timestamp data, which is helpful since the collision space for my particular application is only filled up reasonably slowly over time (only at MOST a few hundred of these generated per second in an extreme case).
uuidDense = function() {
var seed = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_.~';
//Start the UUID with 4 digits of seed from the current date/time in seconds
//(which is almost a year worth of second data).
var seconds = Math.floor((new Date().getTime())/1000);
var ret = seed[seconds % seed.length];
ret += seed[Math.floor(seconds/=seed.length) % seed.length];
ret += seed[Math.floor(seconds/=seed.length) % seed.length];
ret += seed[Math.floor(seconds/=seed.length) % seed.length];
for(var i = 0; i < 8; i++)
ret += seed[Math.random()*seed.length|0];
return ret;
}
Thoughts?
128 bits = 16 bytes -> base64 -> 16*3/2 = will give you string of 24 characters (versus 36 chars that you have)
You also can use base85 for better density but that will require URL encode so you may get even worse results than you have.
Your question is somewhat contradictory. Javascript strings use UCS-2 (fixed 16-bit characters) for their internal representation. However UTF-8 is variable width, but for encoding purposes I believe the most compact form would be to use 1-byte UTF8 characters, which only require the most significant bit be zero. I.e. you could pack 128 bits into 128 * 8/7 = 147 bits.
Converting to bytes and rounding up, you could do this in 19 characters.