Node.JS performance vs native C++ addon when populating an Int32Array

Node.JS performance vs native C++ addon when populating an Int32Array - javascript

I've been experimenting with Node.JS and C++ addons and found that populating an Int32Array is considerably slower when using the C++ addon rather than directly doing so in Node.JS / JavaScript.
Node.JS: 133 ~ ms
C++: 1103 ~ ms
Does anyone know why this is? My test code consists of a fairly large array and for loops containing if statements.
I suspect I'm populating the array incorrectly in my C++ addon. (?)
JavaScript:
var testArray = new Int32Array(36594368);
var i = 0;
for (var xi = 0; xi < 332; xi++) {
for (var yi = 0; yi < 332; yi++) {
for (var zi = 0; zi < 332; zi++) {
if ((xi + yi + zi) % 62 == 0) testArray[i] = 2;
else if (yi < 16) testArray[i] = 2;
else if (yi == 16) testArray[i] = 1;
else testArray[i] = 0;
i++;
}
}
}
C++ Addon:
Local<Int32Array> testArray = Int32Array::New(ArrayBuffer::New(isolate, 4 * 36594368), 0, 36594368);
int i = 0;
for (int xi = 0; xi < 332; xi++) {
for (int yi = 0; yi < 332; yi++) {
for (int zi = 0; zi < 332; zi++) {
if ((xi + yi + zi) % 62 == 0) testArray->Set(i, Integer::New(isolate, 2));
else if (yi < 16) testArray->Set(i, Integer::New(isolate, 2));
else if (yi == 16) testArray->Set(i, Integer::New(isolate, 1));
else testArray->Set(i, Integer::New(isolate, 0));
i++;
}
}
}
EDIT: Just to add, the functions I'm using in my C++ code are V8 functions and weren't defined by myself. Is there another way to set values in an Int32Array without using these?

Maxing out Typed Array Performance using C++
I'm not surprised this is slow as written, but there is a lot you can do to speed it up. The critical insight is that when dealing with JavaScript typed arrays in node, you can gain access to the memory buffer and operate on that directly.
The main source of slowness
Though when dealing with normal JavaScript arrays/objects, the following are necessary
Integer::New(isolate,value)
and
testArray->Set(value)
So for example the following line
testArray->Set(i, Integer::New(isolate, 0));
creates a new Number object, converts the integer 0 to a double since all JavaScript numbers are double, calls Set with the Number object, then converts the double back to an integer because it's storing the value in a Int32 typed array, and then destructs the Number object. This happens 3 million times.
An improvement
But typed arrays are different, and the call GetIndexedPropertiesExternalArrayData gives one access to the underlying buffer, which for a Int32Array is a buffer of int. This allows the C++ function to be re-written to avoid all of those allocations and casts:
void doMkArray(const FunctionCallbackInfo<Value> &args)
{
v8::Isolate *I = v8::Isolate::GetCurrent();
Local<Int32Array> testArray = Int32Array::New(ArrayBuffer::New(I, 4 * 36594368),0,36594368);
int *dptr = (int*)testArray->GetIndexedPropertiesExternalArrayData();
int i = 0;
for (int xi = 0; xi < 332; xi++)
{
for (int yi = 0; yi < 332; yi++)
{
for (int zi = 0; zi < 332; zi++)
{
if ((xi + yi + zi) % 62 == 0) dptr[i] = 2;
else if (yi < 16) dptr[i] = 2;
else if (yi == 16) dptr[i] = 1;
else dptr[i] = 0;
i++;
}
}
}
args.GetReturnValue().Set(testArray);
}
Some measurements
Replacing with the above makes things faster, but how much faster needs a test. The following package can be cloned and when run (using node 0.12.5) results in the following
Performance Tests
✓ via javascript (169ms)
✓ via c++ (141ms)
So that alone makes using C++ faster, but maybe not all that amazing, but if both the Javascript and the C++ loops (see the src) are commented out, and when only the array allocation is included:
void doMkArray(const FunctionCallbackInfo<Value> &args)
{
v8::Isolate *I = v8::Isolate::GetCurrent();
Local<Int32Array> testArray = Int32Array::New(ArrayBuffer::New(I, 4
/*
...
Then the time changes to
Performance Tests
✓ via javascript (62ms)
✓ via c++ (80ms)
In other words, simply allocating the array takes approximately 60ms in JavaScript, and 80ms in the C++ module. But that means that the rest of the time is the time spent in the loop, which is approximately 60ms in C++ and 110ms in Javascript. And so for actions that are predominately loops and calculations using direct buffer access, C++ is preferred.

Related

Programmatically solving Sam Loyd's Battle of Hastings puzzle - performance issues with BigInt

I'm having performance issues when trying to check whether integer n is a perfect square (sqrt is a whole number) when using BigInt. Using normal numbers below Number.MAX_SAFE_INTEGER gives reasonable performance, but attempting to use BigInt even with the same number range causes a huge performance hit.
The program solves the Battle of Hastings perfect square riddle put forth by Sam Loyd whereby my program iterates over the set of real numbers n (in this example, up to 7,000,000) to find instances where variable y is a whole number (perfect square). I'm interested in the original square root of one of the 13 perfect squares where this condition is satisfied, which is what my code generates (there's more than one).
Assuming y^2 < Number.MAX_SAFE_INTEGER which is 2^53 – 1, this can be done without BigInt and runs in ~60ms on my machine:
const limit = 7_000_000;
var a = [];
console.time('regular int');
for (let n = 1; n < limit; n++) {
if (Math.sqrt(Math.pow(n, 2) * 13 + 1) % 1 === 0)
a.push(n);
}
console.log(a.join(', '));
console.timeEnd('regular int');
Being able to use BigInt would mean I could test for numbers much higher than the inherent number variable limit 2^53 - 1, but BigInt seems inherently slower; unusably so. To test whether a BigInt is a perfect square, I have to use a third party library as Math.sqrt doesn't exist for BigInt such that I can check if the root is perfect, as all sqrt returns a floor value. I adapted functions for this from a NodeJS library, bigint-isqrt and bigint-is-perfect-square.
Thus, using BigInt with the same limit of 7,000,000 runs 35x slower:
var integerSQRT = function(value) {
if (value < 2n)
return value;
if (value < 16n)
return BigInt(Math.sqrt(Number(value)) | 0);
let x0, x1;
if (value < 4503599627370496n)
x1 = BigInt(Math.sqrt(Number(value))|0) - 3n;
else {
let vlen = value.toString().length;
if (!(vlen & 1))
x1 = 10n ** (BigInt(vlen / 2));
else
x1 = 4n * 10n ** (BigInt((vlen / 2) | 0));
}
do {
x0 = x1;
x1 = ((value / x0) + x0) >> 1n;
} while ((x0 !== x1 && x0 !== (x1 - 1n)));
return x0;
}
function perfectSquare(n) {
// Divide n by 4 while divisible
while ((n & 3n) === 0n && n !== 0n) {
n >>= 2n;
}
// So, for now n is not divisible by 2
// The only possible residual modulo 8 for such n is 1
if ((n & 7n) !== 1n)
return false;
return n === integerSQRT(n) ** 2n;
}
const limit = 7_000_000;
var a = [];
console.time('big int');
for (let n = 1n; n < limit; n++) {
if (perfectSquare(((n ** 2n) * 13n) + 1n))
a.push(n);
}
console.log(a.join(', '));
console.timeEnd('big int');
Ideally I'm interested in doing this with a much higher limit than 7 million, but I'm unsure whether I can optimise the BigInt version any further. Any suggestions?

You may be pleased to learn that some recent improvements on V8 have sped up the BigInt version quite a bit; with a recent V8 build I'm seeing your BigInt version being about 12x slower than the Number version.
A remaining challenge is that implementations of BigInt-sqrt are typically based on Newton iteration and hence need an estimate for a starting value, which should be near the final result, so about half as wide as the input, which is given by log2(X) or bitLength(X). Until this proposal gets anywhere, that can best be done by converting the BigInt to a string and taking that string's length, which is fairly expensive.
To get faster right now, #Ouroborus' idea is great. I was curious how fast it would be, so I implemented it:
(function betterAlgorithm() {
const limit = 7_000_000n;
var a = [];
console.time('better algorithm');
let m = 1n;
let m_squared = 1n;
for (let n = 1n; n < limit; n += 1n) {
let y_squared = n * n * 13n + 1n;
while (y_squared > m_squared) {
m += 1n;
m_squared = m * m;
}
if (y_squared === m_squared) {
a.push(n);
}
}
console.log(a.join(', '));
console.timeEnd('better algorithm');
})();
As a particular short-term detail, this uses += 1n instead of ++, because as of today, V8 hasn't yet gotten around to optimizing ++ for BigInts. This difference should disappear eventually (hopefully soon).
On my machine, this version takes only about 4x as much time as your original Number-based implementation.
For larger numbers, I would expect some gains from replacing the multiplications with additions (based on the observation that the delta between consecutive square numbers grows linearly), but for small-ish upper limits that appears to be a bit slower. If you want to toy around with it, this snippet describes the idea:
let m_squared = 1n; // == 1*1
let m_squared_delta = 3n; // == 2*2 - 1*1
let y_squared = 14n; // == 1*1*13+1
let y_squared_delta = 39n; // == 2*2*13+1 - 1*1*13+1
for (let n = 1; n < limit; n++) {
while (y_squared > m_squared) {
m_squared += m_squared_delta;
m_squared_delta += 2n;
}
if (y_squared === m_squared) {
a.push(n);
}
y_squared += y_squared_delta;
y_squared_delta += 26n;
}
The earliest where this could possibly pay off is when the results exceed 2n**64n; I wouldn't be surprised if it wasn't measurable before 2n**256n or so.

Why is webAssembly function almost 300 time slower than same JS function

Find length of line 300* slower
First of I have read the answer to Why is my WebAssembly function slower than the JavaScript equivalent?
But it has shed little light on the problem, and I have invested a lot of time that may well be that yellow stuff against the wall.
I do not use globals, I do not use any memory. I have two simple functions that find the length of a line segment and compare them to the same thing in plain old Javascript. I have 4 params 3 more locals and returns a float or double.
On Chrome the Javascript is 40 times faster than the webAssembly and on firefox the wasm is almost 300 times slower than the Javascript.
jsPref test case.
I have added a test case to jsPref WebAssembly V Javascript math
What am I doing wrong?
Either
I have missed an obvious bug, bad practice, or I am suffering coder stupidity.
WebAssembly is not for 32bit OS (win 10 laptop i7CPU)
WebAssembly is far from a ready technology.
Please please be option 1.
I have read the webAssembly use case
Re-use existing code by targeting WebAssembly, embedded in a larger
JavaScript / HTML application. This could be anything from simple
helper libraries, to compute-oriented task offload.
I was hoping I could replace some geometry libs with webAssembly to get some extra performance. I was hoping that it would be awesome, like 10 or more times faster. BUT 300 times slower WTF.
UPDATE
This is not a JS optimisation issues.
To ensure that optimisation has as little as possible effect I have tested using the following methods to reduce or eliminate any optimisation bias..
counter c += length(... to ensure all code is executed.
bigCount += c to ensure whole function is executed. Not needed
4 lines for each function to reduce a inlining skew. Not Needed
all values are randomly generated doubles
each function call returns a different result.
add slower length calculation in JS using Math.hypot to prove code is being run.
added empty call that return first param JS to see overhead
// setup and associated functions
const setOf = (count, callback) => {var a = [],i = 0; while (i < count) { a.push(callback(i ++)) } return a };
const rand = (min = 1, max = min + (min = 0)) => Math.random() * (max - min) + min;
const a = setOf(100009,i=>rand(-100000,100000));
var bigCount = 0;
function len(x,y,x1,y1){
var nx = x1 - x;
var ny = y1 - y;
return Math.sqrt(nx * nx + ny * ny);
}
function lenSlow(x,y,x1,y1){
var nx = x1 - x;
var ny = y1 - y;
return Math.hypot(nx,ny);
}
function lenEmpty(x,y,x1,y1){
return x;
}
// Test functions in same scope as above. None is in global scope
// Each function is copied 4 time and tests are performed randomly.
// c += length(... to ensure all code is executed.
// bigCount += c to ensure whole function is executed.
// 4 lines for each function to reduce a inlining skew
// all values are randomly generated doubles
// each function call returns a different result.
tests : [{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += length(a1,a2,a3,a4);
c += length(a2,a3,a4,a1);
c += length(a3,a4,a1,a2);
c += length(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "length64",
},{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += lengthF(a1,a2,a3,a4);
c += lengthF(a2,a3,a4,a1);
c += lengthF(a3,a4,a1,a2);
c += lengthF(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "length32",
},{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += len(a1,a2,a3,a4);
c += len(a2,a3,a4,a1);
c += len(a3,a4,a1,a2);
c += len(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "length JS",
},{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += lenSlow(a1,a2,a3,a4);
c += lenSlow(a2,a3,a4,a1);
c += lenSlow(a3,a4,a1,a2);
c += lenSlow(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "Length JS Slow",
},{
func : function (){
var i,c=0,a1,a2,a3,a4;
for (i = 0; i < 10000; i += 1) {
a1 = a[i];
a2 = a[i+1];
a3 = a[i+2];
a4 = a[i+3];
c += lenEmpty(a1,a2,a3,a4);
c += lenEmpty(a2,a3,a4,a1);
c += lenEmpty(a3,a4,a1,a2);
c += lenEmpty(a4,a1,a2,a3);
}
bigCount = (bigCount + c) % 1000;
},
name : "Empty",
}
],
Results from update.
Because there is a lot more overhead in the test the results are closer but the JS code is still two orders of magnitude faster.
Note how slow the function Math.hypot is. If optimisation was in effect that function would be near the faster len function.
WebAssembly 13389µs
Javascript 728µs
/*
=======================================
Performance test. : WebAssm V Javascript
Use strict....... : true
Data view........ : false
Duplicates....... : 4
Cycles........... : 147
Samples per cycle : 100
Tests per Sample. : undefined
---------------------------------------------
Test : 'length64'
Mean : 12736µs ±69µs (*) 3013 samples
---------------------------------------------
Test : 'length32'
Mean : 13389µs ±94µs (*) 2914 samples
---------------------------------------------
Test : 'length JS'
Mean : 728µs ±6µs (*) 2906 samples
---------------------------------------------
Test : 'Length JS Slow'
Mean : 23374µs ±191µs (*) 2939 samples << This function use Math.hypot
rather than Math.sqrt
---------------------------------------------
Test : 'Empty'
Mean : 79µs ±2µs (*) 2928 samples
-All ----------------------------------------
Mean : 10.097ms Totals time : 148431.200ms 14700 samples
(*) Error rate approximation does not represent the variance.
*/
Whats the point of WebAssambly if it does not optimise
End of update
All the stuff related to the problem.
Find length of a line.
Original source in custom language
// declare func the < indicates export name, the param with types and return type
func <lengthF(float x, float y, float x1, float y1) float {
float nx, ny, dist; // declare locals float is f32
nx = x1 - x;
ny = y1 - y;
dist = sqrt(ny * ny + nx * nx);
return dist;
}
// and as double
func <length(double x, double y, double x1, double y1) double {
double nx, ny, dist;
nx = x1 - x;
ny = y1 - y;
dist = sqrt(ny * ny + nx * nx);
return dist;
}
Code compiles to Wat for proof read
(module
(func
(export "lengthF")
(param f32 f32 f32 f32)
(result f32)
(local f32 f32 f32)
get_local 2
get_local 0
f32.sub
set_local 4
get_local 3
get_local 1
f32.sub
tee_local 5
get_local 5
f32.mul
get_local 4
get_local 4
f32.mul
f32.add
f32.sqrt
)
(func
(export "length")
(param f64 f64 f64 f64)
(result f64)
(local f64 f64 f64)
get_local 2
get_local 0
f64.sub
set_local 4
get_local 3
get_local 1
f64.sub
tee_local 5
get_local 5
f64.mul
get_local 4
get_local 4
f64.mul
f64.add
f64.sqrt
)
)
As compiled wasm in hex string (Note does not include name section) and loaded using WebAssembly.compile. Exported functions then run against Javascript function len (in below snippet)
// hex of above without the name section
const asm = `0061736d0100000001110260047d7d7d7d017d60047c7c7c7c017c0303020001071402076c656e677468460000066c656e67746800010a3b021c01037d2002200093210420032001932205200594200420049492910b1c01037c20022000a1210420032001a122052005a220042004a2a09f0b`
const bin = new Uint8Array(asm.length >> 1);
for(var i = 0; i < asm.length; i+= 2){ bin[i>>1] = parseInt(asm.substr(i,2),16) }
var length,lengthF;
WebAssembly.compile(bin).then(module => {
const wasmInstance = new WebAssembly.Instance(module, {});
lengthF = wasmInstance.exports.lengthF;
length = wasmInstance.exports.length;
});
// test values are const (same result if from array or literals)
const a1 = rand(-100000,100000);
const a2 = rand(-100000,100000);
const a3 = rand(-100000,100000);
const a4 = rand(-100000,100000);
// javascript version of function
function len(x,y,x1,y1){
var nx = x1 - x;
var ny = y1 - y;
return Math.sqrt(nx * nx + ny * ny);
}
And the test code is the same for all 3 functions and run in strict mode.
tests : [{
func : function (){
var i;
for (i = 0; i < 100000; i += 1) {
length(a1,a2,a3,a4);
}
},
name : "length64",
},{
func : function (){
var i;
for (i = 0; i < 100000; i += 1) {
lengthF(a1,a2,a3,a4);
}
},
name : "length32",
},{
func : function (){
var i;
for (i = 0; i < 100000; i += 1) {
len(a1,a2,a3,a4);
}
},
name : "lengthNative",
}
]
The test results on FireFox are
/*
=======================================
Performance test. : WebAssm V Javascript
Use strict....... : true
Data view........ : false
Duplicates....... : 4
Cycles........... : 34
Samples per cycle : 100
Tests per Sample. : undefined
---------------------------------------------
Test : 'length64'
Mean : 26359µs ±128µs (*) 1128 samples
---------------------------------------------
Test : 'length32'
Mean : 27456µs ±109µs (*) 1144 samples
---------------------------------------------
Test : 'lengthNative'
Mean : 106µs ±2µs (*) 1128 samples
-All ----------------------------------------
Mean : 18.018ms Totals time : 61262.240ms 3400 samples
(*) Error rate approximation does not represent the variance.
*/

Andreas describes a number of good reasons why the JavaScript implementation was initially observed to be x300 faster. However, there are a number of other issues with your code.
This is a classic 'micro benchmark', i.e. the code that you are testing is so small, that the other overheads within your test loop are a significant factor. For example, there is an overhead in calling WebAssembly from JavaScript, which will factor in your results. What are you trying to measure? raw processing speed? or the overhead of the language boundary?
Your results vary wildly, from x300 to x2, due to small changes in your test code. Again, this is a micro benchmark issue. Others have seen the same when using this approach to measure performance, for example this post claims wasm is x84 faster, which is clearly wrong!
The current WebAssembly VM is very new, and an MVP. It will get faster. Your JavaScript VM has had 20 years to reach its current speed. The performance of the JS <=> wasm boundary is being worked on and optimised right now.
For a more definitive answer, see the joint paper from the WebAssembly team, which outlines an expected runtime performance gain of around 30%
Finally, to answer your point:
Whats the point of WebAssembly if it does not optimise
I think you have misconceptions around what WebAssembly will do for you. Based on the paper above, the runtime performance optimisations are quite modest. However, there are still a number of performance advantages:
Its compact binary format mean and low level nature means the browser can load, parse and compile the code much faster than JavaScript. It is anticipated that WebAssembly can be compiled faster than your browser can download it.
WebAssembly has a predictable runtime performance. With JavaScript the performance generally increases with each iteration as it is further optimised. It can also decrease due to se-optimisation.
There are also a number of non-performance related advantages too.
For a more realistic performance measurement, take a look at:
Its use within Figma
Results from using it with PDFKit
Both are practical, production codebases.

The JS engine can apply a lot of dynamic optimisations to this example:
Perform all calculations with integers and only convert to double for the final call to Math.sqrt.
Inline the call to the len function.
Hoist the computation out of the loop, since it always computes the same thing.
Recognise that the loop is left empty and eliminate it entirely.
Recognise that the result is never returned from the testing function, and hence remove the entire body of the test function.
All but (4) apply even if you add the result of every call. With (5) the end result is an empty function either way.
With Wasm an engine cannot do most of these steps, because it cannot inline across language boundaries (at least no engine does that today, AFAICT). Also, for Wasm it is assumed that the producing (offline) compiler has already performed relevant optimisations, so a Wasm JIT tends to be less aggressive than one for JavaScript, where static optimisation is impossible.

Serious answer
It seemed like
WebAssembly is far from a ready technology.
actually did play a role in this, and performance of calling WASM from JS in Firefox was improved in late 2018.
Running your benchmarks in a current FF/Chromium yields results like "Calling the WASM implementation from JS is 4-10 times slower than calling the JS implementation from JS". Still, it seems like engines don't inline across WASM/JS borders, and the overhead of having to call vs. not having to call is significant (as the other answers already pointed out).
Mocking answer
Your benchmarks are all wrong. It turns out that JS is actually 8-40 times (FF, Chrome) slower than WASM. WTF, JS is soo slooow.
Do I intend to prove that? Of course (not).
First, I re-implement your benchmarking code in C:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static double lengthC(double x, double y, double x1, double y1) {
double nx = x1 - x;
double ny = y1 - y;
return sqrt(nx * nx + ny * ny);
}
double lengthArrayC(double* a, size_t length) {
double c = 0;
for (size_t i = 0; i < length; i++) {
double a1 = a[i + 0];
double a2 = a[i + 1];
double a3 = a[i + 2];
double a4 = a[i + 3];
c += lengthC(a1,a2,a3,a4);
c += lengthC(a2,a3,a4,a1);
c += lengthC(a3,a4,a1,a2);
c += lengthC(a4,a1,a2,a3);
}
return c;
}
#ifdef __wasm__
__attribute__((import_module("js"), import_name("len")))
double lengthJS(double x, double y, double x1, double y1);
double lengthArrayJS(double* a, size_t length) {
double c = 0;
for (size_t i = 0; i < length; i++) {
double a1 = a[i + 0];
double a2 = a[i + 1];
double a3 = a[i + 2];
double a4 = a[i + 3];
c += lengthJS(a1,a2,a3,a4);
c += lengthJS(a2,a3,a4,a1);
c += lengthJS(a3,a4,a1,a2);
c += lengthJS(a4,a1,a2,a3);
}
return c;
}
__attribute__((import_module("bench"), import_name("now")))
double now();
__attribute__((import_module("bench"), import_name("result")))
void printtime(int benchidx, double ns);
#else
void printtime(int benchidx, double ns) {
if (benchidx == 1) {
printf("C: %f ns\n", ns);
} else if (benchidx == 0) {
printf("avoid the optimizer: %f\n", ns);
} else {
fprintf(stderr, "Unknown benchmark: %d", benchidx);
exit(-1);
}
}
double now() {
struct timespec ts;
if (clock_gettime(CLOCK_MONOTONIC, &ts) == 0) {
return (double)ts.tv_sec + (double)ts.tv_nsec / 1e9;
} else {
return sqrt(-1);
}
}
#endif
#define iters 1000000
double a[iters+3];
int main() {
int bigCount = 0;
srand(now());
for (size_t i = 0; i < iters + 3; i++)
a[i] = (double)rand()/RAND_MAX*2e5-1e5;
for (int i = 0; i < 10; i++) {
double startTime, endTime;
double c;
startTime = now();
c = lengthArrayC(a, iters);
endTime = now();
bigCount = (bigCount + (int64_t)c) % 1000;
printtime(1, (endTime - startTime) * 1e9 / iters / 4);
#ifdef __wasm__
startTime = now();
c = lengthArrayJS(a, iters);
endTime = now();
bigCount = (bigCount + (int64_t)c) % 1000;
printtime(2, (endTime - startTime) * 1e9 / iters / 4);
#endif
}
printtime(0, bigCount);
return 0;
}
Compile it with clang 12.0.1:
clang -O3 -target wasm32-wasi --sysroot /opt/wasi-sdk/wasi-sysroot/ foo2.c -o foo2.wasm
And provide it with a length function from JS via imports:
"use strict";
(async (wasm) => {
const wasmbytes = new Uint8Array(wasm.length);
for (var i in wasm)
wasmbytes[i] = wasm.charCodeAt(i);
(await WebAssembly.instantiate(wasmbytes, {
js: {
len: function (x,y,x1,y1) {
var nx = x1 - x;
var ny = y1 - y;
return Math.sqrt(nx * nx + ny * ny);
}
},
bench: {
now: () => window.performance.now() / 1e3,
result: (bench, ns) => {
let name;
if (bench == 1) { name = "C" }
else if (bench == 2) { name = "JS" }
else if (bench == 0) { console.log("Optimizer confuser: " + ns); /*not really necessary*/; return; }
else { throw "unknown bench"; }
console.log(name + ": " + ns + " ns");
},
},
})).instance.exports._start();
})(atob('AGFzbQEAAAABFQRgBHx8fHwBfGAAAXxgAn98AGAAAAIlAwJqcwNsZW4AAAViZW5jaANub3cAAQViZW5jaAZyZXN1bHQAAgMCAQMFAwEAfAcTAgZtZW1vcnkCAAZfc3RhcnQAAwr2BAHzBAMIfAJ/An5BmKzoAwJ/EAEiA0QAAAAAAADwQWMgA0QAAAAAAAAAAGZxBEAgA6sMAQtBAAtBAWutNwMAQejbl3whCANAQZis6ANBmKzoAykDAEKt/tXk1IX9qNgAfkIBfCIKNwMAIAhBmKzoA2ogCkIhiKe3RAAAwP///99Bo0QAAAAAAGoIQaJEAAAAAABq+MCgOQMAIAhBCGoiCA0ACwNAEAEhBkGQCCsDACEBQYgIKwMAIQRBgAgrAwAhAEQAAAAAAAAAACECQRghCANAIAQhAyABIgQgAKEiASABoiIHIAMgCEGACGorAwAiAaEiBSAFoiIFoJ8gACAEoSIAIACiIgAgBaCfIAAgASADoSIAIACiIgCgnyACIAcgAKCfoKCgoCECIAMhACAIQQhqIghBmKToA0cNAAtBARABIAahRAAAAABlzc1BokQAAAAAgIQuQaNEAAAAAAAA0D+iEAICfiACmUQAAAAAAADgQ2MEQCACsAwBC0KAgICAgICAgIB/CyALfEQAAAAAAAAAACECQYDcl3whCBABIQMDQCACIAhBgKzoA2orAwAiBSAIQYis6ANqKwMAIgEgCEGQrOgDaisDACIAIAhBmKzoA2orAwAiBBAAoCABIAAgBCAFEACgIAAgBCAFIAEQAKAgBCAFIAEgABAAoCECIAhBCGoiCA0AC0ECEAEgA6FEAAAAAGXNzUGiRAAAAACAhC5Bo0QAAAAAAADQP6IQAkLoB4EhCgJ+IAKZRAAAAAAAAOBDYwRAIAKwDAELQoCAgICAgICAgH8LIAp8QugHgSELIAlBAWoiCUEKRw0AC0EAIAuntxACCwB2CXByb2R1Y2VycwEMcHJvY2Vzc2VkLWJ5AQVjbGFuZ1YxMS4wLjAgKGh0dHBzOi8vZ2l0aHViLmNvbS9sbHZtL2xsdm0tcHJvamVjdCAxNzYyNDliZDY3MzJhODA0NGQ0NTcwOTJlZDkzMjc2ODcyNGE2ZjA2KQ=='))
Now, calling the JS function from WASM is unsurprisingly a lot slower than calling the WASM function from WASM. (In fact, WASM→WASM it isn't calling. You can see the f64.sqrt being inlined into _start.)
(One last interesting datapoint is that WASM→WASM and JS→JS seem to have about the same cost (about 1.5 ns per inlined length(…) on my E3-1280). Disclaimer: It's entirely possible that my benchmark is even more broken than the original question.)
Conclusion
WASM isn't slow, crossing the border is. For now and the foreseeable future, don't put things into WASM unless they're a significant computational task. (And even then, it depends. Sometimes, JS engines are really smart. Sometimes.)

How to rotate (transpose) a 2D array diagonally (switch axes) in Javascript [duplicate]

I have a matrix (relatively big) that I need to transpose. For example assume that my matrix is
a b c d e f
g h i j k l
m n o p q r
I want the result be as follows:
a g m
b h n
c I o
d j p
e k q
f l r
What is the fastest way to do this?

This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.
First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)
void transpose(float *src, float *dst, const int N, const int M) {
#pragma omp parallel for
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
dst[n] = src[M*j + i];
}
}
Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*l+j];
}
C[K*i + j] = tmp;
}
}
That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first
transpose(B);
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*j+l];
}
C[K*i + j] = tmp;
}
}
transpose(B);
Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.
I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.
For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is
Smear image horizontally
transpose output
Smear output horizontally
transpose output
Here is a paper by Intel explaining that
http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions
Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use
void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
#pragma omp parallel for
for(int n=0; n<M*N; n++) {
int k = vec_size*(n/N/vec_size);
int i = (n/vec_size)%N;
int j = n%vec_size;
B[n] = A[M*i + k + j];
}
}
EDIT:
I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16 (Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).
inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
#pragma omp parallel for
for(int i=0; i<block_size; i++) {
for(int j=0; j<block_size; j++) {
B[j*ldb + i] = A[i*lda +j];
}
}
}
inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
}
}
}
The values lda and ldb are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this
#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);
float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
For 3000x1001 this returns ldb = 3008 and lda = 1008
Edit:
I found an even faster solution using SSE intrinsics:
inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
__m128 row1 = _mm_load_ps(&A[0*lda]);
__m128 row2 = _mm_load_ps(&A[1*lda]);
__m128 row3 = _mm_load_ps(&A[2*lda]);
__m128 row4 = _mm_load_ps(&A[3*lda]);
_MM_TRANSPOSE4_PS(row1, row2, row3, row4);
_mm_store_ps(&B[0*ldb], row1);
_mm_store_ps(&B[1*ldb], row2);
_mm_store_ps(&B[2*ldb], row3);
_mm_store_ps(&B[3*ldb], row4);
}
inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
int max_i2 = i+block_size < n ? i + block_size : n;
int max_j2 = j+block_size < m ? j + block_size : m;
for(int i2=i; i2<max_i2; i2+=4) {
for(int j2=j; j2<max_j2; j2+=4) {
transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
}
}
}
}
}

This is going to depend on your application but in general the fastest way to transpose a matrix would be to invert your coordinates when you do a look up, then you do not have to actually move any data.

Some details about transposing 4x4 square float (I will discuss 32-bit integer later) matrices with x86 hardware. It's helpful to start here in order to transpose larger square matrices such as 8x8 or 16x16.
_MM_TRANSPOSE4_PS(r0, r1, r2, r3) is implemented differently by different compilers. GCC and ICC (I have not checked Clang) use unpcklps, unpckhps, unpcklpd, unpckhpd whereas MSVC uses only shufps. We can actually combine these two approaches together like this.
t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);
r0 = _mm_shuffle_ps(t0,t2, 0x44);
r1 = _mm_shuffle_ps(t0,t2, 0xEE);
r2 = _mm_shuffle_ps(t1,t3, 0x44);
r3 = _mm_shuffle_ps(t1,t3, 0xEE);
One interesting observation is that two shuffles can be converted to one shuffle and two blends (SSE4.1) like this.
t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);
v = _mm_shuffle_ps(t0,t2, 0x4E);
r0 = _mm_blend_ps(t0,v, 0xC);
r1 = _mm_blend_ps(t2,v, 0x3);
v = _mm_shuffle_ps(t1,t3, 0x4E);
r2 = _mm_blend_ps(t1,v, 0xC);
r3 = _mm_blend_ps(t3,v, 0x3);
This effectively converted 4 shuffles into 2 shuffles and 4 blends. This uses 2 more instructions than the implementation of GCC, ICC, and MSVC. The advantage is that it reduces port pressure which may have a benefit in some circumstances.
Currently all the shuffles and unpacks can go only to one particular port whereas the blends can go to either of two different ports.
I tried using 8 shuffles like MSVC and converting that into 4 shuffles + 8 blends but it did not work. I still had to use 4 unpacks.
I used this same technique for a 8x8 float transpose (see towards the end of that answer).
https://stackoverflow.com/a/25627536/2542702. In that answer I still had to use 8 unpacks but I manged to convert the 8 shuffles into 4 shuffles and 8 blends.
For 32-bit integers there is nothing like shufps (except for 128-bit shuffles with AVX512) so it can only be implemented with unpacks which I don't think can be convert to blends (efficiently). With AVX512 vshufi32x4 acts effectively like shufps except for 128-bit lanes of 4 integers instead of 32-bit floats so this same technique might be possibly with vshufi32x4 in some cases. With Knights Landing shuffles are four times slower (throughput) than blends.

If the size of the arrays are known prior then we could use the union to our help. Like this-
#include <bits/stdc++.h>
using namespace std;
union ua{
int arr[2][3];
int brr[3][2];
};
int main() {
union ua uav;
int karr[2][3] = {{1,2,3},{4,5,6}};
memcpy(uav.arr,karr,sizeof(karr));
for (int i=0;i<3;i++)
{
for (int j=0;j<2;j++)
cout<<uav.brr[i][j]<<" ";
cout<<'\n';
}
return 0;
}

Consider each row as a column, and each column as a row .. use j,i instead of i,j
demo: http://ideone.com/lvsxKZ
#include <iostream>
using namespace std;
int main ()
{
char A [3][3] =
{
{ 'a', 'b', 'c' },
{ 'd', 'e', 'f' },
{ 'g', 'h', 'i' }
};
cout << "A = " << endl << endl;
// print matrix A
for (int i=0; i<3; i++)
{
for (int j=0; j<3; j++) cout << A[i][j];
cout << endl;
}
cout << endl << "A transpose = " << endl << endl;
// print A transpose
for (int i=0; i<3; i++)
{
for (int j=0; j<3; j++) cout << A[j][i];
cout << endl;
}
return 0;
}

transposing without any overhead (class not complete):
class Matrix{
double *data; //suppose this will point to data
double _get1(int i, int j){return data[i*M+j];} //used to access normally
double _get2(int i, int j){return data[j*N+i];} //used when transposed
public:
int M, N; //dimensions
double (*get_p)(int, int); //functor to access elements
Matrix(int _M,int _N):M(_M), N(_N){
//allocate data
get_p=&Matrix::_get1; // initialised with normal access
}
double get(int i, int j){
//there should be a way to directly use get_p to call. but i think even this
//doesnt incur overhead because it is inline and the compiler should be intelligent
//enough to remove the extra call
return (this->*get_p)(i,j);
}
void transpose(){ //twice transpose gives the original
if(get_p==&Matrix::get1) get_p=&Matrix::_get2;
else get_p==&Matrix::_get1;
swap(M,N);
}
}
can be used like this:
Matrix M(100,200);
double x=M.get(17,45);
M.transpose();
x=M.get(17,45); // = original M(45,17)
of course I didn't bother with the memory management here, which is crucial but different topic.

template <class T>
void transpose( const std::vector< std::vector<T> > & a,
std::vector< std::vector<T> > & b,
int width, int height)
{
for (int i = 0; i < width; i++)
{
for (int j = 0; j < height; j++)
{
b[j][i] = a[i][j];
}
}
}

Modern linear algebra libraries include optimized versions of the most common operations. Many of them include dynamic CPU dispatch, which chooses the best implementation for the hardware at program execution time (without compromising on portability).
This is commonly a better alternative to performing manual optimization of your functinos via vector extensions intrinsic functions. The latter will tie your implementation to a particular hardware vendor and model: if you decide to swap to a different vendor (e.g. Power, ARM) or to a newer vector extensions (e.g. AVX512), you will need to re-implement it again to get the most of them.
MKL transposition, for example, includes the BLAS extensions function imatcopy. You can find it in other implementations such as OpenBLAS as well:
#include <mkl.h>
void transpose( float* a, int n, int m ) {
const char row_major = 'R';
const char transpose = 'T';
const float alpha = 1.0f;
mkl_simatcopy (row_major, transpose, n, m, alpha, a, n, n);
}
For a C++ project, you can make use of the Armadillo C++:
#include <armadillo>
void transpose( arma::mat &matrix ) {
arma::inplace_trans(matrix);
}

intel mkl suggests in-place and out-of-place transposition/copying matrices. here is the link to the documentation. I would recommend trying out of place implementation as faster ten in-place and into the documentation of the latest version of mkl contains some mistakes.

I think that most fast way should not taking higher than O(n^2) also in this way you can use just O(1) space :
the way to do that is to swap in pairs because when you transpose a matrix then what you do is: M[i][j]=M[j][i] , so store M[i][j] in temp, then M[i][j]=M[j][i],and the last step : M[j][i]=temp. this could be done by one pass so it should take O(n^2)

my answer is transposed of 3x3 matrix
#include<iostream.h>
#include<math.h>
main()
{
int a[3][3];
int b[3];
cout<<"You must give us an array 3x3 and then we will give you Transposed it "<<endl;
for(int i=0;i<3;i++)
{
for(int j=0;j<3;j++)
{
cout<<"Enter a["<<i<<"]["<<j<<"]: ";
cin>>a[i][j];
}
}
cout<<"Matrix you entered is :"<<endl;
for (int e = 0 ; e < 3 ; e++ )
{
for ( int f = 0 ; f < 3 ; f++ )
cout << a[e][f] << "\t";
cout << endl;
}
cout<<"\nTransposed of matrix you entered is :"<<endl;
for (int c = 0 ; c < 3 ; c++ )
{
for ( int d = 0 ; d < 3 ; d++ )
cout << a[d][c] << "\t";
cout << endl;
}
return 0;
}

Different output of binary operations on the same machine

Executing this JavaScript code in Safari
// expected output - array containing 32 bit words
b = "a";
var a = Array((b.length+3) >> 2);
for (var i = 0; i < b.length; i++) a[i>>2] |= (b.charCodeAt(i) << (24-(i & 3)*8));
and this (Objective-)C code in iOS Simulator
int array[((#"a".length + 3) >> 2)];
for (int i = 0; i < #"a".length; i++) {
int c = (int) [#"a" characterAtIndex:i];
array[i>>2] |= (c << (24-((i & 3)*8)));
}
gives me different output - consecutively (JavaScript) 1627389952 and (Objective-C) 1627748484.
Since the first four digits are always the same I think that the error is connected with precision but I cannot spot the issue.
EDIT
Sorry for this lack of attention and thank you very much (#Joni and all of you guys). You were right that the array in C code is fullfilled with some random values. I solved the issue setting all elements in the array to zero:
memset(array, 0, sizeof(array));
If anyone is curious the C code looks like this now:
int array[((#"a".length + 3) >> 2)];
memset(array, 0, sizeof(array));
for (int i = 0; i < #"a".length; i++) {
int c = (int) [#"a" characterAtIndex:i];
array[i>>2] |= (c << (24-((i & 3)*8)));
}

I don't know how Objective-c initializes arrays but in javascript
they are not initialized to anything (in fact, the indices don't even exist), so take care of that at least:
var b = "a";
var a = Array((b.length + 3) >> 2);
for( var i = 0, len = a.length; i < len; ++i ) {
a[i] = 0; //initialize a values to 0
}
for (var i = 0; i < b.length; i++) {
a[i >> 2] |= (b.charCodeAt(i) << (24 - (i & 3) * 8));
}
Secondly, this effectively should calculate 97 << 24, for which the correct
answer is 1627389952, so the Objective-C result is wrong. Probably because
the array values are not initialized to 0?

You are not setting the array to zeros in objective c, so it may have some random garbage to start with.

How do I create bit array in Javascript?

What is the best way of implementing a bit array in JavaScript?

Here's one I whipped up:
UPDATE - something about this class had been bothering me all day - it wasn't size based - creating a BitArray with N slots/bits was a two step operation - instantiate, resize. Updated the class to be size based with an optional second paramter for populating the size based instance with either array values or a base 10 numeric value.
(Fiddle with it here)
/* BitArray DataType */
// Constructor
function BitArray(size, bits) {
// Private field - array for our bits
this.m_bits = new Array();
//.ctor - initialize as a copy of an array of true/false or from a numeric value
if (bits && bits.length) {
for (var i = 0; i < bits.length; i++)
this.m_bits.push(bits[i] ? BitArray._ON : BitArray._OFF);
} else if (!isNaN(bits)) {
this.m_bits = BitArray.shred(bits).m_bits;
}
if (size && this.m_bits.length != size) {
if (this.m_bits.length < size) {
for (var i = this.m_bits.length; i < size; i++) {
this.m_bits.push(BitArray._OFF);
}
} else {
for(var i = size; i > this.m_bits.length; i--){
this.m_bits.pop();
}
}
}
}
/* BitArray PUBLIC INSTANCE METHODS */
// read-only property - number of bits
BitArray.prototype.getLength = function () { return this.m_bits.length; };
// accessor - get bit at index
BitArray.prototype.getAt = function (index) {
if (index < this.m_bits.length) {
return this.m_bits[index];
}
return null;
};
// accessor - set bit at index
BitArray.prototype.setAt = function (index, value) {
if (index < this.m_bits.length) {
this.m_bits[index] = value ? BitArray._ON : BitArray._OFF;
}
};
// resize the bit array (append new false/0 indexes)
BitArray.prototype.resize = function (newSize) {
var tmp = new Array();
for (var i = 0; i < newSize; i++) {
if (i < this.m_bits.length) {
tmp.push(this.m_bits[i]);
} else {
tmp.push(BitArray._OFF);
}
}
this.m_bits = tmp;
};
// Get the complimentary bit array (i.e., 01 compliments 10)
BitArray.prototype.getCompliment = function () {
var result = new BitArray(this.m_bits.length);
for (var i = 0; i < this.m_bits.length; i++) {
result.setAt(i, this.m_bits[i] ? BitArray._OFF : BitArray._ON);
}
return result;
};
// Get the string representation ("101010")
BitArray.prototype.toString = function () {
var s = new String();
for (var i = 0; i < this.m_bits.length; i++) {
s = s.concat(this.m_bits[i] === BitArray._ON ? "1" : "0");
}
return s;
};
// Get the numeric value
BitArray.prototype.toNumber = function () {
var pow = 0;
var n = 0;
for (var i = this.m_bits.length - 1; i >= 0; i--) {
if (this.m_bits[i] === BitArray._ON) {
n += Math.pow(2, pow);
}
pow++;
}
return n;
};
/* STATIC METHODS */
// Get the union of two bit arrays
BitArray.getUnion = function (bitArray1, bitArray2) {
var len = BitArray._getLen(bitArray1, bitArray2, true);
var result = new BitArray(len);
for (var i = 0; i < len; i++) {
result.setAt(i, BitArray._union(bitArray1.getAt(i), bitArray2.getAt(i)));
}
return result;
};
// Get the intersection of two bit arrays
BitArray.getIntersection = function (bitArray1, bitArray2) {
var len = BitArray._getLen(bitArray1, bitArray2, true);
var result = new BitArray(len);
for (var i = 0; i < len; i++) {
result.setAt(i, BitArray._intersect(bitArray1.getAt(i), bitArray2.getAt(i)));
}
return result;
};
// Get the difference between to bit arrays
BitArray.getDifference = function (bitArray1, bitArray2) {
var len = BitArray._getLen(bitArray1, bitArray2, true);
var result = new BitArray(len);
for (var i = 0; i < len; i++) {
result.setAt(i, BitArray._difference(bitArray1.getAt(i), bitArray2.getAt(i)));
}
return result;
};
// Convert a number into a bit array
BitArray.shred = function (number) {
var bits = new Array();
var q = number;
do {
bits.push(q % 2);
q = Math.floor(q / 2);
} while (q > 0);
return new BitArray(bits.length, bits.reverse());
};
/* BitArray PRIVATE STATIC CONSTANTS */
BitArray._ON = 1;
BitArray._OFF = 0;
/* BitArray PRIVATE STATIC METHODS */
// Calculate the intersection of two bits
BitArray._intersect = function (bit1, bit2) {
return bit1 === BitArray._ON && bit2 === BitArray._ON ? BitArray._ON : BitArray._OFF;
};
// Calculate the union of two bits
BitArray._union = function (bit1, bit2) {
return bit1 === BitArray._ON || bit2 === BitArray._ON ? BitArray._ON : BitArray._OFF;
};
// Calculate the difference of two bits
BitArray._difference = function (bit1, bit2) {
return bit1 === BitArray._ON && bit2 !== BitArray._ON ? BitArray._ON : BitArray._OFF;
};
// Get the longest or shortest (smallest) length of the two bit arrays
BitArray._getLen = function (bitArray1, bitArray2, smallest) {
var l1 = bitArray1.getLength();
var l2 = bitArray2.getLength();
return l1 > l2 ? smallest ? l2 : l1 : smallest ? l2 : l1;
};
CREDIT TO #Daniel Baulig for asking for the refactor from quick and dirty to prototype based.

I don't know about bit arrays, but you can make byte arrays easy with new features.
Look up typed arrays. I've used these in both Chrome and Firefox. The important one is Uint8Array.
To make an array of 512 uninitialized bytes:
var arr = new UintArray(512);
And accessing it (the sixth byte):
var byte = arr[5];
For node.js, use Buffer (server-side).
EDIT:
To access individual bits, use bit masks.
To get the bit in the one's position, do num & 0x1

The Stanford Javascript Crypto Library (SJCL) provides a Bit Array implementation and can convert different inputs (Hex Strings, Byte Arrays, etc.) to Bit Arrays.
Their code is public on GitHub: bitwiseshiftleft/sjcl. So if you lookup bitArray.js, you can find their bit array implementation.
A conversion from bytes to bits can be found here.

Something like this is as close as I can think of. Saves bit arrays as 32 bit numbers, and has a standard array backing it to handle larger sets.
class bitArray {
constructor(length) {
this.backingArray = Array.from({length: Math.ceil(length/32)}, ()=>0)
this.length = length
}
get(n) {
return (this.backingArray[n/32|0] & 1 << n % 32) > 0
}
on(n) {
this.backingArray[n/32|0] |= 1 << n % 32
}
off(n) {
this.backingArray[n/32|0] &= ~(1 << n % 32)
}
toggle(n) {
this.backingArray[n/32|0] ^= 1 << n % 32
}
forEach(callback) {
this.backingArray.forEach((number, container)=>{
const max = container == this.backingArray.length-1 ? this.length%32 : 32
for(let x=0; x<max; x++) {
callback((number & 1<<x)>0, 32*container+x)
}
})
}
}
let bits = new bitArray(10)
bits.get(2) //false
bits.on(2)
bits.get(2) //true
bits.forEach(console.log)
/* outputs:
false
false
true
false
false
false
false
false
false
false
*/
bits.toggle(2)
bits.forEach(console.log)
/* outputs:
false
false
false
false
false
false
false
false
false
false
*/
bits.toggle(0)
bits.toggle(1)
bits.toggle(2)
bits.off(2)
bits.off(3)
bits.forEach(console.log)
/* outputs:
true
true
false
false
false
false
false
false
false
false
*/

2022
As can be seen from past answers and comments, the question of "implementing a bit array" can be understood in two different (non-exclusive) ways:
an array that takes 1-bit in memory for each entry
an array on which bitwise operations can be applied
As #beatgammit points out, ecmascript specifies typed arrays, but bit arrays are not part of it. I have just published #bitarray/typedarray, an implementation of typed arrays for bits, that emulates native typed arrays and takes 1 bit in memory for each entry.
Because it reproduces the behaviour of native typed arrays, it does not include any bitwise operations though. So, I have also published #bitarray/es6, which extends the previous with bitwise operations.
I wouldn't debate what is the best way of implementing bit array, as per the asked question, because "best" could be argued at length, but those are certainly some way of implementing bit arrays, with the benefit that they behave like native typed arrays.
import BitArray from "#bitarray/es6"
const bits1 = BitArray.from("11001010");
const bits2 = BitArray.from("10111010");
for (let bit of bits1.or(bits2)) console.log(bit) // 1 1 1 1 1 0 1 0

You can easily do that by using bitwise operators. It's quite simple.
Let's try with the number 75.
Its representation in binary is 100 1011. So, how do we obtain each bit from the number?
You can use an AND "&" operator to select one bit and set the rest of them to 0. Then with a Shift operator, you remove the rest of 0 that doesn't matter at the moment.
Example:
Let's do an AND operation with 4 (000 0010)
0100 1011 & 0000 0010 => 0000 0010
Now we need to filter the selected bit, in this case, was the second-bit reading right to left.
0000 0010 >> 1 => 1
The zeros on the left are no representative. So the output will be the bit we selected, in this case, the second one.
var word=75;
var res=[];
for(var x=7; x>=0; x--){
res.push((word&Math.pow(2,x))>>x);
}
console.log(res);
The output:
Expected:
In case you need more than a simple number, you can apply the same function for a byte. Let's say you have a file with multiple bytes. So, you can decompose that file in a ByteArray, then each byte in the array in a BitArray.
Good luck!

#Commi's implementation is what I ended up using .
I believe there is a bug in this implementation. Bits on every 31st boundary give the wrong result. (ie when index is (32 * index - 1), so 31, 63, 95 etc.
I fixed it in the get() method by replacing > 0 with != 0.
get(n) {
return (this.backingArray[n/32|0] & 1 << n % 32) != 0
}
The reason for the bug is that the ints are 32-bit signed. Shifting 1 left by 31 gets you a negative number. Since the check is for >0, this will be false when it should be true.
I wrote a program to prove the bug before, and the fix after. Will post it running out of space.
for (var i=0; i < 100; i++) {
var ar = new bitArray(1000);
ar.on(i);
for(var j=0;j<1000;j++) {
// we should have TRUE only at one position and that is "i".
// if something is true when it should be false or false when it should be true, then report it.
if(ar.get(j)) {
if (j != i) console.log('we got a bug at ' + i);
}
if (!ar.get(j)) {
if (j == i) console.log('we got a bug at ' + i);
}
}
}

2022
We can implement a BitArray class which behaves similar to TypedArrays by extending DataView. However in order to avoid the cost of trapping the direct accesses to the numerical properties (the indices) by using a Proxy, I believe it's best to stay in DataView domain. DataView is preferable to TypedArrays these days anyway as it's performance is highly improved in recent V8 versions (v7+).
Just like TypedArrays, BitArray will have a predetermined length at construction time. I just include a few methods in the below snippet. The popcnt property very efficiently returns the total number of 1s in BitArray. Unlike normal arrays popcnt is a highly sought after functionality for BitArrays. So much so that Web Assembly and even modern CPU's have a dedicated pop count instruction. Apart from these you can easily add methods like .forEach(), .map() etc. if need be.
class BitArray extends DataView{
constructor(n,ab){
if (n > 1.5e10) throw new Error("BitArray size can not exceed 1.5e10");
super(ab instanceof ArrayBuffer ? ab
: new ArrayBuffer(Number((BigInt(n + 31) & ~31n) >> 3n))); // Sets ArrayBuffer.byteLength to multiples of 4 bytes (32 bits)
}
get length(){
return this.buffer.byteLength << 3;
}
get popcount(){
var m1 = 0x55555555,
m2 = 0x33333333,
m4 = 0x0f0f0f0f,
h01 = 0x01010101,
pc = 0,
x;
for (var i = 0, len = this.buffer.byteLength >> 2; i < len; i++){
x = this.getUint32(i << 2);
x -= (x >> 1) & m1; //put count of each 2 bits into those 2 bits
x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & m4; //put count of each 8 bits into those 8 bits
pc += (x * h01) >> 56;
}
return pc;
}
// n >> 3 is Math.floor(n/8)
// n & 7 is n % 8
and(bar){
var len = Math.min(this.buffer.byteLength,bar.buffer.byteLength),
res = new BitArray(len << 3);
for (var i = 0; i < len; i += 4) res.setUint32(i,this.getUint32(i) & bar.getUint32(i));
return res;
}
at(n){
return this.getUint8(n >> 3) & (1 << (n & 7)) ? 1 : 0;
}
or(bar){
var len = Math.min(this.buffer.byteLength,bar.buffer.byteLength),
res = new BitArray(len << 3);
for (var i = 0; i < len; i += 4) res.setUint32(i,this.getUint32(i) | bar.getUint32(i));
return res;
}
not(){
var len = this.buffer.byteLength,
res = new BitArray(len << 3);
for (var i = 0; i < len; i += 4) res.setUint32(i,~(this.getUint32(i) >> 0));
return res;
}
reset(n){
this.setUint8(n >> 3, this.getUint8(n >> 3) & ~(1 << (n & 7)));
}
set(n){
this.setUint8(n >> 3, this.getUint8(n >> 3) | (1 << (n & 7)));
}
slice(a = 0, b = this.length){
return new BitArray(b-a,this.buffer.slice(a >> 3, b >> 3));
}
toggle(n){
this.setUint8(n >> 3, this.getUint8(n >> 3) ^ (1 << (n & 7)));
}
toString(){
return new Uint8Array(this.buffer).reduce((p,c) => p + ((BigInt(c)* 0x0202020202n & 0x010884422010n) % 1023n).toString(2).padStart(8,"0"),"");
}
xor(bar){
var len = Math.min(this.buffer.byteLength,bar.buffer.byteLength),
res = new BitArray(len << 3);
for (var i = 0; i < len; i += 4) res.setUint32(i,this.getUint32(i) ^ bar.getUint32(i));
return res;
}
}
Just do like
var u = new BitArray(12);
I hope it helps.

Probably [definitely] not the most efficient way to do this, but a string of zeros and ones can be parsed as a number as a base 2 number and converted into a hexadecimal number and finally a buffer.
const bufferFromBinaryString = (binaryRepresentation = '01010101') =>
Buffer.from(
parseInt(binaryRepresentation, 2).toString(16), 'hex');
Again, not efficient; but I like this approach because of the relative simplicity.

Thanks for a wonderfully simple class that does just what I need.
I did find a couple of edge-case bugs while testing:
get(n) {
return (this.backingArray[n/32|0] & 1 << n % 32) != 0
// test of > 0 fails for bit 31
}
forEach(callback) {
this.backingArray.forEach((number, container)=>{
const max = container == this.backingArray.length-1 && this.length%32
? this.length%32 : 32;
// tricky edge-case: at length-1 when length%32 == 0,
// need full 32 bits not 0 bits
for(let x=0; x<max; x++) {
callback((number & 1<<x)!=0, 32*container+x) // see fix in get()
}
})
My final implementation fixed the above bugs and changed the backArray to be a Uint8Array instead of Array, which avoids signed int bugs.

We Keep Coding

JavaScript is the programming language of the Web.