Performance in Websockets blob decoding causing memory problems

Performance in Websockets blob decoding causing memory problems - javascript

I have a performance problem in Javascript causing a crash lately at work. With the objective modernising our applications, we are looking into running our applications as webservers, onto which our client would connect via a browser (chrome, firefox, ...), and having all our interfaces running as HTML+JS webpages.
To give you an overview of our performance needs, our application run image processing from camera sources, running in some cases at more than 20 fps, but in the most case around 2-3fps max.
Basically, we have a Webserver written in C++, which HTTP requests, and provides the user with the HTML pages of the interface and the corresponding JS scripts of the application.
In order to simplify the communication between the two applications, I then open a web socket between the webpage and the c++ server to send formatted messages back and forth. These messages can be pretty big, up to several Mos.
It all works pretty well as long as the FPS stays relatively low. When the fps increases the following two things happen.
Either the c++ webserver memory footprint increases pretty fast and crashes when no more memory is available. After investigation, this happens when the network usage full, and the websocket cache fills up. I think this is due to the websocket TCP-IP way of doing stuff, as the socket must wait for the message to be sent and received to send the next one.
Or the browser crashes after a while, showing the Aw snap screen (see figure below). It seems in that case that the same thing more or less happen but it seems this time due to the garbage collection strategy. The other figure below shows the printscreen of the memory usage when the application is running, clearly showing saw pattern. It seems to indicate that garbage collection is doing its work at intervals that are further and further away.
I have trapped the problem down to very big messages (>100Ko) being sent at fast rate per second. And the bigger the message, the faster it happens. In order to use the message I receive, I start a web worker, pass the blob i received to the web worker, the webworker uses a FileReaderSync to convert the message as an ArrayBuffer, and passes it back to the main thread. I expect this to have quite a lot of copies under the hood, but I am not so well versed in JS yet so to be sure of this statement. Also, I initialy did the same thing without the webworker (FileReader), but the framerate and CPU usage were really bad...
Here is the code I call to decode the messages:
function OnDataMessage(msg)
{
var webworkerDataMessage = new Worker('/js/EDXLib/MessageDecoderEvent.js'); // please no comments about this, it's actually a bit nicer on the CPU than reusing the same worker :-)
webworkerDataMessage.onmessage = MessageFileReaderOnLoadComManagerCBack;
webworkerDataMessage.onerror=ErrorHandler;
webworkerDataMessage.postMessage(msg.data);
}
function MessageFileReaderOnLoadComManagerCBack(e)
{
comManager.OnDataMessageReceived(e.data);
}
and the webworker code:
function DecodeMessage(msg)
{
var retMsg = new FileReaderSync().readAsArrayBuffer(msg);
postMessage(retMsg);
}
function receiveDecodingRequest(e)
{
DecodeMessage(e.data);
}
addEventListener("message", receiveDecodingRequest, true);
My question are the following:
Is there a way to make the GC not have to collect so much memory, by for instance telling some of the parts I use to reuse buffers instead of recreating them, or keeping the GC work intervals fixed ? This is something I know how to do in C++, but in JS ?
Is there another method I should use for my big payloads? Keep in mind that the transmission should be as fast as possible.
Is there another method for reading blob data as arraybuffers that would faster than what I did?
I thank you in advance for you help/comments.

As it turns out, the memory problem was due to the new WebWorker line and the new FileReaderSync line in the WebWorker.
Removing these greatly improved the performances!
Also, it turns out that this decoding operation is not necessary if I want to use the websocket as array buffer. I just need to set the binaryType attribute of websockets to "arraybuffer"...
So all in all, a very simple solution to a pain in the *** problem :-)

Related

Why is Non blocking asynchronous single-threaded faster for IO than blocking multi-threaded for some applications

It helps me understand things by using real world comparison, in this case fastfood.
In java, for synchronous blocking I understand that each request processed by a thread can only be completed one at a time. Like ordering through a drive through, so if im tenth in line I have to wait for the 9 cars ahead of me. But, I can open up more threads such that multiple orders are completed simultaneously.
In javascript you can have asynchronous non-blocking but single threaded. As I understand it, multiple requests are made, and those request are immediately accepted, but the request is processed by some background process at some later time before returning. I don't understand how this would be faster. If you order 10 burgers at the same time the 10 requests would be put in immediately but since there is only one cook (single thread) it still takes the same time to create the 10 burgers.
I mean I understand the reasoning, of why non blocking async single thread "should" be faster for somethings, but the more I ask myself questions the less I understand it which makes me not understand it.
I really dont understand how non blocking async single threaded can be faster than sync blocking multithreaded for any type of application including IO.

Non-blocking async single threaded is sometimes faster
That's unlikely. Where are you getting this from?
In multi-threaded synchronous I/O, this is roughly how it works:
The OS and appserver platform (e.g. a JVM) work together to create 10 threads. These are data structures represented in memory, and a scheduler running at the kernel/OS level will use these data structures to tell one of your CPU cores to 'jump to' some point in the code to run the commands it finds there.
The datastructure that represents a thread contains more or less the following items:
What is the location in memory of the instruction we were running
The entire 'stack'. If some function invokes a second function, then we need to remember all local variables and the point we were at in that original method, so that when the second method 'returns', it knows how to do that. e.g. your average java program is probably ~20 methods deep, so that's 20x the local vars, 20 places in code to track. This is all done on stacks. Each thread has one. They tend to be fixed size for the entire app.
What cache page(s) were spun up in the local cache of the core running this code?
The code in the thread is written as follows: All commands to interact with 'resources' (which are orders of magnitude slower than your CPU; think network packets, disk access, etc) are specified to either return the data requested immediately (only possible if everything you asked for is already available and in memory). If that is impossible, because the data you wanted just isn't there yet (let's say the packet carrying the data you want is still on the wire, heading to your network card), there's only one thing to do for the code that powers the 'get me network data' function: Wait until that packet arrives and makes its way into memory.
To not just do nothing at all, the OS/CPU will work together to take that datastructure that represents the thread, freeze it, find another such frozen datastructure, unfreeze it, and jump to the 'where did we leave things' point in the code.
That's a 'thread switch': Core A was running thread 1. Now core A is running thread 2.
The thread switch involves moving a bunch of memory around: All those 'live' cached pages, and that stack, need to be near that core for the CPU to do the job, so that's a CPU loading in a bunch of pages from main memory, which does take some time. Not a lot (nanoseconds), but not zero either. Modern CPUs can only operate on the data loaded in a nearby cachepage (which are ~64k to 1MB in size, no more than that, a thousand+ times less than what your RAM sticks can store).
In single-threaded asynchronous I/O, this is roughly how it works:
There's still a thread of course (all things run in one), but this time the app in question doesn't multithread at all. Instead, it, itself, creates the data structures required to track multiple incoming connections, and, crucially, the primitives used to ask for data work differently. Remember that in the synchronous case, if the code asks for the next bunch of bytes from the network connection then the thread will end up 'freezing' (telling the kernel to find some other work to do) until the data is there. In asynchronous modes, instead the data is returned if available, but if not available, the function 'give me some data!' still returns, but it just says: Sorry bud. I have 0 new bytes for you.
The app itself will then decide to go work on some other connection, and in that way, a single thread can manage a bunch of connections: Is there data for connection #1? Yes, great, I shall process this. No? Oh, okay. Is there data for connection #2? and so on and so forth.
Note that, if data arrives on, say, connection #5, then this one thread, to do the job of handling this incoming data, will presumably need to load, from memory, a bunch of state info, and may need to write it.
For example, let's say you are processing an image, and half of the PNG data arrives on the wire. There's not a lot you can do with it, so this one thread will create a buffer and store half of the PNG inside it. As it then hops to another connection, it needs to load the ~15% of the image it alrady got, and add onto that buffer the 10% of the image that just arrived in a network packet.
This app is also causing a bunch of memory to be moved around into and out of cache pages just the same, so in that sense it's not all that different, and if you want to handle 100k things at once, you're inevitably going to end up having to move stuff into and out of cache pages.
So what is the difference? Can you put it in fry cook terms?
Not really, no. It's all just data structures.
The key difference is in what gets moved into and out of those cache pages.
In the case of async it is exactly what the code you wrote wants to buffer. No more, no less.
In the case of synchronous, it's that 'datastructure representing a thread'.
Take java, for example: That means at the very least the entire stack for that thread. That's, depending on the -Xss parameter, about 128k worth of data. So, if you have 100k connections to be handled simultaneously, that's 12.8GB of RAM just for those stacks!
If those incoming images really are all only about 4k in size, you could have done it with 4k buffers, for only 0.4GB of memory needed at most, if you handrolled that by going async.
That is where the gain lies for async: By handrolling your buffers, you can't avoid moving memory into and out of cache pages, but you can ensure it's smaller chunks. and that will be faster.
Of course, to really make it faster, the buffer for storing state in the async model needs to be small (not much point to this if you need to save 128k into memory before you can operate on it, that's how large those stacks were already), and you need to handle so many things at once (10k+ simultaneous).
There's a reason we don't write all code in assembler or why memory managed languages are popular: Handrolling such concerns is tedious and error-prone. You shouldn't do it unless the benefits are clear.
That's why synchronous is usually the better option, and in practice, often actually faster (those OS thread schedulers are written by expert coders and tweaked extremely well. You don't stand a chance to replicate their work) - that whole 'by handrolling my buffers I can reduce the # of bytes that need to be moved around a ton!' thing needs to outweigh the losses.
In addition, async is complicated as a programming model.
In async mode, you can never block. Wanna do a quick DB query? That could block, so you can't do that, you have to write your code as: Okay, fire off this job, and here's some code to run when it gets back. You can't 'wait for an answer', because in async land, waiting is not allowed.
In async mode, anytime you ask for data, you need to be capable of dealing with getting half of what you wanted. In synchronized mode, if you ask for 4k, you get 4k. The fact that your thread may freeze during this task until the 4k is available is not something you need to worry about, you write your code as if it just arrives as you ask for it, complete.
Bbbuutt... fry cooks!
Look, CPU design just isn't simple enough to put in terms of a restaurant like this.

You are mentally moving the bottleneck from your process (the burger orderer) to the other process (the burger maker).
This will not make your application faster.
When considering the single-threaded async model, the real benefit is that your process is not blocked while waiting for the other process.
In other words, do not associate async with the word fast but with the word free. Free to do other work.

JavaScript WebSocket Idle time in DevTools timeline

I have an application that does the following:
WebSocket.onmessage
put bytes into a queue (Array)
on requestAnimationFrame
flush queue, rendering the received bytes using canvas/webgl
The application receives and plots realtime data. It has a bit of jerk/jank and while profiling my rendering code, I noticed that while the actual rendering seems execute quickly, there are large chunks of idle time during the WebSocket.onmessage handler.
I tried shrinking the window, mentioned in Nat Duca's post, to check if I am "GPU bound". But, even with a small window, the timeline gives pretty much the same results.
What I am suspecting now is Garbage Collection. The application reads data from the WebSocket, plots it, then discards it. So to me, it seems unsurprising that I have a saw-tooth memory profile:
So my question is now two-fold:
1) In the browser, is it even possible to avoid this memory footprint? In other languages, I know I'd be able have a buffer that was allocated once and read from the socket straight into it. But with the WebSocket interface, this doesn't seem possible; I'm getting a newly allocated buffer of bytes that I use briefly and then no longer need.
Update:--- Per pherris' suggestion, I removed the WebSocket from the equation, and while I see improvement, the issue still seems to persist. See the screenshots below:
2) Is this even my main issue? Are there other things in an application like this that I can do to avoid this blocking/idle time?

Is it possible to measure rendering time in webgl using gl.finish()?

I am trying to measure the time it takes for an image in webgl to load.
I was thinking about using gl.finish() to get a timestamp before and after the image has loaded and subtracting the two to get an accurate measurement, however I couldn't find a good example for this kind of usage.
Is this sort of thing possible, and if so can someone provide a sample code?

It is now possible to time WebGL2 executions with the EXT_disjoint_timer_query_webgl2 extension.
const ext = gl.getExtension('EXT_disjoint_timer_query_webgl2');
const query = gl.createQuery();
gl.beginQuery(ext.TIME_ELAPSED_EXT, query);
/* gl.draw*, etc */
gl.endQuery(ext.TIME_ELAPSED_EXT);
Then sometime later, you can get the elapsed time for your query:
const available = this.gl.getQueryParameter(query, this.gl.QUERY_RESULT_AVAILABLE);
if (available) {
const elapsedNanos = gl.getQueryParameter(query, gl.QUERY_RESULT);
}
A couple things to be aware of:
only one timing query may be in progress at once.
results may become available asynchronously. If you have more than one call to time per frame, you may consider using a query pool.

No it is not.
In fact in Chrome gl.finish is just a gl.flush. See the code and search for "::finish".
Because Chrome is multi-process and actually implements security in depth the actual GL calls are issued in another process from your JavaScript so even if Chrome did call gl.finish it would happen in another process and from the POV of JavaScript would not be accurate for timing in any way shape or form. Firefox is apparently in the process of doing something similar for similar reasons.
Even outside of Chrome, every driver handles gl.finish differently. Using gl.finish for timing is not useful information because it's not representative of actual speed since it includes stalling the GPU pipeline. In other words, timing with gl.finish includes lots of overhead that wouldn't happen in real use and so is not an accurate measurement of how fast something would execute normal circumstances.
There are GL extensions on some GPUs to get timing info. Unfortunately they (a) are not available in WebGL and (b) will not likely ever be as they are not portable as they can't really work on tiled GPUs like those found in many mobile phones.
Instead of asking how to time GL calls what specifically are you trying to achieve by timing them? Maybe people can suggest a solution to that.

Being client based, WebGL event timings depend on the current loading of the client machine (CPU loading), GPU loading, and the implementation of the client itself. One way to get a very rough estimate, is to measure the round-trip latency from server to client using a XmlHttpRequest (http://en.wikipedia.org/wiki/XMLHttpRequest). By finding the delay from server measured time to local time, a possible measure of loading can be obtained.

How to do worker-to-worker communication?

I'm experimenting with web workers, and was wondering how well they would deal with embarassingly parallell problems. I therefore implemented Connaway's Game of Life. (To have a bit more fun than doing a blur, or something. The problems would be the same in that case however.)
At the moment I have one web worker performing iterations and posting back new ImageData for the UI thread to place in my canvas. Works nicely.
My experiment doesn't end there however, cause I have several CPU's available and would like to parallellize my application.
So, to start off simply I split my data in two, down the middle, and make two workers each dealing with a half each. The problem is of course the split. Worker A needs one column of pixels from worker B and vice versa. Now, I can clearly fix this by letting my UI-thread give that column down to the workers, but it would be much better if my threads could pass them to eachother directly.
When splitting further, each worker would only have to keep track of it's neighbouring workers, and the UI thread would only be responsible for updating the UI (as it should be).
My problem is, I don't see how I can achieve this worker-to-worker communication. I tried handing the neighbours to eachother by way of an initialization postMessage, but that would copy my worker rather than hand down a reference, which luckily chrome warned me about being impossible.
Uncaught Error: DATA_CLONE_ERR: DOM Exception 25
Finally I see that there's something called a SharedWorker. Is this what I should look into, or is there a way to use the Worker that would solve my problem?

You should be able to use channel messaging:
var channel = new MessageChannel();
worker1.postMessage({code:"port"}, [channel.port1]);
worker2.postMessage({code:"port"}, [channel.port2]);
Then in your worker threads:
var xWorkerPort;
onmessage = function(event) {
if (event.data.code == "port") {
xWorkerPort = event.ports[0];
xWorkerPort.onmessage = function(event) { /* do stuff */ };
}
}
There's not much documentation around, but you could try this MS summary to get started.

multi-core programming using JavaScript?

So I have this seriously recursive function that I would like to use with my code. The issue is it doesn't really take advantage of dual core machines because js is single threaded. I have tried using webworkers but don't really know much about multicore programming. Would someone point me to some material that could explain how it is done. I googled to find this sample link but its not really much help without documentation! =/
I would be glad if someone could show me how this could be done without webworkers though! That would be just awesome! =)
I came across this link on whatwg. This is really weird because it explains how to use multicore programming in webworkers etc, but on executing on my chrome browser it throws errors. Same goes with other browsers.
Error: 9Uncaught ReferenceError: Worker is not defined in worker.js

UPDATE (2018-06-21): For people coming here in search of multi-core programming in JavaScript, not necessarily browser JavaScript (for that, the answer still applies as-is): Node.js now supports multi-threading behind a feature flag (--experimental-workers): release info, relevant issue.
Writing this off the top of my head, no guarantees for source code. Please go easy on me.
As far as I know, you cannot really program in threads with JavaScript. Webworkers are a form of multi-programming; yet JavaScript is by its nature single-threaded (based on an event loop).
A webworker is seperate thread of execution in the sense that it doesn't share anything with the script that started it; there is no reference to the script's global object (typically called "window" in the browser), and no reference to any of your main script's variables other than data you send to the thread.
Think as the web worker as a little "server" that gets asked a question and provides an answer. You can only send strings to that server, and it can only parse the string and send back what it has computed.
// in the main script, one starts a worker by passing the file name of the
// script containing the worker to the constructor.
var w = new Worker("myworker.js");
// you want to react to the "message" event, if your worker wants to inform
// you of a result. The function typically gets the event as an argument.
w.addEventListener("message",
function (evt) {
// process evt.data, which is the message from the
// worker thread
alert("The answer from the worker is " + evt.data);
});
You can then send a message (a String) to this thread using its postMessage()-Method:
w.postMessage("Hello, this is my message!");
A sample worker script (an "echo" server) can be:
// this is another script file, like "myworker.js"
self.addEventListener("message",
function (evt) {
var data = JSON.parse(evt.data);
/* as an echo server, we send this right back */
self.postMessage(JSON.stringify(data))
})
whatever you post to that thread will be decoded, re-encoded, and sent back. of course you can do whatever processing you would want to do in between. That worker will stay active; you can call terminate() on it (in your main script; that'd be w.terminate()) to end it or calling self.close() in your worker.
To summarize: what you can do is you zip up your function parameters into a JSON string which gets sent using postMessage, decoded, and processed "on the other side" (in the worker). The computation result gets sent back to your "main" script.
To explain why this is not easier: More interaction is not really possible, and that limitation is intentional. Because shared resources (an object visible to both the worker and the main script) would be subject to two threads interfering with them at the same time, you would need to manage access (i.e., locking) to that resource in order to prevent race conditions.
The message-passing, shared-nothing approach is not that well-known mainly because most other programming languages (C and Java for example) use threads that operate on the same address space (while others, like Erlang, for instance, don't). Consider this:
It is really hard to code a larger project with mutexes (a mutual exclusion mechanism) because of the associated deadlock/race condition complexities. This is stuff that can make grown men cry!
It is really easy in comparison to do message-passing, shared-nothing semantics. The code is isolated; you know exactly what goes into your worker and what comes out of your worker. Deadlocks and race conditions are impossible to achieve!
Just try it out; it is capable of doing interesting things, probably all you want. Bear in mind that it is still implementation defined whether it takes advantage of multicore as far as I know.
NB. I just got informed that at least some implementations will handle JSON encoding of messages for you.
So, to give an answer to your question (it's all above; tl;dr version): No, you cannot do this without web workers. But there is nothing really wrong about web workers aside from browser support, as is the case with HTML5 in general.

As far as I remember this is only possible with the new HTML5 standard. The keyword is "Web-Worker"
See also:
HTML5: JavaScript Web Workers
JavaScript Threading With HTML5 Web Workers

Web workers are the answer to the client side. For NodeJS there are many approaches. Most popular - spawn several processes with pm2 or similar tool. Run single process and spawn/fork child processes. You can google around these and will find a lot of samples and tactics.
Web workers are already well supported by all browsers. https://caniuse.com/#feat=webworkers
API & samples: https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers

We Keep Coding

JavaScript is the programming language of the Web.