What determines the cadence that a readable Node stream emits data? - javascript

If I configure a stream to emit the contents of a file, what is actually determining the rate at which chunks are read from source and sent to the destination (maybe another stream, for example)?
Also: is it correct terminology to say "chunks" of data to refer to the slices of data read and processed in a sequential fashion to avoid reading everything in at once.

Related

Is copying a large blob over to a worker expensive?

Using the Fetch API I'm able to make a network request for a large asset of binary data (say more than 500 MB) and then convert the Response to either a Blob or an ArrayBuffer.
Afterwards, I can either do worker.postMessage and let the standard structured clone algorithm copy the Blob over to a Web Worker or transfer the ArrayBuffer over to the worker context (making effectively no longer available from the main thread).
At first, it would seem that it would be much preferable to fetch the data as an ArrayBuffer, since a Blob is not transferrable and thus, will need to be copied over. However, blobs are immutable and thus, it seems that the browser doesn't store it in the JS heap associated to the page, but rather in a dedicated blob storage space and thus, what's ended up being copied over to the worker context is just a reference.
I've prepared a demo to try out the difference between the two approaches: https://blobvsab.vercel.app/. I'm fetching 656 MB worth of binary data using both approaches.
Something interesting I've observed in my local tests, is that copying the Blob is even faster than transferring the ArrayBuffer:
Blob copy time from main thread to worker: 1.828125 ms
ArrayBuffer transfer time from main thread to worker: 3.393310546875 ms
This is a strong indicator that dealing with Blobs is actually pretty cheap. Since they're immutable, the browser seems to be smart enough to treat them as a reference rather than linking the overlying binary data to those references.
Here are the heap memory snapshots I've taken when fetching as a Blob:
The first two snapshots were taken after the resulting Blob of fetching was copied over the worker context using postMessage. Notice that neither of those heaps include the 656 MBs.
The latter two snapshots were taken after I've used a FileReader to actually access the underlying data, and as expected, the heap grew a lot.
Now, this is what happens with fetching directly as an ArrayBuffer:
Here, since the binary data was simply transferred over the worker thread, the heap of the main thread is small but the worker heap contains the entirety of the 656 MBs, even before reading this data.
Now, looking around at SO I see that What is the difference between an ArrayBuffer and a Blob? mentions a lot of underlying differences between the two structures, but I haven't found a good reference regarding if one should be worried about copying over a Blob between execution contexts vs. what would seem an inherent advantage of ArrayBuffer that they're transferrable. However, my experiments show that copying the Blob might actually be faster and thus I think preferable.
It seems to be up to each browser vendor how they're storing and handling Blobs. I've found this Chromium documentation describing that all Blobs are transferred from each renderer process (i.e. a page on a tab) to the browser process and that way Chrome can even offload the Blob to the secondary memory if needed.
Does anyone have some more insights regarding all of this? If I can choose to fetch some large binary data over the network and move that to a Web Worker should I prefer a Blob or a ArrayBuffer?
No, it's not expensive at all to postMessage a Blob.
The cloning steps of a Blob are
Their serialization steps, given value and serialized, are:
Set serialized.[[SnapshotState]] to value’s snapshot state.
Set serialized.[[ByteSequence]] to value’s underlying byte sequence.
Their deserialization step, given serialized and value, are:
Set value’s snapshot state to serialized.[[SnapshotState]].
Set value’s underlying byte sequence to serialized.[[ByteSequence]].
In other words, nothing is copied, both the snapshot state and the byte sequence are passed by reference, (even though the wrapping JS object is not).
However regarding your full project, I wouldn't advise using Blobs here for two reasons:
The fetch algorithm first fetches as an ArrayBuffer internally. Requesting a Blob adds an extra step there (which consumes memory).
You'll probably need to read that Blob from the Worker, adding yet an other step (which will also consume memory since here the data will actually get copied).

Can I transfer a Tensorflow.js Tensor efficiently between Node.js processes?

I am developing an AI model with Tensorflow.js and Node.js. As part of this, I need to read and parse my large dataset in a streaming fashion (it's way too big to fit in memory all at the same time). This process results ultimately results in a pair of generator functions (1 for the input data, and another for the output data) that iteratively yield Tensorflow.js Tensors:
function* example_parser() {
while(thereIsData) {
// do reading & parsing here....
yield next_tensor;
}
}
....which are wrapped in a pair of tf.data.generator()s, followed by a tf.data.zip().
This process can be fairly computationally intensive at times, so I would like to refactor into a separate Node.js worker process / thread as I'm aware that Node.js executes Javascript in a single-threaded fashion.
However, I am also aware that if I were to transmit the data normally via e.g. process.send(), the serialisation / deserialisation would slow the process down so much that I'm better off keeping everything inside the same process.
To this end, my question is this:
How can I efficiently transmit (a stream of) Tensorflow.js Tensors between Node.js processes without incurring a heavy serialisation / deserialisation penalty?
How can I efficiently transmit (a stream of) Tensorflow.js Tensors between Node.js ?
First a tensor cannot be send directly. A tensor object does not contain any data.
console.log(tensor) // will show info about the tensor but not the data it contains
Rather than transmitting the tensor object, its data can be sent:
// given a tensor t
// first get its data
const data = await t.data()
// and send it
worker.send({data})
In order to be able to reconstruct this tensor in the receiving process, the shape of the tensor needs to be send as well
worker.send({data, shape})
By default, the sending and receiving of messages between processes creates a copy of the initial data. If there are lots of data to be sent where the copy will incur a penalty to the system, it is possible to use SharedArrayBuffer which means a zero copy. However with the latter once the data is sent, it can no longer be used by the sending thread

How to handle massive text-delimited files with NodeJS

We're working with an API-based data provided that allows us to analyze large sets of GIS data in relation to provided GeoJSON areas and specified timestamps. When the data is aggregated by our provider, it can be marked as complete and alert our service via a callback URL. From there, we have a list of the reports we've run with their relevant download links. One of the reports we need to work with is a TSV file with 4 columns, and looks like this:
deviceId | timestamp | lat | lng
Sometimes, if the area we're analyzing is large enough, these files can be 60+GB large. The download link links to a zipped version of the files, so we can't read them directly from the download URL. We're trying to get the data in this TSV grouped by deviceId and sorted by timestamp so we can route along road networks using the lat/lng in our routing service. We've used Javascript for most of our application so far, but this service poses unique problems that may require additional software and/or languages.
Curious how others have approached the problem of handling and processing data of this size.
We've tried downloading the file, piping it into a ReadStream, and allocating all the available cores on the machine to process batches of the data individually. This works, but it's not nearly as fast as we would like (even with 36 cores).
From Wikipedia:
Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because ... only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures.
In other words, if you try to do it without looking at the end of the zip file first, you may end up accidentally including deleted files. So you can't trust streaming unzippers. However, if the zip file hasn't been modified since it was created, perhaps streaming parsers can be trusted. If you don't want to risk it, then don't use a streaming parser. (Which means you were right to download the file to disk first.)
To some extent it depends on the structure of the zip archive: If it consists of many moderately sized files, and if they can all be processed independently, then you don't need to have very much of it in memory at any one time. On the other hand, if you try to process many files in parallel then you may run into the limit on the number of filehandles that can be open. But you can get round this using something like queue.
You say you have to sort the data by device ID and timestamp. That's another part of the process that can't be streamed. If you need to sort a large list of data, I'd recommend you save it to a database first; that way you can make it as big as your disk will allow, but also structured. You'd have a table where the columns are the columns of the TSV. You can stream from the TSV file into the database, and also index the database by deviceId and timestamp. And by this I mean a single index that uses both of those columns, in that order.
If you want a distributed infrastructure, maybe you could store different device IDs on different disks with different CPUs etc ("sharding" is the word you want to google). But I don't know whether this will be faster. It would speed up the disk access. But it might create a bottleneck in network connections, through either latency or bandwidth, depending on how interconnected the different device IDs are.
Oh, and if you're going to be running multiple instances of this process in parallel, don't forget to create separate databases, or at the very least add another column to the database to distinguish separate instances.

Creating meta-data for binary chunks for sending via WebRTC datachannel

I have a datachannel connection between two browsers, and would like to break a file into chunks and send them to/from the clients.
I can read the file and break it up into chunks just fine. However I need a way for the receiving client to know
which file the chunk of data relates to (unique identifier).
which place the chunk applies to in reconstruction (index number).
When transferring binary data in the browser, it seems the entire payload must be binary. So I can't, for example, create a JSON object with the above properties, and have a data property with the actual binary chunk.
I guess I need to wrap the file chunk into a secondary binary blob which contains the identifier and index. The receiving client would then decode the first, wrapper, chunk to check the meta-data, then handle the actual file chunk based on that information.
How can I do this in the browser? I have done a lot of google searching but can't seem to find any information on this, so wonder if I'm perhaps overlooking something which can help ease this process?
You have to create your own protocol for transfering files.
I assume you have a File/Blob object. You probably also use split() method to get chunks.
You can simply use a Uint8Array to transfer data.
Create a protocol that satisfies your needs, for example:
1 byte: Package type (255 possible package types)
2 bytes: Length of data (2^16 bytes ~ 64KB of data per chunk)
n bytes: <Data>
Send an initial package (e.g. type 0x01)
Data contains some information (all or some):
Total length of blob/file
file type
chunk size
number of chunks
filename
...
Send Chunks of data (e.g. type 0x02)
You should use at least two bytes for a sequence number
Data follows afterwards (no length needed because you know the total length)
Note: If transfering multiple files, you should add a id or something else.
On the receiver side you can wait for the initial package and and create a new Uint8Array with length of the total file. Afterwards you can use set() to add received data at the chunk position (offset = 0-based-chunk-number*chunk-size). When all chunks are received, you can create the Blob.
In addition to #Robert's very good answer, you can use channel.send(blob) (at least in Firefox<->Firefox). Eventually this should work in Chrome as well.
If it is a simple matter of multiple files, you could just create a new data channel for each new file.
Each channel will take care of it's own buffering, sequence etc.
Something like:
chan = peerCon.createDataChannel("/somedir/somefile", props);
then break your file into <64k chunks and chan.send() them in sequence.
The receiving side can get the label and use it to save the file appropriately
peerCon.ondatachannel = function(channel) {
console.log("New file " + channel.label);
channel.onmessage = function(
etc.
P.S.
If you really must use a file system protocol over a single channel (say because you want random access behaviour) don't invent a new one, use one that already exists and is tested - I'm fond of 9p from inferno/plan9

What is optimal way to pipe one stream to two streams in node.js

I'm using a ReadableStream from request.js to pipe() a resource from http to a file system WritableStream.
How do I modify this to be able to process the content in memory? (for streamed parsing or whatever)
Ideally a full 'dupe' into two real streams and not just a callback 'tap' (but I'll take any advice).
Ha, I had some lucky timing: the answer to my problem was announced today:
highlandjs, new stream util library by #caolan (the guy from async).
This is what I needed: http://highlandjs.org/#fork
It has the backpressure feature, lazy evaluation etc.
I would like to firstly pipe the content from http.req into memory (Buffer) through a module named memory-stream. Then I create a readable from Buffer and pipe into file system. At the same time you can create multiple readable from this Buffer and pipe it to any writable.
In one of my projects I need to process thumbnail files in many sizes once I received an image uploaded from browser and I was using the approach above to implement, save the original image into Buffer and create readable steams for each size, pipe with imagemagick and then pipe file system writable stream to save in parallel.

Categories