MongoDB Document Size Limitations

MongoDB Document Size Limitations - javascript

I have a collection of novels that looks as follows:
The Words array contains all words along with additional linguistic information related to each word. When I try to add longer texts (100k words +), I get the error:
RangeError: attempt to write outside buffer bounds
Which, I have gathered, means that the BSON document is larger than 16 mb and therefore above the limit.
I'm assuming this is a relatively common situation. I am now considering how to work around this limitation - For example, I could split the novel into various chunks of 10k words. Or does this mean that the document should make up a separate collection (ie. one new collection per text uploaded) - this makes the least sense to me.
Is there a standard/suggested approach to designing a MongoDB database in this case?
Also, is it possible to check the size of the BSON before inserting a document in JS/Node?

Do you absolutely need to store the contents of the books in MongoDB? If you're simply serving the contents to users or processing them in bulk, I suggest storing them on disk or in an AWS S3 bucket or similar.
If you need the book contents to live in the database, try using the MongoDB GridFS:
GridFS is a specification for storing and retrieving files that exceed
the BSON-document size limit of 16 MB.
Instead of storing a file in a single document, GridFS divides the file into parts, or chunks, and stores each chunk as a separate document
When you query GridFS for a file, the driver will reassemble the chunks as needed. You can perform range queries on files stored through GridFS. You can also access information from arbitrary sections of files, such as to “skip” to the middle of a video or audio file.
Read more here:
https://docs.mongodb.com/manual/core/gridfs/

Related

How to handle massive text-delimited files with NodeJS

We're working with an API-based data provided that allows us to analyze large sets of GIS data in relation to provided GeoJSON areas and specified timestamps. When the data is aggregated by our provider, it can be marked as complete and alert our service via a callback URL. From there, we have a list of the reports we've run with their relevant download links. One of the reports we need to work with is a TSV file with 4 columns, and looks like this:
deviceId | timestamp | lat | lng
Sometimes, if the area we're analyzing is large enough, these files can be 60+GB large. The download link links to a zipped version of the files, so we can't read them directly from the download URL. We're trying to get the data in this TSV grouped by deviceId and sorted by timestamp so we can route along road networks using the lat/lng in our routing service. We've used Javascript for most of our application so far, but this service poses unique problems that may require additional software and/or languages.
Curious how others have approached the problem of handling and processing data of this size.
We've tried downloading the file, piping it into a ReadStream, and allocating all the available cores on the machine to process batches of the data individually. This works, but it's not nearly as fast as we would like (even with 36 cores).

From Wikipedia:
Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because ... only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures.
In other words, if you try to do it without looking at the end of the zip file first, you may end up accidentally including deleted files. So you can't trust streaming unzippers. However, if the zip file hasn't been modified since it was created, perhaps streaming parsers can be trusted. If you don't want to risk it, then don't use a streaming parser. (Which means you were right to download the file to disk first.)
To some extent it depends on the structure of the zip archive: If it consists of many moderately sized files, and if they can all be processed independently, then you don't need to have very much of it in memory at any one time. On the other hand, if you try to process many files in parallel then you may run into the limit on the number of filehandles that can be open. But you can get round this using something like queue.
You say you have to sort the data by device ID and timestamp. That's another part of the process that can't be streamed. If you need to sort a large list of data, I'd recommend you save it to a database first; that way you can make it as big as your disk will allow, but also structured. You'd have a table where the columns are the columns of the TSV. You can stream from the TSV file into the database, and also index the database by deviceId and timestamp. And by this I mean a single index that uses both of those columns, in that order.
If you want a distributed infrastructure, maybe you could store different device IDs on different disks with different CPUs etc ("sharding" is the word you want to google). But I don't know whether this will be faster. It would speed up the disk access. But it might create a bottleneck in network connections, through either latency or bandwidth, depending on how interconnected the different device IDs are.
Oh, and if you're going to be running multiple instances of this process in parallel, don't forget to create separate databases, or at the very least add another column to the database to distinguish separate instances.

Converting a Mongo Collection to GridFS?

Currently, my website is using a Mongo.Collection to hold data submitted from another site. Strings are sent over through HTTP methods and packed into the collection afterward. However, this collection now needs to support storing larger files, but still needs to hold the data already stored, so I've been looking into converting the collection into GridFS. Is there a way to attach the data onto empty files as metadata, or is the conversion more convoluted than this?

Creating meta-data for binary chunks for sending via WebRTC datachannel

I have a datachannel connection between two browsers, and would like to break a file into chunks and send them to/from the clients.
I can read the file and break it up into chunks just fine. However I need a way for the receiving client to know
which file the chunk of data relates to (unique identifier).
which place the chunk applies to in reconstruction (index number).
When transferring binary data in the browser, it seems the entire payload must be binary. So I can't, for example, create a JSON object with the above properties, and have a data property with the actual binary chunk.
I guess I need to wrap the file chunk into a secondary binary blob which contains the identifier and index. The receiving client would then decode the first, wrapper, chunk to check the meta-data, then handle the actual file chunk based on that information.
How can I do this in the browser? I have done a lot of google searching but can't seem to find any information on this, so wonder if I'm perhaps overlooking something which can help ease this process?

You have to create your own protocol for transfering files.
I assume you have a File/Blob object. You probably also use split() method to get chunks.
You can simply use a Uint8Array to transfer data.
Create a protocol that satisfies your needs, for example:
1 byte: Package type (255 possible package types)
2 bytes: Length of data (2^16 bytes ~ 64KB of data per chunk)
n bytes: <Data>
Send an initial package (e.g. type 0x01)
Data contains some information (all or some):
Total length of blob/file
file type
chunk size
number of chunks
filename
...
Send Chunks of data (e.g. type 0x02)
You should use at least two bytes for a sequence number
Data follows afterwards (no length needed because you know the total length)
Note: If transfering multiple files, you should add a id or something else.
On the receiver side you can wait for the initial package and and create a new Uint8Array with length of the total file. Afterwards you can use set() to add received data at the chunk position (offset = 0-based-chunk-number*chunk-size). When all chunks are received, you can create the Blob.

In addition to #Robert's very good answer, you can use channel.send(blob) (at least in Firefox<->Firefox). Eventually this should work in Chrome as well.

If it is a simple matter of multiple files, you could just create a new data channel for each new file.
Each channel will take care of it's own buffering, sequence etc.
Something like:
chan = peerCon.createDataChannel("/somedir/somefile", props);
then break your file into <64k chunks and chan.send() them in sequence.
The receiving side can get the label and use it to save the file appropriately
peerCon.ondatachannel = function(channel) {
console.log("New file " + channel.label);
channel.onmessage = function(
etc.
P.S.
If you really must use a file system protocol over a single channel (say because you want random access behaviour) don't invent a new one, use one that already exists and is tested - I'm fond of 9p from inferno/plan9

Proper Way to Read a Data File in Javascript/Node.js?

I have a flat data file in the form of xml, but there isn't a real Windows viewer for the file, currently. I decided to create a simple application with Node-WebKit, just for basic viewing - the data file won't need to be written to by the application.
My problem is, I don't know the proper way to read a large file. The data file is a backup of phone SMS's and MMS's, and the MMS entries contain Base64 image strings where applicable - so, the file gets pretty big, with large amounts of images (generallly, around 250mb). I didn't create/format the original data in the file, so I can't modify it's structure.
So, the question is - assuming I already have a way to parse the XML into JavaScript objects, should I,
a) Parse the entire file when the application is first run, storing an array of objects in memory for the duration of the applications lifetime, or
b) Read through the entire file each time I want to extract a conversation (all of the messages with a specific outgoing or incoming number), and only store that data in memory, or
c) Employ some alternate, more efficient, solution that I don't know about yet.

Convert your XML data into an SQLite db. SQLite is NOT memory based by default. Query the db when you need the data, problem solved :)

StarDict support for JavaScript and a Firefox OS App

I wrote a dictionary app in the spirit of GoldenDict (www.goldendict.org, also see Google Play Store for more information) for Firefox OS: http://tuxor1337.github.io/firedict and https://marketplace.firefox.com/app/firedict
Since apps for ffos are based on HTML, CSS and JavaScript (WebAPI etc.), I had to write everything from scratch. At first, I wrote a basic library for synchronous and asynchronous access to StarDict dictionaries in JavaScript: https://github.com/tuxor1337/stardict.js
Although the app can be called stable by now, overall performance is still a bit sluggish. For some dictionaries, I have a list of words of almost 1,000,000 entries! That's huge. Indexing takes a really long time (up to several minutes per dictionary) and lookup as well. At the moment, the words are stored in an IndexedDB object store. Is there another alternative? With the current solution (words accessed and inserted using binary search) the overall experience is pretty slow. Maybe it would become faster, if there was some locale sort support by IndexedDB... Actually, I'm not even storing the terms themselves in the DB but only their offsets in the *.syn/*.idx file. I hope to save some memory doing that. But of course I'm not able to use any IDB sorting functionality with this configuration...
Maybe it's not the best idea to do the sorting in memory, because now the app is killed by the kernel due to an OOM on some devices (e.g. ZTE Open). A dictionary with more than 500,000 entries will definitely exceed 100 MB in memory. (That's only 200 Byte per entry and if you suppose the keyword strings are UTF-8, you'll exceed 100 MB immediately...)
Feel free to contribute directly to the project on GitHub. Otherwise, I would be glad to hear your advice concerning the above issues.

I am working on a pure Javascript implementation of MDict parser (https://github.com/fengdh/mdict-js) simliliar to your stardict project. MDict is another popular dictionary format with rich format (embeded image/audio/css etc.), which is widely support on window/linux/ios/android/windows phone. I have some ideas to share, and wish you can apply it to improve stardict.js in future.
MDict dictionary file (mdx/mdd) divides keyword and record into (optionaly compressed) block each contains around 2000 entries, and also provides a keyword block index table and record block index table to help quick look-up. Because of its compact data structure, I can implement my MDict parser scanning directly on dictionary file with small pre-load index table but no need of IndexDB.
Each keyword block index looks like:
{num_entries: ..,
first_word: ..,
last_word: ..,
comp_size: .., // size in compression
decomp_size: .., // size after decompression
offset: .., // offset in mdx file
index: ..
}
In keyblock, each entries is a pair of [keyword, offset]
Each record block index looks like:
{comp_size: .., // size in compression
decomp_size: .., // size after decompression
}
Given a word, use binary search to locate the keyword block maybe containing it.
Slice the keyword block and Load all keys in it, filter out matched one and get its record offfset.
Use binary search to locate the record block containing the word's record.
Slice the record block and retrieve its record (a definition in text or resource in ArrayBuffer) directly.
Since each block contains only around 2000 entries, it is fast enough to lookup word among 100K~1M dictionary entries within 100ms, quite decent value for human interaction. mdict-js parses file head only, it is super fast and of low memory usage.
In the same way, it is possible to retrieve a list of neighboring words for given phrase, even with wild card.
Please take a look on my online demo here: http://fengdh.github.io/mdict-js/
(You have to choose a local MDict dictionary: a mdx + optional mdd file)

We Keep Coding

JavaScript is the programming language of the Web.