How can you store and modify large datasets in node.js? - javascript

Basics
So basically I have written a program which generates test data for MongoDB in Node.
The problem
For that, the program reads a schema file and generates a specified amount of test data out of it. The problem is that this data can eventually become quite big (Think about creating 1M Users (with all properties it needs) and 20M chat messages (with userFrom and userTo) and it has to keep all of that in the RAM to modify/transform/map it and after that save it to a file.
How it works
The program works like that:
Read schema file
Create test data from the schema and store it in a structure (look down below for the structure)
Run through this structure and link all objects referenceTo to a random object with matching referenceKey.
Transform the object structure in a string[] of MongoDB insert statements
Store that string[] in a file.
This is the structure of the generated test data:
export interface IGeneratedCollection {
dbName: string, // Name of the database
collectionName: string, // Name of the collection
documents: IGeneratedDocument[] // One collection has many documents
}
export interface IGeneratedDocument {
documentFields: IGeneratedField [] // One document has many fields (which are recursive, because of nested documents)
}
export interface IGeneratedField {
fieldName: string, // Name of the property
fieldValue: any, // Value of the property (Can also be IGeneratedField, IGeneratedField[], ...)
fieldNeedsQuotations?: boolean, // If the Value needs to be saved with " ... "
fieldIsObject?: boolean, // If the Value is a object (stored as IGeneratedField[]) (To handle it different when transforming to MongoDB inserts)
fieldIsJsonObject?: boolean, // If the Value is a plain JSON object
fieldIsArray?: boolean, // If the Value is array of objects (stored as array of IGeneratedField[])
referenceKey?: number, // Field flagged to be a key
referenceTo?: number // Value gets set to a random object with matching referenceKey
}
Actual data
So in the example with 1M Users and 20M messages it would look like this:
1x IGeneratedCollection (collectionName = "users")
1Mx IGeneratedDocument
10x IGeneratedField (For example each user has 10 fields)
1x IGeneratedCollection (collectionName = "messages")
20Mx IGeneratedDocument
3x IGeneratedField (message, userFrom, userTo)
hich would result in 190M instances of IGeneratedField (1x1Mx10 + 1x20Mx3x = 190M).
Conclusion
This is obviously a lot to handle for the RAM as it needs to store all of that at the same time.
Temporary Solution
It now works like that:
Generate 500 documents(rows in sql) at a time
JSON.stringify those 500 documents and put them in a SQLite table with the schema (dbName STRING, collectionName STRING, value
JSON)
Remove those 500 documents from JS and let the Garbage Collector do its thing
Repeat until all data is generated and in the SQLite table
Take one of the rows (each containing 500 documents) at a time, apply JSON.parse and search for keys in them
Repeat until all data is queried and all keys retrieved
Take one of the rows at a time, apply JSON.parse and search for key references in them
Apply JSON.stringify and update the row if necessary (if key references found and resolved)
Repeat until all data is queried and all keys are resolved
Take one of the rows at a time, apply JSON.parse and transform the documents to valid sql/mongodb inserts
Add the insert (string) in a SQLite table with the schema (singleInsert STRING)
Remove the old and now unused row from the SQLite table
Write all inserts to file (if run from the command line) or return a dataHandle to query the data in the SQLite table (if run from other
node app)
This solution does handle the problem with RAM, because SQLite automatically swaps to the Harddrive when the RAM is full
BUT
As you can see there are a lot of JSON.parse and JSON.stringify involved, which slows down the whole process drastically
What I have thought:
Maybe I should modify the IGeneratedField to only use shortend names as variables (fieldName -> fn, fieldValue -> fv, fieldIsObject -> fio, fieldIsArray -> fia, ....)
This would make the needed storage in the SQLite table smaller, BUT it would also make the code harder to read
Use a document oriented database (But I have not really found one), to handle JSON data better
The Question
Is there any better solution to handle big objects like this in node?
Is my temporary solution OK? What is bad about it? Can it be changed to perform better?

Conceptually, generate items in a stream.
You don't need all 1M users in db. You could add 10k at a time.
For the messages, random sample 2n users from db, those send messages to each other. Repeat till satisfied.
Example:
// Assume Users and Messages are both db.collections
// Assume functions generateUser() and generateMessage(u1, u2) exist.
const desiredUsers = 10000;
const desiredMessages = 5000000;
const blockSize = 1000;
(async () => {
for (const i of _.range(desiredUsers / blockSize) ) {
const users = _.range(blockSize).map(generateUser);
await Users.insertMany(users);
}
for (const i of _.range(desiredMessages / blockSize) ) {
const users = await Users.aggregate([ { $sample: { size: 2 * blockSize } } ]).toArray();
const messages = _.chunk(users, 2).map( (usr) => generateMessage(usr[0], usr[1]));
await Messages.insertMany(messages);
}
})();
Depending on how you tweak the stream, you get a different distribution. This is uniform distribution. You can get more long tailed distribution by interleaving the users and messages. For example, you might want to do this for message boards.
Went to 200MB after i switched the blockSize to 1000.

Related

Generate Firestore document's doc Id based on Users' uids

In my chat app, I have private chat between the two users. I intend to set the chat document's id using these two user's docId/uid in such a way that it doesn't depend on the order they're combined and I can determine the chat document's docId using the uid of users irrespective of the order of uid.
I know, I can use where clauses to get the chat doc as well. Is there any major flaw with my approach of generating the chat document's docId? Should I let it be generated automatically and use normal where clauses supported by firestore and limit(1) to get the chat?
basically, it seems I'm looking for is to encrypt uid1 in such a way that it returns a number only and then same with uid2 and then add them together to create the ChatId. This way it'll not depend on the order I use to add them and I can get the chatId and maybe convert that number back to a string using Base64 encode. This way, if I know the users participating in the chat, I can generate the same ChatId. Will that work or is there any flaw to it?
Converting each user ID to a number and then adding them together will likely lead to collisions. As a simple example, think of the many ways you can add up to the number 5: 0+5, 1+4, 2+3.
This answer builds upon #NimnaPerera's answer.
Method 1: <uid>_<uid>
If your app doesn't plan on using large groups, you can make use of the <uid>_<uid> format. To make sure the two user IDs are ordered in the same way, you can sort them first and then combine them together using some delimiter.
A short way to achieve this is to use:
const docId = [uid1, uid2].sort().join("_");
If you wanted to have a three-way group chat, you'd just add the new userID in the array:
const docId = [uid1, uid2, uid3].sort().join("_");
You could also turn this into a method for readability:
function getChatIdForMembers(userIds) {
return userIds.sort().join("_");
}
Here's an example of it in action:
const uid1 = "apple";
const uid2 = "banana";
const uid3 = "carrot";
[uid1, uid2].sort().join("_"); // returns "apple_banana"
[uid1, uid3].sort().join("_"); // returns "apple_carrot"
[uid2, uid1].sort().join("_"); // returns "apple_banana"
[uid2, uid3].sort().join("_"); // returns "banana_carrot"
[uid3, uid1].sort().join("_"); // returns "apple_carrot"
[uid3, uid2].sort().join("_"); // returns "banana_carrot"
// chats to yourself are permitted
[uid1, uid1].sort().join("_"); // returns "apple_apple"
[uid2, uid2].sort().join("_"); // returns "banana_banana"
[uid3, uid3].sort().join("_"); // returns "carrot_carrot"
// three way chat
[uid1, uid2, uid3].sort().join("_"); // returns "apple_banana_carrot"
[uid1, uid3, uid2].sort().join("_"); // returns "apple_banana_carrot"
[uid2, uid1, uid3].sort().join("_"); // returns "apple_banana_carrot"
[uid2, uid3, uid1].sort().join("_"); // returns "apple_banana_carrot"
[uid3, uid1, uid2].sort().join("_"); // returns "apple_banana_carrot"
[uid3, uid2, uid1].sort().join("_"); // returns "apple_banana_carrot"
Method 2: Member list properties
If you intend on supporting group chats, you should use automatic document IDs (see CollectionReference#add()) and store a list of chat members as one of it's fields as introduced in #NimnaPerera's answer for better use of queries.
I recommend two fields:
"members" - an array containing each chat member's ID. This allows you to query the /chats collection for chats that contain the given user.
"membersAsString" - a string, built from sorting "members" and joining them using "_". This allows you to query the /chats collection for chats that contain the exact list of members.
"chats/{chatId}": {
"members": string[], // list of users in this chat
"membersAsString": string, // sorted list of users in this chat, delimited using "_"
/* ... */
}
To find all chats that I am a part of:
const myUserId = firebase.auth().currentUser.uid;
const myChatsQuery = firebase.firestore()
.collection("chats")
.where("members", "array-contains", myUserId);
myChatsQuery.onSnapshot(querySnapshot => {
// do something with list of chat documents
});
To find all three-way chats between Apple, Banana and I:
const myUserId = firebase.auth().currentUser.uid;
const members = [myUserId, "banana", "apple"];
const membersAsString = members.sort().join("_");
const groupChatsQuery = firebase.firestore()
.collection("chats")
.where("membersAsString", "==", membersAsString);
groupChatsQuery.onSnapshot(querySnapshot => {
// do something with list of chat documents
// normally this would return 1 result, but you may get
// more than one result if a user gets added/removed a chat
});
A normal flow, would be to:
Get a list of the relevant chats
For each chat, get the most recent message
Based on the most recent message, sort the chats in your UI
You can very well use a combination of two users uids to define your Firestore document IDs, as soon as you respect the following constraints:
Must be valid UTF-8 characters
Must be no longer than 1,500 bytes
Cannot contain a forward slash (/)
Cannot solely consist of a single period (.) or double periods (..)
Cannot match the regular expression __.*__
What I'm not sure to understand in your question is "in such a way that it doesn't depend on the order they're combined". If you combine the uids of two users you need to combine them in a certain order. For example, uid1_uid2 is not equal to ui2_uid1.
As you are asking #lightsaber you can follow following methods to achieve your objective. But my personal preference is using an where clause, because firestore is supporting that compound queries which cannot be done in real time database.
Method 1
Create a support function to generate a chatId and check whether document is exist from that id. Then you can create chat document or retrieve the document using that id.
const getChatId = (currentUserId: string, guestUserId: string) => {
/* In this function whether you changed the order of the values when passing as parameters
it will always return only one id using localeCompare */
const comp = currentUserId.localeCompare(guestUserId);
if (comp === 0) {
return null;
}
if (comp === -1) {
return currentUserId + '_' + guestUserId;
} else {
return guestUserId + '_' + currentUserId;
}
}
Method 2
Use where clause with array-contains query for retrieving the chat document. And when creating add two user Ids to array and set the array with a relevant field name.
Firestore docs for querying arrays

Implementing Keras Model into website with Keras.js

I have been trying to implement a basic Keras model generated in Python into a website using the Keras.js library. Now, I have the model trained and exported into the model.json, model_weights.buf, and model_metadata.json files. Now, I essentially copied and pasted test code from the github page to see if the model would load in browser, but unfortunately I am getting errors. Here is the test code. (EDIT: I fixed some errors, see below for remaining ones.)
var model = new KerasJS.Model({
filepaths: {
model: 'dist/model.json',
weights: 'dist/model_weights.buf',
metadata: 'dist/model_metadata.json'
},
gpu: true
});
model.ready()
.then(function() {
console.log("1");
// input data object keyed by names of the input layers
// or `input` for Sequential models
// values are the flattened Float32Array data
// (input tensor shapes are specified in the model config)
var inputData = {
'input_1': new Float32Array(data)
};
console.log("2 " + inputData);
// make predictions
return model.predict(inputData);
})
.then(function(outputData) {
// outputData is an object keyed by names of the output layers
// or `output` for Sequential models
// e.g.,
// outputData['fc1000']
console.log("3 " + outputData);
})
.catch(function(err) {
console.log(err);
// handle error
});
EDIT: So I changed my program around a little to be compatible with JS 5 (that was a stupid mistake on my part), and now I have encountered a different error. This error is caught and then is logged. The error I get is: Error: predict() must take an object where the keys are the named inputs of the model: input. I believe this problem arises because my data variable is not in the correct format. I thought that if my model took in a 28x28 array of numbers, then data should also be a 28x28 array so that it could correctly "predict" the right output. However, I believe I am missing something and that is why the error is being thrown. This question is very similar to mine, however it is in python and not JS. Again, any help would be appreciated.
Ok, so I figured out why this was happening. There were two problems. First, the data array needs to be flattened, so i wrote a quick function to take the 2D input and "flatten" it to be a 1D array of length 784. Then, because I used a Sequential model, the key name of the data should not have been 'input_1', but rather just 'input'. This got rid of all the errors.
Now, to get the output information, we simply can store it in an array like this: var out = outputData['output']. Because I used the MNIST data set, out was a 1D array of length 10 that contained probabilities of each digit being the user-written digit. From there, you can simply find the number with the highest probability and use that as a prediciton for the model.

What are the security considerations for the size of an array that can be passed over HTTP to a JavaScript server?

I'm dealing with the library qs in Node.js, which lets you stringify and parse query strings.
For example, if I want to send a query with an array of items, I would do qs.stringify({ items: [1,2,3] }), which would send this as my query string:
http://example.com/route?items[0]=1&items[1]=2&items[2]=3
(Encoded URI would be items%5B0%5D%3D1%26items%5B1%5D%3D2%26items%5B2%5D%3D3)
When I do qs.parse(url) on the server, I'd get the original object back:
let query = qs.parse(url) // => { items: [1,2,3] }
However, the default size of the array for qs is limited to 20, according to the docs:
qs will also limit specifying indices in an array to a maximum index of 20. Any array members with an index of greater than 20 will instead be converted to an object with the index as the key
This means that if I have more than 20 items in the array, qs.parse will give me an object like this (instead of the array that I expected):
{ items: { '0': 1, '1': 2 ...plus 19 more items } }
I can override this behavior by setting a param, like this: qs.parse(url, { arrayLimit: 1000 }), and this would allow a max array size of 1,000 for example. This would, thus, turn an array of 1,001 items into a plain old JavaScript object.
According to this github issue, the limit might be for "security considerations" (same in this other github issue).
My questions:
If the default limit of 20 is meant to help mitigate a DoS attack, how does turning an array of over 20 items into a plain old JavaScript object supposed to help anything? (Does the object take less memory or something?)
If the above is true, even if there is an array limit of, say, 20, couldn't the attacker just send more requests and still get the same DoS effect? (The number of requests necessary to be sent would decrease linearly with the size limit of the array, I suppose... so I guess the "impact" or load of a single request would be lower)

MongoDB - Query conundrum - Document refs or subdocument

I've run into a bit of an issue with some data that I'm storing in my MongoDB (Note: I'm using mongoose as an ODM). I have two schemas:
mongoose.model('Buyer',{
credit: Number,
})
and
mongoose.model('Item',{
bid: Number,
location: { type: [Number], index: '2d' }
})
Buyer/Item will have a parent/child association, with a one-to-many relationship. I know that I can set up Items to be embedded subdocs to the Buyer document or I can create two separate documents with object id references to each other.
The problem I am facing is that I need to query Items where it's bid is lower than Buyer's credit but also where location is near a certain geo coordinate.
To satisfy the first criteria, it seems I should embed Items as a subdoc so that I can compare the two numbers. But, in order to compare locations with a geoNear query, it seems it would be better to separate the documents, otherwise, I can't perform geoNear on each subdocument.
Is there any way that I can perform both tasks on this data? If so, how should I structure my data? If not, is there a way that I can perform one query and then a second query on the result from the first query?
Thanks for your help!
There is another option (besides embedding and normalizing) for storing hierarchies in mongodb, that is storing them as tree structures. In this case you would store Buyers and Items in separate documents but in the same collection. Each Item document would need a field pointing to its Buyer (parent) document, and each Buyer document's parent field would be set to null. The docs I linked to explain several implementations you could choose from.
If your items are stored in two separate collections than the best option will be write your own function and call it using mongoose.connection.db.eval('some code...');. In such case you can execute your advanced logic on the server side.
You can write something like this:
var allNearItems = db.Items.find(
{ location: {
$near: {
$geometry: {
type: "Point" ,
coordinates: [ <longitude> , <latitude> ]
},
$maxDistance: 100
}
}
});
var res = [];
allNearItems.forEach(function(item){
var buyer = db.Buyers.find({ id: item.buyerId })[0];
if (!buyer) continue;
if (item.bid < buyer.credit) {
res.push(item.id);
}
});
return res;
After evaluation (place it in mongoose.connection.db.eval("...") call) you will get the array of item id`s.
Use it with cautions. If your allNearItems array will be too large or you will query it very often you can face the performance problems. MongoDB team actually has deprecated direct js code execution but it is still available on current stable release.

Dealing with a JSON object too big to fit into memory

I have a dump of a Firebase database representing our Users table stored in JSON. I want to run some data analysis on it but the issue is that it's too big to load into memory completely and manipulate with pure JavaScript (or _ and similar libraries).
Up until now I've been using the JSONStream package to deal with my data in bite-sized chunks (it calls a callback once for each user in the JSON dump).
I've now hit a roadblock though because I want to filter my user ids based on their value. The "questions" I'm trying to answer are of the form "Which users x" whereas previously I was just asking "How many users x" and didn't need to know who they were.
The data format is like this:
{
users: {
123: {
foo: 4
},
567: {
foo: 8
}
}
}
What I want to do is essentially get the user ID (123 or 567 in the above) based on the value of foo. Now, if this were a small list it would be trivial to use something like _.each to iterate over the keys and values and extract the keys I want.
Unfortunately, since it doesn't fit into memory that doesn't work. With JSONStream I can iterate over it by using var parser = JSONStream.parse('users.*'); and piping it into a function that deals with it like this:
var stream = fs.createReadStream('my.json');
stream.pipe(parser);
parser.on('data', function(user) {
// user is equal to { foo: bar } here
// so it is trivial to do my filter
// but I don't know which user ID owns the data
});
But the problem is that I don't have access to the key representing the star wildcard that I passed into JSONStream.parse. In other words, I don't know if { foo: bar} represents user 123 or user 567.
The question is twofold:
How can I get the current path from within my callback?
Is there a better way to be dealing with this JSON data that is too big to fit into memory?
I went ahead and edited JSONStream to add this functionality.
If anyone runs across this and wants to patch it similarly, you can replace line 83 which was previously
stream.queue(this.value[this.key])
with this:
var ret = {};
ret[this.key] = this.value[this.key];
stream.queue(ret);
In the code sample from the original question, rather than user being equal to { foo: bar } in the callback it will now be { uid: { foo: bar } }
Since this is a breaking change I didn't submit a pull request back to the original project but I did leave it in the issues in case they want to add a flag or option for this in the future.

Categories