MongoDB seed script for 10million entries takes 30 minutes - javascript

I have a project im working on and I have to seed a database with 10 million random rows, which i have successfully done. However it takes about 30 minutes for it to complete, which is expected, but i know it could be faster. I would like to make it run ever faster and figure out a way to make it seed 10 million random entries in under 10 minutes preferably while still using mongodb/mongoose. This is my current seed file, any tips on making it run faster? First time posting on here, just fyi. thanks!
I use 'node database/seed.js' to run this file in the terminal.
const db = require("./index.js");
const mongoose = require("mongoose");
const faker = require("faker");
const productSchema = mongoose.Schema({
product_name: String,
image: String,
price: String
});
let Product = mongoose.model("Product", productSchema);
async function seed() {
for (let i = 0; i < 10000000; i++) {
let name = faker.commerce.productName();
let image = faker.image.imageUrl();
let price = faker.commerce.price();
let item = new Product({
product_name: `${name}`,
image: `${image}`,
price: `$${price}`
});
await item
.save()
.then(success => {})
.catch(err => {});
}
}
seed();

You can create batch of may be 1 million records and can use insertMany function to insert bulk into database.

Use InsertMany
Insert/update always takes time in all kinda database. try to reduce number of insertion.
Insert something for every 1000 or loops once
Model.insertMany(arr, function(error, docs) {});

Related

How can I store a large dataset result by chunks to a csv file in nodejs?

I have a mysql table of about 10 Million records, I would like to send those records to a csv file using NodeJs.
I know I can make a query to get all records, store the result in a json format variable and send those to a csv file using a library like fastcsv in conjunction with createWriteStream. Writing the result to the file is doable using a stream. But what I want to avoid is storing 10 Million records into memory (suppose that the records have a lot of lot of columns).
What I would like to do is to query only a subset of the result (for example 20k rows), store the info to the file, then query the next subset (next 20k rows) and append the results to the same file and continue the process until it finishes. The problem that I have right now is that i don't know how to control the execution for the next iteration. Accoding to the debug, different writing operation are being executed at the same time because of the asynchronous nature of nodejs giving me a file where some lines are mixed (multiple results in the same line) and unordered records.
I know the total execution time is affected with this approach, but in this case i prefer a controlled way and avoid ram consumption.
For the database query I'm using sequelize with MySQL, but the idea is the same regardless the query method.
This is my code so far:
// Store file function receives:
// (String) filename
// (Boolean) headers: first iteration is true to put a name to the columns
// (json document) jsonData is the information to store in te file
// (Boolean) append: Disabled the first iteration to create a new file
const storeFile = (filename, headers, jsonData, append) => {
const flags = append === true ? 'a' : 'w'
const ws = fs.createWriteStream(filename, { flags, rowDelimiter: '\r\n' })
fastcsv
.write(jsonData, { headers })
.on('finish', () => {
logger.info(`file=${filename} created/updated sucessfully`)
})
.pipe(ws)
}
// main
let filename = 'test.csv'
let offset = 0
let append = false
let headers = true
const limit = 20000
const totalIterations = Math.ceil(10000000/ limit)
for (let i = 0; i < totalIterations; i += 1) {
// eslint-disable-next-line no-await-in-loop
const records = await Record.findAll({
offset,
limit,
raw: true,
})
storeFile(filename, headers, records, append)
headers = false
append = true
offset += limit // offset is incremented to get the next subset
}

I can't introduce new elements con JS (nodeJS) in json from req mongoDB's

Forgive me but I have been trying to solve the problem for several days, but I can not locate the problem.
I am downloading from mongoDB^14..(with mongoose) two arrays with complementary data. One contains the user data, the other the user survey records.
Both arrays are related through the user's email:
/***************************************************************
This is what I download from mongoDB with Nodejs
****************************************************************
const user = require ('../models/user');
const test = require ('../models/test');
let users = await user.find ({email:userEmail});
let tests = await test.find ({email:testEmail});
Request:
let users: [
{name:'pp', email:'pp#gmail.com},
{name:'aa', email:'aa#gmail.com'}
];
let tests: [
{email:'pp#gmail.com', satisfaction:'5'},
{email:'aa#gmail.com', satisfaction:'2'}];
*****************************************************************************/
Now I try to relate both json arrays using:
for (let i = 0; i < prof1.length; i++) {
for (let z = 0; z < registro1.length; z++) {
if (users[i].email==tests[z].email){
users[i]['satisfation'] = test[z].satisfation;
}
}
}
If I do a console.log(users[0]) my wish is:
{name:'pp', email:'pp#gmail.com', satisfation:'5'}
But I receive:
{name:'pp', email:'pp#gmail.com'}
But attention¡¡¡¡ If I do a console.log(users[0].satisfation)
The result is: 5
?????? Please can someone help me. Thank you very much
Note:
If I instead of downloading the mongodb arrays, I write them by hand. So it works perfectly. Can it be a lock on the models?
WIDE INFORMATION
Although I have given a simple example, the user and test arrays are much more complex in my application, however they are modeled and correctly managed in mongoDB.
The reason for having two documents in mongoDB is because there are more than 20 variables in each of them. In user, I save fixed user data and in the test the data that I collect over time. Later, I lower both arrays to my server where I perform statistical calculations and for this I need to generate a single array with data from both arrays. I just take the test data and add it to the user to relate all the parameters. I thought that in this way it would unburden the mongoDB to carry out continuous queries and subsequent registrations, since I perform the operations on the server, to finally update the arry test in mongoDB.
When I query the database I receive the following array of documents:
let registroG = await datosProf.find({idAdmin:userAdmin });
res.json(registroG);
This is the response received in the front client:
If I open the object the document [0] would be:
**THE QUESTION **:
Why when I try to include a value key in the object it doesn't include it?
You could use Array.map with es6 spread operator to marge to objects
let users = [{ name: 'pp', email: 'pp#gmail.com' }, { name: 'aa', email: 'aa#gmail.com' }];
let tests = [{ email: 'pp#gmail.com', satisfaction: '5' }, { email: 'aa#gmail.com', satisfaction: '2' }];
let result = users.map(v => ({
...v,
...tests.find(e => e.email == v.email)
}))
console.log(result)

How to count a huge list of items

I have a huge list of items about almost all the crops and these data is to be plotted using maps and charts. I would like to count the number of each crop, say how many times was cabbage planted. I use Firebase database to store the data and I retrieve it using this function below:
database = firebase.database()
var ref = database.ref('Planting-Calendar-Entries');
ref.on('value', gotData, errData);
function gotData(data){
console.log(data.val())
var veggie = data.val();
var keys = Object.keys(veggie);
console.log(keys);
let counter = 0
for (var i = 0; i < keys.length; i++){
var k = keys[i];
var Veg_planted = veggie[k].Veg_planted;
var coordinates = veggie[k].coordinates;
if (Veg_planted == 'Cabbage'){
counter++;
}
// vegAll = Veg_planted.count()
console.log(Veg_planted, coordinates)
}
console.log(counter)
}
function errData(err){
console.log('Error!');
console.log(err)
}
This data I retrieve it from the database where it gets updated whenever someone submits their planting information. The code I used above will only apply if my list is small, but I have a list of about 170 items and it would be hard to write code to count each crop individually using something like let counter = 0, counter++. Is there a way I could navigate around this?
I'm assuming data.val() returns an array, not an object, and you're misusing Object.keys() on an array instead of just looping over the array itself. If that's true, then it sounds like you want to group by the Veg_planted key and count the groupings:
const counts = Object.values(veggie).reduce((counts, { Veg_planted }) => ({
...counts,
[Veg_planted]: (counts[Veg_planted] || 0) + 1
}), {});
Usage:
const veggie = [{ Veg_planted: 'Cabbage' }, { Veg_planted: 'Cabbage' }, { Veg_planted: 'Corn' }];
// result of counts:
// {Cabbage: 2, Corn: 1}
Actually: the code to count the items is probably going to be the same, no matter how many items there are. The thing that is going to be a problem as you scale though is the amount of data that you have to retrieve that you're not displaying to the user.
Firebase does not support aggregation queries, and your approach only works for short lists of items. For a more scalable solution, you should store the actual count itself in the database too.
So:
Have a blaCount property for each bla that exists.
Increment/decrement the counter each time your write/remove a bla to/from the database.
Now you can read only the counters, instead of having to read the individual items.
Firestore would be better option. You can query based on the field value.
var plantingRef = db.collection("PlantingCalendarEntries");
var query = plantingRef.where("Veg_planted", "==", "Cabbage");
if you still want to stuck with realtime database.
Save Counters to database.
Or use cloud dunctions to count.

Firestore call extremely slow - how to make it faster?

I'm migrating from MongoDB to Firestore and I've been struggling to make this call faster. So, to give you the context, this is how every document looks like in the collection:
Document sample:
The "entities" object may have the following fields: "customer", "supplier", "product", "product_group" and/or "origin". What I'm trying to do is to build a ranking, e.g., a supplier ranking for docs with origin = "A" and product = "B". That means, I have to get from the DB all documents with entities = { origin:"A", product:"B", supplier: exists }
The collection has more than 300.000 documents, and I make between 4-10 calls depending on the "entities", which return between 0 and a few hundred results each. These calls are taking excessively long to execute (between 10 and 30 seconds in total) in Firestore, whereas with MongoDB it took around 2-3 seconds in total. This is how my code looks like as of now (following the previous example, the parameters in the function would be entities = {origin:"A", product:"B"} and entity = "supplier"):
async getRanking(entities: { [key: string]: any }, stage: string, metric: string, organisationId: string, entity: string) {
let indicators: IndicatorInsight[] = [];
const entitySet = ['product', 'origin', 'product_group', 'supplier', 'customer'];
const entitiesKeys = [];
//this way we always keep the values in the same order as in entitySet
for (const ent of entitySet) {
if ([...Object.keys(entities), entity].includes(ent)) {
entitiesKeys.push(ent);
}
}
let rankingRef = OrganisationRef
.doc(organisationId)
.collection('insight')
.where('stage', '==', stage)
.where('metric', '==', metric)
.where('insight_type', '==', 'indicator')
.where('entities_keys', '==', entitiesKeys)
for (const ent of Object.keys(entities)) {
rankingRef = rankingRef.where(`entities.${ent}`, '==', entities[ent]);
}
(await rankingRef.get()).forEach(snap => indicators.push(snap.data() as IndicatorInsight));
return indicators;
}
So my question is, do you have any suggestion on how to improve the structure of the documents and the querying in order to improve the performance of this call? As I mentioned, with MongoDB this was quite fast, 2-3 seconds tops.
I've been told Firestore should be extremely fast, so I guess I'm not doing things right here and I'd really appreciate your help. Please let me know if you need more details.

How can I target a deeply nested object in google realtime database?

I want to make a cron job that deletes deeply nested objects in my realtime database that are older than 24 hours.
I have looped through and reached the deeply nested object, but I can't grab/target the value of "addedTime" in the object. How do I grab that value so I can run .remove on the parent? So far, it comes back as undefined or it throws an error.
.schedule("every 1 hours")
.onRun(context => {
const rootDatabaseRef = admin.database().ref("ghostData/");
return rootDatabaseRef.ref.once("value").then(function(snapshot) {
console.log("snap", snapshot.val());
snapshot.forEach(function(userSnapshot) {
let buckets = userSnapshot.val().buckets;
console.log("buckets", buckets);
buckets.forEach(function(bucket) {
let currentTimeYesterday = new Date(
new Date().getTime() - 24 * 60 * 60 * 1000
).getTime();
let addedTime = bucket.val().addedTime;
console.log("curr time", currentTimeYesterday);
console.log("addedTime", addedTime);
});
});
Here is the data in my realtime database as well as the logs from the serverless cloud functions:
I think you're having problems with looping, because when you do this "buckets.forEach(function(bucket)" --> bucket in your case is the first element of the list ,
and every element has a nested dictionary , so first you have to iterate the dictionary and for each key in the dictionary , you'll get another dictionary , and you've to grab
only the added-time value.
I know it's difficult to understand but I think it's happening because you're not looping correctly.
Try the code below or something similar.
buckets.forEach(function(bucket){
let currentTimeYesterday = new ......
bucket.forEach(function(dict){
Object.keys(dict).forEach(k => {
console.log(k, ':', dict[k].addedTime);
let addedTime = dict[k].addedTime;
});
....
}
....
}

Categories