Better performance when saving large JSON file to MySQL

Better performance when saving large JSON file to MySQL - javascript

I have an issue.
So, my story is:
I have a 30 GB big file (JSON) of all reddit posts in a specific timeframe.
I will not insert all values of each post into the table.
I have followed this series, and he coded what I'm trying to do in Python.
I tried to follow along (in NodeJS), but when I'm testing it, it's way too slow. It inserts one row every 5 seconds. And there 500000+ reddit posts and that would literally take years.
So here's an example of what I'm doing in.
var readStream = fs.createReadStream(location)
oboe(readStream)
.done(async function(post) {
let { parent_id, body, created_utc, score, subreddit } = data;
let comment_id = data.name;
// Checks if there is a comment with the comment id of this post's parent id in the table
getParent(parent_id, function(parent_data) {
// Checks if there is a comment with the same parent id, and then checks which one has higher score
getExistingCommentScore(parent_id, function(existingScore) {
// other code above but it isn't relevant for my question
// this function adds the query I made to a table
addToTransaction()
})
})
})
Basically what that does, is to start a read stream and then pass it on to a module called oboe.
I then get JSON in return.
Then, it checks if there is a parent saved already in the database, and then checks if there is an existing comment with the same parent id.
I need to use both functions in order to get the data that I need (only getting the "best" comment)
This is somewhat how addToTransaction looks like:
function addToTransaction(query) {
// adds the query to a table, then checks if the length of that table is 1000 or more
if (length >= 1000) {
connection.beginTransaction(function(err) {
if (err) throw new Error(err);
for (var n=0; n<transactions.length;n++) {
let thisQuery = transactions[n];
connection.query(thisQuery, function(err) {
if (err) throw new Error(err);
})
}
connection.commit();
})
}
}
What addToTransaction does, is to get the queries I made and them push them to a table, then check the length of that table and then create a new transaction, execute all those queries in a for loop, then comitting (to save).
Problem is, it's so slow that the callback function I made doesn't even get called.
My question (finally) is, is there any way I could improve the performance?
(If you're wondering why I am doing this, it is because I'm trying to create a chatbot)
I know I've posted a lot, but I tried to give you as much information as I could so you could have a better chance to help me. I appreciate any answers, and I will answer the questions you have.

Related

Electron: Length of an array set to 0 after going outside a query

I'm starting a project, a kind of interactive text game, on Electron. I have a database with 4 tables to use ONLY at the start of the project, in order to get info of characters, places and staff like.
The problem is that, after getting outside the query, the array when I put the data of characters, says that length is 0, but it show the correct info.
This photo you can see that the array above says it has more than 5 houndred items (this come from another version of the project with others techs), but the line under it, the console.log of the lenght, say it is 0
https://i.stack.imgur.com/6go8C.png
This is the code of the query
connection.query("SELECT * FROM persona", function (error, rows) {
if (error) {
throw error;
} else {
obtenerPersonas(rows);
}
})
function obtenerPersonas(rows) {
rows.forEach(row => {
/*I try to not introduce the data directly, but using local variables that I previusly create. It doesn't work*/
auxPer = new Persona(
row.NombreClave,
row.NombreMostrar,
row.Apodo,
row.Localizacion,
row.Genero,
row.Sexualidad,
row.ActDep,
row.ComePrimero,
row.ActPreHombres,
row.ActPreMujeres,
row.Prota
);
listaPersonas.push(auxPer);
});
}
/*This are the console.log of the picture, if I put them inside the connection query, they show all perfecty, the array content and length*/
console.log(listaPersonas);
console.log(listaPersonas.length);
If you need more info, tell me, please. I need to solve this. Without the length, I can't advance more

connection.query is asynchronous function so it's being called after console.log(listaPersonas); console.log(listaPersonas.length);. However listaPersonas fills up after time, so console.log(listaPersonas) shows like this. If you need length then you have to write code after calling the obtenerPersonas function.
connection.query("SELECT * FROM persona", function (error, rows) {
if (error) {
throw error;
} else {
obtenerPersonas(rows);
// write code here
}
})

Ag-grid: duplicate node id 107 detected from getRowNodeId callback , this could cause issues in your grid

I am going to do live data streaming on ag-grid datatable, so I used DeltaRowData for gridOptions and added getRowNodeId method as well which return unique value 'id'.
After all, I got a live update result on my grid table within some period I set, but some rows are duplicated so I can notice total count is a bit increased each time it loads updated data. The question title is warning message from browser console, I got bunch of these messages with different id number. Actually it is supposed not to do this from below docs. This is supposed to detect dups and smartly added new ones if not exist. Ofc, there are several ways to get refreshed data live, but I chose this one, since it says it helps to persist grid info like selected rows, current position of scroll on the grid etc. I am using vanilla js, not going to use any frameworks.
How do I make live data updated periodically without changing any current grid stuff? There is no error on the code, so do not try to speak about any bug. Maybe I am wrong with current implementation, Anyway, I want to know the idea or hear any implementation experience on this.
let gridOptions = {
....
deltaRowDataMode: true,
getRowNodeId = (data) => {
return data.id; // return the property you want set as the id.
}
}
fetch(loadUrl).then((res) => {
return res.json()
}).then((data) => {
gridOptions.api.setRowData(data);
})
...

If you get:
duplicated node warning
it means your getRowNodeId() has 1 value for 2 different rows.
here is part from source:
if (this.allNodesMap[node.id]) {
console.warn("ag-grid: duplicate node id '" + node.id + "' detected from getRowNodeId callback, this could cause issues in your grid.");
}
so try to check your data again.
if u 100% sure there is an error not related with your data - cut oof the private data, create a plinkr/stackblitz examples to reproduce your issue and then it would be simpler to check and help you.

Insert multiple entries to SQL Server with Node.js

I am rewriting an old API for which I am trying to insert multiple values at once into a MSSQL-Server (2008) database using the node module mssql. Now, I am capable of doing this somehow, but I want to this following best practices. I've done my research and tried a lot of things to accomplish my target. However, I was not able to find a single solution which works just right.
Before
You may wonder:
Well, you are rewriting this API, so there must be a way this has been done before and that was working?
Sure, you're right, it was working before, but... not in a way I'd feel comfortable with using in the rewrite. Let me show you how it was done before (little bit of abstraction added of course):
const request = new sql.Request(connection);
let query = "INSERT INTO tbl (col1, col2, col3, col4) VALUES ";
for (/*basic for loop w/ counter variable i*/) {
query += "(1, #col2" + [i] + ", #col3" + [i] + ", (SELECT x FROM y WHERE z = #someParam" + [i] + "))";
// a check whether to add a comma or not
request.input("col2" + [i], sql.Int(), values[i]);
// ...
}
request.query(query, function(err, recordset) {
// ...
}
While this is working, again, I don't quite think this could be called anything like 'best practice'. Also this shows the biggest problem: a subselect is used to insert a value.
What I tried so far
The easy way
At first I tried the probably easiest thing:
// simplified
const sQuery = "INSERT INTO tbl (col1, col2, col3, col4) VALUES (1, #col2, #col3, (SELECT x FROM y WHERE z = #col4));";
oPool.request().then(oRequest => {
return oRequest
.input("col2", sql.Int(), aValues.map(oValue => oValue.col2))
.input("col3", sql.Int(), aValues.map(oValue => oValue.col3))
.input("col4", sql.Int(), aValues.map(oValue => oValue.col4))
.query(sQuery);
});
I'd say, this was a pretty good guess and actually working relative fine.
Except for the part, that ignores every item after the first one... which makes this pretty useless. So, I tried...
Request.multiple = true
...and I thought, it would do the job. But - surprise - it doesn't, still only the first item is inserted.
Using '?' for parameters
At this point I really started the search for a solution, as the second one was only a quick search in the modules documentation.
I stumbled upon this answer and tried it immediately.
Didn't take long for my terminal to spit out a
RequestError: Incorrect syntax near '?'.
So much for that.
Bulk inserting
Some further research led to bulk inserting.
Pretty interesting, cool feature and excellent updating of the question with the solution by the OP!
I had some struggle getting started here, but eventually it looked really good: Multiple records were inserted and the values seemed okay.
Until I added the subquery. Using it as value for a column declared didn't cause any error, however when checking the values of the table, it simply displayed a 0 as value for this column. Not a big surprise at all, but everybody can dream, right?
The lazy way
I don't really know what to think about this:
// simplified
Promise.all(aValues.map(oValue => {
return oPool.request().then(oRequest =>
oRequest
.input("col2", sql.Int, oValue.col2)
.input("col3", sql.Int, oValue.col3)
.input("col4", sql.Int, oValue.col4)
.query(sQuery);
});
});
It does the job, but if any of the request fails for whichever reason, the other, non-failing inserts, will still be executed, even though this should not be possible.
Lazy + Transaction
As continuing even if some fail was the major problem with the last method, I tried building a transaction around it. All querys are successful? Good, commit. Any query has an errpr? Well, just rollback than. So I build a transaction, moved my Promise.all construct into it and tried again.
Aaand the next error pops up in my terminal:
TransactionError: Can't acquire connection for the request. There is another request in progress.
If you came this far, I don't need to tell you what the problem is.
Summary
What I didn't try yet (and I don't think I will try this) is using the transaction way and calling the statements sequentially. I do not believe that this is be the way to go.
And I also don't think the lazy way is the one that should be used, as it uses single requests for every record to insert, when this could somehow be done using only one request. It's just that this somehow is, I don't know, not in my head right now. So please, if you have anything that could help me, tell me.
Also, if you see anything else that's wrong with my code, feel free to point it out. I am not considering myself as a beginner, but I also don't think that learning will ever end. :)

The way I solved this was using PQueue library with concurrency 1. Its slow due to concurrency of one but it works with thousands of queries:
const transaction = sql.transaction();
const request = transaction.request();
const queue = new PQueue({ concurrency: 1 });
// being transaction
await transaction.begin();
for (const query of queries) {
queue.add(async () => {
try {
await request.query(query);
} catch (err) {
// stop pending transactions
await queue.clear();
await queue.onIdle();
// rollback transaction
await transaction.rollback();
// throw error
throw err;
}
});
}
// await queue
await queue.onIdle();
// comit transaction
await transaction.commit();

Copy a Parse.com class to new class with transformation of values

There is an existing Parse.com class that needs to be copied to a new Parse.com class with some new columns and the transformation of one of the columns. The code currently works and uses the Parse.Query.each method to iterate over all records as detailed in the Parse.com documentation but it stops processing at 831 records although there are 12k+ records in the class. This is odd given each should not have a limit and other default limits are 100 or 1000 for find. Should another method be used to iterate over all records or is there something wrong with the code?
var SourceObject = Parse.Object.extend("Log_Old_Class");
var source_query = new Parse.Query(SourceObject);
var TargetObject = Parse.Object.extend("Log_New_Class")
source_query.each(function(record) {
//save record to new class code works fine
var target_query = new TargetObject();
target_query.set("col1_new",record.col1);
target_query.set("col2_new",record.col2);
//etc...
target_query.save(null, {
success: function(obj) {
//SAVED
},
error: function(obj, error) {
//ERROR
}
});
}).then(function() {
//DONE
},
function(error) {
//error
});

One thing that comes to my mind immediately is that the function is getting timed-out. Parse has time limitations on each function. If I were you, I'd first load all the objects in the source class and then add them separately by having a delay between to API calls (server overload issues can also be present).

Adding object to PFRelation through Cloud Code

I am trying to add an object to a PFRelation in Cloud Code. I'm not too comfortable with JS but after a few hours, I've thrown in the towel.
var relation = user.relation("habits");
relation.add(newHabit);
user.save().then(function(success) {
response.success("success!");
});
I made sure that user and habit are valid objects so that isn't the issue. Also, since I am editing a PFUser, I am using the masterkey:
Parse.Cloud.useMasterKey();

Don't throw in the towel yet. The likely cause is hinted at by the variable name newHabit. If it's really new, that's the problem. Objects being saved to relations have to have once been saved themselves. They cannot be new.
So...
var user = // got the user somehow
var newHabit = // create the new habit
// save it, and use promises to keep the code organized
newHabit.save().then(function() {
// newHabit is no longer new, so maybe that wasn't a great variable name
var relation = user.relation("habits");
relation.add(newHabit);
return user.save();
}).then(function(success) {
response.success(success);
}, function(error) {
// you would have had a good hint if this line was here
response.error(error);
});

We Keep Coding

JavaScript is the programming language of the Web.

Better performance when saving large JSON file to MySQL - javascript

Related

Electron: Length of an array set to 0 after going outside a query

Ag-grid: duplicate node id 107 detected from getRowNodeId callback , this could cause issues in your grid

Insert multiple entries to SQL Server with Node.js

Copy a Parse.com class to new class with transformation of values

Adding object to PFRelation through Cloud Code

Categories

Resources