How to insert 10 million rows into MySQL database with Knex.js? - javascript

I'm trying to insert 10M+ rows into a MySQL database using Knex.js. Is there a way to use a for loop to insert arrays of length 10000 (which seems to maximum size that I am able to insert - anything larger than that gets "Error: ER_NET_PACKET_TOO_LARGE: Got a packet bigger than 'max_allowed_packet' bytes").
I tried using a promise chain but the chain would be very long to accommodate 10M records.
exports.seed = (knex) => {
// Deletes ALL existing entries
return knex('books').del()
.then(() => {
const fakeBooks = [];
for (let i = 0; i < 10000; i += 1) {
fakeBooks.push(createFakeBooks());
}
return knex('books').insert(fakeBooks)
.then(() => {
const fakeBooks1 = [];
for (let i = 0; i < 10000; i += 1) {
fakeBooks1.push(createFakeBooks());
}
return knex('books').insert(fakeBooks1)
.then(() => {
const fakeBooks2 = [];
for (let i = 0; i < 10000; i += 1) {
fakeBooks2.push(createFakeBooks());
}
...

It's easier if you use async and await and ditch the thens. It can then be written like this:
exports.seed = async (knex) => {
await knex('books').del();
let fakeBooks = [];
for (let i = 1; i <= 10000000; i += 1) {
fakeBooks.push(createFakeBooks());
if (i % 1000 === 0) {
await knex('books').insert(fakeBooks);
fakeBooks = [];
}
}
};
await will make the promise finish before the function continues, without blocking the thread. The loop will run ten million times and insert into the database for every 1000 rows. You can change it to 10000 rows, but you might as well use 1000 to be sure.
I only tried with one million rows myself, as it took too much time to insert ten million.

You can use https://knexjs.org/#Utility-BatchInsert which is done for inserting big amount of rows to DB.
await knex.batchInsert('books', create10MFakeBooks(), 5000)
However you might want to actually create those books in smaller batches to prevent using gigabytes of memory. So MikaS's answer is valid, that just use async / await and it will be trivial to write.
I would not use knex for this kind of job, but raw SQL.

Related

Iterate over a Range fast in Excelscript for web

I want to check that a range of cell are empty or has any values in them, I use this for loop :
for (let i = 0; i <= namesRange.getCellCount(); i++) {
if (namesRange.getCell(i,0).getText() == "")
{
break;
}
bookedCount += 1;
}
However this iteration is extremely slow (as is the use of Range.getValue, but the console warns you that iterating with .getValue is slow, does not warn you with getText) It takes several seconds to iterate over a very short list of 10 elements.
Is there any way to check for the values of a cell in a speedy manner using ExcelScripts?
Does this mean that, even if I develop a UDF or a ribbon Add-In with office.js and Node.js it will also be this extremely slow for iterating over cells?
Is there any way to make this faster?
The reason your code is likely performing slowly is that the calls to getCell() and getText() are expensive. Instead of performing these calls every time in the loop you can try a different approach. One approach is to get an array of the cell values and iterate over that. You can use your namesRange variable to get the array of values. And you can also use it to get the row count and the column count for the range. Using this information, you should be able to write nested for loops to iterate over the array. Here's an example of how you might do that:
function main(workbook: ExcelScript.Workbook) {
let namesRange: ExcelScript.Range = workbook.getActiveWorksheet().getRange("A1");
let rowCount: number = namesRange.getRowCount();
let colCount: number = namesRange.getColumnCount();
let vals: string[][] = namesRange.getValues() as string[][];
for (let i = 0; i < rowCount; i++) {
for (let j = 0; j < colCount; j++) {
if (vals[i][j] == "") {
//additional code here
}
}
}
}
Another alternative to the first answer is to use the forEach approach for every cell in the range of values.
It can cut down the amount of variables you need to achieve the desired result.
function main(workbook: ExcelScript.Workbook)
{
let worksheet = workbook.getActiveWorksheet();
let usedRange = worksheet.getUsedRange().getValues();
usedRange.forEach(row => {
row.forEach(cellValue => {
console.log(cellValue);
});
});
}

Pouchdb pagination

I am looking for a way to paginate in pouchdb by specifying the number of the page that I want.
The closest example I came across is this:
var options = {limit : 5};
function fetchNextPage() {
pouch.allDocs(options, function (err, response) {
if (response && response.rows.length > 0) {
options.startkey = response.rows[response.rows.length - 1].id;
options.skip = 1;
}
});
}
It assumes however that you are paginating one page after the other and calling this consecutively several times.
What I need instead is a way to retrieve page 5 for example, with a single query.
There is no easy answer to the question. A small slice from 3.2.5.6. Jump to Page
One drawback of the linked list style pagination is that you can’t
pre-compute the rows for a particular page from the page number and
the rows per page. Jumping to a specific page doesn’t really work. Our
gut reaction, if that concern is raised, is, “Not even Google is doing
that!” and we tend to get away with it. Google always pretends on the
first page to find 10 more pages of results. Only if you click on the
second page (something very few people actually do) might Google
display a reduced set of pages. If you page through the results, you
get links for the previous and next 10 pages, but no more.
Pre-computing the necessary startkey and startkey_docid for 20 pages
is a feasible operation and a pragmatic optimization to know the rows
for every page in a result set that is potentially tens of thousands
of rows long, or more.
If you are lucky and every document has an ordered sequence number, then a view could be constructed to easily navigate pages.
Another strategy is to precompute (preload) a range of keys, which is more reasonable but is complicated. The snippet below creates a trivial database which can be paged through via nav links.
There are 5 documents per page, and each "chapter" has 10 pages. computePages performs the look ahead
// look ahead and cache startkey for pages.
async function computePages(startPage, perPage, lookAheadPages, startKey) {
let options = {
limit: perPage * lookAheadPages,
include_docs: false,
reduce: false
};
// adjust. This happens when a requested page has no key cached.
if (startKey !== undefined) {
options.startkey = startKey;
options.skip = perPage; // not ideal, but tolerable probably?
}
const result = await db.allDocs(options);
// use max to prevent result overrun
// only the first key of each page is stored
const max = Math.min(options.limit, result.rows.length)
for (let i = 0; i < max; i += perPage) {
page_keys[startPage++] = result.rows[i].id;
}
}
page_keys provides a key/value store mapping page number to start key. Usually anything other than 1 for skip is red flag however this is reasonable here - we won't be skipping say a 100 documents right?
I just threw this together so it is imperfect and likely buggy, but it does demonstrate page navigation generally.
function gel(id) {
return document.getElementById(id);
}
// canned test documents
function getDocsToInstall() {
let docs = [];
// doc ids are a silly sequence of characters.
for (let i = 33; i < 255; i++) {
docs.push({
_id: `doc-${String.fromCharCode(i)}`
});
}
return docs;
}
// init db instance
let db;
async function initDb() {
db = new PouchDB('test', {
adapter: 'memory'
});
await db.bulkDocs(getDocsToInstall());
}
// documents to show per page
const rows_per_page = 5;
// how many pages to preload into the page_keys list.
const look_ahead_pages = 10;
// page key cache: key = page number, value = document key
const page_keys = {};
// the current page being viewed
let page_keys_index = 0;
// track total rows available to prevent rendering links beyond available pages.
let total_rows = undefined;
async function showPage(page) {
// load the docs for this page
let options = {
limit: rows_per_page,
include_docs: true,
startkey: page_keys[page] // page index is computed
};
let result = await db.allDocs(options);
// see renderNav. Here, there is NO accounting for live changes to the db.
total_rows = total_rows || result.total_rows;
// just display the doc ids.
const view = gel('view');
view.innerText = result.rows.map(row => row.id).join("\n");
}
// look ahead and cache startkey for pages.
async function computePages(startPage, perPage, lookAheadPages, startKey) {
let options = {
limit: perPage * lookAheadPages,
include_docs: false,
reduce: false
};
// adjust. This happens when a requested page has no key cached.
if (startKey !== undefined) {
options.startkey = startKey;
options.skip = perPage; // not ideal, but tolerable probably?
}
const result = await db.allDocs(options);
// use max to prevent result overrun
// only the first key of each page is stored
const max = Math.min(options.limit, result.rows.length)
for (let i = 0; i < max; i += perPage) {
page_keys[startPage++] = result.rows[i].id;
}
}
// show page links and optional skip backward/forward links.
let last_chapter;
async function renderNav() {
// calculate which page to start linking.
const chapter = Math.floor(page_keys_index / look_ahead_pages);
if (chapter !== last_chapter) {
last_chapter = chapter;
const start = chapter * look_ahead_pages;
let html = "";
// don't render more page links than possible.
let max = Math.min(start + look_ahead_pages, total_rows / rows_per_page);
// render prev link if nav'ed past 1st chapter.
if (start > 0) {
html = `< `;
}
for (let i = start; i < max; i++) {
html += `${i+1} `;
}
// if more pages available, render the 'next' link
if (max % look_ahead_pages === 0) {
html += ` > `;
}
gel("nav").innerHTML = html;
}
}
async function navTo(page) {
if (page_keys[page] === undefined) {
// page key not cached - compute more page keys.
await computePages(page, rows_per_page, look_ahead_pages, page_keys[page - 1]);
}
page_keys_index = page;
await showPage(page_keys_index);
renderNav();
}
initDb().then(async() => {
await navTo(0);
});
<script src="https://cdn.jsdelivr.net/npm/pouchdb#7.1.1/dist/pouchdb.min.js"></script>
<script src="https://github.com/pouchdb/pouchdb/releases/download/7.1.1/pouchdb.memory.min.js"></script>
<pre id="view"></pre>
<hr/>
<div id="nav">
</nav>

Javascript a synchronous loop inside a synchronous loop using Vue

I have a JavaScript loop
for (var index = 0; index < this.excelData.results.length; index++) {
let pmidList = this.excelData.results[index]["PMIDList"];
if (pmidList.length == 0 ){
continue;
}
let count = 0;
let pmidsList = pmidList.split(',');
if (pmidsList.length > 200){
pmidList = pmidsList.slice(count, count+ 200).join(',');
} else {
pmidList= pmidsList.join(",");
}
// Create some type of mini loop
// pmidList is a comma separated string so I need to first put it into an array
// then slice the array into 200 item segments
let getJSONLink = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?'
getJSONLink += 'db=pubmed&retmode=json&id=' + pmidList
await axios.get(getJSONLink)
.then(res => {
let jsonObj = res.data.result;
//Do Stuff with the data
}
}).catch(function(error) {
console.log(error);
});
//Loop
}
the entire process works fine EXCEPT when the PMIDList has more than 200 comma separated items. The web service only will accept 200 at a time. So I need to add an internal loop that parses out the first 200 hundred and loop back around to do the rest before going to the next index and it would be nice to do this synchronously since the webservice only allows 3 requests a second. And since I'm using Vue wait is another issue.
The easiest option would be to use async await within a while loop, mutating the original array. Something like:
async function doHttpStuff(params) {
let getJSONLink = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?'
getJSONLink += 'db=pubmed&retmode=json&id=' + params.join('')
return axios.get(getJSONLink)
}
and use it inside your inner while loop:
...
let pmidsList = pmidList.split(',');
// I'll clone this, because mutating might have some side effects?
const clone = [...pmidsList];
while(clone.length) {
const tmp = clone.splice(0, 199); // we are mutating the array, so length is reduced
const data = await doHttpStuff(tmp);
// do things with data
}
...

How to filter results in mongodb based on some other collection field?

I have 2 schemas (let's say t1 and t2)
I need to fetch n documents from t1 with certain criteria, such that some field of t1 also exists in t2.
This works, but if t1 has millions of documents and I only need to get the first 5 matching documents, then it would be stupid to fetch all.
responses = [];
const data = await t1.find({});
for (var i = 0; i < data.length; i++) {
await t2.findOne({
t2.field : t1.field,
t2.field2 : someOtherInfo
}).then((obj) => responses.push(data[i]));
if (responses.length > n) break;
}
return responses;
I have tried to use limit, but I cannot figure out the filtering part. Hence, this crude solution.
What is the better way to do this?

Trying to understand how to use Promise in js

I'm using the native driver for mongoDB. In the db I have about 7 collections and I want create a variable that stores the amount of entries in each collection minus the last collection. Afterwards I want to create another variable that stores the entries of the last collection then I want to pass the variables through the res.render() command and show it on the webpage.
The problem I'm having here is that I'm so used to synchronous execution of functions which in this case goes straight out the window.
The code below is the way I'm thinking, if everything is executed in sync.
var count = 0;
db.listCollections().toArray(function(err,collection){
for(i = 1; i < collection.length;i++){
db.collection(collection[i].name).count(function(err,value){
count = count + value;
})
}
var count2 = db.collection(collection[i].name).count(function(err,value){
return value;
})
res.render('index.html',{data1: count, data2: count2})
})
Obviously this doesn't do want I want to do so I tried playing around with promise, but ended up being even more confused.
You could do something like this with Promises:
Get collection names, iterate over them, and return either count, or entries (if it's the last collection). Then sum up individual counts and send everything to the client.
db.listCollections().toArray()
.then(collections => {
let len = collections.length - 1
return Promise.all(collections.map(({name}, i) => {
let curr = db.collection(name)
return i < len ? curr.count() : curr.find().toArray()
}
))
}
)
.then(res => {
let last = res.pop(),
count = res.reduce((p, c) => p + c)
res.render('index.html', {count, last})
})

Categories