Pg-promise inserts/transactions not working within async queue - javascript

I have found a lot of things related to the use of pg-promise and await/async but nothing that quite answers my issue with async (the node/npm package) and in particular the interaction between async.queue and pg-promise queries.
My issue: I need to make a few millions computations (matching score) asynchronously and commit their results in the same async process in a postgres db. My main process is a promise that first computes all of the possible distinct combinations of two records from a table and segments them in chunks of a thousand pairs at a time.
These chunks of a thousand pairs (i.e. [[0,1], [0,2], ... , [0, 1000]] is my array of chunks' first index' content) are fed to an instance of async.queue that performs first the computation of the matching score then the db recording.
The part that has had me scratching my head for hours is that the db committing doesn't work whether it is using insert statements or transactions. I know for sure the functions I use for the db part work since I've written manual tests using them.
My main code is as follows:
'use strict';
const promise = require('bluebird');
const initOptions = {
promiseLib: promise
};
const pgp = require('pg-promise')(initOptions);
const cn = {connexion parameters...};
const db = pgp(cn);
const async = require('async');
var mainPromise = (db, php, param) => {
return new Promise(resolve => {
//some code computing the chunksArray using param
...
var q = async.queue((chunk, done) => {
var scores = performScoresCalculations(chunk);
//scores is an array containing the 1000 scores for any chunk of a 1000 pairs
performDbCommitting(db, scores);
//commit those scores to the db using pg-promise
done();
}, 10);
q.drain = () => {
resolve(arr);
//admittedly not quite sure about that part, haven't used async.queue much so far
}
q.push(chunksArray);
)}.catch(err => {
console.error(err);
});
};
Now my scores array looks like this:
[{column1: 'value1_0', column2: 'value2_0', ..., columnN: 'valueN_0'}, ... , {column1: 'value1_999', column2: 'value2_999', column3: 'value3_999'}] with a thousand records in it.
My performDbCommitting function is as follows:
var performDbCommitting = (db, pgp, scores) => {
console.log('test1');
//displays 'test1', as expected
var query = pgp.helpers.insert(scores, ['column1', 'column2', 'column3'], 'myScoreTable');
console.log(query);
//display the full content of the query, as expected
db.any(query).then(data => {
console.log('test2');
//nothing is displayed
console.log(data);
//nothing is displayed
return;
}).catch(err => {
console.error(err);
});
}
So here is my problem:
when testing "manually" performDbCommitting works perfectly, I've even tried a version with transactions, same works flawlessly,
when used within async.queue everything in performDbCommitting seems to work until the db.any(query) call, as evidenced by the console.log displaying the info correctly until that point,
no error is thrown up, the computations over chunksArray keep on going by groups of 1000 as expected,
if I inspect any of the arrays (chunk, chunksArray, scores, etc) everything is as should be, the lengths are correct, their contents too.
pg-promise just doesn't seem to want to push my 1000 records at a time in the database when used with async.queue and that's where I'm stuck.
I have no trouble imagining the fault lies with me, it's about the first time I'm using async.queue, especially mixed with bluebird promising and pg-promise.
Thank you very much in advance for taking the time to read this and shed any light on this issue if you can.

I was experiencing this same issue on one of my machines in particular but none of the others.
What worked for me was updating pg-promise from version 10.5.0 to version 10.5.6 (via npm update pg-promise).

Your mainPromise doesn't wait for performDBCommitting to finish:
should be like:
//commit those scores to the db using pg-promise
performDbCommitting(db, scores).then(()=>{done();});
and performDBCommitting needs to return the promise too:
return db.any(query).then(data => {
console.log('test2');
//nothing is displayed
console.log(data);
//nothing is displayed
return null;
}).catch(err => {
console.error(err);
return null;
});

Related

Firebase arrayRemove takes very long time in modular version

I'm refactoring my firebase code to support the modular firebase version.
After I refactored the "arrayRemove" function, I noticed that it started to take a very long time to execute. Something about 1 minute. At v8, that was about 1 seconds
Here is my code:
const userRef = doc(db, 'users', userId)
try {
await updateDoc(userRef, {
items: arrayRemove(itemId),
})
console.log('REMOVED')
} catch (error) {
console.log(error)
}
Execution of this code takes around one minute. The same action with the same data in the same environment for the older version takes about one second. The rest functions works fast.
Firebase version: 9.6.1
Any ideas why it takes so long and how to make it faster?
After reading article "Best Practices: Arrays in Firebase"
I found that practice:
to remove keys, we save the entire array instead of using .remove()
So... I removed item in my code, then updated array and it took less than a second!
// removing item I need to remove at my local array
const updArr = arr(
(item) => item !== itemId
)
const userRef = doc(db, 'collection', 'id')
// updating array
await updateDoc(userRef, {
gifts: updArr,
})
// p.s.
Obviously, there is some bug at arrayRemove function, but actually we don't have to use that function 🤫

Cannot get data needed due to Mongoose Queries/Promises

I am trying to create a Jeopardy like game. I have the topics for the game in a database. When the page is opened I want it to get 6 topics randomly from the database. I am putting them in an array which I will then pass to my ejs file. The problem is that when I go to pass the array to the ejs file, it is always empty. I understand that it is because of promises(actually queries) from Mongoose. My problem is that I cannot figure out how to handle this. I have read the Mongoose docs and searched everywhere but I can't figure it out.
I have tried using callbacks, but they usually just make the program hang and do nothing. I have tried using .then, but I must be using it wrong because it doesn't do what I want.
app.get("/", function(request, response){
//My array to put the topics into
var questionArr = [];
//I need to know the number of items in db for random
var getQuestions = Questions.countDocuments({}, function(err, count){
for(var i = 0; i < 6; i++){
!function(i){
var rand = math.randomInt(count);
//Here I get a random topic and add to array
//Seems to work and actually get it
Questions.findOne().skip(rand).exec(function(err, result){
questionArr.push(result);
});
}(i);
}
});
//I thought this would make it wait for the function to finish so
//that I could have my full array, but apparently not
getQuestions.then(() => {
//this runs before the other functions and give me a length of 0
console.log(questionArr.length);
response.render("jeopardy.ejs", {questions: questionArr});
});
});
I simply need to have the render run after it gets the information from the database. However, it still just runs with an empty array. Thanks for any help, I'm pretty new to async.
I see few issues with your code:
1) You're mixing promises and callbacks which makes things more complicated. The code doesn't work mainly because you're not awaiting for Questions.findOne() results.
2) There is no Math.randomInt
In order to make it work it has to be similar to below:
Questions.countDocuments()
.then((count) => {
return Promise.all([...new Array(6)].map(() => {
const rand = Math.floor(Math.random() * Math.floor(count));
return Questions.findOne().skip(rand).exec();
}))
})
.then((questionArr) => {
response.render("jeopardy.ejs", {questions: questionArr});
});
Best is to use async/await which will make it even more readable
app.get("/", async function(request, response){
const count = await Questions.countDocuments();
const questionArr = await Promise.all([...new Array(6)].map(() => {
const rand = Math.floor(Math.random() * Math.floor(count));
return Questions.findOne().skip(rand).exec();
}));
response.render("jeopardy.ejs", {questions: questionArr});
});
Also keep in mind that you better do proper error handling, but that's a subject of a separate post :)

Duplicate Array Data Web Scraping

I can't seem to get the article duplicates out of my web scraper results, this is my code:
app.get("/scrape", function (req, res) {
request("https://www.nytimes.com/", function (error, response, html) {
// Load the HTML into cheerio and save it to a variable
// '$' becomes a shorthand for cheerio's selector commands, much like jQuery's '$'
var $ = cheerio.load(html);
var uniqueResults = [];
// With cheerio, find each p-tag with the "title" class
// (i: iterator. element: the current element)
$("div.collection").each(function (i, element) {
// An empty array to save the data that we'll scrape
var results = [];
// store scraped data in appropriate variables
results.link = $(element).find("a").attr("href");
results.title = $(element).find("a").text();
results.summary = $(element).find("p.summary").text().trim();
// Log the results once you've looped through each of the elements found with cheerio
db.Article.create(results)
.then(function (dbArticle) {
res.json(dbArticle);
}).catch(function (err) {
return res.json(err);
});
});
res.send("You scraped the data successfully.");
});
});
// Route for getting all Articles from the db
app.get("/articles", function (req, res) {
// Grab every document in the Articles collection
db.Article.find()
.then(function (dbArticle) {
res.json(dbArticle);
})
.catch(function (err) {
res.json(err);
});
});
Right now I am getting five copies of each article sent to the user. I have tried db.Article.distinct and various versions of this to filter the results down to only unique articles. Any tips?
In Short:
Switching the var results = [] from an Array to an Object var results = {} did the trick for me. Still haven't figured out the exact reason for the duplicate insertion of documents in database, will update as soon I find out.
Long Story:
You have multiple mistakes and points of improvement there in your code. I will try pointing them out:
Let's follow them first to make your code error free.
Mistakes
1. Although mongoose's model.create, new mongoose() does seem to work fine with Arrays but I haven't seen such a use before and it does not even look appropriate.
If you intend to create documents one after another then represent your documents using an object instead of an Array. Using an array is more mainstream when you intend to create multiple documents at once.
So switch -
var results = [];
to
var results = {};
2. Sending response headers after they are already sent will create for you an error. I don't know if you have already noticed it or not but its pretty much clear upfront as once the error is popped up the remaining documents won't get stored because of PromiseRejection Error if you haven't setup a try/catch block.
The block inside $("div.collection").each(function (i, element) runs asynchronously so your process control won't wait for each document to get processed, instead it would immediately execute res.send("You scraped the data successfully.");.
This will effectively terminate the Http connection between the client and the server and any further issue of response termination calls like res.json(dbArticle) or res.json(err) will throw an error.
So, just comment the res.json statements inside the .create's then and catch methods. This will although terminate the response even before the whole articles are saved in the DB but you need not to worry as your code would still work behind the scene saving articles in database for you (asynchronously).
If you want your response to be terminated only after you have successfully saved the data then change your middleware implementation to -
request('https://www.nytimes.com', (err, response, html) => {
var $ = cheerio.load(html);
var results = [];
$("div.collection").each(function (i, element) {
var ob = {};
ob.link = $(element).find("a").attr("href");
ob.title = $(element).find("a").text();
ob.summary = $(element).find("p.summary").text().trim();
results.push(ob);
});
db.Article.create(results)
.then(function (dbArticles) {
res.json(dbArticles);
}).catch(function (err) {
return res.json(err);
});
});
After making above changes and even after the first one, my version of your code ran fine. So if you want you can continue on with your current version, or you may try reading some points of improvement.
Points of Improvements
1. Era of callbacks is long gone:
Convert your implementation to utilise Promises as they are more maintainable and easier to reason about. Here are the things you can do -
Change request library from request to axios or any one which supports Promises by default.
2. Make effective use of mongoose methods for insertion. You can perform bulk inserts of multiple statements in just one query. You may find docs on creating documents in mongodb quite helpful.
3. Start using some frontend task automation library such as puppeteer or nightmare.js for data scraping related task. Trust me, they make life a hell lot easier than using cheerio or any other library for the same. Their docs are really good and well maintained so you won't have have hard time picking these up.

Best way to calling API inside for loop using Promises

I have 500 millions of object in which each has n number of contacts as like below
var groupsArray = [
{'G1': ['C1','C2','C3'....]},
{'G2': ['D1','D2','D3'....]}
...
{'G2000': ['D2001','D2002','D2003'....]}
...
]
I have two way of implementation in nodejs which is based on regular promises and another one using bluebird as shown below
Regular promises
...
var groupsArray = [
{'G1': ['C1','C2','C3']},
{'G2': ['D1','D2','D3']}
]
function ajax(url) {
return new Promise(function(resolve, reject) {
request.get(url,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
});
}
_.each(groupsArray,function(groupData){
_.each(groupData,function(contactlists,groupIndex){
// console.log(groupIndex)
_.each(contactlists,function(contactData){
ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData).then(function(result) {
console.log(result.body);
// Code depending on result
}).catch(function() {
// An error occurred
});
})
})
})
...
Using bluebird way i have used concurrency to check how to control the queue of promises
...
_.each(groupsArray,function(groupData){
_.each(groupData,function(contactlists,groupIndex){
var contacts = [];
// console.log(groupIndex)
_.each(contactlists,function(contactData){
contacts.push({
contact_name: 'Contact ' + contactData
});
})
groups.push({
task_name: 'Group ' + groupIndex,
contacts: contacts
});
})
})
Promise.each(groups, group =>
Promise.map(group.contacts,
contact => new Promise((resolve, reject) => {
/*setTimeout(() =>
resolve(group.task_name + ' ' + contact.contact_name), 1000);*/
request.get('http://localhost:3001/api/getcontactdata/'+group.task_name+'/'+contact.contact_name,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
}).then(log => console.log(log.body)),
{
concurrency: 50
}).then(() => console.log())).then(() => {
console.log('All Done!!');
});
...
I want to know when dealing with 100 millions of api call inside loop using promises. please advise the best way to call API asynchronously and deal the response later.
My answer using regular Node.js promises (this can probably easily be adapted to Bluebird or another library).
You could fire off all Promises at once using Promise.all:
var groupsArray = [
{'G1': ['C1','C2','C3']},
{'G2': ['D1','D2','D3']}
];
function ajax(url) {
return new Promise(function(resolve, reject) {
request.get(url,{json: true}, function(error, data) {
if (error) {
reject(error);
} else {
resolve(data);
}
});
});
}
Promise.all(groupsArray.map(group => ajax("your-url-here")))
.then(results => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
Using Promise.all attempts to run all your requests in parallel. This probably won't work well when you have 500 million requests to make all being attempted at the same time!
A more effective way to do it is to use the JavaScript reduce function to sequence your requests one after the other:
// ... Setup as before ...
const results = [];
groupsArray.reduce((prevPromise, group) => {
return prevPromise.then(() => {
return ajax("your-url-here")
.then(result => {
// Process a single result if necessary.
results.push(result); // Collect your results.
});
});
},
Promise.resolve() // Seed promise.
);
.then(() => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
This example chains together the promises so that the next one only starts once the previous one completes.
Unfortunately the sequencing approach will be very slow because it has to wait until each request has completed before starting a new one. Whilst each request is in progress (it takes time to make an API request) your CPU is sitting idle whereas it could be working on another request!
A more efficient, but complicated approach to this problem is to use a combination of the above approaches. You should batch your requests so that the requests in each batch (of say 10) are executed in parallel and then the batches are sequenced one after the other.
It's tricky to implement this yourself - although it's a great learning exercise
- using a combination of Promise.all and the reduce function, but I'd suggest using the library async-await-parallel. There's a bunch of such libraries, but I use this one and it works well and it easily does the job you want.
You can install the library like this:
npm install --save async-await-parallel
Here's how you would use it:
const parallel = require("async-await-parallel");
// ... Setup as before ...
const batchSize = 10;
parallel(groupsArray.map(group => {
return () => { // We need to return a 'thunk' function, so that the jobs can be started when they are need, rather than all at once.
return ajax("your-url-here");
}
}, batchSize)
.then(() => {
// Code that depends on all results.
})
.catch(err => {
// Handle the error.
});
This is better, but it's still a clunky way to make such a large amount of requests! Maybe you need to up the ante and consider investing time in proper asynchronous job management.
I've been using Kue lately for managing a cluster of worker processes. Using Kue with the Node.js cluster library allows you to get proper parallelism happening on a multi-core PC and you can then easily extend it to muliple cloud-based VMs if you need even more grunt.
See my answer here for some Kue example code.
In my opinion you have two problems coupled in one questions - I'd decouple them.
#1 Loading of a large dataset
Operation on such a large dataset (500m records) will surely cause some memory limit issues sooner or later - node.js runs in a single thread and that is limited to use approx 1.5GB of memory - after that your process will crash.
In order to avoid that you could be reading your data as a stream from a CSV - I'll use scramjet as it'll help us with the second problem, but JSONStream or papaparse would do pretty well too:
$ npm install --save scramjet
Then let's read the data - I'd assume from a CSV:
const {StringStream} = require("scramjet");
const stream = require("fs")
.createReadStream(pathToFile)
.pipe(new StringStream('utf-8'))
.csvParse()
Now we have a stream of objects that will return the data line by line, but only if we read it. Solved problem #1, now to "augment" the stream:
#2 Stream data asynchronous augmentation
No worries - that's just what you do - for every line of data you want to fetch some additional info (so augment) from some API, which by default is asynchronous.
That's where scramjet kicks in with just couple additional lines:
stream
.flatMap(groupData => Object.entries(groupData))
.flatMap(([groupIndex, contactList]) => contactList.map(contactData => ([contactData, groupIndex])
// now you have a simple stream of entries for your call
.map(([contactData, groupIndex]) => ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData))
// and here you can print or do anything you like with your data stream
.each(console.log)
After this you'd need to accumulate the data or output it to stream - there are numbers of options - for example: .toJSONArray().pipe(fileStream).
Using scramjet you are able to separate the process to multiple lines without much impact on performance. Using setOptions({maxParallel: 32}) you can control concurrency and best of all, all this will run with a minimal memory footprint - much much faster than if you were to load the whole data into memory.
Let me know how if this is helpful - your question is quite complex so let me know if you run into any problems - I'll be happy to help. :)

How to input data in the correct order in Javascript?

I am using Node.js to send some data to a SQL Server table. The code has a for loop, which calls a function, which runs an insert statement (will eventually be a stored procedure). The foor loop runs through a series of objects, in order by date. Most of the time, the insert statements finish in the correct order but some of the entries are in the wrong order.
Is there some way to insure that the order of dates in the database is the same order as I call them (chronologically). I know about SORT in T-SQL, but I think the program should do it in the correct order. I have always heard to avoid thread sleeping and timers whenever possible, so I don't like the idea of setTimeout. Is there another way, perhaps with event handling?
Example code below. An element of the jsonArray is a single object that has a date key as well as several other numbers. The SQL Server table has an autoincrementing primary key integer, and I want the dates to be entered chronologically (like I call them), so that the primary key as well as the dates are in order.
I am also using the MSSQL import for the SQL manipulation.
// Main class
for (let j = 0; j < jsonArray.length; j++) {
let myObj = jsonArray[j]
WriteToDB.myWriteFunction(myObj)
}
// Different class
const sql = require('mssql')
let request = new sql.Request()
let query = (...)
function myWriteFunction () {
request.query(query) // Just the actual insert statement, no issue here
request.on('error', function (err) {
console.log('REQ ERROR')
console.dir(err)
})
// These dates are out of order, due to when the done event is triggered
request.on('done', function () {
console.log('QUERY RAN FOR ' + e1)
})
}
I am assuming its some sort of multithreading that is causing this
those database operations are done asynchronously and they all happen on the same thread. If the D1 (first database operation) takes longer than D2 (second database operation) and D1 is inserted after the D2 (resulting you problem), it's not a problem that involves multi-threading, maybe D2 has more data, etc. ..
I would totally recommend you NOT to use the bellow code sample unless you really believe the order in the sql in indeed important. req2 is done only after req1 is finish and so on .. => wasted time
let jsonArray = [1,2,3,4];
if (checkArray(jsonArray)) {
createReq(jsonArray[0])
}
function createReq(req) {
new Promise((resolve, reject) => {
setTimeout(() => {
resolve();
jsonArray.shift();
console.log(req);
if (checkArray(jsonArray)) {
createReq(jsonArray[0])
}
}, 100)
})
}
function checkArray(arr) {
return (jsonArray && jsonArray[0])
}

Categories