I have quite a complex express route in my node.js server, that makes many modifications to the database via mongoose.
My goal is to implement a mechanism, that reverts all changes when any error occurs. My idea was implementing functions for undoing into the catch block.
But this is quite ugly, as I have to know what the previous values were and what if an error occurs in the catch block? It's especially difficult to revert those changes, when an error occurred during a Promise.all(array.map( /* ... */ ))
My route looks akin to this:
module.exports = (req, res) => {
var arr1, var2, var3
try {
const body = req.body
arr1 = await fetchVar1(body)
const _data = await Promise.all([
Promise.all(
arr1.map(async x => {
const y = await fetchSometing(x)
return doSometing(y)
})
),
doSomething3(arr1),
])
var2 = _data[1]
var3 = _data[2]
return res.json({ arr1, var2, var3 })
} catch (err) {
/**
* If an error occurs I have to undo
* everything that has been done
* in the try block
*/
}
}
Preferably I would like to implement something that "batches" all changes and "commits" the changes if no errors occurred.
What you are looking for is transactions: https://mongoosejs.com/docs/transactions.html
Manually undoing stuff after doing them won't protect you from every issue, so you should not rely on that. For example, exactly as you wrote: what happens if there is a crash after a partial write (some data is written, some is not), then another crash during your "rollback" code, which does not cleanup everything? If your code depends on your data being absolutely clean, then you have a problem. Your code should either be able to handle partial data correctly, or you must have some way to guarantee that your data is perfectly good at all times.
Transactions is the way to go, because it only commits everything at once if everything works.
What you’re looking for is called Transactions.
Transactions are new in MongoDB 4.0 and Mongoose 5.2.0. Transactions let you execute multiple operations in isolation and potentially undo all the operations if one of them fails. This guide will get you started using transactions with Mongoose.
For more information check the link below:
https://mongoosejs.com/docs/transactions.html
Related
I am using node module pg in my application and I want to make sure it can properly handle connection and query errors.
The first problem I have is I want to make sure it can properly recover when postgres is unavailable.
I found there is an error event so I can detect if there is a connection error.
import pg from 'pg'
let pgClient = null
async function postgresConnect() {
pgClient = new pg.Client(process.env.CONNECTION_STRING)
pgClient.connect()
pgClient.on('error', async (e) => {
console.log('Reconnecting')
await sleep(5000)
await postgresConnect()
})
}
I don't like using a global here, and I want to set the sleep delay to do an small exponential backoff. I noticed "Reconnecting" fires twice immediately, then waits five seconds and I am not sure why it fired the first time without any waiting.
I also have to make sure the queries execute. I have something like this I was trying out.
async function getTimestamp() {
try {
const res = await pgClient.query(
'select current_timestamp from current_timestamp;'
)
return res.rows[0].current_timestamp
} catch (error) {
console.log('Retrying Query')
await sleep(1000)
return getTimestamp()
}
}
This seems to work, but I haven't tested it enough to make sure it will guarantee the query is executed or keep trying. I should look for specific errors and only loop forever on certain errors and fail on others. I need to do more research to find what errors are thrown. I also need to do a backoff on the delay here too.
It all "seems" to work, I don't want to fail victim to the Dunning-Kruger effect. I need to ensure this process can handle all sorts of situations and recover.
I've just discovered that my API is doing weird things when 2 requests are triggered at almost the same time.
I figured out that the issue was me missing the "var" declaration before my "user" variable below, but I'm really curious about the root issue that caused the bug described below:
I have two API endpoints that call the same function as follow:
router.get('/refresh_session_token', function (req, res) {
let user_id = req.body.user_id // The value sent is 8
findUserWithId(user_id)
.then(user_data => {
user = user_data // I forgot 'var' here
})
.then(() => {
console.log(user) // This should always show user data from user_id = 8
})
}
router.get('/resend_invite', function (req, res) {
let user_id = req.body.user_id // The value sent is 18
findUserWithId(user_id)
.then(user_data => {
user = user_data // I forgot 'var' here
})
.then(() => {
console.log(user) // This should always show user data from user_id = 18
})
}
const findUserWithId = (id) => {
return knex.raw(`SELECT * FROM users WHERE id = ?`, [id]).then((data) => data.rows[0])
}
All this code is in the same file that I export through module.exports = router;
What I discovered is that if I trigger the endpoints /refresh_session_token and /resend_invite at almost the same time each with two different user_id, it happens that sometimes, my console.log returns the same result for both as if I was using the same user_id.
Adding var to user fixed the issue but I'm very surprised as to what is actually happening on the background.
Do you have any idea?
When you don't declare your variable and you aren't running your module in Javascript's strict mode, then the first assignment to that variable with:
user = user_data
creates an automatic global variable named user. This means that your two routes are then sharing that same variable.
And, since your two routes both have asynchronous operations in them, even with the single-threadedness of things, your two routes can still be in-flight at the same time and both trying to use the same global variable. One route will overwrite the value from the other. This is a disaster in server-based code because usually, the bug won't show until you get into production and it will be really, really hard to find a reproducible case.
The best answer here is to always run your code in strict mode and then the JS interpreter will make this an error and you will never be allowed to run your code this way in the first place. The error will be found very quickly and easily.
Then obviously, always declare variables with let or const. There are very, very few reasons to ever use var any more as let and const give you more control over the scope of your variable.
To run your module in strict mode, insert this:
'use strict';
before any other Javascript statements.
Or, use something like TypeScript that doesn't let you do sloppy things like not declare your variables.
I am having a problem where I am making a bulk insert of multiple elements into a table, then I immediatly get the last X elements from that table that were recently inserted but when I do that it seems that the elements have not yet been inserted fully even thought I am using async await to wait for the async operations.
I am making a bulk insert like
const createElements = elementsArray => {
return knex
.insert(elementsArray)
.into('elements');
};
Then I have a method to immediately access those X elements that were inserted:
const getLastXInsertedElements = (userId, length, columns=['*']) => {
return knex.select(...columns)
.from('elements').where('userId', userId)
.orderBy('createdAt', 'desc')
.limit(length);
}
And finally after getting those elements I get their ids and save them into another table that makes use of element_id of those recently added elements.
so I have something like:
// A simple helper function that handles promises easily
const handleResponse = (promise, message) => {
return promise
.then(data => ([data, undefined]))
.catch(error => {
if (message) {
throw new Error(`${message}: ${error}`);
} else {
return Promise.resolve([undefined, `${message}: ${error}`])
}
}
);
};
async function service() {
await handleResponse(createElements(list), 'error text'); // insert x elements from the list
const [elements] = await handleResponse(getLastXInsertedElements(userId, list.length), 'error text') // get last x elements that were recently added
await handleResponse(useElementsIdAsForeignKey(listMakingUseOfElementsIds), 'error text'); // Here we use the ids of the elements we got from the last query, but we are not getting them properly for some reason
}
So the problem:
Some times when I execute getLastXInsertedElements it seems that the elements are not yet finished inserting, even thought I am waiting with async/await for it, any ideas why this is? maybe something related to bulk inserts that I don't know of? an important note, all the elements always properly inserted into the table at some point, it just seems like this point is not respected by the promise (async operation that returns success for the knex.insert).
Update 1:
I have tried putting the select after the insert inside a setTimeout of 5 seconds for testing purposes, but the problem seems to persist, that is really weird, seems one would think 5 seconds is enough between the insert and the select to get all the data.
I would like to have all X elements that were just inserted accessible in the select query from getLastXInsertedElements consistently.
Which DB are you using, how big list of data are you inserting? You could also test if you are inserting and getLastXInsertedElements in a transaction if that hides your problem.
Doing those operations in transaction also forces knex to use the same connection for both queries so it might lead to a tracks where is this coming from.
Another trick to force queries to use the same connection would be to set pool's min and max configuration to be 1 (just for testing is parallelism is indeed the problem here).
Also since you have not provided complete reproduction code for this, I'm suspecting there is something else here in the mix which causes this odd behavior. Usually (but not always) this kind of weird cases that shouldn't happen are caused by user error in elsewhere using the library.
I'll update the answer if there is more information provided. Complete reproduction code would be the most important piece of information.
I am not 100% sure but I guess the knex functions do not return promise by default (but a builder object for the query). That builder has a function called then that transforms the builder into a promise. So you may try to add a call to that:
...
limit(length)
.then(x => x); // required to transform to promise
Maybe try debugging the actual type of the returned value. It might happen that this still is not a promise. In this case you may not use async await but need to use the then Syntax because it might not be real js promises but their own implementation.
Also see this issue about standard js promise in knex https://github.com/knex/knex/issues/1588
In theory, it should work.
You say "it seems"... a more clear problem explanation could be helpful.
I can argue the problem is that you have elements.length = list.length - n where n > 0; in your code there are no details about userId property in your list; a possible source of the problem could be that some elements in your list has a no properly set userId property.
I can't seem to get the article duplicates out of my web scraper results, this is my code:
app.get("/scrape", function (req, res) {
request("https://www.nytimes.com/", function (error, response, html) {
// Load the HTML into cheerio and save it to a variable
// '$' becomes a shorthand for cheerio's selector commands, much like jQuery's '$'
var $ = cheerio.load(html);
var uniqueResults = [];
// With cheerio, find each p-tag with the "title" class
// (i: iterator. element: the current element)
$("div.collection").each(function (i, element) {
// An empty array to save the data that we'll scrape
var results = [];
// store scraped data in appropriate variables
results.link = $(element).find("a").attr("href");
results.title = $(element).find("a").text();
results.summary = $(element).find("p.summary").text().trim();
// Log the results once you've looped through each of the elements found with cheerio
db.Article.create(results)
.then(function (dbArticle) {
res.json(dbArticle);
}).catch(function (err) {
return res.json(err);
});
});
res.send("You scraped the data successfully.");
});
});
// Route for getting all Articles from the db
app.get("/articles", function (req, res) {
// Grab every document in the Articles collection
db.Article.find()
.then(function (dbArticle) {
res.json(dbArticle);
})
.catch(function (err) {
res.json(err);
});
});
Right now I am getting five copies of each article sent to the user. I have tried db.Article.distinct and various versions of this to filter the results down to only unique articles. Any tips?
In Short:
Switching the var results = [] from an Array to an Object var results = {} did the trick for me. Still haven't figured out the exact reason for the duplicate insertion of documents in database, will update as soon I find out.
Long Story:
You have multiple mistakes and points of improvement there in your code. I will try pointing them out:
Let's follow them first to make your code error free.
Mistakes
1. Although mongoose's model.create, new mongoose() does seem to work fine with Arrays but I haven't seen such a use before and it does not even look appropriate.
If you intend to create documents one after another then represent your documents using an object instead of an Array. Using an array is more mainstream when you intend to create multiple documents at once.
So switch -
var results = [];
to
var results = {};
2. Sending response headers after they are already sent will create for you an error. I don't know if you have already noticed it or not but its pretty much clear upfront as once the error is popped up the remaining documents won't get stored because of PromiseRejection Error if you haven't setup a try/catch block.
The block inside $("div.collection").each(function (i, element) runs asynchronously so your process control won't wait for each document to get processed, instead it would immediately execute res.send("You scraped the data successfully.");.
This will effectively terminate the Http connection between the client and the server and any further issue of response termination calls like res.json(dbArticle) or res.json(err) will throw an error.
So, just comment the res.json statements inside the .create's then and catch methods. This will although terminate the response even before the whole articles are saved in the DB but you need not to worry as your code would still work behind the scene saving articles in database for you (asynchronously).
If you want your response to be terminated only after you have successfully saved the data then change your middleware implementation to -
request('https://www.nytimes.com', (err, response, html) => {
var $ = cheerio.load(html);
var results = [];
$("div.collection").each(function (i, element) {
var ob = {};
ob.link = $(element).find("a").attr("href");
ob.title = $(element).find("a").text();
ob.summary = $(element).find("p.summary").text().trim();
results.push(ob);
});
db.Article.create(results)
.then(function (dbArticles) {
res.json(dbArticles);
}).catch(function (err) {
return res.json(err);
});
});
After making above changes and even after the first one, my version of your code ran fine. So if you want you can continue on with your current version, or you may try reading some points of improvement.
Points of Improvements
1. Era of callbacks is long gone:
Convert your implementation to utilise Promises as they are more maintainable and easier to reason about. Here are the things you can do -
Change request library from request to axios or any one which supports Promises by default.
2. Make effective use of mongoose methods for insertion. You can perform bulk inserts of multiple statements in just one query. You may find docs on creating documents in mongodb quite helpful.
3. Start using some frontend task automation library such as puppeteer or nightmare.js for data scraping related task. Trust me, they make life a hell lot easier than using cheerio or any other library for the same. Their docs are really good and well maintained so you won't have have hard time picking these up.
I have a tool who's basic idea is as follows:
//get a bunch of couchdb databases. this is an array
const jsonFile = require('jsonfile');
let dbList = getDbList();
const filePath = 'some/path/to/file';
const changesObject = {};
//iterate the db list. do asynchronous stuff on each iteration
dbList.forEach(function(db){
let merchantDb = nano.use(db);
//get some changes from the database. validate inside callback
merchantDb.get("_changes", function(err,changes){
validateChanges(changes);
changesObject['db'] = changes.someAttribute;
//write changes to file
jsonFile.writeFile(filePath, changesObject, function (err) {
if (err) {
logger.error("Unable to write to file: ");
}
});
})
const validateChanges = function(changes) {
if (!validateLogic(changes) sendAlertMail();
}
For performance improvements the iteration is not done synchronously. Therefore there can be multiple iterations running in 'parallel'. My question is can this cause any data inconsistencies and/or any issues with the file writing process?
Edit:
The same file gets written to on each iteration.
Edit:2
The changes are stored as a JSON object with key value pairs. The key being the db name.
If you're really writing to a single file, which you appear to be (though it's hard to be sure), then no; you have a race condition in which multiple callbacks will try to write to the same file, possibly at the same time (remember, I/O isn't done on the JavaScript thread in Node unless you use the *Sync functions), which will at best mean the last one wins and will at worst mean I/O errors because of overlap.
If you're writing to separate files for each db, then provided there's no cross-talk (shared state) amongst validateChanges, validateLogic, sendAlertMail, etc., that should be fine.
Just for detail: It will start tasks (jobs) getting the changes and then writing them out; the callbacks of the calls to get won't be run until later, when all of those jobs are queued.
You are creating closures in loops, but the way you're doing it is okay, both because you're doing it within the forEach callback and because you're not using db in the get callback (which would be fine with the forEach callback but not with some other ways you might loop arrays). Details on that aspect in this question's answers if you're interested.
This line is suspect, though:
let merchantDb = nano.use('db');
I suspect you meant (no quotes):
let merchantDb = nano.use(db);
For what it's worth, it sounds from the updates to the question and your various comments like the better solution would be not to write out the file separately each time. Instead, you want to gather up the changes and then write them out.
You can do that with the classic Node-callback APIs you're using like this:
let completed = 0;
//iterate the db list. do asynchronous stuff on each iteration
dbList.forEach(function(db) {
let merchantDb = nano.use(db);
//get some changes from the database. validate inside callback
merchantDb.get("_changes", function(err, changes) {
if (err) {
// Deal with the fact there was an error (don't return)
} else {
validateChanges(changes);
changesObject[db] = changes.someAttribute; // <=== NOTE: This line had 'db' rather than db, I assume that was meant to be just db
}
if (++completed === dbList.length) {
// All done, write changes to file
jsonFile.writeFile(filePath, changesObject, function(err) {
if (err) {
logger.error("Unable to write to file: ");
}
});
}
})
});