Node.js fs.writeFile() empties the file

Node.js fs.writeFile() empties the file - javascript

I have an update method which gets called about every 16-40ms, and inside I have this code:
this.fs.writeFile("./data.json", JSON.stringify({
totalPlayersOnline: this.totalPlayersOnline,
previousDay: this.previousDay,
gamesToday: this.gamesToday
}), function (err) {
if (err) {
return console.log(err);
}
});
If the server throws an error, the "data.json" file sometimes becomes empty. How do I prevent that?

Problem
fs.writeFile is not an atomic operation. Here is an example program which I will run strace on:
#!/usr/bin/env node
const { writeFile, } = require('fs');
// nodejs won’t exit until the Promise completes.
new Promise(function (resolve, reject) {
writeFile('file.txt', 'content\n', function (err) {
if (err) {
reject(err);
} else {
resolve();
}
});
});
When I run that under strace -f and tidied up the output to show just the syscalls from the writeFile operation (which spans multiple IO threads, actually), I get:
open("file.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 9
pwrite(9, "content\n", 8, 0) = 8
close(9) = 0
As you can see, writeFile completes in three steps.
The file is open()ed. This is an atomic operation that, with the provided flags, either creates an empty file on disk or, if the file exists, truncates it. Truncating the file is an easy way to make sure that only the content you write ends up in the file. If there is existing data in the file and the file is longer than the data you subsequently write to the file, the extra data will stay. To avoid this you truncate.
The content is written. Because I wrote such a short string, this is done with a single pwrite() call, but for larger amounts of data I assume it is possible nodejs would only write a chunk at a time.
The handle is closed.
My strace had each of these steps occurring on a different node IO thread. This suggests to me that fs.writeFile() might actually be implemented in terms of fs.open(), fs.write(), and fs.close(). Thus, nodejs does not treat this complex operation like it is atomic at any level—because it isn’t. Therefore, if your node process terminates, even gracefully, without waiting for the operation to complete, the operation could be at any of the steps above. In your case, you are seeing your process exit after writeFile() finishes step 1 but before it completes step 2.
Solution
The common pattern for transactionally replacing a file’s contents with a POSIX layer is to use these steps:
Write the data to a differently named file, fsync() the file (See “When should you fsync?” in “Ensuring data reaches disk”), and then close() it.
rename() (or, on Windows, MoveFileEx() with MOVEFILE_REPLACE_EXISTING) the differently-named file over the one you want to replace.
Using this algorithm, the destination file is either updated or not regardless of when your program terminates. And, even better, journalled (modern) filesystems will ensure that, as long as you fsync() the file in step 1 before proceeding to step 2, the two operations will occur in order. I.e., if your program performs step 1 and then step 2 but you pull the plug, when you boot up you will find the filesystem in one of the following states:
None of the two steps are completed. The original file is intact (or if it never existed before, it doesn’t exist). The replacement file is either nonexistent (step 1 of the writeFile() algorithm, open(), effectively never succeeded), existent but empty (step 1 of writeFile() algorithm completed), or existent with some data (step 2 of writeFile() algorithm partially completed).
The first step completed. The original file is intact (or if it didn’t exist before it still doesn’t exist). The replacement file exists with all of the data you want.
Both steps completed. At the path of the original file, you can now access your replacement data—all of it, not a blank file. The path you wrote the replacement data to in the first step no longer exists.
The code to use this pattern might look like the following:
const { writeFile, rename, } = require('fs');
function writeFileTransactional (path, content, cb) {
// The replacement file must be in the same directory as the
// destination because rename() does not work across device
// boundaries.
// This simple choice of replacement filename means that this
// function must never be called concurrently with itself for the
// same path value. Also, properly guarding against other
// processes trying to use the same temporary path would make this
// function more complicated. If that is a concern, a proper
// temporary file strategy should be used. However, this
// implementation ensures that any files left behind during an
// unclean termination will be cleaned up on a future run.
let temporaryPath = `${path}.new`;
writeFile(temporaryPath, content, function (err) {
if (err) {
return cb(err);
}
rename(temporaryPath, path, cb);
});
};
This is basically the same solution you’d use for the same problem in any langage/framework.

if the error is caused due to bad input (the data you want to write) then make sure the data is as they should and then do the writeFile.
if the error is caused due to failure of the writeFile even though the input is Ok, you could check that the function is executed until the file is written. One way is using the async doWhilst function.
async.doWhilst(
writeFile(), //your function here but instead of err when fail callback success to loop again
check_if_file_null, //a function that checks that the file is not null
function (err) {
//here the file is not null
}
);

I didn't run some real tests with this I just noticed with manually reloading my ide that sometime the file was empty.
What I tried first was the rename method and noted the same problem, but recreating a new file was less desirable (considering file watches etc.).
My suggestion or what I'm doing now is in your own readFileSync I check if the file is missing or data returned is empty and sleep for a 100 milliseconds before giving it another try. I suppose a third try with more delay would really push the sigma up a notch but currently not going do it as the added delay is hopefully an unnecessary negative (would consider a promise at that point). There are other recovery option opportunities relative to your own code you can add just in case I hopefully. File not found or empty? is basically a retry another way.
My custom writeFileSync has an added flag to toggle between using the rename method (with write sub-dir '._new' creation) or the normal direct method as your code's need may vary. Possible based on file size is my recommendation.
In this use case the files are small and only updated by one node instance / server at a time. I can see adding the random file name as another option with rename to allow multiple machines to write another option for later if needed. Maybe a retry limit argument as well?
I was also thinking that you could write to a local temp and then copy to share target by some means (maybe also rename on target for speed increase), and then clean up (unlink from local temp) of course. I guess that idea is kind of pushing it to shell commands so not better.
Anyway still the main idea here is to read twice if found empty. I'm sure it's safe from being partially written, via nodejs 8+ on to a shared Ubuntu type NFS mount right?

Related

How to delay function call until its callback will be finished in other places

I'm using child_process to write commands to console, and then subscribe on 'data' event to get output from it. The problem is that sometime outputs are merged with each other.
let command = spawn('vlc', { shell: true });
writeCommand(cmd, callback) {
process.stdin.write(`${cmd}\n`);
this.isBusy = true;
this.process.stdout.on('data', (d) => {
callback(d);
});
}
Function writeCommand is used in several places, how can I delay it from executing until output from previous command is finished?
My output can look like (for status command for example):
( audio volume: 230 ) ( state stopped ) >

data events on a stream have zero guarantees that a whole "unit" of output will come together in a single data event. It could easily be broken up into multiple data events. So, this combined with the fact that you are providing multiple inputs which generate multiple outputs means that you need a way to parse both when you have a complete set of output and thus should call the callback with it and also how to delineate the boundaries between sets of output.
You don't show us what your output looks like so we can't offer any concrete suggestions on how to parse it in that way, but common delimiters are double line feeds of things like that. It would entirely depend upon what your output naturally does at the end or if you control the content the child process creates, what you can insert at the end of the output.
Another work-around for the merged output would be to not send the 2nd command until the 1st one is done (perhaps by using some sort of pending queue). But, you will still need a way to parse the output to know when you actually have the completion of the previous output.
Another problem:
In the code you show, every time you call writeCommand(), you will add yet another listener for the data event. So, when you call it twice to send different commands, you will now have two listeners both listening for the same data and you will be processing the same response twice instead of just once.
let command = spawn('vlc', { shell: true });
writeCommand(cmd, callback) {
process.stdin.write(`${cmd}\n`);
this.isBusy = true;
// every time writeCommand is called, it adds yet another listener
this.process.stdout.on('data', (d) => {
callback(d);
});
}
If you really intend to call this multiple times and multiple commands could be "in flight" at the same time, then you really can't use this coding structure. You will probably need one permanent listener for the data event that is outside this function because you don't want to have more than one listener at the same time and since you've already found that the data from two commands can be merged, even if you separate them, you can't use this structure to capture the data appropriately for the second part of the merged output.

You can use a queuing mechanism to execute the next command after the first one is finished. You can also use a library like https://www.npmjs.com/package/p-limit to do it for you.

Node.js - Does iterating with a for loop ensure my callbacks are called one after another in order?

Apologies if the title is a little undescriptive - however I often come across this problem and wonder what the correct way to handle this situation is:
I have an array/some list, I want to iterate through and run call some methods that have callbacks to subsequent steps. Would all the callbacks be processed? And would they be done so in order:
To be more specific, here is an example:
1 - I've created this array called files containing the paths of some dmg files in a folder:
var files = []
walker.on('file', function(root, stat, next) {
if (stat.name.indexOf(".dmg") > -1) {
files.push(root + '/' + stat.name);
}
next();
});
2 - I then want to iterate through, upload something, then after the upload send a message to a RabbitMQ queue:
for (var bk = 0; bk < files.length; bk ++) {
var uploader = client.uploadFile(params);
uploader.on('error', function (err) {
console.error("unable to upload:", err.stack);
});
uploader.on('progress', function () {
console.log("progress", uploader.progressMd5Amount,
uploader.progressAmount, uploader.progressTotal);
});
uploader.on('end', function () {
console.log("done uploading");
//Now send the message to RabbitMQ
myRabbitMQObject.then(function (conn) {
return conn.createChannel();
}).then(function (ch) {
return ch.assertQueue(q).then(function (ok) {
return ch.sendToQueue(q, new Buffer("Some message with path from the files array"));
});
}).catch(console.warn);
});
}
Now the bit I'm always unsure of is if I placed the block of code under 2 around a for loop - as that has callbacks inside, are these guaranteed to get called?
In this example I don't really care about the order - however if I did care about the order, will having it in a for loop ensure the uploading and rabbitmq messages are sent one after another?
I hope the question makes sense.
Any advice appreciated.
Thanks.

Your for loop runs synchronously. In your code, what that means is that it will execute the line:
var uploader = client.uploadFile(params);
one after another starting all the uploads. It won't wait for the first one to finish before it starts all of them. So, think of your for loop as initiating a whole bunch of asynchronous operations.
Then, sometime later, one by one, in no guaranteed order, each of your uploads will finish. They will essentially all by "in flight" at the same time. Each of your rabbitMQ operations will happen whenever their corresponding upload finishes. The for loop will long since be over at that point and the MQ operations will be in no particular order.
Your current code has no way of telling when everything is done.
I have an array/some list, I want to iterate through and run call some methods that have callbacks to subsequent steps. Would all the callbacks be processed? And would they be done so in order?
All the events will get triggered and you event handler callbacks will get called. They will not be done in any guaranteed order.
Now the bit I'm always unsure of is if I placed the block of code under 2 around a for loop - as that has callbacks inside, are these guaranteed to get called?
Yes. Your callbacks will get called. The for launches each upload and, at some point, they will all trigger their events which will call your event handler callbacks.
In this example I don't really care about the order - however if I did care about the order, will having it in a for loop ensure the uploading and rabbitmq messages are sent one after another?
The for loop will ensure that the uploads are started in sequence. But, the finish order is not guaranteed so therefore the rabbitmq messages that you send upon finish may be in any order. If you want the rabbitmq messages to be sent in a particular order or want the uploads to be sequenced, then you need more/different code to make that happen.

This is my understanding. it might help you
Iterating over list of files and calling upload function is synchronous. It will maintain order
Updating RabbitMQ on successful completion of upload does not guarantee order as it is asynchronous and completion of upload depends on size of file and network latency during upload.

Stopping synchronous function after 2 seconds

I'm using the npm library jsdiff, which has a function that determines the difference between two strings. This is a synchronous function, but given two large, very different strings, it will take extremely long periods of time to compute.
diff = jsdiff.diffWords(article[revision_comparison.field], content[revision_comparison.comparison]);
This function is called in a stack that handles an request through Express. How can I, for the sake of the user, make the experience more bearable? I think my two options are:
Cancelling the synchronous function somehow.
Cancelling the user request somehow. (But would this keep the function still running?)
Edit: I should note that given two very large and different strings, I want a different logic to take place in the code. Therefore, simply waiting for the process to finish is unnecessary and cumbersome on the load - I definitely don't want it to run for any long period of time.

fork a child process for that specific task, you can even create a queu to limit the number of child process that can be running in a given moment.
Here you have a basic example of a worker that sends the original express req and res to a child that performs heavy sync. operations without blocking the main (master) thread, and once it has finished returns back to the master the outcome.
Worker (Fork Example) :
process.on('message', function(req,res) {
/* > Your jsdiff logic goes here */
//change this for your heavy synchronous :
var input = req.params.input;
var outcome = false;
if(input=='testlongerstring'){outcome = true;}
// Pass results back to parent process :
process.send(req,res,outcome);
});
And from your Master :
var cp = require('child_process');
var child = cp.fork(__dirname+'/worker.js');
child.on('message', function(req,res,outcome) {
// Receive results from child process
console.log('received: ' + outcome);
res.send(outcome); // end response with data
});
You can perfectly send some work to the child along with the req and res like this (from the Master): (imagine app = express)
app.get('/stringCheck/:input',function(req,res){
child.send(req,res);
});

I found this on jsdiff's repository:
All methods above which accept the optional callback method will run in sync mode when that parameter is omitted and in async mode when supplied. This allows for larger diffs without blocking the event loop. This may be passed either directly as the final parameter or as the callback field in the options object.
This means that you should be able to add a callback as the last parameter, making the function asynchronous. It will look something like this:
jsdiff.diffWords(article[x], content[y], function(err, diff) {
//add whatever you need
});
Now, you have several choices:
Return directly to the user and keep the function running in the background.
Set a 2 second timeout (or whatever limit fits your application) using setTimeout as outlined in this
answer.
If you go with option 2, your code should look something like this
jsdiff.diffWords(article[x], content[y], function(err, diff) {
//add whatever you need
return callback(err, diff);
});
//if this was called, it means that the above operation took more than 2000ms (2 seconds)
setTimeout(function() { return callback(); }, 2000);

How to avoid multiple node processes doing repetitive things?

I have a module in Node.js which repeatedly pick a document from MongoDB and process it. One document should be processed only once. I also want to use multiple processes concept. I want to run the same module(process) on different processors, which run independently.
The problem is, there might be a scenario where the same document picked and processed by two different workers. How multiple processes can know that, a particular document is processed by some other worker so I should not touch it. And there is no way that my independent processes can communicate. I cannot use a parent which forks multiple processes and acts as a bridge between them. How to avoid this kind of problems in Node.js?

One way to do it is to assign an unique numeric ID to each of your MongoDB documents, and to assign an unique numeric identifier to each of your node.js workers.
For example, have an env var called NUM_WORKERS, and then in your node.js module:
var NumWorkers = process.env.NUM_WORKERS || 1;
You then need to assign an unique, contiguous instance number id (in the range 0 to NumWorkers-1) to each of your workers (e.g. via a command line parameter read by your node.js process when it initializes). You can store that in a variable called MyWorkerInstanceNum.
When you pick a document from MongoDB, call the following function (passing the document's unique documentId as a parameter):
function isMine(documentId){
//
// Example: documentId=10
// NumWorkers= 4
// (10 % 4) = 2
// If MyWorkerInstanceNum is 2, return true, else return false.
return ((documentId % NumWorkers) === MyWorkerInstanceNum);
}
Only continue to actually process the document if isMine() returns true.
So, multiple workers may "pick" a document, but only one worker will actually process it.

Simply keep a transaction log of the document being processed by its unique ID. In the transaction log table for the processed documents, write the status as one of the following (for example):
requested
initiated
processed
failed
You may also want a column in that table for stderr/stdout in case you want to know why something failed or succeeded, and timestamps - that sort of thing.
When you initialize the processing of the document in your Node app, look up the document by ID and check its status. If it doesn't exist, then you're free to process it.
Pseudo-code (sorry, I'm not a Mongo guy!):
db.collection.list('collectionName', function(err, doc) {
db.collection.find(doc.id, 'transactions', function(err, trx) {
if (trx === undefined || trx.status === 'failed') {
DocProcessor.child.process(doc)
} else {
// don't need to process it, it's already been done
}
})
})
You'll also want to enable concurrency locking on the transactions log collection so that you ensure a row (and subsequent job) can't be duplicated. If this becomes a challenge to ensure docs are being queued properly, consider adding in an AMQP service to handle queuing of the docs. Set up a handler to manage distribution of the child processes and transaction logging. Flow would be something like:
MQ ⇢ Log ⇢ Handler ⇢ Doc processor children

How to run an async function for each line of a very large (> 1GB) file in Node.js

Say you have a huge (> 1GB) CSV of record ids:
655453
4930285
493029
4930301
493031
...
And for each id you want to make a REST API call to fetch the record data, transform it locally, and insert it into a local database.
How do you do that with Node.js' Readable Stream?
My question is basically this: How do you read a very large file, line-by-line, run an async function for each line, and [optionally] be able to start reading the file from a specific line?
From the following Quora question I'm starting to learn to use fs.createReadStream:
http://www.quora.com/What-is-the-best-way-to-read-a-file-line-by-line-in-node-js
var fs = require('fs');
var lazy = require('lazy');
var stream = fs.createReadStream(path, {
flags: 'r',
encoding: 'utf-8'
});
new lazy(stream).lines.forEach(function(line) {
var id = line.toString();
// pause stream
stream.pause();
// make async API call...
makeAPICall(id, function() {
// then resume to process next id
stream.resume();
});
});
But, that pseudocode doesn't work, because the lazy module forces you to read the whole file (as a stream, but there's no pausing). So that approach doesn't seem like it will work.
Another thing is, I would like to be able to start processing this file from a specific line. The reason for this is, processing each id (making the api call, cleaning the data, etc.) can take up to a half a second per record so I don't want to have to start from the beginning of the file each time. The naive approach I'm thinking about using is to just capture the line number of the last id processed, and save that. Then when you parse the file again, you stream through all the ids, line by line, until you find the line number you left off at, and then you do the makeAPICall business. Another naive approach is to write small files (say of 100 ids) and process each file one at a time (small enough dataset to do everything in memory without an IO stream). Is there a better way to do this?
I can see how this gets tricky (and where node-lazy comes in) because the chunk in stream.on('data', function(chunk) {}); may contain only part of a line (if the bufferSize is small, each chunk may be 10 lines but because the id is variable length, it may only be 9.5 lines or whatever). This is why I'm wondering what the best approach is to the above question.

Related to Andrew Андрей Листочкин's answer:
You can use a module like byline to get a separate data event for each line. It's a transform stream around the original filestream, which produces a data event for each chunk. This lets you pause after each line.
byline won't read the entire file into memory like lazy apparently does.
var fs = require('fs');
var byline = require('byline');
var stream = fs.createReadStream('bigFile.txt');
stream.setEncoding('utf8');
// Comment out this line to see what the transform stream changes.
stream = byline.createStream(stream);
// Write each line to the console with a delay.
stream.on('data', function(line) {
// Pause until we're done processing this line.
stream.pause();
setTimeout(() => {
console.log(line);
// Resume processing.
stream.resume();
}, 200);
});

I guess you don't need to use node-lazy. Here's what I found in Node docs:
Event: data
function (data) { }
The data event emits either a Buffer (by default) or a string if
setEncoding() was used.
So that means that is you call setEncoding() on your stream then your data event callback will accept a string parameter. Then inside this callback you can call use .pause() and .resume() methods.
The pseudo code should look like this:
stream.setEncoding('utf8');
stream.addListener('data', function (line) {
// pause stream
stream.pause();
// make async API call...
makeAPICall(line, function() {
// then resume to process next line
stream.resume();
});
})
Although the docs don't explicitly specify that stream is read line by line I assume that that's the case for file streams. At least in other languages and platforms text streams work that way and I see no reason for Node streams to differ.

We Keep Coding

JavaScript is the programming language of the Web.