Using Node.js to read a live file line by line - javascript

I am trying to determine what is the best way to read a live file line by line.
The line will sent for consumption and then discarded.
The file is live, meaning it is being written to by another application (its a log file).
The file could be large, and therefore I dont want to ready the whole thing in to memory and then process it.
Read line
Process it
Keep required data
Read next line
etc..
It seems there are many plugins aka modules. Not sure what the best (fast and efficient) way is.
I am using node.js version 0.10.33
Thanks

Use tail. It's just like the unix tail command, but in node.
npm install tail
a usage example from the npm page:
Tail = require('tail').Tail;
tail = new Tail("fileToTail", "\n", {}, true);
tail.on("line", function(data) {
console.log(data);
});
tail.on("error", function(error) {
console.log('ERROR: ', error);
});

you can create read stream http://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
and read before next line. Something like this
var lines = [],
line,
rs = require('fs').createReadStream('/etc/passwd');
rs.on('data', function(chunk){
var indx = chunk.indexOf("\n")'
if( indx !== -1 ) {
line = line + chunk;
} else {
line = line + chunk.chunk(0, indx); //we cut the "\n" symbol.
lines.push(line); //we add line to array of lines of file
line = ''; //we clear buffer
}
});
rs.on('end', function(){
console.log(lines);
});

A linux only solution would be to spawn a child_process of tail -f /path/to/your/log and do something with stdout - Not very elegant, but it would work.

Related

NodeJS exec stop after some time without error

I'm using exec from child_process.
The function runs fine but after 4-5minutes, it just stops, without any errors reported, but the script should run for at least 24hours...
Here is the code :
import { exec } from 'child_process';
function searchDirectory(dirPath) {
let lineBuffer = '';
const cmd = `find ${dirPath} -type f -name "*.txt" | pv -l -L 10 -q`;
const findData = exec(cmd);
findData.on('error', err => log.error(err));
findData.stdout.on('data', data => {
lineBuffer += data;
let lines = lineBuffer.split('\n');
for (var i = 0; i < lines.length - 1; i++) {
let filepath = lines[i];
processfile(filepath);
}
lineBuffer = lines[lines.length - 1];
});
findData.stdout.on('end', () => console.log('finished finding...'));
}
The pv command slows down the output, I need this since the path where I'm finding is over the network and pretty slow (60mb/s).
When I run the command directly in the terminal it works fine (I didn't wait 24hours but I let it for half hour and it was still running).
The processfile function actually makes an async call with axios to send some data to a server :
let data = readFileSync(file);
...
axios.post(API_URL, { obj: data }, { proxy: false })
.then(res => {
log.info('Successfully saved object : ' + res.data._id);
})
.catch(err => {
log.error(err.response ? err.response.data : err);
});
What could cause the script to stop? Any ideas?
Thanks
I found the issue, using exec is not recommended for huge outputs since it's using a limited size buffer. Use spawn instead :
The most significant difference between child_process.spawn and
child_process.exec is in what they return - spawn returns a stream and
exec returns a buffer.
child_process.spawn returns an object with stdout and stderr streams.
You can tap on the stdout stream to read data that the child process
sends back to Node. stdout being a stream has the "data", "end", and
other events that streams have. spawn is best used to when you want
the child process to return a large amount of data to Node - image
processing, reading binary data etc.
child_process.exec returns the whole buffer output from the child
process. By default the buffer size is set at 200k. If the child
process returns anything more than that, you program will crash with
the error message "Error: maxBuffer exceeded". You can fix that
problem by setting a bigger buffer size in the exec options. But you
should not do it because exec is not meant for processes that return
HUGE buffers to Node. You should use spawn for that. So what do you
use exec for? Use it to run programs that return result statuses,
instead of data.
from : https://www.hacksparrow.com/difference-between-spawn-and-exec-of-node-js-child_process.html

Random comma inserted at character 8192 in python "json" result called from node.js

I'm a JS developer just learning python. This is my first time trying to use node (v6.7.0) and python (v2.7.1) together. I'm using restify with python-runner as a bridge to my python virtualenv. My python script uses a RAKE NLP keyword-extraction package.
I can't figure out for the life of me why my return data in server.js inserts a random comma at character 8192 and roughly multiples of. There's no pattern except the location; Sometimes it's in the middle of the object key string other times in the value, othertimes after the comma separating the object pairs. This completely breaks the JSON.parse() on the return data. Example outputs below. When I run the script from a python shell, this doesn't happen.
I seriously can't figure out why this is happening, any experienced devs have any ideas?
Sample output in browser
[..., {...ate': 1.0, 'intended recipient': 4.,0, 'correc...}, ...]
Sample output in python shell
[..., {...ate': 1.0, 'intended recipient': 4.0, 'correc...}, ...]
DISREGARD ANY DISCREPANCIES REGARDING OBJECT CONVERSION AND HANDLING IN THE FILES BELOW. THE CODE HAS BEEN SIMPLIFIED TO SHOWCASE THE ISSUE
server.js
var restify = require('restify');
var py = require('python-runner');
var server = restify.createServer({...});
server.get('/keyword-extraction', function( req, res, next ) {
py.execScript(__dirname + '/keyword-extraction.py', {
bin: '.py/bin/python'
})
.then( function( data ) {
fData = JSON.parse(data); <---- ERROR
res.json(fData);
})
.catch( function( err ) {...});
return next();
});
server.listen(8001, 'localhost', function() {...});
keyword-extraction.py
import csv
import json
import RAKE
f = open( 'emails.csv', 'rb' )
f.readline() # skip line containing col names
outputData = []
try:
reader = csv.reader(f)
for row in reader:
email = {}
emailBody = row[7]
Rake = RAKE.Rake('SmartStoplist.txt')
rakeOutput = Rake.run(emailBody)
for tuple in rakeOutput:
email[tuple[0]] = tuple[1]
outputData.append(email)
finally:
file.close()
print( json.dumps(outputData))
This looks suspiciously like a bug related to size of some buffer, since 8192 is a power of two.
The main thing here is to isolate exactly where the failure is occurring. If I were debugging this, I would
Take a closer look at the output from json.dumps, by printing several characters on either side of position 8191, ideally the integer character code (unicode, ASCII, or whatever).
If that looks OK, I would try capturing the output from the python script as a file and read that directly in the node server (i.e. don't run a python script).
If that works, then create a python script that takes that file and outputs it without manipulation and have your node server execute that python script instead of the one it is using now.
That should help you figure out where the problem is occurring. From comments, I suspect that this is essentially a bug that you cannot control, unless you can increase the python buffer size enough to guarantee your data will never blow the buffer. 8K is pretty small, so that might be a realistic solution.
If that is inadequate, then you might consider processing the data on the the node server, to remove every character at n * 8192, if you can consistently rely on that. Good luck.

node.js call external exe and wait for output

I just want to call an external exe from a nodejs-App. This external exe makes some calculations and returns an output the nodejs-App needs. But I have no idea how to make the connection between nodejs and an external exe. So my questions:
How do I call an external exe-file with specific arguments from within nodejs properly?
And how do I have to transmit the output of the exe to nodejs efficiently?
Nodejs shall wait for the output of the external exe. But how does nodejs know when the exe has finished its processing? And then how do I have to deliver the result of the exe? I don't want to create a temporary text-file where I write the output to and nodejs simply reads this text-file. Is there any way I can directly return the output of the exe to nodejs? I don't know how an external exe can directly deliver its output to nodejs. BTW: The exe is my own program. So I have full access to that app and can make any necessary changes. Any help is welcome...
With child_process module.
With stdout.
Code will look like this
var exec = require('child_process').exec;
var result = '';
var child = exec('ping google.com');
child.stdout.on('data', function(data) {
result += data;
});
child.on('close', function() {
console.log('done');
console.log(result);
});
You want to use child_process, you can use exec or spawn, depending on your needs. Exec will return a buffer (it's not live), spawn will return a stream (it is live). There are also some occasional quirks between the two, which is why I do the funny thing I do to start npm.
Here's a modified example from a tool I wrote that was trying to run npm install for you:
var spawn = require('child_process').spawn;
var isWin = /^win/.test(process.platform);
var child = spawn(isWin ? 'cmd' : 'sh', [isWin?'/c':'-c', 'npm', 'install']);
child.stdout.pipe(process.stdout); // I'm logging the output to stdout, but you can pipe it into a text file or an in-memory variable
child.stderr.pipe(process.stderr);
child.on('error', function(err) {
logger.error('run-install', err);
process.exit(1); //Or whatever you do on error, such as calling your callback or resolving a promise with an error
});
child.on('exit', function(code) {
if(code != 0) return throw new Error('npm install failed, see npm-debug.log for more details')
process.exit(0); //Or whatever you do on completion, such as calling your callback or resolving a promise with the data
});

node.js fs - stream file "backwards" - from bottom to top

Using Node.js, what is the best way to stream a file from a filesystem into Node.js, but reading it backwards, from bottom to top? I have a large file, and there doesn't seem to be much sense in reading from the top if I only want the last 10 lines. Is this possible?
Right now I have this horrible code, where we do a GET request with a browser to view the server logs, and pass a query string parameter to tell the server how many lines at the end of the log file we want to read:
function get(req, res, next) {
var numOfLinesToRespondWith = req.query.num_lines || 10;
var fileStream = fs.createReadStream(stderr_path, {encoding: 'utf8'});
var jsonData = []; //where jsonData gets populated
var ret = [];
fileStream.on('data', function processLineOfFileData(chunk) {
jsonData.push(String(chunk));
})
.on('end', function handleEndOfFileData(err) {
if (err) {
log.error(colors.bgRed(err));
res.status(500).send({"error reading from smartconnect_stdout_log": err.toString()});
}
else {
for(var i = 0; i < numOfLinesToRespondWith; i++){
ret.push(jsonData.pop());
}
res.status(200).send({"smartconnect_stdout_log": ret});
}
});
}
the code above reads the whole file and then adds the number of lines requested to the response after reading the whole file. This is bad, is there a better way to do this? Any recommendations will be met gladly.
(one problem with the code above is that it's writing out the last lines of the log but the lines are in reverse order...)
One potential way to do this is:
process.exec('tail -r ' + file_path).pipe(process.stdout);
but that syntax is incorrect - so my question there would be - how do I pipe the result of that command into an array in Node.js and eventually into a JSON HTTP response?
I created a module called fs-backwards-stream that could may meet your needs. https://www.npmjs.com/package/fs-backwards-stream
If you need the result parsed by lines rather than byte chunks you should use the module fs-reverse https://www.npmjs.com/package/fs-reverse or
both of these modules stream you could simply read the last n bytes of a file.
here is an example using plain node fs apis and no dependencies.
https://gist.github.com/soldair/f250fb497ce592c3694a
hope that helps.
One easy way if you're on a linux computer would be to execute the tac command in node as process.exec("tac yourfile.dat") and pipe it to your write stream
You could also use slice-file and then reverse the order yourself.
Also, look at what #alexmills said in the comments
this is the best answer I got, for now
the tail command on Mac/UNIX reads files from the end and pipes to stdout (correct me if this is loose language)
var cp = require('child_process');
module.exports = function get(req, res, next) {
var numOfLinesToRespondWith = req.query.num_lines || 100;
cp.exec('tail -n 5 ' + stderr_path, function(err,stdout,stderr){
if(err){
log.error(colors.bgRed(err));
res.status(500).send({"error reading from smartconnect_stderr_log": err.toString()});
}
else{
var data = String(stdout).split('\n');
res.status(200).send({"stderr_log": data});
}
});
}
this seems to work really well - it does, however, run on separate process which is expensive in it's own way, but probably better than reading an entire 10,000 line log file.

How NOT to stop reading file when meeting EOF?

I'm trying to implement a routine for Node.js that would allow one to open a file, that is being appended to by some other process at this very time, and then return chunks of data immediately as they are appended to file. It can be thought as similar to tail -f UNIX command, however acting immediately as chunks are available, instead of polling for changes over time. Alternatively, one can think of it as of working with a file as you do with socket — expecting on('data') to trigger from time to time until a file is closed explicitly.
In C land, if I were to implement this, I would just open the file, feed its file descriptor to select() (or any alternative function with similar designation), and then just read chunks as file descriptor is marked "readable". So, when there is nothing to be read, it won't be readable, and when something is appended to file, it's readable again.
I somewhat expected this kind of behavior for following code sample in Javascript:
function readThatFile(filename) {
const stream = fs.createReadStream(filename, {
flags: 'r',
encoding: 'utf8',
autoClose: false // I thought this would prevent file closing on EOF too
});
stream.on('error', function(err) {
// handle error
});
stream.on('open', function(fd) {
// save fd, so I can close it later
});
stream.on('data', function(chunk) {
// process chunk
// fs.close() if I no longer need this file
});
}
However, this code sample just bails out when EOF is encountered, so I can't wait for new chunk to arrive. Of course, I could reimplement this using fs.open and fs.read, but that somewhat defeats Node.js purpose. Alternatively, I could fs.watch() file for changes, but it won't work over network, and I don't like an idea of reopening file all the time instead of just keeping it open.
I've tried to do this:
const fd = fs.openSync(filename, 'r'); // sync for readability' sake
const stream = net.Socket({ fd: fd, readable: true, writable: false });
But had no luck — net.Socket isn't happy and throws TypeError: Unsupported fd type: FILE.
So, any solutions?
UPD: this isn't possible, my answer explains why.
I haven't looked into the internals of the read streams for files, but it's possible that they don't support waiting for a file to have more data written to it. However, the fs package definitely supports this with its most basic functionality.
To explain how tailing would work, I've written a somewhat hacky tail function which will read an entire file and invoke a callback for every line (separated by \n only) and then wait for the file to have more lines written to it. Note that a more efficient way of doing this would be to have a fixed size line buffer and just shuffle bytes into it (with a special case for extremely long lines), rather than modifying JavaScript strings.
var fs = require('fs');
function tail(path, callback) {
var descriptor, bytes = 0, buffer = new Buffer(256), line = '';
function parse(err, bytesRead, buffer) {
if (err) {
callback(err, null);
return;
}
// Keep track of the bytes we have consumed already.
bytes += bytesRead;
// Combine the buffered line with the new string data.
line += buffer.toString('utf-8', 0, bytesRead);
var i = 0, j;
while ((j = line.indexOf('\n', i)) != -1) {
// Callback with a single line at a time.
callback(null, line.substring(i, j));
// Skip the newline character.
i = j + 1;
}
// Only keep the unparsed string contents for next iteration.
line = line.substr(i);
// Keep reading in the next tick (avoids CPU hogging).
process.nextTick(read);
}
function read() {
var stat = fs.fstatSync(descriptor);
if (stat.size <= bytes) {
// We're currently at the end of the file. Check again in 500 ms.
setTimeout(read, 500);
return;
}
fs.read(descriptor, buffer, 0, buffer.length, bytes, parse);
}
fs.open(path, 'r', function (err, fd) {
if (err) {
callback(err, null);
} else {
descriptor = fd;
read();
}
});
return {close: function close(callback) {
fs.close(descriptor, callback);
}};
}
// This will tail the system log on a Mac.
var t = tail('/var/log/system.log', function (err, line) {
console.log(err, line);
});
// Unceremoniously close the file handle after one minute.
setTimeout(t.close, 60000);
All that said, you should also try to leverage the NPM community. With some searching, I found the tail-stream package which might do what you want, with streams.
Previous answers have mentioned tail-stream's approach which uses fs.watch, fs.read and fs.stat together to create the effect of streaming the contents of the file. You can see that code in action here.
Another, perhaps hackier, approach might be to just use tail by spawning a child process with it. This of course comes with the limitation that tail must exist on the target platform, but one of node's strengths is using it to do asynchronous systems development via spawn and even on windows, you can execute node in an alternate shell like msysgit or cygwin to get access to the tail utility.
The code for this:
var spawn = require('child_process').spawn;
var child = spawn('tail',
['-f', 'my.log']);
child.stdout.on('data',
function (data) {
console.log('tail output: ' + data);
}
);
child.stderr.on('data',
function (data) {
console.log('err data: ' + data);
}
);
So, it seems people are still looking for an answer to this question for five years already, and there is yet no answer on topic.
In short: you can't. Not in Node.js particularly, you can't at all.
Long answer: there are few reasons for this.
First, POSIX standard clarifies select() behavior in this regard as follows:
File descriptors associated with regular files shall always select true for ready to read, ready to write, and error conditions.
So, select() can't help with detecting a write beyond the file end.
With poll() it's similar:
Regular files shall always poll TRUE for reading and writing.
I can't tell for sure with epoll(), since it's not standartized and you have to read quite lengthy implementation, but I would assume it's similar.
Since libuv, which is in core of Node.js implementation, uses read(), pread() and preadv() in its uv__fs_read(), neither of which would block when invoked at the end of file, it would always return empty buffer when EOF is encountered. So, no luck here too.
So, summarizing, if such functionality is desired, something must be wrong with your design, and you should revise it.
What you're trying to do is a FIFO file (acronym for First In First Out), which as you said works like a socket.
There's a node.js module that allows you to work with fifo files.
I don't know what do you want that for, but there are better ways to work with sockets on node.js. Try socket.io instead.
You could also have a look at this previous question:
Reading a file in real-time using Node.js
Update 1
I'm not familiar with any module that would do what you want with a regular file, instead of with a socket type one. But as you said, you could use tail -f to do the trick:
// filename must exist at the time of running the script
var filename = 'somefile.txt';
var spawn = require('child_process').spawn;
var tail = spawn('tail', ['-f', filename]);
tail.stdout.on('data', function (data) {
data = data.toString().replace(/^[\s]+/i,'').replace(/[\s]+$/i,'');
console.log(data);
});
Then from the command line try echo someline > somefile.txt and watch at the console.
You might also would like to have a look at this: https://github.com/layerssss/node-tailer

Categories