Problem: I am dealing with large files (>10GB).
Sometimes I want to process the complete files and sometimes I just want to sample a few lines.
The processing setup is a pipeline:
pipeline(
inStream,
split2(),
sc,
err => {
...
}
);
sc is a transform that essentially counts some flags in the file.
The code works fine when processing the complete file but never produces the output in ... if I want to exit from the transform before inStream has finished.
_transform(chunk,encoding,done) {
let strChunk = decoder.write(chunk);
if(strChunk === '\u0003') {
this.push('\u0003');
process.exit(0);
}
if(strChunk.startsWith("#")) {
done();
} else {
if(this.sampleMax === 0) {
this.push('\u0003');
//process.exit(0);
} else
if(this.sampleMax > 0)
this.sampleMax--;
let dta = strChunk.split("\t");
let flag = dta[1].trim();
this.flagCount[flag]++;
done();
}
if I use //process.exit(0), the code in the pipeline following sc is not reached.
if I only use this.push('\u0003'); the complete inStream is processed.
The question is how to properly terminate the transform and continue with the downstream pipeline without completely reading inStream.
One solution would be to throw an error or create an error implicitly by destroying the stream. Both options are shown in the code below.
_transform(chunk,encoding,done) {
let strChunk = decoder.write(chunk);
if(strChunk === '\u0003') {
this.push('\u0003');
process.exit(0);
}
if(strChunk.startsWith("#")) {
done();
} else {
if(this.sampleMax === 0) {
//Either of the following lines can solve the problem.
throw("PLANNED_PREMATURE_TERMINATION");
this.destroy(); //=> Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
} else
if(this.sampleMax > 0)
this.sampleMax--;
let dta = strChunk.split("\t");
let flag = dta[1].trim();
this.flagCount[flag]++;
done();
}
The next step is to react to the errors generated in the above code in the pipeline.
pipeline(
inStream,
split2(),
sc,
err => {
if(err && (err === "PLANNED_PREMATURE_TERMINATION") ||
(err === "Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close") {
//do whatever should happen in this case
} else {
//stream was completely processed
//do whatever should happen in this case
}
}
);
Since err contains the message that was thrown during premature termination, we can react to that specifically and present the partially aggregated data in sc. This solved the initial problem and seems to be the only obvious route in this situation. Therefore, I post this solution. Would be great to have other (more elegant) solutions.
Related
I am working on expressJS and this is my code in controller:
inside file readstream i had this condition check and whenever this met i no-longer want to read file because it means that it is not in correct data format:
if (!(waveType in expectedWave && Number(startTime) >= 0 && Number(endTime) > 0)) {
return res.status(500).send(" file format not supported");
}
even though I am returning this return statement I am getting
error of cannot set headers after they are sent to the client though I am returning response
SOLUTION AND EXPLANATION
I found the solution because all over the internet the solution people suggest is to use return.
I am writing this because if someone comes again here for the same problem, he/she can understand the problem.
That error basically means that there is already status and message sent to the response like with res.status(500).send(error.message)
if your code does not end here and traverse further and encounter another return.status().send() then there comes an error because the second response object status assignment is not possible because it is already assigned.
in most of the case if we return the code execution stops but in case of event handler where there is another event triggered while returning from the function and if that event handler has response sending operation then return can't save you like in this code:
const rl = readLine.createInterface({
input: fs.createReadStream(req.file.path)
});
rl.on("line", line => {
const cols = line.split(",");
const waveType = cols[0];
const startTime = cols[1];
const endTime = cols[2];
const tags = cols.slice(3);
//just to make sure csv has valid data list
if (!(waveType in expectedWave && Number(startTime) >= 0 && Number(endTime) > 0)) {
return res.status(500).send("File not supported");
}
and down the code there is
rl.on("close", () => {
if (notSupportedFlag) {
notSupportedFlag = false
return res.status(500).send("File data is not supported format");
} else {
let heartRate = Math.floor(frequencyCollector.sumOfFrequency / frequencyCollector.cycleCount);
results.meanFrequency = heartRate;
results.maxFrequency.time += Number(req.body.time);
results.minFrequency.time += Number(req.body.time);
res.status(200).json(results);
return;
}
});
return can't save you here because when you return that response close event will be called so only thing that we can do here is to perform one response send.
Like the answer posted by #tryanner but here I will add one more thing:
// set this variable before reading with false as default/start value
let notSupportedFlag= false;
inside readline.on ("line", line =>{
if (!(waveType in expectedWave && Number(startTime) >= 0 && Number(endTime) > 0)) {
notSupportedFlag = true;
rl.close();
rl.removeAllListeners();
}
then inside close operation we will do this:
rl.on("close", () => {
if (notSupportedFlag) {
notSupportedFlag = false
return res.status(500).send("File data is not supported format");
} else {
//do whatever you do in case of successful read
return;
}
});
The problem is that the stream is still sending the response multiple times from line event, every time it reads a line.
So, you need to end the stream.
However, it doesn't seem to work like that, because:
Calling rl.close() does not immediately stop other events (including
'line') from being emitted by the InterfaceConstructor instance.
https://nodejs.org/api/readline.html#rlclose
which means the line will emit again, and cause the same error, so you cannot return response from the line event, but somewhere else.
You need to rewrite the code accordingly.
For example, add the variable to track the error manually, close the stream in line, and then check for error there, and return error, if any, success if not.
// track error
let error;
// setup reader to close it later
const reader = readLine.createInterface({
//...
if (!(waveType in expectedWave && Number(startTime) >= 0 && Number(endTime) > 0)) {
error = 'File data entry is not in supported format';
// close
reader.close()
return;
}
//...
}).on("close", () => {
if (error) {
res.status(500).send(error);
error = '';
return;
}
//...
I think in your readLine handlers when you return in its functions, it just return from those function. try with a return in behind of readLine.createInterface
I am trying to create a terminal app that will run indefinitely and will have the ability to read from the terminal.
I tried to user the "readline" api but the app terminates without waiting for any input.
I added a "while(true)" loop but it seems that the thread gets stacked in the loop and does not respond to my input.
I need a series of random numbers.
To accomplice it I added an interval of 1000ms and the result was the same with while loop.
To summary I need to create an app that reads from the terminal and create random numbers on a given interval.
Any guidance will be appreciated.
Edit 1
Additional information I just thought to give you.
I tried to put either the readline call or the interval in a separate forked process but nothing changed.
Also I tried to use recursion for the readline.
Edit 2
Although I accepted #amangpt777`s answer I would like to give another problem that you might encounter.
I was calling my script like this 'clear | node ./script.js' on windows` powershell.
I believe that it was the pipe that was blocking my input.
I don't know if this can happen on linux, I haven't tested it.
I just add it here so you keep it in mind.
I am not sure what you are trying to accomplish here. But following code will take input from user using readline and will keep on storing the input in an array. Note that I have some commented code in this which can be uncommented if you want a publish subscriber model. Also that you will need to add more code to sanitize and validate your input. I hope you will get some pointers to achieve what you want with this:
var readline = require('readline');
//var redis = require('redis');
//let subscriber = redis.createClient();
//let publisher = redis.createClient();
let numEntered = [];
var r1 = readline.createInterface(
{
"input": process.stdin,
"output": process.stdout
}
);
// subscriber.subscribe('myFunc');
// subscriber.on('message', (channel, msg) => {
// //Your logic
// });
function printMyArr(){
console.log("Numbers entered till now: ", numEntered);
}
function askNumber(){
askQuestion('Next Number?\n')
.then(ans => {
handleAnswer(ans);
})
.catch(err => {
console.log(err);
})
}
function handleAnswer(inputNumber) {
if(inputNumber === 'e') {
console.log('Exiting!');
r1.close();
process.exit();
}
else {
numEntered.push(parseInt(inputNumber));
//publisher.publish('myFunc', parseInt(inputNumber));
// OR
printMyArr();
askNumber();
}
}
function askQuestion(q) {
return new Promise((resolve, reject) => {
r1.question(q, (ans) => {
return resolve(ans);
});
});
}
function init() {
askQuestion('Enter Stream. Press e and enter to end input stream!\n')
.then(ans => {
handleAnswer(ans);
})
.catch(err => {
console.log(err);
})
}
init();
I've got an rxjs observer (really a Subject) that tails a file forever, just like tail -f. It's awesome for monitoring logfiles, for example.
This "forever" behavior is great for my application, but terrible for testing. Currently my application works but my tests hang forever.
I'd like to force an observer change to complete early, because my test code knows how many lines should be in the file. How do I do this?
I tried calling onCompleted on the Subject handle I returned but at that point it's basically cast as an observer and you can't force it to close, the error is:
Object # has no method 'onCompleted'
Here's the source code:
function ObserveTail(filename) {
source = new Rx.Subject();
if (fs.existsSync(filename) == false) {
console.error("file doesn't exist: " + filename);
}
var lineSep = /[\r]{0,1}\n/;
tail = new Tail(filename, lineSep, {}, true);
tail.on("line", function(line) {
source.onNext(line);
});
tail.on('close', function(data) {
console.log("tail closed");
source.onCompleted();
});
tail.on('error', function(error) {
console.error(error);
});
this.source = source;
}
And here's the test code that can't figure out how to force forever to end (tape style test). Note the "ILLEGAL" line:
test('tailing a file works correctly', function(tid) {
var lines = 8;
var i = 0;
var filename = 'tape/tail.json';
var handle = new ObserveTail(filename);
touch(filename);
handle.source
.filter(function (x) {
try {
JSON.parse(x);
return true;
} catch (error) {
tid.pass("correctly caught illegal JSON");
return false;
}
})
.map(function(x) { return JSON.parse(x) })
.map(function(j) { return j.name })
.timeout(10000, "observer timed out")
.subscribe (
function(name) {
tid.equal(name, "AssetMgr", "verified name field is AssetMgr");
i++;
if (i >= lines) {
handle.onCompleted(); // XXX ILLEGAL
}
},
function(err) {
console.error(err)
tid.fail("err leaked through to subscriber");
},
function() {
tid.end();
console.log("Completed");
}
);
})
It sounds like you solved your problem, but to your original question
I'd like to force an observer change to complete early, because my test code knows how many lines should be in the file. How do I do this?
In general the use of Subjects is discouraged when you have better alternatives, since they tend to be a crutch for people to use programming styles they are familiar with. Instead of trying to use a Subject I would suggest that you think about what each event would mean in an Observable life cycles.
Wrap Event Emitters
There already exists wrapper for the EventEmitter#on/off pattern in the form of Observable.fromEvent. It handles clean up and keeping the subscription alive only when there are listeners. Thus ObserveTail can be refactored into
function ObserveTail(filename) {
return Rx.Observable.create(function(observer) {
var lineSep = /[\r]{0,1}\n/;
tail = new Tail(filename, lineSep, {}, true);
var line = Rx.Observable.fromEvent(tail, "line");
var close = Rx.Observable.fromEvent(tail, "close");
var error = Rx.Observable.fromEvent(tail, "error")
.flatMap(function(err) { return Rx.Observable.throw(err); });
//Only take events until close occurs and wrap in the error for good measure
//The latter two are terminal events in this case.
return line.takeUntil(close).merge(error).subscribe(observer);
});
}
Which has several benefits over the vanilla use of Subjects, one, you will now actually see the error downstream, and two, this will handle clean up of your events when you are done with them.
Avoid *Sync Methods
Then this can be rolled into your file existence checking without the use of readSync
//If it doesn't exist then we are done here
//You could also throw from the filter if you want an error tracked
var source = Rx.Observable.fromNodeCallback(fs.exists)(filename)
.filter(function(exists) { return exists; })
.flatMap(ObserveTail(filename));
Next you can simplify your filter/map/map sequence down by using flatMap instead.
var result = source.flatMap(function(x) {
try {
return Rx.Observable.just(JSON.parse(x));
} catch (e) {
return Rx.Observable.empty();
}
},
//This allows you to map the result of the parsed value
function(x, json) {
return json.name;
})
.timeout(10000, "observer timed out");
Don't signal, unsubscribe
How do you stop "signal" a stop when streams only travel in one direction. We rarely actually want to have an Observer directly communicate with an Observable, so a better pattern is to not actually "signal" a stop but to simply unsubscribe from the Observable and leave it up to the Observable's behavior to determine what it should do from there.
Essentially your Observer really shouldn't care about your Observable more than to say "I'm done here".
To do that you need to declare a condition you want to reach in when stopping.
In this case since you are simply stopping after a set number in your test case you can use take to unsubscribe. Thus the final subscribe block would look like:
result
//After lines is reached this will complete.
.take(lines)
.subscribe (
function(name) {
tid.equal(name, "AssetMgr", "verified name field is AssetMgr");
},
function(err) {
console.error(err)
tid.fail("err leaked through to subscriber");
},
function() {
tid.end();
console.log("Completed");
}
);
Edit 1
As pointed out in the comments, In the case of this particular api there isn't a real "close" event since Tail is essentially an infinite operation. In this sense it is no different from a mouse event handler, we will stop sending events when people stop listening. So your block would probably end up looking like:
function ObserveTail(filename) {
return Rx.Observable.create(function(observer) {
var lineSep = /[\r]{0,1}\n/;
tail = new Tail(filename, lineSep, {}, true);
var line = Rx.Observable.fromEvent(tail, "line");
var error = Rx.Observable.fromEvent(tail, "error")
.flatMap(function(err) { return Rx.Observable.throw(err); });
//Only take events until close occurs and wrap in the error for good measure
//The latter two are terminal events in this case.
return line
.finally(function() { tail.unwatch(); })
.merge(error).subscribe(observer);
}).share();
}
The addition of the finally and the share operators creates an object which will attach to the tail when a new subscriber arrives and will remain attached as long as there is at least one subscriber still listening. Once all the subscribers are done however we can safely unwatch the tail.
I have only recently started developing for node.js, so forgive me if this is a stupid question - I come from Javaland, where objects still live happily sequentially and synchronous. ;)
I have a key generator object that issues keys for database inserts using a variant of the high-low algorithm. Here's my code:
function KeyGenerator() {
var nextKey;
var upperBound;
this.generateKey = function(table, done) {
if (nextKey > upperBound) {
require("../sync/key-series-request").requestKeys(function(err,nextKey,upperBound) {
if (err) { return done(err); }
this.nextKey = nextKey;
this.upperBound = upperBound;
done(nextKey++);
});
} else {
done(nextKey++);
}
}
}
Obviously, when I ask it for a key, I must ensure that it never, ever issues the same key twice. In Java, if I wanted to enable concurrent access, I would make make this synchronized.
In node.js, is there any similar concept, or is it unnecessary? I intend to ask the generator for a bunch of keys for a bulk insert using async.parallel. My expectation is that since node is single-threaded, I need not worry about the same key ever being issued more than once, can someone please confirm this is correct?
Obtaining a new series involves an asynchronous database operation, so if I do 20 simultaneous key requests, but the series has only two keys left, won't I end up with 18 requests for a new series? What can I do to avoid that?
UPDATE
This is the code for requestKeys:
exports.requestKeys = function (done) {
var db = require("../storage/db");
db.query("select next_key, upper_bound from key_generation where type='issue'", function(err,results) {
if (err) { done(err); } else {
if (results.length === 0) {
// Somehow we lost the "issue" row - this should never have happened
done (new Error("Could not find 'issue' row in key generation table"));
} else {
var nextKey = results[0].next_key;
var upperBound = results[0].upper_bound;
db.query("update key_generation set next_key=?, upper_bound=? where type='issue'",
[ nextKey + KEY_SERIES_WIDTH, upperBound + KEY_SERIES_WIDTH],
function (err,results) {
if (err) { done(err); } else {
done(null, nextKey, upperBound);
}
});
}
}
});
}
UPDATE 2
I should probably mention that consuming a key requires db access even if a new series doesn't have to be requested, because the consumed key will have to be marked as used in the database. The code doesn't reflect this because I ran into trouble before I got around to implementing that part.
UPDATE 3
I think I got it using event emitting:
function KeyGenerator() {
var nextKey;
var upperBound;
var emitter = new events.EventEmitter();
var requesting = true;
// Initialize the generator with the stored values
db.query("select * from key_generation where type='use'", function(err, results)
if (err) { throw err; }
if (results.length === 0) {
throw new Error("Could not get key generation parameters: Row is missing");
}
nextKey = results[0].next_key;
upperBound = results[0].upper_bound;
console.log("Setting requesting = false, emitting event");
requesting = false;
emitter.emit("KeysAvailable");
});
this.generateKey = function(table, done) {
console.log("generateKey, state is:\n nextKey: " + nextKey + "\n upperBound:" + upperBound + "\n requesting:" + requesting + " ");
if (nextKey > upperBound) {
if (!requesting) {
requesting = true;
console.log("Requesting new series");
require("../sync/key-series-request").requestSeries(function(err,newNextKey,newUpperBound) {
if (err) { return done(err); }
console.log("New series available:\n nextKey: " + newNextKey + "\n upperBound: " + newUpperBound);
nextKey = newNextKey;
upperBound = newUpperBound;
requesting = false;
emitter.emit("KeysAvailable");
done(null,nextKey++);
});
} else {
console.log("Key request is already underway, deferring");
var that = this;
emitter.once("KeysAvailable", function() { console.log("Executing deferred call"); that.generateKey(table,done); });
}
} else {
done(null,nextKey++);
}
}
}
I've peppered it with logging outputs, and it does do what I want it to.
As another answer mentions, you will potentially end up with results different from what you want. Taking things in order:
function KeyGenerator() {
// at first I was thinking you wanted these as 'class' properties
// and thus would want to proceed them with this. rather than as vars
// but I think you want them as 'private' members variables of the
// class instance. That's dandy, you'll just want to do things differently
// down below
var nextKey;
var upperBound;
this.generateKey = function (table, done) {
if (nextKey > upperBound) {
// truncated the require path below for readability.
// more importantly, renamed parameters to function
require("key-series-request").requestKeys(function(err,nKey,uBound) {
if (err) { return done(err); }
// note that thanks to the miracle of closures, you have access to
// the nextKey and upperBound variables from the enclosing scope
// but I needed to rename the parameters or else they would shadow/
// obscure the variables with the same name.
nextKey = nKey;
upperBound = uBound;
done(nextKey++);
});
} else {
done(nextKey++);
}
}
}
Regarding the .requestKeys function, you will need to somehow introduce some kind of synchronization. This isn't actually terrible in one way because with only one thread of execution, you don't need to sweat the challenge of setting your semaphore in a single operation, but it is challenging to deal with the multiple callers because you will want other callers to effectively (but not really) block waiting for the first call to requestKeys() which is going to the DB to return.
I need to think about this part a bit more. I had a basic solution in mind which involved setting a simple semaphore and queuing the callbacks, but when I was typing it up I realized I was actually introducing a more subtle potential synchronization bug when processing the queued callbacks.
UPDATE:
I was just finishing up one approach as you were writing about your EventEmitter approach, which seems reasonable. See this gist which illustrates the approach. I took. Just run it and you'll see the behavior. It has some console logging to see which calls are getting deferred for a new key block or which can be handled immediately. The primary moving part of the solution is (note that the keyManager provides the stubbed out implementation of your require('key-series-request'):
function KeyGenerator(km) {
this.nextKey = undefined;
this.upperBound = undefined;
this.imWorkingOnIt = false;
this.queuedCallbacks = [];
this.keyManager = km;
this.generateKey = function(table, done) {
if (this.imWorkingOnIt){
this.queuedCallbacks.push(done);
console.log('KG deferred call. Pending CBs: '+this.queuedCallbacks.length);
return;
};
var self=this;
if ((typeof(this.nextKey) ==='undefined') || (this.nextKey > this.upperBound) ){
// set a semaphore & add the callback to the queued callback list
this.imWorkingOnIt = true;
this.queuedCallbacks.push(done);
this.keyManager.requestKeys(function(err,nKey,uBound) {
if (err) { return done(err); }
self.nextKey = nKey;
self.upperBound = uBound;
var theCallbackList = self.queuedCallbacks;
self.queuedCallbacks = [];
self.imWorkingOnIt = false;
theCallbackList.forEach(function(f){
// rather than making the final callback directly,
// call KeyGenerator.generateKey() with the original
// callback
setImmediate(function(){self.generateKey(table,f);});
});
});
} else {
console.log('KG immediate call',self.nextKey);
var z= self.nextKey++;
setImmediate(function(){done(z);});
}
}
};
If your Node.js code to calculate the next key didn't need to execute an async operation then you wouldn't run into synchronization issues because there is only one JavaScript thread executing code. Access to the nextKey/upperBound variables will be done in sequence by only one thread (i.e. request 1 will access first, then request 2, then request 3 et cetera.) In the Java-world you will always need synchronization because multiple threads will be executing even if you didn't make a DB call.
However, in your Node.js code since you are making an async call to get the nextKey you could get strange results. There is still only one JavaScript thread executing your code, but it would be possible for request 1 to make the call to the DB, then Node.js might accept request 2 (while request 1 is getting data from the DB) and this second request will also make a request to the DB to get keys. Let's say that request 2 gets data from the DB quicker than request 1 and update nextKey/upperBound variables with values 100/150. Once request 1 gets its data (say values 50/100) then it will update nextKey/upperBound. This scenario wouldn't result in duplicate keys, but you might see gaps in your keys (for example, not all keys 100 to 150 will be used because request 1 eventually reset the values to 50/100)
This makes me think that you will need a way to sync access, but I am not exactly sure what will be the best way to achieve this.
Short version
Trying to write a debug command that returns the call stack, minus the current position. I thought I'd use:
try {
throw new Error(options["msg"])
} catch (e) {
e.stack.shift;
throw (e);
}
but I don't know how to do it exactly. apparently I can't just e.stack.shift like that. Also that always makes it an Uncaught Error — but these should just be debug messages.
Long version
I decided I needed a debug library for my content scripts. Here it is:
debug.js
var debugKeys = {
"level": ["off", "event", "function", "timeouts"],
"detail": ["minimal", "detailed"]
};
var debugState = { "level": "off", "detail": "minimal" };
function debug(options) {
if ("level" in options) {
if (verifyDebugValue("level", options["level"]) == false)
return
}
if ("detail" in options) {
if (verifyDebugValue("detail", options["detail"]) == false)
return
}
console.log(options["msg"]);
}
function verifyDebugValue(lval, rval){
var state = 10; // sufficiently high
for (k in debugKeys[lval]) {
if (debugKeys[lval][k] == rval) {
return true;
}
if (debugKeys[lval][k] == debugState[lval]) { // rval was greater than debug key
return false;
}
}
}
When you using it, you can change the debugState in the code to suit your needs. it is still a work in progress but it works just fine.
To use it from another content script, just load it in the manifest like:
manifest.json
"content_scripts": [
{
"js": ["debug.js", "foobar.js"],
}
],
and then call it like:
debug({"level": "timeouts", "msg": "foobar.js waitOnElement() timeout"});
which generates:
foobar.js waitOnElement() timeout debug.js:17
And there is my problem. At the moment, it is using the console log and so all the debug statements come from the same debug.js line. I'd rather return the calling context. I imagine I need something like:
try {
throw new Error(options["msg"])
} catch (e) {
e.stack.shift;
throw (e);
}
but I don't know how to do it exactly. apparently I can't just e.stack.shift like that. Also that always makes it an Uncaught Error — but these should just be debug messages.
You can't avoid mentioning the line in your debug.js, because either using throw (...) or console.log/error(...) your debug.js will be issuing the command.
What you can do, is have some try-catch blocks in your code, then in the catch block pass the error object to your debug function, which will handle it according to its debugState.
In any case, it is not quite clear how you are using your debug library (and why you need to remove the last call from the stack-trace, but you could try something like this:
Split the stack-trace (which is actually a multiline string) into lines.
Isolate the first line (corresponding to the last call) that is not part of the error's message.
Put together a new stack-trace, with the removed line.
E.g.:
function removeLastFromStack(stack, errMsg) {
var firstLines = 'Error: ' + errMsg + '\n';
var restOfStack = stack
.substring(firstLines.length) // <-- skip the error's message
.split('\n') // <-- split into lines
.slice(1) // <-- "slice out" the first line
.join('\n'); // <-- put the rest back together
return firstLines + restOfStack;
}
function myDebug(err) {
/* Based on my `debugState` I should decide what to do with this error.
* E.g. I could ignore it, or print the message only,
* or print the full stack-trace, or alert the user, or whatever */
var oldStack = err.stack;
var newStack = removeLastFromStack(oldStack, err.message);
console.log(newStack);
//or: console.error(newStack);
}
/* Somewhere in your code */
function someFuncThatMayThrowAnErr(errMsg) {
throw new Error(errMsg);
}
try {
someFuncThatMayThrowAnErr('test');
} catch (err) {
myDebug(err);
}
...but I still don't see how removing the last call from the trace would be helpful