I have two different Node.js programs.
One is an Express.js Server (PROGRAM1), which is to provide the user interface and RESTful APIs.
The other is a crawler (PROGRAM2), which keeps on reading item, download it from web and store everything into the database. By the way, I am using the Array.prototype.reduce() and Promise to iterate the files and handle I/Os one by one orderly.
One thing I would like to do here, is to monitor and control the progress of the crawler(PROGRAM2) from the PROGRAM1.
But I found it very complicate.
// Control the loop by this `flag`, the value can be assigned from outside
var flag = "IDLE";
// The outside can read this `index`, and monitor the progress
var current_index = -1;
var PAGE_SIZE = 100;
function handleBatch(index){
var defer = q.defer();
// Mongoose statement to find documents...
Book.find()
.skip(index*PAGE_SIZE).limit(PAGE_SIZE).then(function(books){
var finished = 0;
for(var i=0; i<books.length; i++){
var book = books[i];
downloadInfo(book).then(function(bookInfo){
if(flag === "STOP")
defer.reject(new Error("The loop should stop!"));
//store the info...
finished ++;
if(finished === PAGE_SIZE)
defer.resolve();
});
}
});
return defer.promise;
}
var promiseHandler;
function main(){
while(true){
if(flag === "IDLE")
continue;
else if(flag === "START"){
var [0,1,2,3,4,5...,2500].reduce(function(lastPromise, nextIndex){
promiseHandler = lastPromise.then(function(){
currentIndex = nextIndex;
});
}, q());
}else if(flag === "STOP"){
promiseHandler.then(null, function(err){
flag = "IDLE";
});
}
}
}
main() is just an example(e.g. Actually it is a server, and the state can be changed by the requests from PROGRAM1). By setting the flag as STOP, the loop in handleBatch() will discover the change and throw an Exception, then the program will be paused.
However, I just don't like this way, because it looks too ugly and control the process by throwing Errors. So I am searching for a better way to control and monitor the loop. Any one any idea?
You should look in the documentation for node.js' process.
And to answer your question about stopping the execution, here ->
process.exit(0);
Just as a tip: don't control your program using a loop. It's bad.
By the sound of it you are after a way to implement inter-process communication in node js. This is a broad programming topic that goes way beyond node js. There's many patterns and means to achieve this, but one of my favourites is to use a message queue to loosely couple the two processes.
On node js we have things like Redis and node-redis which can be used to implement a publish-subscribe pattern. Of course there's many messaging libraries that would work too.
In your case the Express API might publish a "pause" event, and the crawler could subscribe to that event and take some action. Your node apps then remain asynchronous (no while(true) nonsense!).
Related
I have a set of API endpoints in Express. One of them receives a request and starts a long running process that blocks other incoming Express requests.
My goal to make this process non-blocking. To understand better inner logic of Node Event Loop and how I can do it properly, I want to replace this long running function with my dummy long running blocking function that would start when I send a request to its endpoint.
I suppose, that different ways of making the dummy function blocking could cause Node manage these blockings differently.
So, my question is - how can I make a basic blocking process as a function that would run infinitely?
You can use node-webworker-threads.
var Worker, i$, x$, spin;
Worker = require('webworker-threads').Worker;
for (i$ = 0; i$ < 5; ++i$) {
x$ = new Worker(fn$);
x$.onmessage = fn1$;
x$.postMessage(Math.ceil(Math.random() * 30));
}
(spin = function(){
return setImmediate(spin);
})();
function fn$(){
var fibo;
fibo = function(n){
if (n > 1) {
return fibo(n - 1) + fibo(n - 2);
} else {
return 1;
}
};
return this.onmessage = function(arg$){
var data;
data = arg$.data;
return postMessage(fibo(data));
};
}
function fn1$(arg$){
var data;
data = arg$.data;
console.log("[" + this.thread.id + "] " + data);
return this.postMessage(Math.ceil(Math.random() * 30));
}
https://github.com/audreyt/node-webworker-threads
So, my question is - how can I make a basic blocking process as a function that would run infinitely?
function block() {
// not sure why you need that though
while(true);
}
I suppose, that different ways of making the dummy function blocking could cause Node manage these blockings differently.
Not really. I can't think of a "special way" to block the engine differently.
My goal to make this process non-blocking.
If it is really that long running you should really offload it to another thread.
There are short cut ways to do a quick fix if its like a one time thing, you can do it using a npm module that would do the job.
But the right way to do it is setting up a common design pattern called 'Work Queues'. You will need to set up a queuing mechanism, like rabbitMq, zeroMq, etc. How it works is, whenever you get a computation heavy task, instead of doing it in the same thread, you send it to the queue with relevant id values. Then a separate node process commonly called a 'worker' process will be listening for new actions on the queue and will process them as they arrive. This is a worker queue pattern and you can read up on it here:
https://www.rabbitmq.com/tutorials/tutorial-one-javascript.html
I would strongly advise you to learn this pattern as you would come across many tasks that would require this kind of mechanism. Also with this in place you can scale both your node servers and your workers independently.
I am not sure what exactly your 'long processing' is, but in general you can approach this kind of problem in two different ways.
Option 1:
Use the webworker-threads module as #serkan pointed out. The usual 'thread' limitations apply in this scenario. You will need to communicate with the Worker in messages.
This method should be preferable only when the logic is too complicated to be broken down into smaller independent problems (explained in option 2). Depending on complexity you should also consider if native code would better serve the purpose.
Option 2:
Break down the problem into smaller problems. Solve a part of the problem, schedule the next part to be executed later, and yield to let NodeJS process other events.
For example, consider the following example for calculating the factorial of a number.
Sync way:
function factorial(inputNum) {
let result = 1;
while(inputNum) {
result = result * inputNum;
inputNum--;
}
return result;
}
Async way:
function factorial(inputNum) {
return new Promise(resolve => {
let result = 1;
const calcFactOneLevel = () => {
result = result * inputNum;
inputNum--;
if(inputNum) {
return process.nextTick(calcFactOneLevel);
}
resolve(result);
}
calcFactOneLevel();
}
}
The code in second example will not block the node process. You can send the response when returned promise resolves.
in my node server I have a variable,
var clicks = 0;
each time a user clicks in the webapp, a websocket event sends a message. on the server,
clicks++;
if (clicks % 10 == 0) {
saveClicks();
}
function saveClicks() {
var placementData = JSON.stringify({'clicks' : clicks});
fs.writeFile( __dirname + '/clicks.json', placementData, function(err) {
});
}
At what rate do I have to start worrying about overwrites? How would I calculate this math?
(I'm looking at creating a MongoDB json object for each click but I'm curious what a native solution can offer).
From the node.js doc for fs.writeFile():
Note that it is unsafe to use fs.writeFile() multiple times on the
same file without waiting for the callback. For this scenario,
fs.createWriteStream() is strongly recommended.
This isn't a math problem to figure out when this might cause a problem - it's just bad code that gives you the chance of a conflict in circumstances that cannot be predicted. The node.js doc clearly states that this can cause a conflict.
To make sure you don't have a conflict, write the code in a different way so a conflict cannot happen.
If you want to make sure that all writes happen in the proper order of incoming requests so the last request to arrive is always the one who ends up in the file, then you make need to queue your data as it arrives (so order is preserved) and write to the file in a way that opens the file for exclusive access so no other request can write while that prior request is still writing and handle contention errors appropriately.
This is an issue that databases mostly do for you automatically so it may be one reason to use a database.
Assuming you weren't using clustering and thus do not have multiple processes trying to write to this file and that you just want to make sure the last value sent is the one written to the file by this process, you could do something like this:
var saveClicks = (function() {
var isWriting = false;
var lastData;
return function() {
// always save most recent data here
lastData = JSON.stringify({'clicks' : clicks});
if (!isWriting) {
writeData(lastData);
}
function writeData(data) {
isWriting = true;
lastData = null;
fs.writeFile(__dirname + '/clicks.json', data, function(err) {
isWriting = false;
if (err) {
// decide what to do if an error occurs
}
// if more data arrived while we were writing this, then write it now
if (lastData) {
writeData(lastData);
}
});
}
}
})();
#jfriend00 is definitely right about createWriteStream and already made a point about the database, and everything's pretty much said, but I would like to emphasize on the point about databases because basically the file-saving approach seems weird to me.
So, use databases.
Not only would this save you from the headache of tracking such things, but would significantly speed up things (remember that the way stuff is done in node, the numerous file reading-writing processes would be parallelized in a single thread, so basically if one of them lasts for ages, it might slightly affect the overall performance).
Redis is a perfect solution to store key-value data, so you can store data like clicks per user in a Redis database which you'll have to get running alongside anyway when your get enough traffic :)
If you're not convinced yet, take a look at this simple benchmark:
Redis:
var async = require('async');
var redis = require("redis"),
client = redis.createClient();
console.time("To Redis");
async.mapLimit(new Array(100000).fill(0), 1, (el, cb) => client.set("./test", 777, cb), () => {
console.timeEnd("To Redis");
});
To Redis: 5410.383ms
fs:
var async = require('async');
var fs = require('fs');
console.time("To file");
async.mapLimit(new Array(100000).fill(0), 1, (el, cb) => fs.writeFile("./test", 777, cb), () => {
console.timeEnd("To file");
});
To file: 20344.749ms
And, by the way, you can significantly increase the number of clicks after which the progress would be stored (now it's 10) by simply adding this "click-saver" to the socket socket.on('disconnect', ....
When I run the following code 9999999+ times, Node returns with:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory
Aborted (core dumped)
Whats the best solution to get around this issue, other than increasing the max alloc size or any command line arguments?
I'd like to improve the code quality rather than hack a solution.
The following is the main bulk of recursion within the application.
The application is a load testing tool.
a.prototype.createClients = function(){
for(var i = 0; i < 999999999; i++){
this.recursiveRequest();
}
}
a.prototype.recursiveRequest = function(){
var self = this;
self.hrtime = process.hrtime();
if(!this.halt){
self.reqMade++;
this.http.get(this.options, function(resp){
resp.on('data', function(){})
.on("connection", function(){
})
.on("end", function(){
self.onSuccess();
});
})
.on("error", function(e){
self.onError();
});
}
}
a.prototype.onSuccess = function(){
var elapsed = process.hrtime(this.hrtime),
ms = elapsed[0] * 1000000 + elapsed[1] / 1000
this.times.push(ms);
this.successful++;
this.recursiveRequest();
}
Looks like you should really be using a queue instead of recursive calls. async.queue offers a fantastic mechanism for processing asynchronous queues. You should also consider using the request module to make your http client connections simpler.
var async = require('async');
var request = require('request');
var load_test_url = 'http://www.testdomain.com/';
var parallel_requests = 1000;
function requestOne(task, callback) {
request.get(task.url, function(err, connection, body) {
if(err) return callback(err);
q.push({url:load_test_url});
callback();
});
}
var q = async.queue(requestOne, parallel_requests);
for(var i = 0; i < parallel_requests; i++){
q.push({url:load_test_url});
}
You can set the parallel_requests variable according to how many simultaneous requests you want to hit the test server with.
You are launching 1 billion "clients" in parallel, and having each of them perform an http get request recursively in an endless recursion.
Few remarks:
while your question mentions 10 million clients, your code creates 1 billion clients.
You should replace the for loop by a recursive function, to get rid of the out-of-memory error.
Something in these lines:
a.prototype.createClients = function(i){
if (i < 999999999) {
this.recursiveRequest();
this.createClients(i+1);
}
}
Then, you probably want to include some delay between the clients creations, or between the calls to recursiveRequest. Use setTimeout.
You should have a way to get the recursions stopping (onSuccess and recursiveRequest keep calling each other)
A flow control library like async node.js module may help.
10 million is very large... Assuming that the stack supports any number of calls, it should work, but you are likely asking the JavaScript interpreter to load 10 million x quite a bit of memory... and the result is Out of Memory.
Also I personally do no see why you'd want to have so many requests at the same time (testing a heavy load on a server?) one way to optimize is to NOT create "floating functions" which you are doing a lot. "Floating functions" use their own set of memory on each instantiation.
this.http.get(this.options, function(resp){ ... });
^^^^
++++--- allocates memory x 10 million
Here the function(resp)... declaration allocates more memory on each call. What you want to do is:
# either global scope:
function r(resp) {...}
this.http.get(this.options, r ...);
# or as a static member:
a.r = function(resp) {...};
this.http.get(this.options, a.r ...);
At least you'll save on all that function memory. That goes for all the functions you declare within the r function, of course. Especially if their are quite large.
If you want to use the this pointer (make r a prototype function) then you can do that:
a.prototype.r = function(resp) {...};
// note that we have to have a small function to use 'that'... probably not a good idea
var that = this;
this.http.get(this.options, function(){that.r();});
To avoid the that reference, you may use an instance saved in a global. That defeats the use of an object as such though:
a.instance = new a;
// r() is static, but can access the object as follow:
a.r = function(resp) { a.instance.<func>(); }
Using the instance you can access the object's functions from the static r function. That could be the actual implementation which could make full use of the this reference:
a.r = function(resp) { a.instance.r_impl(); }
According to a comment by Daniel, your problem is that you misuse a for() to count the total number of requests you want to send. This means you can apply a very simple fix to your code as follow:
a.prototype.createClients = function(){
this.recursiveRequest();
};
a.prototype.recursiveRequest = function(){
var self = this;
self.hrtime = process.hrtime();
if(!this.halt && this.successful < 10000000){
...
Your recursivity is enough to run the test any number of times.
What you do is never quit, though. You have a halt variable, but it does not look like you ever set that to true. However, to test 10 million times, you want to check the number of requests you already sent.
My "fix" supposes that onError() fails (is no recursive). You could also change the code to make use of the halt flag as in:
a.prototype.onSuccess = function(){
var elapsed = process.hrtime(this.hrtime),
ms = elapsed[0] * 1000000 + elapsed[1] / 1000
this.times.push(ms);
this.successful++;
if(this.successful >= 10000000)
{
this.halt = true;
}
this.recursiveRequest();
}
Note here that you will be pushing ms in the times buffer 10 million times. That's a big table! You may want to have a total instead and compute an average at the end:
this.time += ms;
// at the end:
this.average_time = this.time / this.successful;
I just came to this awful situation where I have an array of strings each representing a possibly existing file (e.g. var files = ['file1', 'file2', 'file3']. I need to loop through these file names and try to see if it exists in the current directory, and if it does, stop looping and forget the rest of the remaining files. So basically I want to find the first existing file of those, and fallback to a hard-coded message if nothing was found.
This is what I currently have:
var found = false;
files.forEach(function(file) {
if (found) return false;
fs.readFileSync(path + file, function(err, data) {
if (err) return;
found = true;
continueWithStuff();
});
});
if (found === false) {
// Handle this scenario.
}
This is bad. It's blocking (readFileSync) thus it's slow.
I can't just supply callback methods for fs.readFile, it's not that simple because I need to take the first found item... and the callbacks may be called at any random order. I think one way would be to have a callback that increases a counter and keeps a list of found/not found information and when it reaches the files.length count, then it checks through the found/not found info and decides what to do next.
This is painful. I do see the performance greatness in evented IO, but this is unacceptable. What choices do I have?
Don't use sync stuff in a normal server environment -- things are single threaded and this will completely lock things up while it waits for the results of this io bound loop. CLI utility = probably fine, server = only okay on startup.
A common library for asynchronous flow control is
https://github.com/caolan/async
async.filter(['file1','file2','file3'], path.exists, function(results){
// results now equals an array of the existing files
});
And if you want to say, avoid the extra calls to path.exists, then you could pretty easily write a function 'first' that did the operations until some test succeeded. Similar to https://github.com/caolan/async#until - but you're interested in the output.
The async library is absolutely what you are looking for. It provides pretty much all the types of iteration that you'd want in a nice asynchronous way. You don't have to write your own 'first' function though. Async already provides a 'some' function that does exactly that.
https://github.com/caolan/async#some
async.some(files, path.exists, function(result) {
if (result) {
continueWithStuff();
}
else {
// Handle this scenario
}
});
If you or someone reading this in the future doesn't want to use Async, you can also do your own basic version of 'some.'
function some(arr, func, cb) {
var count = arr.length-1;
(function loop() {
if (count == -1) {
return cb(false);
}
func(arr[count--], function(result) {
if (result) cb(true);
else loop();
});
})();
}
some(files, path.exists, function(found) {
if (found) {
continueWithStuff();
}
else {
// Handle this scenario
}
});
You can do this without third-party libraries by using a recursive function. Pass it the array of filenames and a pointer, initially set to zero. The function should check for the existence of the indicated (by the pointer) file name in the array, and in its callback it should either do the other stuff (if the file exists) or increment the pointer and call itself (if the file doesn't exist).
Use async.waterfall for controlling the async call in node.js for example:
by including async-library and use waterfall call in async:
var async = require('async');
async.waterfall(
[function(callback)
{
callback(null, taskFirst(rootRequest,rootRequestFrom,rootRequestTo, callback, res));
},
function(arg1, callback)
{
if(arg1!==undefined )
{
callback(null, taskSecond(arg1,rootRequest,rootRequestFrom,rootRequestTo,callback, res));
}
}
])
(Edit: removed sync suggestion because it's not a good idea, and we wouldn't want anyone to copy/paste it and use it in production code, would we?)
If you insist on using async stuff, I think a simpler way to implement this than what you described is to do the following:
var path = require('path'), fileCounter = 0;
function existCB(fileExists) {
if (fileExists) {
global.fileExists = fileCounter;
continueWithStuff();
return;
}
fileCounter++;
if (fileCounter >= files.length) {
// none of the files exist, handle stuff
return;
}
path.exists(files[fileCounter], existCB);
}
path.exists(files[0], existCB);
If I have a var on my main page, and have a worker thread trying to set this var, is there a way the page can access it? Assuming everything is synchronized?
var routeWorker = new Worker('getroute.js');
var checkPatrolRouteFoundTimer;
var rw_resultRoute;
var routeFound = false;
routeWorker.onmessage = function(e) {
rw_resultRoute = e.data.route;
routeFound = true;
}
function checkPatrolReady() {
if(!routeFound)
checkPatrolRouteFoundTimer = setTimeout("checkPatrolReady()", 1000);
}
function ForcePatrol(index) {
routeWorker.postMessage(index);
checkPatrolReady();
...
//do work on route
...
}
in this case, the var I'm talking about is rw_resultRoute, and I can see it get set correctly when debugging. But the only thing is that it's set in the worker thread, not in the page thread.
I flow through the ForcePatrol() method the way i'm expecting to, and it looks like the rw_resultRoute is being set, since routeFound evaluates to true after the worker finishes.
Technically, it doesn't make sense, since routeFound can be set by the worker and read by the page thread, but rw_resultRoute can only be accessed by the worker.
I truly hope this is possible, otherwise I don't see a purpose for worker threads other than showing alert() messages and updating page HTML.
I truly hope this is possible, otherwise I don't see a purpose for worker threads other than showing alert() messages and updating page HTML.
It is meant to handle processing that would normally lock up the browser. Great for crunching numbers for canvas and running hashing.
in this case, the var I'm talking about is rw_resultRoute, and I can see it get set correctly when debugging. But the only thing is that it's set in the worker thread, not in the page thread.
The worker is separate from the page that spawns it. Only way to pass data is through messaging. You need to send the data with postMessage and have the onMessage handle the result. If you are handling different things, set up a switch statement to handle the different message types.
I solved the problem. There was some synchronization I wasn't doing correctly. I was using the setTimeout in the wrong way.
var routeWorker = new Worker('getroute.js');
var checkPatrolRouteFoundTimer;
var rw_resultRoute;
var routeFound = false;
routeWorker.onmessage = function(e) {
rw_resultRoute = e.data.route;
routeFound = true;
}
function checkPatrolReady() {
if(routeFound) {
...
//do work on route
...
clearInterval(checkPatrolRouteFoundTimer);
} else {
// do any maint here?
}
}
function ForcePatrol(index) {
routeWorker.postMessage(index);
checkPatrolRouteFoundTimer = setInterval("checkPatrolReady()", 1000);
}
Any call to setTimeout/setInterval will flow through, and in the first example i was using setTimeout instead of setInterval.
In the new way, calling ForcePatrol will setup the timer, and checkPatrolReady() will evaluate the flag, doing the work and clearing the timer if it is true.
So there is indeed nothing fancy in getting the results from web workers, but I was essentially creating a race condition with the worker results.