Node.js design: multiple async functions writing to database using function passed as a closure - javascript

I am writing a standalone web scraper in Node, run from command line, which looks for specific data on a set of pages, fetches page views data from Google Analytics and saves it all in an MySQL database. Almost all is ready, but today I found a problem with the way I write data in the db.
To make thing easier let's assume I have an index.js file and two controllers - db and web. Db reads/writes data to db, web scraps the pages using configurable amount of PhantomJs instances.
Web exposes one function checkTargetUrls(urls, writer)
where urls is an array with urls to be checked and writer is an optional parameter, called only if it is a function and there is data to be written.
Now the way I pass the writer is obviously wrong, but looks as follows (in index.js):
some code here
....
let pageId = 0;
... some promises code,
which checks validy of urls,
creates new execution in the database, etc.
...
.then(ulrs => {
return web.checkTargetUrls(urls,
function(singleUrl, pageData) {
...
a chain of promisable functions from db controller,
which first lookup page id in the db, then its
puts in the pageId variable and continues with write to db
...
}).then(() => {
logger.info('All done captain!');
}).catch(err => {logger.error(err})
In the effect randomly pageId gets overwritten by id of preceeding/succeeding page and invalid data is saved. Inside web there are up to 10 concurrent instances of PhantomJs running, which call writer function after they analyzed a page. Excuse me my language, but for me an analogy for that situation would be if I had, say, 10 instances of some object, which then rely for writing on a singleton, which causes the pageId overwriting problem (don't know how to properly express in JS/Node.js terms).
So far I have found one fix to the problem, but it is ugly as it introduces tight coupling. If I put the writer code in a separate module and then load it directly from inside the web controller all works great. But for me it is a bad design pattern and would rather do it otherwise.
var writer = require('./writer');
function checkTargetUrls(urls, executionId) {
return new Promise(
function(resolve, reject) {
let poolSize = config.phantomJs.concurrentInstances;
let running = 0;
....
a bit of code goes here
....
if (slots != undefined && slots != null && slots.data.length > 0) {
return writer.write(executionId, singleUrl, slots);
}
...
more code follows
})
}
I have a hard time findng a nicer solution, where I could still pass writer as an argument for checkTargetUrls(urls, writer) function. Can anyone point me in the right direction or suggest where to look for the answer?

The exact problem around your global pageId is not entirely clear to me but you could reduce coupling by exposing a setWriter function from your 'web' controller.
var writer;
module.exports.setWriter = function(_writer) { writer = _writer };
Then near the top of your index.js, something like:
var web = require('./web');
web.setWriter(require('./writer'));

Related

Attempting to Import Module in Child Process (Javascript) and Failing

I'm currently running a heavy computation (i.e. generating a Monte Carlo tree), which is an expensive operation. I only have a few seconds to build as big of a tree as I can, so I am using subprocesses in Node.js in order to build multiple trees, and then aggregate their data together to make a more informed decision.
I understand that subprocesses do not share information/memory, and I need to use modules within these subprocesses that are located in a file, called "Epilog.js" on my machine.
When I run functions that are in epilog.js from the main file, it works just fine. But all of my functions that are in my worker threads return absolutely nothing.
I have tested to make sure that the parameters of the functions I am trying to use in "epilog.js" aren't empty, and they're not. The problem isn't in the parameter.
I have also tested to see what happens if I simply don't import, and instead of just outputting an undefined array, I get an error saying that there is no function called "findroles".
//My main thread.
var fs = require('fs');
eval(fs.readFileSync('epilog.js') + '');
var process = fork('./buildGraph.js');
process.send({library});
//My worker thread.
//buildGraph.js
var fs = require('fs');
eval(fs.readFileSync('epilog.js') + '');
// receive message from master process
process.on('message', async(message) => {
library = message["library"];
console.log(findroles(library));
// findroles(library) is a function that is defined in epilog.js,
//and this outputs an array of "roles" given a parameter,library.
// For some reason this function outputs [], rather than giving me
// all of the roles. If I run this exact line from my main thread,
// it doesn't give any errors and outputs the right array:
// e.g. ['red', 'white'].
});
I expect to get not the empty array, but [red, white], as I do if I were to run the same line in the main thread. Does anyone have an idea as to the inconsistency of the functions? I'm very new to node.js and this isn't a class focused too much on software engineering in JavaScript, so I'd appreciate if someone can dumb down what is going on, as this is all very new to me.
If your script does not find the function called findroles then there is a problem with the importing method. Using the eval function for importing is not the normal way of importing modules. Try something like this:
// buildGraph.js
const epilog = require("./epilog.js");
......
console.log(epilog.findroles(library));
then epilog.js
exports.findroles = function (library) {
// function content
}
You can find more info here:
https://www.w3schools.com/nodejs/nodejs_modules.asp
Base on the document and example here, everything seem correct but I think the problem come from this line:
var process = fork('./buildGraph.js');
you might override the original process.
try to change it to
const n = fork('./buildGraph.js');

Safe way to let users register handelbars helpers in nodejs

I have a node js web app that is using handlebars. Users are asking me to let them register their own handlebars helpers.
I'm quite hesitant about letting them do it... but I'll give it a go if there is a secure way of doing it so.
var Handlebars = require("handlebars");
var fs = require("fs");
var content = fs.readFileSync("template.html", "utf8");
//This helper will be posted by the user
var userHandlebarsHelpers = "Handlebars.registerHelper('foo', function(value) { return 'Foo' + value; });"
//eval(userHandlebarsHelpers); This I do not like! Eval is evil
//Compile handlebars with user submitted Helpers
var template = Handlebars.compile(content);
var handleBarContent = template({ foo: bar });
//Save compiled template and some extra code.
Thank you in advance!
Because helpers are just Javascript code, the only way you could safely run arbitrary Javascript from the outside world on your server is if you either ran it an isolated sandbox process or you somehow sanitized the code before you ran it.
The former can be done with isolated VMs and external control over the process, but that makes it quite a pain to have helper code in some external process as you now have to develop ways to even call it and pass data back and forth.
Sanitizing Javascript to be safe from running exploits on your server is a pretty much impossible task when your API set is as large as node.js. The browser has a very tightly controlled set of things that Javascript can do to keep the underlying system safe from what browser Javascript can do. node.js has none of those safeguards. You could put code in one of these helpers to erase the entire hard drive of the server or install multiple viruses or pretty much whatever evil exploit you wanted to code. So, running arbitrary Javascript will simply not be safe.
Depending upon the exact problems that need to be solved, one can something develop a data driven approach where, instead of code, the user provides some higher level set of instructions (map this to that, substitute this with that, replace this with that, display from this set of data, etc...) that is not actually Javascript, but rather some non-executable meta data. That is much more feasible to make safe because you control all the code that acts on this meta data so you just have to make sure that the code that processes the meta data isn't capable of being tricked into doing something evil.
Following #jfriend00 input and after some serious testing I found a way to do it using nodejs vm module.
Users will input their helpers with this format:
[[HBHELPER 'customHelper' value]]
value.replace(/[0-9]/g, "");
[[/HBHELPER]]
[[HBHELPER 'modulus' index mod result block]]
if(parseInt(index) % mod === parseInt(result))
block.fn(this);
[[/HBHELPER]]
//This will throw an error when executed "Script execution timed out."
[[HBHELPER 'infiniteLoop' value]]
while(1){}
[[/HBHELPER]]
I translate that block into this and execute it:
Handlebars.registerHelper('customHelper', function(value) {
//All the code is executed inside the VM
return vm.runInNewContext('value.replace(/[0-9]/g, "");', {
value: value
}, {
timeout: 1000
});
});
Handlebars.registerHelper('modulus', function(index, mod, result, block) {
return vm.runInNewContext('if(parseInt(index) % mod === parseInt(result)) block.fn(this);', {
index: index,
mod: mod,
result: result,
block: block
}, {
timeout: 1000
});
});
Handlebars.registerHelper('infiniteLoop', function(value) {
//Error
return vm.runInNewContext('while(1){}', {
value: value
}, {
timeout: 1000
});
});
I made multiple tests so far, trying to delete files, require modules, infinite loops. Everything is going perfectly, all those operations failed.
Running the handlebar helper callback function in a VM is what made this work for me, because my main problem using VM's and running the whole code inside was adding those helpers to my global Handlebars object.
I'll update if I found a way to exploit it.

NodeJS, SocketIO and Express logic context build

I read a lot about Express / SocketIO and that's crazy how rarely you get some other example than a "Hello" transmitted directly from the app.js. The problem is it doesn't work like that in the real world ... I'm actually desperate on a logic problem which seems far away from what the web give me, that's why I wanted to point this out, I'm sure asking will be the solution ! :)
I'm refactoring my app (because there were many mistakes like using the global scope to put libs, etc.) ; Let's say I've got a huge system based on SocketIO and NodeJS. There's a loader in the app.js which starts the socket system.
When someone join the app it require() another module : it initializes many socket.on() which are loaded dynamically and go to some /*_socket.js files in a folder. Each function in those modules represent a socket listener, then it's way easier to call it from the front-end, might look like this :
// Will call `user_socket.js` and method `try_to_signin(some params)`
Queries.emit_socket('user.try_to_signin', {some params});
The system itself works really well. But there's a catch : the module that will load all those files which understand what the front-end has sent also transmit libraries linked with req/res (sessions, cookies, others...) and must do it, because the called methods are the core of the app and very often need those libraries.
In the previous example we obviously need to check if the user isn't already logged-in.
// The *_socket.js file looks like this :
var $h = require(__ROOT__ + '/api/helpers');
module.exports = function($s, $w) {
var user_process = require(__ROOT__ + '/api/processes/user_process')($s, $w);
return {
my_method_called: function(reference, params, callback) {
// Stuff using $s, $w, etc.
}
}
// And it's called this way :
// $s = services (a big object)
// $w = workers (a big object depending on $s)
// They are linked with the req/res from the page when they are instantiated
controller_instance = require('../sockets/'+ controller_name +'_socket')($s, $w);
// After some processes ...
socket_io.on(socket_listener, function (datas, callback) {
// Will call the correct function, etc.
$w.queries.handle_socket($w, controller_name, method_name, datas);
});
The good news : basically, it works.
The bad news : every time I refresh the page, the listeners double themselves because they are in a loop called on page load.
Below, this should have been one line :
So I should put all the socket.on('connection'...) stuff outside the page loading, which means when the server starts ... Yes, but I also need the req/res datas to be able to load the libraries, which I get only when the page is loaded !
It's a programing logic problem, I know I did something wrong but I don't know where to go now, I got this big system which "basically" works but there's like a paradox on the way I did it and I can't figure out how to resolve this ... It's been a couple of hours I'm stuck.
How can I refacto to let the possibility to get the current libraries depending on req/res within a socket.on() call ? Is there a trick ? Should I think about changing completely the way I did it ?
Also, is there another way to do what I want to do ?
Thank you everyone !
NOTE : If I didn't explain well or if you want more code, just tell me :)
EDIT - SOLUTION : As seen above we can use sockets.once(); instead of sockets.on(), or there's also the sockets.removeAllListeners() solution which is less clean.
Try As Below.
io.sockets.once('connection', function(socket) {
io.sockets.emit('new-data', {
channel: 'stdout',
value: data
});
});
Use once instead of on.
This problem is similar as given in the following link.
https://stackoverflow.com/questions/25601064/multiple-socket-io-connections-on-page-refresh/25601075#25601075

Writing JS code to mimic api design

We're planning on rebuilding our service at my workplace, creating a RESTful API and such and I happened to stumble on an interesting question: can I make my JS code in a way that it mimics my API design?
Here's an example to illustrate what I mean:
We have dogs, and you can access those dogs doing a GET /dogs, and get info on a specific one by GET /dogs/{id}.
My Javascript code would then be something like
var api = {
dogs : function(dogId) {
if ( dogId === undefined ) {
//request /dogs from server
} else {
//request /dogs/dogId from server
}
}
}
All if fine and dandy with that code, I just have to call api.dogs() or api.dogs(123) and I'll get the info I want.
Now, let's say those dogs have a list of diseases (or whatever, really) which you can fetch via GET /dogs/{id}/disases. Is there a way to modify my Javascript so that the previous calls will remain the same - api.dogs() returns all dogs and api.dogs(123) returns dog 123's info - while allowing me to do something like api.dogs(123).diseases() to list dog 123's diseases?
The simplest way I thought of doing it is by having my methods actually build queries instead of retrieving the data and a get or run method to actually run those queries and fetch the data.
The only way I can think of building something like this is if I could somehow, when executing a function, if some other function is chained to the object, but I don't know if that's possible.
What are your thoughts on this?
I cannot give you a concrete implementation, but a few hints how you could accomplish what you want. It would be interesting to know, what kind of Server and framework you are using.
Generate (Write yourself or autogenerate from code) a WADL describing your Service and then try do generate the Code for example with XSLT
In my REST projects I use swagger, that analyzes some common Java REST Implementation and generates JSON descriptions, that you could use as a base for Your JavaScript API
It can be easy for simple REST Apis but gets complicated as the API divides into complex hierarchies or has a tree structure. Then everything will depend on an exact documentation of your service.
Assuming that your JS application knows of the services provided by your REST API i.e send a JSON or XML file describing the services, you could do the following:
var API = (function(){
// private members, here you hide the API's functionality from the outside.
var sendRequest = function (url){ return {} }; // send GET request
return {
// public members, here you place methods that will be exposed to the public.
var getDog = function (id, criteria) {
// check that criteria isn't an invalid request. Remember the JSON file?
// Generate url
response = sendRequest(url);
return response;
};
};
}());
var diseases = API.getDog("123", "diseases");
var breed = API.getDog("123", "breed");
The code above isn't 100% correct since you still have to deal with AJAX call but it is more or less what you what.
I hope this helps!

Simplest approach to Node.js request serialisation

I've got the classic asynchronous/concurrency problem that folks writing a service in Node.js at some point stumble into. I have an object that fetches some data from an RDBM in response to a request, and emits a fin event (using an EventEmitter) when the row fetching is complete.
As you might expect, when the caller of the service makes several near-simultaneous calls to it, the rows are returned in an unpredictable order. The fin event is fired for rows that do not correspond to the calling function's understanding of the request that produced them.
Here's what I've got going on (simplified for relevance):
var mdl = require('model.js');
dispatchGet: function(req, res, sec, params) {
var guid = umc.genGUID(36);
mdl.init(this.modelMap[sec], guid);
// mdl.load() creates returns a 'new events.EventEmitter()'
mdl.load(...).once('fin',
function() {
res.write(...);
res.end();
});
}
A simple test shows that the mdl.guid often does not correspond to the guid.
I would have thought that creating a new events.EventEmitter() inside the mdl.load() function would fix this problem by creating a discrete EventEmitter for every request, but evidently that is not the case; I suppose the same rules of object persistence apply to it as to any other object, irrespective of new.
I'm a C programmer by background: I can certainly come up with my own scheme for associating these replies with their requests, using some circular queue or hashing scheme. However, I am guessing this problem has already been solved many times over. My research has revealed many opinions on how to best handle this--various kinds of queuing implementations, Futures, etc.
What I'm wondering is, what's the simplest possible approach to good asynchronous flow control here? I don't want to get knee-deep in some dependency's massive paradigm shift if I don't have to. Is there a relatively simple, canonical, definitive solution, and/or widespread consensus on which third-party module is best?
Could it be that your model.js looks something like this?
module.exports = {
init : function(model, guid) {
this.guid = guid;
...
}
};
You have to be aware that the object you're passing to module.exports there is a shared object, in the sense that every other module that runs require("model.js") it will receive a reference to the same object.
So every time you run mdl.init(), the guid property of that object is changed, which would explain your comment that "...a simple test shows that the mdl.guid often does not correspond to the guid".
It really depends on your exact implementation, but I think you'd want to use a class instead:
// model.js
var Mdl = function(model, guid) {
this.guid = guid;
};
Mdl.prototype.load = function() {
// instantiate and return a new EventEmitter.
};
module.exports = Mdl;
// app.js
var Mdl = require('model.js');
...
var mdl = new Mdl(this.modelMap[sec], guid);
mdl.load(...)

Categories