I've heard that node.js is fast when it performs IO tasks like querying a database and because javascript is single threaded and uses event loop, it uses a lot less resources when comparing to using a separate thread whenever a concurrent database query is made. But doesn't that mean the maximum number of concurrent queries it can make is still limited by how many concurrent connections a particular database can handle? If that's the case, what are the advantages of node.js over something like vert.x, which can use multiple event loops instead of just one.
Since node.js 12 (it's available since node.js 10 as "experimental feature") you can use worker threads like you're maybe already using worker verticle with vert.x. You also have the child process with node or cluster mode.
Either ways (vert.x or node) you need to keep in mind the golden rule aka "do not block the eventloop or worker pool" and think about the size of your thread pool according to your capacity.
I think it's often better to keep you process monothreaded and to scale it as new and completely isolated instances that can be hosted on multiple places in your infrastructure or even in the same place (using docker/kubernetes/whatever or even with systemd for example). And dynamically perform the scaling according to the requests increase, your infrastructure's ability to scale and of course the size of the connections pool your database can handle.
When you're designing your application as a reactive or event-driven one (the right way to use node or vert.x) and a stateless one, it's easier to handle the scaling that way.
Related
I learnt Node.js is single-threaded and non-blocking. Here I saw a nice explanation How, in general, does Node.js handle 10,000 concurrent requests?
But the first answer says
The seemingly mysterious thing is how both the approaches above manage to run workload in "parallel"? The answer is that the database is threaded. So our single-threaded app is actually leveraging the multi-threaded behaviour of another process: the database.
(1) which gets me confused. Take a simple express application as an example, when I
var monk = require('monk');
var db = monk('localhost:27017/databaseName');
router.get('/QueryVideo', function(req, res) {
var collection = db.get('videos');
collection.find({}, function(err, videos){
if (err) throw err;
res.render('index', { videos: videos })
});
});
And when my router responds multi requests by doing simple MongoDB query. Are those queries handled by different threads? I know there is only one thread in node to router client requests though.
(2) My second question is, how does such a single-threaded node application ensure security? I don't know much about security but it looks like multi requests should be isolated (at least different memory space?) Even though multi-threaded applications can't ensure security, because they still share many things. I know this may not be a reasonable question, but in today's cloud service, isolation seems to be an important topic. I am lost in topics such as serverless, wasm, and wasm based serverless using Node.js environment.
Thank you for your help!!
Since you asked about my answer I guess I can help clarify.
1
For the specific case of handling multiple parallel client requests which triggers multiple parallel MongoDB queries you asked:
Are those queries handled by different threads?
On node.js since MongoDB connects via the network stack (tcp/ip) all parallel requests are handled in a single thread. The magic is a system API that allows your program to wait in parallel. Node.js uses libuv to select which API to use depending on which OS at compile time. But which API does not matter. It is enough to know that all modern OSes have APIs that allow you to wait on multiple sockets in parallel (instead of the usual waiting for a single socket in multiple threads/processes). These APIs are collectively called asynchronous I/O APIs.
On MongoDB.. I don't know much about MongoDB. Mongo may be implemented in multiple threads or it may be singlethreaded like node.js. Disk I/O are themselves handled in parallel by the OS without using threads but instead use I/O channels (eg, PCI lanes) and DMA channels. Basically both threads/processes and asynchronous I/O are generally implemented by the OS (at least on Linux and Mac) using the same underlying system: OS events. And OS events are just functions that handle interrupts. Anyway, this is straying quite far from the discussion about databases..
I know that MySQL and Postgresql are both multithreaded to handle parsing the SQL query loop (query processing in SQL are basically operations that loop through rows and filter the result - this requires both I/O and CPU which is why they're multithreaded)
If you are still curious how computers can do things (like wait for I/O) without the CPU executing a single instruction you can check out my answers to the following related questions:
Is there any other way to implement a "listening" function without an infinite while loop?
What is the mechanism that allows the scheduler to switch which threads are executing?
2
Security is ensured by the language being interpreted and making sure the interpreter does not have any stack overflow or underflow bugs. For the most part this is true for all modern javascript engines. The main mechanism to inject code and execute foreign code via program input is via buffer overflow or underflow. Being able to execute foreign code then allows you to access memory. If you cannot execute foreign code being able to access memory is kind of moot.
There is a second mechanism to inject foreign code which is prevalent in some programming language cultures: code eval (I'm looking at you PHP!). Using string substitution to construct database queries in any language open you up to sql code eval attack (more commonly called sql injection) regardless of your program's memory model. Javascript itself has an eval() function. To protect against this javascript programmers simply consider eval evil. Basically protection against eval is down to good programming practices and Node.js being open source allows anyone to look at the code and report any cases where code evaluation attack is possible. Historically Node.js has been quite good in this regards - so your main guarantee about security from code eval is Node's reputation.
(1) The big picture goes like this; for nodejs there are 2 types of thread: event (single) and workers (pool). So long you don't block the event loop, after nodejs placed the blocked I/O call to worker thread; nodejs goes on to service next request. The worker will place the completed I/O back to the event loop for next course of action.
In short the main thread: "Do something else when it need to wait, come back and continue when the wait is over, and it does this one at a time".
And this reactive mechanism has nothing to do with thread running in another process (ie database). The database may deploy other type of thread management scheme.
(2) The 'memory space' in your question is in the same process space. A thread belongs to a process (ie Express app A) never run in other process (ie Fastify app B) space.
I'm working on a small side project and would like to grow it out, but I'm not too sure how. My question is, how should I design my NodeJs worker application to be able to execute multiple long running jobs at the same time? (i.e. should I be using multiprocessing libraries, a load-balancer, etc)
My current situation is that I have a NodeJs app running purely to serve web requests and put jobs on a queue, while another NodeJs app reading off that queue carries out those jobs (on a heroku worker dyno). Each job may take anywhere from 1 hour to 1 week of purely writing to a database. Due to the nature of the job, and it requiring an npm package specifically, I feel like I should be using Node, but at the same time I'm not sure it's the best option when considering I would like to scale it so that hundreds of jobs can be executed at the same time.
Any advice/suggestions as to how I should architect this design would be appreciated. Thank you.
First off, a single node.js app can handle lots of jobs that are just reading/writing from a database because those activities are mostly asynchronous which means node.js is spending most of its time doing nothing while waiting for the database to respond back from the last request. So, you could probably have a single node.js app handle literally at least hundreds of jobs, perhaps even thousands of jobs (depending upon exactly what the jobs are doing). In fact, I wouldn't be surprised if a single node.js app could throw more work at your database than the database could possibly keep up with.
Then, if you want to scale how many worker node.js apps are running these jobs, you can simply fire up as many worker apps as you want (and as many as your hardware can handle) using the child_process module. You create one central work queue in your main node.js app. Then, create a bunch of child_processes whose job it is to grab N items from the work queue and process them. Note, I suggest you grab N items at once because a single node.js process can probably work on many separate jobs at once because of asynchronous I/O to your database.
You may also want to explore the cluster module which doesn't even need a work queue. You can just fire up as many clustered instances of your main app as you want and they can all share the workload (both serving web pages and working on the long running jobs). The usual guideline is to set up a clustered instance for each CPU you have in the computer. So, if you have 4 cores, you would set up a cluster with a total of four servers in it.
I'm new to Node and try to understand the non-blocking nature of node.
In the image below I've created a high level diagram of the request.
As I understand, all processes from a single user for a single app run on a single thread.
What I would like to understand is the how the logic of the event loop fits in this diagram. Is the event loop the same as the processor pipeline where instructions are queued?
Imagine that we load an app page into RAM that creates a stream to read from by the program:
readstream.on('data', function(data) {});
Instructions for creating the readstream and waiting for data to occur: does this instruction "hang" in a register (waiting for the I/O to finish) in the processor whereas in a multithreaded environment, the processor just doesn't take new instructions from the RAM until the result of the previous I/O request has been returned to the RAM?
Or is the way I see this entirely/partially wrong?
Just a supplementary (related, perhaps stupid) question: run different users on different threads on the server and isn't the single threaded benefit only for a single user?
I'm new to this type of detail, so excuse me if this question doesn't entirely make sense to you. But understanding this seems essential for me before moving forward.
Event-driven non-blocking I/O relies on the fact that modern operating systems have a 'select' method that performs polling at the O/S level (not wasting CPU cycles). The select method allows you to register callbacks for certain I/O events. This tends to be much more efficient than the 'thread-per-connection' model commonly used in thread enabled languages. For more info, do a 'man select' on a Unix/Linux OS.
Threads and I/O have to do with operating system implementation and services, not CPU architecture.
Operations that involve input/output devices of any kind — mass storage, networks, serial ports, etc. — are structured as requests from the CPU to an external device that, by one of several possible mechanisms, are later satisfied.
On top of that reality, operating systems provide alternative programming models. In one model, the factual nature of input/output operations are essentially disguised such that executing programs are given an API that appears to be synchronous. In a C program, a call to the write() system call will cause the entire process to delay until the operation has completed.
Another programming model more closely exposes the asynchronous reality of the system. That's what Node uses. Operating systems provide ways to initiate long-duration asynchronous operations, along with ways for a process to either check for results or to block and wait for results. In Node, the runtime system can juggle lots of separate operations because the entire model is based on code running in response to events. An event can be a synthetic thing (such as the "event" of a Node module being loaded and run initially), or it can be something that's a result of actual asynchronous external events. In the case of input/output operations, the Node runtime waits for operating system notification and translates that into an event that causes some JavaScript code to run.
In order to make a web app responsive you use asynchronous non-blocking requests. I can envision two ways of accomplishing this. One is to use deferreds/promises. The other is Web Workers. With Web Workers we end up introducing another process and we incur the overhead of having to marshal data back and forth. I was looking for some kind of performance metrics to help understand when to chose simple non-blocking callbacks over Web Workers.
Is there some way of formulating which to use without having to prototype both approaches? I see lots of tutorials online about Web Workers but I don't see many success/failure stories. All I know is I want a responsive app. I'm thinking to use a Web Worker as the interface to an in-memory data structure that may be anywhere from 0.5-15MB (essentially a DB) that the user can query and update.
As I understand javascript processing, it is possible to take a single long-running task and slice it up so that it periodically yields control allowing other tasks a slice of processing time. Would that be a sign to use Web Workers?
`Deferred/Promises and Web Workers address different needs:
Deferred/promise are constructs to assign a reference to a result not yet available, and to organize code that runs once the result becomes available or a failure is returned.
Web Workers perform actual work asynchronously (using operating system threads not processes - so they are relatively light weight).
In other words, JavaScript being single threaded, you cannot use deferred/promises to run code asynchronously — once the code runs that fulfills the promise, no other code will run (you may change the order of execution, e.g. using setTimeout(), but that does not make your web app more responsive per se). Still, you might somehow be able to create the illusion of an asynchronous query by e.g. iterating over an array of values by incrementing the index every few milliseconds (e.g. using setInterval), but that's hardly practical.
In order to perform work such as your query asynchronously and thus off load this work from your app's UI, you need something that actually works asynchronously. I see several options:
use an IndexedDB which provides an asynchronous API,
run your own in-memory data structure, and use web workers, as you indicated, to perform the actual query,
use a server-side script engine such as NodeJS to run your code, then use client-side ajax to start the query (plus a promise to process the results),
use a database accessible over HTTP (e.g. Redis, CouchDB), and from the client issue an asynchronous GET (i.e. ajax) to query the DB (plus a promise to process the results),
develop a hybrid web app using e.g. Parse.
Which approach is the best in your case? Hard to tell without exact requirements, but here are the dimensions I would look at:
Code complexity — if you already have code for your data structure, probably Web Workers are a good fit, otherwise IndexedDB looks more sensible.
Performance — if you need consistent performance a server-side implementation or DB seems more appropriate
Architecture/complexity — do you want all processing to be done client side or can you afford the effort (cost) to manage a server side implementation?
I found this book a useful read.
I'm new to JavaScript so forgive me for being a n00b.
When there's intensive calculation required, it more than likely involves loops that are recursive or otherwise. Sometimes this may mean having am recursive loop that runs four functions and maybe each of those functions walks the entire DOM tree, read positions and do some math for collision detection or whatever.
While the first function is walking the DOM tree, the next one will have to wait its for the first one to finish, and so forth. Instead of doing this, why not launch those loops-within-loops separately, outside the programs, and act on their calculations in another loop that runs slower because it isn't doing those calculations itself?
Retarded or clever?
Thanks in advance!
Long-term computations are exactly what Web Workers are for. What you describe is the common pattern of producer and/or consumer threads. While you could do this using Web Workers, the synchronization overhead would likely trump any gains even on highly parallel systems.
JavaScript is not the ideal language for computationally demanding applications. Also, processing power of web browser machines can vary wildly (think a low-end smartphone vs. a 16core workstation). Therefore, consider calculating complex stuff on the server and sending the result to the client to display.
For your everyday web application, you should take a single-threaded approach and analyze performance once it becomes a problem. Heck, why not ask for help about your performance problem here?
JavaScript was never meant to do perform such computationally intensive tasks, and even though this is changing, the fact remains that JavaScript is inherently single-threaded. The recent web workers technology provides a limited form of multi-threading but these worker threads can't access the DOM directly; they can only send/receive messages to the main thread which can then access it on their behalf.
Currently, the only way to have real parallel processing in JS is to use Web Workers, but it is only supported by very recent browsers. And if your program requires such a thing, it could mean that you are not using the right tools (for example, walking the DOM tree is generally done by using DOM selectors like querySelectorAll).