I am creating a program in Node.JS that extract pdf text using the command-line utility pdftotext by creating child_process.spawn for each file. I would like to know if this process is CPU heavy and if it is possible thousands of people to use without breaks anything.
Is create a child_process is heavy? If pdftotext is not multithreading, how can I scale? Do i need load balancing?
Thanks.
Let's break this down a bit:
I would like to know if this process is CPU heavy
I am not sure how CPU intense pdftotext is for a single file. That would also depend on how big each file is, but generally speaking and since the action of extracting PDF to text has no asynchronous work and is CPU bound, I would imagine the process to be CPU heavy, specially with lots of load.
and if it is possible thousands of people to use without breaks anything.
Spawning a new process for every single file or on every single request is generally not a good idea. Spawning a process is an expensive operation that requires a lot of memory. Having thousands of people using your service at the same time would require thousands of processes to be open simultaneously on your server which would cause memory to choke and your server would max at a certain limit and fail after that.
Is create a child_process is heavy? If pdftotext is not multithreading, how can I scale? Do i need load balancing?
As mentioned, spawning a new process is never a cheap operation. It requires memory and resources.
Every file will run in a separate process. Weather pdftotext is implemented to open a single or multiple threads in a process is irrelevant here, either way the process with all it's threads will be competing for machine resources with other processes. Of course it is beneficial if it is implemented in a way that divides work among different threads and can execute in parallel as this makes it faster, however what you would be more concerned about is how long it takes to extract text from a single file i.e. how long the process spends executing.
If you are to run this as a service, you would need to benchmark, optimize and for sure depending on the load you want to support and benchmark results, have to load balance between a few high end machines.
I hope I managed to answer some of your questions.
Related
Ok, so the thing is, we have multiple NodeJS web servers which need to be online all the time. But they'll not be recieving many requests, approx. 100-200 requests a day. The tasks aren't CPU intensive either. We are provisioning EC2 instances for it. So, the question is, can we run multiple nodejs processes on a single core? If not, is it possible to run more low intensity NodeJS processes than number of cores present? What are the pros and cons? Any benchmarks available?
Yes, it is possible. The OS (or VM on top of the OS) will simply share the single CPU among the processes allocated to it. If, as you say, you don't have a lot of requests and those requests aren't very CPU hungry, then everything should work just fine and you probably won't even notice that you're sharing a single CPU among a couple server processes. The OS/VM will time slice among the processes using that CPU, but most of the time you won't even have more than one process asking to use the CPU anyway.
Pros/Cons - Really only that performance might momentarily slow down if both servers get CPU-busy at the same time.
Benchmarks - This is highly dependent upon how much CPU your servers are using and when they try to use it. With the small number of requests you're talking about and the fact that they aren't CPU intensive, it's unlikely a user of either server would even notice. Your CPU is going to be idle most of the time.
If you happen to run a request for each server at the exact same moment and that request would normally take 500ms to complete and most of that was not even CPU time, then perhaps each of these two requests might then take 750ms instead (slightly overlapping CPU time that must be shared). But, most of the time, you're not even going to encounter a request from each of your two servers running at the same time because there are so few requests anyway.
I'm developing an app that should receive a .CSV file, save it, scan it, and insert data of every record into DB and at the end delete the file.
With a file with about 10000 records there aren't problems but with a larger file the PHP script is correctly runned and all data are saved into DB but is printed ERROR 504 The server didn't respond in time..
I'm scanning the .CSV file with the php function fgetcsv();.
I've already edit settings into php.ini file (max execution time (120), etc..) but nothing change, after 1 minute the error is shown.
I've also try to use a javascript function to show an alert every 10 seconds but also in this case the error is shown.
Is there a solution to avoid this problem? Is it possible pass some data from server to client every tot seconds to avoid the error?
Thank's
Its typically when scaling issues pop up when you need to start evolving your system architecture, and your application will need to work asynchronously. This problem you are having is very common (some of my team are dealing with one as I write) but everyone needs to deal with it eventually.
Solution 1: Cron Job
The most common solution is to create a cron job that periodically scans a queue for new work to do. I won't explain the nature of the queue since everyone has their own, some are alright and others are really bad, but typically it involves a DB table with relevant information and a job status (<-- one of the bad solutions), or a solution involving Memcached, also MongoDB is quite popular.
The "problem" with this solution is ultimately again "scaling". Cron jobs run periodically at fixed intervals, so if a task takes a particularly long time jobs are likely to overlap. This means you need to work in some kind of locking or utilize a scheduler that supports running the job sequentially.
In the end, you won't run into the timeout problem, and you can typically dedicate an entire machine to running these tasks so memory isn't as much of an issue either.
Solution 2: Worker Delegation
I'll use Gearman as an example for this solution, but other tools encompass standards like AMQP such as RabbitMQ. I prefer Gearman because its simpler to set up, and its designed more for work processing over messaging.
This kind of delegation has the advantage of running immediately after you call it. The server is basically waiting for stuff to do (not unlike an Apache server), when it get a request it shifts the workload from the client onto one of your "workers", these are scripts you've written which run indefinitely listening to the server for workload.
You can have as many of these workers as you like, each running the same or different types of tasks. This means scaling is determined by the number of workers you have, and this scales horizontally very cleanly.
Conclusion:
Crons are fine in my opinion of automated maintenance, but they run into problems when they need to work concurrently which makes running workers the ideal choice.
Either way, you are going to need to change the way users receive feedback on their requests. They will need to be informed that their request is processing and to check later to get the result, alternatively you can periodically track the status of the running task to provide real-time feedback to the user via ajax. Thats a little tricky with cron jobs, since you will need to persist the state of the task during its execution, but Gearman has a nice built-in solution for doing just that.
http://php.net/manual/en/book.gearman.php
I have to make a million http calls from my nodejs app.
Apart from doing it using async lib, callbacks is there any other way to call these many requests in parallel to process it much faster?
Kindly suggest me on the same
As the title of your question seems to ask, it's a bit of a folly to actually make millions of parallel requests. Having that many requests in flight at the same time will not help you get the job done any quicker and it will likely exhaust many system resources (memory, sockets, bandwidth, etc...).
Instead, if the goal is to just process millions of requests as fast as possible, then you want to do the following:
Start up enough parallel node.js processes so that you are using all the CPU you have available for processing the request responses. If you have 8 cores in each server involved in the process, then start up 8 node.js processes per server.
Install as much networking bandwidth capability as possible (high throughput connection, multiple network cards, etc...) so you can do the networking as fast as possible.
Use asynchronous I/O processing for all I/O so you are using the system resources as efficiently as possible. Be careful about disk I/O because async disk I/O in node.js actually uses a limited thread pool internal to the node implementation so you can't have an indefinite number of async disk I/O requests actually in flight at the same time. You won't get an error if you try to do this (the excess requests will just be queued), but it won't help you with performance either. Networking in node.js is truly async so it doesn't have this issue.
Open only as many simultaneous requests per node.js process as actually benefit you. How many this is (likely somewhere between 2 and 20) depends upon how much of the total time to process a request is networking vs. CPU and how slow the responses are. If all the requests are going to the same remote server, then saturating it with requests likely won't help you either because you're already asking it to do as much as it can do.
Create a coordination mechanism among your multiple node.js processes to feed each one work and possibly collect results (something like a work queue is often used).
Test like crazy and discover where your bottlenecks are and investigate how to tune or change code to reduce the bottlenecks.
If your requests are all to the same remote server then you will have to figure out how it behaves with multiple requests. A larger server farm will probably not behave much differently if you fire 10 requests at it at once vs. 100 requests at once. But, a single smaller remote server might actually behave worse if you fire 100 requests at it at once. If your requests are all to different hosts, then you don't have this issue at all. If your requests are to a mixture of different hosts and same hosts, then it may pay to spread them around to different hosts so that you aren't making 100 at once of the same host.
The basic ideas behind this are:
You want to maximize your use of the CPU so each CPU is always doing as much as it can.
Since your node.js code is single threaded, you need one node.js process per core in order to maximize your use of the CPU cycles available. Adding additional node.js processes beyond the number of cores will just incur unnecessary OS context switching costs and probably not help performance.
You only need enough parallel requests in flight at the same time to keep the CPU fed with work. Having lots of excess requests in flight beyond what is needed to feed the CPU just increases memory usage beyond what is helpful. If you have enough memory to hold the excess requests, it isn't harmful to have more, but it isn't helpful either. So, ideally you'd set things to have a few more requests in flight at a time than are needed to keep the CPU busy.
After reading about event loops and how async works in node.js, this is my understanding of node.js:
Node actually runs processes one at a time and not simultaneously.
Node really shines when multiple databse I/O tasks are called.
It runs faster (than blocking I/O) because it doesn't wait for the response of one call before dealing with the next call. And while dealing with the other call, when the result of the first call arrives, it "gets back to it", basically going back and forth crossing calls and callbacks, without leaving the OS process idle, as opposed to what blocking I/O does. Please correct me if I'm wrong.
But here's my question:
Non-blocking I/O seems to be faster than blocking I/O only if the entity (server/process/thread?) that handles the request sent by node, is not the node server itself.
What would be the cases when the sever handling the request is the same server making the request? If my first bullet is correct, in this case a blocking I/O will work faster than non-blocking if it uses different threads for the task?
Would file compression be an example to such I/O task that works faster on multithreaded blocking I/O?
The main benefit of non-blocking operations is that a relatively heavyweight CPU thread is not kept busy while the server is waiting for something to happen elsewhere (networking, disk I/O, etc...). This means that many different requests can be "in-flight" with only the single CPU thread and no thread is stuck waiting for I/O. A burden is placed back on the developer to write async-friendly code and to use async I/O operations, but in a heavy I/O bound operation, there can be a real benefit to server scalability. The single thread model also really simplifies access to shared resources since there is far, far less opportunity for threading conflicts, deadlocks, etc... This can result in fewer hard-to-find thread synchronization bugs that tend to only nail your server at the worst time (e.g. when it's busy).
Yes, non-blocking I/O only really helps if the agent handling the I/O operation is not node.js itself because the whole point of non-blocking I/O in node is that node is free to use its single thread to go do other things while the I/O operation is running and if it's node that is serving the I/O operation then that wouldn't be true.
Sorry, but I don't understand the part of your question about file compression. File compression takes a certain amount of CPU, no matter who handles it and there are a bunch of different considerations if you were trying to decide whether to handle it inside of node itself or in an outside process (running a different thread). That isn't a simple question. I'd probably start with using whatever code I already had for the compression (e.g. use node code if that's what you had or an external library/process if that's what you had) and only investigate a different option if you actually ran into a performance or scalability issue or knew you had an issue.
FYI, a simple mechanism for handling compression would be to spool the uncompressed data to files in a temporary directory from your node.js app and then have another process (which could be written in any system, even include node) that just looks for files in the temporary directory to which it applies the compression and then does something more permanent with the resulting compressed data.
Node.js server is works on event based models where callback functions are supported. But I am not able to understand how is it better than traditional thread based servers where threads wait for system IO. In case of thread based model, when a thread needs to wait for IO, it gets preempted so doesn't consume CPU cycles hence doesn't contribute to wait time.
How Node.js improves wait time?
when a thread needs to wait for IO, it gets preempted
Actually, it's not preempted. Preemption is something completely different. What happens is that the thread is blocked.
For an event based model something similar happens. Event based interpreters are basically state machines. Only, the state machine is abstracted away and is not visible to the user. When something is waiting for an event it passes the control back to the interpreter. When the interpreter has nothing else to process it blocks itself waiting for I/O. Only, unlike traditional threading code the interpreter waits for multiple I/O.
What's happening at the C level is that the interpreter is using something like select(), poll(), epoll() and friends (depends on the OS and library installed) to do the blocking and waiting for I/O.
Now, why does a select()/poll() based mechanism generally perform better? Actually, 'generally' here depends on what you mean. A select() based server executes all code in a single process/thread. The biggest performance gain from this is that it avoids context switching - every time the OS transfers control over from one thread to another it has to save all the relevant registers, memory map, stack pointers, FPU context etc. so that the other thread can resume execution where it left off. The overhead of doing this can be quite significant.
In fact, there is a historical example of how extreme the overhead can be. Back in the early 2000s someone started benchmarking web servers. To the surprise of everyone, tclhttpd outperformed Apache for serving static files. Now, tcl is not only an interpreted language, but back in 2000 it was a very slow interpreted language because it didn't have a seperate compilation phase (it sort of does now). Tcl scripts are interpreted directly in string form making it around 400x slower than C. Apache is obviously written in C so what's making tclhttpd faster?
It turned out that tclhttpd is event based running only on a single thread while Apache was multithreaded. The overhead of constant thread switching turned out to give tclhttpd enough advantage to perform better than Apache.
Of course, there is always a compromise. A single threaded server like tclhttpd or node.js cannot take advantage of multiple CPUs. Back in the early 2000s multiple CPUs were uncommon. These days they are almost default. Not to mention that most CPUs are also hyperthreaded (hyperthreading adds hardware to the CPU to make context switching cheap).
The best servers these days have learned from history and are a combination of both. Apache2, and Nginx use therad pools: they are multithreaded but each thread serves more than a single connection. This is a hybrid of the two approaches but is more complex to manage.
Read the following article for a more in-depth discussion on this topic: The C10K problem
Threads are relatively heavy-weight objects that have a resource footprint extending all the way into the kernel. When you park a thread in a blocking syscall or on a mutex or condition variable, you are tying up all those resources but doing nothing. Now the OS has to find more resources so your program can create another thread... Then you idle them too. It doesn't take long before the OS is struggling to scavenge more resources for your program to waste.
CPU time is just one small part of he bigger picture. :-)
Simply put:
In a threaded server, no matter how many threads you have, you can always have that many threads waiting for IO.
In node, no matter how many IO operations are pending, you always have your event loop ready to do the next thing.
When having a lot of threads you are going to have a lot of context switching which is going to be expensive. You want have this overhead when using node.js's Event loop
Context Switch
A context switch is the
computing process of storing and
restoring state (context) of a CPU so
that execution can be resumed from the
same point at a later time.
Event loop
In computer science, the event loop,
message dispatcher, message loop or
message pump is a programming
construct that waits for and
dispatches events or messages in a
program.
I think you are full of myths regarding to threads and cost of context switching.
Discover yourself the truth.