How to make jsdom more resource efficient

How to make jsdom more resource efficient - javascript

I am using my node.js and jsdom based getratings.js script from the getratings github project to scrape user reviews from sites like NewEgg, BestBuy etc.
The script is hosted on an EC2 micro instance. It works fine until more than around 12 simultaneous requests are sent to the service. Beyond that, resource and memory utilization on the host is very high and response to the client takes forever.
I've tried to take care of memory leaks. Once its done processing requests, the memory usage does eventually go down, but the usage peaks are very high.
I was wondering if there is something that I can do to make the processing of html through jsdom more efficient in terms of resource utilization.

Paul's answer is very useful for scaling up with node, but I think Amazon EC2 is to blame here. The micro instance gets throttled after a certain (small) amount of CPU burst. See: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html

You might want to do manual garbage clean cleanup if you look at this site. He give a general way of doing it.
My suggestion would be to clean up after every request. But maybe that might be too much. You might also nee to increase the amount of sockets that can be created by you machine. Again, the code is in the site listed below. I'll post the basics that I took from his article.
http://dev.caustik.com/websvn/filedetails.php?repname=sprites&path=%2Fsprites-server%2Ftrunk%2Freadme.txt
you need to increase ulimit to handle more than ~1024 connections:
ulimit -n 1048576
You also need to run node using this command line:
node --trace-gc --expose-gc --nouse-idle-notification sprites.js
You will need to call the gc() function to garbage collect there after.

Related

Error 504, avoid it with some data passing from server to client?

I'm developing an app that should receive a .CSV file, save it, scan it, and insert data of every record into DB and at the end delete the file.
With a file with about 10000 records there aren't problems but with a larger file the PHP script is correctly runned and all data are saved into DB but is printed ERROR 504 The server didn't respond in time..
I'm scanning the .CSV file with the php function fgetcsv();.
I've already edit settings into php.ini file (max execution time (120), etc..) but nothing change, after 1 minute the error is shown.
I've also try to use a javascript function to show an alert every 10 seconds but also in this case the error is shown.
Is there a solution to avoid this problem? Is it possible pass some data from server to client every tot seconds to avoid the error?
Thank's

Its typically when scaling issues pop up when you need to start evolving your system architecture, and your application will need to work asynchronously. This problem you are having is very common (some of my team are dealing with one as I write) but everyone needs to deal with it eventually.
Solution 1: Cron Job
The most common solution is to create a cron job that periodically scans a queue for new work to do. I won't explain the nature of the queue since everyone has their own, some are alright and others are really bad, but typically it involves a DB table with relevant information and a job status (<-- one of the bad solutions), or a solution involving Memcached, also MongoDB is quite popular.
The "problem" with this solution is ultimately again "scaling". Cron jobs run periodically at fixed intervals, so if a task takes a particularly long time jobs are likely to overlap. This means you need to work in some kind of locking or utilize a scheduler that supports running the job sequentially.
In the end, you won't run into the timeout problem, and you can typically dedicate an entire machine to running these tasks so memory isn't as much of an issue either.
Solution 2: Worker Delegation
I'll use Gearman as an example for this solution, but other tools encompass standards like AMQP such as RabbitMQ. I prefer Gearman because its simpler to set up, and its designed more for work processing over messaging.
This kind of delegation has the advantage of running immediately after you call it. The server is basically waiting for stuff to do (not unlike an Apache server), when it get a request it shifts the workload from the client onto one of your "workers", these are scripts you've written which run indefinitely listening to the server for workload.
You can have as many of these workers as you like, each running the same or different types of tasks. This means scaling is determined by the number of workers you have, and this scales horizontally very cleanly.
Conclusion:
Crons are fine in my opinion of automated maintenance, but they run into problems when they need to work concurrently which makes running workers the ideal choice.
Either way, you are going to need to change the way users receive feedback on their requests. They will need to be informed that their request is processing and to check later to get the result, alternatively you can periodically track the status of the running task to provide real-time feedback to the user via ajax. Thats a little tricky with cron jobs, since you will need to persist the state of the task during its execution, but Gearman has a nice built-in solution for doing just that.
http://php.net/manual/en/book.gearman.php

Peak load using Node.js

I am creating a web application using Node.js, socket.io & express modules. and I would like to find out concerning peak loads on server. Every user make 5-7 request per second.
Server:
RAM 2GB
CPU 2x2 GHz
How much connections can this server process?
Is it necessary to use Web Workers?
What recommendations can give concerning high load or maybe you know some statistic information.

This is almost impossible to answer question since no one knows what/how you are going to develop. In case you need a general number that you can compare to your system and case, you can check this link for single node process, cluster and multithreaded benchmarks for node and jxcore distro.

What is the best way I can scale my nodejs app?

The basics
Right now a few of my friends and I are trying to develope a browser game made in nodejs. It's a multiplayer top-down shooter, and most of both the client-side and server-side code is in javascript. We have a good general direction that we'd like to go in, and we're having a lot of fun developing the game. One of our goals when making this game was to make it as hard as possible to cheat. Do do that, we have all of the game logic handled server-side. The client only sends their input the the server via web socket, and the server updates the client (also web socket) with what is happening in the game. Here's the start of our problem.
All of the server side math is getting pretty hefty, and we're finding that we need to scale in some way to handle anything more than 10 players (we want to be able to host many more). At first we had figured that we could just scale vertically as we needed to, but since nodejs is single threaded, is can only take advantage of one core. This means that getting a beefier server won't help that problem. Our only solution is to scale horizontally.
Why we're asking here
We haven't been able to find any good examples of how to scale out a nodejs game. Our use case is pretty particular, and while we've done our best to do this by ourselves, we could really benefit from outside opinions and advice
Details
We've already put a LOT of thought into how to solve this problem. We've been working on it for over a week. Here's what we have put together so far:
Four types of servers
We're splitting tasks into 4 different 'types' of servers. Each one will have a specific task it completes.
The proxy server
The proxy server would sit at the front of the entire stack, and be the only server directly accessible from the internet (there could potentially be more of these). It would have haproxy on it, and it would route all connections to the web servers. We chose haproxy because of its rich feature set, reliability, and nearly unbeatable speed.
The web server
The web server would receive the web-requests, and serve all web-pages. They would also handle lobby creation/management and game creation/management. To do this, they would tell the game servers what lobbies it has, what users are in that lobby, and info about the game they're going to play. The web servers would then update the game servers about user input, and the game server would update the web servers (who would then update the clients) of what's happening in the game. The web servers would use TCP sockets to communicate with the game servers about any type of management, and they would use UDP sockets when communicating about game updates. This would all be done with nodejs.
The game server
The game server would handle all the game math and variable updates about the game. The game servers also communicate with the db servers to record cool stats about players in game. This would be done with nodejs.
The db server
The db server would host the database. This part actually turned out to be the easiest since we found rethinkdb, the coolest db ever. This scales easily, and oddly enough, turned out to be the easiest part of scaling our application.
Some other details
If you're having trouble getting your head around our whole getup, look at this, it's a semi-accurate chart of how we think we'll scale.
If you're just curious, or think it might be helpful to look at our game, it's currently hosted in it's un-scaled state here.
Some things we don't want
We don't want to use the cluster module of nodejs. It isn't stable (said here), and it doesn't scale to other servers, only other processors. We'd like to just take the leap to horizontal scaling.
Our question, summed up
We hope we're going in the right direction, and we've done our homework, but we're not certain. We could certainly take a few tips on how to do this the right way.
Thanks
I realize that this is a pretty long question, and making a well thought out answer will not be easy, but I would really appreciate it.
Thanks!!

Following my spontaneous thoughts on your case:
Multicore usage
node.js can scale with multiple cores as well. How, you can read for example here (or just think about it: You have one thread/process running on one core, what do you need to use multiple cores? Multiple threads or multiple processes. Push work from main thread to other threads or processes and you are done).
I personally would say it is childish to develop an application, which does not make use of multiple cores. If you make use of some background processes, ok, but if you until now only do work in the node.js main event loop, you should definitely invest some time to make the app scalable over cores.
Implementing something like IPC is not that easy by the way. You can do, but if your case is complicated maybe you are good to go with the cluster module. This is obviously not your favorite, but just because something is called "experimental" it does not mean it's trashy. Just give it a try, maybe you can even fix some bugs of the module on the way. It's most likely better to use some broadly used software for complex problems, than invent a new wheel.
You should also (if you do not already) think about (wise) usage of nextTick functionality. This allows the main event loop to pause some cpu intensive task and perform other work in the meanwhile. You can read about it for example here.
General thoughts on computations
You should definitely take a very close look at your algorithms of the game engine. You already noticed that this is your bottleneck right now and actually computations are the most critical part of mostly every game. Scaling does solve this problem in one way, but scaling introduces other problems. Also you cannot throw "scaling" as problem solver on everything and expect every problem to disappear.
Your best bet is to make your game code elegant and fast. Think about how to solve problems efficiently. If you cannot solve something in Javascript efficiently, but the problem can easily be extracted, why not write a little C component instead? This counts as a separate process as well, which reduces load on your main node.js event loop.
Proxy?
Personally I do not see the advantage of the proxy level right now. You do not seem to expect large amount of users, you therefore won't need to solve problems like CDN solves or whatever... it's okay to think about it, but I would not invest much time there right now.
Technically there is a high chance your webserver software provides proxy functionality anyway. So it is ok to have it on the paper, but I would not plan with dedicated hardware right now.
Epilogue
The rest seems more or less fine to me.

Little late to the game, but take a look here: http://goldfirestudios.com/blog/136/Horizontally-Scaling-Node.js-and-WebSockets-with-Redis
You did not mention anything to do with memory management. As you know, nodejs doesn't share its memory with other processes, so an in-memory database is a must if you want to scale. (Redis, Memcache, etc). You need to setup a publisher & subscriber event on each node to accept incoming requests from redis. This way, you can scale up x nilo amount of servers (infront of your HAProxy) and utilize the data piped from redis.
There is also this node addon: http://blog.varunajayasiri.com/shared-memory-with-nodejs That lets you share memory between processes, but only works under Linux. This will help if you don't want to send data across local processes all the time or have to deal with nodes ipc api.
You can also fork child processes within node for a new v8 isolate to help with expensive cpu bound tasks. For example, players can kill monsters and obtain quite a bit of loot within my action rpg game. I have a child process called LootGenerater, and basically whenever a player kills a monster it sends the game id, mob_id, and user_id to the process via the default IPC api .send. Once the child process receives it, it iterates over the large loot table and manages the items (stores to redis, or whatever) and pipes it back.
This helps free up the event loop greatly, and just one idea I can think of to help you scale. But most importantly you will want to use an in-memory database system and make sure your game code architecture is designed around whatever database system you use. Don't make the mistake I did by now having to re-write everything :)
Hope this helps!
Note: If you do decide to go with Memcache, you will need to utilize another pub/sub system.

Finding source of node.js's memory leak - buffer size growing out of control?

I'm getting some traffic to a server of mine and I'm not sure how to deal with this problem.
I've added the nodetime to my app, and here's the result of a heap snapshot. Retainers > Other is up to 88% from 78% (in a matter of a couple of minutes)
Overall system's free memory decrease:
It's slow, but definitely happens. The jump up around 21:20 is when I restarted the server.
The server itself is basically collecting logs: it saves incoming requests to MongoDB, reads from MongoDB once and occasionally sets Redis key. In other words, it's a pretty simple set-up.
How do I track down what what this buffer is? In addition, is there a list somewhere of basic do-not's that can cause this type of issue?
I should also mention that running stress tests with ab causes the server to consume proportionately more memory, so it's definitely a node.js issue and likely not another process that's eating up the memory.
Would it be helpful to dig through the code and rename as many anonymous functions as possible?

Handle I/O requests in amazon ec2 instances

After having learnt node, javascript and all the rest the hard way, I am finally about to release my first web app.
So I subscribed to Amazon Web Services and created a micro instance, planning on the first year free tier to allow me to make the app available to the world.
My concern is more about hidden costs. I know that with the free tier comes 1 million I/O requests per month for the Amazon EC2 EBS.
Thing is, I started testing my app one the ec2 instance to check that everything was running fine; and I am already at more than 100, 000 I/O requests. And I have basically been the only one using it so far (37 hours that the instance runs).
So I am quite afraid of what could happen if my app gets some traffic, and I don't want to end up with a huge unexpected bill at the end of the month.
I find it quite surprising, because I mainly serve static stuff, and my server side code consists in :
Receving a search request from a client
1 http request to a website
1 https request to the youtube api
saving the data to a mongoDB
Sending the results to the client
Do you have any advice on how to dramatically reduce my IO?
I do not use any other Amazon services so far, maybe am I missing something?
Or maybe Amazon free tier in not enough in my case, but then what can it be enough for? I mean, my app is really simple after all.
I'd be really glad for any help you could provide me
Thanks!

You did not mention total number of visits to your app. So I am assuming you have fairly less visits.
What are I/O requests ?
A single I/O request is a read/write instruction that reaches the EBS volumes. Beware! Execution of large read/writes is broken into multiple smaller pieces, which is the block size of the volume.
Possible reasons of high I/O:
Your app uses lot of RAM. After you hit the limit, the OS starts swapping memory to and fro from swap area in your disk, constantly.
This is most likely the problem, the mongoDB search. mongoDB searches can be long complex queries internally. From one of the answers to this question, the person was using mySQL and it caused him 1 billion I/O requests in 24 days. So 1 database search can be many I/O requests.
Cache is disabled, or you write/modify lot of files. You mentioned you were testing. Free-teir is just not suitable for developing stuff.
You should read this, in case you want to know what happens after free-tier expires.

I recently ran into a similar situation recording very high I/O request rates for a website with little to no traffic. The culprit seems to be a variation of what #prajwalkman discovered testing Chef deployments on a micro instance.
I am not using Chef, but I have been using boto3, Docker and Git to automatically 'build' test images inside a micro instance. Each time I go through my test script, a new image is built and I was not careful to read the fine print regarding default settings on the VolumeType argument on the boto3 run_instance command. Each test image was being built with the 'standard' volume type which, according to current EBS pricing, bills at the rate of $0.05/million I/Os. Whereas, the 'gp2' general purpose memory has a flat cost of $0.10 per GB per month with no extra charge for I/O.
With a handful of lean docker containers taking up a total of 2GB, on top of the 1.3GB for the amazon-ecs-optimized-ami, my storage is well within the free-tier usage. So, once I fixed the volumetype attribute in the blockdevicemappings settings in my scripts to 'gp2', I no longer had an I/O problem on my server.
Prior to that point, the constant downloading of docker images and git repos produced nearly 10m I/Os in less than a week.

The micro instance and the free tier is meant for testing their offerings, not a free way for you to host your site/web application.
You may have to pay money at the end of the month, but I really doubt if you can get away with paying less by using some other company for hosting. AFAIK AWS really is the rock bottom of the price charts.
As for the IO requests themselves, it's hard to give generic advice. I once was in a situation where my micro instance racked up ridiculous number of IO requests. Turns out testing Chef deployments on EC2 is a bad idea.

I/O Requests have to do with reading and writing blocks to EBS volumes. You can reduce this by using as much in memory caching as possible. Micro instances only have about 613 MB of memory available, so you may not be able to do much here.

Ok, so it seems like I/O requests are related to the EBS volume, and that caching may reduce it.
Something I had not considered though is all the operations I made to get my app running.
I updated the linux image, installed node and npm, several modules, mongodb, ....
This is likely to be the main cause of the I/O.
The number requests hasn't grown much in the last days, where the server stayed mostly idle.

We Keep Coding

JavaScript is the programming language of the Web.