My service returns responses of very large JSON objects - around 60MB. After some profiling I have found that it spends almost all of the time doing the JSON.stringify() call which is used to convert to string and send it as a response. I have tried custom implementations of stringify and they are even slower.
This is quite a bottleneck for my service. I want to be able to handle as many requests per second as possible - currently 1 request takes 700ms.
My questions are:
1) Can I optimize the sending of response part? Is there a more effective way than stringify-ing the object and sending the response?
2) Will using async module and performing the JSON.stringify() in a separate thread improve overall the number of requests/second(given that over 90% of the time is spent at that call)?
You've got two options:
1) find a JSON module that will allow you to stream the stringify operation, and process it in chunks. I don't know if such a module is out there, if it's not you'd have to build it. EDIT: Thanks to Reinard Mavronicolas for pointing out JSONStream in the comments. I've actually had it on my back burner to look for something like this, for a different use case.
2) async does not use threads. You'd need to use cluster or some other actual threading module to drop the processing into a separate thread. The caveat here is that you're still processing a large amount of data, you're gaining bandwidth using threads but depending on your traffic you still may hit a limit.
After some year, this question has a new answer for the first question: yieldable-json lib.
As described by in this talk by Gireesh Punathil (IBM India), this lib can evaluate a JSON of 60MB without blocking the event loop of node.js let you accept new requests in order to upgrade your throughput.
For the second one, with node.js 11 in the experimental phase, you can use the worker thread in order to increase your web server throughput.
Related
I'm building a tool, which core structure is: make an AJAX request to Cloudflare worker, which fetches HTML data, and then returns it.
So the steps are:
Send request from client
Worker receives request and makes another, which returns a response as a typical HTML document.
Aaaand on the third step I have two options:
to return the obtained HTML back via AJAX response and then parse it on client
to parse HTML first, and then return processed data via AJAX response
The first one is straightforward: I receive the response from my worker, and insert it the returned HTML somewhere in a hidden <div> and then parse it.
The reason I would prefer to go with a second one, though, is not to waste the bandwidth while delivering HTML from Cloudflare Worker back to client, because original page has a lot of irrelevant bloat. I mean, for example, the original page looks something like this:
<div class="very-much-bloat" id="some-other-bloat" useful_parameter ="value">
<div id="some-other-irrelevant-info" id="really-great-id">
something that I need
</div>
</div>
And all that I need from this is, for example
{
"really-great-id" : "something that I need",
"useful_parameter" : "value"
}
If I go with the first step, it would be pretty straightforward to parse it in-browser, however I'll waste bandwidth for delivering a lot of information that is later disposed of.
However, if the second one would involve using complex libraries, it wouldn't be probably a way to go since max execution time per request is 10ms (that's a free plan on Cloudflare, which otherwise is plenty enough: 100,000 requests per day is more than I probably ever need with this app).
The question is: is there any efficient way to parse HTML on Cloudflare worker without breaking 10ms time limit? Page size obtained with worker is around 10-100 KB, parsed data size is around 1-10KB (10 times less than original roughly). While I understand that 100KB may not sound like a lot, it's still mostly garbage that's better to filter as soon as possible.
Cloudflare Workers currently does not support the DOM API. However, it supports an alternative HTML parsing API that might work for you: HTMLRewriter
https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/
This API is different from DOM in that it operates in a streaming fashion: JavaScript callbacks are invoked as the HTML data streams in from the server, without ever holding the entire document in memory at one time. If it fits your use case, it may allow you to respond faster and use fewer resources than a DOM-based solution would. The CPU time used by HTMLRewriter itself does not even count against the 10ms limit -- only the time spent by your callbacks counts. So if you design your callbacks carefully, you should have no problem staying within the limit.
Note that HTMLRewriter is primarily designed to support modifying an HTML document as it streams through. However, it should not be too hard to have it consume the document and generate a completely different kind of data, like JSON. Essentially, you would set up the rewriter so that the "rewritten" HTML is discarded, and you'd have your callbacks separately populate some other data structure or write to some other stream that represents the final result.
Scenario: I have a Nodejs rest server which accepts some json file,parse it and then add it to some DB. I expect hundreds of hits per second.
Requirement:Only insertions are to be done parsing the json from request.Since nodejs is single-threaded and JSON.parse is also synchronised, How can i increase the performace? Or which must be the correct design pattern for maximum performance in nodejs?
Before designing a more complex server (maybe with worker threads), you need to profile the actual performances. The bottleneck might not be the json parsing.
This is largely an 'am I doing it right / how can I do this better' kind of topic, with some concrete questions at the end. If you have other advice / remarks on the text below, even if I didn't specifically ask those questions, feel free to comment.
I have a MySQL table for users of my app that, along with a set of fixed columns, also has a text column containing a JSON config object. This is to store variable configuration data that cannot be stored in separate columns because it has different properties per user. There doesn't need to be any lookup / ordering / anything on the configuration data, so we decided this would be the best way to go.
When querying the database from my Node.JS app (running on Node 0.12.4), I assign the JSON text to an object and then use Object.defineProperty to create a getter property that parses the JSON string data when it is needed and adds it to the object.
The code looks like this:
user =
uid: results[0].uid
_c: results[0].user_config # JSON config data as string
Object.defineProperty user, 'config',
get: ->
#c = JSON.parse #_c if not #c?
return #c
Edit: above code is Coffeescript, here's the (approximate) Javascript equivalent for those of you who don't use Coffeescript:
var user = {
uid: results[0].uid,
_c: results[0].user_config // JSON config data as string
};
Object.defineProperty(user, 'config', {
get: function() {
if(this.c === undefined){
this.c = JSON.parse(this._c);
}
return this.c;
}
});
I implemented it this way because parsing JSON blocks the Node event loop, and the config property is only needed about half the time (this is in a middleware function for an express server) so this way the JSON would only be parsed when it is actually needed. The config data itself can range from 5 to around 50 different properties organised in a couple of nested objects, not a huge amount of data but still more than just a few lines of JSON.
Additionally, there are three of these JSON objects (I only showed one since they're all basically the same, just with different data in them). Each one is needed in different scenarios but all of the scenarios depend on variables (some of which come from external sources) so at the point of this function it's impossible to know which ones will be necessary.
So I had a couple of questions about this approach that I hope you guys can answer.
Is there a negative performance impact when using Object.defineProperty, and if yes, is it possible that it could negate the benefit from not parsing the JSON data right away?
Am I correct in assuming that not parsing the JSON right away will actually improve performance? We're looking at a continuously high number of requests and we need to process these quickly and efficiently.
Right now the three JSON data sets come from two different tables JOINed in an SQL query. This is to only have to do one query per request instead of up to four. Keeping in mind that there are scenarios where none of the JSON data is needed, but also scenarios where all three data sets are needed (and of course scenarios inbetween), could it be an improvement to only get the required JSON data from its table, at the point when one of the data sets is actually needed? I avoided this because I feel like waiting for four separate SELECT queries to be executed would take longer than waiting for one query with two JOINed tables.
Are there other ways to approach this that would improve the general performance even more? (I know, this one's a bit of a subjective question, but ideas / suggestions of things I should check out are welcome). I'm not looking to spin off parsing the JSON data into a separate thread though, because as our service runs on a cluster of virtualised single-core servers, creating a child process would only increase overall CPU usage, which at high loads would have even more negative impact on performance.
Note: when I say performance it mainly means fast and efficient throughput rates. We prefer a somewhat larger memory footprint over heavier CPU usage.
We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil
- Donald Knuth
What do I get from that article? Too much time is spent in optimizing with dubious results instead of focusing on design and clarity.
It's true that JSON.parse blocks the event loop, but every synchronous call does - this is just code execution and is not a bad thing.
The root concern is not that it is blocking, but how long it is blocking. I remember a Strongloop instructor saying 10ms was a good rule of thumb for max execution time for a call in an app at cloud scale. >10ms is time to start optimizing - for apps at huge scale. Each app has to define that threshold.
So, how much execution time will your lazy init save? This article says it takes 1.5s to parse a 15MB json string - about 10,000 B/ms. 3 configs, 50 properties each, 30 bytes/k-v pair = 4500 bytes - about half a millisecond.
When the time came to optimize, I would look at having your lazy init do the MySQL call. A config is needed only 50% of the time, it won't block the event loop, and an external call to a db absolutely dwarfs a JSON.parse().
All of this to say: What you are doing is not necessarily bad or wrong, but if the whole app is littered with these types of dubious optimizations, how does that impact feature addition and maintenance? The biggest problems I see revolve around time to market, not speed. Code complexity increases time to market.
Q1: Is there a negative performance impact when using Object.defineProperty...
Check out this site for a hint.
Q2: *...not parsing the JSON right away will actually improve performance...
IMHO: inconsequentially
Q3: Right now the three JSON data sets come from two different tables...
The majority db query cost is usually the out of process call and the network data transport (unless you have a really bad schema or config). All data in one call is the right move.
Q4: Are there other ways to approach this that would improve the general performance
Impossible to tell. The place to start is with an observed behavior, then profiler tools to identify the culprit, then code optimization.
For maximum load speed and page efficiency, is it better to have:
An 18MB JSON file, containing an array of dictionaries, that I can load and start using as a native JavaScript object (e.g. var myname = jsonobj[1]['name']).
A 4MB CSV file, that I need to read using the jquery.csv plugin, and then use lookups to refer to: var nameidx = titles.getPos('name'); var myname = jsonobj[1][nameidx]).
I'm not really expecting anyone to give me a definitive answer, but a general suspicion would be very useful. Or tips for how to measure - perhaps I can check the trade-off between load speed and efficiency using Developer Tools.
My suspicion is that any extra efficiency from using a native JavaScript object in (1) will be outweighed by the much smaller size of the CSV file, but I would like to know if others think the same.
Did you considered delivering the json content using gzip - here is some benchmarks on gzip http://www.cowtowncoder.com/blog/archives/2009/05/entry_263.html
What is your situation? Are you writing some intranet site where you know what browser users are using and have something like a reasonable expectation of bandwidth, or is this a public-facing site?
If you have control of what browsers people use, for example because they're your employees, consider taking advantage of client-side caching. If you're trying to convince people to use this data you should probably consider breaking the data up into chunks and serving it via XHR.
If you really need to serve it all at once then:
Use gzip
Are you doing heavy processing of the data on the client side? How many of the items are you actually likely to go through? If you're only likely to access fewer than 1,000 of them in any given session then I would imagine that the 14MB savings would be worth it. If on the other hand you're comparing all kinds of things against each other all the time (because you're doing some sort of visualization or... anything) then I imagine that the JSON would pay off.
In other words: it depends. Benchmark it.
4MB vs 18MB? Where problem? Json is just standard format now, csv is maybe same good and ok if you using it. My opinion.
14Mb of data are a HUGE difference, but I will try first to serve both the content with GZIP/Deflate server side compression and, thus, make a comparison of these requests (probably the CSV request will be again better in content length)
Then, I would also try to create some data manipulation tests on jsperf both with CSV and JSON data with a real test case/common usage
That depends a lot on the bandwidth of the connection to the user.
Unless this is only going to be used by people who have a super fast connection to the server, I would say that the best option would be an even smaller file that only contains the actual information that you need to display right away, and then load more data as needed.
It's a simple case of a javascript that continuously asks "are there yet?" Like a four year old on a car drive.. But, much like parents, if you do this too often or, with too many kids at once, the server will buckle under pressure..
How do you solve the issue of having a webpage that looks for new content in the order of every 5 seconds and that allows for a larger number of visitors?
stackoverflow does it some way, don't know how though.
The more standard way would indeed be the javascript that looks for new content every few seconds.
A more advanced way would use a push-like technique, by using Comet techniques (long-polling and such). There's a lot of interesting stuff under that link.
I'm still waiting for a good opportunity to use it myself...
Oh, and here's a link from stackoverflow about it:
Is there some way to PUSH data from web server to browser?
In Java I used Ajax library (DWR) using Comet technology - I think you should search for library in PHP using it.
The idea is that server is sending one very long Http response and when it has something to send to the client it ends it and send new response with updated data.
Using it client doens't have to ping server every x seconds to get new data - I think it could help you.
You could make the poll time variable depending on the number of clients. Using your metaphor, the kid asks "Are we there yet?" and the driver responds "No, but maybe in an hour". Thankfully, Javascript isn't a stubborn kid so you can be sure he won't bug you until then.
You could consider polling every 5 seconds to start with, but after a while start to increase the poll interval time - perhaps up to some upper limit (1 minute, 5 minute - whatever seems optimal for your usage). The increase doesn't have to be linear.
A more sophisticated spin (which could incorporate monzee's suggestion to vary by number of clients), would be to allow the server to dictate the interval before next poll. The server could then increase the intervale over time, and you can even change the algorithm on the fly, or in response to network load.
You could take a look at the 'Twisted' framework in python. It's event-driven network programming framework that might satisfy what you are looking for. It can be used to push messages from the server.
Perhaps you can send a query to a real simple script, that doesn't need to make a real db-query, but only uses a simple timestamp to tell if there is anything new.
And then, if the answer is true, you can do a real query, where the server has to do real work !-)
I would have a single instance calling the DB and if a newer timestamp exists, put that new timestamp in a application variable. Then let all sessions check against that application variable. Or something like that. That way only one innstance are calling the sql-server and the number of clients does'nt matter.
I havent tried this and its just the first idéa on the top of the head but I think that cashe the timestamp and let the clients check the cashe is a way to do it, and how to implement the cashe (sql-server-cashe, application variable and so on) I dont know whats best.
Regarding how SO does it, note that it doesn't check for new answers continuously, only when you're typing into the "Your Answer" box.
The key then, is to first do a computationally cheap operation to weed out common "no update needed" cases (e.g., entering a new answer or checking a timestamp) before initiating a more expensive process to actually retrieve any changes.
Alternately, depending on your application, you may be able to resolve this by optimizing your change-publishing mechanism. For example, perhaps it might be feasible for changes (or summaries of them) to be put onto an RSS feed and have clients watch the feed instead of the real application. We can assume that this would be fairly efficient, as it's exactly the sort of thing RSS is designed and optimized for, plus it would have the additional benefit of making your application much more interoperable with the rest of the world at little or no cost to you.
I believe the approach shd be based on a combination of server-side sockets and client-side ajax/comet. Like:
Assume a chat application with several logged on users, and that each of them is listening via a slow-load AJAX call to the server-side listener script.
Whatever browser gets the just-entered data submits it to the server with an ajax call to a writer script. That server updates the database (or storage system) and posts a sockets write to noted listener script. The latter then gets the fresh data and posts it back to the client browser.
Now I haven't yet written this, and right now I dunno whether/how the browser limit of two concurrent connections screws up the above logic.
Will appreciate hearing fm anyone with thoughts here.
AS