So we have a design challenge, we have an absolutely clean slate to develop a system which presents the processing results of various social networking feeds like Twitter & Facebook on the web and via an API service like REST. The processing part has already been completed however we now need somewhere to store the results.
The result format looks something like a message ID, the date of the message, the processed timestamp and then a collection of various processing scores. There will be around 200 million messages in this database. So the first thing we need is something to store this data. We are thinking a NoSQL document database might be interesting to try given that we need to be able to select over a range of dates which discounts column family style databases (as I believe key range scanning in HBase is slow). Or the better option may be to simply store this data in good old MySQL or VoltDB. Does anyone have example use cases or stories on their implementation of such a system?
The next thing will be to develop a web application. We need a charting service which can take data in real-time and update the interface. We are thinking of using HighCharts for this purpose. Is there anything better?
Finally we need some sort of API service which can act like a commet application and stream data, something like Twitter's streaming API. I was thinking the best option for this would be node.js.
So I guess the question is are the technologies we have selected the best for the job, are there any good example use cases out there and is there anything anyone would recommend?
Cheers!
About storage: There are 4 types of nosql storage. key/value, column database, document database and graph database. Each one is slower than the previous one but also gives you more features. In case you need only to store data key/value or column database is your choice. With this type of storage data processing is done by hand and you may need some kind of map reduce implementation. Maybe hadoop. Document and graph databases gives you some kind of query and you can move part of data processing in database (e.g. date filters). If i have to choose some nosql storage I'll make tests with graph database (e.g. neo4j) and If i have performance issues switch to column database (e.g. cassandra) and map reduce
About charts: HighCharts seems good option. I don't know about svg browser support and if there are some performance issues but On my machine looks very nice.
About data streaming. I have little experience only with nodejs and it will be my first choise. There are few other implementations like Tornadoweb for python and Misultin, Mochiweb and Cowboy for erlang. I found a link with benchmark of this servers and it seems erlang servers are faster than nodejs. You can also look at them.
You can also use SOLR/Lucene with sharding. Throughput can be increased by having a master/slave solr setup.
Related
Update: I am getting the impression that this is not even the right website to post this. If someone can point me in the right direction, I'd be appreciative...
I have an existing PHP+MySQL application that wasn't built to render "real-time" or similarly live-style data. But now I need to build in a way to pull nearly real-time data into the application and keep the data on the page fresh. This live data is only for 1 page in the application.
Looked at things like socket.io and PHP-based websockets libraries, but it seemed like overkill because the data is basically coming from 1 source and being delivered to 1 person (the client). Multiple other users could have this process running, but each one would bring their own data endpoint. That's... like a year down the road. But good to think about. Would ideally have hundreds, or thousands of users on the system, pulling their live-ish data. So I want this to be as streamlined and low-impact as possible.
Users must be authenticated and authorized to consume the data. This is already baked into the current system.
The API to get the data (which has already been built by another vendor) is also NOT streaming. It's set on a 20-second cron, so the new data is available every 20 seconds, which satisfies the client's needs.
My current plan is to do something like this...
Data is pulled on a cron every 20 seconds, organized, and stored into the database (complete)
Adjust #1 so it also does any additional proprietary calculations on data AND compiles + writes a JSON file on the server (unique to the user) which is the exact data needed for the front end (DB data is needed for other pages)
Create small PHP-based service which validates a client-provided JWT and reads the JSON file out
Write AJAX front end to poll endpoint from #3 every X seconds using a JWT for authorization
This all seems sort of like I might be reinventing the wheel, or missing something. The fact that this is an existing PHP based application (LAMP) does have some limiting factors, but I feel like there's got to be a more efficient way to handle this... It's pretty new to me. Also, I'm open to other technologies that'll run on the LAMP stack, if it'll make things better.
I would say go for the API solution in the beginning :) Since it fits the architecture more and is for sure the least amount of work. Also if there will be problem with the "live" feeling of the data you can fix it by polling more often or introducing long polling, assuming you change the cron job time.
I mean in the end it is all about impact for the time spent, don't start implement features that customers don't care about :)
The biggest problem to solve is to implement it in a way that fits your requirements and is somewhat future extendable. You still have to deal with issues like resolution, time outs, reducing server processing when requesting data and so on!
For me, if you need to maintain a global service state because a single client(s) request could affect all other connected client request(s) then most all server-side scripting languages are not the best choice! Also to further add, if you plan on implementing something like this with PHP, you will be setting your self up for a living nightmare! Why, because simply put, PHP(s) socket(s) implementation is that bad!
Currently I'm trying to learn nativescript and for this I thought about doing an App like 'Anki'
But while thinking about the data storage I stumpled upon the problem on how to save my flash cards locally for keeping the app offline (for example with SQLite), save the users time when to reflect each card (e.g. to show again in 10 minutes or 1 day) AND have an update functionality to update the database with new cards without deleting the users data.
What's the best way to solve that problem, especially when I want to provide the updates with an App-Update and without fetching everything from an external database?
I don't have any code yet, therefore a recommendation on how to solve that would be nice.
There is several methods in NativeScript you can use:
NativeScript-Sqlite (disclaimer: I'm the author)
This allows full access to Sqlite for saving and loading items; you can have as big of databases as you need and Sqlite is very fast. Sqlite's biggest drawback is speed of writes; if you have a LOT of writing it can be slower than just writing to a file yourself.
NativeScript-LocalStorage (disclaimer again: I'm the author)
This is more geared to smaller data sizes; as when the app starts and saves it has to load the entire json backed data store into memory. This is really fast over all; but not something you want to use for 10's of thousands of records.
NativeScript-Couchbase
This uses sqlite for local storage and can use couchbase for the remote storage; very nice for having syncable storage - couchbase can be your own server or a leased or rented server.
NativeScript-Firebase
This is also very useful for having syncable storage; however Google charges for FireBase at a certain point.
Built in AppSettings.
This is really designed for a few application settings, not designed for lots of data. But useful for the smaller amounts of data.
Role your own to the file system.
I have done this in a couple of my projects; basically a hybrid between my localstorage plugin and a mini-sql type system. One project was very much write dependent so it made more sense to generate the 20 or so separate files on the phone for each table because I could save them much quicker than inserting/replacing > 100,000 records each time the app started up into sqlite. Had minimal searching needs.
Your storage really needs to be dependent upon what you are doing; it is a balancing act. Lots of searchable data; sqlite wins in almost all cases. Lots of frequent writing; something you create might be a lot faster.
I have created an application using javascript library D3. Users will constantly click and drag to frequently change graphical elements and I currently save the data in 3-4 local javascript objects and arrays. I want to save the data to the server periodically rather than after each change. Also I want them to be able to work if they are not connected. From twenty years ago, I imagine doing this manually where on the client side records are flagged as “new”, “revised”, and “deleted”. Every 10 seconds client data is saved via AJAX and either an object is updated or a SQL statement is executed. An id is returned from the database and saved on the client side to track each record for future modifications.
Note the data must be organized in a database for ease of separating elements for reuse. When the user is connected, updates every 5-10 seconds are fine. Then I can use an inexpensive and slow server. Of course a tool that deals with records that might not fully update is good, perhaps some transactional functionality.
There will be no separate mobile application. I can modify my javascript objects to be json compliant if need be. I see there are “offline-first” frameworks and javascript "state containers". Redux caught my eye, especially when I saw its use climbing over the years according to Google Trends. I’ve read about so many options and am thoroughly confused by all these. Here is a mish mash of tools I looked at: Store.js, now.js, indexedDB, couchDB, pouchDB, Cloudant, localForage, WebSQL, Polymer App Toolbox, Hoodie framework, Ionic and angular, and Loopback. Not to mention XHR, web sockets.
I have used MVC like Laravel and Zend, both are with PHP and MySql. I wonder if I could integrate the suggested solution. Thanks.
Related: How do I sync data with remote database in case of offline-first applications?
Saving the data locally using PouchDb and then syncing it with a CouchDb database (or IBM's Cloudant service) when a network connection is available is a well-trodden path for this sort of requirement. But your question is asking for an opinion, so there will be many other perfectly valid solutions to this.
Is Elasticsearch a database itself? Is it safe to use it as my primary database? Is it secure as my primary database to store sensitive data?
Elasticsearch is a standalone database. Its main use case is for searching text and text and/number related queries such as aggregations. Generally, it's not recommended to use Elasticsearch as the main database, as some operations such as indexing (inserting values) are more expensive compared to other databases.
You can use Elasticsearch along with any other database such as MongoDB or MySQL, where the other databases can act as the primary database, and you can sync Elasticsearch with your primary database for the "searchable" parts of the data.
Elasticsearch works well with a number of other products from Elastic such as Logstash for logging purposes and Kibana for visualization purposes.
Elasticsearch homepage has a well-written description of it and its main use cases.
You might want to look at below link to understand Elasticsearch's usefulness as a database and what tradeoffs you have make in order to use it as a primary database
https://www.elastic.co/blog/found-elasticsearch-as-nosql
In general, Elasticsearch has been primarily used as an index store for retrieving/searching data really fast. Elasticsearch is powered by Lucene which is a high performance , text search engine library , which makes it a very powerful tool to provide an on top full-text search platform for applications. But it is usually recommended that your "source of truth" database is separate from Elasticsearch index data itself and also because nature of its primary operation ( full-text search ) it has not focussed on other aspects of database such as durability , security and write consistency etc. Hope this helps.
This question is from 2018, and at that time I would have probably answer with a definitive No - don't use Elasticsearch as your only and primary database.
But since ~2015 a lot of resiliency issues have been found and addressed, and in recent years a lot of features and specifically stability and resiliency features have been added, that it's definitely something to consider given the right use-cases and leveraging the right features in the right way.
I used ELK stack only one time to monitor the log file from an application. The 'database' used is Elasticsearch. And it does seem as though it is positioned to be a primary database that is also "Open source and free to use."
Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.
My understanding is that the entire ELK stack is comprised of three tools. Namely it is ELK Stack: Elasticsearch, Logstash, Kibana. Where Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.
That's from their site at:
https://www.elastic.co/elk-stack
I am curious what others have to say and will have to follow your question. I currently work with Oracle and SQL Server for our application and would like to see how we could leverage additional database software in the future. Open source is always intriguing.
Is it really helpfull to load data from local database creating using pouchDB ?
please share experience if you used pouchDB. pros n cons.
We have a website which load 1,00,000 records on page load, and then perform many query on this data,
What I did : Create database using their getting-started guide : http://pouchdb.com/getting-started.html
Is is possible something like wild card query on this?
For 1,000,000 documents that the user is simply querying, syncing all of them to the client first sounds like it might be overkill. That's a huge amount of data for your application to wait for at page start.
What you may be interested in trying, though, is storing your data on CouchDB, querying the remote CouchDB, and then selectively syncing documents as needed to the client using filtered replication. It really depends on how badly you need sync, though, and if the user is ever going to modify those documents and need the changes to be synced back.
Well, since it's json you can do the queries in javascript. You could start by using localStorage, and move on to using pouchDB if you need more space or need the other functions provided by pouchDB. But if you just want to be able to filter/search records that you've already retrieved on page load, then you can write your filtering logic with javascript.