Synchronize Data across multiple occasionally-connected-clients using EventSourcing (NodeJS, MongoDB, JSON) - javascript

I'm facing a problem implementing data-synchronization between a server and multiple clients.
I read about Event Sourcing and I would like to use it to accomplish the syncing-part.
I know that this is not a technical question, more of a conceptional one.
I would just send all events live to the server, but the clients are designed to be used offline from time to time.
This is the basic concept:
The Server stores all events that every client should know about, it does not replay those events to serve the data because the main purpose is to sync the events between the clients, enabling them to replay all events locally.
The Clients have its one JSON store, also keeping all events and rebuilding all the different collections from the stored/synced events.
As clients can modify data offline, it is not that important to have consistent syncing cycles. With this in mind, the server should handle conflicts when merging the different events and ask the specific user in the case of a conflict.
So, the main problem for me is to dertermine the diffs between the client and the server to avoid sending all events to the server. I'm also having trouble with the order of the synchronization process: push changes first, pull changes first?
What I've currently built is a default MongoDB implementation on the serverside, which is isolating all documents of a specific user group in all my queries (Currently only handling authentication and server-side database work).
On the client, I've built a wrapper around a NeDB store, enabling me to intercept all query operations to create and manage events per-query, while keeping the default query behaviour intact. I've also compensated for the different ID systems of neDB and MongoDB by implementing custom ids that are generated by the clients and are part of the document data, so that recreating a database won't mess up the IDs (When syncing, these IDs should be consistent across all clients).
The event format will look something like this:
{
type: 'create/update/remove',
collection: 'CollectionIdentifier',
target: ?ID, //The global custom ID of the document updated
data: {}, //The inserted/updated data
timestamp: '',
creator: //Some way to identify the author of the change
}
To save some memory on the clients, I will create snapshots at certain amounts of events, so that fully replaying all events will be more efficient.
So, to narrow down the problem: I'm able to replay events on the client side, I'm also able to create and maintain the events on the client and serverside, Merging the events on serverside should also not be a problem, Also replicating a whole database with existing tools is not an option as I'm only syncing certain parts of the database (Not even entire collections as the documents are assigned different groups in which they should sync).
But what I am having trouble with is:
The process of determining what events to send from the client when syncing (Avoid sending duplicate events, or even all events)
Determining what events to send back to the client (Avoid sending duplicate events, or even all events)
The right order of syncing the events (Push/Pull changes)
Another Question I would like to ask, is whether storing the updates directly on the documents in a revision-like style is more efficient?
If my question is unclear, duplicate (I found some questions, but they didnt help me in my scenario) or something is missing, please leave a comment, I will maintain it as best as I can to keep it simple, as I've just written everything down that could help you understand the concept.
Thanks in advance!

This is a very complex subject, but I'll attempt some form of answer.
My first reflex upon seeing your diagram is to think of how distributed databases replicate data between themselves and recover in the event that one node goes down. This is most often accomplished via gossiping.
Gossip rounds make sure that data stays in sync. Time-stamped revisions are kept on both ends merged on demand, say when a node reconnects, or simply at a given interval (publishing bulk updates via socket or the like).
Database engines like Cassandra or Scylla use 3 messages per merge round.
Demonstration:
Data in Node A
{ id: 1, timestamp: 10, data: { foo: '84' } }
{ id: 2, timestamp: 12, data: { foo: '23' } }
{ id: 3, timestamp: 12, data: { foo: '22' } }
Data in Node B
{ id: 1, timestamp: 11, data: { foo: '50' } }
{ id: 2, timestamp: 11, data: { foo: '31' } }
{ id: 3, timestamp: 8, data: { foo: '32' } }
Step 1: SYN
It lists the ids and last upsert timestamps of all it's documents (feel free to change the structure of these data packets, here I'm using verbose JSON to better illustrate the process)
Node A -> Node B
[ { id: 1, timestamp: 10 }, { id: 2, timestamp: 12 }, { id: 3, timestamp: 12 } ]
Step 2: ACK
Upon receiving this packet, Node B compares the received timestamps with it's own. For each documents, if it's timestamp is older, just place it in the ACK payload, if it's newer place it along with it's data. And if timestamps are the same, do nothing- obviously.
Node B -> Node A
[ { id: 1, timestamp: 11, data: { foo: '50' } }, { id: 2, timestamp: 11 }, { id: 3, timestamp: 8 } ]
Step 3: ACK2
Node A updates it's document if ACK data is provided, then sends back the latest data to Node B for those where no ACK data was provided.
Node A -> Node B
[ { id: 2, timestamp: 12, data: { foo: '23' } }, { id: 3, timestamp: 12, data: { foo: '22' } } ]
That way, both node now have the latest data merged both ways (in case the client did offline work) - without having to send all your documents.
In your case, your source of truth is your server, but you could easily implement peer-to-peer gossiping between your clients with WebRTC, for example.
Hope this helps in some way.
Cassandra training video
Scylla explanation

I think that the best solution to avoid all the event order and duplication issues are to use the pull method. In this way every client maintains its last imported event state (with a tracker for example) and ask the server for the events generated after that last one.
An interesting problem will be to detect the breaking of business invariants. For that you could store on the client the log of applied commands also and in case of a conflict (events were generated by other clients) you could retry the execution of commands from the command log. You need to do that because some commands will not succeed after re-execution; for example, a client saves a document after other user deleted that document in the same time.

Related

How to find out when azure snapshot is finished creating

I'm taking a snapshot of a disk on azure. I'm using the Node SDK. When I issue the command to take a snapshot, within a few seconds I get the response back. I'll paste the output below.
The thing is, the provisioning state always shows Succeeded even though the snapshot is obviously not finished being created yet. And it does not yet show in the dashboard.
If I use the snapshot.list the method, it also says Succeeded for this snapshot.
How can I query to find out when the snapshot is actually finished being created?
{ id:
'/subscriptions/1a6c4c11-6729-48fb-8e76-06c6281bb6f1/resourceGroups/RGOUP1/providers/Microsoft.Compute/snapshots/snapCostTest',
name: 'snapCostTest',
type: 'Microsoft.Compute/snapshots',
location: 'westus',
sku: { name: 'Standard_LRS', tier: 'Standard' },
timeCreated: 2019-08-16T00:51:04.099Z,
osType: 'Windows',
hyperVGeneration: 'V1',
creationData:
{ createOption: 'Copy',
sourceResourceId:
'/subscriptions/1a6c4c11-6729-48fb-8e76-06c6281bb6f1/resourceGroups/RGOUP1/providers/Microsoft.Compute/disks/vm1_OsDisk_1_502b5534fe4b4f288d19e127c457d652' },
diskSizeGB: 127,
provisioningState: 'Succeeded' }
I would have thought the provisioningState would show something like Creating while the snapshot is being created.
You can use the the rest API below to get detailed information of your disks :
GET https://management.azure.com/<YOUR DISK ID>?api-version=2018-06-01
Only properties.provisioningState in response value turned "Succeeded" , the disk will shows up on portal dashboard.

Client-side generated document sequences with Node.js

We want to build a Node.js API that stores schema-less documents in a MongoDB collection. Every document should have a key "no" which orders them in a sequence:
[
{ "no": 1, ... },
{ "no": 2, ... },
{ "no": 3, ... },
{ "no": 4, ... },
...
]
We have the following constraints:
The sequencing, including other parameters, need to be cryptographically signed. Therefore, the server cannot set a sequence number that the client does not know before signing and sending the data.
no must be unique (Not allowed: 1 -> 1 -> 2 -> 3)
There must not be any gaps in the sequencing (Not allowed: 1 -> 2 -> 4 -> 5)
The API is replicated, so there will be a lot of concurrent requests against MongoDB.
The API client is not not a browser application, it actually is a Node.js application as well. There will be only one API client
Our starting point is to have an API that on every storage request returns the next sequence number.
POST /collection { "no": 1, ...}
returns {"next": 2}
Will this work?
On the client side, it could be something like this pseudo code:
let next
module.exports.create = (document, cb) => {
if (!next) next = 1 // here it is probably better to sync the initial no with the db instead always starting with 1
document.no = next
return post('/collection', document, (err, res) => {
if (err) ...
next = res.next
return cb(...)
}
}
If create on the client side is called by many concurrent callers, can there be a case where two or more create requests have duplicate no's?
When you have concurrent API calls and the client has the right to determine the sequencing, it's pretty much impossible for you to achieve what you're trying to if the client has the full rights to determining the sequence.
However, if the sequence no cannot have any gap in between and it must be sequential, why must the client be the one providing the sequence no? You could have easily agreed on a sequencing pattern (e.g. 1,2,3,4 or AB1, AB2, AB3, etc) and let the server side do the inserting of sequence no depending on which request comes in first. Make a collection that generates running number using findAndModify and let server update the sequence no into the database instead.

How to deal with user permissions in single page application

I'm working on a single page enterprise application with a pretty complex logic about user permissions. The huge part of it works entirely on client communicating with backend server using AJAX sending JSON back and forth. The tricky part is that I need to implement permission mechanism as on per-entity basis, and I dont know how to do it the right way.
To explain myself clearly here the example code, I have 2 entity classes on the backend User and Node:
class User {
Long id;
}
class Node {
Long id;
String name;
Status status;
Node parent;
List<User> admins;
}
enum Status {
STATUS_1, STATUS_2
}
I send JSON of parent node to the server:
{id: 1, name: "Node name 1", status: 'STATUS_1'}
And recieve JSON with a bunch of child nodes:
[
{id: 11, name: "Node name 1.1", status: 'STATUS_1'},
{id: 12, name: "Node name 1.2", status: 'STATUS_1'}
]
On the client they are displayed in a tree-like structure, like this:
Now the tricky part:
Simple user that works with application can see tree, but can't change anything.
User can change node name if he is among admins of node or any of its parent nodes.
Admins can also change status of node, from STATUS_1 to STATUS_2, but only if all child nodes has STATUS_2 status.
There is a list of super adminstrators that can do whatever they want: change properties of any node, change status as they want.
So somehow, during rendering of the tree on the client, I need to know what user can or cannot do with each of the node on the page. I can't just assign user a role within a whole application because user rights vary from one node to another. Also I can't see whole picture on the client side because child nodes may be not loaded. How can I manage user permissions in situation like this? What's the proper way or pattern to use?
Should I attach some role object to each node, or maybe a bunch of flags representing what user can or cannot do like that:
{
id: 12,
name: "Node name 1.2",
status: "STATUS_1",
canChangeName: true,
canChangeStatus: false
}
That looks pretty silly to me.
I usually solve complex (and not so complex) permission-based tasks in the application using ACL classes.
I have simple, lighweight classes, that take a model, permissions for which are being checked, and a user object into a constructor. They have a bunch of methods with names canXXXX(). These methods can optionally take some parameters also if that is needed.
If you have the same model classes on front and back, you even might be able to reuse ACLs in both cases.
Can you use this approach?

How to manage large java script data files?

I'm developing a Cordova project. to store my data I'm using java script files like this:
var groups = [
{
id: 1,
parent_id: 0,
name: "Group 1"
},
{
id: 2,
parent_id: 0,
name: "Group 2"
}
];
First problem is that I don't know if it is a good way or maybe there are better ways.
to use this data, simply I use a loop through variable, but the problem is when there are large data volumes, for example thousands of records. It's hard to handle this amount of data in a .js file. what should I do?
A possible solution is to use a database such as IndexedDB(if your app is completely offline) or FireBase (if your app uses internet), you can query and get just the data you require.
Even DOM Storage (Local-Storage) is an option but there is the problem of looping over an array and this cannot store more than 5MB of data.

Duplicated rows when sorting dgrid 0.3.6

I've been using dgrid 0.3.2 along with JsonRest to display tables of data. Recently, I've been looking at upgrading to dgrid 0.3.6 or 0.3.7. Things work mostly the same, but it seems with the newer versions of dgrid that, if the user clicks a column header to sort fast enough, the grid will start displaying duplicate rows. I’ve verified that the response JSON and range are correct, and this didn’t seem to happen when we used dgrid 0.3.2.
Here’s a simple test case that reproduces the problem, and mimics how we set up our grid and store. Am I doing something wrong? If I don’t wrap the JsonRest in a Cache, I don’t get this issue, so I’m sure the problem is there, but I’m unsure about the performance ramifications of not caching the JSON response.
<!doctype html>
<html>
<head>
<%
String dgridVer = request.getParameter("dgridVer");
if (dgridVer==null) { dgridVer = "0.3.6"; }
%>
<script type="text/javascript">
var dojoConfig = {
isDebug: true,
baseUrl: 'dojo',
packages: [
{ name: 'dojo', location: 'dojo' },
{ name: 'dijit', location: 'dijit' },
{ name: 'dojox', location: 'dojox' },
{ name: 'dgrid', location: 'dgrid-<%=dgridVer%>' },
{ name: 'put-selector', location: 'put-selector' },
{ name: 'xstyle', location: 'xstyle' },
{ name: 'datagrid', location: '../datagrid' }
]
};
</script>
<script src="dojo/dojo/dojo.js"></script>
</head>
<body>
Try sorting a column as fast as you can. Look for duplicated rows.<br>
Using dgrid version: <%=dgridVer %><p>
<div id='gridDiv'></div>
<script>
require(['dgrid/Grid', 'dgrid/extensions/Pagination', 'dojo/store/JsonRest',
'dojo/store/Cache', 'dojo/store/Memory', 'dojo/_base/declare', 'dojo/domReady!'],
function(Grid, Pagination, JsonRest,
Cache, Memory, declare) {
var columns = [
{ field: "first", label: "First Name" },
{ field: "last", label: "Last Name" },
{ field: "age", label: "Age" }
];
var store = new JsonRest({
target: 'testData.jsp',
sortParam: "sortBy"
});
store = Cache(store, Memory());
var grid = new (declare([Grid, Pagination]))({
columns: columns,
store: store,
loadingMessage: 'Loading...',
rowsPerPage: 4,
firstLastArrows: true
}, 'gridDiv');
});
</script>
</body>
</html>
Check the default implementation of Cache.js, especially the query and queryEngine functions. By default they always get first to the master store, which in your case is the JsonRest store. Only after the data has been loaded, the caching store is updated (in your case the Memory store).
Now, if you check function _setSort in DGrid List.js file, and function refresh in DGrid OnDemandList.js you'll sill see that by default DGrid calls the query method of the current store to obtain the new list of items sorted differently. In your case that store is the dojo/store/Cache.
So, summing up, when the user clicks the column to sort, DGrid queries the Cache, which in turn queries JsonRest, which in turns queries the server, which then returns new data, which then the Cache stores in the Memory store.
You can actually confirm this for example with Firebug (a Firefox extension). In my case whenever I tried to sort, Firebug was showing a new request to the server to obtain new data.
This makes sense when there are lots of rows because DGrid is designed to load only the first batch of rows and then update the grid when user scrolls down. When the sort is changing, the first visible batch of rows may be different and may not be loaded yet, so DGrid must load them first.
But in my case the Json request was returning all data during a single request. I didn't like the default implementation and implemented my own caching store which doesn't require a trip to the server when changing the sorting. I can't share the implementation now but will try to do when I have some time to tidy up the code.
As for now you shouldn't notice any performance problems if you switch to JsonRest store only (considering that when changing the sorting there is the trip to the server anyway).
I am not sure what causes the specific problem of duplicated rows, but I remember seeing it too when my caching store wasn't implemented properly (it had something to do with deferred requests when loading data If I recall correctly). You can try to debug it by adding (again with Firebug) breakpoints in the get and query functions of the Cache store. My bet is that DGrid tries to load particular rows with the get method (which hits the cache) while the query request is still loading data from the server after user changed the sorting. But I may be wrong so please try to confirm it first if you can.

Categories