Distributed Data Management

Distributed Data Management Monolithic applications are typically backed by a large relational database, which defines a single data model common to all application components. In a microservices approach, such a central database would prevent the goal of building decentralized and independent components. Each microservice component should have its own data persistence layer.

Distributed data management, however, raises new challenges. As explained by the CAP Theorem, distributed microservices architectures inherently trade off consistency for performance and need to embrace eventual consistency.

It’s very common for state changes to affect more than a single microservice. In those cases, event sourcing has proven to be a useful pattern The core idea behind event sourcing is to represent and persist every application change as an event record. Instead of persisting application state, data is stored as a stream of events. Database transaction logging and version control systems are two wellknown examples for event sourcing. Event sourcing has a couple of benefits: state can be determined and reconstructed for any point in time

In the context of microservices architectures, event sourcing enables decoupling different parts of an application by using a publish/subscribe pattern, and it feeds the same event data into different data models for separate microservices.

Command Query Responsibility Segregation (CQRS) -

CQRS - simply means that query and update operations are separated and updates DO NOT happen on query as in traditional database management systems.

Event sourcing is frequently used in conjunction with the CQRS pattern (Command Query Responsibility Segregation) to decouple read from write workloads and optimize both for performance, scalability, and security. In traditional data management systems, commands and queries are run against the same data repository.

Distributed Systems - https://www.oreilly.com/library/view/distributed-systems-in/9781491924914/

Provides overall view of distributed systems https://www.dynamodbguide.com/the-dynamo-paper

Read Replication - If you want to scale your database and there are more reads, you need to have more than one database(read replicas). There is always possibility that databases are not in synch in realtime and there could be a time you could get old data. Ex. if we have master A, and B and C are replicas,, if there is an update in the master, and at the same time the read request is severed from B, the information might not be upto date. But you just have to know that that consistency might take a few minutes to catch up. Another important characteristic of read replication is that it is really only solving the problem of there being lots more reads than writes

If you are committed to distributed storage, you need a distributed database then you probably are going to be willing to put up with eventual consistency. Eventual consistency is something you're gonna live with if you are looking for distributed databased. MonoDB works the same way.

Sharding - (More Writes). Say writes for key values could be split into multiple shards. Say, key from A-F goes to database X, keys from F-N goes to Y db, and N-Z goes to Z db. This way we are sharing or distributing our write loads to different databases based on keys. For each of the write databases, can be have more read replicas. The problem with this model is that if you want to query something which spans to all databases, then we need to query all DB, get results and gather those back together.

When planning for sharing, you will need to make sure you understand your data-access-pattern well. If most of the queries are going to span across DBs, then we will need to limit or design it accordingly.

Consistent Hashing -

In consistent hashing the servers for a ring of hash. Each servers owns a set a hash and the request is servers from the nearest server clock wise. Servers id are also based and forms part of the ring, as such we can create virtual servers with the same server id so the distribution can be increased. With this approach, when new server is added or removed, the keys from the adjutant server are distributed with the new server which will allow dynamic scaling. (DynanoDB and Cassandra uses consistent hashing)

Dynamo Paper - https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

CAP Theorem -

Eric Brewer - CAP theory

C - Consistency - Consistency is a property of a distributed database that, when I read from it, I will always see what I have most recently written, given a certain key.

A - Availability - This just means when I ask, I get an answer. When I try to write, I can write. When I try to read, I can read. The database doesn't tell me, sorry, come back later

P - Partition Tolerance - That could be something simple like a broken network cable. That could be, again, the JVM example of a smoke break, a major GC pause. That can sort of be a partition. It could be power going down to Iraq, it could be multiple data centers with the wind link cut because of a misconfigured route. Whatever it is, some part of this database can't talk to some other part of this database

We cannot have all the 3 together. Only 2 can co-exist.

For distributed systems, Partition tolerance is a ambiance fact and is given that partition tolerance is required. so it is mostly about having Availability and Consistency.