Distributed System

All systems in the computing world does not work in a single super powerful hardware; it is more about networking a large pool of hardware and permits distributed executions. Hence, we need to know how to build, test, and measure such system in order to scale.

Measuring Metrics

Reliability

Scalability

Scale by Sharding (Horizontal Scaling)

Determine Your Design Principles

Dynamic Sharding

Algorithmic Sharding

MapReduce Approach (Data Processing)

Measuring Metrics

When system is distributed, it is very hard to tell whether the system is "robust" or "redundant". Hence, we need to know some metrics unit to measure the system performance in order to understand the system performance.

Reliability

Measure in terms of availability time. The value is often in percentage, like "Uptime of 99.999%". Each 9 has a meaning when it comes to qualitative part like accuracy, apart from quantification.

Scalability

Ability to handle growth and shrinks.

Scale by Sharding (Horizontal Scaling)

Often, critical components like database can be scaled horizontally, although they are preferred to be vertically scaled. As many practical advise from experienced developers: only shard when the database grew way too large (e.g. Google's Gmail application size). If there are rooms for vertical scaling or caching, proceed for those.

Determine Your Design Principles

Before starting the horizontal sharding, as a designer, you need to determine your design principles. This can be driven by various lead factors:

driven by how data is read / write
driven by partitioning strategy (how the data stored)
driven by data activity
etc.

Dynamic Sharding

One basic way to do sharding is to perform a dynamic database "lookup" table to seek for the data you need from the correct database. Remember, the goal of sharding is to re-partition a single too large database into multiple databases.

Hence, the conventional database query now becomes 2 steps (or more) procedures:

Back-end server will always consult the "lookup" table before querying the data.
After knowing which database to access to, the back-end server then perform the query to that database.

A simple graphical representation is shown in the following diagram.

The problem with this design is that:

Who stores the lookup table?
1. Can be coded inside back-end server (requires data reshuffling across all shards in the database). Example:
  - Placing related content on different server
  - Clustering same entity groups into the same database cluster.
2. Can be a database server (which defeats the purpose)
3. Can be routine based approaches

Algorithmic Sharding

This is by using some sort of algorithm to determine the database ID.

The easiest example would be using the key query data and hash it, to find the remainder number. That remainder number represents the database lookup ID. Then, the back-end immediately query the data from that identified database. Here is a design diagram:

The advantage of using such design is that you do not rely on a lookup table, hence faster performance.

The problem is that the data inside each shards can be very random (depending on algorithm). Re-sharding for database optimization can be very expensive.

MapReduce Approach (Data Processing)

Regardless which types of sharding technique you applied to the database, at some point, you will need to perform a different query technique due to the massive decentralized databases. In this case, map-reduce technique is well known across the industry for simplifying large sample size data query into the simplest form.

Map Reduce contains 2 critical steps:

Map - by mapping the similar data into multiple group.
Reduce - reduce the group into its simplest number.

The idea is to have every shard database to operate as an independent worker that dump a list of data. Then, the queries, in their split form, are mapped into elements based on query and then reduce it to the simplest form. Upon the final reach, combine all the reduced form together and you get the resultant data.

MapReduce is iterable as long as they are not shuffled from each other's position. Normally, you only do MapReduce on large scale database, especially decentralized databases.

Here is an example, flowing from left-to-right:

That's all about distribution system design.

⮘ Back

Page updated

Google Sites

Report abuse