All systems in the computing world does not work in a single super powerful hardware; it is more about networking a large pool of hardware and permits distributed executions. Hence, we need to know how to build, test, and measure such system in order to scale.
When system is distributed, it is very hard to tell whether the system is "robust" or "redundant". Hence, we need to know some metrics unit to measure the system performance in order to understand the system performance.
Measure in terms of availability time. The value is often in percentage, like "Uptime of 99.999%". Each 9 has a meaning when it comes to qualitative part like accuracy, apart from quantification.
Ability to handle growth and shrinks.
Often, critical components like database can be scaled horizontally, although they are preferred to be vertically scaled. As many practical advise from experienced developers: only shard when the database grew way too large (e.g. Google's Gmail application size). If there are rooms for vertical scaling or caching, proceed for those.
Before starting the horizontal sharding, as a designer, you need to determine your design principles. This can be driven by various lead factors:
One basic way to do sharding is to perform a dynamic database "lookup" table to seek for the data you need from the correct database. Remember, the goal of sharding is to re-partition a single too large database into multiple databases.
Hence, the conventional database query now becomes 2 steps (or more) procedures:
A simple graphical representation is shown in the following diagram.
The problem with this design is that:
This is by using some sort of algorithm to determine the database ID.
The easiest example would be using the key query data and hash it, to find the remainder number. That remainder number represents the database lookup ID. Then, the back-end immediately query the data from that identified database. Here is a design diagram:
The advantage of using such design is that you do not rely on a lookup table, hence faster performance.
The problem is that the data inside each shards can be very random (depending on algorithm). Re-sharding for database optimization can be very expensive.
Regardless which types of sharding technique you applied to the database, at some point, you will need to perform a different query technique due to the massive decentralized databases. In this case, map-reduce technique is well known across the industry for simplifying large sample size data query into the simplest form.
Map Reduce contains 2 critical steps:
The idea is to have every shard database to operate as an independent worker that dump a list of data. Then, the queries, in their split form, are mapped into elements based on query and then reduce it to the simplest form. Upon the final reach, combine all the reduced form together and you get the resultant data.
MapReduce is iterable as long as they are not shuffled from each other's position. Normally, you only do MapReduce on large scale database, especially decentralized databases.
Here is an example, flowing from left-to-right:
That's all about distribution system design.