This page is my notes of System Design Concepts and Best Practices. It can be used as study material for system design. Feedback and correction is very welcome at contact@mayankprajapati.in
System can be defined as a set of softwares running on set of computers using a set of resources of those computers.
That software can be any database, web application, desktop application, utility service etc that can run on any computer.
Now what are these softwares in broad terms?
These software are "technical implementation of solution" to a problem.
Simple examples of a problem can be saving images, playing video, reading books, interacting with people around the world etc.
Programming languages are used to build that software solution. Other ready made software or framework are used to develop a solution.
You must me wondering what is a Framework? Framework is nothing but a set of ready made libraries/code that can be used to develop a software.
Now next question comes in my mind is, What is resources of computer and how software use those resources?
To answer your question every computer has some resources like CPUs, Hard Disks, RAMs, network throughput etc. As I mentioned above programming language is used to build these software, Using these programming languages set of services or programs (code) are written to use those resources.
Now these computers can be a standalone computer in your study room, can be set of computers in a big datacenter or can be a set of virtual computers running on cloud.
All the entities defined above are joined together to make a System, and process of defining elements to give a solution to a problem is known as System Design. In other words it is application of theoretical knowledge to give practical solutions.
How internet works? link
Limited Resources
CPU
CPU is considered as brain of a computer, its abbreviation is central processing unit. Every instruction no matter how complex or easy it is, must pass through CPU. People also refer CPU as processor. You must have seen the specification of CPU as "64 bit quad core intel i7 3.5 GHz processor".
Let's break down this specification to understand better about CPU. It is all about speed, what ever we want to do we want as fast as possible.
For example to run an image editing software or modern games we need to pass complex instruction through CPU, So to get faster results we will need better CPU.
64 bit vs 32 bit link
Registers link
Type
Cores
Hyper Threading
Memory
Relation between Bytes, KB, MB, GB and TB might come in handy in making calculation. So one must remember these numbers with respect to base 2 and base 10. [ link ]
Hard Disk
Random Read/Write
Sequential Read Write
RAID
Network Bandwidth
Consistent Hashing
Networking
TCP/IP Model link1
Application
Transport
Network
Packets
Low Level Implementations
Routers [WIP]
IPV4
IPV6
Network Address Translation - NAT link1
Physical
Frames
Low Level Implementations
Switches [WIP]
Bit-stream
Low Level Implementations
Interfaces, cables, Optical Stuff (DWDM, Sonnet) [WIP]
DNS Lookup
DNS Record Types
Virtual Private Cloud (VPC)
Physical Layer
Data-Link Layer
Network Layer
Transport Layer
Session Layer
Presentation Layer
Application Layer
Virtualization
Virtual Machine
Containers
Database
Why an application developer need to know internals?
Application developer probably not going to implement his/her own storage engine from scratch, but he/she do need to select a storage engine that is appropriate for his/her application, from the many that are available. In order to tune a storage engine to perform well on his/her kind of workload, he/she need to have a rough idea of what the storage engine is doing under the hood.
Storage Engines & Indexes
Log-Structured[1] & Log-Structured Merge Merge Tree (LSM Trees)[1]
Sorted String Table (SSTable) and Memtable link
Pros
LSM-trees are typically able to sustain higher write throughput than B- trees, partly because they sometimes have lower write amplification (although this depends on the storage engine configuration and workload), and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree.
LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees.
Cons
A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes resulting in high response time. the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background.
Page-Oriented [1]
B-Tree link
B-Tree rotation algorithm ensures that depth of the tree remain log(n) where 'n' is number of keys in the tree.
Most databases can fit into a B-tree that is three or four levels deep, so we don’t need to follow many page references to find the page we are looking for.
A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.
Pros
An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This aspect makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree.
Cons
B-tree storage engines leave some disk space unused due to fragmentation: when a page is split or when a row cannot fit into an existing page, some space in a page remains unused
Secondary Index [1]
Clustered Index [1]
Covering index or Index with included columns [1]
Multi-column index [1]
Fuzzy Index
Points to consider while choosing any storage engine [1]
Concurrency Control & Consistency
Crash Recovery
Write Ahead Log (WAL) or Redo Log
In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log). This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the data‐ base comes back up after a crash, this log is used to restore the B-tree back to a consistent state[1].
Copy-On-Write Schema [1]
A modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location. This approach is also useful for concurrency control.
Write Amplifications
Reclaiming disk space & Space optimizations
Gracefully Handling errors
Handling partially written records
Types of Databases
Based on Processing
Transactional Workload or Online Transaction Processing (OLTP)
Analytics Workload or Online Analytic Processing (OLAP)
Data Warehouse
Based on Storage & Query Engine
Relational
NoSql
Key Value
Wide Column
Document
Graph
Cloud Storage
ACID [1] new
Atomicity
Isolation link
Read Uncommitted
prevent dirty writes (two transactions updating same object)
Acquire exclusive row level lock before write and release after write. Other transactions that try to write the same object are blocked
Read Committed
prevent dirty writes
prevent dirty reads (two transactions reading same object)
Acquire shared lock before read and release after read.
Alternative is to record the value of each object before it is written, and return that old value to all other transactions until the writing transaction completes.
Snapshot Isolation and Repeatable Read
useful for read-only transactions
prevent non-repeatable read or skew read
A transaction sees different parts of the database at different points in time (due to write activity from other users).
An example where a user reads the bank balance from two accounts, where separately a transfer between accounts is happening. It’s okay if the user sees both before values or both after values, but read skew is when one account sees the before value and the other sees the after value.
multi-version concurrency control (MVCC)
How snapshot isolation work?
tagging each transaction with a unique ID and creating visibility rules of consistent snapshot.
How snapshot isolation work in indexes?
Having index pointing to all versions of the object and filter the version of invisible object at the time of query
append-only B-Trees: an append-only/copy-on-write variant that does not overwrite pages of the tree when they are updated, but instead creates a new copy of each modified page. Parent pages, up to the root of the tree, are copied and updated to point to the new versions of their child pages. Any pages that are not affected by a write do not need to be copied, and remain immutable
Preventing Lost Updates (two transactions updating same object)
Two transactions concurrently perform a read-modify-write cycle on the same object. One transaction overwrites the other’s write without incorporating the other’s changes, so data is lost
Can be reduced with
Atomic write operations***
E.g.: UPDATE counters SET value = value + 1 WHERE key = 'foo';
Explicit Locking
Automatic lost update detection**
Compare and set*
Conflict resolution
Full Serializability
Write Skew (two transactions updating different object)
A transaction reads something, makes a decision based what it read, and writes a value dependent upon the decision. While the transaction was running, other writes to different objects have made the decision no longer correct.
Snapshot isolation prevents straightforward phantom reads, but phantoms in the context of write skew require special treatment, such as index-range locks.
Phantoms reads
A transaction reads objects that match some criteria. Another transaction makes a write that creates a new record that affects the search criteria.
Can be reduced by
Explicit Locking **
Materialising Conflicts*
Serializable Isolation
Actual Serial Execution
Two Phase Locking/ Pessimistic Technique
Read Shared Mode Lock
Write Exclusive Mode Lock
Deadlock
Predicate Locks*
Index Range Locks**
Serializable Snapshot Isolation/ Optimistic Technique
Snapshot isolation with algorithm to detect serializable conflicts amongst writes.
Consistency
Eventual Consistency
Strong Consistency
Durability
BASE
Basically Available
Soft State
Eventual Consistency
Replication link [1] new
Types
Master Slave Replication link
Process of adding a new slave.
Take consistent snapshot of master.
Copy snapshot to slave node.
Catch up the data changes happened in master during addition of new slave.
How to ensure slave is up-to date with master?
What if slave fails?
Slave can catch up with the master using transaction logs as soon as it restarts.
What if master fails?
Choose new master amongst the slave. Slave with most up to date data is preferred.
All writes are routed to new master.
Usually old leader is either made as slave when it recovers or is destroyed.
Replication Logs
Statement-based replication
Issues with non-deterministic functions like NOW(). Each replica can get different value and cause inconsistency.
Write-ahead Log
data is described in very low level hence tightly coupled to storage engine and cannot afford running different version on master and slave.
Logical (Row based ) replication
Choosing different log formats for replication than storage engines.
Easier for external system to parse, hence can be used with external system like data warehouse.
Trigger Based Replication
Replication Lags
Why lag can arise?
Read after write consistency
How to monitor lag on a slave and prevent queries on that slave in case of lag beyond certain threshold?
Monotonic Reads
How to practically route queries to certain slave based on certain conditions?
Consistent Prefix Reads
This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order
Async master slave replication posses a problem for caches in data centres using replica. Writes are forwarded to the master but some time will elapse before they are reflected to the local replica.
Master Master Replication
Conflict Resolution
Avoiding Conflict
Convergent Conflict Resolution
Custom Conflict Resolution
How to detect/define a conflict?
Multi Master in Linux link
Leaderless Replication
Dynamo, Cassandra
Quorum Consistency (w + r > n)
Quorum Reads (w)
Quorum Writes (r)
Sloppy Quorum
Hinted Handoff
Partitioning new
Key range based partitions
Range boundaries
Key hash based partitions
Consistent Hashing
Splitting Hot Keys
Partitioning with secondary indexes
Document based / Local Index
Term based / Global Index
Rebalancing Partitions
Partition Splitting
Fixed Number of Partitions
Dynamic Partitions
Sharding
Example Distribution technique to shard
64 bit ID = (shard ID << 46) | (type ID(table id) << 36) | (local ID(row)<<0)
Connection Management
Why alter in RDBMS with millions of rows is costly and how to do that with minimal downtime? link
How to secure databases ?
Data Warehouse [1]
Schema Architecture
Star Schema
Snowflake Schema
Column-oriented storage
Column Compression
Bitmap Encoding
Vectorized Processing - bitwise operations in L1 cache of CPU on compressed column data.
Row-oriented storage
Massively Parallel Processing
Scalability
Vertical Scaling
Horizontal Scaling
Latency link
Throughput
Parallelism
Concurrency link
Consistency
Strong Consistency
Eventual Consistency
Availability
Partition Tolerance
CAP Theorem link can be related with bank
SAS Patterns
Caching link
Consistent Hashing
Write through cache
Write around cache
Write back cache
Load Balancing link1
Points to consider while providing solutions involving Load balancers
Content Delivery Network (CDN)
Edge
Task Queue
Messaging System
Publisher Subscriber (Pub - Sub)
Cloud Computing
Security
Data Center
Racks
Hosts
Advance Data Structure
Bloom filters
Count Min Sketch
Long Polling
Server Sent Events
Distributed Configuration Service
Web Server link
Search Framework
Cloud Infrastructure
Build Systems
Logs link
You can't fully understand databases, NoSQL stores, key value stores, replication, paxos, hadoop, version control, or almost any software system without understanding logs.
streaming logs.
Distributed System link
Distributed Transaction link
Two Phase Commit
Three Phase Commit
SAGA
Serialization/Encoding Format [1]
Textual format
JSON
XML
Binary Format
MessagePack
Apache Thrift
Protocol Buffers
Apache Avro
Parquet
Virtualization
Dockers
Kubernetes
Mesos
Database
Relational
MySQL
Postgres
Microsoft SQL
Oracle 12c
NoSql
Cassandra
Second generation of graph databases
Pros
Scale out well to distributed databases.
Cons
Built on NoSql storage system and lack native graph storage design hence have slow performance on graph updates or multi-hop queries on large datasets.
MongoDB
Couch DB
Redis link
Membase
ReThikDB
LevelDB
LSM Tree
RocksDB
LSM Tree
Dynamo DB
Neo4J
First generation of graph databases.
Pros
Cons
Not designed for big data hence cannot scale out to distributed database
Not designed for parallelism
Perform slow at both data loading and querying large datasets and/or multi-hop queries.
Titan
InfiniteGraph
Datomic
Allegrograph
Sharding
Cloud Storage
Amazon S3
Caching
EhCache
Memcache
Redis
Nginx
Terracotta
Guava
Caffeine
Load Balancing
Nginx
HaProxy
Task Queue
Celery
Messaging System
Kafka
RabbitMQ
ActiveMQ
ZeroMQ
Sqs
Redis
Cloud Computing
Hadoop
Map Reduce
Spark
Amazon EC2
Google Compute Engine
Distributed Configuration Service
Zookeeper
Server
Nginx
Apache
Apache Tomcat
Search
Lucene
Lucene uses a SSTable-like structure for its term dictionary. This structure requires a small in-memory index that tells queries at which offset in the sorted file they need to look for a key. The in-memory index is a finite state automaton over the characters in the keys, similar to a trie. This automaton can be transformed into a Levenshtein automaton, which supports efficient search for words within a given edit distance. [1]
Solr
Elastic Search [ link ]
Indexing Libraries
Faiss
NMSLIB
Annoy
Cloud Platform
Google Cloud
Amazon AWS
Logs streaming services
Flume
2.5 million seconds in a month.
2.5 million request per month = 1 request per second
100 million request per months = 40 request per second
1 billion request per month = 400 request per second
Reading 1 MB sequentially from memory takes about 250 microseconds
Read sequentially from disk at 30 MB/s
Read sequentially from 1 Gbps Ethernet at 100 MB/s
Read sequentially from SSD at 1 GB/s
Read sequentially from main memory at 4 GB/s
How Pinterest went viral? link
How to design a cache system like Memcached or Redis link
Mysql vs Postgresql link
Celery with Redis link
Software disenchantment - Mocking Engineering Industry link
How Swiggy improved Hybrid App Experience by using Caching link
Flickr doing billions of query per day. link
Consistent Hashing at Uber Engineering. link
Load balancing with Consistent Hashing at Vimeo. link
Consisten Hashing in Cassandra. link
Design a rate limiter link
Design a consistent hashing link
Design a key-value store link
Design a unique-id generator in distributed systems link
Design a url shortener link
Design a web crawler link
Design a notification system link
Design a news feed system link
Design a chat system link
Design a youtube design google drive link
Designing Data Intensive Application by Martin Kleppmann. link