ArchitecTure

Interesting Reads about Production System Design

This page is my notes of System Design Concepts and Best Practices. It can be used as study material for system design. Feedback and correction is very welcome at contact@mayankprajapati.in

Introduction

System can be defined as a set of softwares running on set of computers using a set of resources of those computers.

That software can be any database, web application, desktop application, utility service etc that can run on any computer.

Now what are these softwares in broad terms?

These software are "technical implementation of solution" to a problem.

Simple examples of a problem can be saving images, playing video, reading books, interacting with people around the world etc.

Programming languages are used to build that software solution. Other ready made software or framework are used to develop a solution.

You must me wondering what is a Framework? Framework is nothing but a set of ready made libraries/code that can be used to develop a software.

Now next question comes in my mind is, What is resources of computer and how software use those resources?

To answer your question every computer has some resources like CPUs, Hard Disks, RAMs, network throughput etc. As I mentioned above programming language is used to build these software, Using these programming languages set of services or programs (code) are written to use those resources.

Now these computers can be a standalone computer in your study room, can be set of computers in a big datacenter or can be a set of virtual computers running on cloud.

All the entities defined above are joined together to make a System, and process of defining elements to give a solution to a problem is known as System Design. In other words it is application of theoretical knowledge to give practical solutions.

Concepts

- How internet works? link
- Limited Resources
  - CPU
    - CPU is considered as brain of a computer, its abbreviation is central processing unit. Every instruction no matter how complex or easy it is, must pass through CPU. People also refer CPU as processor. You must have seen the specification of CPU as "64 bit quad core intel i7 3.5 GHz processor".
    - Let's break down this specification to understand better about CPU. It is all about speed, what ever we want to do we want as fast as possible.
    - For example to run an image editing software or modern games we need to pass complex instruction through CPU, So to get faster results we will need better CPU.
    - 64 bit vs 32 bit link
    - Registers link
    - Type
    - Cores
    - Hyper Threading
  - Memory
    - Relation between Bytes, KB, MB, GB and TB might come in handy in making calculation. So one must remember these numbers with respect to base 2 and base 10. [ link ]
  - Hard Disk
    - Random Read/Write
    - Sequential Read Write
    - RAID
  - Network Bandwidth
- Consistent Hashing
  - Paper on Consistent Hashing link1
  - Youtube video link1 link2
  - Its is widely used in database sharding, load balancing, caching. Production implementation can be found below in implementation section.
- Networking
  - TCP/IP Model link1
    - Application
      - Mostly just data.
      - HTTP link1 link2
        HTTP 1.0 link
        HTTP 1.1 link
        HTTP/2 link link
        How HTTPS/2 works? link link
        Limitations link
        HTTP/3 / QUIC link link link
        Sessions link
        Head of Line Blocking
      - HTTPS & TLS link
      - WebSockets link
      - Low Level Implementations
        App Firewall
        Proxies
        VPN
    - Transport
      - Low Level Implementations
        Load Balancers, firewalls [WIP]
      - Segments
      - Datagrams
      - TCP link1
      - UDP link1
    - Network
      - Packets
      - Low Level Implementations
        Routers [WIP]
      - IPV4
      - IPV6
      - Network Address Translation - NAT link1
    - Physical
      - Frames
        Low Level Implementations
        Switches [WIP]
      - Bit-stream
        Low Level Implementations
        Interfaces, cables, Optical Stuff (DWDM, Sonnet) [WIP]
  - DNS Lookup
  - DNS Record Types
  - Virtual Private Cloud (VPC)
  - OSI Layer link1 link2
  - Physical Layer
    - Data-Link Layer
    - Network Layer
    - Transport Layer
    - Session Layer
    - Presentation Layer
    - Application Layer
- Virtualization
  - Virtual Machine
  - Containers
- Database
  - Why an application developer need to know internals?
    - Application developer probably not going to implement his/her own storage engine from scratch, but he/she do need to select a storage engine that is appropriate for his/her application, from the many that are available. In order to tune a storage engine to perform well on his/her kind of workload, he/she need to have a rough idea of what the storage engine is doing under the hood.
  - Storage Engines & Indexes
    - Log-Structured[1] & Log-Structured Merge Merge Tree (LSM Trees)[1]
    - Sorted String Table (SSTable) and Memtable link
    - Pros
      - LSM-trees are typically able to sustain higher write throughput than B- trees, partly because they sometimes have lower write amplification (although this depends on the storage engine configuration and workload), and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree.
        LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees.
    - Cons
      - A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes resulting in high response time. the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background.
  - Page-Oriented [1]
    - - B-Tree link
        B-Tree rotation algorithm ensures that depth of the tree remain log(n) where 'n' is number of keys in the tree.
        Most databases can fit into a B-tree that is three or four levels deep, so we don’t need to follow many page references to find the page we are looking for.
        A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.
      - Pros
        An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This aspect makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree.
      - Cons
        B-tree storage engines leave some disk space unused due to fragmentation: when a page is split or when a row cannot fit into an existing page, some space in a page remains unused
  - Secondary Index [1]
    - Clustered Index [1]
    - Covering index or Index with included columns [1]
    - Multi-column index [1]
  - Fuzzy Index
  - Points to consider while choosing any storage engine [1]
    - Concurrency Control & Consistency
    - Crash Recovery
      - Write Ahead Log (WAL) or Redo Log
        In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log). This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the data‐ base comes back up after a crash, this log is used to restore the B-tree back to a consistent state[1].
      - Copy-On-Write Schema [1]
        A modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location. This approach is also useful for concurrency control.
    - Write Amplifications
    - Reclaiming disk space & Space optimizations
    - Gracefully Handling errors
    - Handling partially written records
      - Innodb Double writes link
      - On the Impact of Full Writes link
  - Types of Databases
    - Based on Processing
      - Transactional Workload or Online Transaction Processing (OLTP)
      - Analytics Workload or Online Analytic Processing (OLAP)
        Data Warehouse
    - Based on Storage & Query Engine
      - Relational
      - NoSql
        Key Value
        Wide Column
        Document
        Graph
- Cloud Storage
- ACID [1] new
  - Atomicity
  - Isolation link
    - Read Uncommitted
      - prevent dirty writes (two transactions updating same object)
        Acquire exclusive row level lock before write and release after write. Other transactions that try to write the same object are blocked
    - Read Committed
      - prevent dirty writes
        prevent dirty reads (two transactions reading same object)
        Acquire shared lock before read and release after read.
        Alternative is to record the value of each object before it is written, and return that old value to all other transactions until the writing transaction completes.
      - Snapshot Isolation and Repeatable Read
        useful for read-only transactions
        prevent non-repeatable read or skew read
        A transaction sees different parts of the database at different points in time (due to write activity from other users).
        An example where a user reads the bank balance from two accounts, where separately a transfer between accounts is happening. It’s okay if the user sees both before values or both after values, but read skew is when one account sees the before value and the other sees the after value.
        multi-version concurrency control (MVCC)
        How snapshot isolation work?
        tagging each transaction with a unique ID and creating visibility rules of consistent snapshot.
        How snapshot isolation work in indexes?
        Having index pointing to all versions of the object and filter the version of invisible object at the time of query
        append-only B-Trees: an append-only/copy-on-write variant that does not overwrite pages of the tree when they are updated, but instead creates a new copy of each modified page. Parent pages, up to the root of the tree, are copied and updated to point to the new versions of their child pages. Any pages that are not affected by a write do not need to be copied, and remain immutable
        Preventing Lost Updates (two transactions updating same object)
        Two transactions concurrently perform a read-modify-write cycle on the same object. One transaction overwrites the other’s write without incorporating the other’s changes, so data is lost
        Can be reduced with
        Atomic write operations***
        E.g.: UPDATE counters SET value = value + 1 WHERE key = 'foo';
        Explicit Locking
        Automatic lost update detection**
        Compare and set*

- - - - Conflict resolution

- - - - Full Serializability
        Write Skew (two transactions updating different object)
        A transaction reads something, makes a decision based what it read, and writes a value dependent upon the decision. While the transaction was running, other writes to different objects have made the decision no longer correct.
        Snapshot isolation prevents straightforward phantom reads, but phantoms in the context of write skew require special treatment, such as index-range locks.
        Phantoms reads

- - - - A transaction reads objects that match some criteria. Another transaction makes a write that creates a new record that affects the search criteria.

- - - - Can be reduced by

- - - - Explicit Locking **
        Materialising Conflicts*

- - - Serializable Isolation
      - Actual Serial Execution
      - Two Phase Locking/ Pessimistic Technique
        Read Shared Mode Lock
        Write Exclusive Mode Lock
        Deadlock
        Predicate Locks*
        Index Range Locks**
        Serializable Snapshot Isolation/ Optimistic Technique
        Snapshot isolation with algorithm to detect serializable conflicts amongst writes.
  - Consistency
    - Eventual Consistency
    - Strong Consistency
  - Durability
- BASE
  - Basically Available
  - Soft State
  - Eventual Consistency
- Replication link [1] new
  - Types
    - Master Slave Replication link
      - Process of adding a new slave.
        Take consistent snapshot of master.
        Copy snapshot to slave node.
        Catch up the data changes happened in master during addition of new slave.
        How to ensure slave is up-to date with master?
        What if slave fails?
        Slave can catch up with the master using transaction logs as soon as it restarts.
        What if master fails?
        Choose new master amongst the slave. Slave with most up to date data is preferred.
        All writes are routed to new master.
        Usually old leader is either made as slave when it recovers or is destroyed.
        Replication Logs
        Statement-based replication

- - - - Issues with non-deterministic functions like NOW(). Each replica can get different value and cause inconsistency.

- - - - Write-ahead Log
        data is described in very low level hence tightly coupled to storage engine and cannot afford running different version on master and slave.
        Logical (Row based ) replication

- - - - Choosing different log formats for replication than storage engines.
        Easier for external system to parse, hence can be used with external system like data warehouse.

- - - - Trigger Based Replication
        
        Replication Lags
        Why lag can arise?
        Read after write consistency

- - - - How to monitor lag on a slave and prevent queries on that slave in case of lag beyond certain threshold?

- - - - Monotonic Reads

- - - - How to practically route queries to certain slave based on certain conditions?

- - - - Consistent Prefix Reads

- - - - This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order

- - - - Async master slave replication posses a problem for caches in data centres using replica. Writes are forwarded to the master but some time will elapse before they are reflected to the local replica.
      - Master Master Replication
        Conflict Resolution
        Avoiding Conflict
        Convergent Conflict Resolution
        Custom Conflict Resolution
        How to detect/define a conflict?
        Multi Master in Linux link
      - Leaderless Replication
        Dynamo, Cassandra
    - Quorum Consistency (w + r > n)
      - Quorum Reads (w)
      - Quorum Writes (r)
      - Sloppy Quorum
      - Hinted Handoff
- Partitioning new
  - Key range based partitions
    - Range boundaries
  - Key hash based partitions
    - Consistent Hashing
      - Splitting Hot Keys
  - Partitioning with secondary indexes
    - Document based / Local Index
    - Term based / Global Index
  - Rebalancing Partitions
    - Partition Splitting
    - Fixed Number of Partitions
    - Dynamic Partitions
  - Sharding
    - Example Distribution technique to shard
      - 64 bit ID = (shard ID << 46) | (type ID(table id) << 36) | (local ID(row)<<0)
  - Connection Management
  - Why alter in RDBMS with millions of rows is costly and how to do that with minimal downtime? link
  - How to secure databases ?
- Data Warehouse [1]
  - Schema Architecture
    - Star Schema
    - Snowflake Schema
    - Column-oriented storage
      - Column Compression
        Bitmap Encoding
      - Vectorized Processing - bitwise operations in L1 cache of CPU on compressed column data.
    - Row-oriented storage
  - Massively Parallel Processing
- Scalability
  - Vertical Scaling
  - Horizontal Scaling
- Latency link
- Throughput
- Parallelism
- Concurrency link
  - Thread
    - Green Thread
  - Process
  - Starvation
  - Multithreading
  - Multitasking
  - Multiprocessing
  - Synchronization
  - Locks
  - Asynchronous Interaction
  - Synchronous Interaction
  - Contention
  - Race Condition
  - Deadlock link
  - Optimistic Locking link
  - Pessimistic Locking link
- Consistency
  - Strong Consistency
  - Eventual Consistency
- Availability
- Partition Tolerance
- CAP Theorem link can be related with bank
- SAS Patterns
- Caching link
  - Consistent Hashing
  - Write through cache
  - Write around cache
  - Write back cache
- Load Balancing link1
- What is Load Balancer?
- Features provided by Load Balancers
- Load Balancing algorithms
- What is load factor ?
- What if load balancer itself fail ?
- Implementations of Load Balancers
- Points to consider while providing solutions involving Load balancers
- Content Delivery Network (CDN)
- Edge
- Task Queue
- Messaging System
  - Publisher Subscriber (Pub - Sub)
- Cloud Computing
- Security
  - TLS link
    - TLS Handshake link
  - Certificates link
  - Public Key Infrastructure & Certificate Authority link
  - TLS/SSL Certificate Pinning link
  - Symmetric Encryption link
  - Asymmetric Encryption link
  - DDoS
- Data Center
- Racks
- Hosts
- Advance Data Structure
  - Bloom filters
  - Count Min Sketch
- Paxos link link
- Long Polling
- Server Sent Events
- Distributed Configuration Service
- Web Server link
- Search Framework
- Cloud Infrastructure
- Build Systems
- Logs link
  - You can't fully understand databases, NoSQL stores, key value stores, replication, paxos, hadoop, version control, or almost any software system without understanding logs.
  - streaming logs.
- Distributed System link
  - Distributed Transaction link
    - Two Phase Commit
    - Three Phase Commit
    - SAGA

Serialization/Encoding Format [1]
- Textual format
  - JSON
  - XML
- Binary Format
  - MessagePack
  - Apache Thrift
  - Protocol Buffers
  - Apache Avro
    - Parquet

Projects & Platforms

Virtualization
- Dockers
- Kubernetes
- Mesos
Database
- Relational
  - MySQL
  - Postgres
  - Microsoft SQL
  - Oracle 12c
- NoSql
  - Cassandra
    - Second generation of graph databases
    - Pros
      - Scale out well to distributed databases.
    - Cons
      - Built on NoSql storage system and lack native graph storage design hence have slow performance on graph updates or multi-hop queries on large datasets.
  - MongoDB
  - Couch DB
  - Redis link
  - Membase
  - ReThikDB
  - LevelDB
    - LSM Tree
  - RocksDB
    - LSM Tree
  - Dynamo DB
  - Neo4J
    - First generation of graph databases.
    - Pros
    - Cons
      - Not designed for big data hence cannot scale out to distributed database
      - Not designed for parallelism
        Perform slow at both data loading and querying large datasets and/or multi-hop queries.
  - Titan
  - InfiniteGraph
  - Datomic
  - Allegrograph
  - Sharding
  - noteable implementations
  - Pinterest
  - Cloud Storage
    - Amazon S3
Caching
- EhCache
- Memcache
- Redis
- Nginx
- Terracotta
- Guava
- Caffeine
Load Balancing
- Nginx
- HaProxy
Task Queue
- Celery
Messaging System
- Kafka
- RabbitMQ
- ActiveMQ
- ZeroMQ
- Sqs
- Redis
Cloud Computing
- Hadoop
- Map Reduce
- Spark
- Amazon EC2
- Google Compute Engine
- Distributed Configuration Service
- Zookeeper
Server
- Nginx
- Apache
- Apache Tomcat
Search
- Lucene
  - - Lucene uses a SSTable-like structure for its term dictionary. This structure requires a small in-memory index that tells queries at which offset in the sorted file they need to look for a key. The in-memory index is a finite state automaton over the characters in the keys, similar to a trie. This automaton can be transformed into a Levenshtein automaton, which supports efficient search for words within a given edit distance. [1]
- Solr
- Elastic Search [ link ]
  - Indexing Libraries
    - Faiss
    - NMSLIB
    - Annoy
Cloud Platform
- Google Cloud
- Amazon AWS
Logs streaming services
- Flume

Handy Conversion and Latency Numbers

2.5 million seconds in a month.
2.5 million request per month = 1 request per second
100 million request per months = 40 request per second
1 billion request per month = 400 request per second
Reading 1 MB sequentially from memory takes about 250 microseconds
Read sequentially from disk at 30 MB/s
Read sequentially from 1 Gbps Ethernet at 100 MB/s
Read sequentially from SSD at 1 GB/s
Read sequentially from main memory at 4 GB/s

Interesting Reads about Production System Design

- How Pinterest went viral? link
- How to design a cache system like Memcached or Redis link
- Mysql vs Postgresql link
- Celery with Redis link
- Software disenchantment - Mocking Engineering Industry link
- How Swiggy improved Hybrid App Experience by using Caching link
- Swiggy Delivery Challenge. part1 part2
- Flickr doing billions of query per day. link
- Consistent Hashing at Uber Engineering. link
- Load balancing with Consistent Hashing at Vimeo. link
- Consisten Hashing in Cassandra. link

Interview Specific Reads

- Messenger service like Whatsapp or WeChat. link link
- TinyUrl link link2 link3 link4
- Design a rate limiter link
- Design a consistent hashing link
- Design a key-value store link
- Design a unique-id generator in distributed systems link
- Design a url shortener link
- Design a web crawler link
- Design a notification system link
- Design a news feed system link
- Design a chat system link
- Design a search autocomplete system link link
- Design a youtube design google drive link

References

1. Designing Data Intensive Application by Martin Kleppmann. link

Page updated

Google Sites

Report abuse