Syllabus

The course is divided into seven modules, each of which covers a different aspect of distributed systems:

M1: Models, Abstractions, and Basic Concepts

In this module, students will learn about the core models and abstractions used in distributed systems, including processes, communication, traces and execution, and I/O Automata. Students will also study different types of failures, such as crash-stop and Byzantine failures, and learn about event-based programming and its abstraction, specification, and implementation. Finally, students will explore different types of links, including fair-lossy, stubborn, and point-to-point links.

M2: Time in Distributed Systems

This module focuses on time in distributed systems, covering synchronous, asynchronous, and eventually synchronous systems. Students will learn about different clock synchronization techniques, including Christian, Berkeley, and NTP, and study tree-based versus fully distributed algorithms. The module also covers logical clocks and vector clocks and the happened-before relationship. Students will also learn about encapsulating time in failure detectors, leader election, eventual leader election, and distributed system models, such as fail-stop, fail-noisy, and fail-silent systems.

M3: Basic Broadcast Primitives

In this module, students will explore different types of broadcast primitives, including best-effort broadcast, reliable broadcast, uniform reliable broadcast, and causal broadcast.

M4: Shared Memories

This module focuses on shared memories and explores different types of consistency, including regular, sequential consistency, and linearizability. Students will also learn about different types of registers, such as (1, N) register and atomic (1, 1), (1, N), and (N, N) registers. The module also covers the non-composability of sequential consistency and the relationship between consistency properties.

M5: Consensus

This module covers consensus and its specification. Students will explore hierarchical consensus, including uniform and non-uniform consensus. The module also covers FLP, including the proof that is not required, only the statement, and Paxos.

M6: Total Ordering and Replicated State Machine

In this module, students will learn about total order broadcast and replicated state machines, including Raft. The module also covers active replication versus primary backup.

M7: Byzantine Fault Tolerance (BFT)

This module covers the basics of BFT, including Byzantine failures and authenticated channels and crypto assumptions. Students will learn about Byzantine broadcast, including its consistency and reliability. The module also covers synchronous systems, including an impossibility result and the necessity of 3f+1 and King Algorithm.


Learning objectives:

Course Abstract: The course is about distributed systems, with a specific focus on fault-tolerance. Students will learn to appreciate the difficulties introduced by the uncertainty given by the unavoidable coexistence of local knowledge, asynchrony, and failures.

They will understand how and when, in spite of these difficulties, it is possible to build powerful distributed algorithms. Such algorithms are essential to construct distributed systems in which geographical distant entities cooperate to solve disparate tasks.

The term ``failure" has to be intended in its more general meaning: it indicates any deviation from normal behavior. Therefore, it subsumes the concept of malignant intrusion. Algorithms will be presented in a formal, abstract, and modular way. At the end of the course, the student is expected to be able to design fault-tolerant distributed algorithms, and to provide formal and convincing arguments on their correctness.

At the end of the entire course the acquired skills should be: 

Knowledge and understanding:: The student knows and understands the fundamental concepts and principles of distributed systems, such as processes, communication, failures, time, synchronization, consistency, consensus, broadcast, replication, and fault tolerance.

Applying knowledge and understanding: The student can apply the knowledge and understanding of distributed systems to design and implement distributed algorithms that can solve various problems in different settings and scenarios, such as leader election, shared memory, total order broadcast, replicated state machine, etc.

Making judgements: The student can critically analyze and compare different distributed algorithms in terms of correctness, efficiency, scalability, robustness, and complexity. The student can also identify the assumptions and limitations of each algorithm and evaluate their suitability for different applications and requirements.

Communication skills: The student can communicate effectively the main concepts and results of distributed systems using appropriate terminology and notation. The student can also present and discuss distributed algorithms using oral presentations and written reports.

Learning skills: The student can independently learn new distributed systems concepts and methods by reading scientific literature and following online courses. The student can also update their knowledge and skills by following the latest developments in the field.

Detailed Learning Objectives: (the objectives are given for each module of the program – see program section above).

For M1: Models, Abstractions, and Basic Concepts

Knowledge and understanding: Students will be able to explain basic concepts such as processes, communication, traces and execution, I/O automata, failures, event-based programming, links.

Applying knowledge and understanding: Students will be able to use models and abstractions to model distributed systems that can cope with different types of failures and communication channels.

Making judgements: Students will be able to evaluate the advantages and disadvantages of different models and abstractions for distributed systems.

Communication skills: Students will be able to present their designs for distributed models using appropriate notation and terminology.

Learning skills: Students will be able to learn new models from relevant sources for further autonomous study on distributed systems.

For M2: Time in Distributed Systems

Knowledge and understanding: Students will be able to explain the concepts of synchronous, asynchronous, and eventually synchronous systems; clock synchronization algorithms; logical clocks and vector clocks; happened-before relationship; failure detectors.

Applying knowledge and understanding: Students will be able to implement clock synchronization algorithms using message passing; use logical clocks and vector clocks to order events in distributed systems; use failure detectors to handle failures and elect leaders.

Making judgments: Students will be able to compare different clock synchronization algorithms in terms of accuracy, efficiency, and scalability; analyze the limits of ordering with timestamps; choose appropriate failure detectors for different system models.

Communication skills: Students will be able to use diagrams and pseudocode to illustrate clock synchronization algorithms, logical clocks and vector clocks, failure detectors.

Learning skills: Students will be able to use the acquired notion to autonomously study new clock synchronization algorithms and new abstractions of time in distributed systems.

For M3: Basic Broadcast Primitives

Knowledge and understanding: The student knows and understands the basic concepts and properties of different types of broadcast primitives in distributed systems, such as best effort, reliable, uniform reliable, and causal broadcast.

Applying knowledge and understanding: The student can apply the knowledge and understanding of broadcast primitives to design and implement distributed algorithms that use them as building blocks, such as consensus or total order broadcast.

Making judgements: The student can critically evaluate the advantages and disadvantages of different broadcast primitives in terms of correctness, efficiency, scalability, and fault-tolerance.

Communication skills: The student can communicate effectively with peers and instructors about the concepts and algorithms related to broadcast primitives using appropriate terminology and notation.

Learning skills: The student can independently learn new topics related to broadcast primitives by reading scientific papers or textbooks.

For M4: Shared Memories

Knowledge and understanding: The student knows and understands the basic concepts and properties of shared memory in distributed systems, such as consistency models, registers, and non-composability.

Applying knowledge and understanding: The student can apply the knowledge and understanding of shared memory to design and implement distributed algorithms that use it as an abstraction layer, such as atomic broadcast or mutual exclusion.

Making judgements: The student can critically compare and contrast different consistency models and register implementations in terms of correctness, efficiency, scalability, and fault-tolerance.

Communication skills: The student can communicate effectively with peers and instructors about the concepts and algorithms related to shared memory using appropriate terminology and notation.

Learning skills: The student can independently learn new topics related to shared memory by reading scientific papers or textbooks.

For M5: Consensus

Knowledge and understanding: The student knows and understands the basic concepts and properties of consensus in distributed systems, such as the consensus specification, the impossibility results (FLP), and the main algorithms (hierarchical consensus, Paxos).

Applying knowledge and understanding: The student can apply the knowledge and understanding of consensus to design and implement distributed algorithms that use it as a fundamental problem, such as state machine replication or atomic commit.

Making judgements: The student can critically analyze and compare different consensus algorithms in terms of correctness, efficiency, scalability, fault-tolerance, and complexity.

Communication skills: The student can communicate effectively with peers and instructors about the concepts and algorithms related to consensus using appropriate terminology and notation.

Learning skills: The student can independently learn new topics related to consensus by reading scientific papers or textbooks.

For M6: Replicated State Machines

Knowledge and understanding: The student knows the principles and techniques of total order broadcast and replicated state machine in distributed systems. The student understands how these concepts can be used to implement fault-tolerant services.

Applying knowledge and understanding: The student can design and implement total order broadcast algorithms using different communication primitives. The student can use replicated state machine to build consistent applications that tolerate failures.

Making judgements: The student can evaluate the trade-offs between different total order broadcast algorithms in terms of complexity, performance, and reliability. The student can compare active replication and primary backup approaches for replicated state machine.

Communication skills: The student can explain the concepts of total order broadcast and replicated state machine using appropriate terminology. The student can present their design choices and implementation details clearly and concisely.

Learning skills: The student can independently study advanced topics related to total order broadcast and replicated state machine. The student can apply their knowledge to new scenarios or problems in distributed systems.

For M7: BFT

Knowledge and understanding: The student knows and understands the main concepts and challenges of Byzantine fault tolerance (BFT) in distributed systems, such as authenticated channels, Byzantine broadcast, consensus, and replication.

Applying knowledge and understanding: The student can apply BFT techniques to design and implement distributed algorithms that can tolerate Byzantine failures, such as King algorithm.

Making judgements: The student can critically evaluate the trade-offs and limitations of BFT solutions in terms of performance, security, scalability, and complexity.

Communication skills: The student can communicate effectively the main ideas and results of BFT research papers using oral presentations and written reports.

Learning skills: The student can independently learn new BFT concepts and methods by reading scientific literature and following online courses.