This course mainly focus on distributed-memory parallel framework, where each processor (core, CPU, GPU, IoT device) has its own memory space which can not be shared with others. This parallel framework is popular as it corresponds to many real application scenarios such as the cluster, the sensor network and IoT.
The goal of this course is to explore the most popular strategies for parallelizing machine learning (ML) tasks without data-sharing across various real-world application scenarios. Additionally, the students will learn to maintain the privacy and security of such distributed systems.
Learn skills -- Theorectical part
- Developed foundations for analyzing the time complexity of distributed algorithm
- Understanded fault-tolerant design principles and their guarantees
- Understanded the analysis of collective communication strategies for parallel computing
Learn skills -- Pratical part
- Gained hands-on experience with high-performance computing environments, including CPU and GPU clusters on **Grid'5000**
- Implemented distributed training of machine learning models using multiple CPUs or GPUs with **PyTorch**.
More updated details are here .
Content:
Distributed Algorithms
Basic distributed models
Complexity analyses
Consensus
Distributed Learning
Learning principles
Collective Communication
PyTorch package for distributed learning
Robust learning