Distributed computing and learning

This course mainly focus on distributed-memory parallel framework, where each processor (core, CPU, GPU, IoT device) has its own memory space which can not be shared with others. This parallel framework is popular as it corresponds to many real application scenarios such as the cluster, the sensor network and IoT.

The goal of this course is to explore the most popular strategies for parallelizing machine learning (ML) tasks without data-sharing across various real-world application scenarios. Additionally, the students will learn to maintain the privacy and security of such distributed systems.

Learn skills -- Theorectical part

- Developed foundations for analyzing the time complexity of distributed algorithm

- Understanded fault-tolerant design principles and their guarantees

- Understanded the analysis of collective communication strategies for parallel computing

Learn skills -- Pratical part

- Gained hands-on experience with high-performance computing environments, including CPU and GPU clusters on **Grid'5000**

- Implemented distributed training of machine learning models using multiple CPUs or GPUs with **PyTorch**.

More updated details are here .

Content:

Distributed Algorithms
1. Basic distributed models
2. Complexity analyses
3. Consensus
Distributed Learning
1. Learning principles
2. Collective Communication
3. PyTorch package for distributed learning
4. Robust learning

Google Sites

Report abuse