The 1st Workshop on Parallel and Distributed Machine Learning 2019 (PDML'19)

Program

[09:00-09:05] Opening

[09:05-10:00] Keynote

[Slides] Distributed training of Neural Networks
- Yaroslav Bulatov

[10:00-10:30] Session 1: Technical paper presentation (25 mins + 5 mins(Q&A))

[Slides] Accelerating Hyperparameter Optimisation with PyCOMPSs
- Albert Njoroge Kahira, Leonardo Arturo Bautista Gomez, Javier Conejero and Rosa M. Badia

[10:30-11:00] Break

[11:00-12:00] Session 2: Technical paper presentation(25 mins + 5 mins(Q&A))

[Slides] Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method
- Yohei Tsuji, Kazuki Osawa, Yuichiro Ueno, Akira Naruse, Rio Yokota and Satoshi Matsuoka
[Canceled] ~~Reducing global reductions in large-scale distributed training~~
- ~~Guojing Cong, Fan Zhou and Jamie Yang~~

[12:00-12:30] Session 3: Invited short presentations (13 mins + 2 mins (Q&A))

[Slides] State Preservation for Deep Learning Applications
- Bogdan Nicolae
[Slides] Toward Training a Large 3D Cosmological CNN with Hybrid Parallelization
- Yosuke Oyama

[12:30-12:35] Closing

Keynote

Speaker: Yaroslav Bulatov

Abstract: Growth in AI compute requirement has been outpacing Moore's Law which means that we are increasingly seeing neural network training utilizing multiple devices. In this talk I will go over current applications that require distributed training, strategies used to distribute computation, theoretical and engineering factors that limit applicability of these strategies.

Bio: In the last 8 years, Yaroslav worked at Google Brain and OpenAI, training large models such as Google's StreetView house number recognition. Currently he is at South Park Commons working on improving open-source ecosystem for distributed training of neural networks. His team had a top entry in DAWN ImageNet training competition and more recently released code for distributed training of language models on public cloud (https://github.com/cybertronai/transformer-xl)

Invited short presentation 1

Speaker: Bogdan Nicolae

Title: State Preservation for Deep Learning Applications

Abstract: Deep learning applications are rapidly gaining adoption both in industry and scientific computing (fusion energy science, computational fluid dynamics, lattice quantum chromodynamics, virtual drug response prediction, etc.) [1]. Such applications need to train and explore an enormous number of learning models, each of which needs to access large amounts of training data. Even training a single model can take a long time, therefore significant progress can be lost due to machine failures. However, state-of-art lacks adequate support for resilience. Furthermore, the growing generation rate of learning models leads to high data rates (e.g., 650 GB/s on the Summit pre-Exascale machine). State-of-art approaches cannot afford the I/O overhead of flushing the models to stable storage, which means a majority will be discarded after job completion, leading to potential duplication of effort in subsequent jobs [2]. Therefore, there is a need to efficiently preserve the state of learning models in a durable and resilient fashion enables restart in case of failures and/or later reuse in other jobs without compromising the application performance.

Caching approaches have been explored in the context of deep learning to preserve states in an ephemeral fashion. However, this is does not provide the desired durability and resilience properties. Checkpointing techniques have been extensively explored in the context of traditional, bulk synchronous HPC applications to capture global application states in a durable and resilient fashion. However, such techniques cannot be directly used for deep learning, both because of different goals (i.e., additional use cases beyond resilience) and different application behavior (i.e., looser coupling), which presents a new set of unique challenges and opportunities for state preservation. Currently there is little research being done in general within the checkpointing community to address such capabilities despite being requested by the deep learning community.

This talk is aims to explore the use of state preservation for deep learning applications, focusing both on the need to enable resilience to failures and the need to revisit previous learning models. First, it will discuss the key requirements and constraints that deep learning applications place on state preservation. Then, it introduces DataStates, a versioning-based data management system that is capable to capture data snapshots and applications states efficiently for later re-use (e.g. novel training strategies). Finally, it presents early results obtained with DataStates in the context of CANDLE (Cancer Deep Learning Environment), a large scale deep learning application developed at Argonne National Laboratory that combines the power of exascale computing with neural network-based machine learning to address a range of loosely connected problems in cancer research.

References

[1] https://www.alcf.anl.gov/projects/aurora-esp

[2] Scaling Deep Learning for Cancer with Advanced Workflow Storage Integration. Justin M. Wozniak, Philip E. Davis, Tong Shu, Jonathan Ozik, Nicholson Collier, Manish Parashar, Ian Foster, Thomas Brettin, and Rick Stevens. Proc. MLHPC @ Supercomputing, 2018

Invited short presentation 2

Speaker: Yosuke Oyama

Title: Toward Training a Large 3D Cosmological CNN with Hybrid Parallelization

Abstract: We report our preliminary work on large-scale training of a 3D convolutional neural network model for cosmological analyses of dark matter distributions. Previous work showed promising results for predicting cosmological parameters using CNNs trained on a large-scale parallel computing platform. However, due to its weak scaling nature, there exists a trade-off of training performance and prediction accuracy. This paper extends the existing work for better prediction accuracy and performance by exploiting finer-grained parallelism in distributed convolutions. We show significant improvements using the latest complex cosmological dataset with a huge model that was previously unfeasible due to its memory pressure. We achieve 1.42 PFlop/s on a single training task with a mini-batch size of 128 by using 512 Tesla V100 GPUs. Our results imply that the state-of-the-art deep learning case study can be further advanced with HPC-based algorithms.

Google Sites

Report abuse