TP: Distributed Optimization and Games

Time: 9-12h, 13/02/2020, 20/02/2020 Location: rez-de-chaussée n°281, campus Lucioles Teachers: Chuan Xu, Othmane Marfoq

Tutorial for the usage of NEF cluster (Please finish all the following steps before 13/02/2020)

  1. Create a private account for the Inria nef cluster here [Please do it as soon as possible, as it may take time to create the account]:

      • for company : please choose "other"

      • for mail address : please use UCA email (no gmail, etc.)

      • expiration date : 2020-02-28

      • ssh public key : generate ssh key (Linux, Windows, macOS) and attach the public one

reminder - openssh format is needed for the key (windows/putty users will need to convert the key to openssh format - tutorial)

      • usecase description : please indicate "UCA master of Data Science hands on"

      • account creation may require several days, a confirmation email will be sent once the account is created

2. Log in to the cluster once the account is created:

3. Reserve computing resources in the cluster and log into the reserved node:

3. Get Pytorch prepared: mutiple ways , recommend Method 3.

      • to check if Pytorch is well installed, from the terminal type:

          • python

          • import torch

          • print(torch.__version__)

4. Python editor in cluster, recommend to use VIM and know the basic operation in advance.

5. If you have any question concerning the previous points, please contact me: chuan.xu@inria.fr.

6. Please install Pytorch on your own computer as well: "conda install pytorch torchvision cudatoolkit=10.1 -c pytorch"

TP1 : Pytorch for Distributed Machine Learning

Assignment

Go https://gitlab.inria.fr/chxu/distributed-machine-learning-course for the codes.

TP1 Answers

1) To run Basic_Machine_Learning code using command: python Basic_Machine_Learning.py

2) If the torchvision package is not installed, please following the steps:

  • Check the python version by "which python": the return should be "/misc/opt/conda3-5.0.1/bin/python". If not, redo the method 3

  • Check the pip version by "which pip": the return should be "/misc/opt/conda3-5.0.1/bin/pip"

  • Install torchvision by: pip install --user torchvision

  • Then it should be ok, if not, please contact me by mail.

3) The paper on the Parameter Server Algorithm is available here. (See algorithm 1 in page 586)

4) The answers for the Ex I (5) is uploaded to here. The answer for Ex II will be uploaded to the same place on 17/02.

5) Answer to Ex II is updated here, which you could access directely by updating git directory in the cluster by: git pull origin master.

Then go to the /answers/ directory and run: ./PS.sh script to launch the program.

TP2 (20/02/2020)

  1. Run the answer of 5. Ex I and understand it

  2. Run the answer for Ex II and understand it

  3. GPU usage: oarsub -p "gpu='YES'" -l /gpunum=1,walltime=4 -I

  4. Run the prepared GPU version code: ./PS_gpu.sh. Try different datasets and models. Available ones: "Mnist", "MnistHeter" and "NLP".

  5. Based on the code of "PS_gpu.py", develop the codes for federated learning (See Algorithm 1).

  6. Create the consensus version .