TP: Distributed Optimization and Games

Time: 9-12h, 13/02/2020, 20/02/2020 Location: rez-de-chaussée n°281, campus Lucioles Teachers: Chuan Xu, Othmane Marfoq

Tutorial for the usage of NEF cluster (Please finish all the following steps before 13/02/2020)

Create a private account for the Inria nef cluster here [Please do it as soon as possible, as it may take time to create the account]:
1. - for company : please choose "other"
  - for mail address : please use UCA email (no gmail, etc.)
  - expiration date : 2020-02-28
  - ssh public key : generate ssh key (Linux, Windows, macOS) and attach the public one

reminder - openssh format is needed for the key (windows/putty users will need to convert the key to openssh format - tutorial)

- - usecase description : please indicate "UCA master of Data Science hands on"
  - account creation may require several days, a confirmation email will be sent once the account is created

2. Log in to the cluster once the account is created:

- - ssh username@nef-frontal.inria.fr
  - ssh username@nef-devel.inria.fr
  - you may try to transfer your own local file from computer to the cluster directory by: scp localfile username@nef-frontal.inria.fr:~/

3. Reserve computing resources in the cluster and log into the reserved node:

- - oarsub -l /nodes=1/core=2,walltime=1 -I (reserving 2 CPU cores from one node for one hour)
  - more variants for the commands
  - hardware details of the cluster

3. Get Pytorch prepared: mutiple ways , recommend Method 3.

- - to check if Pytorch is well installed, from the terminal type:
    - - python
      - import torch
      - print(torch.__version__)

4. Python editor in cluster, recommend to use VIM and know the basic operation in advance.

5. If you have any question concerning the previous points, please contact me: chuan.xu@inria.fr.

6. Please install Pytorch on your own computer as well: "conda install pytorch torchvision cudatoolkit=10.1 -c pytorch"

1) To run Basic_Machine_Learning code using command: python Basic_Machine_Learning.py

2) If the torchvision package is not installed, please following the steps:

Check the python version by "which python": the return should be "/misc/opt/conda3-5.0.1/bin/python". If not, redo the method 3
Check the pip version by "which pip": the return should be "/misc/opt/conda3-5.0.1/bin/pip"
Install torchvision by: pip install --user torchvision
Then it should be ok, if not, please contact me by mail.

3) The paper on the Parameter Server Algorithm is available here. (See algorithm 1 in page 586)

4) The answers for the Ex I (5) is uploaded to here. The answer for Ex II will be uploaded to the same place on 17/02.

5) Answer to Ex II is updated here, which you could access directely by updating git directory in the cluster by: git pull origin master.

Then go to the /answers/ directory and run: ./PS.sh script to launch the program.

Run the answer of 5. Ex I and understand it
Run the answer for Ex II and understand it
GPU usage: oarsub -p "gpu='YES'" -l /gpunum=1,walltime=4 -I
Run the prepared GPU version code: ./PS_gpu.sh. Try different datasets and models. Available ones: "Mnist", "MnistHeter" and "NLP".
Based on the code of "PS_gpu.py", develop the codes for federated learning (See Algorithm 1).
Create the consensus version .

Google Sites

Report abuse