Python
What is Python?
Python is a programming language that is easy to learn, easy to use, and easy to integrate with other software. It is widely used by researchers because it is an easy language to program in, and it is very verbose without having to worry about things like memory usage or other low level details. Python trades performance for ease of development.
This page is concerned about how Python is used on OIT-RC's systems and not how to use Python itself. For a great Python tutorial, visit learnpython.org.
Getting Started
What's the different between Python2 and Python3? What is Miniconda?
Python2 is an older version of Python. In general, it is highly recommended to use Python 3 since it has major changes and improvements and is the most supported.
Miniconda is a different package management system for Python, and it also contains its own version of python. For more on Conda, refer to here.
Loading the Python environment module
Python2 is installed by default to all Linux machines, so it is available on all of OIT-RC's Linux systems as well. Again, it is highly recommended to not use this version of Python unless there is a very specific and adamant reason not to use Python3. Upon logging in to a system, you can refer to that this is true.
$ python --version
Python 2.7.5
To use Python3, it is necessary to load an environment module (refer to here for more).
Be advised: To use the preferred Intel Python 3 you will need to load the specific module
module add Python/intel/3.9.18/intel-24.0.0
/usr/bin/python will still refer to Python2.
$ module load Python/intel/3.9.18/intel-24.0.0
$ python --version
Python 3.9.18 :: Intel Corporation
Which version of Python3 should be used?
OIT-RC's systems offer many different versions of Python3. Here is a breakdown of when to use which.
Python/intel/3.9.18/intel-24.0.0 is recommended as it has the best performance and solid support, plus most all math, science, and data packages have been added.
If some software or package that is being used that requires a certain version of Python3, abide by that. In general, later versions of Python are more supported and faster than previews versions.
If it is desired to use Python with MPI (refer to the end of this document for more), Python/gcc/3.8.0/gcc-6.3.0 cannot be used, it does not support that package.
If there is a package that is available on Conda but not Pip, use General/miniconda3/4.8.1.
For example, if some research is being done that needs to use Intel-built Python to take advantage of enhanced MKL or data manipulation, use Python/intel/3.9.18/intel-24.0.0 or some equivalent.
Virtual Environments
Virtual environments are how individual users can install and load different sets of packages as needed into a small, clean, and self-contained environment that is easy to use for you and others.
Virtual Environment using venv (RECOMMENDED)
The preferred way of using virtual environments in Python is with the venv package.
First, a virtual environment (venv) needs to be made (-m) and named anything; here, it is named myPythonEnv.
$ module add Python/intel/3.9.18/intel-24.0.0
$ python -m venv myPythonEnv
Next, the venv needs to be activated. If an sbatch script wants to use a virtual environment, it must have this line in the sbatch script. An active venv in the shell will have the venv name in parenthesis before the shell prompt symbol ($).
$ source myPythonEnv/bin/activate
(myPythonEnv) $
Once a virtual environment is running, use pip to install needed libraries, generally you will want pandas (includes numpy), scipy, scikit-learn, dask, and ray.
(myPythonEnv) $ pip install --upgrade pip # get rid of the upgrade messages
(myPythonEnv) $ pip install pyperformance pandas dask ray scipy scikit-learn
Finally, when you are done with your session, deactivate the venv. To deactivate a venv, deactivate is used. Alternatively, if the shell ends or the sbatch script ends, the virtual environment will be deactivated as well. This has finished once the venv name before the shell prompt will be removed.
(myPythonEnv) $ deactivate
$
Virtual Environment using Conda
An alternative method for creating a virtual environment is using Conda, both the Intel Python and the Miniconda modules provide this capability, however, using Conda to install packages is very slow.
The following example will generally work for both, we will use Intel Python.
$ module add Python/intel/3.9.18/intel-24.0.0
$ conda create -p $HOME/myConda -c conda-forge
$ conda init bash
$ source $HOME/.bashrc
$ conda install pyperformance pandas dask ray scipy scikit-learn
Be sure you do not mix conda and pip install commands which can lead to a confused environment. There is one exception, if a package is only available with pip.
Virtual Environment using Mamba (ADVANCED)
$ git clone https://github.com/pyenv/pyenv.git .pyenv
$ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
$ echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
$ echo 'eval "$(pyenv init -)"' >> ~/.bashrc
Logout of the shell, then login again, this will activate the paths to pyenv from which you can install mamba.
$ pyenv install mambaforge-22.9.0-3
$ pyenv global mambaforge-22.9.0-3
$ mamba create -n myMamba -c bioconda -c conda-forge
$ mamba init bash
$ mamba activate myMamba
$ mamba install pyperformance pandas dask scipy scikit-learn
Using Python on a Cluster with SLURM
There is two ways of using Python on a cluster: either having several copies of the same Python script running (like with a job array), or having a Python script use MPI (message passing interface) to communicate between several children.
If any of the sbatch assertions used are unclear or unfamiliar, please visit here for more.
Python with a SLURM Job Array
A job array is basically having a number of copies of the same program. This is good for running the same tests several times for an average, or for running the same script several times but specifying the argument based off which copy the script is.
For more on job arrays, refer here.
Here is an example Python script:
$ cat mySquarePrinter.py
import sys
value = int(sys.argv[1])
print(sys.argv[1],"squared =", value*value)
This is the corresponding sbatch script:
$ cat submit.sh
#!/bin/bash
#SBATCH --job-name job_array_python # Specify that the sbatch job's name is job_array_python.
#SBATCH --partition short # Use a short partition since this is not a long running job.
#SBATCH --ntasks 1 # Allocate one task per subtask.
#SBATCH --output out-%a.txt # Specify that standard output should go.
#SBATCH --error err-%a.txt # Specify where error output should go.
#SBATCH --array=0-3 # Refer to the link for the job arrays page just above the Python script.
module purge # Loaded modules can get carried into sbatch script, so clean them out.
module load Python/intel/3.9.18/intel-24.0.0
PARAMS=(1 2 3 4)
python mySquarePrinter.py ${PARAMS[$SLURM_ARRAY_TASK_ID]}
$ chmod +x submit.sh
The Python script mySquarePrinter.py will take the first input argument (that is not the file name) and convert it to an int (from a string). It will then print out the value being squared, then " squared = ", and then the result of the value times itself.
The sbatch script submit.sh then uses a job array to create 4 copies (subtasks) of this script, where each copy requests a single task. Each subtask then runs mySquarePrinter.py with the corresponding data value from $PARAMS.
For more on job arrays, refer here.
Python with MPI: What is mpi4py?
mpi4py is a Python package that allows for MPI Python programs.
The site for mpi4py is here, which contains more usage documentation.
Getting Started - Create a Virtual Environment and Install mpi4py
$ module load mvapich2-2.2-psm/gcc-6.3.0
$ # Refer to the section of this document about creating a virtual environment.
(myPythonEnv) $ pip3 install mpi4py
Note that mpich/gcc-6.3.0 can work instead of mvapich2-2.2-psm/gcc-6.3.0. The OpenMpi environment modules will not work with mpi4py.
Using a Virtual Environment in Python and sbatch Scripts
Here is an example of using a Python virtual environment (venv) with mpi4py installed to make an MPI Python program on SLURM. This assumes the previous subsection has been completed.
Here is the Python script:
$ cat hello_mpi_python.py
#!/usr/bin/env/ python3
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # Rank is the process number, starting at 0.
size = comm.Get_size() # Size is the number of processes.
print("Hello, World! from process", rank, "of", size)
Here is the sbatch submission script:
$ cat submit.sh
#!/bin/bash
#SBATCH --job-name mpi4py_hello_world # Specify that the sbatch job's name is mpi4py_hellow_world.
#SBATCH --partition short # Use a short partition since this is not a long running job.
#SBATCH --ntasks 4 # Allocate four tasks.
#SBATCH --nodes 2 # Specify that all four of the tasks must run on two nodes between themselves.
#SBATCH --output out.txt # Specify that standard output should go to ./out.txt (and create the file if needed).
#SBATCH --error err.txt # Specify that error output should go to ./err.txt (and create the file if needed).
module purge # Loaded modules can get carried into sbatch script, so clean them out.
module load mvapich2-2.2-psm/gcc-6.3.0
module load Python/gcc/3.7.5/gcc-6.3.0
source ./myPythonEnv/bin/activate # Activate the virtual environment.
mpiexec -n 4 python3 hello_mpi_python.py # Create four tasks, each of which runs its own hello_mpi_python.py
Submit the sbatch script:
sbatch submit.sh
Here is err.txt (this file is empty):
$ cat err.txt
$
Here is out.txt (note that the order of processes does not matter, just that they are all different):
$ cat out.txt
Hello, World! from process 3 of 4
Hello, World! from process 1 of 4
Hello, World! from process 0 of 4
Hello, World! from process 2 of 4
Common Issues
Several lines of "Hello, World! from process 0 of 1"
This means the module mvapich2-2.2-psm/gcc-6.3.0 was not loaded during installation of mpi4py and/or in the sbatch script. This includes using a different MPI module. Try adding it to the sbatch script, and if that does not work, delete the virtual environment (rm -rf ./myPythonEnv) and rebuild it, with mvapich2-2.2-psm/gcc-6.3.0 loaded.