Applications Guide
Alphafold 3
AlphaFold 3 differs from earlier releases in that it primarily uses a JSON input file (fold_input.json) to specify the sequences (and, if desired, advanced parameters or special configurations). The script run_alphafold.py in AlphaFold 3 reads this JSON file to produce protein structure predictions.
NB: Example submission script and input files can be found in the Amarel path /projects/community/alphafold/vs3.0.0/pgarias/examples.
Submission script explanation
An example submission script is provided below for use with Alphafold 3 on Amarel:
SLURM flags can be modified according to the suitable resource requirements (--time, --mem, etc.)
Below is an explanation of the SLURM directives in the submission script:
#SBATCH --partition=gpu # Partition name
#SBATCH --job-name=job_name # Your alphafold3 job name
#SBATCH --gres=gpu:1 # Number of gpus needed, keep at 1
#SBATCH --ntasks=1 # Number of tasks, keep at 1
#SBATCH --cpus-per-task=1 # Number of gpus needed, keep at 1
#SBATCH --mem=100G # This may need to change based on the number of tokens
#SBATCH --time=03:00:00 # This may need to change according to the requirements of the job
#SBATCH --constraint=ampere|adalovelace. # This is needed due to the GPU architecture that alphafold3 has been tested on
You will need to also load modules:
module purge
module use /projects/community/modulefiles
module load apptainer/1.2.5
module load alphafold/vs3.0.0-pgarias
The Alphafold 3 environments ($ALPHAFOLD_MODELWEIGHTS, $ALPHAFOLD_DATA_PATH, $CONTAINERDIR ) are loaded with the alphafold module. An execution of the singularity image is then added to the script:
apptainer exec \
-B ./af_input:/root/af_input \
-B ./af_output:/root/af_output \
-B $ALPHAFOLD_MODELWEIGHTS:/root/models \
-B $ALPHAFOLD_DATA_PATH:/root/public_databases \
--pwd /app/alphafold \
--nv $CONTAINERDIR/alphafold3.sif \
python run_alphafold.py \
--json_path=/root/af_input/fold_input.json \
--db_dir=/root/public_databases \
--model_dir=/root/models \
--output_dir=/root/af_output
you will need to create the following directories at the same level as the submission script:
Input Directory: Contains your fold_input.json file (and any additional support files, if needed). For example:
./af_input/fold_input.json
Output Directory: This will store generated prediction files, logs, and model outputs. For example:
./af_output/
Submission script example
#!/bin/bash
#SBATCH --partition=gpu # Partition (job queue)
#SBATCH --requeue # Return job to the queue if preempted
#SBATCH --job-name=my_alphafold3 # Job name
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1 # Total # of tasks
#SBATCH --cpus-per-task=1 # Cores per task
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem=100G # Amount of system RAM
#SBATCH --constraint=ampere|adalovelace
#SBATCH --time=24:00:00 # Max run time (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT file
#SBATCH --error=slurm.%N.%j.err # STDERR file
#SBATCH --export=ALL # Export current env to job
module purge
module use /projects/community/modulefiles
module load apptainer/1.2.5
module load alphafold/vs3.0.0-pgarias
# Run AlphaFold 3 inside Apptainer
apptainer exec \
-B ./af_input:/root/af_input \
-B ./af_output:/root/af_output \
-B $ALPHAFOLD_MODELWEIGHTS:/root/models \
-B $ALPHAFOLD_DATA_PATH:/root/public_databases \
--pwd /app/alphafold \
--nv $CONTAINERDIR/alphafold3.sif \
python run_alphafold.py \
--json_path=/root/af_input/fold_input.json \
--db_dir=/root/public_databases \
--model_dir=/root/models \
--output_dir=/root/af_output
JSON File
AlphaFold 3 uses a JSON file to define sequences or advanced parameters. Below is a minimal example (from AF3 repo), though your use case may vary:
{
"job_name": "MyProtein",
"sequences": [
{
"description": "Example protein",
"sequence": "YOUR_PROTEIN_SEQUENCE_HERE"
}
],
"num_recycles": 3
}
where:
job_name: Optional descriptive name for the run.
sequences: List of sequences; each can include a description and the raw sequence.
num_recycles: Example of an additional parameter.
For a full list of available parameters, see the AlphaFold 3 input docs.
Multiple sequences JSON file
Below is an explanation of how to organize multiple protein sequences into a single AlphaFold 3 JSON input file (e.g., fold_input.json).
{
"name": "Tyrosine-protein phosphatase",
"sequences": [
{
"protein": {
"id": "A",
"sequence": "MVDATRVPMDERFRTLKKKLEEGMVFTEYEQIPKKKANGIFSTAALPENAERSRIREVVPYEENRVELIPTKENNTGYINASHIKVVVGGAEWHYIATQGPLPHTCHDFWQMVWEQGVNVIAMVTAEEEGGRTKSHRYWPKLGSKHSSATYGKFKVTTKFRTDSVCYATTGLKVKHLLSGQERTVWHLQYTDWPDHGCPEDVQGFLSYLEEIQSVRRHTNSMLEGTKNRHPPIVVHCSAGVGRTGVLILSELMIYCLEHNEKVEVPMMLRLLREQRMFMIQTIAQYKFVYQVLIQFLQNSRLI"
}
},
{
"protein": {
"id": "B",
"sequence": "GHMAEPQRHTMLCMCCKCEARIELVVESSADDLRAFQQLFLNTLSFVCPWCASQQ"
}
}
],
"modelSeeds": [1,2],
"dialect": "alphafold3",
"version": 1
}
Alphafold 2 (with Multimers)
A recent update of Alphafold 2.0 allows for predictions of multimer and monomer inputs. In addition, this version can also predict monomers (as well as multimers) with full_dbs or reduced_dbs.
Submission scripts
An example submission script is provided below. For all 4 cases, the user needs to modify the following flags:
--max_template_date=YYYY-MM-DD
--fasta_paths=<FASTA FILE PATH>
--output_dir=<OUTPUT FILE DIRECTORY PATH>
For further explanations and additional options please see run_alphafold.py.
SLURM flags can be modified according to the suitable resource requirements (--gres=gpu:#, --nodes, --ntasks,--time, --mem, etc.)
Below is the explanation of the singularity flags in the submission script.
The database and models are stored in $ALPHAFOLD_DATA_PATH and loaded when you run module load alphafold after you load module use /projects/community/modulefiles.
A cache file ld.so.cache will be written to /etc, which is not allowed on Amarel. The workaround is to bind-mount e.g. the current working directory to /etc inside the container. [-B .:/etc]
You must launch AlphaFold from /app/alphafold inside the container due to this issue. [--pwd /app/alphafold]
The --nv flag enables GPU support.
Alphafold multimer, reduced database (reduced_dbs):
#!/bin/bash
#SBATCH --partition=gpu # Partition (job queue)
#SBATCH --requeue # Return job to the queue if preempted
#SBATCH --job-name=alphafold # Assign an short name to your job
#SBATCH --nodes=1 # Number of nodes you require
#SBATCH --ntasks=8 # Total # of tasks across all nodes
#SBATCH --cpus-per-task=1 # Cores per task (>1 if multithread tasks)
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem=64G # Real memory (RAM) required (MB), 0 is the whole-node memory
#SBATCH --time=03:00:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT output file
#SBATCH --error=slurm.%N.%j.err # STDERR output file (optional)
#SBATCH --export=ALL # Export you current env to the job env
module purge
module use /projects/community/modulefiles
module load singularity/3.6.4
module load alphafold
singularity run -B $ALPHAFOLD_DATA_PATH:/data -B .:/etc --pwd /app/alphafold --nv $CONTAINERDIR/alphafold.sif \
--data_dir=/data \
--uniref90_database_path=/data/uniref90/uniref90.fasta \
--mgnify_database_path=/data/mgnify/mgy_clusters_2018_12.fa \
--template_mmcif_dir=/data/pdb_mmcif/mmcif_files/ \
--obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
--fasta_paths=<FASTA FILE PATH> \
--output_dir=<OUTPUT FILE DIRECTORY PATH> \
--model_preset=multimer \
--db_preset=reduced_dbs \
--small_bfd_database_path=/data/small_bfd/bfd-first_non_consensus_sequences.fasta \
--pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt \
--uniprot_database_path=/data/uniprot/uniprot.fasta \
--max_template_date=YYYY-MM-DD \
--use_gpu_relax=True
Alphafold monomer, reduced database (reduced_dbs):
#!/bin/bash
#SBATCH --partition=gpu # Partition (job queue)
#SBATCH --requeue # Return job to the queue if preempted
#SBATCH --job-name=alphafold # Assign an short name to your job
#SBATCH --nodes=1 # Number of nodes you require
#SBATCH --ntasks=8 # Total # of tasks across all nodes
#SBATCH --cpus-per-task=1 # Cores per task (>1 if multithread tasks)
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem=64G # Real memory (RAM) required (MB), 0 is the whole-node memory
#SBATCH --time=03:00:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT output file
#SBATCH --error=slurm.%N.%j.err # STDERR output file (optional)
#SBATCH --export=ALL # Export you current env to the job env
module purge
module use /projects/community/modulefiles
module load singularity/3.6.4
module load alphafold
singularity run -B $ALPHAFOLD_DATA_PATH:/data -B .:/etc --pwd /app/alphafold --nv $CONTAINERDIR/alphafold.sif \
--data_dir=/data \
--uniref90_database_path=/data/uniref90/uniref90.fasta \
--mgnify_database_path=/data/mgnify/mgy_clusters_2018_12.fa \
--template_mmcif_dir=/data/pdb_mmcif/mmcif_files/ \
--obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
--fasta_paths=<FASTA FILE PATH> \
--output_dir=<OUTPUT FILE DIRECTORY PATH> \
--model_preset=monomer \
--db_preset=reduced_dbs \
--small_bfd_database_path=/data/small_bfd/bfd-first_non_consensus_sequences.fasta \
--pdb70_database_path=/data/pdb70/pdb70 \
--max_template_date=YYYY-MM-DD \
--use_gpu_relax=True
Alphafold multimer, full database (full_dbs):
#!/bin/bash
#SBATCH --partition=gpu # Partition (job queue)
#SBATCH --requeue # Return job to the queue if preempted
#SBATCH --job-name=alphafold # Assign an short name to your job
#SBATCH --nodes=1 # Number of nodes you require
#SBATCH --ntasks=8 # Total # of tasks across all nodes
#SBATCH --cpus-per-task=1 # Cores per task (>1 if multithread tasks)
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem=64G # Real memory (RAM) required (MB), 0 is the whole-node memory
#SBATCH --time=03:00:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT output file
#SBATCH --error=slurm.%N.%j.err # STDERR output file (optional)
#SBATCH --export=ALL # Export you current env to the job env
module purge
module use /projects/community/modulefiles
module load singularity/3.6.4
module load alphafold
singularity run -B $ALPHAFOLD_DATA_PATH:/data -B .:/etc --pwd /app/alphafold --nv $CONTAINERDIR/alphafold.sif \
--data_dir=/data \
--uniref90_database_path=/data/uniref90/uniref90.fasta \
--mgnify_database_path=/data/mgnify/mgy_clusters_2018_12.fa \
--template_mmcif_dir=/data/pdb_mmcif/mmcif_files/ \
--obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
--fasta_paths=/scratch/pgarias/fastafiles/A2B2heterodimer.fasta \
--output_dir=/scratch/pgarias/outputdir_mm/full_dbs_multimer \
--model_preset=multimer \
--db_preset=full_dbs \
--bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt \
--uniprot_database_path=/data/uniprot/uniprot.fasta \
--uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--max_template_date=YYYY-MM-DD \
--use_gpu_relax=True
Alphafold monomer, full database (full_dbs):
#!/bin/bash
#SBATCH --partition=gpu # Partition (job queue)
#SBATCH --requeue # Return job to the queue if preempted
#SBATCH --job-name=alphafold # Assign an short name to your job
#SBATCH --nodes=1 # Number of nodes you require
#SBATCH --ntasks=8 # Total # of tasks across all nodes
#SBATCH --cpus-per-task=1 # Cores per task (>1 if multithread tasks)
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem=64G # Real memory (RAM) required (MB), 0 is the whole-node memory
#SBATCH --time=03:00:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT output file
#SBATCH --error=slurm.%N.%j.err # STDERR output file (optional)
#SBATCH --export=ALL # Export you current env to the job env
module purge
module use /projects/community/modulefiles
module load singularity/3.6.4
module load alphafold
singularity run -B $ALPHAFOLD_DATA_PATH:/data -B .:/etc --pwd /app/alphafold --nv $CONTAINERDIR/alphafold.sif \
--data_dir=/data \
--uniref90_database_path=/data/uniref90/uniref90.fasta \
--mgnify_database_path=/data/mgnify/mgy_clusters_2018_12.fa \
--template_mmcif_dir=/data/pdb_mmcif/mmcif_files/ \
--obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
--fasta_paths=/scratch/pgarias/fastafiles/sarscovid2.fasta \
--output_dir=/scratch/pgarias/outputdir_mm/full_dbs_monomer \
--model_preset=monomer \
--db_preset=full_dbs \
--bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--pdb70_database_path=/data/pdb70/pdb70 \
--uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--max_template_date=YYYY-MM-DD \
--use_gpu_relax=True
Conda
Conda is an open-source, cross-platform, language-agnostic package manager and environment management system. Conda packages are binaries. No compilers needed to install them. It consists of the following components
Anaconda is a Conda package repository that includes many python packages and extensions, such as Conda, numpy, scipy, ipython notebook, etc..
Miniconda is a smaller alternative to Anaconda that includes a much smaller set of core packages along with Conda.
While Conda packages are binary distribution allowing fast installation, other forms of installation are supported inside Conda environments, including pip, or any other source installation. Each package installs along with a list of dependent packages by default.
Anaconda
You can either install your own version of Anaconda in your home directory, or you can use a version from the community-contributed modules. Here's how to search for packages with 'anaconda' in their description:
module use /projects/community/modulefiles
module keyword anaconda
anaconda: anaconda/2020.07-gc563
Sets up Anaconda 2020.07 for your environment
py-data-science-stack: py-data-science-stack/5.1.0-kp807
Sets up anaconda 5.1.0 in your environment
py-image: py-image/2020-bd387
Sets up anaconda in your environment for tensorflow and keras
To use the above anaconda (2020.07) module and set up conda environment :
$ module use /projects/community/modulefiles/
$ module load anaconda/2020.07-gc563
$ conda init bash ##configure your bash shell for conda, auto update your .bashrc file
$ cd
$ source .bashrc
$ mkdir -p .conda/pkgs/cache .conda/envs ##These are the folders to store your own env you going to build
To see env already installed in the module: (you'll see tensorflow-2.3.0, pytorch-1.7.0 are installed, you may activate it and use it)
conda env list
To install your own anaconda env, for example, your own TensorFlow version 2.3, with python 3.8, name this env as "tf2":
conda create --name tf2 tensorflow==2.3 python=3.8
This env will be located at: /home/<netID>/.conda/envs/tf2
# To activate your above TensorFlow environment:
$ conda activate tf2
# After you are done with the work in that environment, you shall deactivate it, by:
$ conda deactivate
Miniconda
There is a copy of miniconda installed in the community folder: /projects/community/miniconda3/condabin/conda
The first time using it, you need to initialize it:
/projects/community/miniconda3/bin/conda init ##This will set up the conda environment via your .bashrc file
/projects/community/miniconda3/bin/conda config --set auto_activate_base false ##so it won't automatically activate
source ~/.bashrc
To create your own environment and save it in your home directory, using the above community miniconda:
$ cd
$ mkdir -p .conda/pkgs/cache .conda/envs ##These folders will store your env and related files
$ conda create --name mytest python=3.8 numpy ##create an env called mytest with python 3.8,& numpy
You may check now, your new env "mytest" shall be located at : /home/<netID>/.conda/envs/mytest
To use the env, you need to activate it:
$conda activate mytest
When you are done, deactivate it:
$conda deactivate
MATLAB
There are a few example MATLAB jobs located here: /projects/oarc/users/examples/matlab
Running in parallel (CPU cores only)
To run just about anything in parallel with MATLAB, you must have MATLAB code that's designed to run in parallel. That's usually accomplished by calling the parpool and parfor functions. Before scaling-up to large numbers of cores, be sure your MATLAB script can efficiently utilize many cores (e.g., run some benchmarking jobs to see how the performance of your calculations scale when increasing the number of cores) because more cores doesn't always mean things will run faster.
Here's an example SLURM job script for running a parallel-enabled MATLAB script across 4 cores (i.e., 4 MATLAB worker processes) on a single compute node:
#!/bin/bash
#SBATCH --job-name=mparfor # Assign an short name to your job
#SBATCH --cpus-per-task=4 # Cores per task (>1 if multithread tasks)
#SBATCH --mem=16000 # Real memory (RAM) required (MB)
#SBATCH --time=00:05:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT output file
mkdir -p /scratch/$USER/$SLURM_JOB_ID
cd /scratch/$USER/$SLURM_JOB_ID
module purge
rm -rf ~/.matlab
module load MATLAB/R2020a
srun matlab -nodisplay < MonteCarloPi.m
cd ; rm -rf /scratch/$USER/$SLURM_JOB_ID
Running on GPUs
For MATLAB to take advantage of GPU hardware, the gpuArray function must be used in your MATLAB script.
Here's an example SLURM job script for running a GPU-enabled MATLAB script across 4 cores (i.e., 4 MATLAB worker processes):
#!/bin/bash
#SBATCH --partition=gpu # Partition (job queue)
#SBATCH --job-name=mtrxmlt # Assign an short name to your job
#SBATCH --cpus-per-task=1 # Cores per task (>1 if multithread tasks)
#SBATCH --gres=gpu:1 # Request number of GPUs
#SBATCH --constraint=pascal # Specify hardware models
#SBATCH --mem=16000 # Real memory (RAM) required (MB)
#SBATCH --time=00:05:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT output file
mkdir -p /scratch/$USER/$SLURM_JOB_ID
cd /scratch/$USER/$SLURM_JOB_ID
module purge
rm -rf ~/.matlab
module load MATLAB/R2020a
srun matlab -nodisplay -singleCompThread -batch "N=10000;MatrixMultGPU(rand(N),rand(N))"
cd ; rm -rf /scratch/$USER/$SLURM_JOB_ID
Using the Parallel Server
The MATLAB Parallel Server is required for parallel execution across more than 1 compute node. Note: before scaling-up to multiple nodes, be sure your MATLAB script can efficiently utilize cores across multiple nodes (e.g., run some benchmarking jobs to see how the performance of your calculations scale when increasing the number of cores) because more cores doesn't always mean things will run faster.
Documentation for using Parallel Server is available here: https://www.mathworks.com/help/matlab-parallel-server/index.html
There is an example Cluster Configuration for Amarel here: /projects/oarc/users/examples/matlab/Amarel_SLURM_Parallel_Server.mlsettings
That file can be imported into the Cluster Configuration Manager in MATLAB. When using that example configuration, be sure to edit the JobStorageLocation and the number of workers you wish to use. By default, MATLAB will leave the selection of the number of compute nodes up to SLURM, but you can customize this and many other features of your Parallel Server managed SLURM jobs by adding appropriate options in the SubmitArguments field.
Parallel Server can be utilized exclusively from the command-line (a GUI is not required) and that is the most efficient way to use it. However, if you prefer to use the MATLAB IDE for submitting your job(s), connecting to Amarel using the FastX system and launching MATLAB from a compute node is recommended. Note: the resources you request for running the MATLAB IDE are completely separate from those requested when you submit a job using the Parallel Server. For example, you may only need 1 core and 1 GB RAM for running the MATLAB IDE, regardless of how big your Parallel Server jobs are, because the MATLAB IDE only submits your job when using the Parallel Server, it doesn't actually participate in the computation.
MOE
Molecular Operating Environment (MOE) by Chemical Computing Group (CCG) is a drug discovery platform that integrates visualization, modeling, simulations, and methodology development, in one package.
Download and install:
The MOE installation package for Windows, Linux, or OS X can be downloaded from the Rutgers Software Portal: https://software.rutgers.edu/product/3656
Configure the license file:
Once the installation is complete, edit the license.dat file in the main "moe" directory. For example, on a Windows system, look for the license.dat in C:\moe or C:\Program Files\moe or similar default locations.
When you locate license.dat file, open it with a text editor (like Notepad) and replace the content with the license details provided on the Rutgers Software Portal.
Save the license.dat file and try to start MOE again.
Remember that running MOE will only work if you are connected to the Rutgers campus network (i.e., actually on campus or connected to the campus network via the Rutgers VPN service).
Python
Generally, there are 2 approaches for using Python and its associated tools:
(1) use one of the pre-installed Python modules (version 2.x.x or 3.x.x) that's already available on Amarel (you can add or update packages if needed) or
(2) install your own custom build of Python in your /home directory or in a shared directory (e.g., /projects/<group> or /projects/community).
Quickstart
Loads the Intel libraries and underlying c/c++/fortran code needed for numpy, then the Python module, itself:
module load intel_mkl/17.0.2 python/3.5.2
Or, load a version of Python, Anaconda, or the Python Data Science Stack:
module use /projects/community/modulefiles
module load py-data-science-stack/5.1.0-kp807
Using pre-installed Python modules
With the pre-installed Python modules, you can add or update Python modules/packages as needed if you do it using the '--user' option for pip. This option will instruct pip to install new software or upgrades in your ~/.local directory. Here's an example where I'm installing the Django package:
module load python/3.5.2
pip install --user Django
Note: if necessary, pip can also be upgraded when using a system-installed build of Python, but be aware that the upgraded version of pip will be installed in ~/.local/bin. Whenever a system-installed Pytyon module is loaded, the PATH location of that module's executables (like pip) will precede your ~/.local/bin directory. To run the upgraded version of pip, you'll need to specify its location because the previous version of pip will no longer work properly:
$ which pip
/opt/sw/packages/gcc-4.8/python/3.5.2/bin/pip
$ pip --version
pip 9.0.3 from /opt/sw/packages/gcc-4.8/python/3.5.2/lib/python3.5/site-packages (python 3.5)
$ pip install -U --user pip
Successfully installed pip-10.0.1
$ which pip
/opt/sw/packages/gcc-4.8/python/3.5.2/bin/pip
$ pip --version
Traceback (most recent call last):
File "/opt/sw/packages/gcc-4.8/python/3.5.2/bin/pip", line 7, in
from pip import main
ImportError: cannot import name 'main'
$ .local/bin/pip --version
pip 10.0.1 from /home/gc563/.local/lib/python3.5/site-packages/pip (python 3.5)
$ .local/bin/pip install --user Django
Building your own Python installation
Using this approach, I must specify that I want Python to be installed in my /home directory. This is done using the '--prefix=' option. Also, I prefer to use a [package]/[version] naming scheme because that enables easy organization of multiple versions of Python (optional, it's just a personal preference).
Note: Newer versions of Python require the Foreign Function Interface library (libffi) to avoid errors about missing _ctypes. You can install your own libffi like this,
wget https://github.com/libffi/libffi/releases/download/v3.4.2/libffi-3.4.2.tar.gz
tar -zxf libffi-3.4.2.tar.gz
cd libffi-3.4.2
./configure --prefix=/home/gc563/libffi/3.4.2
make -j 4
make install
or you can use the one available in Amarel's Community-Contributed Software Repository:
module use /projects/community/modulefiles/
module load libffi/3.4.2-gc563
With libffi ready in my shell environment, I can proceed with my Python installation.
Note #1: In the example here, I'm using the libffi available on Amarel, not the one installed in my /home directory (to do that, you must change the path for the LDFLAGS environment variable and you'll need to set the PKG_CONFIG_PATH variable to the lib/pkgconfig directory in your libffi installation.
Note #2: Using the --enable-optimizations option requires that you build Python with GCC 8.10+ or Intel compilers. I'm just using the system default GCC.
At the end of my install procedure, I remove the downloaded install package and tarball, just to tidy-up.
wget https://www.python.org/ftp/python/3.9.6/Python-3.9.6.tgz
tar -zxf Python-3.9.6.tgz
cd Python-3.9.6
./configure --prefix=/home/gc563/python/3.9.6 CXX=g++ --with-ensurepip=install LDFLAGS=-L/projects/community/libffi/3.4.2/gc563/lib64
make -j 8
make install
cd ..
rm -rf Python-3.9.6*
Before using my new Python installation, I'll need to set or edit some environment variables. This can be done from the command line (but the settings won't persist after you log-out) or by adding these commands to the bottom of your ~/.bashrc file (so the settings will persist):
export PATH=/home/gc563/python/3.9.6/bin:$PATH
export LD_LIBRARY_PATH=/home/gc563/python/3.9.6/lib
export MANPATH=/home/gc563/python/3.9.6/share/man
If you're adding these lines to the bottom of your ~/.bashrc file, log-out and log-in again, then verify that the settings are working:
which python3
~/python/3.9.6/bin/python3
Q-Chem
Q-Chem is a comprehensive ab initio quantum chemistry software package that can provide accurate predictions of molecular structures, reactivities, and vibrational, electronic and NMR spectra.
How to load the Q-Chem module:
module load Q-Chem/5.4
Example Q-Chem job:
There is a simple example job in /projects/community/users/training/intro.amarel/qchem.example that runs Q-Chem in parallel using OpenMP (in other words, this parallel job can use the cores on a single compute node, but it cannot span multiple compute nodes. That example includes input for an ADC(2)-s calculation of singlet exited states of methane with D2 symmetry. In total, six excited states are requested corresponding to four (two) electronic transitions with irreducible representation B1 (B2).
sbatch run.qchem.openmp
Notes:
Q-Chem can run in parallel using OpenMP (threaded mode), MPI, or a combination of both (hybrid mode):
OpenMP (threaded):
qchem -nt nthreads infile outfile
MPI (be sure to add --mpi=pmi2 after your srun command):
qchem -np n infile outfile
Hybrid OpenMP+MPI:
qchem -np n -nt nthreads infile outfile
Troubleshooting:
If you encounter a "bus error," you may have run out of memory. In that case, assigning more memory (RAM) for your job may help you avoid that situation.
If you see "License server machine is down or not responding" in your output file, the Q-Chem license server may be busy and simply trying again is appropriate.
R
Running R in an interactive session:
Here's a simple example of loading and running R in an interactive job. First, start an interactive session with 2 CPU cores that will run for 1 hr 40 min and provide you with a Bash shell on the allocated compute node:
$ srun --ntasks=2 --time=01:40:00 --pty bash
Once your interactive session has started, load the desired R module and launch an R shell:
$ module load intel/17.0.4
$ module load R-Project/3.4.1
$ R
Running R in a batch job (using a job script):
#!/bin/bash
#SBATCH --partition=main # Partition (job queue)
#SBATCH --no-requeue # Do not re-run job if preempted
#SBATCH --job-name=TF_gpu # Assign an short name to your job
#SBATCH --nodes=1 # Number of nodes you require
#SBATCH --ntasks=1 # Total # of tasks across all nodes
#SBATCH --cpus-per-task=2 # Cores per task (>1 if multithread tasks)
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem=16000 # Real memory (RAM) required (MB)
#SBATCH --time=00:30:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT output file
#SBATCH --error=slurm.%N.%j.err # STDERR output file (optional)
module purge
module load singularity/.2.5.1
Rscript myRprogram.r
Using packages from BioConductor
If these packages are not installed, you can install them yourself. On a login node, start R and try the following commands:
source("https://bioconductor.org/biocLite.R")
biocLite("ape")
biocLite("MKmisc")
biocLite("Heatplus")
biocLite("affycoretools")
biocLite("flashClust")
biocLite("affy")
Example: Calculate gene length
Get some data from ENSEMBLE
wget ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.gtf.gz
In an R shell, you can execute these commands to compute gene lengths:
library(GenomicFeatures)
gtfdb <- makeTxDbFromGFF("Homo_sapiens.GRCh38.78.gtf",format="gtf")
exons.list.per.gene <- exonsBy(gtfdb,by="gene")
exonic.gene.sizes <- lapply(exons.list.per.gene,function(x){sum(width(reduce(x)))})
class(exonic.gene.sizes)
Hg20_geneLength <-do.call(rbind, exonic.gene.sizes)
colnames(Hg20_geneLength) <- paste('geneLength')
Installing your own build of R
Here are the steps some have used to install their own customizable build of R in their /home directory. After the installation of R, the procedures here also show how the Bioconductor package can be installed using the newly-installed build of R. NOTE: be sure to use your NetID instead of the <NetID> placeholder.
Start by downloading and installing PCRE:
wget https://ftp.pcre.org/pub/pcre/pcre2-10.35.tar.gz
tar -zxf pcre2-10.35.tar.gz
cd pcre2-10.35
./configure --prefix=/home/<NetID>/pcre/10.35
make -j 8
make install
Add these lines to the end of your ~/.bashrc file (NOTE: the commands presented here assume you don't already have those environment variables set. If you do have them set already, adjust these lines accordingly):
export PATH=/home/<NetID>/pcre/10.35/bin:$PATH
export C_INCLUDE_PATH=/home/<NetID>/pcre/10.35/include
export CPLUS_INCLUDE_PATH=/home/<NetID>/pcre/10.35/include
export LIBRARY_PATH=/home/<NetID>/pcre/10.35/lib
export LD_LIBRARY_PATH=/home/<NetID>/pcre/10.35/lib
export MANPATH=/home/<NetID>/pcre/10.35/share/man:$MANPATH
Then log-out and log-in again so those settings will take effect.
Next, we can download and install R:
wget https://cran.r-project.org/src/base/R-4/R-4.1.0.tar.gz
tar -zxf R-4.1.0.tar.gz
cd R-4.1.0
./configure --prefix=/home/<NetID>/R/4.1.0
make -j 8
make install
Add these lines to the end of your ~/.bashrc file:
export PATH=/home/<NetID>/R/4.1.0/bin:$PATH
export C_INCLUDE_PATH=/home/<NetID>/R/4.1.0/include:$C_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/home/<NetID>/R/4.1.0/include:$CPLUS_INCLUDE_PATH
export LIBRARY_PATH=/home/<NetID>/R/4.1.0/lib:$LIBRARY_PATH
export LD_LIBRARY_PATH=/home/<NetID>/R/4.1.0/lib:$LD_LIBRARY_PATH
export MANPATH=/home/<NetID>/R/4.1.0/share/man:$MANPATH
Then log-out and log-in again so those settings will take effect.
Now, we're ready to install additional packages for our new R installation. Here, we'll walk through the process of installing the Bioconductor package.
Start R, then enter these commands:
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
> BiocManager::install()
> BiocManager::install("GenomicFeatures")
This installation can take a long time. You may be asked if you'd like to update the installed packages:
Update all/some/none? [a/s/n]: a
Doing so is probably going to be a good idea for most users.
RStudio
RStudio is available as a pre-built app in Amarel's Open OnDemand interface (see here for details: https://sites.google.com/view/cluster-user-guide/amarel#h.z6biscu53ldl). But some users prefer more flexibility or customizability with their RStudio and underlying R builds. For those users, launching RStudio from a Singularity image (running as a container) may be a good option.
Here's how to run a Singularity image of RStudio on Amarel:
(1) Launch a FastX desktop session (see here for details: https://sites.google.com/view/cluster-user-guide/amarel#h.jsnqsekyy1u6)
Direct your web browser to https://amarel.hpc.rutgers.edu:3443
(2) Access a compute node with the job duration and memory you require and launch a Bash shell:
srun --time=1:00:00 --mem=2G --pty bash
(3) Load a recent version of Singularity (may need to check module --show-hidden avail to find the latest version)
module load singularity/3.6.4
(4) Run our Singularity Image File in /projects/community as a container:
singularity run --app rstudio /projects/community/singularity.images/rstudio/r402rstudio11.sif-gc563
If a newer version of R or RStudio is required, the user can build and customize their own Singularity image as needed using the free, web-based image builder: https://cloud.sylabs.io/builder
An example Singularity definition file for creating a new image can be found here: /projects/community/singularity.images/rstudio/r402rstudio11.def-gc563 (that's the definition file used to build the working image stored in that same directory). The exact commands for building an image with the latest versions of R and RStudio may have changed a bit since this definition file was created: the optimal procedures change all the time. So, a user doing this for themselves may need to do a little research to find the best Docker image or preconfigured OS image with which to start.
Sage
SageMath is a free, open-source mathematics software system licensed under the GPL. It builds on top of many existing open-source packages like NumPy, SciPy, matplotlib, Sympy, Maxima, GAP, FLINT, R and many more. Sage enables access to their combined power through a common, Python-based language or directly via interfaces or wrappers.
On Amarel, Sage 9.2 can be loaded via the Miniconda3 environment in the Community-Contributed Software Repository.
First, initialize the Miniconda3 environment on Amarel as described here.
(e.g., /projects/community/miniconda3/bin/conda init and then log-out, log-in for that change to take effect)
Then, simply activate the Sage environment within Miniconda3:
conda activate sage
Singularity
Singularity is a Linux containerization tool suitable for HPC environments. It uses its own container format and also has features that enable importing Docker containers.
Docker is a platform that employs features of the Linux kernel to run software in a container. The software housed in a Docker container is not standalone program but an entire OS distribution, or at least enough of the OS to enable the program to work. Docker can be thought of as somewhat like a software distribution mechanism like yum or apt. It also can be thought of as an expanded version of a chroot jail, or a reduced version of a virtual machine.
Important differences between Docker and Singularity:
Docker and Singularity have their own container formats.
Docker containers can be imported and run using Singularity.
Docker containers usually run as root, which means you cannot run Docker on a shared computing system (cluster).
Singularity allows for containers that can be run as a regular user. How? When importing a Docker container, Singularity removes any elements which can only run as root. The resulting containers can be run using a regular user account.
Converting to a Singularity image
You will need to have Singularity installed on your local workstation/laptop to prepare your image. The 'create' and 'import' operations of Singularity require root privileges, which you do not have on Amarel.
Create an empty singularity image, and then import the exported docker image into it,
$ sudo singularity create ubuntu.img
Creating a sparse image with a maximum size of 1024MiB...
Using given image size of 1024
Formatting image (/sbin/mkfs.ext3)
Done. Image can be found at: ubuntu.img
$ sudo singularity import ubuntu.img ubuntu.tar
Building new Singularity images
Singularity 3.0 introduced the ability to build a container in the cloud. When doing this, a Singularity user does not need to prepare an environment or assign permissions. The Remote Builder at https://cloud.sylabs.io/builder can build a container using a provided build definition file and can also be used to edit or develop a definition file, then build the desired image.
Here's an example Singularity 3.5 definition file that tells the Singularity bootstrap agent to pull Ubuntu 18.04 from the Container Library, then it installs basic development tools, Python-3.7, Firefox, R-3.6, and the libraries needed for GUI-based applications:
BootStrap: library
From: ubuntu:18.04
%post
apt -y update
apt -y install build-essential
apt -y install python3.7
DEBIAN_FRONTEND=noninteractive apt -y install xorg
apt -y install firefox
apt -y install apt-transport-https software-properties-common
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/'
apt -y update
apt -y install r-recommended=3.6.0-2bionic r-base=3.6.0-2bionic r-base-core=3.6.0-2bionic r-base-dev=3.6.0-2bionic
apt -y install libpcre2-dev
apt -y install r-base
Once the image has been successfully built, you can download the image and transfer it to Amarel.
Using Singularity containers inside a SLURM job
Transfer your new Singularity image to Amarel. The following steps are performed while logged-in to Amarel.
You can run any task/program inside the container by prefacing it with
singularity exec <your image name>
Here is a simple example job script that executes commands inside a container,
#SBATCH --partition=main # Partition (job queue)
#SBATCH --job-name=sing2me # Assign an short name to your job
#SBATCH --nodes=1 # Number of nodes you require
#SBATCH --ntasks=1 # Total # of tasks across all nodes
#SBATCH --cpus-per-task=1 # Cores per task (>1 if multithread tasks)
#SBATCH --mem=4000 # Real memory (RAM) required (MB)
#SBATCH --time=00:30:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT output file
module purge
module load singularity/.2.5.1
## Where am I running?
srun singularity exec ubuntu.img hostname
## What is the current time and date?
srun singularity exec ubuntu.img date
If you created directories for any Amarel filesystems, you should find they are mounted inside your container,
mount | grep gpfs
/dev/scratch/gc563 on /scratch/gc563 type gpfs (rw,relatime)
/dev/projects/oarc on /projects/oarc type gpfs (rw,relatime)
NOTE: If your container mounts Amarel directories, software inside the container may be able to destroy data on these filesystems for which you have write permissions. Proceed with caution.
TensorFlow with a GPU
TensorFlow's Python package includes 2 versions: tensorflow and tensorflow-gpu. But the command to use TensorFlow is the same in both cases: import tensorflow as tf (and not import tensorflow-gpu as tf in case of the GPU version). This means that you must be careful about which package is setup in your environment.ou can control your environment using a Singularity image, but that can present a problem if you need a package not included in the prebuilt Singularity image. If you encounter that problem, you likely need to build the image yourself. Alternatively, you can control your environment using Conda environments or virtual-env.
TensorFlow with a GPU using Singularity
To do this, you can use the Singularity container manager and a Docker image containing the TensorFlow software.
Running Singularity can be done in batch mode using a job script. Below is an example job script for this purpose. In this example, we'll name the script TF_gpu.sh:
#!/bin/bash
#SBATCH --partition=main # Partition (job queue)
#SBATCH --job-name=TF_gpu # Assign an short name to your job
#SBATCH --nodes=1 # Number of nodes you require
#SBATCH --ntasks=1 # Total # of tasks across all nodes
#SBATCH --cpus-per-task=2 # Cores per task (>1 if multithread tasks)
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem=16000 # Real memory (RAM) required (MB)
#SBATCH --time=00:30:00 # Total run time limit (HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out # STDOUT output file
#SBATCH --error=slurm.%N.%j.err # STDERR output file (optional)
module purge
module load singularity/.2.5.1
srun singularity exec --nv docker://tensorflow/tensorflow:1.4.1-gpu python
Once your job script is ready, submit it using the sbatch command:
$ sbatch TF_gpu.sh
Alternatively, you can run Singularity interactively:
$ srun --pty -p main --gres=gpu:1 --time=15:00 --mem=6G singularity shell --nv docker://tensorflow/tensorflow:1.4.1-gpu
Docker image path: index.docker.io/tensorflow/tensorflow:1.4.1-gpu
Cache folder set to /home/user/.singularity/docker
Creating container runtime...
Importing: /home/user/.singularity/docker/sha256:054be6183d067af1af06196d7123f7dd0b67f7157a0959bd857ad73046c3be9a.tar.gz
Importing: /home/user/.singularity/docker/sha256:779578d7ea6e8cc3934791724d28c56bbfc8b1a99e26236e7bf53350ed839b98.tar.gz
Importing: /home/user/.singularity/docker/sha256:82315138c8bd2f784643520005a8974552aaeaaf9ce365faea4e50554cf1bb44.tar.gz
Importing: /home/user/.singularity/docker/sha256:88dc0000f5c4a5feee72bae2c1998412a4b06a36099da354f4f97bdc8f48d0ed.tar.gz
Importing: /home/user/.singularity/docker/sha256:79f59e52a355a539af4e15ec0241dffaee6400ce5de828b372d06c625285fd77.tar.gz
Importing: /home/user/.singularity/docker/sha256:ecc723991ca554289282618d4e422a29fa96bd2c57d8d9ef16508a549f108316.tar.gz
Importing: /home/user/.singularity/docker/sha256:d0e0931cb377863a3dbadd0328a1f637387057321adecce2c47c2d54affc30f2.tar.gz
Importing: /home/user/.singularity/docker/sha256:f7899094c6d8f09b5ac7735b109d7538f5214f1f98d7ded5756ee1cff6aa23dd.tar.gz
Importing: /home/user/.singularity/docker/sha256:ecba77e23ded968b9b2bed496185bfa29f46c6d85b5ea68e23a54a505acb81a3.tar.gz
Importing: /home/user/.singularity/docker/sha256:037240df6b3d47778a353e74703c6ecddbcca4d4d7198eda77f2024f97fc8c3d.tar.gz
Importing: /home/user/.singularity/docker/sha256:b1330cb3fb4a5fe93317aa70df2d6b98ac3ec1d143d20030c32f56fc49b013a8.tar.gz
Importing: /home/user/.singularity/metadata/sha256:b71a53c1f358230f98f25b41ec62ad5c4ba0b9d986bbb4fb15211f24c386780f.tar.gz
Singularity: Invoking an interactive shell within container...
Singularity tensorflow:latest-gpu:~>
Now, you're ready to execute commands:
Singularity tensorflow:latest-gpu:~> python -V
Python 2.7.12
Singularity tensorflow:latest-gpu:~> python3 -V
Python 3.5.2
Remember to exit from your interactive job after you are finished with your calculations.
There are several Docker images available on Amarel for use with Singularity. The one used in the example above, tensorflow:1.4.1-gpu, is intended for python 2.7.12. If you want to use Python3, you'll need a different image, docker://tensorflow/tensorflow:1.4.1-gpu-py3, and the Python command will be python3 instead of python in your script.