AlphaFold2

AlphaFold is intended to provide accurate protein structure predictions[1]

In CWRU HPC resources, AlphaFold2 is only installed on the Pioneer (pioneer.case.edu) cluster. It is installed locally, meaning that it is not required to use containers to run AlphaFold. An example slurm file is shown for requesting resources for a job using T-1050 as input.


Important Notes:


Quickstart on Running Alphafold2

Access  pioneer cluster (pioneer.case.edu) - check Quickstart Guide.

Submit the job (one from the options below; check the comments):

sbatch /usr/local/software/AlphaFold/2.2.2/run_alphafold_monomer.slurm <your-seq-file>  # or monomer

sbatch /usr/local/software/AlphaFold/2.2.2/run_alphafold_multimer.slurm <your_seq_file> # for Full database search version using default GPUs

sbatch /usr/local/software/AlphaFold/2.2.2/multimer_reduced.slurm <your_seq_file> # for reduced database search version using GPUs

sbatch -C gpu4v100 --time=320:00:00 --gres=gpu:4 /usr/local/software/AlphaFold/2.2.2/multimer_reduced.slurm <your_seq_file> # for reduced database search version using customized resources - GPU volta with 4 GPUs and 320 hrs wall time) 


Code and Database organization

The code is installed in the optimied software tree:  /usr/local/easybuild_avx2/software/AlphaFold/2.1.1-fosscuda-2020b.

Running the code requires setting the MODULEPATH to include only the 'easybuild_avx2' optimization:

export MODULE_PATH=/usr/local/easybuild_avx2/modules/all

The database files are stored at /mnt/pan/AlphaFold.

Refer to documentation appropriate to your own data and objectives to determine which runtime flags to set when launching AlphaFold.  A simple example is offered here to illustrate the format of a job script used to allocate resources and run the AlphaFold code. The T1050 single-sequence data was obtained as a fasta file, and copied to $WORKDIR/fastas.


Batch Job Script

#!/bin/bash

#SBATCH -p gpu

#SBATCH --gres=gpu:2

#SBATCH --mem=120gb

#SBATCH -c 20

#SBATCH -o quick_af.o%j


module purge

export MODULE_PATH=/usr/local/easybuild_avx2/modules/all

module load AlphaFold/2.1.1-fosscuda-2020b

export ALPHAFOLD_DATA_DIR="/mnt/pan/AlphaFold"


hostname

echo "$PFSDIR"

echo "$pwd"

lsADD=$(ls $ALPHAFOLD_DATA_DIR)

echo "$lsADD"


WORKDIR=/home/mrd20/hpcdemo/alphafold

cp -r $WORKDIR/fastas $PFSDIR

cd $PFSDIR

mkdir runs


echo Running AlphaFold from $WORKDIR

/home/mrd20/hpcdemo/alphafold/run_alphafold.py \

          --fasta_paths=$PFSDIR/fastas/T1050.fasta \

          --data_dir=/mnt/pan/AlphaFold \

          --pdb70_database_path=/mnt/pan/AlphaFold/pdb70/pdb70  \

          --uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta   \

          --mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters_2018_12.fa   \

          --uniclust30_database_path=$ALPHAFOLD_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \

          --max_template_date=2020-05-14 \

          --db_preset=full_dbs \

          --output_dir=$PFSDIR/runs \

          --model_preset=monomer_ptm



Output

Directory Structure

$ ls -RF /scratch/pbsjobs/job.1184.hpc/

/scratch/pbsjobs/job.1184.hpc/:

fastas/  runs/


/scratch/pbsjobs/job.1184.hpc/fastas:

T1050.fasta


/scratch/pbsjobs/job.1184.hpc/runs:

T1050/


/scratch/pbsjobs/job.1184.hpc/runs/T1050:

features.pkl  msas/  result_model_1_ptm.pkl  result_model_2_ptm.pkl  unrelaxed_model_1_ptm.pdb  unrelaxed_model_2_ptm.pdb


/scratch/pbsjobs/job.1184.hpc/runs/T1050/msas:

bfd_uniclust_hits.a3m  mgnify_hits.sto  pdb_hits.hhr  uniref90_hits.sto

Output


The numerous stages of the AlphaFold workflow will be reported. Typically sections involve:




Full sample output for T1050:  alphafold_T1050.out

References

[1]  https://www.nature.com/articles/s41586-021-03819-2