AlphaFold2

AlphaFold is intended to provide accurate protein structure predictions[1]

In CWRU HPC resources, AlphaFold2 is only installed on the Pioneer (pioneer.case.edu) cluster. It is installed locally, meaning that it is not required to use containers to run AlphaFold. An example slurm file is shown for requesting resources for a job using T-1050 as input.

Important Notes:

Check bug reports on the AlphaFold github - https://github.com/deepmind/alphafold
Try reduced database search if you encounter memory issues, use GPU with more memory (check HPC resources, High Memory GPU job). For example, in pioneer cluster in aisc partition, the GPU has up to 80gb of GPU memory.

Quickstart on Running Alphafold2

Access pioneer cluster (pioneer.case.edu) - check Quickstart Guide.

Submit the job (one from the options below; check the comments):

sbatch /usr/local/software/AlphaFold/2.2.2/run_alphafold_monomer.slurm <your-seq-file> # or monomer

sbatch /usr/local/software/AlphaFold/2.2.2/run_alphafold_multimer.slurm <your_seq_file> # for Full database search version using default GPUs

sbatch /usr/local/software/AlphaFold/2.2.2/multimer_reduced.slurm <your_seq_file> # for reduced database search version using GPUs

sbatch -C gpu4v100 --time=320:00:00 --gres=gpu:4 /usr/local/software/AlphaFold/2.2.2/multimer_reduced.slurm <your_seq_file> # for reduced database search version using customized resources - GPU volta with 4 GPUs and 320 hrs wall time)

Code and Database organization

The code is installed in the optimied software tree: /usr/local/easybuild_avx2/software/AlphaFold/2.1.1-fosscuda-2020b.

Running the code requires setting the MODULEPATH to include only the 'easybuild_avx2' optimization:

export MODULE_PATH=/usr/local/easybuild_avx2/modules/all

The database files are stored at /mnt/pan/AlphaFold.

Refer to documentation appropriate to your own data and objectives to determine which runtime flags to set when launching AlphaFold. A simple example is offered here to illustrate the format of a job script used to allocate resources and run the AlphaFold code. The T1050 single-sequence data was obtained as a fasta file, and copied to $WORKDIR/fastas.

Batch Job Script

#!/bin/bash

#SBATCH -p gpu

#SBATCH --gres=gpu:2

#SBATCH --mem=120gb

#SBATCH -c 20

#SBATCH -o quick_af.o%j

module purge

export MODULE_PATH=/usr/local/easybuild_avx2/modules/all

module load AlphaFold/2.1.1-fosscuda-2020b

export ALPHAFOLD_DATA_DIR="/mnt/pan/AlphaFold"

hostname

echo "$PFSDIR"

echo "$pwd"

lsADD=$(ls $ALPHAFOLD_DATA_DIR)

echo "$lsADD"

WORKDIR=/home/mrd20/hpcdemo/alphafold

cp -r $WORKDIR/fastas $PFSDIR

cd $PFSDIR

mkdir runs

echo Running AlphaFold from $WORKDIR

/home/mrd20/hpcdemo/alphafold/run_alphafold.py \

--fasta_paths=$PFSDIR/fastas/T1050.fasta \

--data_dir=/mnt/pan/AlphaFold \

--pdb70_database_path=/mnt/pan/AlphaFold/pdb70/pdb70 \

--uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \

--mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters_2018_12.fa \

--uniclust30_database_path=$ALPHAFOLD_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \

--max_template_date=2020-05-14 \

--db_preset=full_dbs \

--output_dir=$PFSDIR/runs \

--model_preset=monomer_ptm

Output

Directory Structure

$ ls -RF /scratch/pbsjobs/job.1184.hpc/

/scratch/pbsjobs/job.1184.hpc/:

fastas/ runs/

/scratch/pbsjobs/job.1184.hpc/fastas:

T1050.fasta

/scratch/pbsjobs/job.1184.hpc/runs:

T1050/

/scratch/pbsjobs/job.1184.hpc/runs/T1050:

features.pkl msas/ result_model_1_ptm.pkl result_model_2_ptm.pkl unrelaxed_model_1_ptm.pdb unrelaxed_model_2_ptm.pdb

/scratch/pbsjobs/job.1184.hpc/runs/T1050/msas:

bfd_uniclust_hits.a3m mgnify_hits.sto pdb_hits.hhr uniref90_hits.sto

Output

The numerous stages of the AlphaFold workflow will be reported. Typically sections involve:

JackHammr (2 stages)
- Finished Jackhammr (uniref90.fasta) query in 455.763 seconds
- Finished Jackhmmer (mgy_clusters_2018_12.fa) query in 472.964 seconds
HHSearch
- Finished HHsearch query in 146.808 seconds
HHBLITS
- Finished HHblits query in 7292.667 seconds
Runing Model, Invoking Tensorflow
- run_alphafold.py:225] Running model model_1_ptm on T1050
  - 1. 1. model.py:165] Running predict with shape(feat) = {'aatype': (4, 779), 'residue_index': (4, 779), ....... ]
        run_alphafold.py:237] Total JAX model model_1_ptm on T1050 predict time (includes compilation time, see --benchmark): 574.1s
- run_alphafold.py:225] Running model model_2_ptm on T1050
  - 1. 1. model.py:165] Running predict with shape(feat) = {'aatype': (4, 779), 'residue_index': (4, 779), ........ ]
        run_alphafold.py:237] Total JAX model model_2_ptm on T1050 predict time (includes compilation time, see --benchmark): 489.1s
- run_alphafold.py:225] Running model model_3_ptm on T1050
- run_alphafold.py:225] Running model model_4_ptm on T1050
- run_alphafold.py:225] Running model model_5_ptm on T1050
- run_alphafold.py:306] Final timings for T1050: {'features': 8557.561393976212, 'process_features_model_1_ptm': 38.50683522224426, ' predict_and_compile_model_1_ptm': 574.1196734905243, 'process_features_model_2_ptm': 8.790717363357544, 'predict_and_compile_model_2_ptm': 489.09383630752563, 'process_f eatures_model_3_ptm': 8.145345687866211, 'predict_and_compile_model_3_ptm': 465.3740735054016, 'process_features_model_4_ptm': 7.742178916931152, 'predict_and_compile_mo del_4_ptm': 461.7440278530121, 'process_features_model_5_ptm': 7.70080304145813, 'predict_and_compile_model_5_ptm': 424.1679847240448}

Full sample output for T1050: alphafold_T1050.out

References

[1] https://www.nature.com/articles/s41586-021-03819-2