AlphaFold2
AlphaFold is intended to provide accurate protein structure predictions[1]
In CWRU HPC resources, AlphaFold2 is only installed on the Pioneer (pioneer.case.edu) cluster. It is installed locally, meaning that it is not required to use containers to run AlphaFold. An example slurm file is shown for requesting resources for a job using T-1050 as input.
Important Notes:
Check bug reports on the AlphaFold github - https://github.com/deepmind/alphafold
Try reduced database search if you encounter memory issues, use GPU with more memory (check HPC resources, High Memory GPU job). For example, in pioneer cluster in aisc partition, the GPU has up to 80gb of GPU memory.
Quickstart on Running Alphafold2
Access pioneer cluster (pioneer.case.edu) - check Quickstart Guide.
Submit the job (one from the options below; check the comments):
sbatch /usr/local/software/AlphaFold/2.2.2/run_alphafold_monomer.slurm <your-seq-file> # or monomer
sbatch /usr/local/software/AlphaFold/2.2.2/run_alphafold_multimer.slurm <your_seq_file> # for Full database search version using default GPUs
sbatch /usr/local/software/AlphaFold/2.2.2/multimer_reduced.slurm <your_seq_file> # for reduced database search version using GPUs
sbatch -C gpu4v100 --time=320:00:00 --gres=gpu:4 /usr/local/software/AlphaFold/2.2.2/multimer_reduced.slurm <your_seq_file> # for reduced database search version using customized resources - GPU volta with 4 GPUs and 320 hrs wall time)
Code and Database organization
The code is installed in the optimied software tree: /usr/local/easybuild_avx2/software/AlphaFold/2.1.1-fosscuda-2020b.
Running the code requires setting the MODULEPATH to include only the 'easybuild_avx2' optimization:
export MODULE_PATH=/usr/local/easybuild_avx2/modules/all
The database files are stored at /mnt/pan/AlphaFold.
Refer to documentation appropriate to your own data and objectives to determine which runtime flags to set when launching AlphaFold. A simple example is offered here to illustrate the format of a job script used to allocate resources and run the AlphaFold code. The T1050 single-sequence data was obtained as a fasta file, and copied to $WORKDIR/fastas.
Batch Job Script
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=120gb
#SBATCH -c 20
#SBATCH -o quick_af.o%j
module purge
export MODULE_PATH=/usr/local/easybuild_avx2/modules/all
module load AlphaFold/2.1.1-fosscuda-2020b
export ALPHAFOLD_DATA_DIR="/mnt/pan/AlphaFold"
hostname
echo "$PFSDIR"
echo "$pwd"
lsADD=$(ls $ALPHAFOLD_DATA_DIR)
echo "$lsADD"
WORKDIR=/home/mrd20/hpcdemo/alphafold
cp -r $WORKDIR/fastas $PFSDIR
cd $PFSDIR
mkdir runs
echo Running AlphaFold from $WORKDIR
/home/mrd20/hpcdemo/alphafold/run_alphafold.py \
--fasta_paths=$PFSDIR/fastas/T1050.fasta \
--data_dir=/mnt/pan/AlphaFold \
--pdb70_database_path=/mnt/pan/AlphaFold/pdb70/pdb70 \
--uniref90_database_path=$ALPHAFOLD_DATA_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$ALPHAFOLD_DATA_DIR/mgnify/mgy_clusters_2018_12.fa \
--uniclust30_database_path=$ALPHAFOLD_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--max_template_date=2020-05-14 \
--db_preset=full_dbs \
--output_dir=$PFSDIR/runs \
--model_preset=monomer_ptm
Output
Directory Structure
$ ls -RF /scratch/pbsjobs/job.1184.hpc/
/scratch/pbsjobs/job.1184.hpc/:
fastas/ runs/
/scratch/pbsjobs/job.1184.hpc/fastas:
T1050.fasta
/scratch/pbsjobs/job.1184.hpc/runs:
T1050/
/scratch/pbsjobs/job.1184.hpc/runs/T1050:
features.pkl msas/ result_model_1_ptm.pkl result_model_2_ptm.pkl unrelaxed_model_1_ptm.pdb unrelaxed_model_2_ptm.pdb
/scratch/pbsjobs/job.1184.hpc/runs/T1050/msas:
bfd_uniclust_hits.a3m mgnify_hits.sto pdb_hits.hhr uniref90_hits.sto
Output
The numerous stages of the AlphaFold workflow will be reported. Typically sections involve:
JackHammr (2 stages)
Finished Jackhammr (uniref90.fasta) query in 455.763 seconds
Finished Jackhmmer (mgy_clusters_2018_12.fa) query in 472.964 seconds
HHSearch
Finished HHsearch query in 146.808 seconds
HHBLITS
Finished HHblits query in 7292.667 seconds
Runing Model, Invoking Tensorflow
run_alphafold.py:225] Running model model_1_ptm on T1050
model.py:165] Running predict with shape(feat) = {'aatype': (4, 779), 'residue_index': (4, 779), ....... ]
run_alphafold.py:237] Total JAX model model_1_ptm on T1050 predict time (includes compilation time, see --benchmark): 574.1s
run_alphafold.py:225] Running model model_2_ptm on T1050
model.py:165] Running predict with shape(feat) = {'aatype': (4, 779), 'residue_index': (4, 779), ........ ]
run_alphafold.py:237] Total JAX model model_2_ptm on T1050 predict time (includes compilation time, see --benchmark): 489.1s
run_alphafold.py:225] Running model model_3_ptm on T1050
run_alphafold.py:225] Running model model_4_ptm on T1050
run_alphafold.py:225] Running model model_5_ptm on T1050
run_alphafold.py:306] Final timings for T1050: {'features': 8557.561393976212, 'process_features_model_1_ptm': 38.50683522224426, ' predict_and_compile_model_1_ptm': 574.1196734905243, 'process_features_model_2_ptm': 8.790717363357544, 'predict_and_compile_model_2_ptm': 489.09383630752563, 'process_f eatures_model_3_ptm': 8.145345687866211, 'predict_and_compile_model_3_ptm': 465.3740735054016, 'process_features_model_4_ptm': 7.742178916931152, 'predict_and_compile_mo del_4_ptm': 461.7440278530121, 'process_features_model_5_ptm': 7.70080304145813, 'predict_and_compile_model_5_ptm': 424.1679847240448}
Full sample output for T1050: alphafold_T1050.out
References
[1] https://www.nature.com/articles/s41586-021-03819-2