AlphaFold

AlphaFold can predict protein structures with atomic accuracy even where no similar structure is known.

Policy

AlphaFold is freely available to users at HPC2N.

Citations

If you use the code or data in the AlphaFold package, please cite their work as described in Citing this work on the AlphaFold GitHub page.

Overview

AlphaFold provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP14 and published in Nature.

You can find a more detailed description on the AlphaFold GitHub page.

AlphaFold at HPC2N

On HPC2N we have AlphaFold available as a module. Binaries are compiled for both CPU-only and for GPU.

Usage at HPC2N

Important

Please read this entire message before experimenting with it, because there’s a couple of important things to be aware of.

Note

Two different modules are available for AlphaFold: a CPU-only version, and a GPU version.

Notice that the versions installed on the Intel nodes and those on the AMD nodes may differ. Thus, if you are targeting one version that is only installed on the Intel/AMD nodes, you will need to add the instruction #SBATCH -C skylake (Intel) or ``#SBATCH -C amd_cpu```(AMD) to your batch script, otherwise the job could arrive to a node that lacks that version of the installation.

Databases

Local databases for AlphaFold are automatically available without manual intervention when loading the AlphaFold module. The module defines an environment module ALPHAFOLD_DATA_DIR that AlphaFold uses to find the databases. The last part of the directory specifies when the databases where downloaded. The databases only get updated when new versions of AlphaFold are built. Previously installed versions do not get redirected to the updated database.

The AlphaFold installations we provide have been enhanced a bit to facilitate the usage:

  • The location to the AlphaFold data can be specified via the $ALPHAFOLD_DATA_DIR environment variable, users can change this if they need to use a newer version of the database with an older version of AlphaFold, or vice versa.
  • A symbolic link named ‘alphafold’ that points to the run_alphafold.py script is included, so you can just use “alphafold” instead of “run_alphafold.py” or “python run_alphafold.py” after loading the AlphaFold module.
  • The run_alphafold.py script has been slightly modified such that defining the $ALPHAFOLD_DATA_DIR is sufficient to pick up all the data provided in that location, so you don’t need to use options like --data_dir to specify the location of the data.
  • Similarly, the run_alphafold.py script was tweaked such that the location to commands like hhblits/hhsearch/jackhmmer/kalign are already correctly set, so options like --hhblits_binary_path are not required.
  • The Python script that are used to run hhblits and jackhmmer have been tweaked so you can control how many cores are used for these tools (rather than hardcoding it to 4 and 8 cores, respectively).
    • Using the $ALPHAFOLD_HHBLITS_N_CPU environment variable, you can specify how many cores should be used for running hhblits (the default of 4 cores will be used if $ALPHAFOLD_HHBLITS_N_CPU is not defined); likewise for jackhmmer and $ALPHAFOLD_JACKHMMER_N_CPU.
    • Tweaking this may or may not be worth it though, we have noticed that these tools sometimes run slower on more than 4/8 cores (but this may be workload dependent).

Submit file examples

T1050 on a V100 GPU card

To run the T1050 AlphaFold example on one V100 card, use this as an example:

#!/bin/bash
#SBATCH -A <your-project-id>
#SBATCH -J AF-T1050-full_dbs
#SBATCH -t 05:00:00
#SBATCH -C v100
#SBATCH --gpus-per-node=1

# Clean the environment from loaded modules
ml purge > /dev/null 2>&1

# Load AlphaFold
ml fosscuda/2020b
ml AlphaFold/2.0.0

export ALPHAFOLD_HHBLITS_N_CPU=$SLURM_CPUS_ON_NODE

alphafold --fasta_paths=T1050.fasta --max_template_date=2020-05-14 --preset=full_dbs --output_dir=$PWD --model_names=model_1,model_2,model_3,model_4,model_5

A submit file example for running a single job

In this example we are using Alphafold 2.3.2. We are going to use the fasta secuence for a monomer and the corresponding batch file singlejob.sh for running the simulation on GPUs.

This is the job script that you could also download from the link above:

#!/bin/bash
#SBATCH -A Project_ID      # Your project ID
#SBATCH -J monomer         # Job name in the queue 
#SBATCH -t 02:00:00        # Wall time 
# lines starting with double ## before SBATCH are ignored by the batch system
# Specify which type of GPU to use, or comment out to let the
# system pick the first available one.
#SBATCH -C GPU-Type
# Use 2 GPU cards if the length of the sequence and number nmers are large
##SBATCH --gpus-per-node=2
# For short sequences and monomers 1 GPU card is enough
#SBATCH --gpus-per-node=1

# Clean the environment from loaded modules
ml purge > /dev/null 2>&1

# Load AlphaFold
ml GCC/12.3.0  OpenMPI/4.1.5
ml AlphaFold/2.3.2-CUDA-12.1.1

export ALPHAFOLD_HHBLITS_N_CPU=$SLURM_CPUS_PER_NODE

alphafold --fasta_paths=my_fasta_sequence.fasta --max_template_date=2024-01-10 --model_preset=monomer  --output_dir=$PWD 

Notice that the simulation will take ~1hrs.

Comments:

  1. You need to change GPU-Type in the above to the type of GPU you want to use (l40s, v100, a100 etc.)
  2. To change the simulation to use CPUs only, you would
    • Load a version of AlphaFold that is not using GPUs (check with ml spider AlphaFold for versions, and then ml spider AlphaFold/<version> for how to load it)
    • Only ask for CPUs, not GPUs
    • Use #SBATCH -c to specify number of CPUs you want to use

A submit file example for running job arrays

You will find 8 fasta sequeces seq[1-8].fasta, in the zip file. In that zip file you also find a file with these sequences listed, list_sequences.txt.

The job script jobarray.sh will allow you to run several simulations in a compact manner where each fasta sequence is used as an input for Alphafold. In the present script, a suggestion for using 1 or 2 GPU cards is provided:

#!/bin/bash                                                                     
#SBATCH -A Project_ID      # Your project ID                                    
#SBATCH -J multimer        # Job name in the queue                              
#SBATCH -t 00:05:00        # Wall time                                          
#SBATCH --array=0-7                   # Array range                         
# lines starting with double ## before SBATCH are ignored by the batch system
# Specify which type of GPU to use, or comment out to let the
# system pick the first available one.
#SBATCH -C l40s
# Use 2 GPU cards if the length of the sequence and number nmers are large
##SBATCH --gpus-per-node=2                                                      
# For short sequences and monomers half node, 1 GPU card, would work                 
#SBATCH --gpus-per-node=1                                                                               

# Clean the environment from loaded modules                            
ml purge > /dev/null 2>&1

# Load AlphaFold                                                                 
ml GCC/12.3.0  OpenMPI/4.1.5
ml AlphaFold/2.3.2-CUDA-12.1.1

# Check that you have a GPU to use (or which one if unspecified)
nvidia-smi

export ALPHAFOLD_HHBLITS_N_CPU=$SLURM_CPUS_ON_NODE

file=$(sed -n "$((SLURM_ARRAY_TASK_ID + 1))p" list_sequences.txt)
alphafold --fasta_paths="$file" --max_template_date=2024-01-10 --model_preset=multimer  --output_dir=$PWD

Note

  1. The simulation will take ~1hrs.
  2. You can monitor the resources by using the job-usage tool.

Additional info

Note

AlphaFold is not a MPI code and can only run on a single node.

Important

Absolutely do not set ALPHAFOLD_HHBLITS_N_CPU to a larger value than $SLURM_CPUS_ON_NODE!

We have run some basic tests on the installation, and it seems to be working as expected using the T1050.fasta example that is mentioned in the AlphaFold github README.

Using “–preset=full_dbs”, we got the following runtimes:

  • CPU-only, on Kebnekaise, using 14 cores (on a skylake node): 11h 50min
  • CPU-only, on Kebnekaise, using 28 cores (1 full skylake node): 12h 17min
  • GPU, on Kebnekaise, using 1 V100 GPU: 2h 29min
  • GPU, on Kebnekaise, using 2 V100 GPU: 2h 44min

This highlights a couple of important attention points:

  • Running AlphaFold on GPU is significantly faster than CPU-only (about 5x faster for this particular example).
  • Using more CPU cores may lead to longer runtimes, so be careful with using full nodes when running AlphaFold CPU-only.