Logo
AI/ML Models,  Molecular Biology

AlphaFold

Date Published

AlphaFold is DeepMind’s open-source implementation of the AlphaFold v2 inference pipeline that transformed computational structural biology by delivering near‑experimental accuracy for many protein targets. The package implements both the monomer models used in CASP14 and AlphaFold‑Multimer for predicting complexes, and ships with model weights and utilities to build the multiple sequence alignments (MSAs), search templates and run end‑to‑end inference. AlphaFold’s predictions are accompanied by per‑residue confidence scores (pLDDT), predicted aligned error (PAE) maps and, for pTM models, pTM estimates that help users judge reliability and identify regions suitable for experimental follow up. AlphaFold accepts a primary amino‑acid sequence (FASTA) and leverages evolutionary information (MSAs) plus structural templates when available to predict all heavy‑atom coordinates. The inference produces multiple artifacts useful for analysis and pipelines: unrelaxed and Amber‑relaxed PDB files, ranked_*.pdb (models ordered by pLDDT), features.pkl (input features), result_model_*.pkl (raw model outputs including distograms, pLDDT arrays and predicted_aligned_error matrices), an msas/ directory, timing logs and JSON metadata (ranking_debug.json, relax_metrics.json, timings.json). Confidence values are written into the B‑factor field of PDB outputs (note: higher pLDDT is better). The repo exposes model presets (monomer, monomer_casp14, monomer_ptm, multimer), database presets (reduced_dbs and full_dbs) and flags to control relaxation (GPU vs CPU), number of multimer seeds, reuse of precomputed MSAs and template cutoff dates. Running AlphaFold requires substantial compute and disk resources for full‑scale usage. The code targets Linux and is most easily run with the provided Docker scripts (Singularity third‑party setups are commonly used on HPC). A modern NVIDIA GPU is recommended; DeepMind tested runs on machines with A100 GPUs (example cloud config: 12 vCPUs, 85 GB RAM, 3 TB data disk + A100). The full set of genetic and structural databases used by AlphaFold (BFD, UniRef, MGnify, Uniref90, PDB mmCIF, PDB70, pdb_mmcif, etc.) totals roughly 556 GB of compressed downloads and expands to ~2.6 TB unzipped — a reduced database option (reduced_dbs) is offered to cut resource needs (approx. 600 GB disk, 8 vCPUs, 8 GB RAM minimum). The repository provides scripts (scripts/download_all_data.sh and scripts/download_alphafold_params.sh) to mirror and pull the databases and model parameters; note the download directory should not be a subdirectory of the repo to avoid slow Docker builds. AlphaFold’s workflow and knobs make it suitable for many research use cases. Typical lab workflows use it to generate structural hypotheses for molecular replacement, to interpret cryo‑EM density, to suggest mutational probes or to prioritize biochemical experiments. Multimer mode can predict stoichiometry and interfaces for protein complexes (by default AlphaFold‑Multimer runs multiple seeds per model — 5 seeds per model, producing 25 total predictions with the default five models — and allows tuning of --num_multimer_predictions_per_model). For high‑throughput projects, users can precompute MSAs (use_precomputed_msas=true) and then run inference repeatedly with different model settings. AlphaFold integrates existing MSA and template search tools (jackhmmer, HHblits/MMseqs2, HMMER) and uses OpenMM/Amber for constrained relaxation. Many users have integrated AlphaFold into proteome‑scale projects and pipelines, and DeepMind’s public AlphaFold Protein Structure Database hosts millions of precomputed predictions for lookup and download. There are important caveats and reproducibility notes to consider. AlphaFold outputs are predictive models — not experimentally validated structures — and are explicitly not intended for clinical decision making. Prediction accuracy depends strongly on MSA depth (performance drops when median alignment depth is low; thresholds around ~30 effective sequences are frequently cited) and on whether a target’s fold is stabilized by heterotypic contacts or cofactors. AlphaFold shows some inter‑run variance for a small fraction of targets where MSAs change significantly; ensembling and the provided model‑selection strategy (five models and ranking by pLDDT) mitigate but do not eliminate this. The code is released under an Apache 2.0 license, while model parameters and CASP prediction data are available under CC BY 4.0; the repository includes citation guidance (Jumper et al., Nature 2021 for monomers and Evans et al. for multimer). For production use, follow the repo’s installation and update instructions carefully (database and parameter updates, permission settings) and ensure adequate GPU, CPU and SSD resources for reliable performance.