AlphaFold DB

AlphaFold DB is a large, freely available collection of protein structure predictions produced by the AlphaFold family of models developed by DeepMind and hosted and curated by EMBL‑EBI. The database gives researchers programmatic and web access to predicted 3D coordinates for the vast majority of sequences catalogued in UniProt, including full proteomes and a hand-curated Swiss‑Prot subset. The public resource contains over 200 million entries (the 2024 release reports ~214.7M predictions) and is designed to accelerate structural biology, functional annotation, and downstream computational analyses without requiring each user to run the full AlphaFold inference pipeline locally. What’s provided for each entry and how to interpret it: every structure page includes coordinate files (PDB and mmCIF), a per‑residue confidence metric called pLDDT (stored in the B‑factor field of the coordinate files), and a Predicted Aligned Error (PAE) matrix that quantifies how well relative positions between residue pairs are predicted. Rough rules of thumb used on the site help interpretation: pLDDT > 90 indicates high model accuracy suitable for detailed analysis (e.g., binding-site characterization), 70–90 is generally good for backbone placement, and values below 70 should be treated with caution. The PAE is displayed as an interactive 2D plot and is particularly important for assessing domain‑packing confidence — low PAE between two domains indicates a well‑resolved relative orientation, whereas high PAE implies uncertainty in how domains pack together. Search, analysis and bulk access: AlphaFold DB provides sequence and text search, integrated structure search via Foldseek (which can find similar experimental and predicted structures across the PDB and AFDB50 clusters), and downloads for individual entries or whole proteomes. Foldseek integration enables fast structure‑based similarity searches and structural alignments with reported RMSD and E‑values. For large‑scale workflows, EMBL‑EBI exposes bulk downloads (proteome tar archives and a full dataset on Google Cloud Public Datasets) — the complete UniProt‑mapped dataset is available (~23 TiB) but most users are encouraged to download only relevant subsets. The site also exposes additional derived data: TED (The Encyclopedia of Domains) domain assignments and links to other EMBL‑EBI resources to support evolutionary and functional interpretation. Running AlphaFold yourself and technical requirements: if a sequence of interest is not present in the database or you need bespoke prediction settings (for example multimer complexes), DeepMind’s open‑source AlphaFold repository and inference pipeline can be run locally or on cloud HPC. The official code supports monomer and AlphaFold‑Multimer variants and provides Docker scripts to simplify setup. Full installation and database setup require substantial disk space (the full set of genetic databases totals multiple terabytes — example download sizes in the docs include >2 TB unzipped, and DeepMind’s download script fetches models plus databases such as BFD, MGnify, UniRef and PDB mmCIF). AlphaFold requires Linux and a modern NVIDIA GPU for practical runtimes; smaller, reduced‑database presets exist (reduced_dbs) to lower resource needs (the reduced preset can run with ~8 CPU cores, ~8 GB RAM and ~600 GB disk). Example cloud configurations tested include A100 GPU machines with dozens of vCPUs, tens of GB of RAM and separate multi‑TB disks for the databases. Outputs, model presets and reproducibility: the inference pipeline produces multiple artifacts that are useful for downstream analysis — raw features and MSAs (features.pkl, msas/), unrelaxed and Amber‑relaxed PDB files, ranked models reordered by predicted confidence, per‑residue pLDDT arrays and pTM/PAE outputs when using pTM models. The open‑source code ships multiple model presets (monomer, monomer_casp14, monomer_ptm, multimer) and supports options to trade off MSA/database time versus accuracy (full_dbs vs reduced_dbs), reuse precomputed MSAs, and control how many seeds/predictions are produced for multimers. The AlphaFold team documents sources of variability (inter‑run variance, MSA differences) and provides guidance on reproducing CASP14‑style runs where needed. Typical use‑cases and limitations: AlphaFold DB is immediately useful for inspecting likely protein folds, locating high‑confidence structural segments, suggesting candidate binding pockets, guiding mutagenesis experiments, annotating domain boundaries, and seeding molecular docking or comparative analyses. Complementary tools showcased on the site include AlphaMissense (variant pathogenicity scores built on AlphaFold‑style models) and TED domain annotations for comparative domain classification. Important limitations are also emphasized: AlphaFold predictions are theoretical models, not experimentally validated structures, and the resource is not intended for clinical decision making. The system is primarily validated for predicting single‑chain structures of natural sequences; it does not reliably model small‑molecule binding, post‑translational modifications, bound cofactors/ions, or conformational ensembles, and it is not validated for predicting the structural consequences of specific destabilizing mutations. Licensing and citation: prediction coordinates and associated data on AlphaFold DB are provided for academic and commercial use under Creative Commons Attribution 4.0 (CC BY 4.0); the open‑source code is available under Apache‑2.0. Users are asked to cite the AlphaFold papers and the AlphaFold DB publication when using the resource. For questions, bulk download help, or feedback the site points to EMBL‑EBI and the AlphaFold team contacts and documentation pages.

Links