ColabFold
Date Published

ColabFold is an open-source project that makes modern protein structure prediction workflows accessible without requiring deep ML or HPC expertise. Built around interactive Google Colab notebooks and companion command‑line tools, ColabFold wraps high‑accuracy predictors (AlphaFold, AlphaFold‑multimer, RoseTTAFold, ESMFold and experimental variants) together with fast MMseqs2 multiple‑sequence alignment (MSA) searches and convenient downstream outputs (PDB, JSON, PNG). The project was designed to let bench biologists, computational scientists and educators generate reliable structural models quickly — either on free Colab GPUs or on local/cloud GPU resources. Core capabilities include multiple ready‑to‑run Colab notebooks for single sequences, complexes and batch predictions; a public MSA server and local utilities (colabfold_batch, colabfold_search) to fetch MSAs; and options for multimer/complex prediction via AlphaFold‑multimer or residue‑index jumps. ColabFold streamlines the MSA step by using curated MMseqs2 databases (UniRef30, BFD/Mgnify and the ColabFold DB) to produce diverse MSAs that feed the prediction networks. For reproducibility and downstream use, ColabFold saves predicted structures plus confidence metrics (pLDDT in the B-factor column) and can export MSAs to an AlphaFold3‑compatible JSON format (--af3-json). A dedicated amber‑relax notebook is provided for users who want to post‑process models without re-running the full prediction. ColabFold supports a range of deployment modes. The simplest is the graphical Colab notebooks that run on Google Colab GPUs — ideal for exploratory or low‑throughput work. For heavier workloads, ColabFold provides command‑line tools: colabfold_batch (queries the public MSA server and runs predictions) and colabfold_search / setup_databases.sh to create and search local MMseqs2 databases. Large‑scale or low‑latency setups can host the ColabFold databases locally (note: full ColabFold DBs require very large disk space — ~940 GB — and significant RAM) or run a GPU‑resident MMseqs2 gpuserver to keep indices in GPU memory. Typical resource notes from the project: batch searches commonly need ~128 GB RAM, whereas single‑query low‑latency searches with precomputed indices can demand on the order of 768 GB–1 TB of RAM. GPU‑accelerated search is supported (MMseqs2‑GPU) and invoked with --gpu 1; specific GPUs can be selected via CUDA_VISIBLE_DEVICES. Typical use cases: (1) rapid single‑sequence structure prediction in a Colab notebook to inspect fold and per‑residue confidence, (2) predicting protein‑protein complexes using either AlphaFold‑multimer or the AlphaFold2 residue‑index jump method (different Colab notebooks expose each approach), (3) high‑throughput proteome or family‑scale prediction by running colabfold_search against local databases and batching predictions, and (4) generating AlphaFold3‑compatible MSA JSON inputs including non‑protein components (DNA, RNA, ligands via CCD or SMILES) for advanced modeling. ColabFold also integrates with common structural tools: models can be colored in PyMOL by pLDDT (examples provided), and outputs are usable in crystallographic pipelines (e.g., molecular replacement) with a caution that pLDDT values occupy the B‑factor column and may need conversion for tools that expect lower B‑factors to indicate higher confidence. Installation and integration points are straightforward for most users: interactive Colab notebooks require no installation; local users can install via pip (pip install "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold") and follow the project wiki for Docker or localcolabfold installer options. The project documents MMseqs2 version requirements for database generation and encourages use of the provided database downloads for reproducible MSAs. ColabFold is actively maintained and citable (Mirdita et al., ColabFold: Making protein folding accessible to all), and its architecture is intentionally modular so labs can combine it with their pipelines, relax/post‑processing tools, or visualization suites. If you plan scalable production runs, review the database size and RAM recommendations, consider running the optional MMseqs2 gpuserver for minimal latency, and test notebook versus local CLI modes on representative sequences to choose the best balance of convenience and throughput. ColabFold lowers the barrier to entry for modern structure prediction while preserving the control and reproducibility needed for serious research and batch processing.