ESMFold

Background ESMFold is a structure‑prediction system developed by the Meta FAIR protein team that leverages large protein language models (ESM‑2) to predict atomic‑level 3D coordinates directly from a single amino‑acid sequence. By learning evolutionary and coevolutionary patterns from millions of protein sequences, ESMFold bypasses expensive MSA search steps used by earlier methods and achieves dramatic speedups (up to ~60× faster in reported experiments) while maintaining high accuracy. This efficiency made it possible to predict structures for hundreds of millions of metagenomic proteins and to build the ESM Metagenomic Atlas (updated to ~772 million predicted structures). Core capabilities ESMFold produces full‑atom structure predictions and standard confidence metrics: per‑residue pLDDT (local confidence), pTM (predicted TM‑score, global similarity), and PAE (predicted aligned error). Outputs are available in PDB format for downstream visualization and analysis. The system supports multimer predictions by encoding chains with a ':' separator, and exposes both single‑sequence folding (fast) and higher‑accuracy settings (via recycles, batching, and structure module options). Pretrained model checkpoints (esmfold_v1 recommended) and smaller variants are published in the FAIR‑ESM GitHub repository and distributed via model URLs. The codebase integrates an OpenFold structure module; CUDA (nvcc) and a PyTorch GPU environment are required for efficient inference at scale. How to use and integrate There are multiple entry points for using ESMFold. Developers can install the fair‑esm package (pip install fair-esm) or clone the GitHub repo to run the native PyTorch implementation and command‑line tools (esm-fold for bulk PDB generation, esm-extract for embeddings). Hugging Face and ColabFold also provide simplified bindings and notebooks for running ESMFold interactively. For programmatic access, the ESM Metagenomic Atlas provides an API (rate limited) that accepts a sequence via curl/HTTP POST and returns a PDB; example: curl -X POST --data "<amino_acid_sequence>" https://api.esmatlas.com/foldSequence/v1/pdb/. The Atlas also serves endpoints to fetch pLDDT/pTM/PAE metadata and precomputed ESM2 embedding vectors (averaged final layer activations, 2560‑dim for some models). Workflows and example use‑cases ESMFold is designed for both single‑protein studies and large‑scale, high‑throughput projects. Typical use cases include: (1) folding novel metagenomic sequences to discover previously unknown folds or catalytic motifs; (2) screening enzyme candidates and microbial proteins for functional annotation; (3) integrating predicted structures into structure‑search pipelines (Foldseek) to find distant structural homologs; (4) coupling ESM embeddings with supervised models or ESM‑1v for variant effect prediction; and (5) combining with inverse‑folding models (ESM‑IF1) or generative design workflows to design or score sequences for a target backbone. The ESM repo includes CLI tools, examples, and notebooks for bulk embedding extraction, contact prediction, supervised variant training, inverse folding, and sequence design. ESM Metagenomic Atlas, search & downloads The large collection of ESMFold predictions is published as the ESM Metagenomic Atlas and is available under CC BY 4.0. The Atlas provides web‑based exploration (a 2D embedding map, colored by sequence similarity and confidence) and search APIs for sequence and structure. A high‑confidence clustered subset was created for fast search and is distributed as tarballs of PDB files, Foldseek databases, and ESM2 embedding vectors. Practical download sizes cited include a high‑confidence subset (~1 TB) and a full dataset on the order of ~20 TB, with a ~25 GB metadata parquet file to act as an entry point. Limitations, best practices & licensing ESMFold is optimized for speed and scalability but requires appropriate compute (PyTorch, CUDA GPU recommended) for bulk or long‑sequence workloads; the codebase documents CPU offload, chunking strategies, and batching flags to manage memory. The public API is rate‑limited and intended for single or small numbers of sequences; for large projects download model weights and run locally. Predictions should be interpreted alongside confidence metrics (pLDDT/pTM/PAE) and, when possible, validated experimentally. The code is open source (MIT license) and the Atlas data is available under CC BY 4.0; users are asked to cite the ESMFold/ESM‑2 paper (Lin et al., "Evolutionary‑scale prediction of atomic level protein structure with a language model") when publishing results derived from these resources.

Links