ESM Atlas

Background: The ESM Metagenomic Atlas (ESM Atlas) is an open resource from Meta AI that pairs state-of-the-art protein language models (the ESM family) with a massive collection of metagenomic structure predictions. At its core, ESMFold — a structure-prediction head built on representations from the ESM-2 language model — generates atomic-level 3D structure predictions directly from protein sequences. The Atlas provides access to hundreds of millions of predicted structures (the public site visualizes a sample of 1 million and the dataset was updated to 772 million proteins), precomputed ESM2 embeddings, and programmatic search and folding endpoints designed for exploration, analysis, and downstream model development. Capabilities and models: The project releases code, pretrained weights, and tools for multiple ESM family models: ESM-2 (large transformer protein LMs), ESMFold (fast structure prediction), ESM-1v (zero-shot variant-effect prediction), ESM-IF1 (inverse folding / sequence design), and MSA/ESM-MSA variants for MSA-aware tasks. Pretrained checkpoints range from lightweight models to multi-billion-parameter ESM-2 variants. The Atlas offers (1) a foldSequence API to return PDB-formatted structure predictions and confidence metrics (pLDDT, pTM, PAE), (2) fetchEmbedding endpoints that return precomputed ESM2 embedding vectors per MGnify ID, and (3) sequence and structure search APIs backed by reduced, clustered subsets for fast querying. Tools to run predictions locally include the fair-esm Python package, a command-line interface (esm-fold, esm-extract), HuggingFace transformers compatibility, and ColabFold/Colab notebooks for browser-based runs. Data scale, subsets, and downloads: Because the Atlas operates at metagenomic scale, Meta provides both full and curated subsets for practical use. The high-confidence clustered subset (used for search) is provided as tarballs, Foldseek databases, and ESM2 embedding vectors; this subset is about 1 TB, whereas the full Atlas is roughly 20 TB. Metadata for the whole Atlas is available as a smaller file (~25 GB) suitable for indexing and large-scale analysis. Precomputed embeddings for the Atlas are available for bulk download. Search infrastructure (sequence and Foldseek-enabled structure search) is optimized for the clustered/high-confidence subset to enable rapid exploration of millions of structures. Example use cases: Researchers can fold a single protein sequence using the public foldSequence API to get a PDB and confidence scores, or run ESMFold locally (via the GitHub repo or HuggingFace) for higher-throughput tasks. Use ESM2 embeddings to cluster metagenomic sequences, train supervised predictors on embeddings for variant effect or function classification, or perform contact prediction from model attention maps. ESM-IF1 supports inverse folding: sample designed sequences for a fixed backbone or score conditional log-likelihoods for candidate sequences. ESM-1v enables zero-shot prediction of mutation effects. Typical workflows include extracting embeddings in bulk from FASTA with esm-extract, running esm-fold in batch mode for bulk structure prediction (with options like CPU offload and chunked attention to manage memory), and using the Atlas search APIs or Foldseek to find structural neighbors across millions of entries. Integrations and run options: The repository integrates with common ML and bioinformatics ecosystems: PyTorch (and optional FSDP CPU offload via Fairscale for large-model inference), OpenFold for structure modules (nvcc and CUDA are required for some installs), HuggingFace transformers for simplified model loading, ColabFold for browser-based folding, MMseqs2 for clustering, and Foldseek for structure search. The Atlas web UI is intended for exploration and returns structure visualizations colored by prediction confidence; programmatic users can access endpoints (rate-limited) to fetch PDBs, embeddings, and metadata. The code and model weights are available on GitHub (fair-esm) with installation via pip (fair-esm) or torch.hub; example scripts and Jupyter notebooks demonstrate variant prediction, contact prediction, inverse folding, and bulk embedding extraction. Licensing and reproducibility: Atlas data is made available under a CC BY 4.0 license for academic and commercial use, and the source code is released under the MIT license. The team provides bulk download instructions, model checkpoints, and notebooks to reproduce examples. Because the APIs are rate limited and the models are large, the Atlas supports both lightweight cloud-based queries for single sequences and local/batch processing for large-scale studies. If you use the models or Atlas structures, the repository includes citation details for the relevant papers (ESM, ESM-1v, ESM-IF1, ESM-2, and ESMFold) to credit the underlying methods.

Links