Logo
AI/ML Models,  Molecular Biology

Enformer

Date Published

Background and purpose: Enformer addresses a central challenge in regulatory genomics—predicting cell- and tissue-specific regulatory activity and gene expression directly from DNA sequence, including the influence of distal enhancers located tens of kilobases away. Unlike prior convolutional-only architectures that are effectively limited to ~20 kb receptive fields, Enformer combines convolutional embedding with multi-head self-attention to integrate information across much longer genomic distances (up to ~100 kb of regulatory context). This increased information flow yields more accurate expression and chromatin-state predictions and produces more informative variant effect scores for downstream interpretation. Architecture and inputs/outputs: The model processes very long one-hot DNA sequences (196,608 bp) through an initial convolutional tower with pooling to embed 128-bp bins, followed by a stack of transformer (attention) blocks that capture long-range interactions. A cropping step and organism-specific output heads produce dense predictions across genomic bins: for the human model these are 5,313 genomic tracks (CAGE, DNase, ChIP, etc.) at 128-bp resolution, covering ~114 kb per prediction window. Enformer uses attention pooling, custom relative positional basis functions to encode distance and directionality, and multi-head attention to allow each position to aggregate information from any other position in the input. The design choices (conv embedding + attention + relative positional encodings) explicitly target enhancer–promoter and insulator interactions that require distal context. Capabilities and validated performance: Enformer substantially improves expression and regulatory-track prediction over prior state-of-the-art models (e.g., Basenji2 and ExPecto). Reported gains include higher Pearson/Spearman correlations for CAGE-based expression across genes and tissues and stronger tissue specificity. Importantly for genetics, Enformer improves noncoding-variant effect predictions: it increases concordance with eQTL summary statistics (measured by SLDP regression), boosts accuracy in classifying fine-mapped causal eQTLs across GTEx tissues, and outperforms competing methods on saturation mutagenesis MPRA benchmarks (CAGI5). The model also produces contribution scores and attention patterns that highlight biologically meaningful elements—cell-type-specific enhancers, insulator/TAD boundaries (CTCF-associated), and promoter motifs—making it useful for enhancer prioritization and mechanistic interpretation. Unlike many annotation methods, Enformer provides signed predictions (activating or repressive) and does not rely on sequence conservation, enabling predictions for nonconserved regulatory sequences and synthetic designs. Typical use cases and integrations: Enformer is used to (1) score the regulatory impact of genetic variants (reference vs alternative allele) across many cell types and assays, (2) prioritize candidate enhancers for a given gene using contribution scores, (3) assist fine-mapping and interpretation of GWAS/eQTL loci by supplying variant-level functional annotations, and (4) guide synthetic regulatory element design. The model’s outputs have been used as features in downstream classifiers (random forests, lasso regression) and genome-wide concordance tests (SLDP) and can be combined with experimental or contact-data (Hi-C/ABC) approaches to refine enhancer–promoter links. The Enformer team provides a pretrained model and precomputed effect predictions for frequent variants (1000 Genomes), along with code examples to apply the model to variant sets and genomic intervals. Because Enformer predicts many epigenomic readouts, its feature vectors can be fed into custom predictors or fine-tuning workflows for specific tasks. Practical considerations and limitations: Enformer’s long inputs and attention layers make it computationally demanding—practical use typically requires GPUs and careful batching; transformer complexity grows with input length, so hardware limits and memory must be considered. The model is trained on the genomic assays and cell types present in its training set, so predictions are most reliable for similar tissues and assays; generalization to entirely novel cell types or modalities may require additional training or transfer learning. Enformer is presented and distributed for research use; predictions should be treated as computational annotations for hypothesis generation and prioritization, not as clinical diagnoses. Where available, combining Enformer predictions with orthogonal experimental data (e.g., Hi-C, H3K27ac, CRISPR perturbation results) improves confidence in inferred regulatory mechanisms.