ENA (EMBL-EBI)

Background: The European Nucleotide Archive (ENA) is EMBL-EBI’s primary repository for nucleotide sequencing data and a core node in the global sequence data ecosystem. ENA collects, preserves and distributes a comprehensive record of the world’s nucleotide sequences together with associated metadata: raw sequencing reads, sequence assemblies, annotations and links to biological and bibliographic context. As an ELIXIR Core Data Resource and a Global Core Biodata Resource, ENA underpins research across genomics, biodiversity, epidemiology and molecular biology by making high-quality sequence data openly available and interoperable. Capabilities and access: ENA supports interactive and large-scale workflows through multiple access modes. Users can explore records via a web browser and free-text search tools, retrieve individual INSDC accession records, or perform complex queries against metadata. For programmatic and automated workflows, ENA provides scalable bulk-download services and an API that enables retrieval of raw reads, assemblies and metadata in machine-readable formats. Data is curated to ensure consistency of identifiers and metadata fields, and preserved with long-term service continuity measures so data remains findable and usable over time. Data Hubs and pathogen surveillance: ENA’s Data Hubs Portal lets groups create and manage pre-release or public data collections—particularly useful for collaborative projects and outbreak response. A notable application has been the SARS-CoV-2 Data Hubs: an integrated set of submission, analysis, presentation and visualisation tools that streamline the ingestion of raw read data, run standard analyses, and expose results for downstream use. Data Hubs make it easier for consortia, public-health labs and surveillance networks to share sequence data while retaining control over access and release policies until publication or public release. Use cases and integrations: Researchers reusing public reads for comparative genomics, metagenomics and method development can download raw datasets at scale for reanalysis; genome curators and annotation teams can submit assemblies and link them to functional annotation and literature; public-health agencies can centralise pathogen sequencing workstreams via Data Hubs for rapid situational awareness. ENA integrates into the broader EMBL-EBI ecosystem and global resources: sequence records are cross-referenced with international accession systems (INSDC), and metadata practices support interoperability with other EMBL-EBI data resources and external databases, improving discoverability and downstream annotation. For researchers new to the system, EMBL-EBI offers training materials and documentation to help with submission, retrieval and best practices for metadata. Practical considerations: Data in ENA is freely available subject to the data owner’s release policy and international data-sharing agreements, with an institutional emphasis on open, machine-readable metadata where possible. Because ENA stores raw and assembled sequence data at massive scale, users should plan for bandwidth and storage needs when performing large downloads and consider programmatic access for high-throughput workflows. For collaborative projects or outbreak responses, Data Hubs provide an administrable route to manage pre-publication data sharing and integrated analysis. ENA’s long-term preservation and curation practices mean submitted data can support reproducible research, diagnostic development and large-scale meta-analyses for years to come.

Links