anndata

Background anndata provides a lightweight, expressive container for annotated data matrices that sits conceptually between pandas and xarray. It was originally developed for single‑cell analysis workflows (Scanpy) and is now a core component of the scverse ecosystem. The package defines the AnnData object — a standardized in‑memory representation that couples a primary data matrix with per‑observation and per‑variable annotations and arbitrary unstructured metadata — enabling reproducible, shareable downstream analyses and exchange across tools. Core capabilities AnnData focuses on computational efficiency and interoperability. The core object stores a main data matrix (dense or sparse) together with obs (row/cell annotations), var (column/feature annotations), uns (unstructured metadata) and layers (alternative matrices). It supports sparse arrays natively, lazy operations for deferred computation, and efficient concatenation and subsetting of many datasets. AnnData can operate both in memory and on disk: Dask array support and integration with Zarr enable out‑of‑core and chunked storage, and the documentation includes patterns for lazily accessing remotely stored data. There are dedicated helpers for lazily concatenating multiple AnnData objects and for using AnnData collections with machine learning workflows. Typical use cases AnnData is widely used for single‑cell omics: storing count matrices, metadata, embeddings, clustering labels and analysis results in a single object that can be passed between preprocessing, visualization and differential expression tools. For large cohorts or cloud datasets, the combination of Dask and Zarr backends lets users scale analyses without requiring that an entire matrix fit into RAM. The PyTorch interface and example workflows in the docs make it straightforward to feed AnnData‑backed datasets into deep learning models for tasks such as cell type classification, data integration, or latent‑space modeling. Multimodal frameworks built on top of AnnData (for example muon) use the same conventions to represent and analyze multi‑omics experiments. Integrations, distribution and ecosystem AnnData integrates tightly with Scanpy and other scverse projects, and is intended as a lingua franca for single‑cell toolchains. Installers are simple (pip install anndata or conda install -c conda-forge anndata) and the public API is documented online; the codebase is maintained on GitHub where development and community discussion occur. Several large initiatives and institutes distribute datasets via AnnData representations, and the project is fiscally sponsored by NumFOCUS as part of the scverse umbrella. The maintainers caution that internal APIs (underscored modules) are not guaranteed stable and encourage users to rely on the documented public API or to open issues if functionality is missing. How to get started and resources To begin, install the package and load an AnnData object from common file formats or create one from numpy/scipy arrays and pandas DataFrames. Explore annotations via adata.obs, adata.var and adata.uns; store alternative matrices in adata.layers. If you work with very large or cloud‑hosted matrices, consult the Dask/Zarr examples and the sections on lazily concatenating and lazily accessing remotely stored data to avoid loading everything into memory. The project provides examples for interfacing PyTorch models with AnnData and collections of AnnData objects to support machine learning pipelines. For full details, refer to the official API docs and the GitHub repository for issue tracking, contribution guidelines and the latest release notes.

Links