Logo
AI/ML Models,  Molecular Biology

Basenji2

Date Published

Basenji is an open-source research toolkit for modeling regulatory genomics with deep convolutional neural networks. It was developed to make quantitative predictions of regulatory signal across very long, chromosome-scale DNA sequences rather than performing binary peak calls. Unlike earlier models that framed problems as classification, Basenji predicts continuous signal in genomic bins using regression loss functions, enabling finer-grained estimates of activity across cell types and assays. The codebase is Python3-based and built on TensorFlow, and it generalizes and extends concepts from predecessor tools (for example, Basset) while remaining flexible enough to replicate those older models when desired. Core capabilities center on three linked tasks: learning to predict regulatory activity across long sequences, scoring genetic variants for their predicted regulatory impact, and mapping the regulatory elements and specific nucleotides that drive signal. Basenji makes predictions in contiguous bins across the input sequences, producing quantitative profiles that can represent ChIP-seq, DNase/ATAC, CAGE, or other genomic signal tracks. From trained models you can compute variant effect scores such as SNP Activity Difference (SAD) and Expression Difference (SED), run in silico saturated mutagenesis to measure the contribution of individual nucleotides, and identify distal elements that influence a gene’s predicted activity. The repository includes preprocessing and training utilities plus tutorial notebooks to guide common workflows: preprocessing new datasets for training, running train/test cycles, executing in silico saturated mutagenesis, and computing SAD/SED scores. Installation and dependency management are standard for scientific Python projects: environment.yml and requirements.txt are provided for conda users and pip users respectively, with optional fully prespecified environments for reproducibility. Basenji is compatible with TensorFlow 2 (and should also work with 1.15) and can be run with tensorflow or tensorflow-gpu to leverage GPUs. Because it uses the TensorFlow ecosystem, Basenji can exploit distributed computing and other TensorFlow tools when training very large models on long sequences. Typical use-cases where Basenji adds value include variant prioritization and interpretation in noncoding regions (for example, scoring candidate GWAS SNPs by predicted regulatory effect), discovering distal regulatory architecture for genes of interest, interpreting the nucleotide basis of regulatory elements via saturation mutagenesis maps, and generating quantitative predictions to compare across assays or cell types. Researchers can train models on custom signal tracks or reproduce models and analyses from manuscripts; the codebase ships with links to models and data used in prior publications in a manuscripts directory. Basenji also sits alongside related projects in the same code family: Akita (for 3D genome folding prediction and variant scoring on contact maps) and Saluki (for mRNA half-life prediction using hybrid convolutional/recurrent nets), allowing investigators to use complementary architectures for related functional genomics problems. Practical notes: Basenji is described by its authors as a research-oriented codebase that is actively developed and in between personal research code and widely polished software. It can be computationally demanding to train models on chromosome-scale inputs, so access to GPUs and enough memory/disk for preprocessing and training is important. The project encourages users to consult the provided tutorials, open issues for missing features, and contact maintainers with questions. Models, tutorial notebooks, and recipes for installing dependencies are included in the repository so users can reproduce published analyses or adapt Basenji to new datasets and experimental designs.