Logo
AI/ML Models,  Molecular Biology

BPNet

Date Published

BPNet is a toolkit for training and interpreting base-resolution convolutional neural networks on functional genomics profiles. Designed to answer core regulatory genomics questions — what sequence motifs drive signal, where they occur in the genome, and how they interact — BPNet trains models that predict both read-profile shapes and total counts at single-base resolution. The approach and biological applications are described in the BPNet manuscript “Deep learning at base-resolution reveals motif syntax of the cis-regulatory code.” BPNet is provided as a Python package with a command-line interface and an accompanying Colab notebook that demonstrates an end-to-end example workflow. Core capabilities include model training, model interpretation via contribution scores, motif discovery with TF-MoDISco, motif instance detection with CWM scanning, and export of predictions and contribution scores for visualization. The package exposes a Keras model container (bpnet.seqmodel.SeqModel) and a higher-level wrapper (bpnet.BPNet.BPNetSeqModel) that consolidate profile and total count outputs per task. BPNet computes per-base contribution scores (for example via DeepLIFT), stores those scores in HDF5 files, and can export model predictions and contribution maps to BigWig tracks for genome-browser visualization. It also provides utilities to compute dataset statistics (to guide hyperparameter choices such as the profile vs total-count loss weight), simulate spacing between motifs, and generate reports tailored to ChIP-nexus or ChIP-seq data. BPNet is packaged with several ready-made configurations and example commands to speed adoption. Typical CLI steps are: prepare a dataspec.yml describing BigWig tracks and regions, train a model (for example: bpnet train --premade=bpnet9 dataspec.yml --override='seq_width=200;n_dil_layers=6'), compute contribution scores (bpnet contrib . --method=deeplift contrib.scores.h5), run TF‑MoDISco on the stored contributions to discover motif patterns (bpnet modisco-run contrib.scores.h5 --premade=modisco-50k --override='TfModiscoWorkflow.max_seqlets_per_metacluster=20000' modisco/), and detect individual motif instances with CWM scanning (bpnet cwm-scan modisco/ --contrib-file=contrib.scores.h5 modisco/motif-instances.tsv.gz). Outputs produced during these steps include HDF5 files for contributions and MoDISco results, motif PFMs/CWMs and seqlets, motif-instances TSV files, and BigWig tracks of model predictions and contribution scores. BPNet integrates with several community tools and file formats used in genomics. It exports and reads BigWig files (pyBigWig), works with BED intervals and pybedtools/bedtools, and uses HDF5 stores for contribution and TF‑MoDISco artifacts. The TF‑MoDISco workflow is supported via provided premade configurations. Internally the package is implemented with Keras (a Keras model container is provided) and requires TensorFlow (the documentation references installing tensorflow~=1.0 or tensorflow-gpu if using a GPU). Recommended runtime setup is a dedicated conda environment with dependencies such as pybedtools, bedtools, pybigwig, pysam and genomelake. The docs also recommend disabling HDF5 file locking to avoid Keras/HDF5 issues and optionally installing vmtouch to preload BigWig files into memory to speed multi-process data loading. Common use cases for BPNet include de novo discovery of transcription factor motifs from high-resolution ChIP experiments, mapping motif instances across genomic regions, analyzing motif syntax (relative spacing and interactions between motifs), and producing interpretable genome-wide prediction tracks for visualization or downstream analysis. Because BPNet models predict base-resolution profiles rather than only aggregated signals, they are especially useful for assays that capture fine positional structure (ChIP‑nexus, ChIP‑exo, or high-quality ChIP‑seq). The project exposes both CLI tools for reproducible pipelines and Python APIs for programmatic access, making it suitable for exploratory notebooks, batch training on clusters, and integration into downstream motif analysis pipelines.