BCFtools
Date Published

Background BCFtools is a mature, open‑source command‑line toolkit for working with variant call data in VCF and its binary counterpart BCF. Originating from the samtools/HTSlib ecosystem, BCFtools exposes a broad collection of utilities — from low‑level format conversion and indexing to full variant calling, consequence prediction and copy‑number analysis. The tools are designed to operate efficiently in UNIX pipelines (stdin/stdout), auto‑detect VCF/BCF/bgzipped inputs, and handle both compressed and indexed files to support scalable workflows on local machines, HPC and cloud environments. Core capabilities BCFtools covers the full life cycle of short variant analysis. It provides mpileup and call commands for genotype likelihood computation and variant calling (including gVCF support), powerful view/filter capabilities for region‑ or tag‑based subsetting, normalization and splitting of multiallelic sites, and utilities to merge/concatenate/ligate VCF/BCF files. The suite includes annotate and +plugins (for example +fill‑tags) to add allele frequency or other tabix‑indexed annotations, and a haplotype‑aware consequence predictor (csq) that uses Ensembl GFF3 and a reference FASTA to add functional consequence annotations. There are also specialized tools: a copy‑number variation caller using B‑allele frequency and LRR, consensus construction to apply VCF variants to a reference, and a variety of converters to/from common formats (IMPUTE/SHAPEIT, 23andMe, .gen/.hap, etc.). Filtering, QC and benchmarking BCFtools exposes flexible expression language and many INFO/FORMAT metrics to drive both pre‑ and post‑call filtering. Common workflows use QUAL, DP and per‑site metrics such as MQBZ, RPBZ, SCBZ, IDV and IMF to construct tailored inclusion/exclusion rules; the documentation contains example filters tuned for SNPs and indels and shows how to combine metrics by depth. For benchmarking and concordance analysis BCFtools provides isec and query primitives to compare callsets against truth sets, split true/false positives, and generate distributions of QUAL or other statistics to optimize filters. Norm (-m), bgzip/tabix and index usage are described in detail to ensure consistent normalization prior to comparisons. Typical use cases and examples - Variant calling pipeline: stream mpileup into call for efficient single‑machine or cluster workflows (example: bcftools mpileup -Ou -f ref.fa samples.bam | bcftools call …); use -Ou when piping between bcftools subcommands to avoid recompression overhead. - Annotation: augment raw calls with population AFs or custom tabix‑indexed annotation tracks (create AFs.tab.gz, tabix index it, then bcftools annotate -a AFs.tab.gz -c CHROM,POS,REF,ALT,REF_AN,REF_AC …) to incorporate prior allele counts into calling or downstream filtering. - Filtering and QC: apply post‑call filters using bcftools view/filter expressions (for example exclude low‑QUAL or extreme depth sites), then use bcftools stats and plotting helper scripts to inspect callset quality. - Truthset benchmarking: normalize calls with bcftools norm -m -both -f $ref, then run bcftools isec to separate false positives/negatives and build histograms for QUAL and other metrics to guide filter selection. - Format conversion and interoperability: convert consumer genotype files (e.g., 23andMe) into VCF, generate gVCF blocks, or export to hap/sample formats for imputation tools. Create consensus FASTA sequences by applying VCF variants with bcftools consensus for downstream assembly or visualization. Integrations, deployment and reproducibility BCFtools is implemented in C on top of HTSlib and is tightly integrated with samtools, bgzip/tabix and standard genomics references (fasta + fai, tabix indexes, GFF3). It is actively developed on GitHub and distributed under a permissive license; binaries and source are available alongside documentation and how‑tos. For reproducible deployments it’s packaged on Bioconda and provided as community BioContainers/Docker images so pipelines can be run in consistent environments on local workstations, HPC clusters or cloud platforms. Performance tips in the docs include using indexed VCF/BCF for random access, streaming with -Ou between bcftools subcommands to avoid unnecessary compression, and using multiple worker threads for bgzip compression where supported. Whether you are building an end‑to‑end WGS/WES workflow, doing targeted variant QC and filtering, annotating consequence effects with transcript aware logic, or converting genotype formats for downstream tools, BCFtools provides a compact, scriptable and well‑documented toolset that integrates into established sequencing analysis pipelines.