BUSCO

Background and purpose BUSCO (Benchmarking Universal Single‑Copy Orthologs) measures the completeness of genomic data by searching for sets of near‑universal, single‑copy ortholog marker genes derived from OrthoDB. Unlike purely technical metrics such as N50, BUSCO reports biological completeness: the percentage of expected conserved genes found as complete (single or duplicated), fragmented, or missing. The tool and its lineage datasets (notably the odb12 release) are widely used for assembly QC, annotation benchmarking and extracting phylogenomic markers. Core capabilities and pipelines BUSCO runs from the command line and supports three primary modes: genome (nucleotide assemblies), transcriptome (assembled transcripts) and proteins (predicted protein sets). It ships with a large collection of lineage datasets covering bacteria, archaea, eukaryotes and viruses; datasets are lineage‑specific and chosen to match the taxon being assessed. BUSCO classifies each marker as Complete‑Single, Complete‑Duplicated, Fragmented or Missing, and produces tabular summaries, per‑marker FASTA/GFF outputs and HMMER/BLAST intermediate files so you can inspect individual cases. There are multiple pipelines and gene‑prediction options depending on data type and domain. For eukaryotic genome assessments BUSCO now defaults to Miniprot (a fast protein‑to‑genome mapper) but can use MetaEuk or Augustus; MetaEuk is optimized for eukaryotic metagenomes and is faster/more memory efficient than Augustus, while Augustus offers self‑training gene prediction useful for downstream annotation. For prokaryotic genomes BUSCO uses Prodigal by default; HMMER is used to score candidate genes against BUSCO profiles. BUSCO also provides an auto‑lineage pipeline that runs generic domain datasets, places the input on a phylogenetic tree (using SEPP), and selects the most appropriate specific lineage dataset automatically. Installation, integrations and running BUSCO is distributed as a Conda package and a Docker image (recommended for reproducible environments). Typical Conda install: conda install -c conda-forge -c bioconda busco=6.0.0 (or use mamba to speed dependency resolution). Docker usage examples are provided with instructions to mount your working directory and avoid running containers as root. BUSCO automatically downloads lineage datasets unless you run in --offline mode and supports a config file to point to specific third‑party tool locations. Important third‑party dependencies mentioned in the workflows include HMMER, BLAST/tBLASTn (note known bugs fixed in NCBI BLAST ≥2.10.1), Miniprot, MetaEuk, Augustus, Prodigal, BBTools and SEPP. Plotting of results is integrated (Matplotlib) to produce publication‑ready summaries. Example use cases and outputs - Assembly QC: run BUSCO on a draft genome to quantify expected gene content completeness and compare against related public assemblies. A common command pattern is: busco -i genome.fna -m genome -l <lineage_odb12> -c 8 -o busco_out. BUSCO reports overall completeness (%) and writes per‑marker FASTA/GFF files so you can inspect or re‑use sequences. - Annotation benchmarking: assess predicted gene sets before and after manual curation to quantify improvements in completeness and duplication. - Phylogenomics and marker extraction: BUSCO produces aligned marker sequences and can be used to build species trees and for synteny/marker analyses across taxa. The auto‑lineage pipeline is useful for metagenomes or poorly classified samples because BUSCO will attempt to place the input on a phylogenetic tree and select the best dataset. - Batch processing: point BUSCO at a directory of FASTA files to run many assessments at once; outputs are kept per sample with a top‑level log for the batch. Best practices, caveats and reproducibility Choose the most specific lineage dataset available for your organism; coarser lineages speed up runs but reduce resolution. Interpret BUSCO scores in the context of organism biology — gene loss, parasitic reduction or extreme divergence can legitimately reduce scores. Fragmented and missing calls can arise from assembly fragmentation, divergent genes or failures of the gene‑prediction/search steps; BUSCO performs second‑pass searches with ancestral variants in some pipelines to recover hard cases. BUSCO can also help flag contamination if high scores appear across multiple domains. For reproducibility, report BUSCO version, dataset name and creation date, third‑party tool versions, and all BUSCO options used. BUSCO software is MIT licensed; datasets are provided under CC BY‑ND 4.0 and must be cited in publications (see BUSCO v6 / OrthoDB v12 citations). For help, the BUSCO documentation, protocol paper and community issue tracker are the recommended resources.

Links