Canu

Background and design Canu is an open‑source, hierarchical assembler derived from the Celera Assembler and developed for single‑molecule long reads with high error rates (for example legacy PacBio RS II/Sequel and Oxford Nanopore MinION datasets). It implements adaptive k‑mer weighting and repeat separation strategies to improve accuracy when reads are noisy and repeats are abundant. The pipeline was published and widely used during the initial wave of long‑read sequencing and remains useful for assembling older long‑read datasets and for learning assembly workflows. Core capabilities and pipeline Canu implements a four‑step hierarchical workflow: sensitive overlap detection (using MHAP), generation of corrected read consensus, trimming of corrected reads, and final assembly of the trimmed, corrected sequences. The overlap stage was tuned to tolerate the high per‑base error rates typical of early PacBio and Nanopore runs. Error correction and consensus improve per‑read accuracy before assembly, and repeat separation logic helps resolve complex regions. The tool also supports trio‑binning workflows for haplotype‑resolved assemblies and has a HiCanu variant that targets high‑fidelity (PacBio HiFi) reads and better handles segmental duplications, satellites and allelic variants. Typical use cases Canu is suited to de novo assembly of microbial genomes up to large eukaryotic genomes when input consists of noisy long reads. Common use cases include assembling legacy PacBio RS/Sequel or early Nanopore datasets, producing draft assemblies for downstream polishing and scaffolding, and running trio‑binned assemblies to separate parental haplotypes. HiCanu (the Hi‑Fi tuned mode) is appropriate when using PacBio HiFi reads. Because Canu does error correction and trimming prior to assembly, it is often used in pipelines where those preprocessing steps are needed before running additional polishers or scaffolding tools (Hi‑C, optical maps, etc.). Installation, dependencies and integrations The recommended way to get started is to download an official binary release; building from source is possible but requires development libraries and a supported compiler. Conda (bioconda) and Homebrew formulas are available (conda install -c conda-forge -c bioconda -c defaults canu; brew install brewsci/bio/canu), and an unsupported Docker image has been published by community members. If compiling, you will typically run make -j <number_of_threads> and install development packages such as libboost, zlib, libcurl, OpenSSL, lzma and bzip2; platform notes in the upstream docs describe specifics for Linux, FreeBSD and macOS (Apple Silicon and Intel) and Java runtimes. The project uses MHAP for overlap detection and includes submodules such as seqrequester for histogram/metadata utilities. Important: do NOT download the GitHub .zip source snapshot — it is known to omit files and will not compile; use the release tarball or clone the repository. Limitations and recommendations Canu has been a foundational long‑read assembler, but active development waned around 2021 and the project is considered to have reached end‑of‑life for new sequencing technologies. It has not been tuned or validated on many more recent instrument chemistries and protocols; for contemporary HiFi or latest Nanopore chemistries, use modern assemblers or the HiCanu mode where applicable. Users working with recent datasets should evaluate current alternatives and follow the upstream documentation and publications for best‑practice parameters and recommendations. Key references for Canu and HiCanu are provided in the upstream repository and offer details on the algorithms and benchmarking used during development.

Links