Logo
Libraries and SDKs,  Bioinformatics Tools

Biopython

Date Published

Biopython is a mature, community-driven collection of Python libraries and scripts for computational molecular biology. Developed and maintained by an international group of volunteers, it bundles parsers, data models and convenience wrappers that make it straightforward to read, write and process common bioinformatics formats and workflows. Documentation (a tutorial, cookbook and API docs) is published and generated from the source tree, and users who rely on Biopython in publications are asked to cite the project’s application note and module-specific references. The project provides a broad set of capabilities spanning sequence handling, alignments, phylogenetics, structural bioinformatics and data visualization. Core modules include sequence I/O and manipulation (e.g. SeqIO), alignment tools (Bio.Align), phylogenetics (Bio.Phylo), structural parsers (Bio.PDB and support for mmCIF/PDB), clustering and expression-analysis utilities (Bio.Cluster), and graphics (GenomeDiagram / Bio.Graphics). Biopython implements robust parsers for widely used file formats such as FASTA, GenBank, FASTQ (including Sanger and Illumina variants), BLAST output, PDB/mmCIF structural files and many others. It also includes wrapper code to call and parse results from common command-line bioinformatics tools like BLAST, ClustalW and EMBOSS, enabling those tools to be integrated into Python pipelines. Installing and using Biopython is designed to be straightforward. The project publishes pre-compiled wheel packages on PyPI so pip install typically requires no compilation; NumPy is a required dependency and will be pulled in automatically. For users compiling from source, a compatible C compiler and Python development headers are required (GCC on Linux, MSVC on Windows, Apple command line tools on macOS). Optional dependencies enable extra features: ReportLab (graphics), matplotlib and networkx/pygraphviz/pydot (tree plotting and graph visualization), rdflib (CDAO parsing), and database drivers such as psycopg2, mysqlclient or MySQL Connector/Python for BioSQL support. Biopython is tested on recent Python implementations and on PyPy, and the source includes a regression test suite to help validate installations. Typical use cases for Biopython range from small-scale scripting to large batch pipelines. Common tasks include parsing and filtering sequence files (including quality-aware FASTQ processing), converting between formats, programmatically running and parsing BLAST jobs, computing and manipulating alignments, building and plotting phylogenetic trees, parsing and analysing protein structures, and producing publication-ready genome schematics. BioSQL integration provides a way to store and query sequence records in relational databases for larger projects, while the wrappers for external tools let teams incorporate best-of-breed command-line software into automated analyses. The library is widely used for teaching, for rapid prototyping of bioinformatics methods, and in research where reproducible, scriptable data processing is required. Biopython is open source under a liberal licensing arrangement (project code is available under terms compatible with broad reuse and some parts are dual licensed with a 3-clause BSD option). Development happens on GitHub; contributors are encouraged to read the contributing guidelines, run the test suite and submit changes as pull requests. The project maintains mailing lists, a low-volume announcements list, and encourages question-and-answer activity on public forums such as Stack Overflow (tagged "biopython"). For users getting started, the Biopython Tutorial & Cookbook, API documentation and the project's website provide practical examples, installation instructions and details of optional dependencies and integrations.