ProtTrans (ProtBERT/ProtT5)
Date Published

Links
ProtTrans is an open-source collection of large, pretrained Transformer models trained on massive protein sequence corpora using clusters of GPUs and TPUs. The project packages several architectures adapted from NLP — including ProtBERT, ProtT5 (ProtT5-XL-U50 and larger variants), ProtAlbert, ProtXLNet and ProtElectra — and releases pretrained weights for feature extraction, logits extraction, fine-tuning and protein sequence generation. ProtTrans was developed to “crack the language of life” by applying self-supervised learning to amino-acid sequences so that downstream tasks (secondary structure, localization, binding prediction, variant effects, etc.) can be solved faster and often with improved accuracy compared with hand-crafted features or small supervised models. The primary capability of ProtTrans models is embedding extraction: they produce rich, contextual per-residue (sequence-length × embedding-dim) and per-protein (pooled) representations that serve as inputs to downstream classifiers or regressors. Typical workflows include tokenizing protein sequences (replacing rare residues with X and adding whitespace between residues), feeding sequences to the encoder (for T5 variants) or BERT-style models, and extracting the last-hidden-state for residue-level features or averaging/pooling to obtain a single vector per protein. The repository provides scripts and colab notebooks to simplify batch embedding generation from FASTA files and examples for deriving logits and fine-tuning for per-residue or per-protein predictions. ProtT5 is commonly used in half-precision on GPUs to speed up embedding generation without observed performance loss. ProtTrans models have been applied successfully across a range of bioinformatics tasks. In published benchmarks, ProtT5-XL-UniRef50 (also called ProtT5-XL-U50) achieved notable results for secondary structure prediction (Q3 scores around 81 on CASP12 and high performance on TS115/CB513), membrane vs water-soluble classification (DeepLoc ~91), and subcellular localization. Compared to other protein language models, ProtT5 embeddings frequently improved downstream accuracy (for example, higher accuracy than ProtBERT and many baseline models on subcellular localization and conservation/variant effect tasks). Use cases include: rapid alignment-free structure prediction, residue-level binding-site prediction, variant effect prediction and transfer learning for bespoke tasks. The project also documents end-to-end examples (Colab notebooks) for secondary structure and localization prediction, and provides guidance for building lightweight downstream layers on top of frozen embeddings or for fine-tuning. Integrations and practical notes: all ProtTrans pretrained weights are published on Hugging Face and Zenodo and are straightforward to load via the transformers library. Many example notebooks demonstrate feature extraction via ProtT5-XL-U50; there are also scripts for generating per-residue and per-protein HDF5 outputs. For fine-tuning, ProtTrans provides examples and supports efficient adapters such as LoRA for parameter-efficient updates. A community-led Evo-tuning workflow shows how to continue pretraining ProtT5 on homologous sequences. Services and downstream integrations include LambdaPP (a web service exposing ProtT5-based predictions) and UniProt, which offers precomputed ProtT5 embeddings for selected organisms. Typical dependencies are PyTorch and Hugging Face transformers; GPU usage with half-precision is recommended for performance. ProtTrans is distributed under an academic-friendly open license and maintained as an open-source project with a community of contributors. The authors encourage citing the ProtTrans paper when models are used in publications and welcome issues, contributions and pull requests through the project's GitHub. For practitioners: start with the provided Colab examples to generate embeddings for a few sequences, then scale to batch embedding pipelines using the supplied scripts and Hugging Face model hub entries; consider fine-tuning with LoRA or continuing pretraining (evo-tuning) when adapting models to specialized sequence families or new tasks.