CodonTransformer

CodonTransformer

A Multispecies Codon Optimizer
Using Context-aware Neural Networks

      1 University of Toronto University of Toronto    2 INSERM   INSERM      3 Vector Institute  Vector Institute       

* Equal Contribution.    Equal Advising.

Introducing CodonTransformer

CodonTransformer is a cutting-edge, multispecies deep learning model designed for state-of-the-art codon optimization. Trained on over 1 million gene-protein pairs from 164 organisms spanning all kingdoms of life, CodonTransformer leverages advanced neural network architectures to generate host-specific DNA sequences with natural-like codon usage patterns and minimized negative cis-regulatory elements for any protein sequence.

Colab Logo Try CodonTransformer Now!

Overview

The genetic code's degeneracy allows for multiple DNA sequences to encode the same protein, but not all codons are equal in the eyes of the host organism. Codon usage bias can significantly impact the efficiency of heterologous protein production due to differences in tRNA abundance, protein folding regulations, and evolutionary constraints.

CodonTransformer uses the Transformer architecture and a novel training strategy named STREAM (Shared Token Representation and Encoding with Aligned Multi-masking) to learn and replicate the intricate codon usage patterns across a diverse array of organisms. By doing so, it provides a powerful tool for optimizing DNA sequences for expression in various host species.

Model

CodonTransformer addresses the challenge of codon optimization by translating protein sequences into optimized codon sequences using the encoder-only BigBird Transformer architecture. We frame this task as a Masked Language Modeling (MLM) problem, where the model predicts codons by unmasking tokens from [aminoacid_UNK] to [aminoacid_codon]. Our innovative STREAM training strategy allows the model to learn codon usage patterns by unmasking multiple mask tokens while organism specific embeddings are added to the sequence to contextualize predictions.

The training process involves two stages: pretraining on over one million DNA-protein pairs from 164 diverse organisms to capture universal codon usage patterns, followed by fine-tuning on a curated subset of highly optimized genes specific to target organisms. This dual training strategy enables CodonTransformer to generate DNA sequences with natural-like codon distributions tailored to each host, effectively optimizing gene expression across multiple species.

Data

Fig. 1: CodonTransformer multispecies model with combined organism-amino acid-codon embedding.

a. An encoder-only BigBird Transformer model trained by combined amino acid-codon tokens along with organism encoding for host-specific codon usage representation. b. CodonTransformer was trained with ~1 million genes from 164 organisms across all kingdoms of life and fine-tuned with highly expressed genes (top 10% codon usage index, CSI) of 13 organisms and two chloroplast genomes.

CodonTransformer Outperforms Existing Tools

CodonTransformer demonstrates superior performance in generating natural-like codon distributions and minimizing negative cis-regulatory elements compared to existing codon optimization tools.
Here are the benchmarking and evaluation results of CodonTransformer:

Learning Codon Patterns Across Organisms

CodonTransformer effectively learned codon usage patterns across multiple species, as shown by high codon similarity indices (CSI) when generating DNA sequences for various organisms. The model adapts to the specific codon preferences of each host, ensuring optimal expression.

Data

Fig. 2: CodonTransformer learned codon patterns across organisms.

Codon usage index (CSI) for all and the top 10% CSI original genes and generated DNA sequences for all original proteins by CodonTransformer (base and fine-tuned models) for 9 out of 15 genomes used for fine-tuning in this study. See Supplementary Figs. 2-16 for all 15 genomes and additional metrics of GC content codon and distribution frequency (CDF). Source data for Fig. 2 and Supplementary Figs. 2-16 is available at https://zenodo.org/records/13262517.

Generating Natural-Like Codon Distributions

The model produces DNA sequences with codon usage patterns closely resembling those found in nature, avoiding clusters of rare or highly frequent codons that can negatively affect protein folding and expression. This is visualized using %MinMax profiles and Dynamic Time Warping (DTW) metrics.

Data

Fig. 3: CodonTransformer generates natural-like codon distributions.

a. Schematic representation of %MinMax and dynamic time warping (DTW). %Minmax represents the proportion of common and rare codons in a sliding window of 18 codons. DTW algorithm computes the minimal distance between two %MinMax profiles by finding the matching positions (Methods). b. %MinMax profiles for sequences generated by different models for genes yahG (E. coli), SER33 (S. cerevisiae), AT4G12540 (A. ²thaliana), Csad (M. musculus), ZBTB7C (H. sapiens). c. DTW distances between %MinMax profiles of model-generated sequences and their genomic counterparts for 50 random genes selected among the top 10% codon similarity index (CSI). For each organism, the gene for which the %MinMax profiles are represented above (b) is highlighted in grey. d. Mean and standard deviation of normalized DTW distances by sequence length between sequences for the 5 organisms (for organism-specific DTW distances, see Supplementary Figs. 17). Data underlying this figure is provided in Supplementary Data 1.

Benchmarking with Real World Proteins

When benchmarked against proteins of biotechnological interest, CodonTransformer consistently generates sequences with minimized negative cis-regulatory elements, outperforming other tools. This enhances the potential for successful heterologous expression.

Data

Fig. 4: Model benchmark with proteins of biotechnological interest.

Mean and standard deviation of Jaccard index (a), sequence similarity (b), and dynamic time warping (c) distance between corresponding sequences for the 52 benchmark proteins across the 5 organisms (for organism-specific results, see Supplementary Figs. 19, 20, and 21, respectively). (d), Number of negative cis-elements in the sequences generated by different tools (✕ shows the mean). Data underlying this figure is provided in Supplementary Data 2.

Getting Started

Along with open-source the data and model, we also provide a comprehensive Python package for codon optimization. The CodonTransformer package has 5 modules:

  • CodonData
    facilitates processing of genetic information by cleaning and translating DNA and protein sequences, FASTA files, and managing codon frequencies from databases like NCBI and Kazusa.
  • CodonPrediction
    enables preprocessing of sequences, prediction of optimized DNA sequences using the CodonTransformer model, and supports various other optimization strategies.
  • CodonEvaluation
    provides tools to compute evaluation metrics such as Codon Similarity Index (CSI), GC content, and Codon Frequency Distribution, allowing for detailed assessment of optimized sequences.
  • CodonUtils
    offers essential constants and helper functions for genetic sequence analysis, including amino acid mappings, codon tables, taxonomy ID management, and sequence validation.
  • CodonJupyter
    enhances Jupyter notebook workflows with interactive widgets for selecting organisms and inputting protein sequences, formatting and displaying optimized DNA sequence outputs.

Installation

Install CodonTransformer via pip:

pip install CodonTransformer

Or clone the repository:

git clone https://github.com/adibvafa/CodonTransformer.git
cd CodonTransformer
pip install -r requirements.txt

The package requires python>=3.9. The requirements are available here.

Use Case

After installing CodonTransformer, you can use:

import torch
from transformers import AutoTokenizer, BigBirdForMaskedLM
from CodonTransformer.CodonPrediction import predict_dna_sequence
from CodonTransformer.CodonJupyter import format_model_output

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("adibvafa/CodonTransformer")
model = BigBirdForMaskedLM.from_pretrained("adibvafa/CodonTransformer").to(DEVICE)

# Set your input data
protein = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG"
organism = "Escherichia coli general"

# Predict with CodonTransformer
output = predict_dna_sequence(
    protein=protein,
    organism=organism,
    device=DEVICE,
    tokenizer=tokenizer,
    model=model,
    attention_type="original_full",
)
print(format_model_output(output))
-----------------------------
|          Organism         |
-----------------------------
Escherichia coli general

-----------------------------
|       Input Protein       |
-----------------------------
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG

-----------------------------
|      Processed Input      |
-----------------------------
M_UNK A_UNK L_UNK W_UNK M_UNK R_UNK L_UNK L_UNK P_UNK L_UNK L_UNK A_UNK L_UNK L_UNK A_UNK L_UNK W_UNK G_UNK P_UNK D_UNK P_UNK A_UNK A_UNK A_UNK F_UNK V_UNK N_UNK Q_UNK H_UNK L_UNK C_UNK G_UNK S_UNK H_UNK L_UNK V_UNK E_UNK A_UNK L_UNK Y_UNK L_UNK V_UNK C_UNK G_UNK E_UNK R_UNK G_UNK F_UNK F_UNK Y_UNK T_UNK P_UNK K_UNK T_UNK R_UNK R_UNK E_UNK A_UNK E_UNK D_UNK L_UNK Q_UNK V_UNK G_UNK Q_UNK V_UNK E_UNK L_UNK G_UNK G_UNK __UNK

-----------------------------
|       Predicted DNA       |
-----------------------------
ATGGCTTTATGGATGCGTCTGCTGCCGCTGCTGGCGCTGCTGGCGCTGTGGGGCCCGGACCCGGCGGCGGCGTTTGTGAATCAGCACCTGTGCGGCAGCCACCTGGTGGAAGCGCTGTATCTGGTGTGCGGTGAGCGCGGCTTCTTCTACACGCCCAAAACCCGCCGCGAAGCGGAAGATCTGCAGGTGGGCCAGGTGGAGCTGGGCGGCTAA

You can use the inference template for batch inference in Google Colab.

Why Choose CodonTransformer?

  • Multispecies Support. Trained on 164 organisms, CodonTransformer can optimize codon usage for a wide range of host species, including prokaryotes and eukaryotes.
  • Context-Aware Optimization. The model considers both global codon usage biases and local sequence patterns, ensuring optimal DNA sequence design.
  • Natural-Like Codon Distribution. Generates sequences with codon distributions similar to natural genes, aiding in proper protein folding and function.
  • Custom Fine-Tuning. Users can fine-tune the model on any custom dataset to meet specific characteristics or optimize for unique organisms.
  • Open-Access and Flexible. The base and fine-tuned models are openly available, along with a comprehensive Python package and a user-friendly Google Colab notebook!

Conclusion

CodonTransformer represents a significant advancement in codon optimization by leveraging a multispecies, context-aware deep learning approach trained on 164 diverse organisms. Its ability to generate natural-like codon distributions and minimize negative cis-regulatory elements ensures optimized gene expression while preserving protein structure and function.

The model's flexibility is further enhanced through customizable fine-tuning, allowing users to tailor optimizations to specific gene sets or unique organisms. As an open-access tool, CodonTransformer provides comprehensive resources, including a Python package and an interactive Google Colab notebook, facilitating widespread adoption and adaptation for various biotechnological applications.

By integrating evolutionary insights and advanced neural network architectures, CodonTransformer sets a new standard for efficient and accurate gene design, with potential extensions to protein engineering and therapeutic development.

BibTeX

@article {Fallahpour2024.09.13.612903,
        author = {Fallahpour, Adibvafa and Gureghian, Vincent and Filion, Guillaume J. and Lindner, Ariel B. and Pandi, Amir},
        title = {CodonTransformer: a multispecies codon optimizer using context-aware neural networks},
        elocation-id = {2024.09.13.612903},
        year = {2024},
        doi = {10.1101/2024.09.13.612903},
        publisher = {Cold Spring Harbor Laboratory},
        URL = {https://www.biorxiv.org/content/early/2024/09/13/2024.09.13.612903},
        eprint = {https://www.biorxiv.org/content/early/2024/09/13/2024.09.13.612903.full.pdf},
        journal = {bioRxiv}
      }