CodonTransformer is a cutting-edge, multispecies deep learning model designed for state-of-the-art codon optimization. Trained on over 1 million gene-protein pairs from 164 organisms spanning all kingdoms of life, CodonTransformer leverages advanced neural network architectures to generate host-specific DNA sequences with natural-like codon usage patterns and minimized negative cis-regulatory elements for any protein sequence.
The genetic code's degeneracy allows for multiple DNA sequences to encode the same protein, but not all codons are equal in the eyes of the host organism. Codon usage bias can significantly impact the efficiency of heterologous protein production due to differences in tRNA abundance, protein folding regulations, and evolutionary constraints.
CodonTransformer uses the Transformer architecture and a novel training strategy named STREAM (Shared Token Representation and Encoding with Aligned Multi-masking) to learn and replicate the intricate codon usage patterns across a diverse array of organisms. By doing so, it provides a powerful tool for optimizing DNA sequences for expression in various host species.
CodonTransformer addresses the challenge of codon optimization by translating protein sequences into optimized codon sequences using the encoder-only BigBird Transformer architecture. We frame this task as a Masked Language Modeling (MLM) problem, where the model predicts codons by unmasking tokens from [aminoacid_UNK] to [aminoacid_codon]. Our innovative STREAM training strategy allows the model to learn codon usage patterns by unmasking multiple mask tokens while organism specific embeddings are added to the sequence to contextualize predictions.
The training process involves two stages: pretraining on over one million DNA-protein pairs from 164 diverse organisms to capture universal codon usage patterns, followed by fine-tuning on a curated subset of highly optimized genes specific to target organisms. This dual training strategy enables CodonTransformer to generate DNA sequences with natural-like codon distributions tailored to each host, effectively optimizing gene expression across multiple species.
CodonTransformer demonstrates superior performance in generating
natural-like codon distributions and minimizing negative cis-regulatory elements compared to existing codon
optimization tools.
Here are the benchmarking and evaluation results of CodonTransformer:
CodonTransformer effectively learned codon usage patterns across multiple species, as shown by high codon similarity indices (CSI) when generating DNA sequences for various organisms. The model adapts to the specific codon preferences of each host, ensuring optimal expression.
The model produces DNA sequences with codon usage patterns closely resembling those found in nature, avoiding clusters of rare or highly frequent codons that can negatively affect protein folding and expression. This is visualized using %MinMax profiles and Dynamic Time Warping (DTW) metrics.
When benchmarked against proteins of biotechnological interest, CodonTransformer consistently generates sequences with minimized negative cis-regulatory elements, outperforming other tools. This enhances the potential for successful heterologous expression.
Along with open-source the data and model, we also provide a comprehensive Python package for codon optimization. The CodonTransformer package has 5 modules:
Install CodonTransformer via pip:
pip install CodonTransformer
Or clone the repository:
git clone https://github.com/adibvafa/CodonTransformer.git
cd CodonTransformer
pip install -r requirements.txt
The package requires python>=3.9
. The requirements are available
here.
After installing CodonTransformer, you can use:
import torch
from transformers import AutoTokenizer, BigBirdForMaskedLM
from CodonTransformer.CodonPrediction import predict_dna_sequence
from CodonTransformer.CodonJupyter import format_model_output
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("adibvafa/CodonTransformer")
model = BigBirdForMaskedLM.from_pretrained("adibvafa/CodonTransformer").to(DEVICE)
# Set your input data
protein = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG"
organism = "Escherichia coli general"
# Predict with CodonTransformer
output = predict_dna_sequence(
protein=protein,
organism=organism,
device=DEVICE,
tokenizer=tokenizer,
model=model,
attention_type="original_full",
)
print(format_model_output(output))
-----------------------------
| Organism |
-----------------------------
Escherichia coli general
-----------------------------
| Input Protein |
-----------------------------
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG
-----------------------------
| Processed Input |
-----------------------------
M_UNK A_UNK L_UNK W_UNK M_UNK R_UNK L_UNK L_UNK P_UNK L_UNK L_UNK A_UNK L_UNK L_UNK A_UNK L_UNK W_UNK G_UNK P_UNK D_UNK P_UNK A_UNK A_UNK A_UNK F_UNK V_UNK N_UNK Q_UNK H_UNK L_UNK C_UNK G_UNK S_UNK H_UNK L_UNK V_UNK E_UNK A_UNK L_UNK Y_UNK L_UNK V_UNK C_UNK G_UNK E_UNK R_UNK G_UNK F_UNK F_UNK Y_UNK T_UNK P_UNK K_UNK T_UNK R_UNK R_UNK E_UNK A_UNK E_UNK D_UNK L_UNK Q_UNK V_UNK G_UNK Q_UNK V_UNK E_UNK L_UNK G_UNK G_UNK __UNK
-----------------------------
| Predicted DNA |
-----------------------------
ATGGCTTTATGGATGCGTCTGCTGCCGCTGCTGGCGCTGCTGGCGCTGTGGGGCCCGGACCCGGCGGCGGCGTTTGTGAATCAGCACCTGTGCGGCAGCCACCTGGTGGAAGCGCTGTATCTGGTGTGCGGTGAGCGCGGCTTCTTCTACACGCCCAAAACCCGCCGCGAAGCGGAAGATCTGCAGGTGGGCCAGGTGGAGCTGGGCGGCTAA
You can use the inference template for batch inference in Google Colab.
CodonTransformer represents a significant advancement in codon optimization by leveraging a multispecies, context-aware deep learning approach trained on 164 diverse organisms. Its ability to generate natural-like codon distributions and minimize negative cis-regulatory elements ensures optimized gene expression while preserving protein structure and function.
The model's flexibility is further enhanced through customizable fine-tuning, allowing users to tailor optimizations to specific gene sets or unique organisms. As an open-access tool, CodonTransformer provides comprehensive resources, including a Python package and an interactive Google Colab notebook, facilitating widespread adoption and adaptation for various biotechnological applications.
By integrating evolutionary insights and advanced neural network architectures, CodonTransformer sets a new standard for efficient and accurate gene design, with potential extensions to protein engineering and therapeutic development.
@article {Fallahpour2024.09.13.612903,
author = {Fallahpour, Adibvafa and Gureghian, Vincent and Filion, Guillaume J. and Lindner, Ariel B. and Pandi, Amir},
title = {CodonTransformer: a multispecies codon optimizer using context-aware neural networks},
elocation-id = {2024.09.13.612903},
year = {2024},
doi = {10.1101/2024.09.13.612903},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/09/13/2024.09.13.612903},
eprint = {https://www.biorxiv.org/content/early/2024/09/13/2024.09.13.612903.full.pdf},
journal = {bioRxiv}
}