September 15, 2023
BlogResearchcdsBERT: Why Codons Still Matter for Protein AI
cdsBERT extends protein language modeling from amino acids to codons, capturing information about coding sequence, organism bias, and the hidden biology behind supposedly silent mutations.
- Codons
- Protein Language Models
- Foundation Models
By Logan Hallee
The Problem: Amino Acids Are Not the Whole Story
Most protein AI systems read proteins as amino acid strings. That makes sense: amino acids are the direct building blocks of proteins and determine much of structure and function.
But biology does not start at the amino acid. It starts with coding sequence. Three DNA letters form a codon, and multiple codons can encode the same amino acid. That is why many mutations are called silent.
Silent can be misleading. Codon usage affects translation speed, organism-specific expression, mRNA behavior, and sometimes protein folding. Two genes can encode the same amino acid sequence while carrying different regulatory and evolutionary signals.
cdsBERT asks whether protein language models leave useful information behind when they ignore that layer.
The Core Idea
cdsBERT extends a protein model's vocabulary from amino acids to codons. Instead of reading only the protein sequence, it reads the coding sequence that produced it.
That gives the model access to synonymous codon choices, organism-level usage patterns, and other signals that disappear when DNA is translated into amino acids.
The paper introduced MELD, a training strategy for adapting an existing protein model into a codon-aware model. The public-facing mechanism is simple: codons can carry useful information that amino acid strings compress away.
What The Paper Showed
cdsBERT learned codon-specific structure in its internal representation. Synonymous codons, which encode the same amino acid, did not remain interchangeable. They moved differently in the learned space.
That movement correlated with codon usage bias across broad phylogeny. In other words, the model appeared to learn a biological pattern rather than just memorize a larger alphabet.
The model also improved enzyme function prediction relative to its amino-acid-only starting point. The exact numbers are less important here than the direction of the result: codon awareness can make a protein representation more complete for some tasks.
Why This Matters
Codon-aware modeling is especially relevant for production and design.
If a team wants to express a therapeutic protein, enzyme, or designed binder in a particular organism, the amino acid sequence is only part of the story. The coding sequence can influence expression yield, folding efficiency, and manufacturing behavior.
A codon-aware model gives researchers a way to reason about those effects earlier in the design process.
Limitations
cdsBERT does not prove that codon-aware models improve every protein task. Many protein properties are still dominated by amino acid sequence, and codon data is harder to curate cleanly at scale.
The value is context-dependent. The strongest use cases are likely organism-specific expression, production optimization, and tasks where coding sequence carries biological signal beyond the translated protein.
This blog post summarizes work in the following paper:
cdsBERT - Extending Protein Language Models with Codon Awareness
Logan Hallee, Nikolaos Rafailidis, Jason P. Gleghorn
bioRxiv 2023.09.15.558027; doi: https://doi.org/10.1101/2023.09.15.558027
Related Research
April 9, 2026
BlogDual Triangle Attention: Position Sense for Bidirectional Models
Dual Triangle Attention keeps bidirectional context while giving transformers a built-in directional signal.
- Attention
- Transformers
- Foundation Models
August 21, 2025
BlogProtify: Model Choice As An Experiment
Protify makes protein language model evaluation repeatable across tasks, datasets, and training strategies.
- Protify
- Benchmarking
- Protein Language Models