cdsBERT: Why Codons Still Matter for Protein AI

May 31st, 2026

Research

cdsBERT: Why Codons Still Matter for Protein AI

cdsBERT extends protein language modeling from amino acids to codons, capturing information about coding sequence, organism bias, and the hidden biology behind supposedly silent mutations.

Codons
Protein Language Models
Foundation Models

The Problem: Amino Acids Are Not the Whole Story

Most protein AI systems read proteins as strings of amino acids. That makes sense: amino acids are the direct building blocks of proteins, and they determine much of a protein's structure and function.

But biology does not start at the amino acid. It starts with coding sequence. Three DNA letters form a codon, and each codon maps to an amino acid or stop signal. Multiple codons can encode the same amino acid, which is why many mutations are called "silent."

The word silent can be misleading. Codon usage affects translation speed, organism-specific expression, mRNA behavior, and sometimes protein folding. Two genes can encode the same amino acid sequence while carrying different regulatory and evolutionary signals.

cdsBERT asked whether protein language models are leaving information on the table by ignoring that layer.

The Core Idea

cdsBERT extends a protein model's vocabulary from amino acids to codons. Instead of reading only the protein sequence, it reads the coding sequence that produced it.

That gives the model access to synonymous codon choices, organism-level usage patterns, and other signals that disappear when DNA is translated into amino acids.

The paper introduced MELD, a training strategy for adapting an existing protein model into a codon-aware model. The important public-facing idea is not the training recipe. It is the biological claim: codons can carry useful information that amino acid strings compress away.

What The Paper Showed

cdsBERT learned codon-specific structure in its internal representation. Synonymous codons, which encode the same amino acid, did not remain interchangeable. They moved differently in the model's learned space.

That movement correlated with codon usage bias across broad phylogeny. In other words, the model appeared to learn a real biological pattern rather than just memorizing a larger alphabet.

The model also improved enzyme function prediction relative to its amino-acid-only starting point. The exact benchmark numbers are less important than the direction of the result: adding codon awareness can make a protein representation more biologically complete.

Why This Matters

Codon-aware modeling is especially relevant for protein production and design.

If a team wants to express a therapeutic protein, enzyme, or designed binder in a particular organism, the amino acid sequence is only part of the story. The coding sequence can influence expression yield, folding efficiency, and manufacturing behavior.

A codon-aware model gives researchers a way to reason about those effects earlier in the design process. It opens a path toward models that understand both the protein product and the genetic instructions used to make it.

The Broader Impact

cdsBERT is a foundation-model idea, not just a one-off model. It suggests that the best biological representation may depend on the question being asked.

For structure, amino acids may be enough in many cases. For expression, organism-specific behavior, and production optimization, codons may matter. For regulation, DNA and RNA context may matter even more.

Synthyra's product direction follows that principle: extend and improve open research models for real workflows, rather than simply repackaging them. Codon-aware modeling points toward more complete protein design systems that can evaluate not only what a protein is, but how it will be made.

Limitations

cdsBERT does not prove that codon-aware models will improve every protein task. Many protein properties are still dominated by amino acid sequence. Codon data is also harder to curate cleanly at scale than protein sequence data.

The value is context-dependent. The strongest use cases are likely organism-specific expression, production optimization, and tasks where coding sequence carries biological signal beyond the translated protein.

This blog post summarizes work in the following paper:

cdsBERT - Extending Protein Language Models with Codon Awareness
Logan Hallee, Nikolaos Rafailidis, Jason P. Gleghorn
bioRxiv 2023.09.15.558027; doi: https://doi.org/10.1101/2023.09.15.558027

Try Our Protein Analysis Tools

Protein-Protein Interaction

Predict interactions and binding affinities between protein pairs

Protein Properties

Analyze biochemical properties of protein sequences

More Tools

Discover our full suite of protein analysis tools

May 31st, 2026

Accidental Taxonomists: When Protein Models Learn the Wrong Shortcut

Protein models can appear to predict interactions while actually learning species differences. Accidental Taxonomists explains the shortcut and how to avoid it.

Protein Protein Interaction
Dataset Curation
Atlas

May 31st, 2026

Annotation Vocabulary: Teaching Protein Models the Language of Function

Annotation Vocabulary turns protein properties into a structured language, giving models a cleaner bridge between sequence, function, and design.

Annotation Vocabulary
Protein Function
Atlas