cdsBERT: Why Codons Still Matter for Protein AI

cdsBERT extends protein language modeling from amino acids to codons, capturing information about coding sequence, organism bias, and the hidden biology behind supposedly silent mutations.

Codons
Protein Language Models
Foundation Models

By Logan Hallee

The Problem: Amino Acids Are Not the Whole Story

Most protein AI systems read proteins as amino acid strings. That makes sense: amino acids are the direct building blocks of proteins and determine much of structure and function.

But biology does not start at the amino acid. It starts with coding sequence. Three DNA letters form a codon, and multiple codons can encode the same amino acid. That is why many mutations are called silent.

Silent can be misleading. Codon usage affects translation speed, organism-specific expression, mRNA behavior, and sometimes protein folding. Two genes can encode the same amino acid sequence while carrying different regulatory and evolutionary signals.

cdsBERT asks whether protein language models leave useful information behind when they ignore that layer.

The Core Idea

cdsBERT extends a protein model's vocabulary from amino acids to codons. Instead of reading only the protein sequence, it reads the coding sequence that produced it.

That gives the model access to synonymous codon choices, organism-level usage patterns, and other signals that disappear when DNA is translated into amino acids.

The paper introduced MELD, a training strategy for adapting an existing protein model into a codon-aware model. The public-facing mechanism is simple: codons can carry useful information that amino acid strings compress away.

What The Paper Showed

cdsBERT learned codon-specific structure in its internal representation. Synonymous codons, which encode the same amino acid, did not remain interchangeable. They moved differently in the learned space.

That movement correlated with codon usage bias across broad phylogeny. In other words, the model appeared to learn a biological pattern rather than just memorize a larger alphabet.

The model also improved enzyme function prediction relative to its amino-acid-only starting point. The exact numbers are less important here than the direction of the result: codon awareness can make a protein representation more complete for some tasks.

Why This Matters

Codon-aware modeling is especially relevant for production and design.

If a team wants to express a therapeutic protein, enzyme, or designed binder in a particular organism, the amino acid sequence is only part of the story. The coding sequence can influence expression yield, folding efficiency, and manufacturing behavior.

A codon-aware model gives researchers a way to reason about those effects earlier in the design process.

Limitations

cdsBERT does not prove that codon-aware models improve every protein task. Many protein properties are still dominated by amino acid sequence, and codon data is harder to curate cleanly at scale.

The value is context-dependent. The strongest use cases are likely organism-specific expression, production optimization, and tasks where coding sequence carries biological signal beyond the translated protein.

This blog post summarizes work in the following paper:

cdsBERT - Extending Protein Language Models with Codon Awareness
Logan Hallee, Nikolaos Rafailidis, Jason P. Gleghorn
bioRxiv 2023.09.15.558027; doi: https://doi.org/10.1101/2023.09.15.558027

Related Research

April 9, 2026

Blog

Dual Triangle Attention: Position Sense for Bidirectional Models

Dual Triangle Attention keeps bidirectional context while giving transformers a built-in directional signal.

Attention
Transformers
Foundation Models

August 21, 2025

Blog

Protify: Model Choice As An Experiment

Protify makes protein language model evaluation repeatable across tasks, datasets, and training strategies.

Protify
Benchmarking
Protein Language Models