DSM: Protein Generation with Masked Diffusion

Diffusion Sequence Models use masked diffusion to generate biomimetic proteins and support protein binder design, including wet-lab validation of EGFR and PD-L1 binders.

Protein Generation
DSM
Protein Design

By Logan Hallee

The Problem: Understanding Proteins Is Not the Same as Creating Them

Protein language models are excellent at learning from existing sequences. They can produce embeddings that help with structure, function, classification, and interaction prediction.

Generation is harder. A useful protein generator must produce sequences that look biologically plausible, respect long-range constraints, and ideally connect to a desired function or binding target.

Many language generators write sequences left to right. Proteins do not behave that way. A residue near the beginning of a sequence can change what matters near the end because the final molecule folds into three dimensions.

DSM, or Diffusion Sequence Model, takes a different route.

The Core Idea

DSM uses masked diffusion for protein sequences. Instead of writing a sequence one token at a time, the model learns to recover protein sequences from heavily masked inputs.

That gives the model global context while generating. It can consider the whole sequence as it fills in missing pieces, which is a natural fit for proteins because residues far apart in sequence may be close in structure.

The research goal was to build a model that could both understand proteins and create them.

What The Paper Showed

DSM generated biomimetic sequences whose amino acid patterns, predicted secondary structures, and predicted functional annotations resembled natural proteins, even when starting from high levels of masking.

The same model family also produced useful protein representations. That matters because generation and representation are often treated as separate capabilities. DSM points toward a single framework that can support both.

The paper then extended the system toward binder design. In one workflow, DSM generated candidate binders around known templates and used interaction prediction and structure checks to prioritize designs.

Wet-Lab Signal

The most important evidence boundary is that the work moved beyond purely in silico ranking.

Forty designs, 20 for EGFR and 20 for PD-L1, were sent for expression, purification, and binding measurement. Repo-local content reports that many expressed and many bound their targets. The top EGFR design reached sub-nanomolar affinity and improved on the template used in the design campaign.

That does not mean DSM solves protein design. It means the system produced experimentally testable designs with real binding signal, which is the threshold that matters.

Why This Matters

Protein design often fails because the search space is too large. A generator can produce millions of sequences, but only a small number can be synthesized and tested.

DSM helps make that search more directed. It can generate plausible sequences, then connect to Synthyra capabilities such as Atlas, annotation, developability oracles, and structure checks to prioritize what should move forward.

The output is not experimental truth. It is a way to spend wet-lab effort where it is more likely to matter.

Limitations

DSM does not prove that a design will express, fold, bind, or function in a cell. It prioritizes candidates. Consequential designs still require responsible review, orthogonal computational checks, and experimental validation.

This blog post summarizes work in the following paper:

Diffusion Sequence Models for Enhanced Protein Representation and Generation
Logan Hallee, Nikolaos Rafailidis, David B. Bichara, Jason P. Gleghorn
arXiv preprint, June 2025

Related Research

April 9, 2026

Blog

Dual Triangle Attention: Position Sense for Bidirectional Models

Dual Triangle Attention keeps bidirectional context while giving transformers a built-in directional signal.

Attention
Transformers
Foundation Models

October 24, 2025

Blog

Accidental Taxonomists: When Protein Models Learn the Wrong Shortcut

Protein interaction models can look strong by learning species differences instead of interaction biology.

Protein Protein Interaction
Dataset Curation
Atlas