Synthyra LogoSynthyra

May 31st, 2026

Research

DSM: Protein Generation with Masked Diffusion

Diffusion Sequence Models use masked diffusion to generate biomimetic proteins and support protein binder design, including wet-lab validation of EGFR and PD-L1 binders.

  • Protein Generation
  • DSM
  • Protein Design

The Problem: Understanding Proteins Is Not the Same as Creating Them

Protein language models are excellent at learning from existing sequences. They can produce embeddings that help with structure, function, classification, and interaction prediction.

Generation is harder. A useful protein generator must produce sequences that look biologically plausible, respect long-range constraints, and ideally connect to a desired function or binding target.

Many language generators write sequences left to right. Proteins do not behave that way. A mutation near the beginning of a sequence can change what matters near the end because the final molecule folds into three dimensions.

DSM, or Diffusion Sequence Model, takes a different approach.

The Core Idea

DSM uses masked diffusion for protein sequences. Instead of writing a sequence one token at a time from left to right, the model learns to recover protein sequences from heavily masked inputs.

That means the model can use global context while generating. It can consider the whole sequence as it fills in missing pieces, which is a natural fit for proteins because residues far apart in sequence may be close in structure.

The research goal was to build a model that could both understand proteins and create them.

What The Paper Showed

DSM generated biomimetic sequences whose amino acid patterns, predicted secondary structures, and predicted functional annotations resembled natural proteins, even when starting from very high levels of masking.

The same model family also produced useful protein representations. That is important because generation and representation are often treated as separate capabilities. DSM points toward a single framework that can support both.

The paper then extended the system toward binder design. In one workflow, DSM generated candidate binders around known templates and used interaction prediction and structure checks to prioritize designs.

Wet-Lab Signal

The most important part is that the work moved beyond purely in silico ranking.

Forty designs, 20 for EGFR and 20 for PD-L1, were sent for expression, purification, and binding measurement. Many expressed, and many bound their targets. The top EGFR design reached sub-nanomolar affinity and improved on the template used in the design campaign.

That does not mean DSM "solves" protein design. It means the system produced experimentally testable designs with real binding signal, which is the threshold that matters.

Why This Matters

Protein design often fails because the search space is too large. A generator can produce millions of sequences, but only a tiny number can be synthesized and tested.

DSM helps make that search more directed. It can generate plausible sequences, then connect to other Synthyra capabilities such as Atlas PPI and structural checks to prioritize what should go to the lab.

This is the direction we care about: fast computational exploration, followed by focused experimental validation.

What This Enables

DSM supports a future where researchers can move from a target or functional concept to a ranked set of protein designs more quickly.

For therapeutics, that can mean binder discovery. For enzymes, it can mean exploring variants around a functional scaffold. For materials and sustainability, it can mean searching protein space for new behaviors while keeping experimental campaigns manageable.

The model does not prove that a design will express, fold, bind, or function in a cell. It prioritizes candidates so teams can spend wet-lab effort where it is most likely to matter.


This blog post summarizes work in the following paper:

Diffusion Sequence Models for Enhanced Protein Representation and Generation
Logan Hallee, Nikolaos Rafailidis, David B. Bichara, Jason P. Gleghorn
Manuscript draft, 2026

Try Our Protein Analysis Tools

Protein-Protein Interaction

Predict interactions and binding affinities between protein pairs

Protein Properties

Analyze biochemical properties of protein sequences

More Tools

Discover our full suite of protein analysis tools

Related Articles

May 31st, 2026

Accidental Taxonomists: When Protein Models Learn the Wrong Shortcut

Protein models can appear to predict interactions while actually learning species differences. Accidental Taxonomists explains the shortcut and how to avoid it.

  • Protein Protein Interaction
  • Dataset Curation
  • Atlas

May 31st, 2026

Annotation Vocabulary: Teaching Protein Models the Language of Function

Annotation Vocabulary turns protein properties into a structured language, giving models a cleaner bridge between sequence, function, and design.

  • Annotation Vocabulary
  • Protein Function
  • Atlas
BlogInitiativesSign In
Terms of ServicePrivacy Policy

© 2026 Synthyra. All rights reserved