May 31st, 2026
ResearchDSM: Protein Generation with Masked Diffusion
Diffusion Sequence Models use masked diffusion to generate biomimetic proteins and support protein binder design, including wet-lab validation of EGFR and PD-L1 binders.
- Protein Generation
- DSM
- Protein Design
The Problem: Understanding Proteins Is Not the Same as Creating Them
Protein language models are excellent at learning from existing sequences. They can produce embeddings that help with structure, function, classification, and interaction prediction.
Generation is harder. A useful protein generator must produce sequences that look biologically plausible, respect long-range constraints, and ideally connect to a desired function or binding target.
Many language generators write sequences left to right. Proteins do not behave that way. A mutation near the beginning of a sequence can change what matters near the end because the final molecule folds into three dimensions.
DSM, or Diffusion Sequence Model, takes a different approach.
The Core Idea
DSM uses masked diffusion for protein sequences. Instead of writing a sequence one token at a time from left to right, the model learns to recover protein sequences from heavily masked inputs.
That means the model can use global context while generating. It can consider the whole sequence as it fills in missing pieces, which is a natural fit for proteins because residues far apart in sequence may be close in structure.
The research goal was to build a model that could both understand proteins and create them.
What The Paper Showed
DSM generated biomimetic sequences whose amino acid patterns, predicted secondary structures, and predicted functional annotations resembled natural proteins, even when starting from very high levels of masking.
The same model family also produced useful protein representations. That is important because generation and representation are often treated as separate capabilities. DSM points toward a single framework that can support both.
The paper then extended the system toward binder design. In one workflow, DSM generated candidate binders around known templates and used interaction prediction and structure checks to prioritize designs.
Wet-Lab Signal
The most important part is that the work moved beyond purely in silico ranking.
Forty designs, 20 for EGFR and 20 for PD-L1, were sent for expression, purification, and binding measurement. Many expressed, and many bound their targets. The top EGFR design reached sub-nanomolar affinity and improved on the template used in the design campaign.
That does not mean DSM "solves" protein design. It means the system produced experimentally testable designs with real binding signal, which is the threshold that matters.
Why This Matters
Protein design often fails because the search space is too large. A generator can produce millions of sequences, but only a tiny number can be synthesized and tested.
DSM helps make that search more directed. It can generate plausible sequences, then connect to other Synthyra capabilities such as Atlas PPI and structural checks to prioritize what should go to the lab.
This is the direction we care about: fast computational exploration, followed by focused experimental validation.
What This Enables
DSM supports a future where researchers can move from a target or functional concept to a ranked set of protein designs more quickly.
For therapeutics, that can mean binder discovery. For enzymes, it can mean exploring variants around a functional scaffold. For materials and sustainability, it can mean searching protein space for new behaviors while keeping experimental campaigns manageable.
The model does not prove that a design will express, fold, bind, or function in a cell. It prioritizes candidates so teams can spend wet-lab effort where it is most likely to matter.
This blog post summarizes work in the following paper:
Diffusion Sequence Models for Enhanced Protein Representation and Generation
Logan Hallee, Nikolaos Rafailidis, David B. Bichara, Jason P. Gleghorn
Manuscript draft, 2026
Try Our Protein Analysis Tools
Protein-Protein Interaction
Predict interactions and binding affinities between protein pairs
Protein Properties
Analyze biochemical properties of protein sequences
More Tools
Discover our full suite of protein analysis tools
Related Articles
May 31st, 2026
Accidental Taxonomists: When Protein Models Learn the Wrong Shortcut
Protein models can appear to predict interactions while actually learning species differences. Accidental Taxonomists explains the shortcut and how to avoid it.
- Protein Protein Interaction
- Dataset Curation
- Atlas
May 31st, 2026
Annotation Vocabulary: Teaching Protein Models the Language of Function
Annotation Vocabulary turns protein properties into a structured language, giving models a cleaner bridge between sequence, function, and design.
- Annotation Vocabulary
- Protein Function
- Atlas