Synteract-4: Mapping Protein Interactions from Sequence Alone

May 31st, 2026

Research

Synteract-4: Mapping Protein Interactions from Sequence Alone

Synteract-4 is a sequence-only protein interaction model that maps likely protein-protein interactions at proteome scale, using carefully controlled training to avoid shortcuts and prioritize biologically useful hypotheses.

Protein Protein Interaction
Synteract
Atlas

The Problem: Proteins Do Not Work Alone

Most biology happens through relationships. A protein folds, binds, signals, modifies, recruits, blocks, or stabilizes other proteins. These protein-protein interactions form the working circuitry of a cell.

The hard part is scale. A human cell has roughly 20,000 proteins. The number of possible protein pairs is enormous, and only a small fraction has been tested directly. Wet-lab methods are still essential, but they are too slow and expensive to use as the first pass over every possible interaction.

Synteract-4 was built for that first pass. It asks a simple question: can we look at protein sequences alone and produce a useful map of which proteins are most likely to interact?

Synteract-4 Reframes Interaction Prediction

Earlier protein interaction models often treated each pair as a separate classification problem. Take protein A, take protein B, run a large model, return a yes-or-no prediction.

Synteract-4 takes a different view. It learns an interaction-aware space where proteins that are likely to bind or participate in the same molecular relationship are placed closer together. Once proteins are embedded into that space, scoring a whole proteome becomes a fast comparison between vectors.

That sounds technical, but the practical effect is simple: Synteract-4 can move from "score this one pair" to "map this whole interaction landscape."

This is the core innovation behind Atlas PPI, the production-facing Synthyra model powered by this line of work. Atlas PPI uses the Synteract-4 framing to turn protein sequences into large-scale interaction maps that researchers can inspect, query, and use to prioritize experiments.

Why Sequence-Only Matters

Many modern protein interaction systems use extra information, such as structure, genomic context, or organism-specific training data. Those signals are valuable when they are available. But they are not always available.

Designed proteins may not have natural genomic context. New pathogens may have sparse annotations. De novo sequences may not resemble anything in existing databases. In those settings, a model that depends heavily on context from natural biology can lose coverage.

Synteract-4 is intentionally lighter. It starts with raw amino acid sequence, passes each protein through a frozen protein encoder, and trains smaller interaction-specific towers on top. The large sequence model supplies general protein knowledge. The Synteract-4 training teaches the system how to use that knowledge for interaction mapping.

That design makes the model practical for natural proteins, engineered proteins, and early-stage design campaigns where researchers may only have sequences.

Avoiding the Easy Shortcut

Protein interaction datasets contain a subtle trap. Real interactions usually come from proteins within the same species. Random negative examples often pair proteins from different species. A model can exploit that pattern by learning species difference instead of interaction biology.

This is the "accidental taxonomist" problem: a protein language model can appear strong because it learned organism identity, not because it learned which proteins interact.

Synteract-4 is built with that lesson in mind. Its training uses same-species negative controls and cluster-aware safeguards so the model cannot win by taking the easy taxonomic shortcut. This matters because a useful interaction model should help researchers discover biology, not rediscover dataset artifacts.

What It Achieves

On a gold-standard protein interaction benchmark designed to reduce leakage and shortcuts, Synteract-4 beats the best published comparator by a 13% margin. More importantly, it does this while staying sequence-only.

At proteome scale, Synteract-4 is competitive with richer-input systems that use organism-specific supervision. It can screen whole proteomes, compare likely interaction neighborhoods, and support network-level analysis without requiring a new model for every species.

The work also connects back to the lab. In a cardiac HSP90 study, Synteract-4 recovered many newly measured HSP90 partners that were absent from prior interaction databases. That is the kind of signal we care about most: not just repeating known biology, but helping prioritize plausible new biology for follow-up.

What This Enables

For researchers, Synteract-4 makes protein interaction prediction feel less like a single-pair assay and more like a searchable map.

You can start from a protein of interest and ask which partners are most likely to matter. You can compare designed proteins against a proteome before synthesis. You can inspect whether a candidate therapeutic might touch pathways outside its intended target. You can use interaction neighborhoods as a first layer of functional context for proteins that are poorly annotated.

None of this replaces wet-lab validation. The model does not prove binding, measure affinity, or capture every cellular condition. It narrows the search space. It helps decide which experiments are worth doing first.

That is the point of Synthyra's Atlas platform: move from isolated predictions to interpretable, proteome-scale hypothesis generation.

The Field Impact

The broader message of Synteract-4 is that protein interaction prediction can be treated as representation learning. Once the model learns a useful interaction space, the expensive part is embedding the proteins. After that, interaction scoring is fast enough to support whole-proteome maps.

This changes the shape of the problem. Instead of asking whether computational PPI prediction can score one pair at a time, we can ask what becomes possible when interaction maps are cheap enough to generate routinely.

For drug discovery, that means faster target and off-target exploration. For protein design, it means earlier safety and specificity checks. For basic biology, it means a new way to explore the hidden wiring of the cell.

This blog post summarizes work in the following paper:

Sequence-Only Interactome-Scale Prediction of Protein-Protein Interactions
Logan Hallee, Richard Roberts, Sujoita Sen, Nikolaos Rafailidis, Tamar Peleg, Halley Echols, Chi Keung Lam, Jason P. Gleghorn
Manuscript draft, 2026

Try Our Protein Analysis Tools

Protein-Protein Interaction

Predict interactions and binding affinities between protein pairs

Protein Properties

Analyze biochemical properties of protein sequences

More Tools

Discover our full suite of protein analysis tools

May 31st, 2026

Accidental Taxonomists: When Protein Models Learn the Wrong Shortcut

Protein models can appear to predict interactions while actually learning species differences. Accidental Taxonomists explains the shortcut and how to avoid it.

Protein Protein Interaction
Dataset Curation
Atlas

May 31st, 2026

Annotation Vocabulary: Teaching Protein Models the Language of Function

Annotation Vocabulary turns protein properties into a structured language, giving models a cleaner bridge between sequence, function, and design.

Annotation Vocabulary
Protein Function
Atlas