Synteract: Predicting Protein Interactions from Sequence

May 31st, 2026

Research

Synteract: Predicting Protein Interactions from Sequence

Synteract was the first step in our protein interaction research line, showing that sequence-based AI could prioritize likely protein-protein interactions before expensive wet-lab validation.

Protein Protein Interaction
Synteract
Atlas

The Problem: Proteins Rarely Act Alone

Proteins are often described as the workhorses of biology, but the better metaphor is a network. They bind, recruit, block, modify, stabilize, and regulate one another. Those protein-protein interactions, or PPIs, are central to cell signaling, disease biology, immune response, and drug discovery.

The challenge is that interaction maps are still incomplete. Wet-lab and in vitro assays remain the gold standard, but testing every possible pair is too slow and expensive. A single organism can contain thousands of proteins, which means millions of possible pairs.

Synteract asked whether a protein language model could help narrow that search. Given two amino acid sequences, could the model estimate whether the proteins are likely to interact?

The First Synteract Idea

The first Synteract model treated PPI prediction as a sequence understanding problem. It used a large protein language model that had already learned broad patterns from protein sequences, then adapted that knowledge to distinguish likely interactors from likely non-interactors.

That framing mattered. It showed that a model did not need a solved structure, a hand-built feature set, or organism-specific pathway information to produce useful interaction signals. The amino acid sequence itself contained enough information to make the problem tractable.

For researchers, the practical value was straightforward: use in silico prediction to prioritize which interactions deserve experimental follow-up.

The Negative Example Problem

PPI modeling has a quiet data problem. Databases contain many examples of proteins that interact, but very few experimentally verified examples of proteins that do not interact. You can observe a binding event, but proving that two proteins never interact under any condition is much harder.

Synteract tackled that imbalance by building synthetic negative examples and testing whether those examples could help the model learn a useful boundary between interacting and non-interacting pairs.

The result was encouraging, but it also exposed a broader lesson: how you choose negative examples can define what the model learns.

A Warning About Shortcuts

One of the most important findings from the early Synteract work was that some existing PPI datasets could reward the wrong behavior. If negative examples are created by pairing proteins from different cellular compartments, a model can look impressive while learning localization patterns instead of interaction biology.

That insight became a recurring theme in the Synteract research line. A useful model should not succeed because it found an artifact in the dataset. It should succeed because it learned something that helps researchers reason about real biological interactions.

This concern later grew into the Accidental Taxonomist work and the same-species controls used in Synteract-4.

Why It Mattered

Synteract was not the final answer. It was the proof that sequence-only interaction prediction was worth pursuing.

It showed that modern protein language models could support PPI prediction across biologically diverse data. It also showed that these systems must be evaluated carefully, because apparent performance can hide dataset shortcuts.

That combination, possibility plus skepticism, shaped the later Synthyra approach. Atlas PPI is not a simple port of Synteract. It is the production-facing continuation of this research direction, extended with improved modeling, safeguards, calibration, API infrastructure, and network-scale analysis.

What It Enables

The long-term goal is not to replace experiments. It is to make experiments more efficient.

A sequence-based PPI model can help a team choose which candidate binders to synthesize, which off-targets to inspect, which disease pathway partners to prioritize, or which newly sequenced proteins deserve closer study. In a world where generative models can create enormous numbers of candidate proteins, this kind of triage becomes essential.

Synteract began that path by showing that protein interaction prediction could move from slow pairwise biology toward fast, scalable hypothesis generation.

This blog post summarizes work in the following paper:

Protein-Protein Interaction Prediction is Achievable with Large Language Models
Logan Hallee, Jason P. Gleghorn
bioRxiv 2023.06.07.544109; doi: https://doi.org/10.1101/2023.06.07.544109

Try Our Protein Analysis Tools

Protein-Protein Interaction

Predict interactions and binding affinities between protein pairs

Protein Properties

Analyze biochemical properties of protein sequences

More Tools

Discover our full suite of protein analysis tools

May 31st, 2026

Accidental Taxonomists: When Protein Models Learn the Wrong Shortcut

Protein models can appear to predict interactions while actually learning species differences. Accidental Taxonomists explains the shortcut and how to avoid it.

Protein Protein Interaction
Dataset Curation
Atlas

May 31st, 2026

Annotation Vocabulary: Teaching Protein Models the Language of Function

Annotation Vocabulary turns protein properties into a structured language, giving models a cleaner bridge between sequence, function, and design.

Annotation Vocabulary
Protein Function
Atlas