Atlas: Making Protein Screens Searchable

Atlas connects sequence-only interaction prediction, ligand prioritization, and functional annotation so researchers can move from a protein sequence to a ranked set of hypotheses.

Atlas
Protein Protein Interaction
Drug Discovery
Protein Function

By Logan Hallee

Proteins are difficult to study one at a time because they rarely behave one at a time. They bind, recruit, block, localize, stabilize, and modify one another. A disease phenotype is often not the consequence of a single molecule, but of a set of relationships that has shifted.

That is the scale Atlas is meant to sit inside. A proteome with 20,000 proteins contains hundreds of millions of possible protein pairs before adding pathogen proteins, designed proteins, small molecules, or functional annotations. The useful object is not a single score. It is a map that helps decide which part of the search space deserves attention.

Atlas human intra-actome map

What Atlas Tries To Organize

Atlas has three related surfaces.

The first is protein-protein interaction prediction. Given amino acid sequences, Atlas scores likely relationships between proteins. It can be used for a single pair, a neighborhood around a query, or an all-vs-all screen across a reference proteome.

The second is protein-ligand prioritization. This is a triage layer for early target exploration, repurposing, and chemical search. A ligand score is not efficacy, selectivity, or safety. It is a way to narrow a much larger chemical space before spending more expensive effort.

The third is functional annotation. CAMP-style annotation retrieval connects a sequence to structured biological language, so an interaction or ligand hypothesis is not floating by itself. The model output has functional context around it.

These pieces matter together. Interaction predictions give network context. Ligand predictions suggest possible interventions. Functional annotations help explain what a protein is likely doing in the first place.

The Interaction Core

The PPI engine behind Atlas is tied to the Synteract-4 manuscript, Sequence-Only Interactome-Scale Prediction of Protein-Protein Interactions. The main modeling move is simple to state: treat interaction prediction as representation learning.

Instead of running a large pairwise model from scratch for every possible pair, Synteract-4 embeds proteins into an interaction-aware space. A scaled dot product between two embeddings becomes the interaction score. Once proteins are embedded, a proteome-scale screen becomes a vector retrieval problem.

That is a very different operating mode. Pairwise classifiers ask one question at a time. Atlas asks many questions after paying the embedding cost once.

The Synteract-4 draft reports several useful anchors for this framing: MCC of 0.34 on the Bernett leakage-controlled benchmark, human intra-actome ROC AUC around 0.968 with a strict zero-shot subset at 0.953, and a mean ROC AUC around 0.899 across a 19-pathogen intra-actome panel. Those numbers are manuscript evidence, not a claim that every biological edge is real. They show that the retrieval framing is strong enough to support large screens.

Why The Controls Are Part Of The Product

Protein interaction datasets are full of tempting shortcuts. A model can look good by learning taxonomy, homology, localization, or dataset construction instead of interaction biology.

That is why Atlas inherits controls from the Accidental Taxonomists and Synteract-4 work: same-species negatives where needed, cluster-aware splits, homology checks, and explicit attention to interspecies evaluation. These controls usually make benchmarks harder. That is the point.

The output is not experimental truth. It is a ranked search surface that is less likely to be dominated by the easiest artifact in the data.

What A Researcher Gets Back

A useful Atlas result is not just "yes" or "no." It is a set of relationships that can be inspected.

A query protein can return likely partners, enriched neighborhoods, pathway context, and possible ligand hypotheses. A designed sequence can be compared against a reference proteome before synthesis. A pathogen proteome can be screened against host proteins to prioritize host-pathogen contacts worth deeper review.

The host-pathogen case is where the map framing becomes especially important. Infection is partly a sequence of physical contacts: pathogen proteins enter cells, perturb host machinery, evade immune response, and change tissue behavior through interactions with host proteins. Atlas cannot prove those events happened, but it can produce a shorter list of plausible contacts to investigate.

Where This Leaves The User

Atlas is best understood as a prioritization system. It does not replace binding assays, structural biology, perturbation experiments, toxicology, or clinical reasoning.

Expression, localization, cofactors, post-translational modification, conformational state, tissue context, and assay conditions still decide whether a predicted relationship is biologically active.

The value is narrower and more practical: use sequence-first models to make large biological search spaces searchable, then spend experimental attention where the map gives you a reason to look.

This blog post summarizes Atlas product direction alongside ideas from the Synteract-4 manuscript draft, including sequence-only interactome-scale PPI prediction and host-pathogen interaction analysis.

Related Research

October 24, 2025

Blog

Accidental Taxonomists: When Protein Models Learn the Wrong Shortcut

Protein interaction models can look strong by learning species differences instead of interaction biology.

Protein Protein Interaction
Dataset Curation
Atlas

September 18, 2025

Blog

Synteract-4: Interaction Prediction as Retrieval

Synteract-4 reframes protein-protein interaction prediction as sequence-only representation learning at proteome scale.

Protein Protein Interaction
Synteract
Atlas