Annotation Vocabulary: Teaching Protein Models the Language of Function

Annotation Vocabulary replaces loose natural-language protein descriptions with structured biological terms, helping AI systems connect amino acid sequences to function, annotation, and generation.

Annotation Vocabulary
Protein Function
Atlas

By Logan Hallee

The Problem: Protein Function Is Hard to Say Cleanly

Protein sequences are abundant. Clean function labels are not.

Most protein AI systems learn from amino acid strings because sequences are easy to collect at scale. Researchers usually care about a richer question: what does this protein do, where does it act, which domains does it contain, what reaction might it catalyze, and which biological process does it support?

Natural language can describe those things, but it is noisy. Two curators can describe the same protein differently. Some descriptions are long and redundant. Others are sparse. Many include boilerplate, literature context, or missing experimental detail.

Annotation Vocabulary was built around a different premise: give models a structured language of biological properties instead of asking them to infer function from messy prose.

The Core Idea

Annotation Vocabulary converts established biological annotations into transformer-readable terms. Gene Ontology, Enzyme Commission numbers, InterPro, Gene3D, cofactors, and UniProt keywords become a specialized property language.

That gives the model a cleaner target. A catalytic activity, domain, cofactor, or cellular component becomes a precise term instead of a phrase buried in a paragraph.

The result is a bridge. On one side is the amino acid sequence. On the other is a structured description of what biology already knows about the protein.

Why This Is Different

Sequence-only models are powerful for structure, homology, and broad biochemical patterns. Function can be more abstract. Similar folds can support different activities, and similar activities can appear across different sequence families.

Annotation Vocabulary shifts the supervision toward biological meaning. It supports models that represent annotations alone, align sequences with annotations, and generate protein sequences from functional prompts.

The architecture is not the whole claim. The value comes from making the labels cleaner and easier for a model to learn from.

What The Paper Showed

The strongest representation model in the paper, CAMP, used Annotation Vocabulary to make sequence embeddings more function-aware. The paper reports state-of-the-art performance across a meaningful slice of common protein benchmarks while costing about $3 in commercial compute to train.

That result suggests better descriptions can matter as much as bigger models. If the language of the labels is cleaner, a smaller system can learn a more useful representation.

The generative direction was also important. GSM generated realistic protein sequences from annotation-only prompts. Some generated sequences returned significant BLAST hits and showed enrichment consistent with the requested annotations, even when the ground-truth sequence was far from the training set.

In plain terms: the model could start from a functional description and produce sequences that looked biologically plausible.

Why It Matters for Synthyra

Annotation Vocabulary is a foundation for several Synthyra directions. Translator maps amino acid sequences into structured functional annotations. Atlas uses CAMP-style annotation context to make interaction and ligand predictions more interpretable. DSM connects generation back to function-aware and interaction-aware design.

These are not simple ports of the original research. Synthyra extends the idea with improved models, broader vocabularies, serving infrastructure, validation workflows, and user-facing analysis layers.

The strategic point is larger than any one model. Protein AI needs a better interface between what models see and what scientists ask. Annotation Vocabulary is one route toward that interface.

What This Enables

A researcher can start with a sequence and receive structured functional hypotheses. A designer can start with a desired function and search for compatible sequence space. A platform can connect sequence, annotation, retrieval, and generation without relying entirely on free-text descriptions.

This does not replace curation or experimental validation. It makes the first pass more organized, searchable, and biologically legible.

This blog post summarizes work in the following paper:

Annotation Vocabulary (Might Be) All You Need
Logan Hallee, Niko Rafailidis, Colin Horger, David Hong, Jason P. Gleghorn
bioRxiv 2024.07.30.605924; doi: https://doi.org/10.1101/2024.07.30.605924

Related Research

March 18, 2025

Blog

Translator: Broad Protein Annotation, Fast

Translator maps protein sequences into structured functional annotation hypotheses using the Annotation Vocabulary framework.

Annotation Vocabulary
Protein Function
Atlas

October 10, 2025

Blog

Atlas: Making Protein Screens Searchable

Atlas turns sequence-first protein models into searchable maps for interaction, ligand, and annotation work.

Atlas
Protein Protein Interaction
Drug Discovery
Protein Function