Annotation Vocabulary: Teaching Protein Models the Language of Function

May 31st, 2026

Research

Annotation Vocabulary: Teaching Protein Models the Language of Function

Annotation Vocabulary replaces loose natural-language protein descriptions with structured biological terms, helping AI systems connect amino acid sequences to function, annotation, and generation.

Annotation Vocabulary
Protein Function
Atlas

The Problem: Protein Function Is Hard to Say Cleanly

Protein sequences are everywhere. Protein function labels are not.

Most protein AI systems learn from amino acid sequences because sequences are abundant. But researchers usually care about what a protein does: what reaction it catalyzes, where it lives in the cell, what domains it contains, what pathways it touches, and what biological process it supports.

Natural language can describe those things, but it is noisy. Two curators may describe the same protein differently. Some descriptions are long and redundant. Others are sparse. Many contain literature references, boilerplate, or missing context.

Annotation Vocabulary was built around a simple idea: instead of asking models to learn function from messy prose, give them a structured language of biological properties.

The Core Idea

Annotation Vocabulary converts established biological annotations into a transformer-readable vocabulary. Terms from resources such as Gene Ontology, Enzyme Commission numbers, InterPro, Gene3D, cofactors, and UniProt keywords become tokens in a specialized protein property language.

That makes protein descriptions more consistent. A model can read the presence of a catalytic activity, a domain, or a cellular component as a precise term rather than as a sentence fragment buried in a paragraph.

The result is a bridge between sequence and function. One side is the amino acid string. The other side is a structured description of what biology already knows about that protein.

Why This Is Different

Most protein language models learn from sequence alone. That is powerful for structure, homology, and broad biochemical patterns, but function is often more abstract. The same fold can support different activities. The same function can appear in different sequence families. A purely sequence-trained model may not naturally organize its space around the functional concepts a biologist wants to ask about.

Annotation Vocabulary shifts the target. It gives models a compact, consistent way to learn from function itself.

That led to several research directions, including models that represent annotations alone, models that align sequence and annotation spaces, and models that generate protein sequences from annotation prompts.

What The Paper Showed

The strongest representation model in the paper, CAMP, used Annotation Vocabulary to make sequence embeddings more function-aware. It reached state-of-the-art performance across a meaningful slice of common protein benchmarks while costing only about $3 in commercial compute to train.

That result is important because it suggests that better descriptions can matter as much as bigger models. If the language of the labels is cleaner, a smaller system can learn a more useful representation.

The generative side was equally interesting. GSM generated realistic protein sequences from annotation-only prompts. Some generated sequences returned significant BLAST hits and showed enrichment consistent with the requested annotations, even when the ground-truth sequence was far from the training set.

In plain terms: the model could start from a functional description and produce sequences that looked biologically plausible.

Why It Matters for Synthyra

Annotation Vocabulary now sits underneath several Synthyra ideas. Translator uses the same philosophy to map amino acid sequences into structured functional annotations. Atlas CAMP extends the idea into production-facing retrieval and annotation workflows. DSM connects sequence generation back to functional and interaction-aware design.

These are not simple ports of the original open research. Synthyra extends the work with improved models, broader vocabularies, serving infrastructure, validation workflows, and user-facing analysis layers.

The strategic point is bigger than any single model. Protein AI needs a better interface between what models see and what scientists ask. Annotation Vocabulary is one route toward that interface.

What This Enables

For a nontechnical user, Annotation Vocabulary makes protein AI feel less like searching a black-box embedding and more like asking for biological meaning.

A researcher can start with a sequence and receive structured functional hypotheses. A designer can start with a desired function and search for compatible sequence space. A platform can connect sequence, annotation, retrieval, and generation without relying entirely on free-text descriptions.

This does not replace curation or experimental validation. It makes the first pass more organized, more searchable, and more biologically legible.

This blog post summarizes work in the following paper:

Annotation Vocabulary (Might Be) All You Need
Logan Hallee, Niko Rafailidis, Colin Horger, David Hong, Jason P. Gleghorn
bioRxiv 2024.07.30.605924; doi: https://doi.org/10.1101/2024.07.30.605924

Try Our Protein Analysis Tools

Protein-Protein Interaction

Predict interactions and binding affinities between protein pairs

Protein Properties

Analyze biochemical properties of protein sequences

More Tools

Discover our full suite of protein analysis tools

May 31st, 2026

Accidental Taxonomists: When Protein Models Learn the Wrong Shortcut

Protein models can appear to predict interactions while actually learning species differences. Accidental Taxonomists explains the shortcut and how to avoid it.

Protein Protein Interaction
Dataset Curation
Atlas

May 31st, 2026

cdsBERT: Why Codons Still Matter for Protein AI

cdsBERT showed that protein models can learn useful biology by looking one layer earlier, at the codons that encode amino acids.

Codons
Protein Language Models
Foundation Models