Dual Triangle Attention: Position Sense for Bidirectional Models

Dual Triangle Attention splits each attention head into complementary triangular masks, letting bidirectional transformers retain order information with or without explicit positional embeddings.

Attention
Transformers
Foundation Models

By Logan Hallee

Bidirectional transformers are useful because every token can read every other token. That is exactly what you want for protein encoders, retrieval models, masked language models, and many annotation tasks where the best evidence may come from either side of a sequence.

The cost is that standard bidirectional attention does not know order on its own. If you remove positional embeddings, the attention pattern is symmetric. The model can see relationships between tokens, but the mechanism itself does not tell it which token came first.

Causal models have a built-in cue because the mask is triangular. Each token sees only the past, so direction is baked into the computation. Dual Triangle Attention asks whether bidirectional models can get a similar structural cue without giving up full context.

The Mechanism

Dual Triangle Attention splits each attention head's query-key space into two halves.

One half attends over the lower triangle: past and self. The other half attends over the upper triangle: future and self. Together they cover the full bidirectional context, but each half sees a directional mask.

A useful way to think about it is that the model still reads both directions, but it reads them through two ordered channels instead of one symmetric surface.

The manuscript implements this with PyTorch flex_attention as a single compiled kernel call and adds no parameters beyond standard multi-head attention. The architecture change is in the attention mask geometry, not in adding a larger model.

Dual Triangle Attention argmax probe

The Clean Test

The paper starts with an argmax position probe. A model sees a sequence of random token IDs and must predict the position of the largest token. It cannot solve this by only recognizing token identity; it has to bind token value to token position.

That probe exposes the core distinction. Standard bidirectional attention without positional embeddings fails. Causal attention succeeds because the triangular mask carries order. Dual Triangle Attention also succeeds, which supports the mechanism: complementary triangular masks provide enough positional signal for a bidirectional model to learn order.

Masked Language Modeling

The next question is whether that toy result survives contact with real sequence modeling. The paper tests masked language modeling on natural language from FineWeb-Edu and protein sequences from OMG-Prot50.

Across both domains, Dual Triangle Attention with RoPE performs strongly, especially in longer-context evaluation. Without positional embeddings, DTA remains functional where ordinary bidirectional attention largely collapses. On the protein MLM experiments, the same pattern appears: RoPE is still useful, but the triangular structure gives DTA a fallback positional bias that standard bidirectional attention does not have.

Dual Triangle Attention protein MLM results

What Did Not Work

The paper also tests DroPE-style position dropping: train with RoPE, then remove positional embeddings late in training. That idea works surprisingly well in some autoregressive settings because causal masks can carry position after RoPE is removed.

In masked language modeling, it did not reliably transfer. Dropping positions degraded test performance across natural language and protein runs. DTA was more resilient than standard bidirectional attention, but the result is still a warning: positional mechanisms that work for causal models should not be assumed to carry over to bidirectional objectives.

Why It Matters For Protein Models

Protein models need global context, but they also need order. Residues far apart in sequence can become neighbors in structure, yet the sequence direction still matters for domains, motifs, and local grammar.

Dual Triangle Attention targets that tension directly. It keeps bidirectional context while introducing a native directional inductive bias. The practical question now is where this matters most: longer proteins, protein complexes, genomic sequences, retrieval encoders, or hybrid models that need both local order and global comparison.

This is an architecture result, not a biological discovery by itself. Its value depends on the model, data, scale, task, and how it is integrated. The paper's contribution is showing that bidirectional attention does not have to be position-blind when explicit positional embeddings are weak, removed, or out of distribution.

This blog post summarizes work in the following paper:

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Logan Hallee, Jason P. Gleghorn
arXiv preprint, April 2026

Related Research

September 15, 2023

Blog

cdsBERT: Why Codons Still Matter for Protein AI

cdsBERT showed that protein models can learn useful biology by looking one layer earlier, at the codons that encode amino acids.

Codons
Protein Language Models
Foundation Models

October 24, 2025

Blog

Accidental Taxonomists: When Protein Models Learn the Wrong Shortcut

Protein interaction models can look strong by learning species differences instead of interaction biology.

Protein Protein Interaction
Dataset Curation
Atlas