Researchers have developed Tauformer, a transformer architecture replacing dot-product attention with Laplacian-derived topological attention, achieving 50% KV-cache reduction and promising initial results in 30M-parameter tests.
Rethinking Attention with Topology
Tauformer introduces a fundamental shift in transformer design by replacing standard dot-product attention with a topological mechanism. Instead of computing attention through vector similarity (Q·K), it compresses each token's head vector into a Laplacian-derived scalar called λ (taumode). Attention logits become distances between these scalars: att_ij = -|λ_Qi - λ_Kj| / temperature.
Key Implementation Advantages
- Efficiency: Stores only value vectors and λ scalars in KV-cache instead of full K and V tensors, reducing cache size by ~50% (e.g., from typical 384-dim heads to single scalars)
- Sparsity Potential: Designed to leverage precomputed sparse Laplacians from domain manifolds, avoiding costly dense matrix multiplications
- Stable Processing: Maintains standard transformer components (Q/K/V projections, RoPE, causal masking) while altering only logit computation
Training Results: 30M Parameter Model
- Configuration:
- Layers: 6, Heads: 6, Embedding: 384
- Sequence: 1024 tokens, Vocabulary: 30,522
- Optimizer: AdamW (LR 5e-4), 100-step warmup
- Performance:
- Best validation loss: 1.9146 at step 4,500
- Validation perplexity dropped from 107.47 (step 100) to 6.59 (step 2,000)
- Trained 5,000 steps in 2 hours (~60K tokens/sec)
Critical Observations
- Taumode Convergence: λ scalars decreased as loss improved, suggesting learned representations become "smoother" under the Laplacian
- Training Dynamics: Later stages showed volatility, indicating potential collapse risks if λ contrast diminishes
- Adaptive Strategies Planned: Future versions will recalibrate taumode during training to counter drift
Significance and Next Steps
The architecture demonstrates viable alternatives to dot-product attention, particularly valuable for:
- Memory-constrained applications (reduced KV-cache)
- Domain-specific models leveraging precomputed topological structures Researchers plan scaling tests to 100M parameters and exploration of epiplexity principles for efficient structure learning.
Acknowledgment: Experiments run on Enverge Labs' H100 GPU cluster.

Comments
Please log in or register to join the discussion