Attention, Visualized
The one mechanism behind every modern language model. Each token asks a question (its query), compares it against what every other token offers (its key), and reads back a weighted blend of their values. Click a token, or watch the sweep, to see where attention flows.
Overview
Brightness of each beam is how strongly the active query token attends to another. Switch heads to see the same sentence read through different relationships — attention is many parallel lenses, not one.
Methodology
Scaled dot-product attention: weights = softmax(QKᵀ/√dₖ), applied per head, with sinusoidal positional encodings. The projection weights here are illustrative (untrained) — this is the shape of the mechanism, not learned linguistics.
Applications
The Transformer (Vaswani et al., 2017) — the architecture under large language models, machine translation, protein folding, and modern computer vision.