Attention is a mechanism that allows neural networks to focus on specific parts of the input when processing each element. Think of it like reading a sentence: when you understand the word "it," you automatically look back to find what "it" refers to.
In transformers, attention helps the model understand relationships between words. For example, in "The cat sat on the mat," attention helps the model learn that "sat" is most related to "cat" (who did the sitting) and "mat" (where the sitting happened).
Key Benefits:
🎯 Selective Focus: Pay more attention to relevant words
🔗 Long-range Dependencies: Connect words far apart in a sentence
🧠 Context Understanding: Capture meaning based on relationships
Each input token is transformed into three vectors:
Query (Q): "What am I looking for?" - represents what information this token needs
Key (K): "What do I offer?" - represents what information this token contains
Value (V): "Here's my information" - the actual content to pass along
For each Query token, compute similarity with all Key tokens using the dot product:
Score(Q, K) = Q · K / √d_k
where d_k is the dimension of key vectors (helps with numerical stability)
Convert scores into probabilities that sum to 1:
Attention_Weights = softmax(Scores)
This ensures higher scores get more weight and all weights are between 0 and 1.
Multiply each Value vector by its attention weight and sum them up:
Output = Σ (Attention_Weight_i × Value_i)
The complete attention formula:
Attention(Q, K, V) = softmax(Q·K^T / √d_k) × V
Enter text in the input box at the top
Click the "Visualize" button to process your input
Try the example buttons for quick demos
Switch attention heads to see different patterns
Toggle the matrix on/off using the checkbox
The visualizer uses BPE-style tokenization similar to GPT-2
Automatically handles:
Punctuation: "Hello, world!" → ["hello", ",", "world", "!"]
Contractions: "don't" → ["don", "'t"]
Special characters and mixed case
Tokens limited to 10 for optimal visualization
Shows your input broken into individual tokens
Each token appears as a colored chip
Tokens are what the model actually processes
The Sankey diagram shows the complete data flow through the attention mechanism:
Flow stages:
Input (Gray): Original token embeddings
Query/Key/Value (Blue/Pink/Green): Each input splits into three pathways
Attention (Yellow): Where Q and K compute attention scores
Output (Teal): Final weighted representation
Line thickness represents connection strength - thicker lines = stronger data flow.
The heatmap shows attention weights between all token pairs:
Rows (Queries): Each token asking "what should I attend to?"
Columns (Keys): Each token that can be attended to
Color intensity: Red darkness = attention strength
Light pink = weak attention (~5-15%)
Dark red = strong attention (25%+)
Numbers in cells: Show exact attention percentages
Diagonal: Usually shows self-attention (token attending to itself)
How to read it:
Look at a row to see what that token attends to
High values = strong relationships
The row sums to 100% (softmax normalization)
Based on the attention-weighted context, shows:
Top 10 most likely next tokens
Probability bars - longer bars = higher confidence
Percentages showing exact prediction confidence
Switch between 4 different attention heads:
Head 1 (Local Context): Focuses on nearby words
Head 2 (Previous Words): Emphasizes earlier tokens in sequence
Head 3 (Self-Attention): Strong diagonal pattern, tokens attend to themselves
Head 4 (Diverse Patterns): Captures various relationships
Each head learns different linguistic patterns!
Toggle the attention matrix visualization on/off with the checkbox to focus on the Sankey flow diagram.
Input: "The cat sat on the mat"
What happens:
Tokenization: ["the", "cat", "sat", "on", "the", "mat"]
Each token creates Q, K, V vectors
When processing "sat":
Its Query asks: "Who is doing the action? Where?"
Compares with all Keys
Attention weights might be:
sat → cat: 45% (subject doing the action)
sat → mat: 30% (location)
sat → on: 15% (preposition)
sat → the: 5% each (articles, less important)
These weights multiply the Values to create the output
In the visualizer you'll see:
Strong connections from "sat" to "cat" and "mat" in the Sankey diagram
Dark red cells at (sat, cat) and (sat, mat) in the matrix
Likely next tokens: "on", "down", "there"
Before attention, models processed words sequentially and struggled with long-range dependencies. Attention enables:
✅ Parallel Processing: All words processed simultaneously, not one-by-one ✅ Direct Connections: Any word can directly attend to any other word, regardless of distance ✅ Multiple Perspectives: Different attention heads capture different relationships ✅ Context-Aware Understanding: Meaning emerges from relationships, not just word order
This is why transformers (which use attention) power modern AI:
GPT (text generation)
BERT (language understanding)
Claude (conversational AI)
Vision Transformers (image understanding)
And many more!
The color scheme helps you track data flow:
Gray → Input → Original embeddings
Blue → Query → "What am I looking for?"
Pink → Key → "What do I offer?"
Green → Value → "Here's my content"
Yellow → Attention → Weighted computation
Teal → Output → Final representation
Start simple: Try "The cat sat on the mat" to see basic patterns
Try questions: See how attention handles "Hello, how are you?"
Compare heads: Switch between heads to see different learned patterns
Longer sentences: Use 6-8 words for rich attention patterns
Watch the flow: Follow a single token through the Sankey diagram
Built with: D3.js v7.9.0 + D3-Sankey v0.12.3
Tokenization: BPE-style (similar to GPT-2)
Attention simulation: Demonstrates typical patterns (not a real model)
Performance: Limited to 10 tokens for smooth visualization
Browser-based: Runs entirely in your browser, no server needed!
Want to dive deeper into transformers?
Read "Attention Is All You Need" (Vaswani et al., 2017)
Explore the Transformer Explainer project
Check out Jay Alammar's "The Illustrated Transformer"
Visit Hugging Face for pre-trained models
Happy visualizing! 🎉