Attention Vizualization

🔍 Transformer Attention Mechanism Visualizer - User Guide

🤔 What is Attention?

Attention is a mechanism that allows neural networks to focus on specific parts of the input when processing each element. Think of it like reading a sentence: when you understand the word "it," you automatically look back to find what "it" refers to.

In transformers, attention helps the model understand relationships between words. For example, in "The cat sat on the mat," attention helps the model learn that "sat" is most related to "cat" (who did the sitting) and "mat" (where the sitting happened).

Key Benefits:

🎯 Selective Focus: Pay more attention to relevant words
🔗 Long-range Dependencies: Connect words far apart in a sentence
🧠 Context Understanding: Capture meaning based on relationships

⚙️ How Does Attention Work Internally?

Step 1: Create Query, Key, and Value vectors

Each input token is transformed into three vectors:

Query (Q): "What am I looking for?" - represents what information this token needs
Key (K): "What do I offer?" - represents what information this token contains
Value (V): "Here's my information" - the actual content to pass along

Step 2: Calculate Attention Scores

For each Query token, compute similarity with all Key tokens using the dot product:

Score(Q, K) = Q · K / √d_k

where d_k is the dimension of key vectors (helps with numerical stability)

Step 3: Apply Softmax

Convert scores into probabilities that sum to 1:

Attention_Weights = softmax(Scores)

This ensures higher scores get more weight and all weights are between 0 and 1.

Step 4: Weighted Sum of Values

Multiply each Value vector by its attention weight and sum them up:

Output = Σ (Attention_Weight_i × Value_i)

The complete attention formula:

Attention(Q, K, V) = softmax(Q·K^T / √d_k) × V

🎮 How to Use This Visualizer

Basic Usage:

Enter text in the input box at the top
Click the "Visualize" button to process your input
Try the example buttons for quick demos
Switch attention heads to see different patterns
Toggle the matrix on/off using the checkbox

Input & Tokenization:

The visualizer uses BPE-style tokenization similar to GPT-2
Automatically handles:
- Punctuation: "Hello, world!" → ["hello", ",", "world", "!"]
- Contractions: "don't" → ["don", "'t"]
- Special characters and mixed case
Tokens limited to 10 for optimal visualization

📊 Understanding the Visualizations

📝 Tokens Display

Shows your input broken into individual tokens
Each token appears as a colored chip
Tokens are what the model actually processes

🌊 Attention Pipeline (Sankey Flow Diagram)

The Sankey diagram shows the complete data flow through the attention mechanism:

Flow stages:

Input (Gray): Original token embeddings
Query/Key/Value (Blue/Pink/Green): Each input splits into three pathways
Attention (Yellow): Where Q and K compute attention scores
Output (Teal): Final weighted representation

Line thickness represents connection strength - thicker lines = stronger data flow.

🎯 Attention Matrix (Query × Key)

The heatmap shows attention weights between all token pairs:

Rows (Queries): Each token asking "what should I attend to?"
Columns (Keys): Each token that can be attended to
Color intensity: Red darkness = attention strength
- Light pink = weak attention (~5-15%)
- Dark red = strong attention (25%+)
Numbers in cells: Show exact attention percentages
Diagonal: Usually shows self-attention (token attending to itself)

How to read it:

Look at a row to see what that token attends to
High values = strong relationships
The row sums to 100% (softmax normalization)

🎲 Next Token Predictions (Sidebar)

Based on the attention-weighted context, shows:

Top 10 most likely next tokens
Probability bars - longer bars = higher confidence
Percentages showing exact prediction confidence

🎛️ Interactive Controls

Attention Head Selection

Switch between 4 different attention heads:

Head 1 (Local Context): Focuses on nearby words
Head 2 (Previous Words): Emphasizes earlier tokens in sequence
Head 3 (Self-Attention): Strong diagonal pattern, tokens attend to themselves
Head 4 (Diverse Patterns): Captures various relationships

Each head learns different linguistic patterns!

Show/Hide Matrix

Toggle the attention matrix visualization on/off with the checkbox to focus on the Sankey flow diagram.

🧩 Real-World Example

Input: "The cat sat on the mat"

What happens:

Tokenization: ["the", "cat", "sat", "on", "the", "mat"]
Each token creates Q, K, V vectors
When processing "sat":
- Its Query asks: "Who is doing the action? Where?"
- Compares with all Keys
- Attention weights might be:
  - sat → cat: 45% (subject doing the action)
  - sat → mat: 30% (location)
  - sat → on: 15% (preposition)
  - sat → the: 5% each (articles, less important)
These weights multiply the Values to create the output

In the visualizer you'll see:

Strong connections from "sat" to "cat" and "mat" in the Sankey diagram
Dark red cells at (sat, cat) and (sat, mat) in the matrix
Likely next tokens: "on", "down", "there"

💡 Why is Attention Revolutionary?

Before attention, models processed words sequentially and struggled with long-range dependencies. Attention enables:

✅ Parallel Processing: All words processed simultaneously, not one-by-one ✅ Direct Connections: Any word can directly attend to any other word, regardless of distance ✅ Multiple Perspectives: Different attention heads capture different relationships ✅ Context-Aware Understanding: Meaning emerges from relationships, not just word order

This is why transformers (which use attention) power modern AI:

GPT (text generation)
BERT (language understanding)
Claude (conversational AI)
Vision Transformers (image understanding)
And many more!

🎨 Visual Legend

The color scheme helps you track data flow:

Gray → Input → Original embeddings
Blue → Query → "What am I looking for?"
Pink → Key → "What do I offer?"
Green → Value → "Here's my content"
Yellow → Attention → Weighted computation
Teal → Output → Final representation

💭 Tips for Best Results

Start simple: Try "The cat sat on the mat" to see basic patterns
Try questions: See how attention handles "Hello, how are you?"
Compare heads: Switch between heads to see different learned patterns
Longer sentences: Use 6-8 words for rich attention patterns
Watch the flow: Follow a single token through the Sankey diagram

🚀 Technical Notes

Built with: D3.js v7.9.0 + D3-Sankey v0.12.3
Tokenization: BPE-style (similar to GPT-2)
Attention simulation: Demonstrates typical patterns (not a real model)
Performance: Limited to 10 tokens for smooth visualization
Browser-based: Runs entirely in your browser, no server needed!

📚 Learn More

Want to dive deeper into transformers?

Read "Attention Is All You Need" (Vaswani et al., 2017)
Explore the Transformer Explainer project
Check out Jay Alammar's "The Illustrated Transformer"
Visit Hugging Face for pre-trained models

Happy visualizing! 🎉

Page updated

Google Sites

Report abuse