Overview
Sparse attention mechanisms reduce the computational complexity of Transformers from O(n^2) to O(n sqrt(n)) or O(n log n), enabling processing of much longer sequences. These mechanisms trade some attention capacity for dramatic efficiency gains.
Key characteristics
- Reduced Complexity: Quadratic to linear or sub-quadratic complexity in sequence length
- Long Sequences: Enables processing of sequences with thousands to millions of tokens
- Pattern Sparsity: Uses predefined or learned patterns to limit attention scope
- Efficiency Trade-offs: Balance between computational efficiency and attention coverage
Popular models
- Longformer: Combines local and global attention patterns for long documents
- BigBird: Uses random, window, and global attention for even longer sequences
- Performer: Uses kernel-based approximation for linear-attention
- Reformer: Uses locality-sensitive hashing for efficient attention
Core steps
- Query-Key Projection: Project inputs to queries and keys
- Sparsity Pattern: Apply pattern-based or approximate attention
- Value Aggregation: Aggregate values based on sparse attention patterns
- Output Projection: Project to final output dimension
Sparsity patterns
- Sliding window: O(n) - local dependencies
- Strided: O(n sqrt(n)) - hierarchical patterns
- Global+local: O(n) - long document tasks
- Random: O(n) - global context
Memory vs quality trade-offs
- Full attention: High memory, best quality
- Local + global: Medium memory, good quality
- Sparse projection: Low memory, moderate quality
- Kernel approximation: Lowest memory, varies quality
Architecture overview
Sparse attention mechanisms replace the full attention matrix multiplication with more efficient operations, either through pattern-based sparsity, low-rank approximation, or kernel methods.
Core steps
- Query-Key Projection: Project inputs to queries and keys
- Sparsity Pattern: Apply pattern-based or approximate attention
- Value Aggregation: Aggregate values based on sparse attention patterns
- Output Projection: Project to final output dimension
Applications
Long Context Tasks
- Document Understanding: Processing long documents with thousands of tokens
- Book Summarization: Understanding and summarizing entire books
- Code Understanding: Analyzing large codebases and repositories
Efficiency-Critical Applications
- Edge Devices: Running Transformers on resource-constrained devices
- Real-time Processing: Low-latency inference for streaming inputs
- Long-horizon Planning: Tasks requiring reasoning over long sequences
Industry applications
- Legal: Analyzing long legal documents and contracts
- Finance: Processing long financial reports and market data
- Healthcare: Analyzing lengthy medical records and research papers