Overview
Encoder-decoder models combine both encoder and decoder components of the Transformer architecture. These models, exemplified by T5 (Text-to-Text Transfer Transformer), BART, and others, excel at sequence-to-sequence tasks like translation and summarization.
Key characteristics
- Sequence-to-Sequence: Process an input sequence and generate an output sequence of potentially different length
- Cross-Attention: Decoder attends to encoder outputs to condition on the input sequence
- Versatile Tasks: Can handle various tasks by framing them as text-to-text problems
- Bidirectional Encoder: Encoder processes input in both directions for full context understanding
Popular models
- T5 (Text-to-Text Transfer Transformer): Frames all NLP tasks as text generation
- BART (Bidirectional and Auto-Regressive Transformers): Combines bidirectional encoding with autoregressive decoding
- mBART: Multilingual variant of BART for machine translation
- Pegasus: Pre-trained for abstractive summarization
Core steps
- Encoder Processing: Input sequence is processed bidirectionally through encoder layers
- Cross-Attention: Decoder layers attend to encoder outputs to condition on input
- Autoregressive Generation: Decoder generates output tokens one at a time
- Output Projection: Final representations are projected to vocabulary logits
Encoder-decoder trade-offs
- Input understanding: Full bidirectional vs causal only
- Generation: Autoregressive for both
- Pretraining efficiency: Lower than decoder-only
- Fine-tuning flexibility: Task-specific vs instruction-tuned
Training objectives
- Prefix LM: Predict continuation of prefix
- Infilling: Fill masked spans
- Seq2seq LM: Standard encoder-decoder training
Architecture overview
Encoder-decoder models consist of an encoder that processes the input sequence bidirectionally, and a decoder that generates the output autoregressively while attending to the encoder's representations.
Core steps
- Encoder Processing: Input sequence is processed bidirectionally through encoder layers
- Cross-Attention: Decoder layers attend to encoder outputs to condition on input
- Autoregressive Generation: Decoder generates output tokens one at a time
- Output Projection: Final representations are projected to vocabulary logits
Applications
Sequence-to-Sequence Tasks
- Machine Translation: Translating text between languages
- Summarization: Generating concise summaries of documents
- Question Generation: Generating questions from given contexts
- Text Simplification: Rewriting text in simpler terms
Generation Tasks
- Data-to-Text: Generating text from structured data or tables
- Code Generation: Generating code from natural language descriptions
- Dialogue Generation: Generating conversational responses
Industry applications
- Translation Services: Real-time and document translation across languages
- Content Summarization: News summarization, document condensation
- Data Extraction: Extracting structured information from unstructured text