Agent Recipes

Overview

Mixture of Experts (MoE) models represent a paradigm shift in neural network design, using sparse activation to dramatically increase model capacity while maintaining computational efficiency. Models like Mixtral, Switch Transformer, and GLaM have shown that MoE architectures can achieve remarkable performance with trillions of parameters.

Key characteristics

Sparse Activation: Only a subset of experts (typically 2-4) are active per token, reducing computation
Gating Network: Learns to route tokens to the most appropriate experts based on input content
Massive Parameters: Can have trillions of parameters while keeping inference compute manageable
Expert Specialization: Different experts can develop distinct strengths for different types of inputs

Popular models

Mixtral 8x7B: 8 experts per layer, 2 active per token, 46.7B total parameters, open weights
Switch Transformer: Google Research (2021), simplified routing, trillion parameter scale
GLaM: Google (2021), 64 experts per layer, 1.2T total parameters
DeepSeek V2: Mixture of experts with 236B total parameters

Core steps

Input Processing: Token embeddings are passed through the network
Gating Computation: Gating network computes expert selection scores
Expert Routing: Top-k experts are selected based on gating scores
Parallel Processing: Selected experts process the token in parallel
Weighted Combination: Expert outputs are weighted by gating scores and combined

Routing strategies

Top-k: Select top-k scoring experts
Expert choice: Each expert selects tokens
Hash-based: Fixed routing based on hashing
Load-balanced: Explicit capacity constraints

Training considerations

Load imbalance: Auxiliary balancing loss
Expert collapse: Noisy top-k routing
Training stability: Gradient clipping, warmup
Inference efficiency: Expert caching, batching

Architecture overview

MoE models replace the dense feed-forward layers in Transformers with sparse expert layers. A gating network routes each token to the top-k experts, and their outputs are weighted and combined.

Core steps

Input Processing: Token embeddings are passed through the network
Gating Computation: Gating network computes expert selection scores
Expert Routing: Top-k experts are selected based on gating scores
Parallel Processing: Selected experts process the token in parallel
Weighted Combination: Expert outputs are weighted by gating scores and combined

Applications

Large Language Models

OpenAssistant: Open-source assistant trained using MoE architecture
Mixtral: State-of-the-art open-source MoE model
WizardLM: Instruction-tuned model using MoE for scalability

Benefits and Challenges

Benefits: Scalability, specialization, efficiency, parallelism
Challenges: Complex training, load balancing, memory requirements

Industry applications

AI Research: Pushing the boundaries of model scale and capability
Enterprise AI: Large-scale deployments with efficient inference
Multilingual Models: Experts specialized for different languages

Overview

Key characteristics

Popular models

Core steps

Routing strategies

Training considerations

Architecture overview

Core steps

Applications

Large Language Models

Benefits and Challenges

Industry applications

Mixture of Experts Implementation