NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism

NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token.

Expert Parallelism and Its Impact

Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure.

Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks.

System Design and Architecture

The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment.

Addressing Communication Overhead

Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible.

Kernel Optimization and Load Balancing

To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts to prevent over- or under-utilization of GPUs, crucial for maintaining efficiency in real-time production systems.

Implications for AI Inference

Wide-EP on NVIDIA’s NVL72 systems provides a scalable solution for MoE models, reducing weight-loading pressure and improving GroupGEMM efficiency. In testing, large EP configurations demonstrated up to 1.8x higher per-GPU throughput compared to smaller setups, highlighting the potential for significant performance gains.

The advancements in Wide-EP not only improve throughput and latency but also enhance system economics by increasing concurrency and GPU efficiency. This positions NVIDIA’s NVL72 as a pivotal player in the cost-effective deployment of trillion-parameter models, offering developers, researchers, and infrastructure teams new opportunities to optimize AI workloads.

Image source: Shutterstock

Source: https://blockchain.news/news/nvidia-nvl72-revolutionizing-moe-model-scaling