Today AWS Announced Amazon SageMaker HyperPod Support for P6e-GB200 UltraServers, Enabling Trillion-Parameter Scale AI
Amazon Web Services has unveiled significant new AI infrastructure capabilities, according to a recent announcement that brings NVIDIA's most advanced GPU technology to their cloud platform.
Contextualize
Today, AWS announced that Amazon SageMaker HyperPod now supports P6e-GB200 UltraServers, accelerated by NVIDIA GB200 NVL72 technology. The new offering provides access to configurations of up to 72 NVIDIA Blackwell GPUs in a single system, delivering 360 petaflops of dense 8-bit floating point (FP8) compute and 1.4 exaflops of sparse 4-bit floating point (FP4) compute. According to AWS, this marks a pivotal shift in the industry's ability to efficiently train and deploy trillion-parameter scale AI models with unprecedented performance and cost efficiency.
Key Takeaways
- AWS now offers UltraServers in two configurations: ml.u-p6e-gb200x36 (with 36 Blackwell GPUs) and ml.u-p6e-gb200x72 (with 72 Blackwell GPUs) in a single NVLink domain
- The system provides up to 13.4 TB of high-bandwidth memory (HBM3e) and 130 TBps of low-latency NVLink bandwidth between GPUs, enabling efficient training of trillion-parameter models
- UltraServers deliver up to 28.8 Tbps of total Elastic Fabric Adapter (EFA) v4 networking and support up to 405 TB of local NVMe SSD storage
- AWS states the technology enables 30x faster inference on trillion-parameter LLMs compared to prior platforms, particularly when paired with NVIDIA Dynamo
Deepen
A key technical innovation in the UltraServers is the NVLink domain, which AWS describes as critical for large-scale AI training. In this architecture, each compute node within an UltraServer uses the fifth-generation NVIDIA NVLink to provide up to 1.8 TBps of bidirectional, direct GPU-to-GPU interconnect. This unified memory domain allows massive AI models to be efficiently partitioned across multiple GPUs while maintaining high-speed communication, effectively eliminating the bottlenecks typically seen when training across multiple disconnected systems.
The company also highlights that SageMaker HyperPod automatically implements topology-aware scheduling, applying labels to UltraServer compute nodes based on their Region, Availability Zone, Network Node Layers, and UltraServer ID, ensuring optimal placement of workloads across the infrastructure.
Why It Matters
For AI researchers and organizations working on frontier models, AWS's introduction of UltraServers represents a significant advancement in accessible computing power. According to the announcement, the combined capabilities enable faster iteration cycles for developing and fine-tuning large language models (LLMs) and Mixture-of-Experts (MoE) architectures.
For enterprises deploying AI in production, the platform addresses key challenges in serving trillion-parameter models at scale. The company states that P6e-GB200 UltraServers can efficiently handle high-concurrency applications with long context windows, particularly when using NVIDIA Dynamo to disaggregate the compute-heavy prefill phase and memory-heavy decode phase onto different GPUs within the large NVLink domain.
For infrastructure teams, SageMaker HyperPod's flexible training plans for UltraServer capacity allow organizations to reserve and manage these high-performance resources efficiently, while the platform's automated failover capabilities (with recommended spare nodes) ensure resilience for critical AI workloads.
Analyst's Note
AWS's addition of P6e-GB200 UltraServers to SageMaker HyperPod represents an important inflection point in the availability of enterprise-grade infrastructure for trillion-parameter scale AI. While only currently available in the Dallas AWS Local Zone (us-east-1-dfw-2a), this deployment signals AWS's commitment to providing the highest tier of AI infrastructure to compete with specialized AI cloud providers.
The integration with SageMaker's existing orchestration, monitoring, and management capabilities gives AWS a compelling offering for organizations looking to develop frontier AI models without building custom infrastructure. However, the true impact will depend on pricing and availability, which will determine whether these resources democratize access to frontier model development or remain accessible primarily to well-funded AI labs and enterprises.
Source: Amazon Web Services Blog