AI Infrastructure Summit: AWS Pushes 20,000+ GPU Fabrics with <10µs Latency

Jim Carroll

7 months ago

AWS outlined its roadmap for building AI-optimized infrastructure during a keynote at the AI Infrastructure Summit in Santa Clara. Barry Cooks, VP of Compute Services at AWS, emphasized the need for a strong foundation in networking, compute, and resilience to support the accelerating shift toward agent-driven software development and large-scale AI training.

At the networking layer, AWS showcased its 10p10u AI network fabric, now supporting more than 20,000 GPUs with sub-10 microsecond latency. The company described a hybrid control plane architecture combining centralized planning with distributed decision-making in the switches to deliver faster rerouting and reliability. AWS said its elastic fabric adapter (EFA), scalable reliable datagram (SRD), and specialized congestion control stack are now handling millions of connections and serving as the backbone of its fastest-growing network technology.

On the compute side, AWS presented new EC2 P6 instances powered by NVIDIA Blackwell GPUs and introduced water-cooled UltraServers for improved thermals and density. Each P6 B200 instance delivers 1.4 Tbps interconnect bandwidth and significant generation-to-generation training time improvements. AWS also detailed its custom Trainium2 UltraServers, which deliver 83.2 FP8 PFLOPS, 12.8 Tbps EFA bandwidth, and 6 TB of high-bandwidth memory—positioned as the best price-performance option for GenAI training on AWS. UltraCluster deployments can now scale to more than 1.6 million accelerators in a single cluster image.

To address resilience at scale, Cooks announced new checkpointing capabilities in SageMaker HyperPod, which leverage CPU and local storage to accelerate recovery times in the event of failures. AWS is also expanding observability and automation features in SageMaker to improve developer productivity and reduce training times. “Security, latency, performance, and resilience are the pillars of our infrastructure,” said Cooks. “Our goal is to ensure that AI builders can rely on AWS to deliver the scale, economics, and reliability they need.”

🌐 Analysis: AWS is positioning itself not just as a supplier of raw compute, but as an integrated infrastructure provider across networking, compute, and software resilience. Its emphasis on water-cooled GPUs, large-scale AI fabrics, and SageMaker checkpointing directly addresses hyperscaler-scale challenges. This mirrors moves by rivals Microsoft Azure and Google Cloud, which are also investing in custom silicon and high-density AI clusters, but AWS is leaning on both Blackwell GPUs and its own Trainium line to balance peak performance with economics.