• Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io
No Result
View All Result
Converge Digest
Saturday, April 18, 2026
  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io
No Result
View All Result
Converge Digest
No Result
View All Result

Home » AI Infrastructure Summit: AWS  Pushes 20,000+ GPU Fabrics with <10µs Latency

AI Infrastructure Summit: AWS  Pushes 20,000+ GPU Fabrics with <10µs Latency

September 9, 2025
in All
A A

AWS outlined its roadmap for building AI-optimized infrastructure during a keynote at the AI Infrastructure Summit in Santa Clara. Barry Cooks, VP of Compute Services at AWS, emphasized the need for a strong foundation in networking, compute, and resilience to support the accelerating shift toward agent-driven software development and large-scale AI training.

At the networking layer, AWS showcased its 10p10u AI network fabric, now supporting more than 20,000 GPUs with sub-10 microsecond latency. The company described a hybrid control plane architecture combining centralized planning with distributed decision-making in the switches to deliver faster rerouting and reliability. AWS said its elastic fabric adapter (EFA), scalable reliable datagram (SRD), and specialized congestion control stack are now handling millions of connections and serving as the backbone of its fastest-growing network technology.

On the compute side, AWS presented new EC2 P6 instances powered by NVIDIA Blackwell GPUs and introduced water-cooled UltraServers for improved thermals and density. Each P6 B200 instance delivers 1.4 Tbps interconnect bandwidth and significant generation-to-generation training time improvements. AWS also detailed its custom Trainium2 UltraServers, which deliver 83.2 FP8 PFLOPS, 12.8 Tbps EFA bandwidth, and 6 TB of high-bandwidth memory—positioned as the best price-performance option for GenAI training on AWS. UltraCluster deployments can now scale to more than 1.6 million accelerators in a single cluster image.

To address resilience at scale, Cooks announced new checkpointing capabilities in SageMaker HyperPod, which leverage CPU and local storage to accelerate recovery times in the event of failures. AWS is also expanding observability and automation features in SageMaker to improve developer productivity and reduce training times. “Security, latency, performance, and resilience are the pillars of our infrastructure,” said Cooks. “Our goal is to ensure that AI builders can rely on AWS to deliver the scale, economics, and reliability they need.”

🌐 Analysis: AWS is positioning itself not just as a supplier of raw compute, but as an integrated infrastructure provider across networking, compute, and software resilience. Its emphasis on water-cooled GPUs, large-scale AI fabrics, and SageMaker checkpointing directly addresses hyperscaler-scale challenges. This mirrors moves by rivals Microsoft Azure and Google Cloud, which are also investing in custom silicon and high-density AI clusters, but AWS is leaning on both Blackwell GPUs and its own Trainium line to balance peak performance with economics.

ShareTweetShare
Previous Post

AI Infrastructure Summit: NVIDIA’s Architectural Advancements

Next Post

AI Infrastructure Summit: Meta’s AI Journey

Jim Carroll

Jim Carroll

Editor and Publisher, Converge! Network Digest, Optical Networks Daily - Covering the full stack of network convergence from Silicon Valley

Related Posts

Cisco, G42, and AMD to Build AI Infrastructure in the UAE
AI Infrastructure

DigitalBridge Teams with KT for AI Data Centers in Korea

November 26, 2025
BerryComm Expands Central Indiana Fiber with Nokia
5G / 6G / Wi-Fi

Telefónica Germany Awards Nokia a 5-Year RAN Modernization Deal

November 26, 2025
AMD’s Compute + Pensando Network Architecture Powers Zyphra’s AI 
AI Infrastructure

AMD’s Compute + Pensando Network Architecture Powers Zyphra’s AI 

November 25, 2025
Bleu, the “Cloud de Confiance” from Capgemini and Orange
Clouds and Carriers

Orange Business Begins Migration of 70% of IT Infrastructure to Bleu Cloud

November 25, 2025
Dell’s server and networking sales rise 16% yoy
Financials

Dell Raises FY26 AI Infrastructure Outlook as AI Server Shipments Surge 150%

November 25, 2025
GlobalFoundries acquires Tagore Technology’s GaN IP
Optical

GlobalFoundries Acquires InfiniLink for Silicon-Photonics Expertise

November 25, 2025
Next Post
AI Infrastructure Summit: Meta’s AI Journey

AI Infrastructure Summit: Meta’s AI Journey

Categories

  • 5G / 6G / Wi-Fi
  • AI Infrastructure
  • All
  • Automotive Networking
  • Blueprints
  • Clouds and Carriers
  • Data Centers
  • Enterprise
  • Explainer
  • Feature
  • Financials
  • Last Mile / Middle Mile
  • Legal / Regulatory
  • Optical
  • Quantum
  • Research
  • Security
  • Semiconductors
  • Space
  • Start-ups
  • Subsea
  • Sustainability
  • Video
  • Webinars

Archives

Tags

5G All AT&T Australia AWS Blueprint columns BroadbandWireless Broadcom China Ciena Cisco Data Centers Dell'Oro Ericsson FCC Financial Financials Huawei Infinera Intel Japan Juniper Last Mile Last Mille LTE Mergers and Acquisitions Mobile NFV Nokia Optical Packet Systems PacketVoice People Regulatory Satellite SDN Service Providers Silicon Silicon Valley StandardsWatch Storage TTP UK Verizon Wi-Fi
Converge Digest

A private dossier for networking and telecoms

Follow Us

  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io

© 2025 Converge Digest - A private dossier for networking and telecoms.

No Result
View All Result
  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io

© 2025 Converge Digest - A private dossier for networking and telecoms.

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.
Go to mobile version