Google Cloud’s AI Hypercomputer and Gemini-Powered Future

Jim Carroll

1 year ago

At Google Cloud Next ’25 in Las Vegas, CEO Thomas Kurian showcased the sweeping momentum of Google’s AI transformation across its cloud, infrastructure, and product ecosystem. With over 3,000 new features introduced in 2024, more than 4 million developers building on Gemini, and a 20x spike in usage of Vertex AI, Kurian emphasized that Google Cloud is delivering AI at planetary scale. New global regions in Sweden, South Africa, and Mexico—alongside a vast, resilient backbone of over 2 million miles of fiber—underscore how Google is laying the foundation for real-time, AI-powered enterprise services.

At the heart of these innovations is Google’s AI Hypercomputer—a supercomputing system designed to simplify AI deployment while maximizing performance and cost efficiency. Anchored by Ironwood TPUs delivering 42.5 exaflops per pod and supported by NVIDIA Blackwell GPUs, the AI Hypercomputer integrates compute, storage, and software. Enhancements like Hyperdisk Exapools, Anywhere Cache, and new GKE inferencing capabilities are enabling customers to achieve up to 24x more intelligence per dollar compared to leading alternatives. These advances are now accessible both in the cloud and on-prem, with Google Distributed Cloud (GDC) extending Gemini to sovereign and air-gapped environments, including deployments authorized for U.S. government use.

Highlights:

• 4M+ developers building with Gemini models; Vertex AI usage up 20x year-over-year.

• More than 2B AI assists/month in Workspace reshaping productivity across businesses.

• Cloud WAN launched: Enterprises can now access Google’s global network, reducing costs by 40% while boosting performance by up to 40%.

• AI Hypercomputer:

• Ironwood TPUs: 9,000+ chips per pod, 42.5 exaflops, 10x improvement over previous TPUs. (see addendum below)

• NVIDIA Blackwell (B200, GB200) & Vera Rubin GPUs now available in Google Cloud.

• Hyperdisk Exapools & Anywhere Cache cut storage latency up to 70%.

• GKE Inference Optimizations: 30% lower serving costs, 60% lower tail latency.

• Pathways & vLLM: Distributed ML runtime and PyTorch compatibility on TPUs.

• Gemini on-premises: Google Distributed Cloud now delivers Gemini locally, including in air-gapped environments with NVIDIA and Dell, compliant with U.S. Secret and Top Secret levels.

• Customer momentum: 500+ real-world success stories from global brands like Airbus, Honeywell, Intuit, Samsung, Reddit, and the Government of Singapore.

Watch the full keynote and dive deeper into Google Cloud’s AI vision at Google Cloud Next.

Addendum: Ironwood — Google’s Inference-First TPU with Breakthrough ICI Performance

Announced at Google Cloud Next ’25

At Google Cloud Next ’25, Google introduced Ironwood, its most advanced and scalable TPU to date, built specifically for inferential AI workloads. Representing a 10x leap in performance over previous generations, Ironwood is optimized for today’s most computationally demanding AI models—LLMs, MoEs, and next-gen reasoning systems. It’s offered in two configurations: 256 chips and a massive 9,216-chip pod capable of delivering 42.5 exaFLOPS of compute—24x more than the world’s top-ranked traditional supercomputer, El Capitan.

A cornerstone of Ironwood’s breakthrough performance is its Inter-Chip Interconnect (ICI)—a custom, high-speed mesh fabric that links thousands of TPUs in a pod. The new ICI delivers up to 1.2 terabits per second (Tbps) of bidirectional bandwidth per chip, representing a 1.5x improvement over its previous generation Trillium. This low-latency, high-throughput network is critical for massive model parallelism, enabling fast and efficient communication between TPU cores across the pod. The ICI mesh architecture ensures that data is always where it needs to be, reducing inter-chip latency and improving training and inference throughput at hyperscale. Spanning nearly 10 megawatts of interconnected compute, Ironwood’s ICI allows synchronized communication across thousands of chips, unlocking new levels of distributed AI performance.

Key Ironwood Highlights:

• Inference-first TPU architecture: Designed to power proactive “thinking” models like Gemini 2.5 with high-performance serving at scale.

• Compute scale: Up to 42.5 exaFLOPS per pod across 9,216 liquid-cooled chips (each at 4,614 TFLOPs peak).

• Memory capacity & bandwidth:

• 192 GB HBM per chip (6x Trillium), supporting vast models and reducing off-chip data movement.

• 7.2 TBps of memory bandwidth per chip for rapid tensor processing.

• Breakthrough ICI (Inter-Chip Interconnect):

• 1.2 Tbps bidirectional bandwidth per chip (1.5x Trillium).

• High-efficiency 3D torus topology enables low-latency, high-volume chip-to-chip communication.

• Engineered for pod-wide synchronous computation and minimal data movement bottlenecks.

• Enhanced SparseCore: Expanded support for ranking, recommendation, and scientific workloads with ultra-large embeddings.

• Power efficiency: 2x more efficient than Trillium; 30x improvement over TPU v2 thanks to advanced liquid cooling and architectural refinement.

• Pathways software integration: Enables efficient scaling of AI across hundreds of thousands of Ironwood TPUs using Google’s distributed ML runtime.

• Available later this year via Google Cloud, with native support for PyTorch, JAX, and Google’s full AI infrastructure stack.

Ironwood is a foundational pillar of Google Cloud’s AI Hypercomputer, built to deliver scalable, cost-effective, and energy-efficient AI at planetary scale—ushering in the true age of inference for enterprise and research workloads alike.