Google outlined its vision for scaling AI compute at the AI Infrastructure Summit in Santa Clara, where Mark Lohmeyer, VP & GM of Compute and AI Infrastructure, delivered the keynote “What’s Next for the Foundations of AI.” Lohmeyer compared the speed of today’s AI breakthroughs to the early internet, pointing to surging demand for compute and power efficiency as the defining challenges of this era.
He revealed that AI token processing across Google products hit 980 trillion tokens per month in June 2025, doubling in just two months. At this scale, power availability—not chips or datacenter space—has become the primary constraint. Google is addressing this by driving efficiency across the stack, claiming a 33x reduction in energy per prompt for Gemini over the past year, with each prompt consuming just 0.25 watt-hours, equivalent to nine seconds of video playback.
Google also spotlighted its TPU Ironwood platform, scaling 9,000 chips per pod with 42.5 petaflops and 7.3 petabytes of memory bandwidth, as well as partnerships with NVIDIA on Blackwell GPUs. New services like Inference Gateway, Dynamic Workload Scheduler, and AI-optimized storage aim to cut costs, reduce latency, and simplify deployment of large-scale inference workloads.
• AI token traffic surged to 980 trillion per month by June 2025 (2x growth in 2 months)
• Equivalent to every person on Earth reading a novel monthly
• Power availability is the new limiting factor for AI infrastructure buildouts
• Gemini prompt uses ~0.25 Wh — equal to 9 seconds of video streaming
• Google cut energy per Gemini prompt 33x in one year
• Efficiency gains driven by speculative decoding, disaggregated serving, and mixtures-of-experts
• TPU Ironwood delivers 5x compute and 6x memory vs. prior gen
• 9,000 Ironwood chips scale into a superpod with 42.5 petaflops peak
• Pods linked by optical fabric with dynamic reconfiguration for resilience
• 7.3 PB of HBM accessible across 9,000 chips, addressing bottlenecks
• Fifth-generation liquid cooling deployed across Ironwood systems
• Google TPU platform now in its 7th generation, >10 years of iteration
• Native PyTorch support coming to TPUs alongside JAX/TF
• Partnership with NVIDIA Blackwell GPUs, A100/A200 series integrated in Google Cloud
• Three major NVIDIA-backed services launched this year for inference/training
• Inference Gateway GA: AI-aware routing balances workloads across servers
• Features include prefix-aware routing and disaggregated serving
• Inference Optimizer delivers best-practice configs and continuous tuning
• Dynamic Workload Scheduler: new consumption model with flex-start and calendar reservations
• Custom classes: workload profiles that auto-shift between TPUs and GPUs across pricing tiers
• AI-optimized storage caches weights near accelerators, reducing load times by 96%
• Eliminates need for customer-built caching solutions (used by Palantir, Toyota)
• Long-context workloads supported via high-performance managed storage
• Cloud Network (CloudLAN) interconnect delivers 40% lower latency globally
• Example: Toyota reduced AI model creation time by 20% using Google infrastructure
• Small Toyota team built full AI platform in half the expected time
• Google’s approach integrates compute, storage, networking, frameworks, and deployment into a full AI stack
“Power has become one of the most precious commodities we have, and the only way forward is relentless efficiency across the entire stack,” Lohmeyer said.
🌐 Analysis: This keynote reinforced Google’s strategy to pair NVIDIA GPUs with its custom TPU roadmap while attacking the power efficiency challenge head-on. The Ironwood superpod demonstrates Google’s ability to scale custom silicon rivaling AWS Trainium/Inferentia and Microsoft Maia/Cobalt. The focus on inference infrastructure—storage, scheduling, and latency optimization—shows that serving models efficiently is emerging as the next competitive battleground.
🌐 We’re tracking the latest developments in AI infrastructure. Follow our ongoing coverage at: https://convergedigest.com/category/ai-infrastructure/Google outlined its vision for scaling AI compute at the AI Infrastructure Summit in Santa Clara, where Mark Lohmeyer, VP & GM of Compute and AI Infrastructure, delivered the keynote “What’s Next for the Foundations of AI.” Lohmeyer compared the speed of today’s AI breakthroughs to the early internet, pointing to surging demand for compute and power efficiency as the defining challenges of this era.
He revealed that AI token processing across Google products hit 980 trillion tokens per month in June 2025, doubling in just two months. At this scale, power availability—not chips or datacenter space—has become the primary constraint. Google is addressing this by driving efficiency across the stack, claiming a 33x reduction in energy per prompt for Gemini over the past year, with each prompt consuming just 0.25 watt-hours, equivalent to nine seconds of video playback.
Google also spotlighted its TPU Ironwood platform, scaling 9,000 chips per pod with 42.5 petaflops and 7.3 petabytes of memory bandwidth, as well as partnerships with NVIDIA on Blackwell GPUs. New services like Inference Gateway, Dynamic Workload Scheduler, and AI-optimized storage aim to cut costs, reduce latency, and simplify deployment of large-scale inference workloads.
• AI token traffic surged to 980 trillion per month by June 2025 (2x growth in 2 months)
• Equivalent to every person on Earth reading a novel monthly
• Power availability is the new limiting factor for AI infrastructure buildouts
• Gemini prompt uses ~0.25 Wh — equal to 9 seconds of video streaming
• Google cut energy per Gemini prompt 33x in one year
• Efficiency gains driven by speculative decoding, disaggregated serving, and mixtures-of-experts
• TPU Ironwood delivers 5x compute and 6x memory vs. prior gen
• 9,000 Ironwood chips scale into a superpod with 42.5 petaflops peak
• Pods linked by optical fabric with dynamic reconfiguration for resilience
• 7.3 PB of HBM accessible across 9,000 chips, addressing bottlenecks
• Fifth-generation liquid cooling deployed across Ironwood systems
• Google TPU platform now in its 7th generation, >10 years of iteration
• Native PyTorch support coming to TPUs alongside JAX/TF
• Partnership with NVIDIA Blackwell GPUs, A100/A200 series integrated in Google Cloud
• Three major NVIDIA-backed services launched this year for inference/training
• Inference Gateway GA: AI-aware routing balances workloads across servers
• Features include prefix-aware routing and disaggregated serving
• Inference Optimizer delivers best-practice configs and continuous tuning
• Dynamic Workload Scheduler: new consumption model with flex-start and calendar reservations
• Custom classes: workload profiles that auto-shift between TPUs and GPUs across pricing tiers
• AI-optimized storage caches weights near accelerators, reducing load times by 96%
• Eliminates need for customer-built caching solutions (used by Palantir, Toyota)
• Long-context workloads supported via high-performance managed storage
• Cloud Network (CloudLAN) interconnect delivers 40% lower latency globally
• Example: Toyota reduced AI model creation time by 20% using Google infrastructure
• Small Toyota team built full AI platform in half the expected time
• Google’s approach integrates compute, storage, networking, frameworks, and deployment into a full AI stack
“Power has become one of the most precious commodities we have, and the only way forward is relentless efficiency across the entire stack,” Lohmeyer said.
🌐 Analysis: This keynote reinforced Google’s strategy to pair NVIDIA GPUs with its custom TPU roadmap while attacking the power efficiency challenge head-on. The Ironwood superpod demonstrates Google’s ability to scale custom silicon rivaling AWS Trainium/Inferentia and Microsoft Maia/Cobalt. The focus on inference infrastructure—storage, scheduling, and latency optimization—shows that serving models efficiently is emerging as the next competitive battleground.
🌐 We’re tracking the latest developments in AI infrastructure. Follow our ongoing coverage at: https://convergedigest.com/category/ai-infrastructure/






