PECC Summit: Google’s Ryohei Urata on Reliability for AI Data Centers

At the Photonic Enabled Cloud Computing (PECC) Summit in Silicon Valley, Ryohei Urata of Google outlined what he called three potential “brick walls” for the AI infrastructure era: reliability, power, and bandwidth. His keynote focused on the first—reliability and availability—arguing that it now represents the single most difficult engineering challenge at hyperscale.

“AI workloads are synchronous. If one piece fails, the whole job stops,” Urata said. “That makes reliability a day-to-day concern for us at Google.”

AI-Scale Availability

Urata described how Google’s AI data centers, built around large-scale TPU systems, face a dramatically different reliability profile than conventional compute clusters. Traditional web and storage services can tolerate isolated server failures. In contrast, AI training workloads require synchronized operation across thousands of accelerators.

A typical TPUv4 cluster includes roughly one million optical links. Even with a daily failure rate as low as 0.004 percent, this translates to about forty link failures every day—enough to disrupt training jobs. “At that scale, what seems like a small failure rate becomes a major availability problem,” Urata said.

Google measures performance using both Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR). “Availability is a function of both,” he noted. “You have to design for fewer failures—and faster recovery.”

Hardening the Optics

Urata outlined three approaches to improving reliability: strengthening component design, improving system architecture, and accelerating repair.

For component design, Google emphasizes thorough fault-mode analysis and testing well beyond formal specifications. “Testing to the spec is just the beginning,” he said. “You need large sample sizes—hundreds are not enough if you’re chasing a 0.01 percent failure rate.”

Because AI systems often deploy new optical and DSP technologies early in their lifecycle, Urata said there is limited time for aging studies and reliability modeling. “There’s no magic—just solid design, sweat, and iteration.”

He also described how manufacturing scale adds new challenges. Ramp rates for AI optics are an order of magnitude higher than for traditional data center deployments. Google’s approach is to perform intensive reliability testing early in production, then taper as yield data stabilizes, while maintaining rapid feedback loops when field issues occur.

Manufacturing and Quality Control

Field data from Google’s TPUv6 optical systems reveal a long tail of failure modes. “We’ve eliminated laser failures, but what remains are dozens of smaller contributors—firmware issues, wire bonds, PCB quality. It’s the death of a thousand cuts,” Urata said. He emphasized that suppliers who can improve reliability fastest will win future orders: “People who solve this problem are the ones getting the POs.”

System Design and Optical Switching

Urata pointed to system-level design as a critical reliability lever. Google uses optical circuit switching (OCS) to interconnect TPU racks, enabling flexible topology and isolation of faulty nodes. “With OCS, we can reshape the system and select out healthy racks dynamically,” he said. “It can mean the difference between a stalled training job and one that keeps running.”

This optical flexibility also yields performance benefits—some workloads complete training up to three times faster.

Faster Repair and Predictive Monitoring

Beyond optics and system architecture, Google focuses heavily on fast diagnosis and repair. Automated monitoring and telemetry provide real-time visibility across the optical fabric. “When something breaks, you need to know what failed and fix it quickly,” Urata said.

He highlighted the need for more predictive and self-correcting systems powered by machine learning. “We’re actively using AI to drive repair prioritization and predict failures before they happen,” he said. “These systems are very complex—automation is the only way to keep up.”

Designing Thoughtfully for Change

Urata cautioned against making design changes solely to reduce cost or power. “Don’t tinker for the sake of tinkering,” he warned. “A ten percent power savings isn’t worth it if it raises your failure rate and leaves a thousand TPUs idle. That’s far more expensive.”

Key Takeaways

AI workloads are highly sensitive to optical failures; even small failure rates scale to major system outages.
Reliability depends equally on lowering MTBF and MTTR—component quality and repair speed must improve together.
Google’s optical circuit-switching fabric enables reconfigurable topologies for higher availability.
Manufacturing quality, firmware robustness, and rapid feedback loops are now critical reliability factors.
Predictive monitoring and automated repair are central to Google’s next-generation infrastructure strategy.

🌐 We’re tracking the evolution of AI-scale optical infrastructure and reliability engineering at

ConvergeDigest.com/category/data-centers