• Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io
No Result
View All Result
Converge Digest
Saturday, April 11, 2026
  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io
No Result
View All Result
Converge Digest
No Result
View All Result

Home » PECC Summit: Meta’s Drew Alduino on AI Networking Reliability Walls

PECC Summit: Meta’s Drew Alduino on AI Networking Reliability Walls

October 23, 2025
in All
A A

At the Photonic Enabled Cloud Computing (PECC) Summit in Silicon Valley, Drew Alduino, Director of Optical Infrastructure at Meta, addressed the extraordinary pace and complexity of AI infrastructure growth—and the emerging limits of today’s data-center architectures.

Meta, he said, brought online more than a gigawatt of new capacity this year and is investing tens of billions of dollars annually to expand AI training infrastructure. “It’s hard to call this a bubble when every major company is scaling this aggressively,” he observed. “The challenge is how to sustain it.”

Alduino noted Meta’s recently announced 5-gigawatt data-center campus in Louisiana—so large it would cover a significant portion of Manhattan—as an example of the physical scale now required for AI clusters measured in millions of nodes across multiple regions.

From Scale-Out to Scale-Up

AI clusters have evolved from the 24,000-node systems Meta deployed in 2023 to designs exceeding 129,000 GPUs today and soon into multimillion-node, multi-regional fabrics.  The architectural questions have shifted from how to scale out across racks to how to scale up within and between racks as power and cooling densities rise.

Meta’s earlier rack generations—A100, H100, and H200—fit within a single physical rack.  The newest systems require multi-rack configurations to accommodate cooling and power infrastructure. “We now have compute that spans two racks and needs a six-rack physical solution when you add the cooling,” Alduino said. “Even maintaining a single copper-reach backplane between them becomes a challenge.”

The Limits of Copper

Alduino described how passive electrical backplanes, once considered the simplest and most reliable connection, are becoming bottlenecks as cluster scale and thermal load increase. “How are you going to get more scalable and more reliable than a copper wire?” he asked rhetorically. “That’s the problem we’re facing.”

As scale-up connectivity pushes beyond copper’s reach, optical solutions such as CPO (co-packaged optics), NPO, AEC, and AOC are being evaluated for their performance, reliability, and serviceability trade-offs.

Reliability, Availability, and Serviceability

Alduino framed Meta’s design philosophy around three interdependent goals:

  • Reliability – the physical robustness of the components themselves.
  • Availability – the system-level resilience and amount of spare capacity needed to absorb inevitable failures.
  • Serviceability – the practical ability to detect, access, and replace failed parts without excessive downtime.

“Reliability is what breaks and when; availability is how much capacity you lose when it does; and serviceability is how quickly you can fix it,” he explained.

Different technologies affect these factors in different ways.  A pluggable transceiver failure might take down a single link, easily rerouted in the scale-out domain.  A CPO failure, by contrast, could disable multiple ports and a larger section of the fabric, raising questions about repair time and spare capacity.  “If a CPO port fails, do I lose a switch node?” Alduino asked. “That’s the question we’re trying to answer.”

Meta’s CPO Evaluation

Meta is now testing CPO and pluggable optical systems side by side at scale—roughly 15 million CPO device-hours and two million pluggable-module hours so far—to establish statistically significant reliability data.  The results are promising: CPO modules show about a 5× improvement in MTBF over comparable pluggables, with roughly 65 percent lower power consumption and stable operation across temperature.

Still, Alduino cautioned that the most relevant metric is not component failure but link interruption.  Firmware, control logic, and transient link resets can have outsized impact on AI training workloads.  “After fifteen million device-hours we haven’t seen unserviceable CPO failures,” he said, “but what really matters is whether the link stays up.”

Toward a Data-Driven Decision

Meta’s goal is not simply to validate CPO components but to quantify how failures propagate at the system level—what he called “the blast radius.”  The company is building the statistical base needed to understand whether the benefits of integrated optics outweigh the complexity and repair challenges.  “The question for us isn’t ‘can the industry build it?’ It’s ‘should we deploy it at scale?’ ”

Alduino closed by emphasizing that reliability and serviceability will ultimately determine how far AI infrastructure can scale.  “Power savings help, but the unanswered questions are still reliability, availability, and serviceability,” he said. “If we can make integrated optics truly reliable at data-center scale, that’s how we move forward.”


Key Takeaways

  • Meta’s AI infrastructure now exceeds a gigawatt of new capacity and is moving toward multi-million-node, multi-regional clusters.
  • Traditional copper backplanes are reaching physical and thermal limits for scale-up connectivity.
  • Co-packaged optics (CPO) show promise—roughly 5× higher MTBF and 65 percent lower power—but raise new serviceability questions.
  • Reliability, availability, and serviceability (RAS) must be co-optimized as AI fabrics grow.
  • Meta is collecting large-scale field data to evaluate CPO versus pluggables before committing to broad deployment.

Tags: MetaPECC25
ShareTweetShare
Previous Post

PECC: Microsoft’s Ram Huggahalli on the Next Phase of AI-Scale Optics

Next Post

PECC Summit: Google’s Ryohei Urata on Reliability for AI Data Centers

Jim Carroll

Jim Carroll

Editor and Publisher, Converge! Network Digest, Optical Networks Daily - Covering the full stack of network convergence from Silicon Valley

Related Posts

Scintil Photonics Pushes DWDM Co-Packaged Optics
Optical

Scintil Photonics Pushes DWDM Co-Packaged Optics

November 5, 2025
Arista’s Andy Bechtolsheim: Pluggables Still Reign as AI Drives Next Wave of 1.6T and 3.2T
Optical

Arista’s Andy Bechtolsheim: Pluggables Still Reign as AI Drives Next Wave of 1.6T and 3.2T

November 4, 2025
Meta’s AI Buildout Drives Record $19B Quarterly CapEx
Financials

Meta’s AI Buildout Drives Record $19B Quarterly CapEx

October 30, 2025
Microsoft signs largest corporate solar deal in U.S. at 315 MW
AI Infrastructure

ENGIE Secures New 600 MW Solar PPA with Meta for Texas Data Centers

October 28, 2025
Ciena’s Peter Winzer on Hybrid Co-Packaged Copper/Optics for Scale-Up
All

Ciena’s Peter Winzer on Hybrid Co-Packaged Copper/Optics for Scale-Up

October 27, 2025
Lightmatter’s 3D Photonic Integration for AI Interconnects
Optical

Lightmatter’s 3D Photonic Integration for AI Interconnects

October 27, 2025
Next Post
PECC Summit: Google’s Ryohei Urata on Reliability for AI Data Centers

PECC Summit: Google’s Ryohei Urata on Reliability for AI Data Centers

Categories

  • 5G / 6G / Wi-Fi
  • AI Infrastructure
  • All
  • Automotive Networking
  • Blueprints
  • Clouds and Carriers
  • Data Centers
  • Enterprise
  • Explainer
  • Feature
  • Financials
  • Last Mile / Middle Mile
  • Legal / Regulatory
  • Optical
  • Quantum
  • Research
  • Security
  • Semiconductors
  • Space
  • Start-ups
  • Subsea
  • Sustainability
  • Video
  • Webinars

Archives

Tags

5G All AT&T Australia AWS Blueprint columns BroadbandWireless Broadcom China Ciena Cisco Data Centers Dell'Oro Ericsson FCC Financial Financials Huawei Infinera Intel Japan Juniper Last Mile Last Mille LTE Mergers and Acquisitions Mobile NFV Nokia Optical Packet Systems PacketVoice People Regulatory Satellite SDN Service Providers Silicon Silicon Valley StandardsWatch Storage TTP UK Verizon Wi-Fi
Converge Digest

A private dossier for networking and telecoms

Follow Us

  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io

© 2025 Converge Digest - A private dossier for networking and telecoms.

No Result
View All Result
  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io

© 2025 Converge Digest - A private dossier for networking and telecoms.

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.
Go to mobile version