Yee Jiun (YJ) Song, VP of Engineering for Infra Foundation at Meta, told the AI Infrastructure Summit in Santa Clara that AI has forced the company to reinvent the way it designs and operates data centers. Song reflected on his 15 years at Meta, beginning with Facebook’s early days when all application logic ran on simple web servers tied directly to databases, hosted in just a few leased facilities. As the company expanded to include Instagram, WhatsApp, and Messenger, Meta had to break monolithic applications into microservices, build custom data centers, and roll out a global backbone and edge network. Each step introduced new distributed-systems challenges, including consistency bugs, failures across multiple data centers, and the need to mask unreliable hardware at massive scale.
The inflection point came in the late 2010s with the rise of short-form video. Unlike earlier newsfeed ranking systems that focused on friends’ posts, recommendations for Reels demanded a deep understanding of the content itself. GPUs and other accelerators became indispensable for generating embeddings of all uploaded videos and matching them against user preferences. This represented a shift from social-based ranking to interest-based recommendation, powered by models that required orders of magnitude more computation than CPUs could provide. Meta’s first GPU clusters were sized at 4,000 GPUs and later expanded to 24,000 H100 GPUs by 2023, connected with custom backplane fabrics for low-latency synchronization.
Scaling further introduced new technical and operational hurdles. Training large language models (LLMs) required thousands of GPUs to work in lockstep, meaning a single node failure could stall an entire job. Song noted that Meta had to develop new checkpointing and recovery mechanisms to reduce daily job-interruption rates from hundreds to under 2%, while also redesigning its scheduling and networking systems to maximize efficiency. At one point, Meta executed a drastic operation: emptying five production data centers located side by side, rerouting workloads, digging new fiber trenches, quadrupling network capacity, and even designing custom robots to move racks—all to consolidate sufficient power and cooling for a new 129,000-GPU H100 cluster. At its launch, it was the largest AI supercluster in the world.
Looking forward, Meta is planning even larger builds. The “Prometheus” cluster is scheduled to go online in 2026, while “Hyperion” represents an even more ambitious multi-building complex on a scale compared to Manhattan. These next-generation builds respond to the reality that single-building clusters can no longer meet the compute and power requirements of frontier AI training. Song emphasized that the industry is entering uncharted territory where workloads are heterogeneous and evolving rapidly—from recommendation models to LLM pretraining, reinforcement learning, and beyond. Many of these workloads are not yet well understood, meaning infrastructure must be designed for flexibility as well as scale.
Hardware diversity is both an asset and a burden. Meta continues to deploy NVIDIA’s latest GPUs but has also tested systems from other vendors, such as the MI300X, to encourage a more competitive ecosystem. At the same time, the company is investing in its own custom silicon optimized for recommendation and ranking workloads, which remain mission-critical across Meta’s platforms. Song noted that these chips have already delivered efficiency gains and are being scaled across production data centers, with additional designs now in development. He argued that open standards, abstractions, and open-source software are necessary to tame the growing complexity, enabling developers to focus on applications while operators manage heterogeneous clusters more effectively.
“AI is teaching us that we know nothing,” Song concluded. “Despite all the build-out that has already happened, we are still very early in the evolution of AI infrastructure. The next few years will be even more intense.”
• Meta scaled from 4K-GPU clusters to a 129K-GPU H100 supercluster in 2023
• Five production data centers were repurposed to supply power and networking for AI workloads
• Prometheus cluster scheduled for 2026, with Hyperion planned as an even larger complex
• Open standards, abstractions, and multi-vendor GPU strategies are key priorities
• Meta continues to invest in custom silicon for recommendation and ranking workloads
“AI is teaching us that we know nothing,” said Yee Jiun Song. “The next few years will be even more intense.”


🌐 Analysis: Meta’s keynote illustrates the transition from web-scale to AI-scale infrastructure, where challenges shift from user demand to compute density, synchronization, and power availability. The Prometheus and Hyperion projects point to the emergence of city-scale AI clusters, rivaling similar efforts at Microsoft, Google, and Amazon. Meta’s differentiated approach combines heterogeneous GPU fleets, custom accelerators, and open standards to manage complexity. This mirrors broader industry momentum around interoperability and abstraction layers, such as OCP’s AI hardware standards and MLCommons benchmarks, which are vital to keep pace with rapid hardware turnover and spiraling AI compute needs.
🌐 We’re tracking the latest developments in AI infrastructure. Follow our ongoing coverage at: https://convergedigest.com/category/ai-infrastructure/







