Meta Outlines AI Infrastructure Upgrades at OCP Summit 2024

At the Open Compute Project (OCP) Summit 2024 in San Jose, Meta’s Director of Engineering, Omar Baldonado, delivered a keynote packed with significant announcements on Meta’s evolving AI infrastructure. With AI driving profound transformations across industries, Meta is doubling down on open, scalable systems to support its AI ambitions. Here are the key highlights:

AMD Joins Meta’s AI Ecosystem

Meta welcomed AMD to its AI infrastructure with the announcement that the AMD MI300X processor has been integrated into Meta’s Grand Teton platform, which powers large-scale inference operations. The MI300X’s large HBM capacity and bandwidth will enable more efficient AI inference, particularly for Meta’s latest Llama 3 models. Baldonado hinted that many users of Meta’s family apps (including Instagram, WhatsApp, and Facebook) are already benefiting from this infrastructure.

Introducing Catalina: Next-Generation Rack Scale Infrastructure

Meta unveiled “Catalina,” its new AI rack-scale infrastructure, featuring Nvidia GB200 GPUs. Designed to support the growing computational demands of AI models, Catalina can accommodate 72 GPUs across two racks, delivering an unprecedented 140 kW of power. This breakthrough in density and power management is critical for handling the immense workloads required for large-scale AI training. The system also incorporates advanced liquid cooling and modular management capabilities to meet Meta’s evolving AI needs.

Mount Diablo: Meta and Microsoft Collaborate on Power Innovation

To further enhance the scalability of its AI clusters, Meta is partnering with Microsoft on “Mount Diablo,” a project aimed at tackling the disaggregated power needs of massive AI infrastructures. Baldonado stressed the importance of industry collaboration to address these power challenges, highlighting Meta and Microsoft’s long-standing partnership within OCP.

Meta’s First Networking Silicon and OCP Contributions

Another significant announcement was Meta’s development of its first custom networking silicon, an ASIC-based NIC, in collaboration with Marvell. This NIC, which is designed for Meta’s AI compute platforms, will be contributed to OCP under an open-source license, with drivers already integrated into the Linux ecosystem.

Switching to 51 Tbps: Upgraded AI Data Center Networks

To meet the immense bandwidth demands of its AI clusters, Meta announced two new switches: Minipack 3, built with Broadcom’s Tomahawk 5 ASIC, and Cisco’s 85101 switch, which includes a G2 51 Tbps ASIC. Both switches are being contributed to OCP, further bolstering Meta’s open-source commitments.

Rethinking Data Center Fabrics

Meta also revealed a complete rethinking of its data center fabrics to support its AI infrastructure. With large-scale chassis switches now insufficient for AI’s growing demands, Meta has redesigned its fabric to ensure greater control, reliability, and scalability for future AI workloads.

Baldonado underscored Meta’s fundamental belief in openness as a driving force behind its AI strategy, highlighting that transparency and collaboration are at the core of every layer of its AI stack. From open-source Llama models and PyTorch frameworks to Meta’s contributions to OCP with custom silicon, switches, and rack-scale infrastructure, the company is leading the charge in making AI development accessible to the broader community. Baldonado likened Llama’s role in the AI ecosystem to that of Linux in the software world, emphasizing that openness fosters innovation at scale. Meta’s unwavering commitment to open AI systems aims to propel industry-wide advancements and ensure that the future of AI is shared, scalable, and impactful for all..

Meta Outlines AI Infrastructure Upgrades at OCP Summit 2024

#OCPSummit24: The Ethernet Path to Scale-Up Networks

Accton Debuts AI-Centric Data Center Solutions

Jim Carroll

Related Posts

DigitalBridge Teams with KT for AI Data Centers in Korea

Telefónica Germany Awards Nokia a 5-Year RAN Modernization Deal

AMD’s Compute + Pensando Network Architecture Powers Zyphra’s AI

Orange Business Begins Migration of 70% of IT Infrastructure to Bleu Cloud

Dell Raises FY26 AI Infrastructure Outlook as AI Server Shipments Surge 150%

GlobalFoundries Acquires InfiniLink for Silicon-Photonics Expertise

Accton Debuts AI-Centric Data Center Solutions

Categories

Archives

Meta Outlines AI Infrastructure Upgrades at OCP Summit 2024

#OCPSummit24: The Ethernet Path to Scale-Up Networks

Accton Debuts AI-Centric Data Center Solutions

Related Posts

Categories

Archives

Tags