• Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io
No Result
View All Result
Converge Digest
Saturday, April 11, 2026
  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io
No Result
View All Result
Converge Digest
No Result
View All Result

Home » Hot Interconnects: Google Unveils Falcon, Swift, and Firefly for AI Data Centers

Hot Interconnects: Google Unveils Falcon, Swift, and Firefly for AI Data Centers

August 20, 2025
in AI Infrastructure, Clouds and Carriers, Data Centers
A A

At the Hot Interconnects 2025 conference, Google Fellow and Vice President of Engineering Amin Vahdat unveiled Falcon, a next-generation NIC transport designed to address the unique challenges of AI and high-performance computing. His keynote framed the network not as a background utility but as the critical enabler—or bottleneck—for AI job completion.

Vahdat traced the evolution of host networking from the Internet era, through distributed systems in data centers, to today’s AI era. Whereas TCP and best-effort delivery sufficed for wide-area connectivity, and RDMA supported storage and analytics clusters, AI workloads now demand ultra-low latency (tens of microseconds), synchronized bursts at line rate, and near-perfect predictability. In large-scale training runs, the 100th percentile tail latency—not the median—dictates job completion. If even a single node or link lags, the entire training run can stall.

To meet these challenges, Google introduced Falcon, a hardware NIC transport delivering 200 Gbps bandwidth, 150 Mpps packet rates, and ~3 µs one-way latency, with scalability to 400/800 Gbps. Falcon offloads flow control, reliability, congestion management, and multipath routing into silicon, complementing techniques such as Swift (delay-based congestion control with near-zero loss) and Firefly (a NIC-to-NIC clock synchronization framework). Falcon is already deployed in Intel’s E2100 IPU, supporting RDMA, NVMe, live VM migration, and bulk data compression.

Vahdat emphasized that AI workloads break the statistical multiplexing model that underpinned internet and data center design for decades. Instead of many applications sharing the same fabric, AI training often monopolizes thousands of nodes in synchronized all-to-all communication, where bursts must complete before computation can continue. This shift, he said, requires rethinking NIC transport, congestion control, and time synchronization from the ground up, with hardware acceleration and automated recovery from failures or stragglers.

  • AI/ML workloads generate synchronized, high-bandwidth bursts that invalidate prior assumptions of multiplexing.
  • The network now dictates accelerator efficiency: computation waits for communication.
  • Falcon hardware transport enables scalable, predictable performance at 200–800 Gbps.
  • Swift congestion control and Firefly synchronization extend Google’s stack for reliable, low-jitter fabrics.
  • Recovery from stragglers and rapid fault localization are now core to workload-level reliability.

“Our goal is to unlearn networking assumptions from the past,” Vahdat said. “We’re moving from best-effort delivery to proactive scheduling, from software-centric to hardware transports, and from tolerating degraded performance to aiming for near-perfect predictability. The network is what determines when an AI job finishes.”

Falcon introduces hardware-accelerated transport with programmable congestion control, multipathing, and flow management. It supports 200 Gbps per link, 150 million packets per second, and ~3 μs one-way latency, with scalability to 400/800 Gbps. Falcon is already integrated into Intel’s E2100 IPU, where it supports NVMe offloads, RDMA, live VM migration, and bulk data compression. Complementing Falcon, Google highlighted Swift, a delay-based congestion control method that achieves high utilization and near-zero packet loss, and Firefly, a scalable NIC-to-NIC and NIC-to-time-server synchronization framework that keeps distributed systems tightly aligned.

The keynote laid out a broader case for rethinking NIC transport:

  • New workloads: AI/ML training and HPC require high operation rates, massive burst bandwidth, and ultra-low jitter beyond what software stacks can deliver.
  • Reliability: ML workloads demand synchronized performance, where recovery depends on rapid detection, localization, and checkpoint restoration.
  • Hybrid networks: Specialized links like NVLink and ICI cannot scale across the data center; general-purpose transport must deliver predictable performance at scale.
  • AI networking quadfecta: To succeed, infrastructure must simultaneously deliver burst bandwidth, low latency, ultra-low jitter, and flawless reliability.

https://www.hoti.org

🌐 Analysis: Google is reframing the role of the network as the determinant of AI job completion, moving beyond compute-centric scaling to network-aware infrastructure. This mirrors efforts at Microsoft, Meta, and Amazon, all of whom are building custom NICs, congestion-control protocols, and time sync systems to sustain multi-gigawatt AI campuses. With Falcon, Swift, and Firefly, Google is making the case that AI infrastructure innovation depends as much on networking as on GPUs.

Tags: GoogleHOTI
ShareTweetShare
Previous Post

HOTI25:  Lightmatter’s Roadmap to 100x AI Bandwidth

Next Post

Kioxia Prototypes 5TB Flash with 64GB/s for Edge AI

Jim Carroll

Jim Carroll

Editor and Publisher, Converge! Network Digest, Optical Networks Daily - Covering the full stack of network convergence from Silicon Valley

Related Posts

Anthropic Expands Use of Google Cloud TPUs, Targeting One Million Units 
AI Infrastructure

Google Cloud to Build New Türkiye Region as Part of $2B, 10-Year Investment

November 24, 2025
Anthropic Expands Use of Google Cloud TPUs, Targeting One Million Units 
AI Infrastructure

Google Commits $40B for AI Infrastructure in Texas

November 14, 2025
Google Cloud Details Ironwood TPUs and Axion CPUs for AI Inference 
AI Infrastructure

Google Cloud Details Ironwood TPUs and Axion CPUs for AI Inference 

November 9, 2025
Microsoft Cloud and AI Momentum Drive Results, CAPEX Rockets Up
AI Infrastructure

Google Sees Surging AI Infrastructure Expenses

October 29, 2025
Google and NextEra to Restart Iowa’s Duane Arnold Nuclear Plant 
AI Infrastructure

Google and NextEra to Restart Iowa’s Duane Arnold Nuclear Plant 

October 29, 2025
PECC Summit: Google’s Ryohei Urata on Reliability for AI Data Centers
All

PECC Summit: Google’s Ryohei Urata on Reliability for AI Data Centers

October 23, 2025
Next Post
Kioxia Prototypes 5TB Flash with 64GB/s for Edge AI

Kioxia Prototypes 5TB Flash with 64GB/s for Edge AI

Categories

  • 5G / 6G / Wi-Fi
  • AI Infrastructure
  • All
  • Automotive Networking
  • Blueprints
  • Clouds and Carriers
  • Data Centers
  • Enterprise
  • Explainer
  • Feature
  • Financials
  • Last Mile / Middle Mile
  • Legal / Regulatory
  • Optical
  • Quantum
  • Research
  • Security
  • Semiconductors
  • Space
  • Start-ups
  • Subsea
  • Sustainability
  • Video
  • Webinars

Archives

Tags

5G All AT&T Australia AWS Blueprint columns BroadbandWireless Broadcom China Ciena Cisco Data Centers Dell'Oro Ericsson FCC Financial Financials Huawei Infinera Intel Japan Juniper Last Mile Last Mille LTE Mergers and Acquisitions Mobile NFV Nokia Optical Packet Systems PacketVoice People Regulatory Satellite SDN Service Providers Silicon Silicon Valley StandardsWatch Storage TTP UK Verizon Wi-Fi
Converge Digest

A private dossier for networking and telecoms

Follow Us

  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io

© 2025 Converge Digest - A private dossier for networking and telecoms.

No Result
View All Result
  • Home
  • Events Calendar
  • Blueprint Guidelines
  • Privacy Policy
  • Subscribe to Daily Newsletter
  • NextGenInfra.io

© 2025 Converge Digest - A private dossier for networking and telecoms.

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.
Go to mobile version