Hot Interconnects: Google Unveils Falcon, Swift, and Firefly for AI Data Centers

Jim Carroll

8 months ago

At the Hot Interconnects 2025 conference, Google Fellow and Vice President of Engineering Amin Vahdat unveiled Falcon, a next-generation NIC transport designed to address the unique challenges of AI and high-performance computing. His keynote framed the network not as a background utility but as the critical enabler—or bottleneck—for AI job completion.

Vahdat traced the evolution of host networking from the Internet era, through distributed systems in data centers, to today’s AI era. Whereas TCP and best-effort delivery sufficed for wide-area connectivity, and RDMA supported storage and analytics clusters, AI workloads now demand ultra-low latency (tens of microseconds), synchronized bursts at line rate, and near-perfect predictability. In large-scale training runs, the 100th percentile tail latency—not the median—dictates job completion. If even a single node or link lags, the entire training run can stall.

To meet these challenges, Google introduced Falcon, a hardware NIC transport delivering 200 Gbps bandwidth, 150 Mpps packet rates, and ~3 µs one-way latency, with scalability to 400/800 Gbps. Falcon offloads flow control, reliability, congestion management, and multipath routing into silicon, complementing techniques such as Swift (delay-based congestion control with near-zero loss) and Firefly (a NIC-to-NIC clock synchronization framework). Falcon is already deployed in Intel’s E2100 IPU, supporting RDMA, NVMe, live VM migration, and bulk data compression.

Vahdat emphasized that AI workloads break the statistical multiplexing model that underpinned internet and data center design for decades. Instead of many applications sharing the same fabric, AI training often monopolizes thousands of nodes in synchronized all-to-all communication, where bursts must complete before computation can continue. This shift, he said, requires rethinking NIC transport, congestion control, and time synchronization from the ground up, with hardware acceleration and automated recovery from failures or stragglers.

AI/ML workloads generate synchronized, high-bandwidth bursts that invalidate prior assumptions of multiplexing.
The network now dictates accelerator efficiency: computation waits for communication.
Falcon hardware transport enables scalable, predictable performance at 200–800 Gbps.
Swift congestion control and Firefly synchronization extend Google’s stack for reliable, low-jitter fabrics.
Recovery from stragglers and rapid fault localization are now core to workload-level reliability.

“Our goal is to unlearn networking assumptions from the past,” Vahdat said. “We’re moving from best-effort delivery to proactive scheduling, from software-centric to hardware transports, and from tolerating degraded performance to aiming for near-perfect predictability. The network is what determines when an AI job finishes.”

Falcon introduces hardware-accelerated transport with programmable congestion control, multipathing, and flow management. It supports 200 Gbps per link, 150 million packets per second, and ~3 μs one-way latency, with scalability to 400/800 Gbps. Falcon is already integrated into Intel’s E2100 IPU, where it supports NVMe offloads, RDMA, live VM migration, and bulk data compression. Complementing Falcon, Google highlighted Swift, a delay-based congestion control method that achieves high utilization and near-zero packet loss, and Firefly, a scalable NIC-to-NIC and NIC-to-time-server synchronization framework that keeps distributed systems tightly aligned.

The keynote laid out a broader case for rethinking NIC transport:

New workloads: AI/ML training and HPC require high operation rates, massive burst bandwidth, and ultra-low jitter beyond what software stacks can deliver.
Reliability: ML workloads demand synchronized performance, where recovery depends on rapid detection, localization, and checkpoint restoration.
Hybrid networks: Specialized links like NVLink and ICI cannot scale across the data center; general-purpose transport must deliver predictable performance at scale.
AI networking quadfecta: To succeed, infrastructure must simultaneously deliver burst bandwidth, low latency, ultra-low jitter, and flawless reliability.

https://www.hoti.org

🌐 Analysis: Google is reframing the role of the network as the determinant of AI job completion, moving beyond compute-centric scaling to network-aware infrastructure. This mirrors efforts at Microsoft, Meta, and Amazon, all of whom are building custom NICs, congestion-control protocols, and time sync systems to sustain multi-gigawatt AI campuses. With Falcon, Swift, and Firefly, Google is making the case that AI infrastructure innovation depends as much on networking as on GPUs.