Converge Digest

Hot Interconnects: Google Unveils Falcon, Swift, and Firefly for AI Data Centers

At the Hot Interconnects 2025 conference, Google Fellow and Vice President of Engineering Amin Vahdat unveiled Falcon, a next-generation NIC transport designed to address the unique challenges of AI and high-performance computing. His keynote framed the network not as a background utility but as the critical enabler—or bottleneck—for AI job completion.

Vahdat traced the evolution of host networking from the Internet era, through distributed systems in data centers, to today’s AI era. Whereas TCP and best-effort delivery sufficed for wide-area connectivity, and RDMA supported storage and analytics clusters, AI workloads now demand ultra-low latency (tens of microseconds), synchronized bursts at line rate, and near-perfect predictability. In large-scale training runs, the 100th percentile tail latency—not the median—dictates job completion. If even a single node or link lags, the entire training run can stall.

To meet these challenges, Google introduced Falcon, a hardware NIC transport delivering 200 Gbps bandwidth, 150 Mpps packet rates, and ~3 µs one-way latency, with scalability to 400/800 Gbps. Falcon offloads flow control, reliability, congestion management, and multipath routing into silicon, complementing techniques such as Swift (delay-based congestion control with near-zero loss) and Firefly (a NIC-to-NIC clock synchronization framework). Falcon is already deployed in Intel’s E2100 IPU, supporting RDMA, NVMe, live VM migration, and bulk data compression.

Vahdat emphasized that AI workloads break the statistical multiplexing model that underpinned internet and data center design for decades. Instead of many applications sharing the same fabric, AI training often monopolizes thousands of nodes in synchronized all-to-all communication, where bursts must complete before computation can continue. This shift, he said, requires rethinking NIC transport, congestion control, and time synchronization from the ground up, with hardware acceleration and automated recovery from failures or stragglers.

“Our goal is to unlearn networking assumptions from the past,” Vahdat said. “We’re moving from best-effort delivery to proactive scheduling, from software-centric to hardware transports, and from tolerating degraded performance to aiming for near-perfect predictability. The network is what determines when an AI job finishes.”

Falcon introduces hardware-accelerated transport with programmable congestion control, multipathing, and flow management. It supports 200 Gbps per link, 150 million packets per second, and ~3 μs one-way latency, with scalability to 400/800 Gbps. Falcon is already integrated into Intel’s E2100 IPU, where it supports NVMe offloads, RDMA, live VM migration, and bulk data compression. Complementing Falcon, Google highlighted Swift, a delay-based congestion control method that achieves high utilization and near-zero packet loss, and Firefly, a scalable NIC-to-NIC and NIC-to-time-server synchronization framework that keeps distributed systems tightly aligned.

The keynote laid out a broader case for rethinking NIC transport:

https://www.hoti.org

🌐 Analysis: Google is reframing the role of the network as the determinant of AI job completion, moving beyond compute-centric scaling to network-aware infrastructure. This mirrors efforts at Microsoft, Meta, and Amazon, all of whom are building custom NICs, congestion-control protocols, and time sync systems to sustain multi-gigawatt AI campuses. With Falcon, Swift, and Firefly, Google is making the case that AI infrastructure innovation depends as much on networking as on GPUs.

Exit mobile version