Rafay and Aviz Join Forces to Streamline GPU Cloud Orchestration

Jim Carroll

7 months ago

Rafay Systems and Aviz Networks announced a strategic partnership aimed at simplifying how enterprises and GPU cloud providers deploy, orchestrate, and monetize GPU-based infrastructure. The collaboration combines Rafay’s Kubernetes and GPU lifecycle management platform with Aviz’s multi-vendor AI fabric orchestration and observability stack, delivering an integrated compute-to-network automation layer for large-scale AI workloads.

The joint solution provides end-to-end self-service access to GPU and CPU resources, tenant-aware automation across compute and network layers, and real-time observability to reduce troubleshooting and improve utilization. Aviz’s ONES platform automates AI fabric configuration for NVIDIA Spectrum-X switches and GPU NICs, while Rafay’s platform manages GPU resource binding, cluster provisioning, and policy enforcement. This integration enables GPU cloud providers to deploy production-ready environments in weeks rather than months, with cloud-like consumption models and secure, multi-tenant isolation.

“Cloud providers and enterprises need a simple way to consume GPU infrastructure without reinventing orchestration stacks,” said Haseeb Budhani, CEO and Co-Founder of Rafay Systems. “Our partnership with Aviz gives customers not just self-service compute, but the tools and visibility they need to run AI workloads at scale.”

• End-to-end orchestration for GPU/CPU workloads and AI fabrics

• Integrated self-service workflows spanning compute and networking

• Multi-tenant binding and network segmentation for secure isolation

• Real-time observability across GPU, NIC, and fabric layers

• Rapid deployment through unified APIs and automation frameworks

🌐 Analysis: Rafay and Aviz are aligning around the emerging need for GPU cloud operators to manage both compute and AI networking as one stack. Rafay’s Kubernetes-centric GPU PaaS approach complements Aviz’s ONES orchestration across NVIDIA Spectrum-X fabrics and SONiC-based networks, addressing a key challenge in multi-tenant GPU clusters: visibility and efficiency. This positions the two firms against larger ecosystem players such as Run:ai, VMware, and NVIDIA’s DGX Cloud management stack, each racing to define standards for GPU lifecycle orchestration and AI fabric automation.