RDMA

Overview

RDMA stands for Remote Direct Memory Access — a high-performance networking technique. It lets one machine directly read or write the memory of another without involving the remote CPU, OS interrupts, or kernel. The result is a dramatic drop in network latency and CPU load, ideal for low-latency, high-throughput workloads.

Key properties

Low latency — bypasses the OS kernel, shortening the data path.
High bandwidth — fully utilizes the underlying hardware bandwidth.
Low CPU usage — data transfer skips the CPU, freeing it for other tasks.
Zero copy — data goes directly from one application's buffer to another's, with no intermediate copies.

Implementations

There are three mainstream RDMA implementations:

InfiniBand (IB)
- Network protocol designed for HPC.
- Extremely low latency and high bandwidth.
- Requires dedicated hardware (switches, NICs).
RoCE (RDMA over Converged Ethernet)
- RDMA over standard Ethernet.
- Reuses Ethernet infrastructure but requires DCB (Data Center Bridging) capable switches.
- RoCEv1 (single L2 segment only) and RoCEv2 (with L3 routing).
iWARP (Internet Wide Area RDMA Protocol)
- RDMA over TCP/IP.
- Runs on standard IP networks but performance can lag InfiniBand or RoCE.

Using RDMA in VKS

VKS supports cross-node RDMA. Just add the RDMA device label(s) to your container's resource definition. Currently supported:

rdma/rdma_shared_device_a: 1
rdma/rdma_shared_device_b: 1

Example

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay
spec:
  rayVersion: '2.40.0'
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
          - name: ray-head
            image: registry.hd-01.alayanew.com:8443/vc-app_market/ray-ml-vllm:0.7.1
            resources:
              requests:
                memory: "1600G"
                cpu: "144"
                nvidia.com/gpu-h800: 8
                rdma/rdma_shared_device_a: 1
                rdma/rdma_shared_device_b: 1
              limits:
                memory: "1600G"
                cpu: "144"
                nvidia.com/gpu-h800: 8
                rdma/rdma_shared_device_a: 1
                rdma/rdma_shared_device_b: 1
  workerGroupSpecs:
    - replicas: {{ .Values.raycluster.workerGroupSpecs.replicas }}
      groupName: workergroup
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-worker
              image: registry.hd-01.alayanew.com:8443/vc-app_market/ray-ml-vllm:0.7.1
              resources:
                requests:
                  memory: "1600G"
                  cpu: "144"
                  nvidia.com/gpu-h800: 8
                  rdma/rdma_shared_device_a: 1
                  rdma/rdma_shared_device_b: 1
                limits:
                  memory: "1600G"
                  cpu: "144"
                  nvidia.com/gpu-h800: 8
                  rdma/rdma_shared_device_a: 1
                  rdma/rdma_shared_device_b: 1

Overview

Key properties

Implementations

Using RDMA in VKS

Example

On this page