kuberay

@tylertitsworth/kuberay

0 forks

Updated 4/1/2026

KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).

Installation

$npx agent-skills-cli install @tylertitsworth/kuberay

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathkuberay/SKILL.md

Branchmain

Scoped Name@tylertitsworth/kuberay

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: kuberay description: "KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core)."

KubeRay

Kubernetes operator for Ray. Provides CRDs for running distributed Ray workloads natively on K8s.

Docs: https://docs.ray.io/en/latest/cluster/kubernetes/index.html
GitHub: https://github.com/ray-project/kuberay
Operator: v1.5.1 | Ray: 2.54.0 | API: ray.io/v1

CRDs

CRD	Purpose	Lifecycle
RayCluster	Long-running Ray cluster (head + worker groups)	Manual or autoscaled
RayJob	One-shot: creates cluster, submits job, optionally cleans up	Ephemeral
RayService	Ray Serve with zero-downtime upgrades	Long-running serving

When to use which:

RayJob for batch/training — new cluster per job, auto-cleanup, cost-efficient
RayCluster for interactive/dev — persistent, no startup latency per job
RayService for model serving — managed upgrades, HA, traffic routing

Installation

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 \
  --namespace kuberay-system --create-namespace

Key Helm Values

image:
  repository: quay.io/kuberay/operator
  tag: v1.5.1

# Namespace scoping
watchNamespace: []                    # empty = all namespaces
singleNamespaceInstall: false         # true = Role instead of ClusterRole

# RBAC
rbacEnable: true
crNamespacedRbacEnable: true          # false for GitOps tools like ArgoCD

# Feature gates
featureGates:
- name: RayClusterStatusConditions
  enabled: true

# Operator tuning
reconcileConcurrency: 1               # increase for many CRs
batchScheduler: ""                     # "volcano" or "yunikorn"

# Leader election (for HA)
leaderElection:
  enabled: true

Verify: kubectl get pods -n kuberay-system

RayCluster

GPU Cluster Example

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: gpu-cluster
spec:
  rayVersion: "2.54.0"
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default             # Default | Conservative | Aggressive
    idleTimeoutSeconds: 60
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: "0.0.0.0"
      num-cpus: "0"                    # prevent workloads on head
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-ml:2.54.0-gpu
          resources:
            limits: { cpu: "4", memory: 16Gi }
            requests: { cpu: "4", memory: 16Gi }
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: void                # head doesn't need GPU
  workerGroupSpecs:
  - groupName: gpu-a100
    replicas: 2
    minReplicas: 0
    maxReplicas: 8
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.54.0-gpu
          resources:
            limits: { cpu: "8", memory: 64Gi, nvidia.com/gpu: "1" }
            requests: { cpu: "8", memory: 64Gi, nvidia.com/gpu: "1" }
        tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
        nodeSelector:
          nvidia.com/gpu.product: A100

Add multiple workerGroupSpecs entries for heterogeneous hardware (GPU types, spot vs on-demand, CPU-only groups).

Label Selector API (v1.5+, Preferred Over `rayStartParams`)

KubeRay v1.5 introduces a top-level resources and labels API for each head/worker group — replacing the previous practice of embedding labels and resources in rayStartParams strings. These top-level values are mirrored into pod labels, enabling combined Ray + K8s label selector queries, and are consumed by the Ray autoscaler for improved decisions.

headGroupSpec:
  rayStartParams: {}               # no longer need label/resource hacks here
  resources:
    custom_accelerator: "4"        # Ray logical resource (replaces JSON string in rayStartParams)
  labels:
    ray.io/zone: us-west-2a        # also mirrors into pod labels
    ray.io/region: us-west-2
workerGroupSpecs:
- groupName: gpu-workers
  rayStartParams: {}
  resources:
    custom_accelerator: "4"
  labels:
    ray.io/zone: us-west-2b
  template:
    # ...

Before (v1.4 style — still works, but deprecated):

rayStartParams:
  resources: '"{\"custom_accelerator\": 4}"'

Configuration Best Practices

Pod sizing:

Size each Ray pod to fill one K8s node (fewer large pods > many small)
Set memory and GPU requests = limits (KubeRay ignores memory/GPU requests, uses limits)
CPU: set requests only (no limits) to avoid throttling; KubeRay uses requests if limits absent
KubeRay rounds CPU to nearest integer for Ray resource accounting

Head pod:

Set num-cpus: "0" to prevent workloads on head
Set dashboard-host: "0.0.0.0" to expose dashboard
Set NVIDIA_VISIBLE_DEVICES: void if head is on GPU node but shouldn't use GPUs

Worker groups:

All rayStartParams values must be strings
Use same Ray image + version across head and all workers (same Python version too)
Multiple worker groups for heterogeneous hardware (GPU types, spot vs on-demand)
Use nodeSelector and tolerations to target specific node pools

Custom Ray resources:

rayStartParams:
  resources: '"{\"TPU\": 4, \"custom_resource\": 1}"'  # JSON string of custom resources

Head Service

KubeRay auto-creates <cluster>-head-svc with ports:

Port	Name	Purpose
6379	gcs	Global Control Store
8265	dashboard	Ray Dashboard + Jobs API
10001	client	Ray client connections
8000	serve	Ray Serve HTTP endpoint

Override serviceType in headGroupSpec: ClusterIP (default), NodePort, LoadBalancer.

RayJob

RayJob manages two things: a RayCluster and a submitter that calls ray job submit to run your code on that cluster. The submitter is NOT your workload — it's a lightweight pod that submits and monitors.

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: training-job
spec:
  entrypoint: python /home/ray/train.py --epochs 10
  runtimeEnvYAML: |
    pip:
      - torch==2.5.0
      - transformers
    env_vars:
      WANDB_API_KEY: "secret"
    working_dir: "https://github.com/org/repo/archive/main.zip"
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 300
  activeDeadlineSeconds: 7200          # max total runtime
  backoffLimit: 0                      # ⚠️ each retry = NEW full cluster (see warning below)
  submissionMode: K8sJobMode           # see submission modes below
  suspend: false                       # true for Kueue integration
  rayClusterSpec:
    rayVersion: "2.54.0"
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray-ml:2.54.0-gpu
            resources:
              limits:
                cpu: "4"
                memory: 16Gi
    workerGroupSpecs:
    - groupName: gpu-workers
      replicas: 4
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray-ml:2.54.0-gpu
            resources:
              limits:
                cpu: "8"
                memory: 64Gi
                nvidia.com/gpu: "1"

⚠️ backoffLimit Creates Full New Clusters Per Retry

Cost trap: spec.backoffLimit on a RayJob creates a completely new RayCluster per retry — not a cheap pod restart. backoffLimit: 3 means up to 4× full GPU clusters provisioned sequentially: 4× node spin-up, 4× image pulls, 4× Kueue quota consumed. On 8×A100 nodes, that's 32 GPU-hours wasted on retries alone.

Three different retry mechanisms exist — don't confuse them:

Field	Scope	What happens on retry	Cost
`spec.backoffLimit`	Entire RayJob	Deletes cluster, creates a brand new one	Full cluster cost per retry
`submitterConfig.backoffLimit`	Submitter pod only	Restarts the lightweight `ray job submit` pod	Near zero
`ray.train.FailureConfig(max_failures=N)`	Ray Train workers	Restarts failed workers on the same cluster	Near zero

Production recommendation:

spec:
  backoffLimit: 0              # never recreate the entire cluster on failure
  submitterConfig:
    backoffLimit: 3            # retry submission if dashboard temporarily unreachable

Use FailureConfig(max_failures=N) in your Ray Train script for worker-level recovery with checkpoint restore — this retries on the same cluster without reprovisioning:

from ray.train import RunConfig, FailureConfig

run_config = RunConfig(
    failure_config=FailureConfig(max_failures=3),  # retry workers, keep cluster
)

Set backoffLimit: 1 only if you experience transient node provisioning failures and want one automatic retry of the full cluster.

Submission Modes

Mode	How It Works	When to Use
`K8sJobMode` (default)	Creates a K8s Job pod that runs `ray job submit`	Most reliable. Works with Kueue.
`HTTPMode`	Operator sends HTTP POST to Ray Dashboard directly	No extra pod. Dashboard must be reachable from operator.
`SidecarMode`	Injects submitter container into head pod	No extra pod. Cannot use `clusterSelector`. Head restart must be `Never`.
`InteractiveMode` (alpha)	Waits for user to submit via `kubectl ray` plugin	Jupyter/notebook workflows.

In K8sJobMode, the submitter pod gets two injected env vars: RAY_DASHBOARD_ADDRESS and RAY_JOB_SUBMISSION_ID.

Key Fields

Field	Purpose
`entrypoint`	Command passed to `ray job submit`
`runtimeEnvYAML`	pip packages, env vars, working_dir, py_modules
`shutdownAfterJobFinishes`	Delete RayCluster on completion (simple boolean — prefer `deletionPolicy` for fine-grained control)
`ttlSecondsAfterFinished`	Delay before cleanup (applies when `shutdownAfterJobFinishes: true`)
`deletionPolicy`	Advanced deletion control (v1.5+, see below)
`activeDeadlineSeconds`	Max runtime before `DeadlineExceeded` failure
`backoffLimit`	Full retries (each = new cluster). Different from `submitterConfig.backoffLimit` (submitter pod retries).
`submissionMode`	See table above
`suspend`	Set `true` for Kueue (Kueue controls unsuspension)
`clusterSelector`	Use existing RayCluster instead of creating one
`entrypointNumCpus/Gpus`	Reserve head resources for driver script

Advanced Deletion Policies (v1.5+)

Replaces the shutdownAfterJobFinishes boolean with per-status, per-action TTLs (DeleteCluster, DeleteWorkers, DeleteSelf). For the full spec and action table, see references/kuberay-v1.5.md.

For full RayJob details (lifecycle, deletion strategies, submitter customization, troubleshooting), see references/rayjob.md.

Using Existing Clusters

Skip cluster creation — submit to a running RayCluster:

spec:
  clusterSelector:
    ray.io/cluster: my-existing-cluster
  # Do NOT include rayClusterSpec

Autoscaling

Three levels of autoscaling work together:

Ray Serve auto-scales replicas (actors) based on request load
Ray Autoscaler scales Ray worker pods based on logical resource demand
K8s Cluster Autoscaler provisions new nodes for pending pods

Configuration

spec:
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default           # Default | Aggressive | Conservative
    idleTimeoutSeconds: 60           # seconds before removing idle workers
    resources:
      limits:
        cpu: "500m"
        memory: 512Mi

Mode	Behavior
`Default`	Scale up to meet demand, conservative bin-packing
`Aggressive`	Scale up faster, less bin-packing
`Conservative`	Scale up more slowly

Key behavior: The autoscaler monitors logical resource demands (from @ray.remote decorators), not physical utilization. If a task requests more resources than any single worker provides, the autoscaler won't scale.

Autoscaler V2 (alpha, Ray ≥ 2.10): Improved observability and stability. Enable via KubeRay feature gate.

GCS Fault Tolerance

Without GCS FT, head pod failure kills the entire cluster. Enable with external Redis:

spec:
  headGroupSpec:
    rayStartParams:
      redis-password: "${REDIS_PASSWORD}"
    template:
      metadata:
        annotations:
          ray.io/ft-enabled: "true"
      spec:
        containers:
        - name: ray-head
          env:
          - name: RAY_REDIS_ADDRESS
            value: "redis:6379"
          - name: RAY_gcs_rpc_server_reconnect_timeout_s
            value: "120"             # worker reconnect timeout (default 60s)

With GCS FT: workers continue serving during head recovery, cluster state persists in Redis.

Authentication (v1.5+)

For token auth (v1.5.1+, Ray ≥ 2.52.0) and TLS gRPC encryption, see references/kuberay-v1.5.md.

Kueue Integration

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: queued-training
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  suspend: true    # Kueue controls unsuspension
  # ... rest of spec

Also works with RayCluster (set spec.suspend: true).

Observability

Ray Dashboard

export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl port-forward $HEAD_POD 8265:8265
# Open http://localhost:8265

Status Conditions (feature gate: RayClusterStatusConditions)

Condition	True When
`RayClusterProvisioned`	All pods reached ready at least once
`HeadPodReady`	Head pod is currently ready
`RayClusterReplicaFailure`	Reconciliation error (failed create/delete pod)

RayService conditions: Ready (serving traffic), UpgradeInProgress (pending cluster exists).

Prometheus Metrics

Head pod exposes metrics on port 8080. Configure ServiceMonitor to scrape.

kubectl exec -it $HEAD_POD -- ray status       # cluster resources
kubectl exec -it $HEAD_POD -- ray list actors   # actor states
kubectl exec -it $HEAD_POD -- ray summary actors

Kubernetes Events

kubectl describe raycluster <name>   # events: CreatedService, CreatedHeadPod, CreatedWorkerPod
kubectl describe rayjob <name>       # events: job submission, completion, failure

kubectl ray Plugin

The KubeRay kubectl plugin (beta, v1.3.0+) simplifies cluster creation, log collection, sessions, and job submission without YAML. For installation, commands, and a comparison with raw kubectl, see references/kubectl-ray-plugin.md.

Key kubectl Commands

# List all Ray resources
kubectl get rayclusters,rayjobs,rayservices -A

# Ray pods
kubectl get pods -l ray.io/is-ray-node=yes
kubectl get pods -l ray.io/node-type=head
kubectl get pods -l ray.io/node-type=worker

# Head pod logs
kubectl logs $HEAD_POD -c ray-head

# Autoscaler logs (sidecar)
kubectl logs $HEAD_POD -c autoscaler

# Worker init container (if stuck)
kubectl logs <worker-pod> -c wait-gcs-ready

# Operator logs
kubectl logs -n kuberay-system deploy/kuberay-operator

# Ray internal logs
kubectl exec -it $HEAD_POD -- ls /tmp/ray/session_latest/logs/
kubectl exec -it $HEAD_POD -- cat /tmp/ray/session_latest/logs/gcs_server.out

RayService

For Ray Serve deployments with zero-downtime upgrades, see references/rayservice.md.

Incremental Upgrade Strategy (v1.5+)

Replaces blue-green (100% resource surge) with rolling traffic migration using maxSurgePercent/stepSizePercent/intervalSeconds. For the full YAML spec and resource comparison, see references/kuberay-v1.5.md.

Disaggregated Prefill-Decode via RayService

For deploying vLLM with disaggregated prefill-decode using build_pd_openai_app in a RayService, see the Disaggregated PD section in references/rayservice.md.

Troubleshooting

For debugging common KubeRay issues, see references/troubleshooting.md.

References

kubectl-ray-plugin.md — kubectl ray plugin for cluster management shortcuts
kuberay-v1.5.md — v1.5 features: token auth, TLS, deletion policies, incremental upgrade
rayjob.md — RayJob lifecycle, submission modes, and batch workload patterns
rayservice.md — RayService for serving Ray Serve apps with zero-downtime upgrades
security.md — Dashboard authentication, NetworkPolicies, GCS port security, RBAC scoping
troubleshooting.md — Common KubeRay issues: pod scheduling, GCS failures, and autoscaler problems

Cross-References

kueue — Queue and gang-schedule Ray workloads
ray-core — Ray programming model
ray-train — Distributed training on Ray clusters
ray-serve — Model serving on Ray clusters
ray-data — Data processing on Ray clusters
gpu-operator — GPU driver and device plugin for Ray GPU workers
volcano — Alternative scheduler for Ray workloads
prometheus-grafana — Scrape Ray cluster Prometheus metrics
nccl — NCCL tuning for Ray Train multi-node GPU communication
flyte-kuberay — Run Flyte tasks on KubeRay clusters

More by tylertitsworth

View all

tensorrt-llm

TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.

triton-inference-server

NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.

llm-evaluation

LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.

kuberay

Installation

Details

Usage

Skill Instructions

name: kuberay description: "KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core)."

KubeRay

CRDs

Installation

Key Helm Values

RayCluster

GPU Cluster Example

Label Selector API (v1.5+, Preferred Over rayStartParams)

Configuration Best Practices

Head Service

RayJob

⚠️ backoffLimit Creates Full New Clusters Per Retry

Submission Modes

Key Fields

Advanced Deletion Policies (v1.5+)

Using Existing Clusters

Autoscaling

Configuration

GCS Fault Tolerance

Authentication (v1.5+)

Kueue Integration

Observability

Ray Dashboard

Status Conditions (feature gate: RayClusterStatusConditions)

Prometheus Metrics

Kubernetes Events

kubectl ray Plugin

Key kubectl Commands

RayService

Incremental Upgrade Strategy (v1.5+)

Disaggregated Prefill-Decode via RayService

Troubleshooting

References

Cross-References

More by tylertitsworth

Label Selector API (v1.5+, Preferred Over `rayStartParams`)