KubeRay operator β RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: kuberay description: "KubeRay operator β RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core)."
KubeRay
Kubernetes operator for Ray. Provides CRDs for running distributed Ray workloads natively on K8s.
Docs: https://docs.ray.io/en/latest/cluster/kubernetes/index.html
GitHub: https://github.com/ray-project/kuberay
Operator: v1.5.1 | Ray: 2.54.0 | API: ray.io/v1
CRDs
| CRD | Purpose | Lifecycle |
|---|---|---|
| RayCluster | Long-running Ray cluster (head + worker groups) | Manual or autoscaled |
| RayJob | One-shot: creates cluster, submits job, optionally cleans up | Ephemeral |
| RayService | Ray Serve with zero-downtime upgrades | Long-running serving |
When to use which:
- RayJob for batch/training β new cluster per job, auto-cleanup, cost-efficient
- RayCluster for interactive/dev β persistent, no startup latency per job
- RayService for model serving β managed upgrades, HA, traffic routing
Installation
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 \
--namespace kuberay-system --create-namespace
Key Helm Values
image:
repository: quay.io/kuberay/operator
tag: v1.5.1
# Namespace scoping
watchNamespace: [] # empty = all namespaces
singleNamespaceInstall: false # true = Role instead of ClusterRole
# RBAC
rbacEnable: true
crNamespacedRbacEnable: true # false for GitOps tools like ArgoCD
# Feature gates
featureGates:
- name: RayClusterStatusConditions
enabled: true
# Operator tuning
reconcileConcurrency: 1 # increase for many CRs
batchScheduler: "" # "volcano" or "yunikorn"
# Leader election (for HA)
leaderElection:
enabled: true
Verify: kubectl get pods -n kuberay-system
RayCluster
GPU Cluster Example
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: gpu-cluster
spec:
rayVersion: "2.54.0"
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default # Default | Conservative | Aggressive
idleTimeoutSeconds: 60
headGroupSpec:
serviceType: ClusterIP
rayStartParams:
dashboard-host: "0.0.0.0"
num-cpus: "0" # prevent workloads on head
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.54.0-gpu
resources:
limits: { cpu: "4", memory: 16Gi }
requests: { cpu: "4", memory: 16Gi }
env:
- name: NVIDIA_VISIBLE_DEVICES
value: void # head doesn't need GPU
workerGroupSpecs:
- groupName: gpu-a100
replicas: 2
minReplicas: 0
maxReplicas: 8
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.54.0-gpu
resources:
limits: { cpu: "8", memory: 64Gi, nvidia.com/gpu: "1" }
requests: { cpu: "8", memory: 64Gi, nvidia.com/gpu: "1" }
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.product: A100
Add multiple workerGroupSpecs entries for heterogeneous hardware (GPU types, spot vs on-demand, CPU-only groups).
Label Selector API (v1.5+, Preferred Over rayStartParams)
KubeRay v1.5 introduces a top-level resources and labels API for each head/worker group β replacing the previous practice of embedding labels and resources in rayStartParams strings. These top-level values are mirrored into pod labels, enabling combined Ray + K8s label selector queries, and are consumed by the Ray autoscaler for improved decisions.
headGroupSpec:
rayStartParams: {} # no longer need label/resource hacks here
resources:
custom_accelerator: "4" # Ray logical resource (replaces JSON string in rayStartParams)
labels:
ray.io/zone: us-west-2a # also mirrors into pod labels
ray.io/region: us-west-2
workerGroupSpecs:
- groupName: gpu-workers
rayStartParams: {}
resources:
custom_accelerator: "4"
labels:
ray.io/zone: us-west-2b
template:
# ...
Before (v1.4 style β still works, but deprecated):
rayStartParams:
resources: '"{\"custom_accelerator\": 4}"'
Configuration Best Practices
Pod sizing:
- Size each Ray pod to fill one K8s node (fewer large pods > many small)
- Set memory and GPU requests = limits (KubeRay ignores memory/GPU requests, uses limits)
- CPU: set requests only (no limits) to avoid throttling; KubeRay uses requests if limits absent
- KubeRay rounds CPU to nearest integer for Ray resource accounting
Head pod:
- Set
num-cpus: "0"to prevent workloads on head - Set
dashboard-host: "0.0.0.0"to expose dashboard - Set
NVIDIA_VISIBLE_DEVICES: voidif head is on GPU node but shouldn't use GPUs
Worker groups:
- All
rayStartParamsvalues must be strings - Use same Ray image + version across head and all workers (same Python version too)
- Multiple worker groups for heterogeneous hardware (GPU types, spot vs on-demand)
- Use
nodeSelectorandtolerationsto target specific node pools
Custom Ray resources:
rayStartParams:
resources: '"{\"TPU\": 4, \"custom_resource\": 1}"' # JSON string of custom resources
Head Service
KubeRay auto-creates <cluster>-head-svc with ports:
| Port | Name | Purpose |
|---|---|---|
| 6379 | gcs | Global Control Store |
| 8265 | dashboard | Ray Dashboard + Jobs API |
| 10001 | client | Ray client connections |
| 8000 | serve | Ray Serve HTTP endpoint |
Override serviceType in headGroupSpec: ClusterIP (default), NodePort, LoadBalancer.
RayJob
RayJob manages two things: a RayCluster and a submitter that calls ray job submit to run your code on that cluster. The submitter is NOT your workload β it's a lightweight pod that submits and monitors.
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: training-job
spec:
entrypoint: python /home/ray/train.py --epochs 10
runtimeEnvYAML: |
pip:
- torch==2.5.0
- transformers
env_vars:
WANDB_API_KEY: "secret"
working_dir: "https://github.com/org/repo/archive/main.zip"
shutdownAfterJobFinishes: true
ttlSecondsAfterFinished: 300
activeDeadlineSeconds: 7200 # max total runtime
backoffLimit: 0 # β οΈ each retry = NEW full cluster (see warning below)
submissionMode: K8sJobMode # see submission modes below
suspend: false # true for Kueue integration
rayClusterSpec:
rayVersion: "2.54.0"
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.54.0-gpu
resources:
limits:
cpu: "4"
memory: 16Gi
workerGroupSpecs:
- groupName: gpu-workers
replicas: 4
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.54.0-gpu
resources:
limits:
cpu: "8"
memory: 64Gi
nvidia.com/gpu: "1"
β οΈ backoffLimit Creates Full New Clusters Per Retry
Cost trap: spec.backoffLimit on a RayJob creates a completely new RayCluster per retry β not a cheap pod restart. backoffLimit: 3 means up to 4Γ full GPU clusters provisioned sequentially: 4Γ node spin-up, 4Γ image pulls, 4Γ Kueue quota consumed. On 8ΓA100 nodes, that's 32 GPU-hours wasted on retries alone.
Three different retry mechanisms exist β don't confuse them:
| Field | Scope | What happens on retry | Cost |
|---|---|---|---|
spec.backoffLimit | Entire RayJob | Deletes cluster, creates a brand new one | Full cluster cost per retry |
submitterConfig.backoffLimit | Submitter pod only | Restarts the lightweight ray job submit pod | Near zero |
ray.train.FailureConfig(max_failures=N) | Ray Train workers | Restarts failed workers on the same cluster | Near zero |
Production recommendation:
spec:
backoffLimit: 0 # never recreate the entire cluster on failure
submitterConfig:
backoffLimit: 3 # retry submission if dashboard temporarily unreachable
Use FailureConfig(max_failures=N) in your Ray Train script for worker-level recovery with checkpoint restore β this retries on the same cluster without reprovisioning:
from ray.train import RunConfig, FailureConfig
run_config = RunConfig(
failure_config=FailureConfig(max_failures=3), # retry workers, keep cluster
)
Set backoffLimit: 1 only if you experience transient node provisioning failures and want one automatic retry of the full cluster.
Submission Modes
| Mode | How It Works | When to Use |
|---|---|---|
K8sJobMode (default) | Creates a K8s Job pod that runs ray job submit | Most reliable. Works with Kueue. |
HTTPMode | Operator sends HTTP POST to Ray Dashboard directly | No extra pod. Dashboard must be reachable from operator. |
SidecarMode | Injects submitter container into head pod | No extra pod. Cannot use clusterSelector. Head restart must be Never. |
InteractiveMode (alpha) | Waits for user to submit via kubectl ray plugin | Jupyter/notebook workflows. |
In K8sJobMode, the submitter pod gets two injected env vars: RAY_DASHBOARD_ADDRESS and RAY_JOB_SUBMISSION_ID.
Key Fields
| Field | Purpose |
|---|---|
entrypoint | Command passed to ray job submit |
runtimeEnvYAML | pip packages, env vars, working_dir, py_modules |
shutdownAfterJobFinishes | Delete RayCluster on completion (simple boolean β prefer deletionPolicy for fine-grained control) |
ttlSecondsAfterFinished | Delay before cleanup (applies when shutdownAfterJobFinishes: true) |
deletionPolicy | Advanced deletion control (v1.5+, see below) |
activeDeadlineSeconds | Max runtime before DeadlineExceeded failure |
backoffLimit | Full retries (each = new cluster). Different from submitterConfig.backoffLimit (submitter pod retries). |
submissionMode | See table above |
suspend | Set true for Kueue (Kueue controls unsuspension) |
clusterSelector | Use existing RayCluster instead of creating one |
entrypointNumCpus/Gpus | Reserve head resources for driver script |
Advanced Deletion Policies (v1.5+)
Replaces the shutdownAfterJobFinishes boolean with per-status, per-action TTLs (DeleteCluster, DeleteWorkers, DeleteSelf). For the full spec and action table, see references/kuberay-v1.5.md.
For full RayJob details (lifecycle, deletion strategies, submitter customization, troubleshooting), see references/rayjob.md.
Using Existing Clusters
Skip cluster creation β submit to a running RayCluster:
spec:
clusterSelector:
ray.io/cluster: my-existing-cluster
# Do NOT include rayClusterSpec
Autoscaling
Three levels of autoscaling work together:
- Ray Serve auto-scales replicas (actors) based on request load
- Ray Autoscaler scales Ray worker pods based on logical resource demand
- K8s Cluster Autoscaler provisions new nodes for pending pods
Configuration
spec:
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default # Default | Aggressive | Conservative
idleTimeoutSeconds: 60 # seconds before removing idle workers
resources:
limits:
cpu: "500m"
memory: 512Mi
| Mode | Behavior |
|---|---|
Default | Scale up to meet demand, conservative bin-packing |
Aggressive | Scale up faster, less bin-packing |
Conservative | Scale up more slowly |
Key behavior: The autoscaler monitors logical resource demands (from @ray.remote decorators), not physical utilization. If a task requests more resources than any single worker provides, the autoscaler won't scale.
Autoscaler V2 (alpha, Ray β₯ 2.10): Improved observability and stability. Enable via KubeRay feature gate.
GCS Fault Tolerance
Without GCS FT, head pod failure kills the entire cluster. Enable with external Redis:
spec:
headGroupSpec:
rayStartParams:
redis-password: "${REDIS_PASSWORD}"
template:
metadata:
annotations:
ray.io/ft-enabled: "true"
spec:
containers:
- name: ray-head
env:
- name: RAY_REDIS_ADDRESS
value: "redis:6379"
- name: RAY_gcs_rpc_server_reconnect_timeout_s
value: "120" # worker reconnect timeout (default 60s)
With GCS FT: workers continue serving during head recovery, cluster state persists in Redis.
Authentication (v1.5+)
For token auth (v1.5.1+, Ray β₯ 2.52.0) and TLS gRPC encryption, see references/kuberay-v1.5.md.
Kueue Integration
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: queued-training
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
suspend: true # Kueue controls unsuspension
# ... rest of spec
Also works with RayCluster (set spec.suspend: true).
Observability
Ray Dashboard
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl port-forward $HEAD_POD 8265:8265
# Open http://localhost:8265
Status Conditions (feature gate: RayClusterStatusConditions)
| Condition | True When |
|---|---|
RayClusterProvisioned | All pods reached ready at least once |
HeadPodReady | Head pod is currently ready |
RayClusterReplicaFailure | Reconciliation error (failed create/delete pod) |
RayService conditions: Ready (serving traffic), UpgradeInProgress (pending cluster exists).
Prometheus Metrics
Head pod exposes metrics on port 8080. Configure ServiceMonitor to scrape.
kubectl exec -it $HEAD_POD -- ray status # cluster resources
kubectl exec -it $HEAD_POD -- ray list actors # actor states
kubectl exec -it $HEAD_POD -- ray summary actors
Kubernetes Events
kubectl describe raycluster <name> # events: CreatedService, CreatedHeadPod, CreatedWorkerPod
kubectl describe rayjob <name> # events: job submission, completion, failure
kubectl ray Plugin
The KubeRay kubectl plugin (beta, v1.3.0+) simplifies cluster creation, log collection, sessions, and job submission without YAML. For installation, commands, and a comparison with raw kubectl, see references/kubectl-ray-plugin.md.
Key kubectl Commands
# List all Ray resources
kubectl get rayclusters,rayjobs,rayservices -A
# Ray pods
kubectl get pods -l ray.io/is-ray-node=yes
kubectl get pods -l ray.io/node-type=head
kubectl get pods -l ray.io/node-type=worker
# Head pod logs
kubectl logs $HEAD_POD -c ray-head
# Autoscaler logs (sidecar)
kubectl logs $HEAD_POD -c autoscaler
# Worker init container (if stuck)
kubectl logs <worker-pod> -c wait-gcs-ready
# Operator logs
kubectl logs -n kuberay-system deploy/kuberay-operator
# Ray internal logs
kubectl exec -it $HEAD_POD -- ls /tmp/ray/session_latest/logs/
kubectl exec -it $HEAD_POD -- cat /tmp/ray/session_latest/logs/gcs_server.out
RayService
For Ray Serve deployments with zero-downtime upgrades, see references/rayservice.md.
Incremental Upgrade Strategy (v1.5+)
Replaces blue-green (100% resource surge) with rolling traffic migration using maxSurgePercent/stepSizePercent/intervalSeconds. For the full YAML spec and resource comparison, see references/kuberay-v1.5.md.
Disaggregated Prefill-Decode via RayService
For deploying vLLM with disaggregated prefill-decode using build_pd_openai_app in a RayService, see the Disaggregated PD section in references/rayservice.md.
Troubleshooting
For debugging common KubeRay issues, see references/troubleshooting.md.
References
kubectl-ray-plugin.mdβ kubectl ray plugin for cluster management shortcutskuberay-v1.5.mdβ v1.5 features: token auth, TLS, deletion policies, incremental upgraderayjob.mdβ RayJob lifecycle, submission modes, and batch workload patternsrayservice.mdβ RayService for serving Ray Serve apps with zero-downtime upgradessecurity.mdβ Dashboard authentication, NetworkPolicies, GCS port security, RBAC scopingtroubleshooting.mdβ Common KubeRay issues: pod scheduling, GCS failures, and autoscaler problems
Cross-References
- kueue β Queue and gang-schedule Ray workloads
- ray-core β Ray programming model
- ray-train β Distributed training on Ray clusters
- ray-serve β Model serving on Ray clusters
- ray-data β Data processing on Ray clusters
- gpu-operator β GPU driver and device plugin for Ray GPU workers
- volcano β Alternative scheduler for Ray workloads
- prometheus-grafana β Scrape Ray cluster Prometheus metrics
- nccl β NCCL tuning for Ray Train multi-node GPU communication
- flyte-kuberay β Run Flyte tasks on KubeRay clusters
More by tylertitsworth
View allTensorRT-LLM β engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.
NVIDIA Triton Inference Server β model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.
LLM evaluation with lm-evaluation-harness β MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.
uv β fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.
