verl (Volcano Engine RL) β PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: verl description: "verl (Volcano Engine RL) β PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs."
verl (Volcano Engine Reinforcement Learning)
verl is a production-ready RL training framework for LLMs. Supports PPO, GRPO, DAPO, RLOO, REINFORCE++, ReMax, and more. Uses FSDP/Megatron-LM for training and vLLM/SGLang for rollout generation.
GitHub: verl-project/verl | Docs: verl.readthedocs.io
Requirements: PyTorch 2.4+, FSDP or Megatron-LM, vLLM or SGLang for rollout generation. Container image: verl.
Version pinning is strict. verl pins exact rollout engine versions. See references/compatibility.md for the full compatibility matrix and upgrade notes β mixing verl v0.7 with vLLM >0.12.0 breaks the RLHF engine interface.
Core Architecture
verl uses a 3D-HybridEngine where the same GPU pool switches between:
- Rollout (generation) β uses vLLM/SGLang for fast batched inference
- Training (policy update) β uses FSDP/Megatron-LM for gradient computation
- Reference model β computes reference log-probs for KL penalty
The hybrid engine eliminates memory redundancy by resharding model weights between training (FSDP sharded) and inference (vLLM tensor-parallel) phases on the same GPUs.
Resource allocation lifecycle per training step:
- Load model weights into vLLM, generate rollouts for the batch
- Offload vLLM, load FSDP-sharded weights for training
- Compute advantages, run PPO/GRPO updates
- Save checkpoint, repeat
SFT (Supervised Fine-Tuning)
verl includes an SFT trainer as a pre-RL step. Data format: Parquet with prompt and response columns.
python3 -m verl.trainer.fsdp_sft_trainer \
data.train_files=/data/sft_train.parquet \
model.path=Qwen/Qwen2.5-7B-Instruct \
optim.lr=2e-5 trainer.total_epochs=3 \
trainer.n_gpus_per_node=4 trainer.save_freq=500
PPO Configuration
PPO is the full RLHF algorithm with actor, critic, reference model, and reward model:
# ppo_config.yaml
data:
train_files: /data/gsm8k/train.parquet
val_files: /data/gsm8k/test.parquet
train_batch_size: 256
max_prompt_length: 512
max_response_length: 512
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
actor:
optim:
lr: 1e-6
ppo_mini_batch_size: 64
ppo_micro_batch_size_per_gpu: 4
ppo_epochs: 1
clip_ratio: 0.2
grad_clip: 1.0
entropy_coeff: 0.0
use_torch_compile: true
rollout:
name: vllm
tensor_model_parallel_size: 1
gpu_memory_utilization: 0.4
temperature: 1.0
top_p: 1.0
n: 1 # 1 response per prompt for PPO
ref:
log_prob_micro_batch_size_per_gpu: 4
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct # same or separate critic model
optim:
lr: 1e-5
ppo_micro_batch_size_per_gpu: 4
algorithm:
adv_estimator: gae # Generalized Advantage Estimation
kl_ctrl:
kl_coef: 0.001
trainer:
n_gpus_per_node: 4
nnodes: 1
total_epochs: 15
save_freq: 10
logger: ["console", "wandb"]
project_name: ppo-gsm8k
GRPO Configuration
GRPO is simpler than PPO β no critic model. Samples multiple responses per prompt and uses group-relative rewards:
# grpo_config.yaml
data:
train_files: /data/gsm8k/train.parquet
val_files: /data/gsm8k/test.parquet
train_batch_size: 128
max_prompt_length: 512
max_response_length: 1024
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
actor:
optim:
lr: 1e-6
use_kl_loss: true # KL penalty in loss function
kl_loss_coef: 0.001
ppo_mini_batch_size: 64
ppo_micro_batch_size_per_gpu: 2
rollout:
name: vllm
n: 8 # 8 responses per prompt (key for GRPO)
temperature: 1.0
tensor_model_parallel_size: 2
gpu_memory_utilization: 0.5
algorithm:
adv_estimator: grpo # group-relative advantage
# No critic section needed β GRPO is critic-free
trainer:
n_gpus_per_node: 4
total_epochs: 20
Reward Functions
Rule-Based Rewards
# verl/utils/reward_score/my_reward.py
import re
def compute_reward(data_source, solution_str, ground_truth, extra_info=None):
match = re.search(r"####\s*(-?\d+)", solution_str)
if match is None:
return 0.0
return 1.0 if match.group(1).strip() == str(ground_truth).strip() else 0.0
Register in config:
reward_model.reward_fn.path=verl/utils/reward_score/my_reward.py \
reward_model.reward_fn.name=compute_reward
Reward Model (Learned)
Use a trained reward model instead of rules:
reward_model.enable=True \
reward_model.model.path=my-org/reward-model-7b \
reward_model.micro_batch_size_per_gpu=4
Multi-Reward Composition
Combine multiple signals in a single reward function:
def compute_reward(data_source, solution_str, ground_truth, extra_info=None):
correctness = check_answer(solution_str, ground_truth) # 0 or 1
format_score = check_format(solution_str) # 0 to 0.5
length_penalty = -max(0, len(solution_str) - 2000) / 10000
return correctness + format_score + length_penalty
Hybrid Reward Manager (v0.7)
v0.7 supports combining generative (LLM-as-judge), discriminative (classifier), and rule-based rewards. Reward models run in server mode β either colocated (shared GPUs) or standalone (dedicated nodes). Manager modes: limited (rate-controlled) or remote (offloaded CPU-intensive evaluation).
reward_model:
enable: true
model:
path: my-org/reward-model-7b
deployment: colocated # or standalone
manager_mode: limited # or remote
reward_fn:
path: verl/utils/reward_score/my_reward.py
name: compute_reward
Configuration Reference
Data Config
| Parameter | Purpose | Default |
|---|---|---|
data.train_batch_size | Global batch size per step | 1024 |
data.max_prompt_length | Max input tokens | 512 |
data.max_response_length | Max generated tokens | 512 |
data.prompt_key | Column name for prompts | prompt |
Actor Config
| Parameter | Purpose | Default |
|---|---|---|
actor_rollout_ref.model.path | HuggingFace model | required |
actor_rollout_ref.actor.optim.lr | Actor learning rate | 1e-6 |
actor_rollout_ref.actor.ppo_mini_batch_size | PPO mini-batch (global) | 256 |
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu | Micro-batch per GPU | 8 |
actor_rollout_ref.actor.grad_clip | Gradient clipping | 1.0 |
actor_rollout_ref.actor.clip_ratio | PPO clip ratio | 0.2 |
actor_rollout_ref.actor.ppo_epochs | PPO epochs per step | 1 |
actor_rollout_ref.actor.entropy_coeff | Entropy bonus | 0.0 |
actor_rollout_ref.actor.use_kl_loss | KL loss (for GRPO) | False |
actor_rollout_ref.actor.kl_loss_coef | KL loss coefficient | 0.001 |
actor_rollout_ref.actor.use_torch_compile | torch.compile | True |
Rollout Config (vLLM)
| Parameter | Purpose | Default |
|---|---|---|
rollout.name | Engine (vllm, sglang, hf) | vllm |
rollout.temperature | Sampling temperature | 1.0 |
rollout.top_p | Nucleus sampling | 1.0 |
rollout.n | Responses per prompt | 1 (8+ for GRPO) |
rollout.tensor_model_parallel_size | TP for rollout | 1 |
rollout.gpu_memory_utilization | vLLM GPU memory fraction | 0.5 |
rollout.enforce_eager | Disable CUDA graphs | True |
rollout.online_quant | Online quantization backend (no pre-quantized checkpoint needed) | None (torchao) |
Algorithm Config
| Parameter | Purpose | Options |
|---|---|---|
algorithm.adv_estimator | Advantage estimation | gae, grpo, rloo, reinforce_plus_plus |
algorithm.kl_ctrl.kl_coef | KL penalty coefficient | 0.001 |
algorithm.use_kl_in_reward | KL in reward vs loss | True/False |
Trainer Config
| Parameter | Purpose | Default |
|---|---|---|
trainer.n_gpus_per_node | GPUs per node | 8 |
trainer.nnodes | Number of nodes | 1 |
trainer.total_epochs | Training epochs | 1 |
trainer.save_freq | Checkpoint frequency (steps) | -1 |
trainer.test_freq | Validation frequency | -1 |
trainer.logger | Logging backends | console |
trainer.project_name | wandb project | verl |
trainer.experiment_name | wandb run name | default |
Supported Algorithms
| Algorithm | Advantage Estimator | Critic? | Key Feature |
|---|---|---|---|
| PPO | gae | Yes | Full RLHF with value function |
| GRPO | grpo | No | Group-relative rewards, simpler |
| DAPO | grpo + DAPO recipe | No | SOTA reasoning (AIME 50pts) |
| RLOO | rloo | No | Leave-one-out baseline |
| REINFORCE++ | reinforce_plus_plus | No | Improved REINFORCE |
| ReMax | remax | No | Max-reward baseline |
| CISPO | cispo | No | Clipped IS-weight Policy Optimization (paper) |
| SAPO | sapo | No | Soft Adaptive Policy Optimization (paper) |
| GSPO | gspo | No | Group-level optimization variant (v0.6+) |
Rollout Correction (TIS) (v0.6+)
In async or off-policy RL, the policy used for rollout generation may differ from the current training policy. This mismatch degrades training signal quality. verl's Rollout Correction framework addresses this via importance sampling (IS) weights and rejection sampling (RS) β configured under algorithm.rollout_correction.
Token-Level TIS
Applies per-token importance weights to mitigate the gap between rollout and training policy distributions. Automatically reweights loss contributions so that tokens generated by a stale policy receive proportionally less influence.
algorithm:
rollout_correction:
rollout_is: token # per-token IS weights (TIS)
rollout_is_threshold: 2.0 # upper-clamp threshold
actor_rollout_ref:
rollout:
calculate_log_probs: true # required
Sequence-Level TIS (v0.6+)
Monitors distribution mismatch and applies sequence-level IS correction. Useful for detecting and mitigating RL collapse in high-throughput async settings.
algorithm:
rollout_correction:
rollout_is: sequence # per-sequence IS weights
rollout_is_threshold: 5.0
actor_rollout_ref:
rollout:
calculate_log_probs: true
Presets: Use RolloutCorrectionConfig presets for validated configurations: RolloutCorrectionConfig.decoupled_token_is(), .decoupled_seq_is(), .bypass_ppo_clip(), etc. See Rollout Correction docs.
Monitoring: When Rollout Correction is enabled, verl logs rollout_corr/kl, rollout_corr/log_ppl_abs_diff, and rollout_corr/chi2_token. Use these to detect training instability from off-policy drift.
Rollout Server Mode (v0.6+)
v0.6 transitions rollout engines from SPMD mode (embedded in trainer process) to native server mode β the inference engine runs as a standalone server with an adapter for weight synchronization. This enables:
- DP+EP support for large MoE models (e.g., expert parallelism across decode workers)
- Full-fledged serving optimizations (continuous batching, prefix caching) during rollout
- Cleaner separation between training and inference engines
Both vLLM and SGLang now use native server mode by default. The rollout config remains the same β the change is architectural. The BaseRollout interface was refactored to make the training engine agnostic of the inference engine during weight synchronization, simplifying integration of new inference backends (e.g., TensorRT-LLM).
Breaking change (v0.6): ShardingManager is deprecated and will be removed in the next release. Use the new nD dispatch API with device meshes and @register(dispatch_mode=...) decorators instead. See verl docs for the migration guide.
Multi-GPU Scaling
FSDP Backend (Default)
# 4 GPUs on 1 node
trainer.n_gpus_per_node=4 trainer.nnodes=1
# 8 GPUs on 2 nodes
trainer.n_gpus_per_node=4 trainer.nnodes=2
# With tensor parallelism for large models (rollout)
actor_rollout_ref.rollout.tensor_model_parallel_size=4
Megatron-LM Backend
For very large models (70B+):
actor_rollout_ref.actor.strategy=megatron \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2
Checkpointing
# Save checkpoints every 10 steps
trainer.save_freq=10
# Checkpoints saved to:
# checkpoints/{project_name}/{experiment_name}/global_step_{N}/actor/
# Merge to HuggingFace format
python scripts/model_merger.py merge \
--backend fsdp \
--local_dir checkpoints/my_project/my_run/global_step_100/actor \
--target_dir ./merged_model/huggingface
Data Preparation
verl expects Parquet files with at minimum a prompt column. Apply chat templates via tokenizer.apply_chat_template(), include answer (for reward) and data_source columns, then save with df.to_parquet().
Advanced Algorithms
See verl recipes for DAPO, SPIN, SPPO, OPO, GPG, and more.
DAPO Configuration
DAPO uses asymmetric clipping for SOTA reasoning (50% on AIME 2024 with Qwen2.5-32B):
actor_rollout_ref:
actor:
clip_ratio_low: 0.2 # Ξ΅_low
clip_ratio_high: 0.28 # Ξ΅_high (more exploration)
Online DPO
Generate N responses, score, form pairwise preferences, train with DPO loss. See docs/advance/dpo_extension.md.
LoRA Training
actor_rollout_ref:
actor:
peft:
peft_type: lora
lora_rank: 16
lora_alpha: 32
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
Megatron-Bridge LoRA (v0.7): See references/advanced-features.md for Megatron-Bridge LoRA config with module name mapping.
Multi-Turn Rollout
Enable multi-turn conversation training with tool use:
rollout:
multi_turn:
enable: true
max_turns: 5
tool_server: "http://tool-server:8080"
VLM and Multimodal RL
verl v0.7 adds VLM RL training and SFT with multimodal inputs (images, video). See references/advanced-features.md for config examples.
FP8 Rollout
Enable FP8 for faster generation:
rollout:
dtype: fp8
Async Training
verl supports async variants for higher throughput:
- One-step off-policy: Generate with previous policy, train with current
- Fully async (v0.7): Decouples rollout and training onto separate GPU pools with NCCL-based weight sync
See references/advanced-features.md for full async config and the TrainingWorker API.
Monitoring
verl logs to WandB by default. Enable Prometheus + Grafana for rollout monitoring:
trainer:
prometheus:
enable: true
port: 9090
Debugging
See references/troubleshooting.md for:
- Reward hacking and training instability
- KL divergence explosion
- OOM during rollout or training
- vLLM rollout failures
- Checkpoint and model merging issues
Cross-References
- hydra β Hydra configuration framework (verl uses OmegaConf/Hydra for config composition)
- wandb β Experiment tracking and monitoring
- vllm β vLLM rollout engine for inference during RL training
- fsdp β FSDP training backend
- megatron-lm β Megatron-LM training backend for large models
- pytorch β PyTorch distributed training fundamentals
- gpu-operator β GPU driver and device plugin prerequisites
- nccl β NCCL tuning for multi-node RL training
- kubeflow-trainer β Orchestrate verl training on Kubernetes
- kueue β Queue verl training workloads
- ray-train β verl uses Ray for distributed orchestration
Reference
compatibility.mdβ version compatibility matrix: verl β vLLM β SGLang β PyTorch; RLHF interface changes between vLLM 0.12 and 0.16advanced-features.mdβ Megatron-Bridge LoRA, VLM/multimodal RL, fully async trainer, TrainingWorker APItroubleshooting.mdβ common errors and fixes- verl docs
- verl GitHub
- verl recipes
- HybridFlow paper (EuroSys 2025)
assets/grpo_config.yamlβ complete GRPO training config with vLLM rollout, math reward, W&B logging, and DAPO variant commentsassets/architecture.mdβ Mermaid architecture diagrams
More by tylertitsworth
View allLLM evaluation with lm-evaluation-harness β MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.
FSx for Lustre β performance tuning, striping, S3 data repositories, EKS integration. Use when configuring high-performance storage for ML on EKS. NOT for EBS or EFS.
Ray Train β TorchTrainer, distributed training across GPUs/nodes, fault tolerance, local mode with torchrun, HuggingFace integration, Ray Data pipelines, and Tune. Use when running distributed training with Ray. NOT for single-GPU training.
uv β fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.
