verl

@tylertitsworth/verl

0 forks

Updated 4/7/2026

verl (Volcano Engine RL) — PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs.

Installation

$npx agent-skills-cli install @tylertitsworth/verl

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathverl/SKILL.md

Branchmain

Scoped Name@tylertitsworth/verl

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: verl description: "verl (Volcano Engine RL) — PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs."

verl (Volcano Engine Reinforcement Learning)

verl is a production-ready RL training framework for LLMs. Supports PPO, GRPO, DAPO, RLOO, REINFORCE++, ReMax, and more. Uses FSDP/Megatron-LM for training and vLLM/SGLang for rollout generation.

GitHub: verl-project/verl | Docs: verl.readthedocs.io

Requirements: PyTorch 2.4+, FSDP or Megatron-LM, vLLM or SGLang for rollout generation. Container image: verl.

Version pinning is strict. verl pins exact rollout engine versions. See references/compatibility.md for the full compatibility matrix and upgrade notes — mixing verl v0.7 with vLLM >0.12.0 breaks the RLHF engine interface.

Core Architecture

verl uses a 3D-HybridEngine where the same GPU pool switches between:

Rollout (generation) — uses vLLM/SGLang for fast batched inference
Training (policy update) — uses FSDP/Megatron-LM for gradient computation
Reference model — computes reference log-probs for KL penalty

The hybrid engine eliminates memory redundancy by resharding model weights between training (FSDP sharded) and inference (vLLM tensor-parallel) phases on the same GPUs.

Resource allocation lifecycle per training step:

Load model weights into vLLM, generate rollouts for the batch
Offload vLLM, load FSDP-sharded weights for training
Compute advantages, run PPO/GRPO updates
Save checkpoint, repeat

SFT (Supervised Fine-Tuning)

verl includes an SFT trainer as a pre-RL step. Data format: Parquet with prompt and response columns.

python3 -m verl.trainer.fsdp_sft_trainer \
  data.train_files=/data/sft_train.parquet \
  model.path=Qwen/Qwen2.5-7B-Instruct \
  optim.lr=2e-5 trainer.total_epochs=3 \
  trainer.n_gpus_per_node=4 trainer.save_freq=500

PPO Configuration

PPO is the full RLHF algorithm with actor, critic, reference model, and reward model:

# ppo_config.yaml
data:
  train_files: /data/gsm8k/train.parquet
  val_files: /data/gsm8k/test.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 512

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    optim:
      lr: 1e-6
    ppo_mini_batch_size: 64
    ppo_micro_batch_size_per_gpu: 4
    ppo_epochs: 1
    clip_ratio: 0.2
    grad_clip: 1.0
    entropy_coeff: 0.0
    use_torch_compile: true
  rollout:
    name: vllm
    tensor_model_parallel_size: 1
    gpu_memory_utilization: 0.4
    temperature: 1.0
    top_p: 1.0
    n: 1                              # 1 response per prompt for PPO
  ref:
    log_prob_micro_batch_size_per_gpu: 4

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct    # same or separate critic model
  optim:
    lr: 1e-5
  ppo_micro_batch_size_per_gpu: 4

algorithm:
  adv_estimator: gae                   # Generalized Advantage Estimation
  kl_ctrl:
    kl_coef: 0.001

trainer:
  n_gpus_per_node: 4
  nnodes: 1
  total_epochs: 15
  save_freq: 10
  logger: ["console", "wandb"]
  project_name: ppo-gsm8k

GRPO Configuration

GRPO is simpler than PPO — no critic model. Samples multiple responses per prompt and uses group-relative rewards:

# grpo_config.yaml
data:
  train_files: /data/gsm8k/train.parquet
  val_files: /data/gsm8k/test.parquet
  train_batch_size: 128
  max_prompt_length: 512
  max_response_length: 1024

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    optim:
      lr: 1e-6
    use_kl_loss: true                  # KL penalty in loss function
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
    ppo_micro_batch_size_per_gpu: 2
  rollout:
    name: vllm
    n: 8                               # 8 responses per prompt (key for GRPO)
    temperature: 1.0
    tensor_model_parallel_size: 2
    gpu_memory_utilization: 0.5

algorithm:
  adv_estimator: grpo                  # group-relative advantage

# No critic section needed — GRPO is critic-free

trainer:
  n_gpus_per_node: 4
  total_epochs: 20

Reward Functions

Rule-Based Rewards

# verl/utils/reward_score/my_reward.py
import re

def compute_reward(data_source, solution_str, ground_truth, extra_info=None):
    match = re.search(r"####\s*(-?\d+)", solution_str)
    if match is None:
        return 0.0
    return 1.0 if match.group(1).strip() == str(ground_truth).strip() else 0.0

reward_model.reward_fn.path=verl/utils/reward_score/my_reward.py \
reward_model.reward_fn.name=compute_reward

Reward Model (Learned)

Use a trained reward model instead of rules:

reward_model.enable=True \
reward_model.model.path=my-org/reward-model-7b \
reward_model.micro_batch_size_per_gpu=4

Multi-Reward Composition

Combine multiple signals in a single reward function:

def compute_reward(data_source, solution_str, ground_truth, extra_info=None):
    correctness = check_answer(solution_str, ground_truth)  # 0 or 1
    format_score = check_format(solution_str)               # 0 to 0.5
    length_penalty = -max(0, len(solution_str) - 2000) / 10000
    return correctness + format_score + length_penalty

Hybrid Reward Manager (v0.7)

v0.7 supports combining generative (LLM-as-judge), discriminative (classifier), and rule-based rewards. Reward models run in server mode — either colocated (shared GPUs) or standalone (dedicated nodes). Manager modes: limited (rate-controlled) or remote (offloaded CPU-intensive evaluation).

reward_model:
  enable: true
  model:
    path: my-org/reward-model-7b
  deployment: colocated                # or standalone
  manager_mode: limited                # or remote
  reward_fn:
    path: verl/utils/reward_score/my_reward.py
    name: compute_reward

Configuration Reference

Data Config

Parameter	Purpose	Default
`data.train_batch_size`	Global batch size per step	1024
`data.max_prompt_length`	Max input tokens	512
`data.max_response_length`	Max generated tokens	512
`data.prompt_key`	Column name for prompts	`prompt`

Actor Config

Parameter	Purpose	Default
`actor_rollout_ref.model.path`	HuggingFace model	required
`actor_rollout_ref.actor.optim.lr`	Actor learning rate	1e-6
`actor_rollout_ref.actor.ppo_mini_batch_size`	PPO mini-batch (global)	256
`actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`	Micro-batch per GPU	8
`actor_rollout_ref.actor.grad_clip`	Gradient clipping	1.0
`actor_rollout_ref.actor.clip_ratio`	PPO clip ratio	0.2
`actor_rollout_ref.actor.ppo_epochs`	PPO epochs per step	1
`actor_rollout_ref.actor.entropy_coeff`	Entropy bonus	0.0
`actor_rollout_ref.actor.use_kl_loss`	KL loss (for GRPO)	False
`actor_rollout_ref.actor.kl_loss_coef`	KL loss coefficient	0.001
`actor_rollout_ref.actor.use_torch_compile`	torch.compile	True

Rollout Config (vLLM)

Parameter	Purpose	Default
`rollout.name`	Engine (vllm, sglang, hf)	vllm
`rollout.temperature`	Sampling temperature	1.0
`rollout.top_p`	Nucleus sampling	1.0
`rollout.n`	Responses per prompt	1 (8+ for GRPO)
`rollout.tensor_model_parallel_size`	TP for rollout	1
`rollout.gpu_memory_utilization`	vLLM GPU memory fraction	0.5
`rollout.enforce_eager`	Disable CUDA graphs	True
`rollout.online_quant`	Online quantization backend (no pre-quantized checkpoint needed)	None (`torchao`)

Algorithm Config

Parameter	Purpose	Options
`algorithm.adv_estimator`	Advantage estimation	gae, grpo, rloo, reinforce_plus_plus
`algorithm.kl_ctrl.kl_coef`	KL penalty coefficient	0.001
`algorithm.use_kl_in_reward`	KL in reward vs loss	True/False

Trainer Config

Parameter	Purpose	Default
`trainer.n_gpus_per_node`	GPUs per node	8
`trainer.nnodes`	Number of nodes	1
`trainer.total_epochs`	Training epochs	1
`trainer.save_freq`	Checkpoint frequency (steps)	-1
`trainer.test_freq`	Validation frequency	-1
`trainer.logger`	Logging backends	console
`trainer.project_name`	wandb project	verl
`trainer.experiment_name`	wandb run name	default

Supported Algorithms

Algorithm	Advantage Estimator	Critic?	Key Feature
PPO	`gae`	Yes	Full RLHF with value function
GRPO	`grpo`	No	Group-relative rewards, simpler
DAPO	`grpo` + DAPO recipe	No	SOTA reasoning (AIME 50pts)
RLOO	`rloo`	No	Leave-one-out baseline
REINFORCE++	`reinforce_plus_plus`	No	Improved REINFORCE
ReMax	`remax`	No	Max-reward baseline
CISPO	`cispo`	No	Clipped IS-weight Policy Optimization (paper)
SAPO	`sapo`	No	Soft Adaptive Policy Optimization (paper)
GSPO	`gspo`	No	Group-level optimization variant (v0.6+)

Rollout Correction (TIS) (v0.6+)

In async or off-policy RL, the policy used for rollout generation may differ from the current training policy. This mismatch degrades training signal quality. verl's Rollout Correction framework addresses this via importance sampling (IS) weights and rejection sampling (RS) — configured under algorithm.rollout_correction.

Token-Level TIS

Applies per-token importance weights to mitigate the gap between rollout and training policy distributions. Automatically reweights loss contributions so that tokens generated by a stale policy receive proportionally less influence.

algorithm:
  rollout_correction:
    rollout_is: token        # per-token IS weights (TIS)
    rollout_is_threshold: 2.0  # upper-clamp threshold
actor_rollout_ref:
  rollout:
    calculate_log_probs: true  # required

Sequence-Level TIS (v0.6+)

Monitors distribution mismatch and applies sequence-level IS correction. Useful for detecting and mitigating RL collapse in high-throughput async settings.

algorithm:
  rollout_correction:
    rollout_is: sequence     # per-sequence IS weights
    rollout_is_threshold: 5.0
actor_rollout_ref:
  rollout:
    calculate_log_probs: true

Presets: Use RolloutCorrectionConfig presets for validated configurations: RolloutCorrectionConfig.decoupled_token_is(), .decoupled_seq_is(), .bypass_ppo_clip(), etc. See Rollout Correction docs.

Monitoring: When Rollout Correction is enabled, verl logs rollout_corr/kl, rollout_corr/log_ppl_abs_diff, and rollout_corr/chi2_token. Use these to detect training instability from off-policy drift.

Rollout Server Mode (v0.6+)

v0.6 transitions rollout engines from SPMD mode (embedded in trainer process) to native server mode — the inference engine runs as a standalone server with an adapter for weight synchronization. This enables:

DP+EP support for large MoE models (e.g., expert parallelism across decode workers)
Full-fledged serving optimizations (continuous batching, prefix caching) during rollout
Cleaner separation between training and inference engines

Both vLLM and SGLang now use native server mode by default. The rollout config remains the same — the change is architectural. The BaseRollout interface was refactored to make the training engine agnostic of the inference engine during weight synchronization, simplifying integration of new inference backends (e.g., TensorRT-LLM).

Breaking change (v0.6): ShardingManager is deprecated and will be removed in the next release. Use the new nD dispatch API with device meshes and @register(dispatch_mode=...) decorators instead. See verl docs for the migration guide.

Multi-GPU Scaling

FSDP Backend (Default)

# 4 GPUs on 1 node
trainer.n_gpus_per_node=4 trainer.nnodes=1

# 8 GPUs on 2 nodes
trainer.n_gpus_per_node=4 trainer.nnodes=2

# With tensor parallelism for large models (rollout)
actor_rollout_ref.rollout.tensor_model_parallel_size=4

Megatron-LM Backend

For very large models (70B+):

actor_rollout_ref.actor.strategy=megatron \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2

Checkpointing

# Save checkpoints every 10 steps
trainer.save_freq=10

# Checkpoints saved to:
# checkpoints/{project_name}/{experiment_name}/global_step_{N}/actor/

# Merge to HuggingFace format
python scripts/model_merger.py merge \
  --backend fsdp \
  --local_dir checkpoints/my_project/my_run/global_step_100/actor \
  --target_dir ./merged_model/huggingface

Data Preparation

verl expects Parquet files with at minimum a prompt column. Apply chat templates via tokenizer.apply_chat_template(), include answer (for reward) and data_source columns, then save with df.to_parquet().

Advanced Algorithms

See verl recipes for DAPO, SPIN, SPPO, OPO, GPG, and more.

DAPO Configuration

DAPO uses asymmetric clipping for SOTA reasoning (50% on AIME 2024 with Qwen2.5-32B):

actor_rollout_ref:
  actor:
    clip_ratio_low: 0.2               # ε_low
    clip_ratio_high: 0.28             # ε_high (more exploration)

Online DPO

Generate N responses, score, form pairwise preferences, train with DPO loss. See docs/advance/dpo_extension.md.

LoRA Training

actor_rollout_ref:
  actor:
    peft:
      peft_type: lora
      lora_rank: 16
      lora_alpha: 32
      target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]

Megatron-Bridge LoRA (v0.7): See references/advanced-features.md for Megatron-Bridge LoRA config with module name mapping.

Multi-Turn Rollout

Enable multi-turn conversation training with tool use:

rollout:
  multi_turn:
    enable: true
    max_turns: 5
    tool_server: "http://tool-server:8080"

VLM and Multimodal RL

verl v0.7 adds VLM RL training and SFT with multimodal inputs (images, video). See references/advanced-features.md for config examples.

FP8 Rollout

Enable FP8 for faster generation:

rollout:
  dtype: fp8

Async Training

verl supports async variants for higher throughput:

One-step off-policy: Generate with previous policy, train with current
Fully async (v0.7): Decouples rollout and training onto separate GPU pools with NCCL-based weight sync

See references/advanced-features.md for full async config and the TrainingWorker API.

Monitoring

verl logs to WandB by default. Enable Prometheus + Grafana for rollout monitoring:

trainer:
  prometheus:
    enable: true
    port: 9090

Debugging

See references/troubleshooting.md for:

Reward hacking and training instability
KL divergence explosion
OOM during rollout or training
vLLM rollout failures
Checkpoint and model merging issues

Cross-References

hydra — Hydra configuration framework (verl uses OmegaConf/Hydra for config composition)
wandb — Experiment tracking and monitoring
vllm — vLLM rollout engine for inference during RL training
fsdp — FSDP training backend
megatron-lm — Megatron-LM training backend for large models
pytorch — PyTorch distributed training fundamentals
gpu-operator — GPU driver and device plugin prerequisites
nccl — NCCL tuning for multi-node RL training
kubeflow-trainer — Orchestrate verl training on Kubernetes
kueue — Queue verl training workloads
ray-train — verl uses Ray for distributed orchestration

Reference

compatibility.md — version compatibility matrix: verl ↔ vLLM ↔ SGLang ↔ PyTorch; RLHF interface changes between vLLM 0.12 and 0.16
advanced-features.md — Megatron-Bridge LoRA, VLM/multimodal RL, fully async trainer, TrainingWorker API
troubleshooting.md — common errors and fixes
verl docs
verl GitHub
verl recipes
HybridFlow paper (EuroSys 2025)
assets/grpo_config.yaml — complete GRPO training config with vLLM rollout, math reward, W&B logging, and DAPO variant comments
assets/architecture.md — Mermaid architecture diagrams

More by tylertitsworth

View all

llm-evaluation

LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.

aws-fsx

FSx for Lustre — performance tuning, striping, S3 data repositories, EKS integration. Use when configuring high-performance storage for ML on EKS. NOT for EBS or EFS.

ray-train

Ray Train — TorchTrainer, distributed training across GPUs/nodes, fault tolerance, local mode with torchrun, HuggingFace integration, Ray Data pipelines, and Tune. Use when running distributed training with Ray. NOT for single-GPU training.

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.