Agent SkillsAgent Skills
tylertitsworth

verl

@tylertitsworth/verl
tylertitsworth
0
0 forks
Updated 4/7/2026
View on GitHub

verl (Volcano Engine RL) — PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs.

Installation

$npx agent-skills-cli install @tylertitsworth/verl
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathverl/SKILL.md
Branchmain
Scoped Name@tylertitsworth/verl

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: verl description: "verl (Volcano Engine RL) — PPO, GRPO, DAPO, GSPO, RLOO, TIS (token/sequence importance sampling), rollout server mode, reward models, rule-based rewards, vLLM/SGLang rollout, and multi-GPU FSDP/Megatron training. Use when doing RLHF or RL post-training on LLMs."

verl (Volcano Engine Reinforcement Learning)

verl is a production-ready RL training framework for LLMs. Supports PPO, GRPO, DAPO, RLOO, REINFORCE++, ReMax, and more. Uses FSDP/Megatron-LM for training and vLLM/SGLang for rollout generation.

GitHub: verl-project/verl | Docs: verl.readthedocs.io

Requirements: PyTorch 2.4+, FSDP or Megatron-LM, vLLM or SGLang for rollout generation. Container image: verl.

Version pinning is strict. verl pins exact rollout engine versions. See references/compatibility.md for the full compatibility matrix and upgrade notes — mixing verl v0.7 with vLLM >0.12.0 breaks the RLHF engine interface.

Core Architecture

verl uses a 3D-HybridEngine where the same GPU pool switches between:

  1. Rollout (generation) — uses vLLM/SGLang for fast batched inference
  2. Training (policy update) — uses FSDP/Megatron-LM for gradient computation
  3. Reference model — computes reference log-probs for KL penalty

The hybrid engine eliminates memory redundancy by resharding model weights between training (FSDP sharded) and inference (vLLM tensor-parallel) phases on the same GPUs.

Resource allocation lifecycle per training step:

  1. Load model weights into vLLM, generate rollouts for the batch
  2. Offload vLLM, load FSDP-sharded weights for training
  3. Compute advantages, run PPO/GRPO updates
  4. Save checkpoint, repeat

SFT (Supervised Fine-Tuning)

verl includes an SFT trainer as a pre-RL step. Data format: Parquet with prompt and response columns.

python3 -m verl.trainer.fsdp_sft_trainer \
  data.train_files=/data/sft_train.parquet \
  model.path=Qwen/Qwen2.5-7B-Instruct \
  optim.lr=2e-5 trainer.total_epochs=3 \
  trainer.n_gpus_per_node=4 trainer.save_freq=500

PPO Configuration

PPO is the full RLHF algorithm with actor, critic, reference model, and reward model:

# ppo_config.yaml
data:
  train_files: /data/gsm8k/train.parquet
  val_files: /data/gsm8k/test.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 512

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    optim:
      lr: 1e-6
    ppo_mini_batch_size: 64
    ppo_micro_batch_size_per_gpu: 4
    ppo_epochs: 1
    clip_ratio: 0.2
    grad_clip: 1.0
    entropy_coeff: 0.0
    use_torch_compile: true
  rollout:
    name: vllm
    tensor_model_parallel_size: 1
    gpu_memory_utilization: 0.4
    temperature: 1.0
    top_p: 1.0
    n: 1                              # 1 response per prompt for PPO
  ref:
    log_prob_micro_batch_size_per_gpu: 4

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct    # same or separate critic model
  optim:
    lr: 1e-5
  ppo_micro_batch_size_per_gpu: 4

algorithm:
  adv_estimator: gae                   # Generalized Advantage Estimation
  kl_ctrl:
    kl_coef: 0.001

trainer:
  n_gpus_per_node: 4
  nnodes: 1
  total_epochs: 15
  save_freq: 10
  logger: ["console", "wandb"]
  project_name: ppo-gsm8k

GRPO Configuration

GRPO is simpler than PPO — no critic model. Samples multiple responses per prompt and uses group-relative rewards:

# grpo_config.yaml
data:
  train_files: /data/gsm8k/train.parquet
  val_files: /data/gsm8k/test.parquet
  train_batch_size: 128
  max_prompt_length: 512
  max_response_length: 1024

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    optim:
      lr: 1e-6
    use_kl_loss: true                  # KL penalty in loss function
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
    ppo_micro_batch_size_per_gpu: 2
  rollout:
    name: vllm
    n: 8                               # 8 responses per prompt (key for GRPO)
    temperature: 1.0
    tensor_model_parallel_size: 2
    gpu_memory_utilization: 0.5

algorithm:
  adv_estimator: grpo                  # group-relative advantage

# No critic section needed — GRPO is critic-free

trainer:
  n_gpus_per_node: 4
  total_epochs: 20

Reward Functions

Rule-Based Rewards

# verl/utils/reward_score/my_reward.py
import re

def compute_reward(data_source, solution_str, ground_truth, extra_info=None):
    match = re.search(r"####\s*(-?\d+)", solution_str)
    if match is None:
        return 0.0
    return 1.0 if match.group(1).strip() == str(ground_truth).strip() else 0.0

Register in config:

reward_model.reward_fn.path=verl/utils/reward_score/my_reward.py \
reward_model.reward_fn.name=compute_reward

Reward Model (Learned)

Use a trained reward model instead of rules:

reward_model.enable=True \
reward_model.model.path=my-org/reward-model-7b \
reward_model.micro_batch_size_per_gpu=4

Multi-Reward Composition

Combine multiple signals in a single reward function:

def compute_reward(data_source, solution_str, ground_truth, extra_info=None):
    correctness = check_answer(solution_str, ground_truth)  # 0 or 1
    format_score = check_format(solution_str)               # 0 to 0.5
    length_penalty = -max(0, len(solution_str) - 2000) / 10000
    return correctness + format_score + length_penalty

Hybrid Reward Manager (v0.7)

v0.7 supports combining generative (LLM-as-judge), discriminative (classifier), and rule-based rewards. Reward models run in server mode — either colocated (shared GPUs) or standalone (dedicated nodes). Manager modes: limited (rate-controlled) or remote (offloaded CPU-intensive evaluation).

reward_model:
  enable: true
  model:
    path: my-org/reward-model-7b
  deployment: colocated                # or standalone
  manager_mode: limited                # or remote
  reward_fn:
    path: verl/utils/reward_score/my_reward.py
    name: compute_reward

Configuration Reference

Data Config

ParameterPurposeDefault
data.train_batch_sizeGlobal batch size per step1024
data.max_prompt_lengthMax input tokens512
data.max_response_lengthMax generated tokens512
data.prompt_keyColumn name for promptsprompt

Actor Config

ParameterPurposeDefault
actor_rollout_ref.model.pathHuggingFace modelrequired
actor_rollout_ref.actor.optim.lrActor learning rate1e-6
actor_rollout_ref.actor.ppo_mini_batch_sizePPO mini-batch (global)256
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpuMicro-batch per GPU8
actor_rollout_ref.actor.grad_clipGradient clipping1.0
actor_rollout_ref.actor.clip_ratioPPO clip ratio0.2
actor_rollout_ref.actor.ppo_epochsPPO epochs per step1
actor_rollout_ref.actor.entropy_coeffEntropy bonus0.0
actor_rollout_ref.actor.use_kl_lossKL loss (for GRPO)False
actor_rollout_ref.actor.kl_loss_coefKL loss coefficient0.001
actor_rollout_ref.actor.use_torch_compiletorch.compileTrue

Rollout Config (vLLM)

ParameterPurposeDefault
rollout.nameEngine (vllm, sglang, hf)vllm
rollout.temperatureSampling temperature1.0
rollout.top_pNucleus sampling1.0
rollout.nResponses per prompt1 (8+ for GRPO)
rollout.tensor_model_parallel_sizeTP for rollout1
rollout.gpu_memory_utilizationvLLM GPU memory fraction0.5
rollout.enforce_eagerDisable CUDA graphsTrue
rollout.online_quantOnline quantization backend (no pre-quantized checkpoint needed)None (torchao)

Algorithm Config

ParameterPurposeOptions
algorithm.adv_estimatorAdvantage estimationgae, grpo, rloo, reinforce_plus_plus
algorithm.kl_ctrl.kl_coefKL penalty coefficient0.001
algorithm.use_kl_in_rewardKL in reward vs lossTrue/False

Trainer Config

ParameterPurposeDefault
trainer.n_gpus_per_nodeGPUs per node8
trainer.nnodesNumber of nodes1
trainer.total_epochsTraining epochs1
trainer.save_freqCheckpoint frequency (steps)-1
trainer.test_freqValidation frequency-1
trainer.loggerLogging backendsconsole
trainer.project_namewandb projectverl
trainer.experiment_namewandb run namedefault

Supported Algorithms

AlgorithmAdvantage EstimatorCritic?Key Feature
PPOgaeYesFull RLHF with value function
GRPOgrpoNoGroup-relative rewards, simpler
DAPOgrpo + DAPO recipeNoSOTA reasoning (AIME 50pts)
RLOOrlooNoLeave-one-out baseline
REINFORCE++reinforce_plus_plusNoImproved REINFORCE
ReMaxremaxNoMax-reward baseline
CISPOcispoNoClipped IS-weight Policy Optimization (paper)
SAPOsapoNoSoft Adaptive Policy Optimization (paper)
GSPOgspoNoGroup-level optimization variant (v0.6+)

Rollout Correction (TIS) (v0.6+)

In async or off-policy RL, the policy used for rollout generation may differ from the current training policy. This mismatch degrades training signal quality. verl's Rollout Correction framework addresses this via importance sampling (IS) weights and rejection sampling (RS) — configured under algorithm.rollout_correction.

Token-Level TIS

Applies per-token importance weights to mitigate the gap between rollout and training policy distributions. Automatically reweights loss contributions so that tokens generated by a stale policy receive proportionally less influence.

algorithm:
  rollout_correction:
    rollout_is: token        # per-token IS weights (TIS)
    rollout_is_threshold: 2.0  # upper-clamp threshold
actor_rollout_ref:
  rollout:
    calculate_log_probs: true  # required

Sequence-Level TIS (v0.6+)

Monitors distribution mismatch and applies sequence-level IS correction. Useful for detecting and mitigating RL collapse in high-throughput async settings.

algorithm:
  rollout_correction:
    rollout_is: sequence     # per-sequence IS weights
    rollout_is_threshold: 5.0
actor_rollout_ref:
  rollout:
    calculate_log_probs: true

Presets: Use RolloutCorrectionConfig presets for validated configurations: RolloutCorrectionConfig.decoupled_token_is(), .decoupled_seq_is(), .bypass_ppo_clip(), etc. See Rollout Correction docs.

Monitoring: When Rollout Correction is enabled, verl logs rollout_corr/kl, rollout_corr/log_ppl_abs_diff, and rollout_corr/chi2_token. Use these to detect training instability from off-policy drift.

Rollout Server Mode (v0.6+)

v0.6 transitions rollout engines from SPMD mode (embedded in trainer process) to native server mode — the inference engine runs as a standalone server with an adapter for weight synchronization. This enables:

  • DP+EP support for large MoE models (e.g., expert parallelism across decode workers)
  • Full-fledged serving optimizations (continuous batching, prefix caching) during rollout
  • Cleaner separation between training and inference engines

Both vLLM and SGLang now use native server mode by default. The rollout config remains the same — the change is architectural. The BaseRollout interface was refactored to make the training engine agnostic of the inference engine during weight synchronization, simplifying integration of new inference backends (e.g., TensorRT-LLM).

Breaking change (v0.6): ShardingManager is deprecated and will be removed in the next release. Use the new nD dispatch API with device meshes and @register(dispatch_mode=...) decorators instead. See verl docs for the migration guide.

Multi-GPU Scaling

FSDP Backend (Default)

# 4 GPUs on 1 node
trainer.n_gpus_per_node=4 trainer.nnodes=1

# 8 GPUs on 2 nodes
trainer.n_gpus_per_node=4 trainer.nnodes=2

# With tensor parallelism for large models (rollout)
actor_rollout_ref.rollout.tensor_model_parallel_size=4

Megatron-LM Backend

For very large models (70B+):

actor_rollout_ref.actor.strategy=megatron \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2

Checkpointing

# Save checkpoints every 10 steps
trainer.save_freq=10

# Checkpoints saved to:
# checkpoints/{project_name}/{experiment_name}/global_step_{N}/actor/

# Merge to HuggingFace format
python scripts/model_merger.py merge \
  --backend fsdp \
  --local_dir checkpoints/my_project/my_run/global_step_100/actor \
  --target_dir ./merged_model/huggingface

Data Preparation

verl expects Parquet files with at minimum a prompt column. Apply chat templates via tokenizer.apply_chat_template(), include answer (for reward) and data_source columns, then save with df.to_parquet().

Advanced Algorithms

See verl recipes for DAPO, SPIN, SPPO, OPO, GPG, and more.

DAPO Configuration

DAPO uses asymmetric clipping for SOTA reasoning (50% on AIME 2024 with Qwen2.5-32B):

actor_rollout_ref:
  actor:
    clip_ratio_low: 0.2               # ε_low
    clip_ratio_high: 0.28             # ε_high (more exploration)

Online DPO

Generate N responses, score, form pairwise preferences, train with DPO loss. See docs/advance/dpo_extension.md.

LoRA Training

actor_rollout_ref:
  actor:
    peft:
      peft_type: lora
      lora_rank: 16
      lora_alpha: 32
      target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]

Megatron-Bridge LoRA (v0.7): See references/advanced-features.md for Megatron-Bridge LoRA config with module name mapping.

Multi-Turn Rollout

Enable multi-turn conversation training with tool use:

rollout:
  multi_turn:
    enable: true
    max_turns: 5
    tool_server: "http://tool-server:8080"

VLM and Multimodal RL

verl v0.7 adds VLM RL training and SFT with multimodal inputs (images, video). See references/advanced-features.md for config examples.

FP8 Rollout

Enable FP8 for faster generation:

rollout:
  dtype: fp8

Async Training

verl supports async variants for higher throughput:

  • One-step off-policy: Generate with previous policy, train with current
  • Fully async (v0.7): Decouples rollout and training onto separate GPU pools with NCCL-based weight sync

See references/advanced-features.md for full async config and the TrainingWorker API.

Monitoring

verl logs to WandB by default. Enable Prometheus + Grafana for rollout monitoring:

trainer:
  prometheus:
    enable: true
    port: 9090

Debugging

See references/troubleshooting.md for:

  • Reward hacking and training instability
  • KL divergence explosion
  • OOM during rollout or training
  • vLLM rollout failures
  • Checkpoint and model merging issues

Cross-References

  • hydra — Hydra configuration framework (verl uses OmegaConf/Hydra for config composition)
  • wandb — Experiment tracking and monitoring
  • vllm — vLLM rollout engine for inference during RL training
  • fsdp — FSDP training backend
  • megatron-lm — Megatron-LM training backend for large models
  • pytorch — PyTorch distributed training fundamentals
  • gpu-operator — GPU driver and device plugin prerequisites
  • nccl — NCCL tuning for multi-node RL training
  • kubeflow-trainer — Orchestrate verl training on Kubernetes
  • kueue — Queue verl training workloads
  • ray-train — verl uses Ray for distributed orchestration

Reference