tensorrt-llm

@tylertitsworth/tensorrt-llm

0 forks

Updated 4/1/2026

TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.

Installation

$npx agent-skills-cli install @tylertitsworth/tensorrt-llm

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathtensorrt-llm/SKILL.md

Branchmain

Scoped Name@tylertitsworth/tensorrt-llm

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: tensorrt-llm description: "TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup."

TensorRT-LLM

TensorRT-LLM is NVIDIA's open-source library for optimizing LLM inference on NVIDIA GPUs. It compiles model weights into GPU-specific execution plans with fused kernels, quantization, and hardware-specific optimizations that generic runtimes (vLLM, SGLang) cannot replicate.

Key differentiator from vLLM/SGLang: TRT-LLM compiles models into optimized engine plans with operator fusion, architecture-specific CUDA kernels (Hopper wgmma, Ampere TMA), and calibrated quantization baked in at build time. NIM wraps TRT-LLM with pre-built profiles — this skill covers using TRT-LLM directly.

Version: 1.3.0 (PyTorch backend is stable and default since v1.0; LLM API is stable. v1.2 added AutoDeploy, Ray orchestrator, HTTP disagg cluster management. v1.3 adds KVCacheManager v2, Python KV transceiver, GPU energy benchmarking).

Breaking Changes & Version History

v1.0: PyTorch backend is default; LLM API is stable; cuda_graph_config.padding_enabled → enable_padding; mixed_sampler → enable_mixed_sampler; LLM.autotuner_enabled → enable_autotuner. Removed: batch_manager::KvCacheConfig, TrtGptModelOptionalParams, deprecated LoRA args (use lora_config).

v1.1: C++ TRTLLM sampler is default (use sampler_type in SamplingConfig to override). KV Cache Connector API for disaggregated serving state transfer.

v1.2: Sampling strategy selection refined (BREAKING — review SamplingConfig defaults). Added AutoDeploy, Ray orchestrator, HTTP disagg cluster management, etcd storage, factory TP sharding of quantized models, block-sparse attention (RocketKV), cache_salt in LLM.generate, adaptive speculative decoding threshold.

v1.3: KVCacheManager v2 with dynamic quota resize, Python KV transceiver for disaggregated serving, GPU energy monitoring in trtllm-bench, PEFT safetensors loading, FLUX text-to-image pipeline support.

Python LLM API

The high-level LLM class is the primary interface — mirrors vLLM's API shape intentionally:

from tensorrt_llm import LLM, SamplingParams

# Basic: pass HuggingFace model name or local path
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Pre-quantized NVIDIA checkpoints (FP8, FP4) work directly
llm = LLM(model="nvidia/Llama-3.1-8B-Instruct-FP8")

prompts = ["The capital of France is", "Explain transformers in one sentence:"]
sampling = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)

for output in llm.generate(prompts, sampling):
    print(f"{output.prompt!r} → {output.outputs[0].text!r}")

Key LLM Constructor Arguments

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    backend="pytorch",                 # "pytorch" (default since v1.0) or "tensorrt"
    tensor_parallel_size=4,            # shard across GPUs
    pipeline_parallel_size=1,          # pipeline stages (multi-node)
    dtype="bfloat16",                  # auto, float16, bfloat16
    enable_autotuner=False,            # renamed from autotuner_enabled in v1.0
    max_batch_size=256,                # max concurrent requests
    max_num_tokens=8192,               # max tokens per batch iteration
    max_seq_len=4096,                  # max sequence length (KV cache sized to this)
    kv_cache_free_gpu_memory_fraction=0.90,  # GPU memory fraction for KV cache
)

Context manager pattern recommended — ensures clean shutdown of MPI processes:

with LLM(model="nvidia/Llama-3.1-8B-Instruct-FP8") as llm:
    outputs = list(llm.generate(["Hello"], SamplingParams()))

Multi-GPU (Single Node)

No mpirun prefix needed — LLM handles process spawning internally:

# Tensor parallel across 4 GPUs — just set the kwarg
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
)

Async / Streaming

import asyncio
from tensorrt_llm import LLM, SamplingParams

async def stream():
    llm = LLM(model="nvidia/Llama-3.1-8B-Instruct-FP8")
    async for output in llm.generate_async(
        "Write a haiku about GPUs:", SamplingParams(max_tokens=64),
        streaming=True,
    ):
        print(output.outputs[0].text, end="", flush=True)

asyncio.run(stream())

AutoDeploy (v1.2+, Beta)

AutoDeploy is a beta backend (added v1.2) that automatically converts HuggingFace models to optimized TRT-LLM engines — auto-selects quantization, applies factory TP sharding, and enables CUDA graphs for VLM subgraphs. It is not the default; the default since v1.0 is backend="pytorch".

Explicitly opt in by passing backend="autodeploy":

from tensorrt_llm import LLM

# AutoDeploy — explicitly opt in (beta backend; NOT the default)
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    backend="autodeploy",
)

# For VLMs, AutoDeploy handles subgraph compilation and CUDA graph capture
llm = LLM(model="liuhaotian/llava-v1.5-7b", backend="autodeploy")

Supports 100+ text-to-text LLMs and early VLM support. For full manual control, use the standard PyTorch backend (the default, backend="pytorch") or the legacy TRT engine path (backend="tensorrt" with manual engine build).

Quantization

TRT-LLM supports quantization via pre-quantized checkpoints (preferred) or runtime calibration with NVIDIA ModelOpt.

Pre-Quantized Checkpoints (Recommended)

NVIDIA publishes optimized checkpoints on HuggingFace — pass directly to LLM():

# FP8 — Hopper+ (H100/H200/B200). Minimal quality loss, ~2× throughput vs FP16
llm = LLM(model="nvidia/Llama-3.1-70B-Instruct-FP8")

# FP4 (NVFP4) — Blackwell (B200/GB200). ~2.5-3× throughput vs FP16
llm = LLM(model="nvidia/Llama-3.3-70B-Instruct-FP4")

# INT4 AWQ — Ampere+ (A100/H100). Good quality, small memory footprint
llm = LLM(model="nvidia/Llama-3.1-70B-Instruct-AWQ-INT4")

Browse all optimized checkpoints: NVIDIA Model Optimizer Collection

Quantization with ModelOpt (Calibration)

For custom models not in NVIDIA's collection, quantize with TensorRT Model Optimizer:

import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_tensorrt_llm_checkpoint
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("my-org/custom-llama-70b", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("my-org/custom-llama-70b")

# Calibration dataset — use representative prompts from your workload
def calibration_data():
    for text in ["Example prompt 1...", "Example prompt 2...", ...]:
        yield tokenizer(text, return_tensors="pt").input_ids.cuda()

# FP8 quantization with calibration
mtq.quantize(model, mtq.FP8_DEFAULT_CFG, forward_loop=calibration_data)

# Export to TRT-LLM checkpoint format
export_tensorrt_llm_checkpoint(model, "float16", export_dir="/tmp/fp8_ckpt")

Then load the calibrated checkpoint:

llm = LLM(model="/tmp/fp8_ckpt")

Quantization Methods Comparison

Method	Bits	Quality	Throughput Gain	GPU Requirement	Calibration?
FP8	8	Excellent	~1.5-2×	Hopper+ (H100)	Yes (baked in)
NVFP4	4	Good	~2.5-3×	Blackwell (B200)	Yes
INT8 SmoothQuant (W8A8)	8	Good	~1.3-1.5×	Ampere+	Yes
INT8 Weight-Only (W8A16)	8	Very Good	~1.2×	Any	No
INT4 Weight-Only (W4A16)	4	Good	~1.5×	Any	No
AWQ (W4A16)	4	Good	~1.5-2×	Ampere+	Yes (group)
GPTQ (W4A16)	4	Good	~1.5-2×	Ampere+	Yes (group)
FP8 KV Cache	8	Excellent	Memory saving	Hopper+	Yes

Key difference from vLLM: TRT-LLM bakes calibration data into the engine at compile time, producing tighter quantization ranges. vLLM applies quantization at load time with less optimization opportunity.

Online Serving (trtllm-serve)

OpenAI-compatible server — supports /v1/completions, /v1/chat/completions, /v1/responses:

# Programmatic equivalent of the CLI server
from openai import OpenAI

# Start server: trtllm-serve "nvidia/Llama-3.1-8B-Instruct-FP8" --tp_size 2
client = OpenAI(base_url="http://localhost:8000/v1", api_key="tensorrt_llm")

response = client.chat.completions.create(
    model="nvidia/Llama-3.1-8B-Instruct-FP8",
    messages=[{"role": "user", "content": "Explain KV cache in 2 sentences."}],
    max_tokens=128,
)

Server Configuration via YAML

Create a config.yml for advanced tuning:

# config.yml — pass to trtllm-serve with --config
kv_cache_config:
  enable_block_reuse: true              # prefix caching (shared system prompts)
  free_gpu_memory_fraction: 0.90        # GPU memory for KV cache

pytorch_backend_config:
  enable_overlap_scheduler: true        # overlap compute and communication

cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]

trtllm-serve "nvidia/Llama-3.1-70B-Instruct-FP8" \
  --tp_size 4 \
  --max_batch_size 256 \
  --max_num_tokens 4096 \
  --config config.yml

Performance Tuning

KV Cache Configuration

The KV cache is the primary memory consumer after model weights. Tuning it directly controls batch capacity:

llm = LLM(
    model="nvidia/Llama-3.1-70B-Instruct-FP8",
    tensor_parallel_size=2,
    max_seq_len=4096,                          # lower = more batch capacity
    kv_cache_free_gpu_memory_fraction=0.90,    # 90% of remaining GPU memory for KV
)

Block reuse (prefix caching): reuses KV cache blocks for shared prefixes (system prompts, few-shot examples). Enable via config YAML:

kv_cache_config:
  enable_block_reuse: true

FP8 KV cache: halves KV cache memory on Hopper GPUs, doubling effective batch capacity:

kv_cache_config:
  kv_cache_dtype: fp8

Scheduler and Batching

TRT-LLM uses in-flight batching (continuous batching) by default — new requests join mid-batch without waiting for the full batch to complete.

Key tuning knobs:

max_batch_size: max requests per iteration. Higher = more throughput, more latency variance
max_num_tokens: max tokens processed per iteration across all requests. Controls prefill chunking
Overlap scheduler: overlaps GPU compute with CPU scheduling — enable for throughput:
```
pytorch_backend_config:
  enable_overlap_scheduler: true
```

CUDA Graphs

Reduce kernel launch overhead for small batches (decode phase):

cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]

Pre-captures CUDA graphs at specified batch sizes. Padding ensures batch sizes snap to the next captured size. Critical for latency-sensitive deployments.

Speculative Decoding

Use a smaller draft model to generate candidates, verified by the main model:

from tensorrt_llm.llmapi import DraftTargetDecodingConfig

speculative_config = DraftTargetDecodingConfig(
    max_draft_len=5,
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",
)
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_config=speculative_config,
    disable_overlap_scheduler=True,  # required when speculative decoding is enabled
)

For EAGLE-3 speculative decoding, use EagleDecodingConfig from tensorrt_llm.llmapi instead. Also supports guided decoding integration with speculative decoding, draft model chunked prefill (v1.1), and adaptive speculative decoding (v1.2+) — automatically disables speculative decoding when acceptance-length falls below a threshold, avoiding overhead for prompts where drafts are rarely accepted. See model support matrix in references/model-support.md.

Sparse Attention (v1.2+)

Block-sparse attention framework with RocketKV — dynamically selects which KV cache blocks are relevant per query, reducing memory and compute for long-context workloads:

# Enable sparse attention in config — field names are illustrative;
# verify exact keys against upstream config docs (feature added in PR #8086)
sparse_attention:
  enabled: true
  framework: "rocketkv"    # RocketKV sparse selection

Cache Salt (Multi-Tenant Isolation)

Use cache_salt in LLM.generate() to isolate prefix cache entries between tenants sharing the same engine:

# Different tenants get isolated cache namespaces
out1 = llm.generate(prompts, sampling, cache_salt="tenant-a")
out2 = llm.generate(prompts, sampling, cache_salt="tenant-b")

Attention Data Parallelism

For MoE models (DeepSeek, Mixtral, Qwen3-MoE), enable attention DP to separate expert parallelism from attention:

enable_attention_dp: true

This allows expert parallelism (EP) across GPUs while replicating attention heads — critical for efficient MoE serving.

TRT-LLM vs vLLM vs NIM

Aspect	TRT-LLM (Direct)	vLLM	NIM
Kernel optimization	GPU-arch-specific fused kernels	Generic CUDA kernels	TRT-LLM engines (pre-built)
Quantization	Calibrated, baked at compile time	Runtime, less optimized	Auto-selected per GPU
Configuration	Full control (Python API + YAML)	Full control (Python kwargs)	Env vars only
Model support	NVIDIA-validated architectures	Any HuggingFace	NGC catalog only
Setup complexity	Medium (need NVIDIA container)	Low (pip install)	Lowest (docker run)
Engine build time	Minutes-hours (first run)	None	Pre-built or JIT
Performance ceiling	Highest (with tuning)	Good	High (pre-tuned)

Use TRT-LLM directly when:

You need maximum control over quantization, batching, and memory
Deploying custom/fine-tuned models that aren't in the NIM catalog
Building custom serving infrastructure (Triton, custom gRPC)
You want GPU-arch-specific kernel optimization without NIM's packaging

Use NIM instead when:

Deploying NGC catalog models with zero tuning effort
Need pre-built engines (skip compilation time)
Want automatic hardware-aware profile selection

Triton Inference Server Backend

TRT-LLM integrates with Triton for production serving with multi-model support, request queuing, and ensemble pipelines. See references/triton-backend.md for configuration.

Disaggregated Serving (v1.1+)

Separate context (prefill) and generation (decode) phases across different workers for better resource utilization.

v1.1: KV Cache Connector API for state transfer between workers. v1.2: HTTP-based disagg cluster management with etcd coordination — manage worker registration, health, and routing via REST API:

# config.yml for disaggregated cluster
kv_cache_connector:
  enabled: true

disagg_cluster:
  management: http              # HTTP API for cluster management (v1.2+)
  etcd_endpoints: "etcd:2379"   # etcd for worker discovery and coordination

v1.3: Python KV transceiver for zero-copy KV cache transfer between context/generation workers, KVCacheManager v2 with dynamic quota resize (adapts KV memory allocation at runtime based on load).

Ray orchestrator (v1.2+): Use Ray as the distributed backend for disaggregated serving — manages worker pools, scaling, and fault tolerance:

# orchestrator_type kwarg added in v1.2 (PR #7520); verify param name against LLM API docs
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    orchestrator_type="ray",    # "ray" | "mpi" (default)
)

Supports guided decoding in disaggregated mode, optimized KV cache transfer for uneven pipeline parallelism, and dedicated benchmarking.

Hardware Support

Blackwell (B200/B300/GB200/GB300): Full support including NVFP4 quantization, MLA chunked prefill, DeepEP FP4 kernels, CuteDSL NVFP4 grouped GEMM. B300/GB300 added in v1.1 (multi-node beta).
Hopper (H100/H200): Primary target. FP8, spec dec XQA multi-block mode, W4A8 MoE kernels.
Ampere (A100): Supported with INT8/INT4 quantization.
DGX Spark: Beta support (single-node) added in v1.2 for validated models.

Supported Models

The PyTorch backend supports: Llama 3.x/4, DeepSeek V3/R1, Qwen 2/3/3-MoE, Mistral 3.1, Mixtral, Gemma 2/3, Phi-4 (multimodal), Nemotron, GPT-OSS, EXAONE 4.0, Seed-OSS, Hunyuan (Dense/MoE), and more. See references/model-support.md for the full matrix including feature support (CUDA graphs, disaggregated serving, speculative decoding) per architecture.

Cross-References

nvidia-nim — NIM wraps TRT-LLM with pre-built profiles; use for zero-config deployment
nvidia-dynamo — Dynamo orchestration layer with TRT-LLM as a backend engine
triton-inference-server — Triton backend for production serving with request queuing and ensembles
vllm — Alternative inference engine; shares K8s deployment patterns
sglang — Alternative engine with RadixAttention; NIM can use SGLang backend
model-formats — HuggingFace, SafeTensors, engine format details
gpu-operator — GPU drivers and device plugin prerequisites

References

references/model-support.md — Full model support matrix with feature compatibility per architecture
references/triton-backend.md — Triton Inference Server backend configuration and ensemble setup
troubleshooting.md — Engine build failures, runtime errors, performance issues, and multi-GPU problems
nccl — NCCL for multi-GPU tensor parallelism in TRT-LLM
leaderworkerset — Multi-node TRT-LLM deployment on Kubernetes
prometheus-grafana — Monitor TRT-LLM inference metrics
flash-attention — Attention optimization context for TRT-LLM

More by tylertitsworth

View all

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.

triton-inference-server

NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.

llm-evaluation

LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.

kuberay

KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).