tensorrt-llm

@davila7/tensorrt-llm

davila7

17,041

1504 forks

Updated 1/18/2026

View on GitHub

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Installation

$skills install @davila7/tensorrt-llm

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorydavila7/claude-code-templates

Pathcli-tool/components/skills/ai-research/inference-serving-tensorrt-llm/SKILL.md

Branchmain

Scoped Name@davila7/tensorrt-llm

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

skills list

Skill Instructions

name: tensorrt-llm description: Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling. version: 1.0.0 author: Orchestra Research license: MIT tags: [Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU] dependencies: [tensorrt-llm, torch]

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

Deploying on NVIDIA GPUs (A100, H100, GB200)
Need maximum throughput (24,000+ tokens/sec on Llama 3)
Require low latency for real-time applications
Working with quantized models (FP8, INT4, FP4)
Scaling across multiple GPUs or nodes

Use vLLM instead when:

Need simpler setup and Python-first API
Want PagedAttention without TensorRT compilation
Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

Deploying on CPU or Apple Silicon
Need edge deployment without NVIDIA GPUs
Want simpler GGUF quantization format

Quick start

Installation

# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

Basic inference

from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)

Serving with trtllm-serve

# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Key features

Performance optimizations

In-flight batching: Dynamic batching during generation
Paged KV cache: Efficient memory management
Flash Attention: Optimized attention kernels
Quantization: FP8, INT4, FP4 for 2-4× faster inference
CUDA graphs: Reduced kernel launch overhead

Parallelism

Tensor parallelism (TP): Split model across GPUs
Pipeline parallelism (PP): Layer-wise distribution
Expert parallelism: For Mixture-of-Experts models
Multi-node: Scale beyond single machine

Advanced features

Speculative decoding: Faster generation with draft models
LoRA serving: Efficient multi-adapter deployment
Disaggregated serving: Separate prefill and generation

Common patterns

Quantized model (FP8)

from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])

Multi-GPU deployment

# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)

Batch inference

# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

Throughput: 24,000 tokens/sec
Latency: ~10ms per token
vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

FP8 quantization: 2× faster than FP16
Memory: 50% reduction with FP8

Supported models

LLaMA family: Llama 2, Llama 3, CodeLlama
GPT family: GPT-2, GPT-J, GPT-NeoX
Qwen: Qwen, Qwen2, QwQ
DeepSeek: DeepSeek-V2, DeepSeek-V3
Mixtral: Mixtral-8x7B, Mixtral-8x22B
Vision: LLaVA, Phi-3-vision
100+ models on HuggingFace

References

Optimization Guide - Quantization, batching, KV cache tuning
Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
Serving Guide - Production deployment, monitoring, autoscaling

Resources

Docs: https://nvidia.github.io/TensorRT-LLM/
GitHub: https://github.com/NVIDIA/TensorRT-LLM
Models: https://huggingface.co/models?library=tensorrt_llm

More by davila7

View all

agile-product-owner

17,041

Agile product ownership toolkit for Senior Product Owner including INVEST-compliant user story generation, sprint planning, backlog management, and velocity tracking. Use for story writing, sprint planning, stakeholder communication, and agile ceremonies.

content-creator

17,041

Create SEO-optimized marketing content with consistent brand voice. Includes brand voice analyzer, SEO optimizer, content frameworks, and social media templates. Use when writing blog posts, creating social media content, analyzing brand voice, optimizing SEO, planning content calendars, or when user mentions content creation, brand voice, SEO optimization, social media marketing, or content strategy.

dspy

17,041

Build complex AI systems with declarative programming, optimize prompts automatically, create modular RAG systems and agents with DSPy - Stanford NLP's framework for systematic LM programming

marketing-demand-acquisition

17,041

Multi-channel demand generation, paid media optimization, SEO strategy, and partnership programs for Series A+ startups. Includes CAC calculator, channel playbooks, HubSpot integration, and international expansion tactics. Use when planning demand generation campaigns, optimizing paid media, building SEO strategies, establishing partnerships, or when user mentions demand gen, paid ads, LinkedIn ads, Google ads, CAC, acquisition, lead generation, or pipeline generation.