Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
6 min read·Updated March 27, 2026

TensorRT-LLM

NVIDIA logoBy NVIDIA

TensorRT-LLM is NVIDIA's open-source library for optimizing large language model inference — delivering best-in-class throughput through automatic quantization, in-flight batching, KV cache optimization, and multi-GPU parallelism on NVIDIA GPUs.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Understand what TensorRT-LLM is and how it optimizes LLM inference performance
  • Identify the key optimization techniques (quantization, batching, KV cache) and why they matter
  • Evaluate when TensorRT-LLM is the right choice versus alternatives like vLLM or Ollama

What Is TensorRT-LLM?

TensorRT-LLM is NVIDIA's open-source library for optimizing large language model inference on NVIDIA GPUs. It is the engine behind NVIDIA NIM's performance — the layer that takes a standard model (Llama, Mistral, Nemotron, GPT) and transforms it into a highly optimized inference engine tuned for maximum throughput and minimum latency.

TensorRT-LLM builds on NVIDIA's original TensorRT inference optimizer (which handles general deep learning models) with LLM-specific optimizations: attention mechanisms, autoregressive decoding, long-context memory management, and multi-GPU model parallelism.

The practical impact: TensorRT-LLM can deliver 2-4x higher throughput compared to running the same model with standard PyTorch inference, depending on the model and hardware configuration.

Tip

Get TensorRT-LLM: github.com/NVIDIA/TensorRT-LLM — open-source, Apache 2.0 license. Includes pre-built examples for popular models.

How TensorRT-LLM Optimizes Inference

Quantization — Smaller, Faster Numbers

LLMs are typically trained in FP16 or BF16 precision (16-bit floating point). TensorRT-LLM can convert models to lower precision formats:

  • FP8 — 8-bit floating point (supported on Hopper and Blackwell GPUs). Roughly 2x throughput improvement with minimal quality loss.
  • INT8 — 8-bit integer quantization. Even faster, with slightly more quality degradation.
  • INT4 — 4-bit integer. Maximum compression for memory-constrained deployments. Best combined with weight-only quantization techniques (GPTQ, AWQ).

The key insight: most of an LLM's computation is matrix multiplication, and reducing the precision of these operations proportionally increases throughput — while modern quantization techniques preserve nearly all of the model's quality.

In-Flight Batching — Maximum GPU Utilization

Standard batching waits for a fixed batch of requests before processing them together. In-flight batching (also called continuous batching) is smarter:

  • New requests are inserted into the batch as soon as GPU capacity is available
  • Completed requests are removed immediately — they don't wait for the entire batch to finish
  • The GPU stays fully utilized even when requests have different input lengths and generation lengths

This technique alone can improve throughput by 2-10x compared to naive static batching, especially under variable workloads.

Paged KV Cache — Memory Efficiency

During text generation, LLMs store attention computations in a key-value (KV) cache — one entry per token, per layer. For long-context models (128K+ tokens), this cache can consume tens of gigabytes of GPU memory.

TensorRT-LLM implements paged attention (inspired by operating system virtual memory):

  • KV cache is stored in non-contiguous memory pages rather than a single large block
  • Pages are allocated on demand and freed when generation completes
  • Memory fragmentation is eliminated — more requests can fit in GPU memory simultaneously

Multi-GPU Parallelism

For models too large to fit on a single GPU:

  • Tensor parallelism — splits individual layers across multiple GPUs (low latency, high bandwidth required)
  • Pipeline parallelism — distributes different layers to different GPUs (works across slower interconnects)
  • Automatic partitioning handles the split — developers specify the number of GPUs, TensorRT-LLM handles the rest

Speculative Decoding

Uses a smaller "draft" model to predict several tokens ahead. The large model then verifies all predictions in a single forward pass. When the draft model is right (which is often), this produces multiple tokens for the cost of one inference step — significantly increasing throughput for interactive applications.

TensorRT-LLM vs. Alternatives

ToolBest ForGPU SupportOpen Source
TensorRT-LLMMaximum throughput on NVIDIA GPUsNVIDIA onlyYes (Apache 2.0)
vLLMFlexible serving with good performanceNVIDIA and AMDYes (Apache 2.0)
OllamaLocal model running (consumer hardware)NVIDIA; AMD; AppleYes (MIT)
llama.cppCPU and edge inference; maximum portabilityAny (CPU; GPU; Metal)Yes (MIT)
NVIDIA NIMProduction deployment without optimization workNVIDIA onlyContainers (free dev; paid prod)

When to use TensorRT-LLM directly: You want maximum performance on NVIDIA GPUs and are comfortable building a custom serving stack. Typical users: ML engineering teams at companies running high-throughput inference.

When to use NIM instead: You want TensorRT-LLM's performance without the engineering effort. NIM packages TensorRT-LLM into ready-to-deploy containers.

When to use vLLM: You need GPU flexibility (NVIDIA + AMD) or prefer vLLM's serving features (OpenAI-compatible API server built in).

Access

DetailInfo
PriceFree (open source)
LicenseApache 2.0
Source Codegithub.com/NVIDIA/TensorRT-LLM
GPU RequiredNVIDIA (Ampere A100+ recommended; Hopper/Blackwell for FP8)
Supported ModelsLlama; Mistral; GPT; Falcon; Bloom; ChatGLM; Qwen; Nemotron; and more
IntegrationPowers NVIDIA NIM containers; can be used standalone

Strengths

  • Best-in-class NVIDIA inference performance — consistently outperforms alternatives on throughput benchmarks for NVIDIA GPUs
  • Open source (Apache 2.0) — full source code, no licensing restrictions
  • Comprehensive optimization stack — quantization, batching, KV cache, parallelism, speculative decoding all included
  • Broad model support — pre-built configurations for most popular open-weight models
  • Powers NIM — the same engine behind NVIDIA's production deployment platform
  • Active development — frequent updates aligned with new GPU architectures and model architectures

Limitations & Considerations

  • NVIDIA GPUs only — no AMD, no CPU, no Apple Silicon support
  • Build complexity — compiling and configuring TensorRT-LLM engines requires more expertise than using higher-level tools (Ollama, vLLM)
  • Model compilation step — models must be compiled into TensorRT-LLM engines (a one-time process that can take minutes to hours depending on model size)
  • Less flexible than vLLM — vLLM supports more hardware and has a simpler API for serving; TensorRT-LLM trades flexibility for raw performance
  • Rapid API changes — the library is evolving quickly, which can break existing configurations between versions

Key Takeaways

  • TensorRT-LLM is NVIDIA's open-source LLM inference optimization library — delivering 2-4x throughput improvements through quantization (FP8/INT4), in-flight batching, paged KV cache, multi-GPU parallelism, and speculative decoding
  • It powers NVIDIA NIM containers under the hood — teams can use TensorRT-LLM directly for maximum control or NIM for deployment simplicity
  • Best suited for high-throughput production inference on NVIDIA GPUs; for local experimentation use Ollama, for GPU flexibility use vLLM
  • Open source (Apache 2.0) and actively developed, with support for most popular open-weight models

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you