TensorRT-LLM and TGI
TensorRT-LLM and TGI (Text Generation Inference) are two important LLM inference engines. TensorRT-LLM, developed by NVIDIA, pursues maximum inference performance; TGI, developed by Hugging Face, emphasizes ease of use and ecosystem integration.
For the vLLM inference engine, see vLLM.
TensorRT-LLM
Overview
TensorRT-LLM is an LLM-specific inference framework built by NVIDIA on top of the TensorRT deep learning inference optimizer. It compiles LLMs into highly optimized TensorRT engines that fully exploit the hardware capabilities of NVIDIA GPUs.
Core Optimization Techniques
Graph Optimization and Kernel Fusion
TensorRT-LLM performs deep optimizations on the computation graph during the compilation stage:
- Layer Fusion: Merges multiple consecutive operations (e.g., LayerNorm + Linear + Activation) into a single CUDA kernel, reducing memory access overhead and kernel launch costs
- Constant Folding: Pre-computes static expressions at compile time
- Mixed Precision: Automatically determines which layers use FP16 and which use FP8/INT8, balancing accuracy and speed
Custom CUDA Kernels
NVIDIA has written highly optimized CUDA kernels for critical Transformer operations:
- Customized Flash Attention implementations (tailored to different GPU architectures)
- Fused MHA/GQA/MQA kernels
- Optimized GEMM (matrix multiplication) kernels with FP8 Tensor Core support
Tensor Parallelism and Pipeline Parallelism
Native support for multiple parallelism strategies:
- Tensor Parallelism (TP): Partitions the model's weight matrices across multiple GPUs
- Pipeline Parallelism (PP): Distributes different layers of the model across different GPUs
- Supports hybrid TP + PP parallelism
In-flight Batching
Similar to vLLM's Continuous Batching, TensorRT-LLM implements In-flight Batching:
- Dynamically adds/removes requests at each iteration
- Supports different sampling parameters for different requests
- Tightly integrated with KV Cache management
Usage Workflow
TensorRT-LLM usage consists of two stages: compilation and deployment.
# 1. Model conversion: Convert a HuggingFace model to a TensorRT-LLM checkpoint
python convert_checkpoint.py \
--model_dir meta-llama/Llama-2-7b-hf \
--output_dir ./checkpoint \
--dtype float16
# 2. Engine compilation: Compile the checkpoint into a TensorRT engine
trtllm-build \
--checkpoint_dir ./checkpoint \
--output_dir ./engine \
--gemm_plugin float16 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_seq_len 4096
# 3. Deployment: Serve via Triton Server
python launch_triton_server.py --model_repo ./triton_repo
Note: The maximum batch size and sequence length must be specified at compile time and cannot be changed afterward. Separate compilation is required for different hardware.
Use Cases
- Peak performance in production: Online services with strict latency requirements
- NVIDIA GPU clusters: Full utilization of hardware features such as Tensor Cores and NVLink
- Large-scale deployment: Elastic scaling through Triton Inference Server
TGI (Text Generation Inference)
Overview
TGI is an LLM inference and serving tool developed by Hugging Face, written in Rust and Python. It provides an out-of-the-box deployment experience via Docker containers and is deeply integrated with the Hugging Face ecosystem.
Core Features
Inference Optimizations:
- Continuous Batching: Dynamic request batching
- Flash Attention 2: Efficient attention computation
- Tensor Parallelism: Multi-GPU inference
- Quantization support: GPTQ, AWQ, BitsAndBytes, EETQ, FP8
Serving Features:
- Streaming output (Server-Sent Events)
- Token-level streaming
- Concurrent request management and queuing
- Structured output (JSON mode / Grammar)
- OpenAI-compatible API endpoints
Operations-Friendly:
- One-click Docker deployment
- Built-in Prometheus monitoring metrics
- Distributed tracing support (OpenTelemetry)
- Direct model pulling from HuggingFace Hub
Usage
Docker Deployment (Recommended):
# Simplest deployment
docker run --gpus all --shm-size 1g -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-chat-hf \
--max-input-tokens 2048 \
--max-total-tokens 4096
# Multi-GPU tensor parallelism
docker run --gpus all --shm-size 1g -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-70b-chat-hf \
--num-shard 4 \
--quantize gptq
Client Usage:
# Using the huggingface_hub client
from huggingface_hub import InferenceClient
client = InferenceClient("http://localhost:8080")
# Streaming generation
for token in client.text_generation(
"What is deep learning?",
max_new_tokens=100,
stream=True
):
print(token, end="")
# Or using the OpenAI-compatible API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="-")
response = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "Hello!"}]
)
Use Cases
- Rapid prototyping and validation: One-click Docker startup, no compilation needed
- HuggingFace ecosystem integration: Direct use of models from the Hub
- Medium-scale services: Meets the performance needs of most online services
Comparison of the Three Major Inference Engines
| Dimension | TensorRT-LLM | TGI | vLLM |
|---|---|---|---|
| Developer | NVIDIA | Hugging Face | UC Berkeley |
| Language | C++ / Python | Rust / Python | Python |
| Core Strength | Peak performance (kernel-level optimization) | Ease of use & ecosystem | High throughput (PagedAttention) |
| Deployment Complexity | High (engine compilation required) | Low (Docker) | Low (pip install) |
| Quantization Support | FP8, INT8, INT4 (native) | GPTQ, AWQ, BnB, FP8 | GPTQ, AWQ, FP8 |
| Parallelism Strategy | TP + PP | TP | TP + PP |
| Latency | Lowest | Moderate | Moderate |
| Throughput | Highest | High | Very high |
| Hardware Requirements | NVIDIA GPU (specific architectures) | NVIDIA GPU | NVIDIA / AMD GPU |
| Model Support | Requires per-model adaptation | Broad HF Hub support | Broad HF Hub support |
| Learning Curve | Steep | Gentle | Gentle |
Selection Guide
Need extreme latency/throughput?
├── Yes → Have engineering resources for compilation optimization?
│ ├── Yes → TensorRT-LLM
│ └── No → vLLM
└── No → Need deep HuggingFace integration?
├── Yes → TGI
└── No → vLLM (general recommendation)