TensorRT-LLM and TGI

TensorRT-LLM and TGI (Text Generation Inference) are two important LLM inference engines. TensorRT-LLM, developed by NVIDIA, pursues maximum inference performance; TGI, developed by Hugging Face, emphasizes ease of use and ecosystem integration.

For the vLLM inference engine, see vLLM.

TensorRT-LLM

Overview

TensorRT-LLM is an LLM-specific inference framework built by NVIDIA on top of the TensorRT deep learning inference optimizer. It compiles LLMs into highly optimized TensorRT engines that fully exploit the hardware capabilities of NVIDIA GPUs.

Core Optimization Techniques

Graph Optimization and Kernel Fusion

TensorRT-LLM performs deep optimizations on the computation graph during the compilation stage:

Layer Fusion: Merges multiple consecutive operations (e.g., LayerNorm + Linear + Activation) into a single CUDA kernel, reducing memory access overhead and kernel launch costs
Constant Folding: Pre-computes static expressions at compile time
Mixed Precision: Automatically determines which layers use FP16 and which use FP8/INT8, balancing accuracy and speed

Custom CUDA Kernels

NVIDIA has written highly optimized CUDA kernels for critical Transformer operations:

Customized Flash Attention implementations (tailored to different GPU architectures)
Fused MHA/GQA/MQA kernels
Optimized GEMM (matrix multiplication) kernels with FP8 Tensor Core support

Tensor Parallelism and Pipeline Parallelism

Native support for multiple parallelism strategies:

Tensor Parallelism (TP): Partitions the model's weight matrices across multiple GPUs
Pipeline Parallelism (PP): Distributes different layers of the model across different GPUs
Supports hybrid TP + PP parallelism

In-flight Batching

Similar to vLLM's Continuous Batching, TensorRT-LLM implements In-flight Batching:

Dynamically adds/removes requests at each iteration
Supports different sampling parameters for different requests
Tightly integrated with KV Cache management

Usage Workflow

TensorRT-LLM usage consists of two stages: compilation and deployment.

# 1. Model conversion: Convert a HuggingFace model to a TensorRT-LLM checkpoint
python convert_checkpoint.py \
    --model_dir meta-llama/Llama-2-7b-hf \
    --output_dir ./checkpoint \
    --dtype float16

# 2. Engine compilation: Compile the checkpoint into a TensorRT engine
trtllm-build \
    --checkpoint_dir ./checkpoint \
    --output_dir ./engine \
    --gemm_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 2048 \
    --max_seq_len 4096

# 3. Deployment: Serve via Triton Server
python launch_triton_server.py --model_repo ./triton_repo

Note: The maximum batch size and sequence length must be specified at compile time and cannot be changed afterward. Separate compilation is required for different hardware.

Use Cases

Peak performance in production: Online services with strict latency requirements
NVIDIA GPU clusters: Full utilization of hardware features such as Tensor Cores and NVLink
Large-scale deployment: Elastic scaling through Triton Inference Server

TGI (Text Generation Inference)

Overview

TGI is an LLM inference and serving tool developed by Hugging Face, written in Rust and Python. It provides an out-of-the-box deployment experience via Docker containers and is deeply integrated with the Hugging Face ecosystem.

Core Features

Inference Optimizations:

Continuous Batching: Dynamic request batching
Flash Attention 2: Efficient attention computation
Tensor Parallelism: Multi-GPU inference
Quantization support: GPTQ, AWQ, BitsAndBytes, EETQ, FP8

Serving Features:

Streaming output (Server-Sent Events)
Token-level streaming
Concurrent request management and queuing
Structured output (JSON mode / Grammar)
OpenAI-compatible API endpoints

Operations-Friendly:

One-click Docker deployment
Built-in Prometheus monitoring metrics
Distributed tracing support (OpenTelemetry)
Direct model pulling from HuggingFace Hub

Usage

Docker Deployment (Recommended):

# Simplest deployment
docker run --gpus all --shm-size 1g -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --max-input-tokens 2048 \
    --max-total-tokens 4096

# Multi-GPU tensor parallelism
docker run --gpus all --shm-size 1g -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-70b-chat-hf \
    --num-shard 4 \
    --quantize gptq

Client Usage:

# Using the huggingface_hub client
from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Streaming generation
for token in client.text_generation(
    "What is deep learning?",
    max_new_tokens=100,
    stream=True
):
    print(token, end="")

# Or using the OpenAI-compatible API
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="-")
response = client.chat.completions.create(
    model="tgi",
    messages=[{"role": "user", "content": "Hello!"}]
)

Use Cases

Rapid prototyping and validation: One-click Docker startup, no compilation needed
HuggingFace ecosystem integration: Direct use of models from the Hub
Medium-scale services: Meets the performance needs of most online services

Comparison of the Three Major Inference Engines

Dimension	TensorRT-LLM	TGI	vLLM
Developer	NVIDIA	Hugging Face	UC Berkeley
Language	C++ / Python	Rust / Python	Python
Core Strength	Peak performance (kernel-level optimization)	Ease of use & ecosystem	High throughput (PagedAttention)
Deployment Complexity	High (engine compilation required)	Low (Docker)	Low (pip install)
Quantization Support	FP8, INT8, INT4 (native)	GPTQ, AWQ, BnB, FP8	GPTQ, AWQ, FP8
Parallelism Strategy	TP + PP	TP	TP + PP
Latency	Lowest	Moderate	Moderate
Throughput	Highest	High	Very high
Hardware Requirements	NVIDIA GPU (specific architectures)	NVIDIA GPU	NVIDIA / AMD GPU
Model Support	Requires per-model adaptation	Broad HF Hub support	Broad HF Hub support
Learning Curve	Steep	Gentle	Gentle

Selection Guide

Need extreme latency/throughput?
├── Yes → Have engineering resources for compilation optimization?
│   ├── Yes → TensorRT-LLM
│   └── No  → vLLM
└── No  → Need deep HuggingFace integration?
    ├── Yes → TGI
    └── No  → vLLM (general recommendation)

TensorRT-LLM and TGI

TensorRT-LLM

Overview

Core Optimization Techniques

Usage Workflow

Use Cases

TGI (Text Generation Inference)

Overview

Core Features

Usage

Use Cases

Comparison of the Three Major Inference Engines

Selection Guide

References

评论 #