Skip to content

TensorRT-LLM and TGI

TensorRT-LLM and TGI (Text Generation Inference) are two important LLM inference engines. TensorRT-LLM, developed by NVIDIA, pursues maximum inference performance; TGI, developed by Hugging Face, emphasizes ease of use and ecosystem integration.

For the vLLM inference engine, see vLLM.


TensorRT-LLM

Overview

TensorRT-LLM is an LLM-specific inference framework built by NVIDIA on top of the TensorRT deep learning inference optimizer. It compiles LLMs into highly optimized TensorRT engines that fully exploit the hardware capabilities of NVIDIA GPUs.

Core Optimization Techniques

Graph Optimization and Kernel Fusion

TensorRT-LLM performs deep optimizations on the computation graph during the compilation stage:

  • Layer Fusion: Merges multiple consecutive operations (e.g., LayerNorm + Linear + Activation) into a single CUDA kernel, reducing memory access overhead and kernel launch costs
  • Constant Folding: Pre-computes static expressions at compile time
  • Mixed Precision: Automatically determines which layers use FP16 and which use FP8/INT8, balancing accuracy and speed

Custom CUDA Kernels

NVIDIA has written highly optimized CUDA kernels for critical Transformer operations:

  • Customized Flash Attention implementations (tailored to different GPU architectures)
  • Fused MHA/GQA/MQA kernels
  • Optimized GEMM (matrix multiplication) kernels with FP8 Tensor Core support

Tensor Parallelism and Pipeline Parallelism

Native support for multiple parallelism strategies:

  • Tensor Parallelism (TP): Partitions the model's weight matrices across multiple GPUs
  • Pipeline Parallelism (PP): Distributes different layers of the model across different GPUs
  • Supports hybrid TP + PP parallelism

In-flight Batching

Similar to vLLM's Continuous Batching, TensorRT-LLM implements In-flight Batching:

  • Dynamically adds/removes requests at each iteration
  • Supports different sampling parameters for different requests
  • Tightly integrated with KV Cache management

Usage Workflow

TensorRT-LLM usage consists of two stages: compilation and deployment.

# 1. Model conversion: Convert a HuggingFace model to a TensorRT-LLM checkpoint
python convert_checkpoint.py \
    --model_dir meta-llama/Llama-2-7b-hf \
    --output_dir ./checkpoint \
    --dtype float16

# 2. Engine compilation: Compile the checkpoint into a TensorRT engine
trtllm-build \
    --checkpoint_dir ./checkpoint \
    --output_dir ./engine \
    --gemm_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 2048 \
    --max_seq_len 4096

# 3. Deployment: Serve via Triton Server
python launch_triton_server.py --model_repo ./triton_repo

Note: The maximum batch size and sequence length must be specified at compile time and cannot be changed afterward. Separate compilation is required for different hardware.

Use Cases

  • Peak performance in production: Online services with strict latency requirements
  • NVIDIA GPU clusters: Full utilization of hardware features such as Tensor Cores and NVLink
  • Large-scale deployment: Elastic scaling through Triton Inference Server

TGI (Text Generation Inference)

Overview

TGI is an LLM inference and serving tool developed by Hugging Face, written in Rust and Python. It provides an out-of-the-box deployment experience via Docker containers and is deeply integrated with the Hugging Face ecosystem.

Core Features

Inference Optimizations:

  • Continuous Batching: Dynamic request batching
  • Flash Attention 2: Efficient attention computation
  • Tensor Parallelism: Multi-GPU inference
  • Quantization support: GPTQ, AWQ, BitsAndBytes, EETQ, FP8

Serving Features:

  • Streaming output (Server-Sent Events)
  • Token-level streaming
  • Concurrent request management and queuing
  • Structured output (JSON mode / Grammar)
  • OpenAI-compatible API endpoints

Operations-Friendly:

  • One-click Docker deployment
  • Built-in Prometheus monitoring metrics
  • Distributed tracing support (OpenTelemetry)
  • Direct model pulling from HuggingFace Hub

Usage

Docker Deployment (Recommended):

# Simplest deployment
docker run --gpus all --shm-size 1g -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --max-input-tokens 2048 \
    --max-total-tokens 4096

# Multi-GPU tensor parallelism
docker run --gpus all --shm-size 1g -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-70b-chat-hf \
    --num-shard 4 \
    --quantize gptq

Client Usage:

# Using the huggingface_hub client
from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Streaming generation
for token in client.text_generation(
    "What is deep learning?",
    max_new_tokens=100,
    stream=True
):
    print(token, end="")

# Or using the OpenAI-compatible API
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="-")
response = client.chat.completions.create(
    model="tgi",
    messages=[{"role": "user", "content": "Hello!"}]
)

Use Cases

  • Rapid prototyping and validation: One-click Docker startup, no compilation needed
  • HuggingFace ecosystem integration: Direct use of models from the Hub
  • Medium-scale services: Meets the performance needs of most online services

Comparison of the Three Major Inference Engines

Dimension TensorRT-LLM TGI vLLM
Developer NVIDIA Hugging Face UC Berkeley
Language C++ / Python Rust / Python Python
Core Strength Peak performance (kernel-level optimization) Ease of use & ecosystem High throughput (PagedAttention)
Deployment Complexity High (engine compilation required) Low (Docker) Low (pip install)
Quantization Support FP8, INT8, INT4 (native) GPTQ, AWQ, BnB, FP8 GPTQ, AWQ, FP8
Parallelism Strategy TP + PP TP TP + PP
Latency Lowest Moderate Moderate
Throughput Highest High Very high
Hardware Requirements NVIDIA GPU (specific architectures) NVIDIA GPU NVIDIA / AMD GPU
Model Support Requires per-model adaptation Broad HF Hub support Broad HF Hub support
Learning Curve Steep Gentle Gentle

Selection Guide

Need extreme latency/throughput?
├── Yes → Have engineering resources for compilation optimization?
│   ├── Yes → TensorRT-LLM
│   └── No  → vLLM
└── No  → Need deep HuggingFace integration?
    ├── Yes → TGI
    └── No  → vLLM (general recommendation)

References


评论 #