Model Serving
1. Serving Patterns
1.1 REST API
The most common model serving interface:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class ChatRequest(BaseModel):
messages: list[dict]
model: str = "gpt-4"
temperature: float = 0.7
max_tokens: int = 1024
class ChatResponse(BaseModel):
id: str
choices: list[dict]
usage: dict
@app.post("/v1/chat/completions")
async def chat_completion(request: ChatRequest):
# Model inference
response = await model.generate(
messages=request.messages,
temperature=request.temperature,
max_tokens=request.max_tokens,
)
return ChatResponse(
id=generate_id(),
choices=[{"message": {"role": "assistant", "content": response}}],
usage={"prompt_tokens": ..., "completion_tokens": ..., "total_tokens": ...}
)
Features:
- Simple, universal, easy to debug
- Based on HTTP/HTTPS, widely compatible
- Suitable for most application scenarios
- OpenAI-compatible API format has become the de facto standard
1.2 gRPC
High-performance RPC framework, suitable for internal service communication:
// inference.proto
service InferenceService {
rpc Predict (PredictRequest) returns (PredictResponse);
rpc StreamPredict (PredictRequest) returns (stream PredictResponse);
}
message PredictRequest {
string model_name = 1;
repeated Message messages = 2;
float temperature = 3;
int32 max_tokens = 4;
}
message PredictResponse {
string text = 1;
TokenUsage usage = 2;
}
Features:
- Higher performance than REST (Protocol Buffers serialization)
- Native streaming support
- Strongly typed interface definitions
- Suitable for inter-microservice communication
1.3 Streaming
LLM-generated text is incrementally returned via SSE (Server-Sent Events):
from fastapi.responses import StreamingResponse
@app.post("/v1/chat/completions/stream")
async def chat_completion_stream(request: ChatRequest):
async def generate():
async for token in model.generate_stream(
messages=request.messages,
temperature=request.temperature,
):
chunk = {
"choices": [{"delta": {"content": token}}]
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream"
)
Advantages of streaming:
- User experience: Lower Time To First Token (TTFT)
- Memory efficiency: No need to wait for the complete response
- Timeout handling: Reduces timeout risk for long text generation
2. Inference Serving Frameworks
2.1 Triton Inference Server
High-performance inference server from NVIDIA:
Features:
- Supports multiple frameworks (PyTorch, TensorFlow, ONNX, TensorRT)
- Dynamic batching
- Concurrent model execution
- GPU sharing and management
- Built-in performance monitoring
Use cases:
- Multi-model serving
- High-throughput requirements
- GPU resource optimization
- Enterprise deployment
Model repository structure:
model_repository/
├── llm_model/
│ ├── config.pbtxt
│ ├── 1/
│ │ └── model.plan
│ └── 2/
│ └── model.plan
2.2 vLLM
High-performance serving framework optimized for LLM inference:
from vllm import LLM, SamplingParams
# Load model
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # Number of GPUs for parallelism
gpu_memory_utilization=0.9, # GPU memory utilization
max_model_len=8192, # Maximum sequence length
)
# Inference
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1024,
)
outputs = llm.generate(["What is AI?"], sampling_params)
Key vLLM technologies:
- PagedAttention: Manages KV-Cache in pages, reducing memory waste
- Continuous Batching: Dynamic batching to improve GPU utilization
- Tensor Parallelism: Multi-GPU parallel inference
- Speculative Decoding: Accelerates generation through speculative decoding
Starting an OpenAI-compatible server:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2
2.3 TGI (Text Generation Inference)
Text generation inference framework from Hugging Face:
# Docker deployment
docker run --gpus all \
-p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3-8B-Instruct \
--quantize gptq \
--max-input-length 4096 \
--max-total-tokens 8192
Features:
- Flash Attention support
- Quantized inference (GPTQ, AWQ, EETQ)
- Watermarking
- Token streaming
- Production-grade Rust implementation
2.4 Framework Comparison
| Feature | vLLM | TGI | Triton |
|---|---|---|---|
| Focus | LLM inference | Text generation | General inference |
| Performance | Very high | High | High |
| Ease of use | High | High | Medium |
| Multi-model | Limited | Limited | Excellent |
| Quantization | AWQ, GPTQ, FP8 | GPTQ, AWQ | TensorRT |
| Community | Active | Active | NVIDIA-maintained |
3. API Gateway
3.1 Rate Limiting
from fastapi import FastAPI, HTTPException
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
@app.post("/v1/chat/completions")
@limiter.limit("60/minute") # 60 requests per minute
async def chat_completion(request: ChatRequest):
...
Rate limiting strategies:
- Fixed window: N requests per minute
- Sliding window: Smoother rate limiting
- Token bucket: Allows burst traffic
- Per user/API Key: Different quotas for different users
3.2 Authentication and Authorization
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
security = HTTPBearer()
@app.post("/v1/chat/completions")
async def chat_completion(
request: ChatRequest,
credentials: HTTPAuthorizationCredentials = Depends(security)
):
# Verify API Key
if not verify_api_key(credentials.credentials):
raise HTTPException(status_code=401, detail="Invalid API key")
# Check permissions
user = get_user_by_key(credentials.credentials)
if not user.has_permission("chat"):
raise HTTPException(status_code=403, detail="Insufficient permissions")
...
3.3 Request Logging and Monitoring
import time
import logging
@app.middleware("http")
async def log_requests(request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
logging.info(f"{request.method} {request.url.path} "
f"status={response.status_code} "
f"duration={duration:.3f}s")
# Send to monitoring system
metrics.record_request(
path=request.url.path,
status=response.status_code,
duration=duration,
)
return response
4. Load Balancing
4.1 Strategy Selection
| Strategy | Description | Use Cases |
|---|---|---|
| Round Robin | Distribute requests sequentially | Uniform load |
| Weighted Round Robin | Distribute by GPU performance | Heterogeneous GPU clusters |
| Least Connections | Route to the instance with fewest connections | High variance in request duration |
| Consistent Hashing | Route same user to same instance | Session affinity needed |
| Fastest Response | Route to the fastest responding instance | Latency-sensitive |
4.2 Health Checks
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"gpu_memory_used": get_gpu_memory_usage(),
"active_requests": get_active_request_count(),
"model_loaded": model.is_loaded(),
}
@app.get("/ready")
async def readiness_check():
if not model.is_loaded():
raise HTTPException(status_code=503, detail="Model not ready")
return {"status": "ready"}
5. Auto-scaling
5.1 Scaling Metrics
- QPS: Queries per second
- GPU utilization: GPU compute resource usage
- Queue length: Number of requests waiting to be processed
- Response latency: P95/P99 latency
5.2 Kubernetes HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "80"
- type: Pods
pods:
metric:
name: request_queue_length
target:
type: AverageValue
averageValue: "10"
5.3 Cold Start Optimization
- Model pre-loading: Pre-load models to GPU at startup
- Minimum replicas: Keep at least 2 instances running
- Warm-up requests: Send warm-up requests after startup
- Model caching: Use shared storage to cache model weights
6. Batching Strategies
6.1 Static Batching
Collect a fixed number of requests → Process together → Return together
Drawback: Must wait for all requests to arrive before processing
6.2 Dynamic Batching
Set maximum wait time → Collect requests within the time window → Batch process
Advantage: Balances latency and throughput
6.3 Continuous Batching
Request A: [generating...] [done] →
Request B: [generating...........] [done] →
Request C: [generating......] [done] →
Request D: [generating...] [done] →
New requests can immediately join the batch as earlier requests complete
vLLM's continuous batching:
- As soon as a request completes, a new request joins
- Maximizes GPU utilization
- 2-4x throughput improvement over static batching
7. Serving Architecture
graph TD
Client[Client] --> LB[Load Balancer]
LB --> GW[API Gateway]
GW --> Auth[Authentication]
GW --> RL[Rate Limiting]
GW --> Log[Logging & Monitoring]
GW --> Router[Request Router]
Router --> S1[Inference Instance 1<br/>GPU: A100]
Router --> S2[Inference Instance 2<br/>GPU: A100]
Router --> S3[Inference Instance 3<br/>GPU: H100]
S1 --> Cache[KV-Cache<br/>Redis]
S2 --> Cache
S3 --> Cache
Router --> Queue[Request Queue<br/>Kafka/Redis]
Queue --> S1
Queue --> S2
Queue --> S3
Monitor[Monitoring<br/>Prometheus + Grafana] --> S1
Monitor --> S2
Monitor --> S3
Monitor --> GW
HPA[Auto-scaling<br/>K8s HPA] --> S1
HPA --> S2
HPA --> S3
8. Practical Checklist
- [ ] Choose an inference framework (vLLM/TGI/Triton)
- [ ] Implement OpenAI-compatible API
- [ ] Configure streaming
- [ ] Deploy API gateway (authentication, rate limiting)
- [ ] Set up load balancing and health checks
- [ ] Configure auto-scaling
- [ ] Implement continuous batching
- [ ] Establish monitoring and alerting
References
- vLLM Documentation
- NVIDIA Triton Inference Server Documentation
- Hugging Face Text Generation Inference Documentation
- LLMOps Overview — Full lifecycle management for LLM applications