Skip to content

Model Serving

1. Serving Patterns

1.1 REST API

The most common model serving interface:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    messages: list[dict]
    model: str = "gpt-4"
    temperature: float = 0.7
    max_tokens: int = 1024

class ChatResponse(BaseModel):
    id: str
    choices: list[dict]
    usage: dict

@app.post("/v1/chat/completions")
async def chat_completion(request: ChatRequest):
    # Model inference
    response = await model.generate(
        messages=request.messages,
        temperature=request.temperature,
        max_tokens=request.max_tokens,
    )
    return ChatResponse(
        id=generate_id(),
        choices=[{"message": {"role": "assistant", "content": response}}],
        usage={"prompt_tokens": ..., "completion_tokens": ..., "total_tokens": ...}
    )

Features:

  • Simple, universal, easy to debug
  • Based on HTTP/HTTPS, widely compatible
  • Suitable for most application scenarios
  • OpenAI-compatible API format has become the de facto standard

1.2 gRPC

High-performance RPC framework, suitable for internal service communication:

// inference.proto
service InferenceService {
    rpc Predict (PredictRequest) returns (PredictResponse);
    rpc StreamPredict (PredictRequest) returns (stream PredictResponse);
}

message PredictRequest {
    string model_name = 1;
    repeated Message messages = 2;
    float temperature = 3;
    int32 max_tokens = 4;
}

message PredictResponse {
    string text = 1;
    TokenUsage usage = 2;
}

Features:

  • Higher performance than REST (Protocol Buffers serialization)
  • Native streaming support
  • Strongly typed interface definitions
  • Suitable for inter-microservice communication

1.3 Streaming

LLM-generated text is incrementally returned via SSE (Server-Sent Events):

from fastapi.responses import StreamingResponse

@app.post("/v1/chat/completions/stream")
async def chat_completion_stream(request: ChatRequest):
    async def generate():
        async for token in model.generate_stream(
            messages=request.messages,
            temperature=request.temperature,
        ):
            chunk = {
                "choices": [{"delta": {"content": token}}]
            }
            yield f"data: {json.dumps(chunk)}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

Advantages of streaming:

  • User experience: Lower Time To First Token (TTFT)
  • Memory efficiency: No need to wait for the complete response
  • Timeout handling: Reduces timeout risk for long text generation

2. Inference Serving Frameworks

2.1 Triton Inference Server

High-performance inference server from NVIDIA:

Features:
- Supports multiple frameworks (PyTorch, TensorFlow, ONNX, TensorRT)
- Dynamic batching
- Concurrent model execution
- GPU sharing and management
- Built-in performance monitoring

Use cases:
- Multi-model serving
- High-throughput requirements
- GPU resource optimization
- Enterprise deployment

Model repository structure:

model_repository/
├── llm_model/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.plan
│   └── 2/
│       └── model.plan

2.2 vLLM

High-performance serving framework optimized for LLM inference:

from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,     # Number of GPUs for parallelism
    gpu_memory_utilization=0.9, # GPU memory utilization
    max_model_len=8192,         # Maximum sequence length
)

# Inference
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

outputs = llm.generate(["What is AI?"], sampling_params)

Key vLLM technologies:

  • PagedAttention: Manages KV-Cache in pages, reducing memory waste
  • Continuous Batching: Dynamic batching to improve GPU utilization
  • Tensor Parallelism: Multi-GPU parallel inference
  • Speculative Decoding: Accelerates generation through speculative decoding

Starting an OpenAI-compatible server:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2

2.3 TGI (Text Generation Inference)

Text generation inference framework from Hugging Face:

# Docker deployment
docker run --gpus all \
    -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3-8B-Instruct \
    --quantize gptq \
    --max-input-length 4096 \
    --max-total-tokens 8192

Features:

  • Flash Attention support
  • Quantized inference (GPTQ, AWQ, EETQ)
  • Watermarking
  • Token streaming
  • Production-grade Rust implementation

2.4 Framework Comparison

Feature vLLM TGI Triton
Focus LLM inference Text generation General inference
Performance Very high High High
Ease of use High High Medium
Multi-model Limited Limited Excellent
Quantization AWQ, GPTQ, FP8 GPTQ, AWQ TensorRT
Community Active Active NVIDIA-maintained

3. API Gateway

3.1 Rate Limiting

from fastapi import FastAPI, HTTPException
from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app = FastAPI()

@app.post("/v1/chat/completions")
@limiter.limit("60/minute")  # 60 requests per minute
async def chat_completion(request: ChatRequest):
    ...

Rate limiting strategies:

  • Fixed window: N requests per minute
  • Sliding window: Smoother rate limiting
  • Token bucket: Allows burst traffic
  • Per user/API Key: Different quotas for different users

3.2 Authentication and Authorization

from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

@app.post("/v1/chat/completions")
async def chat_completion(
    request: ChatRequest,
    credentials: HTTPAuthorizationCredentials = Depends(security)
):
    # Verify API Key
    if not verify_api_key(credentials.credentials):
        raise HTTPException(status_code=401, detail="Invalid API key")

    # Check permissions
    user = get_user_by_key(credentials.credentials)
    if not user.has_permission("chat"):
        raise HTTPException(status_code=403, detail="Insufficient permissions")

    ...

3.3 Request Logging and Monitoring

import time
import logging

@app.middleware("http")
async def log_requests(request, call_next):
    start_time = time.time()

    response = await call_next(request)

    duration = time.time() - start_time
    logging.info(f"{request.method} {request.url.path} "
                 f"status={response.status_code} "
                 f"duration={duration:.3f}s")

    # Send to monitoring system
    metrics.record_request(
        path=request.url.path,
        status=response.status_code,
        duration=duration,
    )

    return response

4. Load Balancing

4.1 Strategy Selection

Strategy Description Use Cases
Round Robin Distribute requests sequentially Uniform load
Weighted Round Robin Distribute by GPU performance Heterogeneous GPU clusters
Least Connections Route to the instance with fewest connections High variance in request duration
Consistent Hashing Route same user to same instance Session affinity needed
Fastest Response Route to the fastest responding instance Latency-sensitive

4.2 Health Checks

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "gpu_memory_used": get_gpu_memory_usage(),
        "active_requests": get_active_request_count(),
        "model_loaded": model.is_loaded(),
    }

@app.get("/ready")
async def readiness_check():
    if not model.is_loaded():
        raise HTTPException(status_code=503, detail="Model not ready")
    return {"status": "ready"}

5. Auto-scaling

5.1 Scaling Metrics

  • QPS: Queries per second
  • GPU utilization: GPU compute resource usage
  • Queue length: Number of requests waiting to be processed
  • Response latency: P95/P99 latency

5.2 Kubernetes HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "80"
    - type: Pods
      pods:
        metric:
          name: request_queue_length
        target:
          type: AverageValue
          averageValue: "10"

5.3 Cold Start Optimization

  • Model pre-loading: Pre-load models to GPU at startup
  • Minimum replicas: Keep at least 2 instances running
  • Warm-up requests: Send warm-up requests after startup
  • Model caching: Use shared storage to cache model weights

6. Batching Strategies

6.1 Static Batching

Collect a fixed number of requests → Process together → Return together
Drawback: Must wait for all requests to arrive before processing

6.2 Dynamic Batching

Set maximum wait time → Collect requests within the time window → Batch process
Advantage: Balances latency and throughput

6.3 Continuous Batching

Request A: [generating...] [done] →
Request B:    [generating...........]  [done] →
Request C:        [generating......] [done] →
Request D:              [generating...] [done] →

New requests can immediately join the batch as earlier requests complete

vLLM's continuous batching:

  • As soon as a request completes, a new request joins
  • Maximizes GPU utilization
  • 2-4x throughput improvement over static batching

7. Serving Architecture

graph TD
    Client[Client] --> LB[Load Balancer]
    LB --> GW[API Gateway]

    GW --> Auth[Authentication]
    GW --> RL[Rate Limiting]
    GW --> Log[Logging & Monitoring]

    GW --> Router[Request Router]

    Router --> S1[Inference Instance 1<br/>GPU: A100]
    Router --> S2[Inference Instance 2<br/>GPU: A100]
    Router --> S3[Inference Instance 3<br/>GPU: H100]

    S1 --> Cache[KV-Cache<br/>Redis]
    S2 --> Cache
    S3 --> Cache

    Router --> Queue[Request Queue<br/>Kafka/Redis]
    Queue --> S1
    Queue --> S2
    Queue --> S3

    Monitor[Monitoring<br/>Prometheus + Grafana] --> S1
    Monitor --> S2
    Monitor --> S3
    Monitor --> GW

    HPA[Auto-scaling<br/>K8s HPA] --> S1
    HPA --> S2
    HPA --> S3

8. Practical Checklist

  • [ ] Choose an inference framework (vLLM/TGI/Triton)
  • [ ] Implement OpenAI-compatible API
  • [ ] Configure streaming
  • [ ] Deploy API gateway (authentication, rate limiting)
  • [ ] Set up load balancing and health checks
  • [ ] Configure auto-scaling
  • [ ] Implement continuous batching
  • [ ] Establish monitoring and alerting

References

  • vLLM Documentation
  • NVIDIA Triton Inference Server Documentation
  • Hugging Face Text Generation Inference Documentation
  • LLMOps Overview — Full lifecycle management for LLM applications

评论 #