Model Serving

1. Serving Patterns

1.1 REST API

The most common model serving interface:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    messages: list[dict]
    model: str = "gpt-4"
    temperature: float = 0.7
    max_tokens: int = 1024

class ChatResponse(BaseModel):
    id: str
    choices: list[dict]
    usage: dict

@app.post("/v1/chat/completions")
async def chat_completion(request: ChatRequest):
    # Model inference
    response = await model.generate(
        messages=request.messages,
        temperature=request.temperature,
        max_tokens=request.max_tokens,
    )
    return ChatResponse(
        id=generate_id(),
        choices=[{"message": {"role": "assistant", "content": response}}],
        usage={"prompt_tokens": ..., "completion_tokens": ..., "total_tokens": ...}
    )

Features:

Simple, universal, easy to debug
Based on HTTP/HTTPS, widely compatible
Suitable for most application scenarios
OpenAI-compatible API format has become the de facto standard

1.2 gRPC

High-performance RPC framework, suitable for internal service communication:

// inference.proto
service InferenceService {
    rpc Predict (PredictRequest) returns (PredictResponse);
    rpc StreamPredict (PredictRequest) returns (stream PredictResponse);
}

message PredictRequest {
    string model_name = 1;
    repeated Message messages = 2;
    float temperature = 3;
    int32 max_tokens = 4;
}

message PredictResponse {
    string text = 1;
    TokenUsage usage = 2;
}

Features:

Higher performance than REST (Protocol Buffers serialization)
Native streaming support
Strongly typed interface definitions
Suitable for inter-microservice communication

1.3 Streaming

LLM-generated text is incrementally returned via SSE (Server-Sent Events):

from fastapi.responses import StreamingResponse

@app.post("/v1/chat/completions/stream")
async def chat_completion_stream(request: ChatRequest):
    async def generate():
        async for token in model.generate_stream(
            messages=request.messages,
            temperature=request.temperature,
        ):
            chunk = {
                "choices": [{"delta": {"content": token}}]
            }
            yield f"data: {json.dumps(chunk)}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

Advantages of streaming:

User experience: Lower Time To First Token (TTFT)
Memory efficiency: No need to wait for the complete response
Timeout handling: Reduces timeout risk for long text generation

2. Inference Serving Frameworks

2.1 Triton Inference Server

High-performance inference server from NVIDIA:

Features:
- Supports multiple frameworks (PyTorch, TensorFlow, ONNX, TensorRT)
- Dynamic batching
- Concurrent model execution
- GPU sharing and management
- Built-in performance monitoring

Use cases:
- Multi-model serving
- High-throughput requirements
- GPU resource optimization
- Enterprise deployment

Model repository structure:

model_repository/
├── llm_model/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.plan
│   └── 2/
│       └── model.plan

2.2 vLLM

High-performance serving framework optimized for LLM inference:

from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,     # Number of GPUs for parallelism
    gpu_memory_utilization=0.9, # GPU memory utilization
    max_model_len=8192,         # Maximum sequence length
)

# Inference
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

outputs = llm.generate(["What is AI?"], sampling_params)

Key vLLM technologies:

PagedAttention: Manages KV-Cache in pages, reducing memory waste
Continuous Batching: Dynamic batching to improve GPU utilization
Tensor Parallelism: Multi-GPU parallel inference
Speculative Decoding: Accelerates generation through speculative decoding

Starting an OpenAI-compatible server:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2

2.3 TGI (Text Generation Inference)

Text generation inference framework from Hugging Face:

# Docker deployment
docker run --gpus all \
    -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3-8B-Instruct \
    --quantize gptq \
    --max-input-length 4096 \
    --max-total-tokens 8192

Features:

Flash Attention support
Quantized inference (GPTQ, AWQ, EETQ)
Watermarking
Token streaming
Production-grade Rust implementation

2.4 Framework Comparison

Feature	vLLM	TGI	Triton
Focus	LLM inference	Text generation	General inference
Performance	Very high	High	High
Ease of use	High	High	Medium
Multi-model	Limited	Limited	Excellent
Quantization	AWQ, GPTQ, FP8	GPTQ, AWQ	TensorRT
Community	Active	Active	NVIDIA-maintained

3. API Gateway

3.1 Rate Limiting

from fastapi import FastAPI, HTTPException
from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app = FastAPI()

@app.post("/v1/chat/completions")
@limiter.limit("60/minute")  # 60 requests per minute
async def chat_completion(request: ChatRequest):
    ...

Rate limiting strategies:

Fixed window: N requests per minute
Sliding window: Smoother rate limiting
Token bucket: Allows burst traffic
Per user/API Key: Different quotas for different users

3.2 Authentication and Authorization

from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

@app.post("/v1/chat/completions")
async def chat_completion(
    request: ChatRequest,
    credentials: HTTPAuthorizationCredentials = Depends(security)
):
    # Verify API Key
    if not verify_api_key(credentials.credentials):
        raise HTTPException(status_code=401, detail="Invalid API key")

    # Check permissions
    user = get_user_by_key(credentials.credentials)
    if not user.has_permission("chat"):
        raise HTTPException(status_code=403, detail="Insufficient permissions")

    ...

3.3 Request Logging and Monitoring

import time
import logging

@app.middleware("http")
async def log_requests(request, call_next):
    start_time = time.time()

    response = await call_next(request)

    duration = time.time() - start_time
    logging.info(f"{request.method} {request.url.path} "
                 f"status={response.status_code} "
                 f"duration={duration:.3f}s")

    # Send to monitoring system
    metrics.record_request(
        path=request.url.path,
        status=response.status_code,
        duration=duration,
    )

    return response

4. Load Balancing

4.1 Strategy Selection

Strategy	Description	Use Cases
Round Robin	Distribute requests sequentially	Uniform load
Weighted Round Robin	Distribute by GPU performance	Heterogeneous GPU clusters
Least Connections	Route to the instance with fewest connections	High variance in request duration
Consistent Hashing	Route same user to same instance	Session affinity needed
Fastest Response	Route to the fastest responding instance	Latency-sensitive

4.2 Health Checks

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "gpu_memory_used": get_gpu_memory_usage(),
        "active_requests": get_active_request_count(),
        "model_loaded": model.is_loaded(),
    }

@app.get("/ready")
async def readiness_check():
    if not model.is_loaded():
        raise HTTPException(status_code=503, detail="Model not ready")
    return {"status": "ready"}

5. Auto-scaling

5.1 Scaling Metrics

QPS: Queries per second
GPU utilization: GPU compute resource usage
Queue length: Number of requests waiting to be processed
Response latency: P95/P99 latency

5.2 Kubernetes HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "80"
    - type: Pods
      pods:
        metric:
          name: request_queue_length
        target:
          type: AverageValue
          averageValue: "10"

5.3 Cold Start Optimization

Model pre-loading: Pre-load models to GPU at startup
Minimum replicas: Keep at least 2 instances running
Warm-up requests: Send warm-up requests after startup
Model caching: Use shared storage to cache model weights

6. Batching Strategies

6.1 Static Batching

Collect a fixed number of requests → Process together → Return together
Drawback: Must wait for all requests to arrive before processing

6.2 Dynamic Batching

Set maximum wait time → Collect requests within the time window → Batch process
Advantage: Balances latency and throughput

6.3 Continuous Batching

Request A: [generating...] [done] →
Request B:    [generating...........]  [done] →
Request C:        [generating......] [done] →
Request D:              [generating...] [done] →

New requests can immediately join the batch as earlier requests complete

vLLM's continuous batching:

As soon as a request completes, a new request joins
Maximizes GPU utilization
2-4x throughput improvement over static batching

7. Serving Architecture

graph TD
    Client[Client] --> LB[Load Balancer]
    LB --> GW[API Gateway]

    GW --> Auth[Authentication]
    GW --> RL[Rate Limiting]
    GW --> Log[Logging & Monitoring]

    GW --> Router[Request Router]

    Router --> S1[Inference Instance 1<br/>GPU: A100]
    Router --> S2[Inference Instance 2<br/>GPU: A100]
    Router --> S3[Inference Instance 3<br/>GPU: H100]

    S1 --> Cache[KV-Cache<br/>Redis]
    S2 --> Cache
    S3 --> Cache

    Router --> Queue[Request Queue<br/>Kafka/Redis]
    Queue --> S1
    Queue --> S2
    Queue --> S3

    Monitor[Monitoring<br/>Prometheus + Grafana] --> S1
    Monitor --> S2
    Monitor --> S3
    Monitor --> GW

    HPA[Auto-scaling<br/>K8s HPA] --> S1
    HPA --> S2
    HPA --> S3

8. Practical Checklist

[ ] Choose an inference framework (vLLM/TGI/Triton)
[ ] Implement OpenAI-compatible API
[ ] Configure streaming
[ ] Deploy API gateway (authentication, rate limiting)
[ ] Set up load balancing and health checks
[ ] Configure auto-scaling
[ ] Implement continuous batching
[ ] Establish monitoring and alerting

References

vLLM Documentation
NVIDIA Triton Inference Server Documentation
Hugging Face Text Generation Inference Documentation
LLMOps Overview — Full lifecycle management for LLM applications

Model Serving

1. Serving Patterns

1.1 REST API

1.2 gRPC

1.3 Streaming

2. Inference Serving Frameworks

2.1 Triton Inference Server

2.2 vLLM

2.3 TGI (Text Generation Inference)

2.4 Framework Comparison

3. API Gateway

3.1 Rate Limiting

3.2 Authentication and Authorization

3.3 Request Logging and Monitoring

4. Load Balancing

4.1 Strategy Selection

4.2 Health Checks

5. Auto-scaling

5.1 Scaling Metrics

5.2 Kubernetes HPA Configuration

5.3 Cold Start Optimization

6. Batching Strategies

6.1 Static Batching

6.2 Dynamic Batching

6.3 Continuous Batching

7. Serving Architecture

8. Practical Checklist

References

评论 #