Skip to content

Deployment Architecture Overview

Overview

The deployment architecture of AI Agents determines the system's availability, scalability, and cost efficiency. Unlike traditional web services, agent systems involve unique challenges such as long-running tasks, external tool calls, and state management. This section discusses the major deployment architecture patterns and their applicable scenarios.

Deployment Architecture Patterns

graph TD
    A[Agent Deployment Architecture] --> B[Serverless]
    A --> C[Containerized]
    A --> D[Long-running Service]

    B --> B1[AWS Lambda]
    B --> B2[Cloud Functions]
    B --> B3[Vercel Functions]

    C --> C1[Docker]
    C --> C2[Kubernetes]
    C --> C3[ECS/Cloud Run]

    D --> D1[WebSocket Service]
    D --> D2[Worker Processes]
    D --> D3[Queue Consumers]

    style B fill:#e3f2fd
    style C fill:#fff3e0
    style D fill:#e8f5e9

Cloud vs. Edge Deployment

Dimension Cloud Deployment Edge Deployment
Compute power Strong (GPU available) Limited
Latency Higher (network transfer) Low
Model size Large models Small models / distilled models
Privacy Data leaves local environment Data processed locally
Cost model Pay-per-use billing Fixed hardware cost
Availability Depends on network Available offline

Serverless Agent

Architecture Features

Serverless architecture is suitable for short-duration, event-driven agent tasks.

graph LR
    A[Trigger Event] --> B[API Gateway]
    B --> C[Lambda/Function]
    C --> D[LLM API]
    C --> E[Tool Calls]
    D --> F[Return Result]
    E --> F

Advantages:

  • Zero ops, automatic scaling
  • Pay per invocation, zero cost when idle
  • Fast deployment and iteration

Disadvantages:

  • Cold start latency (1-5 seconds)
  • Execution time limits (typically 15 minutes)
  • Stateless, state management requires external storage
  • Not suitable for long-running agent tasks

Applicable Scenarios:

  • Simple single-step agent calls
  • Webhook-triggered automation
  • Lightweight API proxies

State Management Solutions

# Serverless Agent state management
import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('agent_sessions')

def lambda_handler(event, context):
    session_id = event['session_id']

    # Restore state
    session = table.get_item(Key={'id': session_id})['Item']

    # Execute agent step
    result = agent.run_step(session['state'], event['input'])

    # Save state
    table.put_item(Item={
        'id': session_id,
        'state': result['new_state'],
        'ttl': int(time.time()) + 3600  # 1 hour expiry
    })

    return result['output']

Containerized Agent

Docker Deployment

# Agent Dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

# Security: run as non-root user
RUN useradd -m agent
USER agent

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Kubernetes Deployment

# Agent Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent
  template:
    spec:
      containers:
      - name: agent
        image: agent-service:latest
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        env:
        - name: LLM_API_KEY
          valueFrom:
            secretKeyRef:
              name: agent-secrets
              key: api-key

Scaling Strategies

Strategy Trigger Condition Applicable Scenario
HPA (Horizontal) CPU/memory utilization General scenarios
KEDA Queue length Async tasks
VPA (Vertical) Resource insufficiency Single instance needs more resources
Scheduled scaling Known traffic patterns Predictable peaks

Webhook-triggered vs. Long-running

Webhook-triggered Pattern

sequenceDiagram
    participant U as User/System
    participant W as Webhook Endpoint
    participant Q as Task Queue
    participant A as Agent Worker
    participant L as LLM API

    U->>W: POST /webhook
    W->>Q: Enqueue task
    W-->>U: 202 Accepted
    Q->>A: Consume task
    A->>L: LLM call
    L-->>A: Response
    A->>U: Callback/notify result

Characteristics:

  • Asynchronous processing, does not block the caller
  • Suitable for time-consuming agent tasks
  • Requires implementing callback or polling mechanisms

Long-running Pattern

Suitable for scenarios requiring persistent connections:

  • WebSocket real-time interaction
  • Long-running background agents
  • Scenarios requiring in-memory state maintenance
# WebSocket Agent service
from fastapi import FastAPI, WebSocket

app = FastAPI()

@app.websocket("/agent/{session_id}")
async def agent_endpoint(websocket: WebSocket, session_id: str):
    await websocket.accept()
    agent = AgentSession(session_id)

    while True:
        # Receive user message
        data = await websocket.receive_text()

        # Agent streaming execution
        async for chunk in agent.stream_run(data):
            await websocket.send_json({
                "type": chunk.type,  # "thinking", "action", "result"
                "content": chunk.content
            })

Hybrid Architecture

Production deployments typically adopt a hybrid architecture:

graph TD
    subgraph Frontend Layer
        A[Web UI]
        B[API Client]
        C[Webhook]
    end

    subgraph Gateway Layer
        D[API Gateway]
        E[Load Balancer]
    end

    subgraph Service Layer
        F[Synchronous Agent Service]
        G[Async Agent Worker]
    end

    subgraph Infrastructure
        H[Message Queue]
        I[State Store Redis]
        J[Persistent Store DB]
        K[Vector Database]
    end

    A --> D
    B --> D
    C --> D
    D --> E
    E --> F
    E --> H
    H --> G
    F --> I
    G --> I
    F --> J
    G --> J
    F --> K
    G --> K

Deployment Checklist

Pre-launch Checks

  • [ ] API keys managed through Secrets, not hardcoded
  • [ ] Request rate limiting and concurrency controls configured
  • [ ] Health check endpoints configured
  • [ ] Logging and monitoring systems ready
  • [ ] Error handling and graceful degradation in place
  • [ ] Timeout configurations are reasonable
  • [ ] Security scan passed
  • [ ] Load testing completed

Operations Essentials

Item Description
Key rotation Regularly rotate API keys
Backup and recovery Regular state data backups
Version management Blue-green deployment or canary releases
Disaster recovery Multi-region or multi-cloud deployment
Cost alerts Set spending limits and alerts

References

  1. AWS. "Building Serverless AI Agents." 2024.
  2. Google Cloud. "Agent Builder Architecture." 2024.
  3. Kubernetes. "Best Practices for AI Workloads." 2024.

Cross-references: - Secure deployment → Security and Sandboxing - Monitoring → Observability and Monitoring - Cost optimization → Cost Optimization and Caching


评论 #