Deployment Architecture Overview
Overview
The deployment architecture of AI Agents determines the system's availability, scalability, and cost efficiency. Unlike traditional web services, agent systems involve unique challenges such as long-running tasks, external tool calls, and state management. This section discusses the major deployment architecture patterns and their applicable scenarios.
Deployment Architecture Patterns
graph TD
A[Agent Deployment Architecture] --> B[Serverless]
A --> C[Containerized]
A --> D[Long-running Service]
B --> B1[AWS Lambda]
B --> B2[Cloud Functions]
B --> B3[Vercel Functions]
C --> C1[Docker]
C --> C2[Kubernetes]
C --> C3[ECS/Cloud Run]
D --> D1[WebSocket Service]
D --> D2[Worker Processes]
D --> D3[Queue Consumers]
style B fill:#e3f2fd
style C fill:#fff3e0
style D fill:#e8f5e9
Cloud vs. Edge Deployment
| Dimension | Cloud Deployment | Edge Deployment |
|---|---|---|
| Compute power | Strong (GPU available) | Limited |
| Latency | Higher (network transfer) | Low |
| Model size | Large models | Small models / distilled models |
| Privacy | Data leaves local environment | Data processed locally |
| Cost model | Pay-per-use billing | Fixed hardware cost |
| Availability | Depends on network | Available offline |
Serverless Agent
Architecture Features
Serverless architecture is suitable for short-duration, event-driven agent tasks.
graph LR
A[Trigger Event] --> B[API Gateway]
B --> C[Lambda/Function]
C --> D[LLM API]
C --> E[Tool Calls]
D --> F[Return Result]
E --> F
Advantages:
- Zero ops, automatic scaling
- Pay per invocation, zero cost when idle
- Fast deployment and iteration
Disadvantages:
- Cold start latency (1-5 seconds)
- Execution time limits (typically 15 minutes)
- Stateless, state management requires external storage
- Not suitable for long-running agent tasks
Applicable Scenarios:
- Simple single-step agent calls
- Webhook-triggered automation
- Lightweight API proxies
State Management Solutions
# Serverless Agent state management
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('agent_sessions')
def lambda_handler(event, context):
session_id = event['session_id']
# Restore state
session = table.get_item(Key={'id': session_id})['Item']
# Execute agent step
result = agent.run_step(session['state'], event['input'])
# Save state
table.put_item(Item={
'id': session_id,
'state': result['new_state'],
'ttl': int(time.time()) + 3600 # 1 hour expiry
})
return result['output']
Containerized Agent
Docker Deployment
# Agent Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
# Security: run as non-root user
RUN useradd -m agent
USER agent
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Kubernetes Deployment
# Agent Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-service
spec:
replicas: 3
selector:
matchLabels:
app: agent
template:
spec:
containers:
- name: agent
image: agent-service:latest
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
- name: LLM_API_KEY
valueFrom:
secretKeyRef:
name: agent-secrets
key: api-key
Scaling Strategies
| Strategy | Trigger Condition | Applicable Scenario |
|---|---|---|
| HPA (Horizontal) | CPU/memory utilization | General scenarios |
| KEDA | Queue length | Async tasks |
| VPA (Vertical) | Resource insufficiency | Single instance needs more resources |
| Scheduled scaling | Known traffic patterns | Predictable peaks |
Webhook-triggered vs. Long-running
Webhook-triggered Pattern
sequenceDiagram
participant U as User/System
participant W as Webhook Endpoint
participant Q as Task Queue
participant A as Agent Worker
participant L as LLM API
U->>W: POST /webhook
W->>Q: Enqueue task
W-->>U: 202 Accepted
Q->>A: Consume task
A->>L: LLM call
L-->>A: Response
A->>U: Callback/notify result
Characteristics:
- Asynchronous processing, does not block the caller
- Suitable for time-consuming agent tasks
- Requires implementing callback or polling mechanisms
Long-running Pattern
Suitable for scenarios requiring persistent connections:
- WebSocket real-time interaction
- Long-running background agents
- Scenarios requiring in-memory state maintenance
# WebSocket Agent service
from fastapi import FastAPI, WebSocket
app = FastAPI()
@app.websocket("/agent/{session_id}")
async def agent_endpoint(websocket: WebSocket, session_id: str):
await websocket.accept()
agent = AgentSession(session_id)
while True:
# Receive user message
data = await websocket.receive_text()
# Agent streaming execution
async for chunk in agent.stream_run(data):
await websocket.send_json({
"type": chunk.type, # "thinking", "action", "result"
"content": chunk.content
})
Hybrid Architecture
Production deployments typically adopt a hybrid architecture:
graph TD
subgraph Frontend Layer
A[Web UI]
B[API Client]
C[Webhook]
end
subgraph Gateway Layer
D[API Gateway]
E[Load Balancer]
end
subgraph Service Layer
F[Synchronous Agent Service]
G[Async Agent Worker]
end
subgraph Infrastructure
H[Message Queue]
I[State Store Redis]
J[Persistent Store DB]
K[Vector Database]
end
A --> D
B --> D
C --> D
D --> E
E --> F
E --> H
H --> G
F --> I
G --> I
F --> J
G --> J
F --> K
G --> K
Deployment Checklist
Pre-launch Checks
- [ ] API keys managed through Secrets, not hardcoded
- [ ] Request rate limiting and concurrency controls configured
- [ ] Health check endpoints configured
- [ ] Logging and monitoring systems ready
- [ ] Error handling and graceful degradation in place
- [ ] Timeout configurations are reasonable
- [ ] Security scan passed
- [ ] Load testing completed
Operations Essentials
| Item | Description |
|---|---|
| Key rotation | Regularly rotate API keys |
| Backup and recovery | Regular state data backups |
| Version management | Blue-green deployment or canary releases |
| Disaster recovery | Multi-region or multi-cloud deployment |
| Cost alerts | Set spending limits and alerts |
References
- AWS. "Building Serverless AI Agents." 2024.
- Google Cloud. "Agent Builder Architecture." 2024.
- Kubernetes. "Best Practices for AI Workloads." 2024.
Cross-references: - Secure deployment → Security and Sandboxing - Monitoring → Observability and Monitoring - Cost optimization → Cost Optimization and Caching