Cost Optimization and Caching
Overview
The operational cost of AI Agents primarily comes from LLM API calls, making cost optimization critical for large-scale deployment. This section introduces practical techniques including prompt caching, model routing, and token optimization to help significantly reduce operational costs while maintaining quality.
Prompt Caching
Anthropic Prompt Caching
Anthropic provides native prompt caching functionality:
- Cache hit: Input price reduced by 90%
- Cache write: 25% price increase on first cache
- Cache TTL: 5 minutes (refreshed on each hit)
Cost Comparison:
| Operation | Claude Sonnet Normal Price | Cache Hit Price | Savings |
|---|---|---|---|
| Input | $3.00/1M | $0.30/1M | 90% |
| Output | $15.00/1M | $15.00/1M | 0% |
Usage Scenario:
# Mark system prompt and tool definitions as cacheable
response = client.messages.create(
model="claude-sonnet-4-20250514",
system=[
{
"type": "text",
"text": long_system_prompt, # Extensive system instructions
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_query}
]
)
# Subsequent requests with the same system prompt will hit cache
OpenAI Automatic Caching
OpenAI automatically caches exactly matching prompt prefixes:
- Auto-triggered: No additional configuration needed
- Discount: Input price halved on cache hit
- Applicable models: GPT-4o, GPT-4o mini
Caching Strategy Design
graph TD
A[Agent Request] --> B{System prompt changed?}
B -->|No| C[Use cached prefix]
B -->|Yes| D[Update cache]
C --> E[Only send new user message]
D --> E
E --> F[LLM Inference]
F --> G[Cache new prefix]
Maximizing Cache Hit Rate:
- Place unchanging content at the beginning of the prompt
- System prompt → Tool definitions → Few-shot examples → Conversation history → Current query
- Keep system prompts and tool definitions stable and unchanged
Model Routing
Intelligent Routing Strategy
Automatically select models based on task complexity:
Routing Implementation
class ModelRouter:
def __init__(self):
self.models = {
"simple": {"name": "gpt-4o-mini", "cost": 0.15},
"medium": {"name": "claude-sonnet", "cost": 3.00},
"complex": {"name": "claude-opus", "cost": 15.00},
}
def route(self, task):
# Select model based on task features
complexity = self.estimate_complexity(task)
if complexity < 0.3:
return self.models["simple"]
elif complexity < 0.7:
return self.models["medium"]
else:
return self.models["complex"]
def estimate_complexity(self, task):
"""
Factors for estimating task complexity:
- Number of reasoning steps needed
- Whether tool use is required
- Whether code generation is involved
- Historical failure rate for similar tasks
"""
pass
Cascade Strategy
Try with a cheap model first, escalate on failure:
graph LR
A[Task] --> B[Small Model Attempt]
B --> C{Success?}
C -->|Yes| D[Return Result]
C -->|No| E[Medium Model Attempt]
E --> F{Success?}
F -->|Yes| D
F -->|No| G[Large Model Processing]
G --> D
Expected Cost:
Numerical Example:
Assuming \(s_1 = 0.7, s_2 = 0.2, c_1 = \$0.01, c_2 = \$0.10, c_3 = \$0.50\):
Direct use of the large model: $0.50. Savings of 68%.
Token Optimization
Prompt Compression
Methods for reducing input token count:
| Method | Compression Rate | Quality Loss | Applicable Scenario |
|---|---|---|---|
| History summarization | 50-80% | Medium | Long conversations |
| Tool output truncation | 30-70% | Low-Medium | Large tool outputs |
| Selective context | 40-60% | Low | Multi-file references |
| Prompt pruning | 10-30% | Very low | All scenarios |
Conversation History Compression
class ConversationCompressor:
def compress(self, messages, max_tokens=4000):
total_tokens = count_tokens(messages)
if total_tokens <= max_tokens:
return messages
# Strategy 1: Keep recent N turns + summarize old conversation
recent = messages[-6:] # Most recent 3 turns
old = messages[:-6]
summary = self.llm.summarize(old)
return [
{"role": "system", "content": f"Conversation history summary: {summary}"},
*recent
]
Output Truncation
Controlling the output length per agent step:
Batch Processing
Asynchronous Batch Processing
Both OpenAI and Anthropic offer batch processing APIs at half price:
| Provider | Batch Discount | Completion Time | Applicable Scenario |
|---|---|---|---|
| OpenAI Batch API | 50% | Within 24 hours | Large-scale offline processing |
| Anthropic Batch | 50% | Within 24 hours | Batch analysis |
# OpenAI batch processing example
batch = client.batches.create(
input_file_id="file-xxx",
endpoint="/v1/chat/completions",
completion_window="24h"
)
Applicable Scenarios:
- Large-scale data labeling
- Batch document analysis
- Offline evaluation and testing
- Tasks not requiring real-time responses
Cost Monitoring
Monitoring Metrics
cost_metrics = {
"per_task_cost": {
"description": "Average cost per task",
"alert_threshold": "$1.00",
},
"daily_spend": {
"description": "Daily total spend",
"alert_threshold": "$100.00",
},
"cost_per_success": {
"description": "Cost per successful task",
"formula": "total_cost / successful_tasks",
},
"cache_hit_rate": {
"description": "Cache hit rate",
"target": "> 60%",
},
"model_distribution": {
"description": "Usage proportion by model",
"target": "cheap model > 70%",
},
}
Cost Alerts
Practical Recommendations
- Measure before optimizing: Establish a cost baseline first, then optimize strategically
- Caching first: Prompt caching is the simplest and most effective optimization
- Routing second: Model routing can significantly reduce average costs
- Continuous monitoring: Cost monitoring should run continuously to detect anomalies promptly
- Quality-cost balance: Do not sacrifice quality excessively to reduce costs
References
- Anthropic. "Prompt Caching." 2024.
- OpenAI. "Batch API." 2024.
- Chen, L., et al. "FrugalGPT: How to Use Large Language Models While Reducing Cost." arXiv:2305.05176, 2023.
- Madaan, A., et al. "Automix: Automatically Mixing Language Models." arXiv:2310.12963, 2023.
Cross-references: - Cost analysis → Cost-Benefit Analysis - Deployment architecture → Deployment Architecture Overview