Cost Optimization and Caching

Overview

The operational cost of AI Agents primarily comes from LLM API calls, making cost optimization critical for large-scale deployment. This section introduces practical techniques including prompt caching, model routing, and token optimization to help significantly reduce operational costs while maintaining quality.

Prompt Caching

Anthropic Prompt Caching

Anthropic provides native prompt caching functionality:

Cache hit: Input price reduced by 90%
Cache write: 25% price increase on first cache
Cache TTL: 5 minutes (refreshed on each hit)

Cost Comparison:

Operation	Claude Sonnet Normal Price	Cache Hit Price	Savings
Input	$3.00/1M	$0.30/1M	90%
Output	$15.00/1M	$15.00/1M	0%

Usage Scenario:

# Mark system prompt and tool definitions as cacheable
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # Extensive system instructions
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": user_query}
    ]
)
# Subsequent requests with the same system prompt will hit cache

OpenAI Automatic Caching

OpenAI automatically caches exactly matching prompt prefixes:

Auto-triggered: No additional configuration needed
Discount: Input price halved on cache hit
Applicable models: GPT-4o, GPT-4o mini

Caching Strategy Design

graph TD
    A[Agent Request] --> B{System prompt changed?}
    B -->|No| C[Use cached prefix]
    B -->|Yes| D[Update cache]
    C --> E[Only send new user message]
    D --> E
    E --> F[LLM Inference]
    F --> G[Cache new prefix]

Maximizing Cache Hit Rate:

Place unchanging content at the beginning of the prompt
System prompt → Tool definitions → Few-shot examples → Conversation history → Current query
Keep system prompts and tool definitions stable and unchanged

Model Routing

Intelligent Routing Strategy

Automatically select models based on task complexity:

\[ \text{Model} = \begin{cases} M_{\text{cheap}} & \text{if difficulty}(task) \leq \theta_1 \\ M_{\text{medium}} & \text{if } \theta_1 < \text{difficulty}(task) \leq \theta_2 \\ M_{\text{expensive}} & \text{if difficulty}(task) > \theta_2 \end{cases} \]

Routing Implementation

class ModelRouter:
    def __init__(self):
        self.models = {
            "simple": {"name": "gpt-4o-mini", "cost": 0.15},
            "medium": {"name": "claude-sonnet", "cost": 3.00},
            "complex": {"name": "claude-opus", "cost": 15.00},
        }

    def route(self, task):
        # Select model based on task features
        complexity = self.estimate_complexity(task)

        if complexity < 0.3:
            return self.models["simple"]
        elif complexity < 0.7:
            return self.models["medium"]
        else:
            return self.models["complex"]

    def estimate_complexity(self, task):
        """
        Factors for estimating task complexity:
        - Number of reasoning steps needed
        - Whether tool use is required
        - Whether code generation is involved
        - Historical failure rate for similar tasks
        """
        pass

Cascade Strategy

Try with a cheap model first, escalate on failure:

graph LR
    A[Task] --> B[Small Model Attempt]
    B --> C{Success?}
    C -->|Yes| D[Return Result]
    C -->|No| E[Medium Model Attempt]
    E --> F{Success?}
    F -->|Yes| D
    F -->|No| G[Large Model Processing]
    G --> D

Expected Cost:

\[ E[C] = c_1 + (1-s_1) \cdot c_2 + (1-s_1)(1-s_2) \cdot c_3 \]

Numerical Example:

Assuming $s_1 = 0.7, s_2 = 0.2, c_1 = \$0.01, c_2 = \$0.10, c_3 = \$0.50$:

\[ E[C] = 0.01 + 0.3 \times 0.10 + 0.3 \times 0.8 \times 0.50 = 0.01 + 0.03 + 0.12 = \$0.16 \]

Direct use of the large model: $0.50. Savings of 68%.

Token Optimization

Prompt Compression

Methods for reducing input token count:

Method	Compression Rate	Quality Loss	Applicable Scenario
History summarization	50-80%	Medium	Long conversations
Tool output truncation	30-70%	Low-Medium	Large tool outputs
Selective context	40-60%	Low	Multi-file references
Prompt pruning	10-30%	Very low	All scenarios

Conversation History Compression

class ConversationCompressor:
    def compress(self, messages, max_tokens=4000):
        total_tokens = count_tokens(messages)

        if total_tokens <= max_tokens:
            return messages

        # Strategy 1: Keep recent N turns + summarize old conversation
        recent = messages[-6:]  # Most recent 3 turns
        old = messages[:-6]

        summary = self.llm.summarize(old)

        return [
            {"role": "system", "content": f"Conversation history summary: {summary}"},
            *recent
        ]

Output Truncation

Controlling the output length per agent step:

\[ \text{Output Budget} = \min(t_{\text{max}}, \frac{C_{\text{budget}} - C_{\text{current}}}{p_{\text{output}}}) \]

Batch Processing

Asynchronous Batch Processing

Both OpenAI and Anthropic offer batch processing APIs at half price:

Provider	Batch Discount	Completion Time	Applicable Scenario
OpenAI Batch API	50%	Within 24 hours	Large-scale offline processing
Anthropic Batch	50%	Within 24 hours	Batch analysis

# OpenAI batch processing example
batch = client.batches.create(
    input_file_id="file-xxx",
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

Applicable Scenarios:

Large-scale data labeling
Batch document analysis
Offline evaluation and testing
Tasks not requiring real-time responses

Cost Monitoring

Monitoring Metrics

cost_metrics = {
    "per_task_cost": {
        "description": "Average cost per task",
        "alert_threshold": "$1.00",
    },
    "daily_spend": {
        "description": "Daily total spend",
        "alert_threshold": "$100.00",
    },
    "cost_per_success": {
        "description": "Cost per successful task",
        "formula": "total_cost / successful_tasks",
    },
    "cache_hit_rate": {
        "description": "Cache hit rate",
        "target": "> 60%",
    },
    "model_distribution": {
        "description": "Usage proportion by model",
        "target": "cheap model > 70%",
    },
}

Cost Alerts

\[ \text{Alert} = \begin{cases} \text{Warning} & \text{if daily\_cost} > 0.8 \times \text{budget} \\ \text{Critical} & \text{if daily\_cost} > \text{budget} \\ \text{Shutdown} & \text{if monthly\_cost} > \text{hard\_limit} \end{cases} \]

Practical Recommendations

Measure before optimizing: Establish a cost baseline first, then optimize strategically
Caching first: Prompt caching is the simplest and most effective optimization
Routing second: Model routing can significantly reduce average costs
Continuous monitoring: Cost monitoring should run continuously to detect anomalies promptly
Quality-cost balance: Do not sacrifice quality excessively to reduce costs

References

Anthropic. "Prompt Caching." 2024.
OpenAI. "Batch API." 2024.
Chen, L., et al. "FrugalGPT: How to Use Large Language Models While Reducing Cost." arXiv:2305.05176, 2023.
Madaan, A., et al. "Automix: Automatically Mixing Language Models." arXiv:2310.12963, 2023.

Cross-references: - Cost analysis → Cost-Benefit Analysis - Deployment architecture → Deployment Architecture Overview