Skip to content

Cost Optimization and Caching

Overview

The operational cost of AI Agents primarily comes from LLM API calls, making cost optimization critical for large-scale deployment. This section introduces practical techniques including prompt caching, model routing, and token optimization to help significantly reduce operational costs while maintaining quality.

Prompt Caching

Anthropic Prompt Caching

Anthropic provides native prompt caching functionality:

  • Cache hit: Input price reduced by 90%
  • Cache write: 25% price increase on first cache
  • Cache TTL: 5 minutes (refreshed on each hit)

Cost Comparison:

Operation Claude Sonnet Normal Price Cache Hit Price Savings
Input $3.00/1M $0.30/1M 90%
Output $15.00/1M $15.00/1M 0%

Usage Scenario:

# Mark system prompt and tool definitions as cacheable
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # Extensive system instructions
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": user_query}
    ]
)
# Subsequent requests with the same system prompt will hit cache

OpenAI Automatic Caching

OpenAI automatically caches exactly matching prompt prefixes:

  • Auto-triggered: No additional configuration needed
  • Discount: Input price halved on cache hit
  • Applicable models: GPT-4o, GPT-4o mini

Caching Strategy Design

graph TD
    A[Agent Request] --> B{System prompt changed?}
    B -->|No| C[Use cached prefix]
    B -->|Yes| D[Update cache]
    C --> E[Only send new user message]
    D --> E
    E --> F[LLM Inference]
    F --> G[Cache new prefix]

Maximizing Cache Hit Rate:

  1. Place unchanging content at the beginning of the prompt
  2. System prompt → Tool definitions → Few-shot examples → Conversation history → Current query
  3. Keep system prompts and tool definitions stable and unchanged

Model Routing

Intelligent Routing Strategy

Automatically select models based on task complexity:

\[ \text{Model} = \begin{cases} M_{\text{cheap}} & \text{if difficulty}(task) \leq \theta_1 \\ M_{\text{medium}} & \text{if } \theta_1 < \text{difficulty}(task) \leq \theta_2 \\ M_{\text{expensive}} & \text{if difficulty}(task) > \theta_2 \end{cases} \]

Routing Implementation

class ModelRouter:
    def __init__(self):
        self.models = {
            "simple": {"name": "gpt-4o-mini", "cost": 0.15},
            "medium": {"name": "claude-sonnet", "cost": 3.00},
            "complex": {"name": "claude-opus", "cost": 15.00},
        }

    def route(self, task):
        # Select model based on task features
        complexity = self.estimate_complexity(task)

        if complexity < 0.3:
            return self.models["simple"]
        elif complexity < 0.7:
            return self.models["medium"]
        else:
            return self.models["complex"]

    def estimate_complexity(self, task):
        """
        Factors for estimating task complexity:
        - Number of reasoning steps needed
        - Whether tool use is required
        - Whether code generation is involved
        - Historical failure rate for similar tasks
        """
        pass

Cascade Strategy

Try with a cheap model first, escalate on failure:

graph LR
    A[Task] --> B[Small Model Attempt]
    B --> C{Success?}
    C -->|Yes| D[Return Result]
    C -->|No| E[Medium Model Attempt]
    E --> F{Success?}
    F -->|Yes| D
    F -->|No| G[Large Model Processing]
    G --> D

Expected Cost:

\[ E[C] = c_1 + (1-s_1) \cdot c_2 + (1-s_1)(1-s_2) \cdot c_3 \]

Numerical Example:

Assuming \(s_1 = 0.7, s_2 = 0.2, c_1 = \$0.01, c_2 = \$0.10, c_3 = \$0.50\):

\[ E[C] = 0.01 + 0.3 \times 0.10 + 0.3 \times 0.8 \times 0.50 = 0.01 + 0.03 + 0.12 = \$0.16 \]

Direct use of the large model: $0.50. Savings of 68%.

Token Optimization

Prompt Compression

Methods for reducing input token count:

Method Compression Rate Quality Loss Applicable Scenario
History summarization 50-80% Medium Long conversations
Tool output truncation 30-70% Low-Medium Large tool outputs
Selective context 40-60% Low Multi-file references
Prompt pruning 10-30% Very low All scenarios

Conversation History Compression

class ConversationCompressor:
    def compress(self, messages, max_tokens=4000):
        total_tokens = count_tokens(messages)

        if total_tokens <= max_tokens:
            return messages

        # Strategy 1: Keep recent N turns + summarize old conversation
        recent = messages[-6:]  # Most recent 3 turns
        old = messages[:-6]

        summary = self.llm.summarize(old)

        return [
            {"role": "system", "content": f"Conversation history summary: {summary}"},
            *recent
        ]

Output Truncation

Controlling the output length per agent step:

\[ \text{Output Budget} = \min(t_{\text{max}}, \frac{C_{\text{budget}} - C_{\text{current}}}{p_{\text{output}}}) \]

Batch Processing

Asynchronous Batch Processing

Both OpenAI and Anthropic offer batch processing APIs at half price:

Provider Batch Discount Completion Time Applicable Scenario
OpenAI Batch API 50% Within 24 hours Large-scale offline processing
Anthropic Batch 50% Within 24 hours Batch analysis
# OpenAI batch processing example
batch = client.batches.create(
    input_file_id="file-xxx",
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

Applicable Scenarios:

  • Large-scale data labeling
  • Batch document analysis
  • Offline evaluation and testing
  • Tasks not requiring real-time responses

Cost Monitoring

Monitoring Metrics

cost_metrics = {
    "per_task_cost": {
        "description": "Average cost per task",
        "alert_threshold": "$1.00",
    },
    "daily_spend": {
        "description": "Daily total spend",
        "alert_threshold": "$100.00",
    },
    "cost_per_success": {
        "description": "Cost per successful task",
        "formula": "total_cost / successful_tasks",
    },
    "cache_hit_rate": {
        "description": "Cache hit rate",
        "target": "> 60%",
    },
    "model_distribution": {
        "description": "Usage proportion by model",
        "target": "cheap model > 70%",
    },
}

Cost Alerts

\[ \text{Alert} = \begin{cases} \text{Warning} & \text{if daily\_cost} > 0.8 \times \text{budget} \\ \text{Critical} & \text{if daily\_cost} > \text{budget} \\ \text{Shutdown} & \text{if monthly\_cost} > \text{hard\_limit} \end{cases} \]

Practical Recommendations

  1. Measure before optimizing: Establish a cost baseline first, then optimize strategically
  2. Caching first: Prompt caching is the simplest and most effective optimization
  3. Routing second: Model routing can significantly reduce average costs
  4. Continuous monitoring: Cost monitoring should run continuously to detect anomalies promptly
  5. Quality-cost balance: Do not sacrifice quality excessively to reduce costs

References

  1. Anthropic. "Prompt Caching." 2024.
  2. OpenAI. "Batch API." 2024.
  3. Chen, L., et al. "FrugalGPT: How to Use Large Language Models While Reducing Cost." arXiv:2305.05176, 2023.
  4. Madaan, A., et al. "Automix: Automatically Mixing Language Models." arXiv:2310.12963, 2023.

Cross-references: - Cost analysis → Cost-Benefit Analysis - Deployment architecture → Deployment Architecture Overview


评论 #