Skip to content

Observability and Monitoring

Overview

Observability for AI Agent systems is more challenging than for traditional software. The non-deterministic behavior, multi-step execution, and external tool calls characteristic of agents make debugging and monitoring complex. This section introduces agent-specific observability tools and best practices.

The Three Pillars of Observability

graph TD
    A[Agent Observability] --> B[Tracing]
    A --> C[Metrics]
    A --> D[Logging]

    B --> B1[End-to-end execution traces]
    B --> B2[Per-step LLM call details]
    B --> B3[Tool call tracing]

    C --> C1[Latency distribution]
    C --> C2[Token usage]
    C --> C3[Success rate / Error rate]

    D --> D1[Agent reasoning process]
    D --> D2[Tool inputs and outputs]
    D --> D3[Error stack traces]

    style A fill:#e3f2fd

Specialized Tracing Tools

LangSmith

The official tracing and evaluation platform from LangChain:

Core Features:

Feature Description
Tracing Complete agent execution trace visualization
Evaluation Custom evaluators, batch evaluation
Datasets Test dataset management
Monitoring Production environment real-time monitoring
Playground Online debugging of prompts and agents

Trace Data Structure:

# LangSmith auto-traces agent execution
from langsmith import traceable

@traceable(name="agent_task")
def run_agent(task):
    # Each LLM call, tool call automatically recorded
    plan = llm.plan(task)         # → trace: planning
    for step in plan:
        result = tool.execute(step)  # → trace: tool_call
        feedback = llm.reflect(result)  # → trace: reflection
    return final_result

Trace Information Includes:

  • Complete input/output content
  • Token usage (input/output/total)
  • Latency (time to first token, total time)
  • Model and parameter information
  • Parent-child relationships (call chain)

Langfuse

An open-source LLM observability platform:

Features:

  • Fully open-source, supports self-hosting
  • Compatible with OpenAI, Anthropic, LangChain, etc.
  • Provides Web UI for trace analysis
  • Supports custom scoring and annotations
from langfuse import Langfuse

langfuse = Langfuse()

# Create a trace
trace = langfuse.trace(name="agent_run", user_id="user_123")

# Record LLM call
generation = trace.generation(
    name="planning",
    model="claude-sonnet-4-20250514",
    input=messages,
    output=response,
    usage={"input": 1500, "output": 300}
)

# Record tool call
span = trace.span(
    name="web_search",
    input={"query": "AI agents 2024"},
    output={"results": [...]}
)

Other Tools

Tool Type Features
Arize Phoenix Open-source Strong visualization, embedding analysis support
Helicone SaaS Proxy layer, zero-code integration
Portkey SaaS Multi-model gateway + monitoring
Braintrust SaaS Evaluation + monitoring

OpenTelemetry for Agents

Agent Tracing Standards

Extending OpenTelemetry standards to agent systems:

from opentelemetry import trace

tracer = trace.get_tracer("agent-service")

def agent_step(task):
    with tracer.start_as_current_span("agent_step") as span:
        span.set_attribute("agent.step_number", step_num)
        span.set_attribute("agent.model", "claude-sonnet")

        # LLM call
        with tracer.start_as_current_span("llm_call") as llm_span:
            llm_span.set_attribute("llm.model", model_name)
            llm_span.set_attribute("llm.tokens.input", input_tokens)
            llm_span.set_attribute("llm.tokens.output", output_tokens)
            response = call_llm(prompt)

        # Tool call
        with tracer.start_as_current_span("tool_call") as tool_span:
            tool_span.set_attribute("tool.name", tool_name)
            tool_span.set_attribute("tool.success", True)
            result = call_tool(tool_name, params)

Semantic Conventions

OpenTelemetry semantic conventions for agent systems (proposed):

Attribute Description
gen_ai.system AI system (openai, anthropic)
gen_ai.request.model Requested model name
gen_ai.usage.input_tokens Input token count
gen_ai.usage.output_tokens Output token count
gen_ai.agent.step Agent step number
gen_ai.tool.name Tool name used

Key Monitoring Metrics

Core Metrics

Metric Calculation Alert Threshold (example)
P50/P95/P99 latency Response time percentiles P95 > 30s
Token usage Total tokens per task > 100K tokens
Error rate Failures / Total > 5%
Task completion rate Successes / Total < 90%
Cost per task Total cost / Task count > $1.00
Tool call count Average calls per task > 20 calls
Cache hit rate Hits / Total requests < 50%

Business Metrics

Metric Description
User satisfaction CSAT score
Task avoidance rate Proportion of tasks agent refuses to handle
Human takeover rate Proportion requiring human transfer
Repeat task rate Proportion of same tasks resubmitted

Alerting Strategy

Tiered Alerts

graph TD
    A[Monitoring Metrics] --> B{Anomaly Detection}
    B -->|Minor| C[Info Log]
    B -->|Moderate| D[Warning Notification]
    B -->|Severe| E[Critical Alert]
    B -->|Emergency| F[Emergency Handling]

    C --> G[Record]
    D --> H[Slack Notification]
    E --> I[PagerDuty Alert]
    F --> J[Auto Circuit-break + Emergency Notice]

Anomaly Detection

\[ \text{Anomaly Score}(x) = \frac{|x - \mu|}{\sigma} \]

Trigger alerts when the score exceeds a threshold (e.g., \(3\sigma\)).

Intelligent Alerting

  • Trend alerts: Early warning when metrics continuously deteriorate
  • Correlated alerts: Merge alerts when multiple metrics are simultaneously anomalous
  • Adaptive thresholds: Automatically adjust thresholds based on historical data
  • Alert convergence: Prevent alert storms

Log Aggregation

Structured Logging

import structlog

logger = structlog.get_logger()

def agent_step(task_id, step_num, action):
    logger.info(
        "agent_step_executed",
        task_id=task_id,
        step_num=step_num,
        action=action.type,
        tool=action.tool_name,
        input_tokens=action.input_tokens,
        output_tokens=action.output_tokens,
        latency_ms=action.latency_ms,
        success=action.success,
    )

Log Level Design

Level Content Frequency
DEBUG Complete LLM inputs and outputs Development environment
INFO Step summaries and key decisions Production environment
WARNING Retries, degradation, anomalies Always enabled
ERROR Failures, abnormal termination Always enabled

Practical Recommendations

  1. Start with tracing: Implement complete execution trace tracking first
  2. Focus on cost: Cost monitoring should be the primary concern
  3. Differentiate environments: More detailed logging in development, moderate in production
  4. Protect privacy: Redact sensitive user information in logs
  5. Regular review: Periodically review monitoring data to discover optimization opportunities

References

  1. LangChain. "LangSmith Documentation." 2024.
  2. Langfuse. "Open Source LLM Engineering Platform." 2024.
  3. OpenTelemetry. "Semantic Conventions for GenAI." 2024.

Cross-references: - Cost monitoring → Cost Optimization and Caching - Reliability metrics → Reliability and Robustness


评论 #