Observability and Monitoring

Overview

Observability for AI Agent systems is more challenging than for traditional software. The non-deterministic behavior, multi-step execution, and external tool calls characteristic of agents make debugging and monitoring complex. This section introduces agent-specific observability tools and best practices.

The Three Pillars of Observability

graph TD
    A[Agent Observability] --> B[Tracing]
    A --> C[Metrics]
    A --> D[Logging]

    B --> B1[End-to-end execution traces]
    B --> B2[Per-step LLM call details]
    B --> B3[Tool call tracing]

    C --> C1[Latency distribution]
    C --> C2[Token usage]
    C --> C3[Success rate / Error rate]

    D --> D1[Agent reasoning process]
    D --> D2[Tool inputs and outputs]
    D --> D3[Error stack traces]

    style A fill:#e3f2fd

Specialized Tracing Tools

LangSmith

The official tracing and evaluation platform from LangChain:

Core Features:

Feature	Description
Tracing	Complete agent execution trace visualization
Evaluation	Custom evaluators, batch evaluation
Datasets	Test dataset management
Monitoring	Production environment real-time monitoring
Playground	Online debugging of prompts and agents

Trace Data Structure:

# LangSmith auto-traces agent execution
from langsmith import traceable

@traceable(name="agent_task")
def run_agent(task):
    # Each LLM call, tool call automatically recorded
    plan = llm.plan(task)         # → trace: planning
    for step in plan:
        result = tool.execute(step)  # → trace: tool_call
        feedback = llm.reflect(result)  # → trace: reflection
    return final_result

Trace Information Includes:

Complete input/output content
Token usage (input/output/total)
Latency (time to first token, total time)
Model and parameter information
Parent-child relationships (call chain)

Langfuse

An open-source LLM observability platform:

Features:

Fully open-source, supports self-hosting
Compatible with OpenAI, Anthropic, LangChain, etc.
Provides Web UI for trace analysis
Supports custom scoring and annotations

from langfuse import Langfuse

langfuse = Langfuse()

# Create a trace
trace = langfuse.trace(name="agent_run", user_id="user_123")

# Record LLM call
generation = trace.generation(
    name="planning",
    model="claude-sonnet-4-20250514",
    input=messages,
    output=response,
    usage={"input": 1500, "output": 300}
)

# Record tool call
span = trace.span(
    name="web_search",
    input={"query": "AI agents 2024"},
    output={"results": [...]}
)

Other Tools

Tool	Type	Features
Arize Phoenix	Open-source	Strong visualization, embedding analysis support
Helicone	SaaS	Proxy layer, zero-code integration
Portkey	SaaS	Multi-model gateway + monitoring
Braintrust	SaaS	Evaluation + monitoring

OpenTelemetry for Agents

Agent Tracing Standards

Extending OpenTelemetry standards to agent systems:

from opentelemetry import trace

tracer = trace.get_tracer("agent-service")

def agent_step(task):
    with tracer.start_as_current_span("agent_step") as span:
        span.set_attribute("agent.step_number", step_num)
        span.set_attribute("agent.model", "claude-sonnet")

        # LLM call
        with tracer.start_as_current_span("llm_call") as llm_span:
            llm_span.set_attribute("llm.model", model_name)
            llm_span.set_attribute("llm.tokens.input", input_tokens)
            llm_span.set_attribute("llm.tokens.output", output_tokens)
            response = call_llm(prompt)

        # Tool call
        with tracer.start_as_current_span("tool_call") as tool_span:
            tool_span.set_attribute("tool.name", tool_name)
            tool_span.set_attribute("tool.success", True)
            result = call_tool(tool_name, params)

Semantic Conventions

OpenTelemetry semantic conventions for agent systems (proposed):

Attribute	Description
`gen_ai.system`	AI system (openai, anthropic)
`gen_ai.request.model`	Requested model name
`gen_ai.usage.input_tokens`	Input token count
`gen_ai.usage.output_tokens`	Output token count
`gen_ai.agent.step`	Agent step number
`gen_ai.tool.name`	Tool name used

Key Monitoring Metrics

Core Metrics

Metric	Calculation	Alert Threshold (example)
P50/P95/P99 latency	Response time percentiles	P95 > 30s
Token usage	Total tokens per task	> 100K tokens
Error rate	Failures / Total	> 5%
Task completion rate	Successes / Total	< 90%
Cost per task	Total cost / Task count	> $1.00
Tool call count	Average calls per task	> 20 calls
Cache hit rate	Hits / Total requests	< 50%

Business Metrics

Metric	Description
User satisfaction	CSAT score
Task avoidance rate	Proportion of tasks agent refuses to handle
Human takeover rate	Proportion requiring human transfer
Repeat task rate	Proportion of same tasks resubmitted

Alerting Strategy

Tiered Alerts

graph TD
    A[Monitoring Metrics] --> B{Anomaly Detection}
    B -->|Minor| C[Info Log]
    B -->|Moderate| D[Warning Notification]
    B -->|Severe| E[Critical Alert]
    B -->|Emergency| F[Emergency Handling]

    C --> G[Record]
    D --> H[Slack Notification]
    E --> I[PagerDuty Alert]
    F --> J[Auto Circuit-break + Emergency Notice]

Anomaly Detection

\[ \text{Anomaly Score}(x) = \frac{|x - \mu|}{\sigma} \]

Trigger alerts when the score exceeds a threshold (e.g., $3\sigma$).

Intelligent Alerting

Trend alerts: Early warning when metrics continuously deteriorate
Correlated alerts: Merge alerts when multiple metrics are simultaneously anomalous
Adaptive thresholds: Automatically adjust thresholds based on historical data
Alert convergence: Prevent alert storms

Log Aggregation

Structured Logging

import structlog

logger = structlog.get_logger()

def agent_step(task_id, step_num, action):
    logger.info(
        "agent_step_executed",
        task_id=task_id,
        step_num=step_num,
        action=action.type,
        tool=action.tool_name,
        input_tokens=action.input_tokens,
        output_tokens=action.output_tokens,
        latency_ms=action.latency_ms,
        success=action.success,
    )

Log Level Design

Level	Content	Frequency
DEBUG	Complete LLM inputs and outputs	Development environment
INFO	Step summaries and key decisions	Production environment
WARNING	Retries, degradation, anomalies	Always enabled
ERROR	Failures, abnormal termination	Always enabled

Practical Recommendations

Start with tracing: Implement complete execution trace tracking first
Focus on cost: Cost monitoring should be the primary concern
Differentiate environments: More detailed logging in development, moderate in production
Protect privacy: Redact sensitive user information in logs
Regular review: Periodically review monitoring data to discover optimization opportunities

References

LangChain. "LangSmith Documentation." 2024.
Langfuse. "Open Source LLM Engineering Platform." 2024.
OpenTelemetry. "Semantic Conventions for GenAI." 2024.

Cross-references: - Cost monitoring → Cost Optimization and Caching - Reliability metrics → Reliability and Robustness