Observability and Monitoring
Overview
Observability for AI Agent systems is more challenging than for traditional software. The non-deterministic behavior, multi-step execution, and external tool calls characteristic of agents make debugging and monitoring complex. This section introduces agent-specific observability tools and best practices.
The Three Pillars of Observability
graph TD
A[Agent Observability] --> B[Tracing]
A --> C[Metrics]
A --> D[Logging]
B --> B1[End-to-end execution traces]
B --> B2[Per-step LLM call details]
B --> B3[Tool call tracing]
C --> C1[Latency distribution]
C --> C2[Token usage]
C --> C3[Success rate / Error rate]
D --> D1[Agent reasoning process]
D --> D2[Tool inputs and outputs]
D --> D3[Error stack traces]
style A fill:#e3f2fd
Specialized Tracing Tools
LangSmith
The official tracing and evaluation platform from LangChain:
Core Features:
| Feature | Description |
|---|---|
| Tracing | Complete agent execution trace visualization |
| Evaluation | Custom evaluators, batch evaluation |
| Datasets | Test dataset management |
| Monitoring | Production environment real-time monitoring |
| Playground | Online debugging of prompts and agents |
Trace Data Structure:
# LangSmith auto-traces agent execution
from langsmith import traceable
@traceable(name="agent_task")
def run_agent(task):
# Each LLM call, tool call automatically recorded
plan = llm.plan(task) # → trace: planning
for step in plan:
result = tool.execute(step) # → trace: tool_call
feedback = llm.reflect(result) # → trace: reflection
return final_result
Trace Information Includes:
- Complete input/output content
- Token usage (input/output/total)
- Latency (time to first token, total time)
- Model and parameter information
- Parent-child relationships (call chain)
Langfuse
An open-source LLM observability platform:
Features:
- Fully open-source, supports self-hosting
- Compatible with OpenAI, Anthropic, LangChain, etc.
- Provides Web UI for trace analysis
- Supports custom scoring and annotations
from langfuse import Langfuse
langfuse = Langfuse()
# Create a trace
trace = langfuse.trace(name="agent_run", user_id="user_123")
# Record LLM call
generation = trace.generation(
name="planning",
model="claude-sonnet-4-20250514",
input=messages,
output=response,
usage={"input": 1500, "output": 300}
)
# Record tool call
span = trace.span(
name="web_search",
input={"query": "AI agents 2024"},
output={"results": [...]}
)
Other Tools
| Tool | Type | Features |
|---|---|---|
| Arize Phoenix | Open-source | Strong visualization, embedding analysis support |
| Helicone | SaaS | Proxy layer, zero-code integration |
| Portkey | SaaS | Multi-model gateway + monitoring |
| Braintrust | SaaS | Evaluation + monitoring |
OpenTelemetry for Agents
Agent Tracing Standards
Extending OpenTelemetry standards to agent systems:
from opentelemetry import trace
tracer = trace.get_tracer("agent-service")
def agent_step(task):
with tracer.start_as_current_span("agent_step") as span:
span.set_attribute("agent.step_number", step_num)
span.set_attribute("agent.model", "claude-sonnet")
# LLM call
with tracer.start_as_current_span("llm_call") as llm_span:
llm_span.set_attribute("llm.model", model_name)
llm_span.set_attribute("llm.tokens.input", input_tokens)
llm_span.set_attribute("llm.tokens.output", output_tokens)
response = call_llm(prompt)
# Tool call
with tracer.start_as_current_span("tool_call") as tool_span:
tool_span.set_attribute("tool.name", tool_name)
tool_span.set_attribute("tool.success", True)
result = call_tool(tool_name, params)
Semantic Conventions
OpenTelemetry semantic conventions for agent systems (proposed):
| Attribute | Description |
|---|---|
gen_ai.system |
AI system (openai, anthropic) |
gen_ai.request.model |
Requested model name |
gen_ai.usage.input_tokens |
Input token count |
gen_ai.usage.output_tokens |
Output token count |
gen_ai.agent.step |
Agent step number |
gen_ai.tool.name |
Tool name used |
Key Monitoring Metrics
Core Metrics
| Metric | Calculation | Alert Threshold (example) |
|---|---|---|
| P50/P95/P99 latency | Response time percentiles | P95 > 30s |
| Token usage | Total tokens per task | > 100K tokens |
| Error rate | Failures / Total | > 5% |
| Task completion rate | Successes / Total | < 90% |
| Cost per task | Total cost / Task count | > $1.00 |
| Tool call count | Average calls per task | > 20 calls |
| Cache hit rate | Hits / Total requests | < 50% |
Business Metrics
| Metric | Description |
|---|---|
| User satisfaction | CSAT score |
| Task avoidance rate | Proportion of tasks agent refuses to handle |
| Human takeover rate | Proportion requiring human transfer |
| Repeat task rate | Proportion of same tasks resubmitted |
Alerting Strategy
Tiered Alerts
graph TD
A[Monitoring Metrics] --> B{Anomaly Detection}
B -->|Minor| C[Info Log]
B -->|Moderate| D[Warning Notification]
B -->|Severe| E[Critical Alert]
B -->|Emergency| F[Emergency Handling]
C --> G[Record]
D --> H[Slack Notification]
E --> I[PagerDuty Alert]
F --> J[Auto Circuit-break + Emergency Notice]
Anomaly Detection
Trigger alerts when the score exceeds a threshold (e.g., \(3\sigma\)).
Intelligent Alerting
- Trend alerts: Early warning when metrics continuously deteriorate
- Correlated alerts: Merge alerts when multiple metrics are simultaneously anomalous
- Adaptive thresholds: Automatically adjust thresholds based on historical data
- Alert convergence: Prevent alert storms
Log Aggregation
Structured Logging
import structlog
logger = structlog.get_logger()
def agent_step(task_id, step_num, action):
logger.info(
"agent_step_executed",
task_id=task_id,
step_num=step_num,
action=action.type,
tool=action.tool_name,
input_tokens=action.input_tokens,
output_tokens=action.output_tokens,
latency_ms=action.latency_ms,
success=action.success,
)
Log Level Design
| Level | Content | Frequency |
|---|---|---|
| DEBUG | Complete LLM inputs and outputs | Development environment |
| INFO | Step summaries and key decisions | Production environment |
| WARNING | Retries, degradation, anomalies | Always enabled |
| ERROR | Failures, abnormal termination | Always enabled |
Practical Recommendations
- Start with tracing: Implement complete execution trace tracking first
- Focus on cost: Cost monitoring should be the primary concern
- Differentiate environments: More detailed logging in development, moderate in production
- Protect privacy: Redact sensitive user information in logs
- Regular review: Periodically review monitoring data to discover optimization opportunities
References
- LangChain. "LangSmith Documentation." 2024.
- Langfuse. "Open Source LLM Engineering Platform." 2024.
- OpenTelemetry. "Semantic Conventions for GenAI." 2024.
Cross-references: - Cost monitoring → Cost Optimization and Caching - Reliability metrics → Reliability and Robustness