A/B Testing and Deployment
1. A/B Testing for LLM Applications
1.1 Unique Aspects of A/B Testing
A/B testing for LLM applications differs from traditional web applications:
| Dimension | Traditional A/B Test | LLM A/B Test |
|---|---|---|
| Metrics | Click rate, conversion rate | Answer quality, user satisfaction |
| Evaluation | Binary/continuous metrics | Multi-dimensional subjective evaluation |
| Sample size | Usually requires large volumes | Sample efficiency matters more |
| Cost | Low | High (each request has API cost) |
| Latency | Millisecond-level differences | Second-level differences may impact experience |
1.2 Metrics Design
Core metrics:
class ABTestMetrics:
# Quality metrics
user_satisfaction: float # User satisfaction score (1-5)
thumbs_up_rate: float # Thumbs-up rate
task_completion_rate: float # Task completion rate
# Safety metrics
hallucination_rate: float # Hallucination rate
safety_violation_rate: float # Safety violation rate
# Performance metrics
ttft_ms: float # Time to first token
total_latency_ms: float # Total latency
# Cost metrics
tokens_per_query: float # Tokens per query
cost_per_query: float # Cost per query
# Business metrics
retention_rate: float # User retention rate
queries_per_session: float # Queries per session
Guardrail metrics:
Guardrail metrics should not degrade, even if they don't improve:
- Safety violation rate <= 0.1%
- Hallucination rate <= 5%
- P95 latency <= 3000ms
- Error rate <= 1%
1.3 Statistical Significance
from scipy import stats
def ab_test_significance(control_scores, treatment_scores, alpha=0.05):
"""Test statistical significance of A/B test"""
# Welch's t-test (does not assume equal variance)
t_stat, p_value = stats.ttest_ind(
control_scores, treatment_scores, equal_var=False
)
# Effect size (Cohen's d)
pooled_std = ((control_scores.std()**2 + treatment_scores.std()**2) / 2)**0.5
cohens_d = (treatment_scores.mean() - control_scores.mean()) / pooled_std
return {
"control_mean": control_scores.mean(),
"treatment_mean": treatment_scores.mean(),
"p_value": p_value,
"significant": p_value < alpha,
"cohens_d": cohens_d,
"lift": (treatment_scores.mean() - control_scores.mean()) / control_scores.mean(),
}
1.4 Traffic Splitting
import hashlib
class ABTestRouter:
def __init__(self, experiment_id: str, treatment_ratio: float = 0.5):
self.experiment_id = experiment_id
self.treatment_ratio = treatment_ratio
def get_variant(self, user_id: str) -> str:
"""Determine group based on user ID (ensures same user always in same group)"""
hash_input = f"{self.experiment_id}:{user_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
if (hash_value % 100) / 100 < self.treatment_ratio:
return "treatment"
return "control"
def route_request(self, user_id: str, query: str):
variant = self.get_variant(user_id)
if variant == "treatment":
return self.treatment_handler(query) # New prompt/model
else:
return self.control_handler(query) # Current version
2. Canary Deployment
2.1 Concept
Canary deployment gradually rolls out the new version to a small percentage of users, expanding only after confirming it is safe.
2.2 Progressive Rollout
Phase 1: 1% traffic → New version (internal users)
Observe for 24 hours, check core metrics
Phase 2: 5% traffic → New version
Observe for 24 hours, compare A/B metrics
Phase 3: 25% traffic → New version
Observe for 48 hours, confirm no degradation
Phase 4: 50% traffic → New version
Observe for 24 hours
Phase 5: 100% traffic → New version
Full rollout complete
2.3 Automatic Rollback Conditions
class CanaryMonitor:
def __init__(self):
self.rollback_conditions = {
"error_rate": {"threshold": 0.05, "window": "5m"},
"p95_latency_ms": {"threshold": 5000, "window": "10m"},
"hallucination_rate": {"threshold": 0.10, "window": "30m"},
"safety_violation": {"threshold": 0.001, "window": "5m"},
}
def check_health(self, metrics):
"""Check canary health and decide whether to rollback"""
for metric_name, condition in self.rollback_conditions.items():
current_value = metrics.get(metric_name)
if current_value > condition["threshold"]:
return {
"action": "rollback",
"reason": f"{metric_name} ({current_value}) exceeds threshold ({condition['threshold']})",
}
return {"action": "continue"}
3. Blue-Green Deployment
3.1 Concept
Maintains two complete production environments simultaneously, achieving zero-downtime deployment through traffic switching.
Blue environment (current version) ←── 100% traffic
Green environment (new version) ←── 0% traffic
After switch:
Blue environment (old version) ←── 0% traffic (retained for rollback)
Green environment (new version) ←── 100% traffic
3.2 Kubernetes Implementation
# Blue deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-service-blue
labels:
version: blue
spec:
replicas: 3
template:
spec:
containers:
- name: llm-inference
image: llm-service:v1.2
env:
- name: PROMPT_VERSION
value: "v1.3"
---
# Green deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-service-green
labels:
version: green
spec:
replicas: 3
template:
spec:
containers:
- name: llm-inference
image: llm-service:v1.3
env:
- name: PROMPT_VERSION
value: "v2.0"
---
# Service (switch traffic via selector)
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
version: blue # Change to green to switch traffic
ports:
- port: 80
targetPort: 8000
4. Feature Flags
4.1 Using Feature Flags to Control LLM Behavior
from feature_flags import FeatureFlagClient
ff = FeatureFlagClient()
async def handle_chat(user_id: str, query: str):
# Model selection
if ff.is_enabled("use_gpt4_turbo", user_id=user_id):
model = "gpt-4-turbo"
else:
model = "gpt-4"
# Prompt version
if ff.is_enabled("new_prompt_v2", user_id=user_id):
prompt = load_prompt("v2.0")
else:
prompt = load_prompt("v1.3")
# RAG configuration
if ff.is_enabled("enable_reranker", user_id=user_id):
rag_config = {"reranker": True, "top_k": 10}
else:
rag_config = {"reranker": False, "top_k": 5}
return await generate(model, prompt, query, rag_config)
4.2 Progressive Feature Rollout
Feature Flag: "new_rag_pipeline"
- Phase 1: Internal test users only
- Phase 2: 5% random users
- Phase 3: Paid users
- Phase 4: All users
- Phase 5: Remove flag, becomes default behavior
5. Rollback Strategies
5.1 Quick Rollback
class RollbackManager:
def __init__(self):
self.version_history = []
def deploy(self, new_version):
"""Deploy new version, save current version for rollback"""
current = self.get_current_version()
self.version_history.append(current)
self.switch_to(new_version)
def rollback(self, reason: str):
"""Rollback to the last stable version"""
if not self.version_history:
raise Error("No previous version available")
previous = self.version_history.pop()
self.switch_to(previous)
# Send alert notification
self.alert(f"Rollback triggered: {reason}")
return previous
5.2 Rollback Checklist
- [ ] Automatic rollback trigger conditions are configured
- [ ] Rollback completes within 30 seconds
- [ ] Previous version's config/prompt/model are still available
- [ ] Rollback does not lose user data
- [ ] Rollback operations have complete logs
6. Cost Monitoring
6.1 Cost Tracking
class CostMonitor:
def track(self, request):
cost = calculate_cost(
model=request.model,
prompt_tokens=request.prompt_tokens,
completion_tokens=request.completion_tokens,
)
self.metrics.record(
cost=cost,
model=request.model,
feature=request.feature,
user_tier=request.user_tier,
)
# Cost anomaly alerting
daily_cost = self.get_daily_cost()
if daily_cost > self.daily_budget * 0.8:
self.alert(f"Daily cost approaching budget: ${daily_cost:.2f}")
6.2 Cost Optimization Strategies
- Model routing: Use small models for simple queries, large models for complex queries
- Caching: Cache responses for similar queries
- Prompt compression: Reduce unnecessary tokens
- Batching: Combine requests to reduce overhead
7. User Feedback Loops
7.1 Collecting Feedback
class FeedbackCollector:
def collect_explicit(self, conversation_id, feedback_type, details=None):
"""Collect explicit feedback (thumbs up/down)"""
self.store({
"conversation_id": conversation_id,
"type": feedback_type, # "thumbs_up", "thumbs_down"
"details": details,
"timestamp": datetime.now(),
})
def collect_implicit(self, conversation_id, signals):
"""Collect implicit feedback"""
self.store({
"conversation_id": conversation_id,
"regenerated": signals.get("regenerated", False),
"follow_up_questions": signals.get("follow_ups", 0),
"session_duration": signals.get("duration"),
"copied_response": signals.get("copied", False),
})
7.2 Feedback-Driven Optimization
1. Collect negative feedback cases
2. Classify failure reasons (hallucination/irrelevant/formatting/safety)
3. Targeted optimization (modify prompt/update knowledge base/adjust parameters)
4. A/B test to verify optimization effects
5. Deploy optimized version
8. Summary
Deployment Strategy Selection
| Strategy | Risk | Complexity | Use Cases |
|---|---|---|---|
| Direct deployment | High | Low | Non-critical applications |
| A/B testing | Low | Medium | Data-driven decisions needed |
| Canary deployment | Low | Medium | Progressive validation |
| Blue-green deployment | Very low | High | Zero-downtime and fast rollback needed |
| Feature flags | Very low | Medium | Fine-grained control needed |
Recommended Process
Development → Auto Eval → Human Review → Shadow Test → Canary(1%) → A/B Test(50%) → Full Rollout
References
- LLMOps Overview — LLMOps overview
- Experiment Management and Version Control — Experiment tracking and version management
- LLM Evaluation — Detailed evaluation methods