A/B Testing and Deployment

1. A/B Testing for LLM Applications

1.1 Unique Aspects of A/B Testing

A/B testing for LLM applications differs from traditional web applications:

Dimension	Traditional A/B Test	LLM A/B Test
Metrics	Click rate, conversion rate	Answer quality, user satisfaction
Evaluation	Binary/continuous metrics	Multi-dimensional subjective evaluation
Sample size	Usually requires large volumes	Sample efficiency matters more
Cost	Low	High (each request has API cost)
Latency	Millisecond-level differences	Second-level differences may impact experience

1.2 Metrics Design

Core metrics:

class ABTestMetrics:
    # Quality metrics
    user_satisfaction: float     # User satisfaction score (1-5)
    thumbs_up_rate: float       # Thumbs-up rate
    task_completion_rate: float  # Task completion rate

    # Safety metrics
    hallucination_rate: float   # Hallucination rate
    safety_violation_rate: float # Safety violation rate

    # Performance metrics
    ttft_ms: float              # Time to first token
    total_latency_ms: float     # Total latency

    # Cost metrics
    tokens_per_query: float     # Tokens per query
    cost_per_query: float       # Cost per query

    # Business metrics
    retention_rate: float       # User retention rate
    queries_per_session: float  # Queries per session

Guardrail metrics:

Guardrail metrics should not degrade, even if they don't improve:
- Safety violation rate <= 0.1%
- Hallucination rate <= 5%
- P95 latency <= 3000ms
- Error rate <= 1%

1.3 Statistical Significance

from scipy import stats

def ab_test_significance(control_scores, treatment_scores, alpha=0.05):
    """Test statistical significance of A/B test"""
    # Welch's t-test (does not assume equal variance)
    t_stat, p_value = stats.ttest_ind(
        control_scores, treatment_scores, equal_var=False
    )

    # Effect size (Cohen's d)
    pooled_std = ((control_scores.std()**2 + treatment_scores.std()**2) / 2)**0.5
    cohens_d = (treatment_scores.mean() - control_scores.mean()) / pooled_std

    return {
        "control_mean": control_scores.mean(),
        "treatment_mean": treatment_scores.mean(),
        "p_value": p_value,
        "significant": p_value < alpha,
        "cohens_d": cohens_d,
        "lift": (treatment_scores.mean() - control_scores.mean()) / control_scores.mean(),
    }

1.4 Traffic Splitting

import hashlib

class ABTestRouter:
    def __init__(self, experiment_id: str, treatment_ratio: float = 0.5):
        self.experiment_id = experiment_id
        self.treatment_ratio = treatment_ratio

    def get_variant(self, user_id: str) -> str:
        """Determine group based on user ID (ensures same user always in same group)"""
        hash_input = f"{self.experiment_id}:{user_id}"
        hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)

        if (hash_value % 100) / 100 < self.treatment_ratio:
            return "treatment"
        return "control"

    def route_request(self, user_id: str, query: str):
        variant = self.get_variant(user_id)

        if variant == "treatment":
            return self.treatment_handler(query)  # New prompt/model
        else:
            return self.control_handler(query)     # Current version

2. Canary Deployment

2.1 Concept

Canary deployment gradually rolls out the new version to a small percentage of users, expanding only after confirming it is safe.

2.2 Progressive Rollout

Phase 1: 1% traffic → New version (internal users)
  Observe for 24 hours, check core metrics

Phase 2: 5% traffic → New version
  Observe for 24 hours, compare A/B metrics

Phase 3: 25% traffic → New version
  Observe for 48 hours, confirm no degradation

Phase 4: 50% traffic → New version
  Observe for 24 hours

Phase 5: 100% traffic → New version
  Full rollout complete

2.3 Automatic Rollback Conditions

class CanaryMonitor:
    def __init__(self):
        self.rollback_conditions = {
            "error_rate": {"threshold": 0.05, "window": "5m"},
            "p95_latency_ms": {"threshold": 5000, "window": "10m"},
            "hallucination_rate": {"threshold": 0.10, "window": "30m"},
            "safety_violation": {"threshold": 0.001, "window": "5m"},
        }

    def check_health(self, metrics):
        """Check canary health and decide whether to rollback"""
        for metric_name, condition in self.rollback_conditions.items():
            current_value = metrics.get(metric_name)
            if current_value > condition["threshold"]:
                return {
                    "action": "rollback",
                    "reason": f"{metric_name} ({current_value}) exceeds threshold ({condition['threshold']})",
                }
        return {"action": "continue"}

3. Blue-Green Deployment

3.1 Concept

Maintains two complete production environments simultaneously, achieving zero-downtime deployment through traffic switching.

Blue environment (current version)  ←── 100% traffic
Green environment (new version)     ←── 0% traffic

After switch:
Blue environment (old version)      ←── 0% traffic (retained for rollback)
Green environment (new version)     ←── 100% traffic

3.2 Kubernetes Implementation

# Blue deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service-blue
  labels:
    version: blue
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: llm-inference
          image: llm-service:v1.2
          env:
            - name: PROMPT_VERSION
              value: "v1.3"

---
# Green deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service-green
  labels:
    version: green
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: llm-inference
          image: llm-service:v1.3
          env:
            - name: PROMPT_VERSION
              value: "v2.0"

---
# Service (switch traffic via selector)
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    version: blue  # Change to green to switch traffic
  ports:
    - port: 80
      targetPort: 8000

4. Feature Flags

4.1 Using Feature Flags to Control LLM Behavior

from feature_flags import FeatureFlagClient

ff = FeatureFlagClient()

async def handle_chat(user_id: str, query: str):
    # Model selection
    if ff.is_enabled("use_gpt4_turbo", user_id=user_id):
        model = "gpt-4-turbo"
    else:
        model = "gpt-4"

    # Prompt version
    if ff.is_enabled("new_prompt_v2", user_id=user_id):
        prompt = load_prompt("v2.0")
    else:
        prompt = load_prompt("v1.3")

    # RAG configuration
    if ff.is_enabled("enable_reranker", user_id=user_id):
        rag_config = {"reranker": True, "top_k": 10}
    else:
        rag_config = {"reranker": False, "top_k": 5}

    return await generate(model, prompt, query, rag_config)

4.2 Progressive Feature Rollout

Feature Flag: "new_rag_pipeline"
- Phase 1: Internal test users only
- Phase 2: 5% random users
- Phase 3: Paid users
- Phase 4: All users
- Phase 5: Remove flag, becomes default behavior

5. Rollback Strategies

5.1 Quick Rollback

class RollbackManager:
    def __init__(self):
        self.version_history = []

    def deploy(self, new_version):
        """Deploy new version, save current version for rollback"""
        current = self.get_current_version()
        self.version_history.append(current)
        self.switch_to(new_version)

    def rollback(self, reason: str):
        """Rollback to the last stable version"""
        if not self.version_history:
            raise Error("No previous version available")

        previous = self.version_history.pop()
        self.switch_to(previous)

        # Send alert notification
        self.alert(f"Rollback triggered: {reason}")

        return previous

5.2 Rollback Checklist

[ ] Automatic rollback trigger conditions are configured
[ ] Rollback completes within 30 seconds
[ ] Previous version's config/prompt/model are still available
[ ] Rollback does not lose user data
[ ] Rollback operations have complete logs

6. Cost Monitoring

6.1 Cost Tracking

class CostMonitor:
    def track(self, request):
        cost = calculate_cost(
            model=request.model,
            prompt_tokens=request.prompt_tokens,
            completion_tokens=request.completion_tokens,
        )

        self.metrics.record(
            cost=cost,
            model=request.model,
            feature=request.feature,
            user_tier=request.user_tier,
        )

        # Cost anomaly alerting
        daily_cost = self.get_daily_cost()
        if daily_cost > self.daily_budget * 0.8:
            self.alert(f"Daily cost approaching budget: ${daily_cost:.2f}")

6.2 Cost Optimization Strategies

Model routing: Use small models for simple queries, large models for complex queries
Caching: Cache responses for similar queries
Prompt compression: Reduce unnecessary tokens
Batching: Combine requests to reduce overhead

7. User Feedback Loops

7.1 Collecting Feedback

class FeedbackCollector:
    def collect_explicit(self, conversation_id, feedback_type, details=None):
        """Collect explicit feedback (thumbs up/down)"""
        self.store({
            "conversation_id": conversation_id,
            "type": feedback_type,  # "thumbs_up", "thumbs_down"
            "details": details,
            "timestamp": datetime.now(),
        })

    def collect_implicit(self, conversation_id, signals):
        """Collect implicit feedback"""
        self.store({
            "conversation_id": conversation_id,
            "regenerated": signals.get("regenerated", False),
            "follow_up_questions": signals.get("follow_ups", 0),
            "session_duration": signals.get("duration"),
            "copied_response": signals.get("copied", False),
        })

7.2 Feedback-Driven Optimization

1. Collect negative feedback cases
2. Classify failure reasons (hallucination/irrelevant/formatting/safety)
3. Targeted optimization (modify prompt/update knowledge base/adjust parameters)
4. A/B test to verify optimization effects
5. Deploy optimized version

8. Summary

Deployment Strategy Selection

Strategy	Risk	Complexity	Use Cases
Direct deployment	High	Low	Non-critical applications
A/B testing	Low	Medium	Data-driven decisions needed
Canary deployment	Low	Medium	Progressive validation
Blue-green deployment	Very low	High	Zero-downtime and fast rollback needed
Feature flags	Very low	Medium	Fine-grained control needed

Recommended Process

Development → Auto Eval → Human Review → Shadow Test → Canary(1%) → A/B Test(50%) → Full Rollout

References

LLMOps Overview — LLMOps overview
Experiment Management and Version Control — Experiment tracking and version management
LLM Evaluation — Detailed evaluation methods