Experiment Management and Version Control

1. Prompt Versioning

1.1 Why Prompt Versioning Is Needed

Minor prompt modifications can cause dramatic output quality changes
Need to track the reason and effect of each modification
Support A/B testing and rollback
Team collaboration requires unified management

1.2 Version Management Approaches

Approach 1: Git-based management

prompts/
├── chat/
│   ├── system_prompt.txt
│   ├── few_shot_examples.json
│   └── CHANGELOG.md
├── classification/
│   ├── system_prompt.txt
│   └── CHANGELOG.md
└── prompt_config.yaml

# prompt_config.yaml
chat:
  version: "1.3"
  model: "gpt-4"
  temperature: 0.7
  system_prompt: "prompts/chat/system_prompt.txt"
  few_shot: "prompts/chat/few_shot_examples.json"

classification:
  version: "2.1"
  model: "gpt-3.5-turbo"
  temperature: 0.0
  system_prompt: "prompts/classification/system_prompt.txt"

Approach 2: Database management

class PromptVersion:
    id: str
    name: str
    version: str
    content: str
    model: str
    parameters: dict
    created_at: datetime
    created_by: str
    evaluation_score: float
    is_active: bool
    changelog: str

Approach 3: Dedicated tools

LangSmith: Prompt management in the LangChain ecosystem
PromptLayer: Dedicated prompt version management platform
Humanloop: Prompt optimization and version management

1.3 Best Practices

Record the reason for each change
Associate evaluation scores with version numbers
Support quick rollback to the previous version
Use semantic versioning (major.minor.patch)

2. Model Registry

2.1 Concept

A model registry is a central repository for managing model lifecycles, tracking model versions, metadata, and deployment status.

2.2 MLflow Model Registry

import mlflow

# Register model
with mlflow.start_run():
    mlflow.log_params({
        "model_name": "llama-3-8b",
        "quantization": "AWQ-4bit",
        "fine_tune_dataset": "custom_v2",
    })
    mlflow.log_metrics({
        "eval_accuracy": 0.85,
        "eval_latency_ms": 120,
        "eval_cost_per_1k": 0.002,
    })
    mlflow.log_artifact("model_config.yaml")

    # Register to Registry
    mlflow.register_model(
        model_uri=f"runs:/{mlflow.active_run().info.run_id}/model",
        name="chat-assistant"
    )

# Model stage management
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="chat-assistant",
    version=3,
    stage="Production"  # Staging / Production / Archived
)

2.3 Model Metadata

model:
  name: chat-assistant
  version: 3
  base_model: meta-llama/Llama-3-8B-Instruct
  quantization: AWQ-4bit

training:
  dataset: custom_instructions_v2
  epochs: 3
  learning_rate: 2e-5

evaluation:
  accuracy: 0.85
  latency_p95_ms: 150
  hallucination_rate: 0.03

deployment:
  stage: production
  serving_framework: vllm
  gpu: A100-40GB
  replicas: 2

3. Data Versioning

3.1 DVC (Data Version Control)

# Initialize DVC
dvc init

# Track data files
dvc add data/training_set.jsonl
dvc add data/knowledge_base/

# Push to remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push

# Switch data versions
git checkout v1.0
dvc checkout

3.2 Knowledge Base Versioning

class KnowledgeBaseVersion:
    """Version management for RAG knowledge bases"""

    def __init__(self, version_id: str):
        self.version_id = version_id
        self.metadata = {
            "version": version_id,
            "created_at": datetime.now(),
            "document_count": 0,
            "embedding_model": "text-embedding-3-small",
            "chunk_size": 512,
            "chunk_overlap": 100,
        }

    def add_documents(self, documents: list):
        """Add documents and record the version"""
        self.metadata["document_count"] += len(documents)
        # Index documents...

    def diff(self, other_version: "KnowledgeBaseVersion"):
        """Compare differences between two versions"""
        added = self.documents - other_version.documents
        removed = other_version.documents - self.documents
        return {"added": added, "removed": removed}

3.3 Conversation Data Management

# Log user conversations for evaluation and improvement
class ConversationLogger:
    def log(self, conversation):
        record = {
            "conversation_id": conversation.id,
            "messages": conversation.messages,
            "model": conversation.model,
            "prompt_version": conversation.prompt_version,
            "user_feedback": conversation.feedback,  # thumbs up/down
            "tokens_used": conversation.token_usage,
            "latency_ms": conversation.latency,
            "timestamp": datetime.now(),
        }
        self.store.insert(record)

4. Experiment Tracking

4.1 W&B (Weights & Biases) for LLMs

import wandb

wandb.init(project="llm-prompt-optimization")

# Log prompt experiment
wandb.log({
    "prompt_version": "v1.3",
    "model": "gpt-4",
    "temperature": 0.7,
    "eval_accuracy": 0.85,
    "eval_faithfulness": 0.92,
    "eval_relevancy": 0.88,
    "avg_latency_ms": 1200,
    "avg_cost_per_query": 0.05,
})

# Log prompt content
wandb.log({"system_prompt": wandb.Html(system_prompt_content)})

# Log evaluation samples
table = wandb.Table(columns=["query", "response", "score", "feedback"])
for sample in eval_samples:
    table.add_data(sample.query, sample.response, sample.score, sample.feedback)
wandb.log({"eval_samples": table})

4.2 MLflow for LLMs

import mlflow

with mlflow.start_run(run_name="prompt_v1.3_gpt4"):
    # Log parameters
    mlflow.log_params({
        "prompt_version": "v1.3",
        "model": "gpt-4",
        "temperature": 0.7,
        "chunk_size": 512,
        "top_k": 5,
    })

    # Log metrics
    mlflow.log_metrics({
        "accuracy": 0.85,
        "faithfulness": 0.92,
        "relevancy": 0.88,
        "avg_latency_ms": 1200,
        "cost_per_query": 0.05,
    })

    # Log prompt files
    mlflow.log_artifact("prompts/system_prompt_v1.3.txt")
    mlflow.log_artifact("eval_results.json")

4.3 Experiment Comparison

Experiment 1: prompt_v1.2 + gpt-4 + temp=0.5
  Accuracy: 0.80 | Faithfulness: 0.88 | Latency: 1100ms | Cost: $0.04

Experiment 2: prompt_v1.3 + gpt-4 + temp=0.7
  Accuracy: 0.85 | Faithfulness: 0.92 | Latency: 1200ms | Cost: $0.05

Experiment 3: prompt_v1.3 + gpt-3.5-turbo + temp=0.7
  Accuracy: 0.78 | Faithfulness: 0.85 | Latency: 400ms | Cost: $0.002

Conclusion: Experiment 2 is optimal (if budget allows), Experiment 3 offers the best cost-performance ratio

5. Configuration Management

5.1 Environment Configuration

# config/production.yaml
llm:
  provider: openai
  model: gpt-4
  temperature: 0.7
  max_tokens: 1024
  timeout: 30

rag:
  embedding_model: text-embedding-3-small
  chunk_size: 512
  chunk_overlap: 100
  top_k: 5
  reranker: cohere-rerank-v3

prompt:
  version: v1.3
  path: prompts/production/

monitoring:
  log_level: INFO
  trace_sampling: 0.1
  alert_threshold:
    latency_p95: 3000
    error_rate: 0.01

5.2 Feature Flags

from featureflags import FeatureFlags

flags = FeatureFlags()

# Select behavior based on feature flags
if flags.is_enabled("use_new_prompt_v2"):
    prompt = load_prompt("v2.0")
else:
    prompt = load_prompt("v1.3")

if flags.is_enabled("enable_reranker"):
    results = rerank(retrieval_results)

6. Reproducibility Challenges

6.1 Reproducibility Issues with LLMs

Temperature sampling: Non-zero temperature introduces output randomness
API version changes: Model providers may update models
Context differences: RAG retrieval results may differ
System prompt: Minor modifications cause behavioral changes

6.2 Methods to Improve Reproducibility

# 1. Fix random seed (if API supports it)
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    temperature=0,  # Deterministic output
    seed=42,        # Fixed seed
)

# 2. Log complete requests and responses
experiment_log = {
    "request": {
        "model": "gpt-4",
        "messages": messages,
        "temperature": 0,
        "seed": 42,
    },
    "response": response.model_dump(),
    "metadata": {
        "api_version": "2024-01",
        "prompt_version": "v1.3",
        "rag_index_version": "20240301",
    }
}

7. A/B Testing Setup

See A/B Testing and Deployment for details.

References

MLflow Documentation
Weights & Biases Documentation
DVC Documentation
LLMOps Overview — Overall LLMOps overview
A/B Testing and Deployment — Detailed deployment strategies