Experiment Management and Version Control
1. Prompt Versioning
1.1 Why Prompt Versioning Is Needed
- Minor prompt modifications can cause dramatic output quality changes
- Need to track the reason and effect of each modification
- Support A/B testing and rollback
- Team collaboration requires unified management
1.2 Version Management Approaches
Approach 1: Git-based management
prompts/
├── chat/
│ ├── system_prompt.txt
│ ├── few_shot_examples.json
│ └── CHANGELOG.md
├── classification/
│ ├── system_prompt.txt
│ └── CHANGELOG.md
└── prompt_config.yaml
# prompt_config.yaml
chat:
version: "1.3"
model: "gpt-4"
temperature: 0.7
system_prompt: "prompts/chat/system_prompt.txt"
few_shot: "prompts/chat/few_shot_examples.json"
classification:
version: "2.1"
model: "gpt-3.5-turbo"
temperature: 0.0
system_prompt: "prompts/classification/system_prompt.txt"
Approach 2: Database management
class PromptVersion:
id: str
name: str
version: str
content: str
model: str
parameters: dict
created_at: datetime
created_by: str
evaluation_score: float
is_active: bool
changelog: str
Approach 3: Dedicated tools
- LangSmith: Prompt management in the LangChain ecosystem
- PromptLayer: Dedicated prompt version management platform
- Humanloop: Prompt optimization and version management
1.3 Best Practices
- Record the reason for each change
- Associate evaluation scores with version numbers
- Support quick rollback to the previous version
- Use semantic versioning (major.minor.patch)
2. Model Registry
2.1 Concept
A model registry is a central repository for managing model lifecycles, tracking model versions, metadata, and deployment status.
2.2 MLflow Model Registry
import mlflow
# Register model
with mlflow.start_run():
mlflow.log_params({
"model_name": "llama-3-8b",
"quantization": "AWQ-4bit",
"fine_tune_dataset": "custom_v2",
})
mlflow.log_metrics({
"eval_accuracy": 0.85,
"eval_latency_ms": 120,
"eval_cost_per_1k": 0.002,
})
mlflow.log_artifact("model_config.yaml")
# Register to Registry
mlflow.register_model(
model_uri=f"runs:/{mlflow.active_run().info.run_id}/model",
name="chat-assistant"
)
# Model stage management
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="chat-assistant",
version=3,
stage="Production" # Staging / Production / Archived
)
2.3 Model Metadata
model:
name: chat-assistant
version: 3
base_model: meta-llama/Llama-3-8B-Instruct
quantization: AWQ-4bit
training:
dataset: custom_instructions_v2
epochs: 3
learning_rate: 2e-5
evaluation:
accuracy: 0.85
latency_p95_ms: 150
hallucination_rate: 0.03
deployment:
stage: production
serving_framework: vllm
gpu: A100-40GB
replicas: 2
3. Data Versioning
3.1 DVC (Data Version Control)
# Initialize DVC
dvc init
# Track data files
dvc add data/training_set.jsonl
dvc add data/knowledge_base/
# Push to remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push
# Switch data versions
git checkout v1.0
dvc checkout
3.2 Knowledge Base Versioning
class KnowledgeBaseVersion:
"""Version management for RAG knowledge bases"""
def __init__(self, version_id: str):
self.version_id = version_id
self.metadata = {
"version": version_id,
"created_at": datetime.now(),
"document_count": 0,
"embedding_model": "text-embedding-3-small",
"chunk_size": 512,
"chunk_overlap": 100,
}
def add_documents(self, documents: list):
"""Add documents and record the version"""
self.metadata["document_count"] += len(documents)
# Index documents...
def diff(self, other_version: "KnowledgeBaseVersion"):
"""Compare differences between two versions"""
added = self.documents - other_version.documents
removed = other_version.documents - self.documents
return {"added": added, "removed": removed}
3.3 Conversation Data Management
# Log user conversations for evaluation and improvement
class ConversationLogger:
def log(self, conversation):
record = {
"conversation_id": conversation.id,
"messages": conversation.messages,
"model": conversation.model,
"prompt_version": conversation.prompt_version,
"user_feedback": conversation.feedback, # thumbs up/down
"tokens_used": conversation.token_usage,
"latency_ms": conversation.latency,
"timestamp": datetime.now(),
}
self.store.insert(record)
4. Experiment Tracking
4.1 W&B (Weights & Biases) for LLMs
import wandb
wandb.init(project="llm-prompt-optimization")
# Log prompt experiment
wandb.log({
"prompt_version": "v1.3",
"model": "gpt-4",
"temperature": 0.7,
"eval_accuracy": 0.85,
"eval_faithfulness": 0.92,
"eval_relevancy": 0.88,
"avg_latency_ms": 1200,
"avg_cost_per_query": 0.05,
})
# Log prompt content
wandb.log({"system_prompt": wandb.Html(system_prompt_content)})
# Log evaluation samples
table = wandb.Table(columns=["query", "response", "score", "feedback"])
for sample in eval_samples:
table.add_data(sample.query, sample.response, sample.score, sample.feedback)
wandb.log({"eval_samples": table})
4.2 MLflow for LLMs
import mlflow
with mlflow.start_run(run_name="prompt_v1.3_gpt4"):
# Log parameters
mlflow.log_params({
"prompt_version": "v1.3",
"model": "gpt-4",
"temperature": 0.7,
"chunk_size": 512,
"top_k": 5,
})
# Log metrics
mlflow.log_metrics({
"accuracy": 0.85,
"faithfulness": 0.92,
"relevancy": 0.88,
"avg_latency_ms": 1200,
"cost_per_query": 0.05,
})
# Log prompt files
mlflow.log_artifact("prompts/system_prompt_v1.3.txt")
mlflow.log_artifact("eval_results.json")
4.3 Experiment Comparison
Experiment 1: prompt_v1.2 + gpt-4 + temp=0.5
Accuracy: 0.80 | Faithfulness: 0.88 | Latency: 1100ms | Cost: $0.04
Experiment 2: prompt_v1.3 + gpt-4 + temp=0.7
Accuracy: 0.85 | Faithfulness: 0.92 | Latency: 1200ms | Cost: $0.05
Experiment 3: prompt_v1.3 + gpt-3.5-turbo + temp=0.7
Accuracy: 0.78 | Faithfulness: 0.85 | Latency: 400ms | Cost: $0.002
Conclusion: Experiment 2 is optimal (if budget allows), Experiment 3 offers the best cost-performance ratio
5. Configuration Management
5.1 Environment Configuration
# config/production.yaml
llm:
provider: openai
model: gpt-4
temperature: 0.7
max_tokens: 1024
timeout: 30
rag:
embedding_model: text-embedding-3-small
chunk_size: 512
chunk_overlap: 100
top_k: 5
reranker: cohere-rerank-v3
prompt:
version: v1.3
path: prompts/production/
monitoring:
log_level: INFO
trace_sampling: 0.1
alert_threshold:
latency_p95: 3000
error_rate: 0.01
5.2 Feature Flags
from featureflags import FeatureFlags
flags = FeatureFlags()
# Select behavior based on feature flags
if flags.is_enabled("use_new_prompt_v2"):
prompt = load_prompt("v2.0")
else:
prompt = load_prompt("v1.3")
if flags.is_enabled("enable_reranker"):
results = rerank(retrieval_results)
6. Reproducibility Challenges
6.1 Reproducibility Issues with LLMs
- Temperature sampling: Non-zero temperature introduces output randomness
- API version changes: Model providers may update models
- Context differences: RAG retrieval results may differ
- System prompt: Minor modifications cause behavioral changes
6.2 Methods to Improve Reproducibility
# 1. Fix random seed (if API supports it)
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
temperature=0, # Deterministic output
seed=42, # Fixed seed
)
# 2. Log complete requests and responses
experiment_log = {
"request": {
"model": "gpt-4",
"messages": messages,
"temperature": 0,
"seed": 42,
},
"response": response.model_dump(),
"metadata": {
"api_version": "2024-01",
"prompt_version": "v1.3",
"rag_index_version": "20240301",
}
}
7. A/B Testing Setup
See A/B Testing and Deployment for details.
References
- MLflow Documentation
- Weights & Biases Documentation
- DVC Documentation
- LLMOps Overview — Overall LLMOps overview
- A/B Testing and Deployment — Detailed deployment strategies