Skip to content

Experiment Management and Version Control

1. Prompt Versioning

1.1 Why Prompt Versioning Is Needed

  • Minor prompt modifications can cause dramatic output quality changes
  • Need to track the reason and effect of each modification
  • Support A/B testing and rollback
  • Team collaboration requires unified management

1.2 Version Management Approaches

Approach 1: Git-based management

prompts/
├── chat/
│   ├── system_prompt.txt
│   ├── few_shot_examples.json
│   └── CHANGELOG.md
├── classification/
│   ├── system_prompt.txt
│   └── CHANGELOG.md
└── prompt_config.yaml
# prompt_config.yaml
chat:
  version: "1.3"
  model: "gpt-4"
  temperature: 0.7
  system_prompt: "prompts/chat/system_prompt.txt"
  few_shot: "prompts/chat/few_shot_examples.json"

classification:
  version: "2.1"
  model: "gpt-3.5-turbo"
  temperature: 0.0
  system_prompt: "prompts/classification/system_prompt.txt"

Approach 2: Database management

class PromptVersion:
    id: str
    name: str
    version: str
    content: str
    model: str
    parameters: dict
    created_at: datetime
    created_by: str
    evaluation_score: float
    is_active: bool
    changelog: str

Approach 3: Dedicated tools

  • LangSmith: Prompt management in the LangChain ecosystem
  • PromptLayer: Dedicated prompt version management platform
  • Humanloop: Prompt optimization and version management

1.3 Best Practices

  • Record the reason for each change
  • Associate evaluation scores with version numbers
  • Support quick rollback to the previous version
  • Use semantic versioning (major.minor.patch)

2. Model Registry

2.1 Concept

A model registry is a central repository for managing model lifecycles, tracking model versions, metadata, and deployment status.

2.2 MLflow Model Registry

import mlflow

# Register model
with mlflow.start_run():
    mlflow.log_params({
        "model_name": "llama-3-8b",
        "quantization": "AWQ-4bit",
        "fine_tune_dataset": "custom_v2",
    })
    mlflow.log_metrics({
        "eval_accuracy": 0.85,
        "eval_latency_ms": 120,
        "eval_cost_per_1k": 0.002,
    })
    mlflow.log_artifact("model_config.yaml")

    # Register to Registry
    mlflow.register_model(
        model_uri=f"runs:/{mlflow.active_run().info.run_id}/model",
        name="chat-assistant"
    )

# Model stage management
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="chat-assistant",
    version=3,
    stage="Production"  # Staging / Production / Archived
)

2.3 Model Metadata

model:
  name: chat-assistant
  version: 3
  base_model: meta-llama/Llama-3-8B-Instruct
  quantization: AWQ-4bit

training:
  dataset: custom_instructions_v2
  epochs: 3
  learning_rate: 2e-5

evaluation:
  accuracy: 0.85
  latency_p95_ms: 150
  hallucination_rate: 0.03

deployment:
  stage: production
  serving_framework: vllm
  gpu: A100-40GB
  replicas: 2

3. Data Versioning

3.1 DVC (Data Version Control)

# Initialize DVC
dvc init

# Track data files
dvc add data/training_set.jsonl
dvc add data/knowledge_base/

# Push to remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push

# Switch data versions
git checkout v1.0
dvc checkout

3.2 Knowledge Base Versioning

class KnowledgeBaseVersion:
    """Version management for RAG knowledge bases"""

    def __init__(self, version_id: str):
        self.version_id = version_id
        self.metadata = {
            "version": version_id,
            "created_at": datetime.now(),
            "document_count": 0,
            "embedding_model": "text-embedding-3-small",
            "chunk_size": 512,
            "chunk_overlap": 100,
        }

    def add_documents(self, documents: list):
        """Add documents and record the version"""
        self.metadata["document_count"] += len(documents)
        # Index documents...

    def diff(self, other_version: "KnowledgeBaseVersion"):
        """Compare differences between two versions"""
        added = self.documents - other_version.documents
        removed = other_version.documents - self.documents
        return {"added": added, "removed": removed}

3.3 Conversation Data Management

# Log user conversations for evaluation and improvement
class ConversationLogger:
    def log(self, conversation):
        record = {
            "conversation_id": conversation.id,
            "messages": conversation.messages,
            "model": conversation.model,
            "prompt_version": conversation.prompt_version,
            "user_feedback": conversation.feedback,  # thumbs up/down
            "tokens_used": conversation.token_usage,
            "latency_ms": conversation.latency,
            "timestamp": datetime.now(),
        }
        self.store.insert(record)

4. Experiment Tracking

4.1 W&B (Weights & Biases) for LLMs

import wandb

wandb.init(project="llm-prompt-optimization")

# Log prompt experiment
wandb.log({
    "prompt_version": "v1.3",
    "model": "gpt-4",
    "temperature": 0.7,
    "eval_accuracy": 0.85,
    "eval_faithfulness": 0.92,
    "eval_relevancy": 0.88,
    "avg_latency_ms": 1200,
    "avg_cost_per_query": 0.05,
})

# Log prompt content
wandb.log({"system_prompt": wandb.Html(system_prompt_content)})

# Log evaluation samples
table = wandb.Table(columns=["query", "response", "score", "feedback"])
for sample in eval_samples:
    table.add_data(sample.query, sample.response, sample.score, sample.feedback)
wandb.log({"eval_samples": table})

4.2 MLflow for LLMs

import mlflow

with mlflow.start_run(run_name="prompt_v1.3_gpt4"):
    # Log parameters
    mlflow.log_params({
        "prompt_version": "v1.3",
        "model": "gpt-4",
        "temperature": 0.7,
        "chunk_size": 512,
        "top_k": 5,
    })

    # Log metrics
    mlflow.log_metrics({
        "accuracy": 0.85,
        "faithfulness": 0.92,
        "relevancy": 0.88,
        "avg_latency_ms": 1200,
        "cost_per_query": 0.05,
    })

    # Log prompt files
    mlflow.log_artifact("prompts/system_prompt_v1.3.txt")
    mlflow.log_artifact("eval_results.json")

4.3 Experiment Comparison

Experiment 1: prompt_v1.2 + gpt-4 + temp=0.5
  Accuracy: 0.80 | Faithfulness: 0.88 | Latency: 1100ms | Cost: $0.04

Experiment 2: prompt_v1.3 + gpt-4 + temp=0.7
  Accuracy: 0.85 | Faithfulness: 0.92 | Latency: 1200ms | Cost: $0.05

Experiment 3: prompt_v1.3 + gpt-3.5-turbo + temp=0.7
  Accuracy: 0.78 | Faithfulness: 0.85 | Latency: 400ms | Cost: $0.002

Conclusion: Experiment 2 is optimal (if budget allows), Experiment 3 offers the best cost-performance ratio

5. Configuration Management

5.1 Environment Configuration

# config/production.yaml
llm:
  provider: openai
  model: gpt-4
  temperature: 0.7
  max_tokens: 1024
  timeout: 30

rag:
  embedding_model: text-embedding-3-small
  chunk_size: 512
  chunk_overlap: 100
  top_k: 5
  reranker: cohere-rerank-v3

prompt:
  version: v1.3
  path: prompts/production/

monitoring:
  log_level: INFO
  trace_sampling: 0.1
  alert_threshold:
    latency_p95: 3000
    error_rate: 0.01

5.2 Feature Flags

from featureflags import FeatureFlags

flags = FeatureFlags()

# Select behavior based on feature flags
if flags.is_enabled("use_new_prompt_v2"):
    prompt = load_prompt("v2.0")
else:
    prompt = load_prompt("v1.3")

if flags.is_enabled("enable_reranker"):
    results = rerank(retrieval_results)

6. Reproducibility Challenges

6.1 Reproducibility Issues with LLMs

  • Temperature sampling: Non-zero temperature introduces output randomness
  • API version changes: Model providers may update models
  • Context differences: RAG retrieval results may differ
  • System prompt: Minor modifications cause behavioral changes

6.2 Methods to Improve Reproducibility

# 1. Fix random seed (if API supports it)
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    temperature=0,  # Deterministic output
    seed=42,        # Fixed seed
)

# 2. Log complete requests and responses
experiment_log = {
    "request": {
        "model": "gpt-4",
        "messages": messages,
        "temperature": 0,
        "seed": 42,
    },
    "response": response.model_dump(),
    "metadata": {
        "api_version": "2024-01",
        "prompt_version": "v1.3",
        "rag_index_version": "20240301",
    }
}

7. A/B Testing Setup

See A/B Testing and Deployment for details.

References


评论 #