Skip to content

RAG Evaluation and Optimization

1. RAG Evaluation Metrics

1.1 Core Evaluation Dimensions

RAG system evaluation requires three dimensions:

Dimension Evaluation Target Key Question
Context Relevance Retrieval results Are the retrieved documents relevant to the query?
Faithfulness Generated results Is the generated answer faithful to the retrieved context?
Answer Relevance Final output Does the answer truly address the user's question?

1.2 Context Relevance

Measures how well the retrieved documents match the user query:

  • Precision@K: Proportion of relevant documents in the top K retrieval results
  • Recall@K: Proportion of all relevant documents that were retrieved
  • MRR (Mean Reciprocal Rank): Average of the reciprocal rank of the first relevant document
  • NDCG (Normalized Discounted Cumulative Gain): Relevance score that accounts for ranking position

1.3 Faithfulness

Measures whether the generated answer is faithful to the retrieved context (no fabrication):

Evaluation process:
1. Decompose the answer into independent claims
2. For each claim, check if supporting evidence exists in the context
3. Faithfulness = number of supported claims / total claims

1.4 Answer Relevance

Measures whether the generated answer truly addresses the user's question:

Evaluation method:
1. Generate multiple possible questions from the answer
2. Calculate similarity between generated questions and the original question
3. Answer relevance = average similarity

2. RAGAS Framework

2.1 RAGAS Overview

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework specifically designed for evaluating RAG systems.

2.2 Installation and Usage

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["What is a Transformer?", "What are the advantages of RAG?"],
    "answer": ["A Transformer is a...", "The main advantages of RAG include..."],
    "contexts": [
        ["The Transformer was proposed by Google in 2017..."],
        ["RAG enhances LLMs by retrieving external knowledge..."],
    ],
    "ground_truth": ["A Transformer is a self-attention based...", "The advantages of RAG are..."],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

print(results)
# {'context_precision': 0.85, 'context_recall': 0.78, 
#  'faithfulness': 0.92, 'answer_relevancy': 0.88}

2.3 RAGAS Metrics in Detail

Metric Range Meaning No ground truth needed
context_precision 0-1 Relevant documents rank higher in retrieval results No
context_recall 0-1 Information in ground truth is retrieved No
faithfulness 0-1 Answer is faithful to the retrieved context Yes
answer_relevancy 0-1 Answer is relevant to the question Yes
answer_correctness 0-1 Answer is consistent with ground truth No

3. Chunk Size Optimization

3.1 Impact of Chunk Size on Retrieval

Small Chunks (128-256 tokens):
+ High retrieval precision (more precise matching)
- Incomplete context information
- Higher retrieval overhead (more vectors)

Large Chunks (1024-2048 tokens):
+ Complete context information
- Lower retrieval precision (more noise)
- May exceed LLM context window

Recommended Chunks (256-512 tokens):
✓ Balance between precision and context

3.2 Optimization Experiments

def evaluate_chunk_size(chunk_sizes, test_queries, ground_truths):
    """Evaluate the effect of different chunk sizes"""
    results = {}

    for size in chunk_sizes:
        # Re-chunk and re-index
        chunks = split_documents(documents, chunk_size=size, overlap=size//5)
        vectorstore = build_index(chunks)
        retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

        # Evaluate retrieval quality
        precision_scores = []
        for query, truth in zip(test_queries, ground_truths):
            docs = retriever.get_relevant_documents(query)
            precision = calculate_precision(docs, truth)
            precision_scores.append(precision)

        results[size] = {
            "avg_precision": sum(precision_scores) / len(precision_scores),
            "num_chunks": len(chunks),
        }

    return results

# Experiment with different chunk sizes
sizes = [128, 256, 512, 1024, 2048]
results = evaluate_chunk_size(sizes, test_queries, ground_truths)

3.3 Parent-Child Strategy

Retrieve with small chunks (high precision)
Return large chunks (complete context)

Parent Chunk (1024 tokens): [Full passage]
  ├── Child Chunk 1 (256 tokens): [Part 1]  ← Used for retrieval
  ├── Child Chunk 2 (256 tokens): [Part 2]  ← Used for retrieval
  └── Child Chunk 3 (256 tokens): [Part 3]  ← Used for retrieval

Match on Child Chunk 2 → Return the entire Parent Chunk

4. Reranking

4.1 Why Reranking Is Needed

Vector retrieval (bi-encoder) is fast but limited in precision; reranking (cross-encoder) is more precise but slower:

Stage 1 (Recall): Vector retrieval → Recall Top 100 from millions of documents
Stage 2 (Precision): Reranking → Precisely rank Top 5 from 100 results

4.2 Cohere Rerank

import cohere

co = cohere.Client("your-api-key")

results = co.rerank(
    model="rerank-english-v3.0",
    query="What is the Transformer architecture?",
    documents=[doc.page_content for doc in retrieved_docs],
    top_n=5
)

for result in results.results:
    print(f"Score: {result.relevance_score:.4f} | {result.document.text[:100]}")

4.3 Cross-Encoder Reranking

from sentence_transformers import CrossEncoder

# Load cross-encoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Prepare query-document pairs
pairs = [(query, doc.page_content) for doc in retrieved_docs]

# Compute relevance scores
scores = reranker.predict(pairs)

# Sort
ranked_results = sorted(
    zip(retrieved_docs, scores), 
    key=lambda x: x[1], 
    reverse=True
)[:5]

4.4 Reranking Model Comparison

Model Type Speed Accuracy Use Cases
Cohere Rerank v3 API Fast High Production, zero-deployment
cross-encoder/ms-marco Local Medium Medium-High English scenarios
bge-reranker-large Local Medium High Chinese scenarios
ColBERT Local Fast Medium-High Low-latency requirements

5.1 Dense + Sparse Retrieval

Combines the strengths of semantic search (Dense) and keyword search (Sparse):

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, vectorstore, documents, alpha=0.5):
        self.vectorstore = vectorstore
        self.alpha = alpha  # dense weight

        # Build BM25 index
        tokenized = [doc.split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
        self.documents = documents

    def search(self, query, k=5):
        # Dense retrieval
        dense_results = self.vectorstore.similarity_search_with_score(query, k=k*2)

        # Sparse retrieval (BM25)
        tokenized_query = query.split()
        bm25_scores = self.bm25.get_scores(tokenized_query)

        # Score fusion (Reciprocal Rank Fusion)
        combined_scores = {}
        for doc, score in dense_results:
            combined_scores[doc.page_content] = self.alpha * score

        for idx, score in enumerate(bm25_scores):
            content = self.documents[idx]
            if content in combined_scores:
                combined_scores[content] += (1 - self.alpha) * score
            else:
                combined_scores[content] = (1 - self.alpha) * score

        # Sort and return Top K
        sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
        return sorted_results[:k]
Scenario Dense Search Sparse Search Hybrid Search
Semantically similar queries Excellent Poor Excellent
Exact keyword matching Poor Excellent Excellent
Technical terminology Medium Excellent Excellent
Long-tail queries Medium Medium Excellent

5.3 RRF (Reciprocal Rank Fusion)

def reciprocal_rank_fusion(results_lists, k=60):
    """Fuse multiple ranked result lists"""
    fused_scores = {}

    for results in results_lists:
        for rank, (doc_id, _) in enumerate(results):
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0
            fused_scores[doc_id] += 1 / (rank + k)

    sorted_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_results

6. Evaluation Pipeline

6.1 End-to-End Evaluation Process

1. Prepare evaluation dataset
   - Question set (covering different types and difficulty levels)
   - Ground truth answers
   - Relevant document annotations

2. Module-level evaluation
   - Retrieval evaluation: Recall@K, Precision@K, MRR
   - Generation evaluation: Faithfulness, Answer Relevancy
   - End-to-end evaluation: Answer Correctness

3. Failure analysis
   - Retrieval miss: Relevant documents not retrieved
   - Ranking error: Relevant documents ranked too low
   - Generation error: Context available but answer is wrong
   - Hallucination: Generated information not in the context

4. Targeted optimization
   - Retrieval miss → Adjust chunking strategy / embedding model
   - Ranking error → Add reranking
   - Generation error → Optimize prompt / switch model
   - Hallucination → Add faithfulness constraints

6.2 Failure Analysis Framework

def analyze_failures(eval_results):
    """Analyze failure modes of the RAG system"""
    failures = {
        "retrieval_miss": [],      # Relevant documents not retrieved
        "ranking_error": [],       # Relevant documents ranked too low
        "generation_error": [],    # Context available but answer wrong
        "hallucination": [],       # Hallucinated content
    }

    for item in eval_results:
        if item["context_recall"] < 0.3:
            failures["retrieval_miss"].append(item)
        elif item["context_precision"] < 0.3:
            failures["ranking_error"].append(item)
        elif item["faithfulness"] < 0.5:
            failures["hallucination"].append(item)
        elif item["answer_correctness"] < 0.5:
            failures["generation_error"].append(item)

    return failures

7. Optimization Checklist

7.1 Retrieval Optimization

  • [ ] Experiment with different chunk sizes (256/512/1024)
  • [ ] Try Parent-Child chunking strategy
  • [ ] Add hybrid search (Dense + BM25)
  • [ ] Implement query rewriting (HyDE/Multi-Query)
  • [ ] Add reranking module

7.2 Generation Optimization

  • [ ] Optimize RAG prompt templates
  • [ ] Implement context compression
  • [ ] Add source citation requirements
  • [ ] Implement Self-RAG verification

7.3 System Optimization

  • [ ] Build an automated evaluation pipeline
  • [ ] Set up continuous monitoring
  • [ ] Implement incremental index updates
  • [ ] Optimize query latency

References


评论 #