RAG Evaluation and Optimization
1. RAG Evaluation Metrics
1.1 Core Evaluation Dimensions
RAG system evaluation requires three dimensions:
| Dimension | Evaluation Target | Key Question |
|---|---|---|
| Context Relevance | Retrieval results | Are the retrieved documents relevant to the query? |
| Faithfulness | Generated results | Is the generated answer faithful to the retrieved context? |
| Answer Relevance | Final output | Does the answer truly address the user's question? |
1.2 Context Relevance
Measures how well the retrieved documents match the user query:
- Precision@K: Proportion of relevant documents in the top K retrieval results
- Recall@K: Proportion of all relevant documents that were retrieved
- MRR (Mean Reciprocal Rank): Average of the reciprocal rank of the first relevant document
- NDCG (Normalized Discounted Cumulative Gain): Relevance score that accounts for ranking position
1.3 Faithfulness
Measures whether the generated answer is faithful to the retrieved context (no fabrication):
Evaluation process:
1. Decompose the answer into independent claims
2. For each claim, check if supporting evidence exists in the context
3. Faithfulness = number of supported claims / total claims
1.4 Answer Relevance
Measures whether the generated answer truly addresses the user's question:
Evaluation method:
1. Generate multiple possible questions from the answer
2. Calculate similarity between generated questions and the original question
3. Answer relevance = average similarity
2. RAGAS Framework
2.1 RAGAS Overview
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework specifically designed for evaluating RAG systems.
2.2 Installation and Usage
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy,
)
from datasets import Dataset
# Prepare evaluation data
eval_data = {
"question": ["What is a Transformer?", "What are the advantages of RAG?"],
"answer": ["A Transformer is a...", "The main advantages of RAG include..."],
"contexts": [
["The Transformer was proposed by Google in 2017..."],
["RAG enhances LLMs by retrieving external knowledge..."],
],
"ground_truth": ["A Transformer is a self-attention based...", "The advantages of RAG are..."],
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy,
],
)
print(results)
# {'context_precision': 0.85, 'context_recall': 0.78,
# 'faithfulness': 0.92, 'answer_relevancy': 0.88}
2.3 RAGAS Metrics in Detail
| Metric | Range | Meaning | No ground truth needed |
|---|---|---|---|
| context_precision | 0-1 | Relevant documents rank higher in retrieval results | No |
| context_recall | 0-1 | Information in ground truth is retrieved | No |
| faithfulness | 0-1 | Answer is faithful to the retrieved context | Yes |
| answer_relevancy | 0-1 | Answer is relevant to the question | Yes |
| answer_correctness | 0-1 | Answer is consistent with ground truth | No |
3. Chunk Size Optimization
3.1 Impact of Chunk Size on Retrieval
Small Chunks (128-256 tokens):
+ High retrieval precision (more precise matching)
- Incomplete context information
- Higher retrieval overhead (more vectors)
Large Chunks (1024-2048 tokens):
+ Complete context information
- Lower retrieval precision (more noise)
- May exceed LLM context window
Recommended Chunks (256-512 tokens):
✓ Balance between precision and context
3.2 Optimization Experiments
def evaluate_chunk_size(chunk_sizes, test_queries, ground_truths):
"""Evaluate the effect of different chunk sizes"""
results = {}
for size in chunk_sizes:
# Re-chunk and re-index
chunks = split_documents(documents, chunk_size=size, overlap=size//5)
vectorstore = build_index(chunks)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Evaluate retrieval quality
precision_scores = []
for query, truth in zip(test_queries, ground_truths):
docs = retriever.get_relevant_documents(query)
precision = calculate_precision(docs, truth)
precision_scores.append(precision)
results[size] = {
"avg_precision": sum(precision_scores) / len(precision_scores),
"num_chunks": len(chunks),
}
return results
# Experiment with different chunk sizes
sizes = [128, 256, 512, 1024, 2048]
results = evaluate_chunk_size(sizes, test_queries, ground_truths)
3.3 Parent-Child Strategy
Retrieve with small chunks (high precision)
Return large chunks (complete context)
Parent Chunk (1024 tokens): [Full passage]
├── Child Chunk 1 (256 tokens): [Part 1] ← Used for retrieval
├── Child Chunk 2 (256 tokens): [Part 2] ← Used for retrieval
└── Child Chunk 3 (256 tokens): [Part 3] ← Used for retrieval
Match on Child Chunk 2 → Return the entire Parent Chunk
4. Reranking
4.1 Why Reranking Is Needed
Vector retrieval (bi-encoder) is fast but limited in precision; reranking (cross-encoder) is more precise but slower:
Stage 1 (Recall): Vector retrieval → Recall Top 100 from millions of documents
Stage 2 (Precision): Reranking → Precisely rank Top 5 from 100 results
4.2 Cohere Rerank
import cohere
co = cohere.Client("your-api-key")
results = co.rerank(
model="rerank-english-v3.0",
query="What is the Transformer architecture?",
documents=[doc.page_content for doc in retrieved_docs],
top_n=5
)
for result in results.results:
print(f"Score: {result.relevance_score:.4f} | {result.document.text[:100]}")
4.3 Cross-Encoder Reranking
from sentence_transformers import CrossEncoder
# Load cross-encoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Prepare query-document pairs
pairs = [(query, doc.page_content) for doc in retrieved_docs]
# Compute relevance scores
scores = reranker.predict(pairs)
# Sort
ranked_results = sorted(
zip(retrieved_docs, scores),
key=lambda x: x[1],
reverse=True
)[:5]
4.4 Reranking Model Comparison
| Model | Type | Speed | Accuracy | Use Cases |
|---|---|---|---|---|
| Cohere Rerank v3 | API | Fast | High | Production, zero-deployment |
| cross-encoder/ms-marco | Local | Medium | Medium-High | English scenarios |
| bge-reranker-large | Local | Medium | High | Chinese scenarios |
| ColBERT | Local | Fast | Medium-High | Low-latency requirements |
5. Hybrid Search
5.1 Dense + Sparse Retrieval
Combines the strengths of semantic search (Dense) and keyword search (Sparse):
from rank_bm25 import BM25Okapi
class HybridRetriever:
def __init__(self, vectorstore, documents, alpha=0.5):
self.vectorstore = vectorstore
self.alpha = alpha # dense weight
# Build BM25 index
tokenized = [doc.split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
self.documents = documents
def search(self, query, k=5):
# Dense retrieval
dense_results = self.vectorstore.similarity_search_with_score(query, k=k*2)
# Sparse retrieval (BM25)
tokenized_query = query.split()
bm25_scores = self.bm25.get_scores(tokenized_query)
# Score fusion (Reciprocal Rank Fusion)
combined_scores = {}
for doc, score in dense_results:
combined_scores[doc.page_content] = self.alpha * score
for idx, score in enumerate(bm25_scores):
content = self.documents[idx]
if content in combined_scores:
combined_scores[content] += (1 - self.alpha) * score
else:
combined_scores[content] = (1 - self.alpha) * score
# Sort and return Top K
sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:k]
5.2 Advantages of Hybrid Search
| Scenario | Dense Search | Sparse Search | Hybrid Search |
|---|---|---|---|
| Semantically similar queries | Excellent | Poor | Excellent |
| Exact keyword matching | Poor | Excellent | Excellent |
| Technical terminology | Medium | Excellent | Excellent |
| Long-tail queries | Medium | Medium | Excellent |
5.3 RRF (Reciprocal Rank Fusion)
def reciprocal_rank_fusion(results_lists, k=60):
"""Fuse multiple ranked result lists"""
fused_scores = {}
for results in results_lists:
for rank, (doc_id, _) in enumerate(results):
if doc_id not in fused_scores:
fused_scores[doc_id] = 0
fused_scores[doc_id] += 1 / (rank + k)
sorted_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results
6. Evaluation Pipeline
6.1 End-to-End Evaluation Process
1. Prepare evaluation dataset
- Question set (covering different types and difficulty levels)
- Ground truth answers
- Relevant document annotations
2. Module-level evaluation
- Retrieval evaluation: Recall@K, Precision@K, MRR
- Generation evaluation: Faithfulness, Answer Relevancy
- End-to-end evaluation: Answer Correctness
3. Failure analysis
- Retrieval miss: Relevant documents not retrieved
- Ranking error: Relevant documents ranked too low
- Generation error: Context available but answer is wrong
- Hallucination: Generated information not in the context
4. Targeted optimization
- Retrieval miss → Adjust chunking strategy / embedding model
- Ranking error → Add reranking
- Generation error → Optimize prompt / switch model
- Hallucination → Add faithfulness constraints
6.2 Failure Analysis Framework
def analyze_failures(eval_results):
"""Analyze failure modes of the RAG system"""
failures = {
"retrieval_miss": [], # Relevant documents not retrieved
"ranking_error": [], # Relevant documents ranked too low
"generation_error": [], # Context available but answer wrong
"hallucination": [], # Hallucinated content
}
for item in eval_results:
if item["context_recall"] < 0.3:
failures["retrieval_miss"].append(item)
elif item["context_precision"] < 0.3:
failures["ranking_error"].append(item)
elif item["faithfulness"] < 0.5:
failures["hallucination"].append(item)
elif item["answer_correctness"] < 0.5:
failures["generation_error"].append(item)
return failures
7. Optimization Checklist
7.1 Retrieval Optimization
- [ ] Experiment with different chunk sizes (256/512/1024)
- [ ] Try Parent-Child chunking strategy
- [ ] Add hybrid search (Dense + BM25)
- [ ] Implement query rewriting (HyDE/Multi-Query)
- [ ] Add reranking module
7.2 Generation Optimization
- [ ] Optimize RAG prompt templates
- [ ] Implement context compression
- [ ] Add source citation requirements
- [ ] Implement Self-RAG verification
7.3 System Optimization
- [ ] Build an automated evaluation pipeline
- [ ] Set up continuous monitoring
- [ ] Implement incremental index updates
- [ ] Optimize query latency
References
- Es et al., "RAGAS: Automated Evaluation of Retrieval Augmented Generation", 2023
- RAG Architecture Design — Overall RAG pipeline architecture
- Vector Database in Practice — Embedding models and vector database selection