RAG Evaluation and Optimization

1. RAG Evaluation Metrics

1.1 Core Evaluation Dimensions

RAG system evaluation requires three dimensions:

Dimension	Evaluation Target	Key Question
Context Relevance	Retrieval results	Are the retrieved documents relevant to the query?
Faithfulness	Generated results	Is the generated answer faithful to the retrieved context?
Answer Relevance	Final output	Does the answer truly address the user's question?

1.2 Context Relevance

Measures how well the retrieved documents match the user query:

Precision@K: Proportion of relevant documents in the top K retrieval results
Recall@K: Proportion of all relevant documents that were retrieved
MRR (Mean Reciprocal Rank): Average of the reciprocal rank of the first relevant document
NDCG (Normalized Discounted Cumulative Gain): Relevance score that accounts for ranking position

1.3 Faithfulness

Measures whether the generated answer is faithful to the retrieved context (no fabrication):

Evaluation process:
1. Decompose the answer into independent claims
2. For each claim, check if supporting evidence exists in the context
3. Faithfulness = number of supported claims / total claims

1.4 Answer Relevance

Measures whether the generated answer truly addresses the user's question:

Evaluation method:
1. Generate multiple possible questions from the answer
2. Calculate similarity between generated questions and the original question
3. Answer relevance = average similarity

2. RAGAS Framework

2.1 RAGAS Overview

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework specifically designed for evaluating RAG systems.

2.2 Installation and Usage

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["What is a Transformer?", "What are the advantages of RAG?"],
    "answer": ["A Transformer is a...", "The main advantages of RAG include..."],
    "contexts": [
        ["The Transformer was proposed by Google in 2017..."],
        ["RAG enhances LLMs by retrieving external knowledge..."],
    ],
    "ground_truth": ["A Transformer is a self-attention based...", "The advantages of RAG are..."],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

print(results)
# {'context_precision': 0.85, 'context_recall': 0.78, 
#  'faithfulness': 0.92, 'answer_relevancy': 0.88}

2.3 RAGAS Metrics in Detail

Metric	Range	Meaning	No ground truth needed
context_precision	0-1	Relevant documents rank higher in retrieval results	No
context_recall	0-1	Information in ground truth is retrieved	No
faithfulness	0-1	Answer is faithful to the retrieved context	Yes
answer_relevancy	0-1	Answer is relevant to the question	Yes
answer_correctness	0-1	Answer is consistent with ground truth	No

3. Chunk Size Optimization

3.1 Impact of Chunk Size on Retrieval

Small Chunks (128-256 tokens):
+ High retrieval precision (more precise matching)
- Incomplete context information
- Higher retrieval overhead (more vectors)

Large Chunks (1024-2048 tokens):
+ Complete context information
- Lower retrieval precision (more noise)
- May exceed LLM context window

Recommended Chunks (256-512 tokens):
✓ Balance between precision and context

3.2 Optimization Experiments

def evaluate_chunk_size(chunk_sizes, test_queries, ground_truths):
    """Evaluate the effect of different chunk sizes"""
    results = {}

    for size in chunk_sizes:
        # Re-chunk and re-index
        chunks = split_documents(documents, chunk_size=size, overlap=size//5)
        vectorstore = build_index(chunks)
        retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

        # Evaluate retrieval quality
        precision_scores = []
        for query, truth in zip(test_queries, ground_truths):
            docs = retriever.get_relevant_documents(query)
            precision = calculate_precision(docs, truth)
            precision_scores.append(precision)

        results[size] = {
            "avg_precision": sum(precision_scores) / len(precision_scores),
            "num_chunks": len(chunks),
        }

    return results

# Experiment with different chunk sizes
sizes = [128, 256, 512, 1024, 2048]
results = evaluate_chunk_size(sizes, test_queries, ground_truths)

3.3 Parent-Child Strategy

Retrieve with small chunks (high precision)
Return large chunks (complete context)

Parent Chunk (1024 tokens): [Full passage]
  ├── Child Chunk 1 (256 tokens): [Part 1]  ← Used for retrieval
  ├── Child Chunk 2 (256 tokens): [Part 2]  ← Used for retrieval
  └── Child Chunk 3 (256 tokens): [Part 3]  ← Used for retrieval

Match on Child Chunk 2 → Return the entire Parent Chunk

4. Reranking

4.1 Why Reranking Is Needed

Vector retrieval (bi-encoder) is fast but limited in precision; reranking (cross-encoder) is more precise but slower:

Stage 1 (Recall): Vector retrieval → Recall Top 100 from millions of documents
Stage 2 (Precision): Reranking → Precisely rank Top 5 from 100 results

4.2 Cohere Rerank

import cohere

co = cohere.Client("your-api-key")

results = co.rerank(
    model="rerank-english-v3.0",
    query="What is the Transformer architecture?",
    documents=[doc.page_content for doc in retrieved_docs],
    top_n=5
)

for result in results.results:
    print(f"Score: {result.relevance_score:.4f} | {result.document.text[:100]}")

4.3 Cross-Encoder Reranking

from sentence_transformers import CrossEncoder

# Load cross-encoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Prepare query-document pairs
pairs = [(query, doc.page_content) for doc in retrieved_docs]

# Compute relevance scores
scores = reranker.predict(pairs)

# Sort
ranked_results = sorted(
    zip(retrieved_docs, scores), 
    key=lambda x: x[1], 
    reverse=True
)[:5]

4.4 Reranking Model Comparison

Model	Type	Speed	Accuracy	Use Cases
Cohere Rerank v3	API	Fast	High	Production, zero-deployment
cross-encoder/ms-marco	Local	Medium	Medium-High	English scenarios
bge-reranker-large	Local	Medium	High	Chinese scenarios
ColBERT	Local	Fast	Medium-High	Low-latency requirements

5. Hybrid Search

5.1 Dense + Sparse Retrieval

Combines the strengths of semantic search (Dense) and keyword search (Sparse):

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, vectorstore, documents, alpha=0.5):
        self.vectorstore = vectorstore
        self.alpha = alpha  # dense weight

        # Build BM25 index
        tokenized = [doc.split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
        self.documents = documents

    def search(self, query, k=5):
        # Dense retrieval
        dense_results = self.vectorstore.similarity_search_with_score(query, k=k*2)

        # Sparse retrieval (BM25)
        tokenized_query = query.split()
        bm25_scores = self.bm25.get_scores(tokenized_query)

        # Score fusion (Reciprocal Rank Fusion)
        combined_scores = {}
        for doc, score in dense_results:
            combined_scores[doc.page_content] = self.alpha * score

        for idx, score in enumerate(bm25_scores):
            content = self.documents[idx]
            if content in combined_scores:
                combined_scores[content] += (1 - self.alpha) * score
            else:
                combined_scores[content] = (1 - self.alpha) * score

        # Sort and return Top K
        sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
        return sorted_results[:k]

5.2 Advantages of Hybrid Search

Scenario	Dense Search	Sparse Search	Hybrid Search
Semantically similar queries	Excellent	Poor	Excellent
Exact keyword matching	Poor	Excellent	Excellent
Technical terminology	Medium	Excellent	Excellent
Long-tail queries	Medium	Medium	Excellent

5.3 RRF (Reciprocal Rank Fusion)

def reciprocal_rank_fusion(results_lists, k=60):
    """Fuse multiple ranked result lists"""
    fused_scores = {}

    for results in results_lists:
        for rank, (doc_id, _) in enumerate(results):
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0
            fused_scores[doc_id] += 1 / (rank + k)

    sorted_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_results

6. Evaluation Pipeline

6.1 End-to-End Evaluation Process

1. Prepare evaluation dataset
   - Question set (covering different types and difficulty levels)
   - Ground truth answers
   - Relevant document annotations

2. Module-level evaluation
   - Retrieval evaluation: Recall@K, Precision@K, MRR
   - Generation evaluation: Faithfulness, Answer Relevancy
   - End-to-end evaluation: Answer Correctness

3. Failure analysis
   - Retrieval miss: Relevant documents not retrieved
   - Ranking error: Relevant documents ranked too low
   - Generation error: Context available but answer is wrong
   - Hallucination: Generated information not in the context

4. Targeted optimization
   - Retrieval miss → Adjust chunking strategy / embedding model
   - Ranking error → Add reranking
   - Generation error → Optimize prompt / switch model
   - Hallucination → Add faithfulness constraints

6.2 Failure Analysis Framework

def analyze_failures(eval_results):
    """Analyze failure modes of the RAG system"""
    failures = {
        "retrieval_miss": [],      # Relevant documents not retrieved
        "ranking_error": [],       # Relevant documents ranked too low
        "generation_error": [],    # Context available but answer wrong
        "hallucination": [],       # Hallucinated content
    }

    for item in eval_results:
        if item["context_recall"] < 0.3:
            failures["retrieval_miss"].append(item)
        elif item["context_precision"] < 0.3:
            failures["ranking_error"].append(item)
        elif item["faithfulness"] < 0.5:
            failures["hallucination"].append(item)
        elif item["answer_correctness"] < 0.5:
            failures["generation_error"].append(item)

    return failures

7. Optimization Checklist

7.1 Retrieval Optimization

[ ] Experiment with different chunk sizes (256/512/1024)
[ ] Try Parent-Child chunking strategy
[ ] Add hybrid search (Dense + BM25)
[ ] Implement query rewriting (HyDE/Multi-Query)
[ ] Add reranking module

7.2 Generation Optimization

[ ] Optimize RAG prompt templates
[ ] Implement context compression
[ ] Add source citation requirements
[ ] Implement Self-RAG verification

7.3 System Optimization

[ ] Build an automated evaluation pipeline
[ ] Set up continuous monitoring
[ ] Implement incremental index updates
[ ] Optimize query latency

References

Es et al., "RAGAS: Automated Evaluation of Retrieval Augmented Generation", 2023
RAG Architecture Design — Overall RAG pipeline architecture
Vector Database in Practice — Embedding models and vector database selection