RAG Architecture Design

1. RAG Overview

Retrieval-Augmented Generation (RAG) provides contextual information to LLMs by retrieving relevant documents before generation, addressing issues of knowledge cutoff, hallucination, and insufficient domain knowledge.

1.1 Why RAG Is Needed

Problem	Pure LLM	RAG
Knowledge timeliness	Limited to training data cutoff	Can access latest information
Hallucination	Prone to fabricating facts	Answers based on retrieved facts
Domain knowledge	Generalized but not deep	Can connect to specialized knowledge bases
Traceability	Cannot trace information sources	Can cite specific documents
Cost	High fine-tuning cost	No training needed, just update knowledge base

1.2 RAG vs Fine-tuning

RAG: Suitable for knowledge-intensive tasks, requiring up-to-date information, frequently updated knowledge
Fine-tuning: Suitable for adjusting model behavior/style, domain-specific expression, tasks requiring deep understanding
RAG + Fine-tuning: Best practice is to combine both

2. RAG Evolution

2.1 Naive RAG

The most basic RAG implementation:

User Query → Vector Retrieval → Concatenate Context → LLM Generation

Limitations:

Retrieval quality depends on query quality
Simple vector similarity may be inaccurate
Cannot handle questions requiring multi-hop reasoning
Limited context window

2.2 Advanced RAG

Adds pre-processing and post-processing on top of Naive RAG:

User Query → Query Rewriting → Vector Retrieval → Reranking → Context Compression → LLM Generation

Improvements:

Query optimization (rewriting, expansion, decomposition)
Retrieval optimization (hybrid search, reranking)
Generation optimization (context compression, self-reflection)

2.3 Modular RAG

Decomposes RAG into composable modules with flexible orchestration:

[Router Module] → Decides processing strategy
[Retrieval Module] → Multi-source retrieval + fusion
[Reranking Module] → Relevance ordering
[Compression Module] → Context refinement
[Generation Module] → Generation based on optimized context
[Verification Module] → Checks generation quality

3. RAG Pipeline in Detail

3.1 Document Loading (Load)

from langchain.document_loaders import (
    PyPDFLoader, TextLoader, UnstructuredHTMLLoader,
    CSVLoader, DirectoryLoader
)

# Load documents of different formats
pdf_loader = PyPDFLoader("document.pdf")
html_loader = UnstructuredHTMLLoader("page.html")
csv_loader = CSVLoader("data.csv")

# Batch load a directory
dir_loader = DirectoryLoader(
    "./docs/", 
    glob="**/*.md",
    show_progress=True
)
documents = dir_loader.load()

Supported data sources: PDF, HTML, Markdown, CSV, JSON, databases, APIs, web scraping

3.2 Document Chunking (Chunk)

Fixed-Size Chunking

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=1000,      # Size per chunk
    chunk_overlap=200,    # Overlap region
    separator="\n\n"      # Separator
)
chunks = splitter.split_documents(documents)

Semantic Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]  # Recursive separators
)

Chunking Strategy Comparison

Strategy	Pros	Cons	Use Cases
Fixed-size	Simple, fast	May break semantics	General purpose
Semantic	Preserves semantic integrity	Higher compute overhead	Irregular document structure
Recursive	Flexible, good results	Requires tuning	Most scenarios (recommended)
Document structure	Leverages document structure	Depends on format	Structured docs (HTML, Markdown)
Sentence-level	Most semantically complete	Chunks too small, low retrieval efficiency	Precision matching

Choosing chunk size:

Too small: Loses contextual information
Too large: Reduces retrieval precision, may exceed context window
Rule of thumb: 256-1024 tokens
Recommendation: Determine optimal size through experimentation

3.3 Embedding (Embed)

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])

See Vector Database in Practice for embedding model comparisons.

3.4 Indexing (Index)

from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

3.5 Retrieval (Retrieve)

# Basic semantic retrieval
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# MMR retrieval (balances relevance and diversity)
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.5}
)

docs = retriever.get_relevant_documents("What is a Transformer?")

3.6 Reranking (Rerank)

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Rerank retrieved results
pairs = [(query, doc.page_content) for doc in retrieved_docs]
scores = reranker.predict(pairs)

# Sort by score
reranked = sorted(zip(retrieved_docs, scores), key=lambda x: x[1], reverse=True)

3.7 Generation (Generate)

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    chain_type="stuff",  # stuff / map_reduce / refine
    retriever=retriever,
    return_source_documents=True
)

result = qa_chain({"query": "What is the self-attention mechanism in Transformers?"})

3.8 RAG Pipeline Overview

graph TD
    A[Document Loading] --> B[Document Chunking]
    B --> C[Vector Embedding]
    C --> D[Vector Indexing]

    E[User Query] --> F[Query Transformation]
    F --> G[Vector Retrieval]
    D --> G
    G --> H[Reranking]
    H --> I[Context Construction]
    I --> J[LLM Generation]
    J --> K[Output]

    F --> |HyDE| F1[Generate Hypothetical Doc]
    F1 --> G

    F --> |Multi-Query| F2[Generate Multiple Queries]
    F2 --> G

    style A fill:#e1f5fe
    style B fill:#e1f5fe
    style C fill:#e1f5fe
    style D fill:#e1f5fe
    style E fill:#fff3e0
    style J fill:#e8f5e9

4. Query Transformation

4.1 HyDE (Hypothetical Document Embeddings)

Have the LLM generate a hypothetical answer document first, then use that document's embedding for retrieval:

def hyde_retrieval(query, llm, embeddings, vectorstore):
    # 1. Generate hypothetical document
    hypothesis_prompt = f"Write a passage that might answer the following question:\n{query}"
    hypothesis_doc = llm.predict(hypothesis_prompt)

    # 2. Retrieve using the hypothetical document's embedding
    hypothesis_embedding = embeddings.embed_query(hypothesis_doc)
    docs = vectorstore.similarity_search_by_vector(hypothesis_embedding, k=5)

    return docs

Rationale: The hypothetical document's embedding is closer to the real answer in embedding space, yielding better retrieval than using the question directly.

4.2 Multi-Query

Rewrite the original query into multiple queries from different angles, then merge retrieval results:

def multi_query_retrieval(query, llm, retriever):
    # Generate multiple query variants
    prompt = f"""Rewrite the following question from different perspectives, generating 3 related but different queries:
    Original question: {query}

    Query 1:
    Query 2:
    Query 3:"""

    queries = llm.predict(prompt).split("\n")

    # Merge retrieval results from all queries
    all_docs = set()
    for q in queries:
        docs = retriever.get_relevant_documents(q)
        all_docs.update(docs)

    return list(all_docs)

4.3 Query Decomposition

Decompose complex questions into sub-questions, retrieve separately, then merge:

Original question: What are the differences between Transformer and RNN in handling long sequences?

Sub-question 1: How does Transformer handle long sequences?
Sub-question 2: How does RNN handle long sequences?
Sub-question 3: What are the main differences between Transformer and RNN?

5. Advanced RAG Patterns

5.1 Self-RAG

Dynamically decides whether retrieval is needed during generation:

1. Receive query
2. Determine: Is retrieval needed?
   - Yes → Retrieve → Evaluate retrieval relevance → Generate → Verify
   - No → Generate directly
3. Self-verify the generated result

5.2 CRAG (Corrective RAG)

Evaluates and corrects retrieved results:

1. Retrieve documents
2. Evaluate document relevance
   - Relevant → Use
   - Irrelevant → Discard + supplement with web search
   - Partially relevant → Extract relevant parts
3. Generate based on corrected context

5.3 Graph RAG

Enhances retrieval with knowledge graphs:

1. Build knowledge graph (entities + relationships)
2. Vector retrieval + graph traversal
3. Obtain multi-hop relationship information
4. Merge vector retrieval and graph results
5. LLM generation

6. Summary

The choice of RAG architecture depends on specific requirements:

Simple Q&A: Naive RAG is sufficient
Enterprise applications: Advanced RAG + reranking + hybrid search
Complex reasoning: Modular RAG + Self-RAG / CRAG
Knowledge graph scenarios: Graph RAG

Key principles:

Evaluation-driven: Establish evaluation baselines before optimizing
Progressive enhancement: Start simple, add complexity as needed
End-to-end optimization: Jointly optimize all stages rather than tuning in isolation

References

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", 2020
Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey", 2024
RAG-Augmented Memory — RAG applications in Agents
Vector Database in Practice — Embedding models and vector database selection
RAG Evaluation and Optimization — RAG system evaluation methods