Skip to content

RAG Architecture Design

1. RAG Overview

Retrieval-Augmented Generation (RAG) provides contextual information to LLMs by retrieving relevant documents before generation, addressing issues of knowledge cutoff, hallucination, and insufficient domain knowledge.

1.1 Why RAG Is Needed

Problem Pure LLM RAG
Knowledge timeliness Limited to training data cutoff Can access latest information
Hallucination Prone to fabricating facts Answers based on retrieved facts
Domain knowledge Generalized but not deep Can connect to specialized knowledge bases
Traceability Cannot trace information sources Can cite specific documents
Cost High fine-tuning cost No training needed, just update knowledge base

1.2 RAG vs Fine-tuning

  • RAG: Suitable for knowledge-intensive tasks, requiring up-to-date information, frequently updated knowledge
  • Fine-tuning: Suitable for adjusting model behavior/style, domain-specific expression, tasks requiring deep understanding
  • RAG + Fine-tuning: Best practice is to combine both

2. RAG Evolution

2.1 Naive RAG

The most basic RAG implementation:

User Query → Vector Retrieval → Concatenate Context → LLM Generation

Limitations:

  • Retrieval quality depends on query quality
  • Simple vector similarity may be inaccurate
  • Cannot handle questions requiring multi-hop reasoning
  • Limited context window

2.2 Advanced RAG

Adds pre-processing and post-processing on top of Naive RAG:

User Query → Query Rewriting → Vector Retrieval → Reranking → Context Compression → LLM Generation

Improvements:

  • Query optimization (rewriting, expansion, decomposition)
  • Retrieval optimization (hybrid search, reranking)
  • Generation optimization (context compression, self-reflection)

2.3 Modular RAG

Decomposes RAG into composable modules with flexible orchestration:

[Router Module] → Decides processing strategy
[Retrieval Module] → Multi-source retrieval + fusion
[Reranking Module] → Relevance ordering
[Compression Module] → Context refinement
[Generation Module] → Generation based on optimized context
[Verification Module] → Checks generation quality

3. RAG Pipeline in Detail

3.1 Document Loading (Load)

from langchain.document_loaders import (
    PyPDFLoader, TextLoader, UnstructuredHTMLLoader,
    CSVLoader, DirectoryLoader
)

# Load documents of different formats
pdf_loader = PyPDFLoader("document.pdf")
html_loader = UnstructuredHTMLLoader("page.html")
csv_loader = CSVLoader("data.csv")

# Batch load a directory
dir_loader = DirectoryLoader(
    "./docs/", 
    glob="**/*.md",
    show_progress=True
)
documents = dir_loader.load()

Supported data sources: PDF, HTML, Markdown, CSV, JSON, databases, APIs, web scraping

3.2 Document Chunking (Chunk)

Fixed-Size Chunking

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=1000,      # Size per chunk
    chunk_overlap=200,    # Overlap region
    separator="\n\n"      # Separator
)
chunks = splitter.split_documents(documents)

Semantic Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]  # Recursive separators
)

Chunking Strategy Comparison

Strategy Pros Cons Use Cases
Fixed-size Simple, fast May break semantics General purpose
Semantic Preserves semantic integrity Higher compute overhead Irregular document structure
Recursive Flexible, good results Requires tuning Most scenarios (recommended)
Document structure Leverages document structure Depends on format Structured docs (HTML, Markdown)
Sentence-level Most semantically complete Chunks too small, low retrieval efficiency Precision matching

Choosing chunk size:

  • Too small: Loses contextual information
  • Too large: Reduces retrieval precision, may exceed context window
  • Rule of thumb: 256-1024 tokens
  • Recommendation: Determine optimal size through experimentation

3.3 Embedding (Embed)

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])

See Vector Database in Practice for embedding model comparisons.

3.4 Indexing (Index)

from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

3.5 Retrieval (Retrieve)

# Basic semantic retrieval
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# MMR retrieval (balances relevance and diversity)
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.5}
)

docs = retriever.get_relevant_documents("What is a Transformer?")

3.6 Reranking (Rerank)

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Rerank retrieved results
pairs = [(query, doc.page_content) for doc in retrieved_docs]
scores = reranker.predict(pairs)

# Sort by score
reranked = sorted(zip(retrieved_docs, scores), key=lambda x: x[1], reverse=True)

3.7 Generation (Generate)

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    chain_type="stuff",  # stuff / map_reduce / refine
    retriever=retriever,
    return_source_documents=True
)

result = qa_chain({"query": "What is the self-attention mechanism in Transformers?"})

3.8 RAG Pipeline Overview

graph TD
    A[Document Loading] --> B[Document Chunking]
    B --> C[Vector Embedding]
    C --> D[Vector Indexing]

    E[User Query] --> F[Query Transformation]
    F --> G[Vector Retrieval]
    D --> G
    G --> H[Reranking]
    H --> I[Context Construction]
    I --> J[LLM Generation]
    J --> K[Output]

    F --> |HyDE| F1[Generate Hypothetical Doc]
    F1 --> G

    F --> |Multi-Query| F2[Generate Multiple Queries]
    F2 --> G

    style A fill:#e1f5fe
    style B fill:#e1f5fe
    style C fill:#e1f5fe
    style D fill:#e1f5fe
    style E fill:#fff3e0
    style J fill:#e8f5e9

4. Query Transformation

4.1 HyDE (Hypothetical Document Embeddings)

Have the LLM generate a hypothetical answer document first, then use that document's embedding for retrieval:

def hyde_retrieval(query, llm, embeddings, vectorstore):
    # 1. Generate hypothetical document
    hypothesis_prompt = f"Write a passage that might answer the following question:\n{query}"
    hypothesis_doc = llm.predict(hypothesis_prompt)

    # 2. Retrieve using the hypothetical document's embedding
    hypothesis_embedding = embeddings.embed_query(hypothesis_doc)
    docs = vectorstore.similarity_search_by_vector(hypothesis_embedding, k=5)

    return docs

Rationale: The hypothetical document's embedding is closer to the real answer in embedding space, yielding better retrieval than using the question directly.

4.2 Multi-Query

Rewrite the original query into multiple queries from different angles, then merge retrieval results:

def multi_query_retrieval(query, llm, retriever):
    # Generate multiple query variants
    prompt = f"""Rewrite the following question from different perspectives, generating 3 related but different queries:
    Original question: {query}

    Query 1:
    Query 2:
    Query 3:"""

    queries = llm.predict(prompt).split("\n")

    # Merge retrieval results from all queries
    all_docs = set()
    for q in queries:
        docs = retriever.get_relevant_documents(q)
        all_docs.update(docs)

    return list(all_docs)

4.3 Query Decomposition

Decompose complex questions into sub-questions, retrieve separately, then merge:

Original question: What are the differences between Transformer and RNN in handling long sequences?

Sub-question 1: How does Transformer handle long sequences?
Sub-question 2: How does RNN handle long sequences?
Sub-question 3: What are the main differences between Transformer and RNN?

5. Advanced RAG Patterns

5.1 Self-RAG

Dynamically decides whether retrieval is needed during generation:

1. Receive query
2. Determine: Is retrieval needed?
   - Yes → Retrieve → Evaluate retrieval relevance → Generate → Verify
   - No → Generate directly
3. Self-verify the generated result

5.2 CRAG (Corrective RAG)

Evaluates and corrects retrieved results:

1. Retrieve documents
2. Evaluate document relevance
   - Relevant → Use
   - Irrelevant → Discard + supplement with web search
   - Partially relevant → Extract relevant parts
3. Generate based on corrected context

5.3 Graph RAG

Enhances retrieval with knowledge graphs:

1. Build knowledge graph (entities + relationships)
2. Vector retrieval + graph traversal
3. Obtain multi-hop relationship information
4. Merge vector retrieval and graph results
5. LLM generation

6. Summary

The choice of RAG architecture depends on specific requirements:

  • Simple Q&A: Naive RAG is sufficient
  • Enterprise applications: Advanced RAG + reranking + hybrid search
  • Complex reasoning: Modular RAG + Self-RAG / CRAG
  • Knowledge graph scenarios: Graph RAG

Key principles:

  1. Evaluation-driven: Establish evaluation baselines before optimizing
  2. Progressive enhancement: Start simple, add complexity as needed
  3. End-to-end optimization: Jointly optimize all stages rather than tuning in isolation

References


评论 #