RAG Architecture Design
1. RAG Overview
Retrieval-Augmented Generation (RAG) provides contextual information to LLMs by retrieving relevant documents before generation, addressing issues of knowledge cutoff, hallucination, and insufficient domain knowledge.
1.1 Why RAG Is Needed
| Problem | Pure LLM | RAG |
|---|---|---|
| Knowledge timeliness | Limited to training data cutoff | Can access latest information |
| Hallucination | Prone to fabricating facts | Answers based on retrieved facts |
| Domain knowledge | Generalized but not deep | Can connect to specialized knowledge bases |
| Traceability | Cannot trace information sources | Can cite specific documents |
| Cost | High fine-tuning cost | No training needed, just update knowledge base |
1.2 RAG vs Fine-tuning
- RAG: Suitable for knowledge-intensive tasks, requiring up-to-date information, frequently updated knowledge
- Fine-tuning: Suitable for adjusting model behavior/style, domain-specific expression, tasks requiring deep understanding
- RAG + Fine-tuning: Best practice is to combine both
2. RAG Evolution
2.1 Naive RAG
The most basic RAG implementation:
User Query → Vector Retrieval → Concatenate Context → LLM Generation
Limitations:
- Retrieval quality depends on query quality
- Simple vector similarity may be inaccurate
- Cannot handle questions requiring multi-hop reasoning
- Limited context window
2.2 Advanced RAG
Adds pre-processing and post-processing on top of Naive RAG:
User Query → Query Rewriting → Vector Retrieval → Reranking → Context Compression → LLM Generation
Improvements:
- Query optimization (rewriting, expansion, decomposition)
- Retrieval optimization (hybrid search, reranking)
- Generation optimization (context compression, self-reflection)
2.3 Modular RAG
Decomposes RAG into composable modules with flexible orchestration:
[Router Module] → Decides processing strategy
[Retrieval Module] → Multi-source retrieval + fusion
[Reranking Module] → Relevance ordering
[Compression Module] → Context refinement
[Generation Module] → Generation based on optimized context
[Verification Module] → Checks generation quality
3. RAG Pipeline in Detail
3.1 Document Loading (Load)
from langchain.document_loaders import (
PyPDFLoader, TextLoader, UnstructuredHTMLLoader,
CSVLoader, DirectoryLoader
)
# Load documents of different formats
pdf_loader = PyPDFLoader("document.pdf")
html_loader = UnstructuredHTMLLoader("page.html")
csv_loader = CSVLoader("data.csv")
# Batch load a directory
dir_loader = DirectoryLoader(
"./docs/",
glob="**/*.md",
show_progress=True
)
documents = dir_loader.load()
Supported data sources: PDF, HTML, Markdown, CSV, JSON, databases, APIs, web scraping
3.2 Document Chunking (Chunk)
Fixed-Size Chunking
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000, # Size per chunk
chunk_overlap=200, # Overlap region
separator="\n\n" # Separator
)
chunks = splitter.split_documents(documents)
Semantic Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " ", ""] # Recursive separators
)
Chunking Strategy Comparison
| Strategy | Pros | Cons | Use Cases |
|---|---|---|---|
| Fixed-size | Simple, fast | May break semantics | General purpose |
| Semantic | Preserves semantic integrity | Higher compute overhead | Irregular document structure |
| Recursive | Flexible, good results | Requires tuning | Most scenarios (recommended) |
| Document structure | Leverages document structure | Depends on format | Structured docs (HTML, Markdown) |
| Sentence-level | Most semantically complete | Chunks too small, low retrieval efficiency | Precision matching |
Choosing chunk size:
- Too small: Loses contextual information
- Too large: Reduces retrieval precision, may exceed context window
- Rule of thumb: 256-1024 tokens
- Recommendation: Determine optimal size through experimentation
3.3 Embedding (Embed)
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])
See Vector Database in Practice for embedding model comparisons.
3.4 Indexing (Index)
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
3.5 Retrieval (Retrieve)
# Basic semantic retrieval
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
# MMR retrieval (balances relevance and diversity)
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.5}
)
docs = retriever.get_relevant_documents("What is a Transformer?")
3.6 Reranking (Rerank)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Rerank retrieved results
pairs = [(query, doc.page_content) for doc in retrieved_docs]
scores = reranker.predict(pairs)
# Sort by score
reranked = sorted(zip(retrieved_docs, scores), key=lambda x: x[1], reverse=True)
3.7 Generation (Generate)
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
chain_type="stuff", # stuff / map_reduce / refine
retriever=retriever,
return_source_documents=True
)
result = qa_chain({"query": "What is the self-attention mechanism in Transformers?"})
3.8 RAG Pipeline Overview
graph TD
A[Document Loading] --> B[Document Chunking]
B --> C[Vector Embedding]
C --> D[Vector Indexing]
E[User Query] --> F[Query Transformation]
F --> G[Vector Retrieval]
D --> G
G --> H[Reranking]
H --> I[Context Construction]
I --> J[LLM Generation]
J --> K[Output]
F --> |HyDE| F1[Generate Hypothetical Doc]
F1 --> G
F --> |Multi-Query| F2[Generate Multiple Queries]
F2 --> G
style A fill:#e1f5fe
style B fill:#e1f5fe
style C fill:#e1f5fe
style D fill:#e1f5fe
style E fill:#fff3e0
style J fill:#e8f5e9
4. Query Transformation
4.1 HyDE (Hypothetical Document Embeddings)
Have the LLM generate a hypothetical answer document first, then use that document's embedding for retrieval:
def hyde_retrieval(query, llm, embeddings, vectorstore):
# 1. Generate hypothetical document
hypothesis_prompt = f"Write a passage that might answer the following question:\n{query}"
hypothesis_doc = llm.predict(hypothesis_prompt)
# 2. Retrieve using the hypothetical document's embedding
hypothesis_embedding = embeddings.embed_query(hypothesis_doc)
docs = vectorstore.similarity_search_by_vector(hypothesis_embedding, k=5)
return docs
Rationale: The hypothetical document's embedding is closer to the real answer in embedding space, yielding better retrieval than using the question directly.
4.2 Multi-Query
Rewrite the original query into multiple queries from different angles, then merge retrieval results:
def multi_query_retrieval(query, llm, retriever):
# Generate multiple query variants
prompt = f"""Rewrite the following question from different perspectives, generating 3 related but different queries:
Original question: {query}
Query 1:
Query 2:
Query 3:"""
queries = llm.predict(prompt).split("\n")
# Merge retrieval results from all queries
all_docs = set()
for q in queries:
docs = retriever.get_relevant_documents(q)
all_docs.update(docs)
return list(all_docs)
4.3 Query Decomposition
Decompose complex questions into sub-questions, retrieve separately, then merge:
Original question: What are the differences between Transformer and RNN in handling long sequences?
Sub-question 1: How does Transformer handle long sequences?
Sub-question 2: How does RNN handle long sequences?
Sub-question 3: What are the main differences between Transformer and RNN?
5. Advanced RAG Patterns
5.1 Self-RAG
Dynamically decides whether retrieval is needed during generation:
1. Receive query
2. Determine: Is retrieval needed?
- Yes → Retrieve → Evaluate retrieval relevance → Generate → Verify
- No → Generate directly
3. Self-verify the generated result
5.2 CRAG (Corrective RAG)
Evaluates and corrects retrieved results:
1. Retrieve documents
2. Evaluate document relevance
- Relevant → Use
- Irrelevant → Discard + supplement with web search
- Partially relevant → Extract relevant parts
3. Generate based on corrected context
5.3 Graph RAG
Enhances retrieval with knowledge graphs:
1. Build knowledge graph (entities + relationships)
2. Vector retrieval + graph traversal
3. Obtain multi-hop relationship information
4. Merge vector retrieval and graph results
5. LLM generation
6. Summary
The choice of RAG architecture depends on specific requirements:
- Simple Q&A: Naive RAG is sufficient
- Enterprise applications: Advanced RAG + reranking + hybrid search
- Complex reasoning: Modular RAG + Self-RAG / CRAG
- Knowledge graph scenarios: Graph RAG
Key principles:
- Evaluation-driven: Establish evaluation baselines before optimizing
- Progressive enhancement: Start simple, add complexity as needed
- End-to-end optimization: Jointly optimize all stages rather than tuning in isolation
References
- Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", 2020
- Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey", 2024
- RAG-Augmented Memory — RAG applications in Agents
- Vector Database in Practice — Embedding models and vector database selection
- RAG Evaluation and Optimization — RAG system evaluation methods