RAG-Augmented Memory
Introduction
Retrieval-Augmented Generation (RAG) is currently the most important technical paradigm for agent memory systems. RAG retrieves external knowledge before generation, enabling LLMs to access information beyond their training data, effectively addressing hallucination problems and enabling dynamic knowledge updates.
Motivation for RAG
Inherent limitations of LLMs:
- Knowledge cutoff: Training data has a cutoff date
- Hallucination: May fabricate answers for uncertain questions
- Missing domain knowledge: Lacks private knowledge of specific organizations/domains
- Untraceable: Cannot point to information sources
RAG's approach: Retrieve first, then generate.
Naive RAG Pipeline
graph TB
subgraph Offline Indexing Phase
D[Document Collection] --> L[Document Loading]
L --> S[Text Chunking]
S --> E[Vector Embedding]
E --> I[(Vector Database<br/>Vector Store)]
end
subgraph Online Query Phase
Q[User Query] --> QE[Query Embedding]
QE --> R[Retrieval<br/>Top-K]
I --> R
R --> C[Context Assembly]
Q --> C
C --> G[LLM Generation]
G --> A[Answer]
end
Step 1: Document Loading
from langchain.document_loaders import (
PyPDFLoader, TextLoader, WebBaseLoader,
UnstructuredMarkdownLoader
)
# PDF
loader = PyPDFLoader("paper.pdf")
docs = loader.load()
# Web page
loader = WebBaseLoader("https://example.com/article")
docs = loader.load()
# Multiple formats unified processing
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader("./knowledge_base/", glob="**/*.md")
docs = loader.load()
Step 2: Text Chunking
Chunking strategy directly affects retrieval quality:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Max tokens per chunk
chunk_overlap=50, # Overlap between chunks
separators=["\n\n", "\n", ".", " ", ""], # Split priority
length_function=len,
)
chunks = splitter.split_documents(docs)
Chunking Strategy Comparison:
| Strategy | Pros | Cons |
|---|---|---|
| Fixed size | Simple and consistent | May break semantic units |
| Recursive splitting | Respects document structure | Uneven chunk sizes |
| Semantic chunking | Semantically complete | High computational cost |
| Document structure | Preserves headings/sections | Depends on document format |
Step 3: Embedding and Indexing
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
Step 4: Retrieval
retriever = vectorstore.as_retriever(
search_type="similarity", # or "mmr" for diversity
search_kwargs={"k": 5}
)
relevant_docs = retriever.get_relevant_documents("What is the agent's memory system?")
Step 5: Generation
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
retriever=retriever,
return_source_documents=True,
)
result = qa_chain.invoke({"query": "What is the agent's memory system?"})
print(result["result"])
print(result["source_documents"]) # Source documents
Advanced RAG Techniques
Query Optimization
Query Rewriting
def rewrite_query(original_query, llm):
prompt = f"""Please rewrite the following user query into forms
more suitable for semantic search.
Generate 3 query variants from different angles.
Original query: {original_query}
Rewritten queries:"""
return llm.generate(prompt)
# "How to make an agent remember things" ->
# ["Agent memory system implementation methods",
# "LLM long-term memory storage techniques",
# "Conversation history persistence solutions"]
HyDE (Hypothetical Document Embeddings)
Have the LLM generate a hypothetical answer first, then use that answer for retrieval:
def hyde_retrieval(query, llm, retriever):
# 1. Generate hypothetical answer
hypothetical_answer = llm.generate(
f"Please answer the following question (even if uncertain): {query}"
)
# 2. Use the hypothetical answer's embedding for retrieval
results = retriever.get_relevant_documents(hypothetical_answer)
return results
Principle: The hypothetical answer is semantically closer to actual documents in embedding space, yielding better retrieval than the original query.
Retrieval Optimization
Ensemble Retrieval
from langchain.retrievers import EnsembleRetriever, BM25Retriever
# Keyword retrieval
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Vector retrieval
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Ensemble
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.3, 0.7]
)
Maximal Marginal Relevance (MMR)
Balances between relevance and diversity:
where \(\lambda\) controls the balance between relevance and diversity.
Reranking
After initial retrieval, use a cross-encoder to rerank candidate results:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def rerank(query, documents, top_k=3):
pairs = [(query, doc.page_content) for doc in documents]
scores = reranker.predict(pairs)
# Sort by score
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]
Cohere Rerank API:
import cohere
co = cohere.Client("your-api-key")
results = co.rerank(
query="Agent memory system",
documents=[doc.page_content for doc in candidates],
top_n=3,
model="rerank-english-v3.0"
)
Advanced RAG Patterns
Self-RAG
graph TD
Q[Query] --> D{Need retrieval?}
D -->|Yes| R[Retrieve Documents]
D -->|No| G1[Generate Directly]
R --> E{Documents relevant?}
E -->|Yes| G2[Generate Based on Documents]
E -->|No| R2[Re-retrieve/Rewrite Query]
G2 --> V{Answer supported?}
V -->|Yes| O[Output]
V -->|No| G2
Self-RAG (Asai et al., 2023) trains models to autonomously decide when to retrieve, evaluate retrieval quality, and verify generation results.
CRAG (Corrective RAG)
def corrective_rag(query, retriever, llm):
docs = retriever.get_relevant_documents(query)
# Evaluate retrieval quality
relevance = llm.evaluate(
f"Are the following documents relevant to the query?\nQuery: {query}\nDocuments: {docs}")
if relevance == "correct":
# Directly use retrieved results
return generate(query, docs)
elif relevance == "ambiguous":
# Combine retrieved results with web search
web_results = web_search(query)
return generate(query, docs + web_results)
else: # incorrect
# Rely entirely on web search
web_results = web_search(query)
return generate(query, web_results)
Multi-Hop RAG
For complex questions requiring multi-step reasoning:
def multi_hop_rag(query, retriever, llm, max_hops=3):
context = []
current_query = query
for hop in range(max_hops):
# Retrieve
docs = retriever.get_relevant_documents(current_query)
context.extend(docs)
# Check if sufficient information to answer
can_answer = llm.evaluate(
f"Can the question be answered based on this information?\n"
f"Question: {query}\nInformation: {context}")
if can_answer:
break
# Generate follow-up query
current_query = llm.generate(
f"To answer '{query}', what additional information is needed? Known: {context}")
return llm.generate(f"Answer the question based on the following information.\n"
f"Question: {query}\nInformation: {context}")
Graph RAG
Structured retrieval combined with knowledge graphs:
# Use LLM to extract entities and relations from text
def build_knowledge_graph(documents, llm):
triples = []
for doc in documents:
extracted = llm.extract(
f"Extract (subject, relation, object) triples from the following text:\n{doc}")
triples.extend(extracted)
return triples
# Combine graph retrieval and vector retrieval
def graph_rag(query, graph_store, vector_store, llm):
# Extract entities from query
entities = llm.extract_entities(query)
# Graph retrieval: get entity neighbor relationships
graph_context = graph_store.get_neighbors(entities, depth=2)
# Vector retrieval: get semantically related documents
vector_context = vector_store.similarity_search(query, k=5)
# Combine contexts for answer generation
return llm.generate(query, graph_context + vector_context)
RAG Evaluation
Core Metrics
Retrieval Quality
- Recall@K: Proportion of relevant documents in the top K results
- MRR (Mean Reciprocal Rank): Average of reciprocal ranks of first correct answer
- NDCG (Normalized DCG): Retrieval quality metric accounting for ranking position
Generation Quality
- Faithfulness: Whether the generated answer is faithful to retrieved documents
- Answer Relevance: Whether the answer is relevant to the question
- Context Relevance: Whether retrieved context is relevant to the question
RAGAS Evaluation Framework
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results)
Production-Grade RAG Architecture
graph TB
subgraph Data Pipeline
DS[Data Sources] --> ETL[ETL Pipeline]
ETL --> CP[Chunking]
CP --> EM[Embedding Generation]
EM --> VS[(Vector Database)]
ETL --> MD[(Metadata Store)]
end
subgraph Query Pipeline
UQ[User Query] --> QP[Query Processing<br/>Rewriting/Expansion]
QP --> HR[Hybrid Retrieval<br/>Vector + Keyword]
VS --> HR
HR --> RR[Reranking]
RR --> CTX[Context Assembly]
CTX --> LLM[LLM Generation]
LLM --> PP[Post-Processing<br/>Citation/Verification]
PP --> OUT[Final Answer]
end
subgraph Feedback Loop
OUT --> FB[User Feedback]
FB --> AN[Analysis & Optimization]
AN --> DS
end
Practical Recommendations
RAG Optimization Checklist
- [ ] Choose appropriate chunk size (typically 256-512 tokens)
- [ ] Use recursive splitting to preserve document structure
- [ ] Implement hybrid search (vector + BM25)
- [ ] Add reranking step for improved precision
- [ ] Apply query rewriting/expansion
- [ ] Add metadata filtering to narrow search scope
- [ ] Implement evaluation pipeline for continuous quality monitoring
- [ ] Handle multimodal content (tables, images)
Common Pitfalls
- Chunks too large or too small: Too large dilutes signal; too small loses context
- Ignoring metadata: Timestamps, sources, types are important filtering dimensions
- Skipping reranking: Initial retrieval results typically need precision ranking
- Not evaluating: Without evaluation, improvement is impossible
Further Reading
- API Orchestration and Tool Selection - RAG as an agent tool orchestration
- Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"
- Asai, A., et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection"
- Gao, Y., et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey"