RAG-Augmented Memory

Introduction

Retrieval-Augmented Generation (RAG) is currently the most important technical paradigm for agent memory systems. RAG retrieves external knowledge before generation, enabling LLMs to access information beyond their training data, effectively addressing hallucination problems and enabling dynamic knowledge updates.

Motivation for RAG

Inherent limitations of LLMs:

Knowledge cutoff: Training data has a cutoff date
Hallucination: May fabricate answers for uncertain questions
Missing domain knowledge: Lacks private knowledge of specific organizations/domains
Untraceable: Cannot point to information sources

RAG's approach: Retrieve first, then generate.

Naive RAG Pipeline

graph TB
    subgraph Offline Indexing Phase
        D[Document Collection] --> L[Document Loading]
        L --> S[Text Chunking]
        S --> E[Vector Embedding]
        E --> I[(Vector Database<br/>Vector Store)]
    end

    subgraph Online Query Phase
        Q[User Query] --> QE[Query Embedding]
        QE --> R[Retrieval<br/>Top-K]
        I --> R
        R --> C[Context Assembly]
        Q --> C
        C --> G[LLM Generation]
        G --> A[Answer]
    end

Step 1: Document Loading

from langchain.document_loaders import (
    PyPDFLoader, TextLoader, WebBaseLoader, 
    UnstructuredMarkdownLoader
)

# PDF
loader = PyPDFLoader("paper.pdf")
docs = loader.load()

# Web page
loader = WebBaseLoader("https://example.com/article")
docs = loader.load()

# Multiple formats unified processing
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader("./knowledge_base/", glob="**/*.md")
docs = loader.load()

Step 2: Text Chunking

Chunking strategy directly affects retrieval quality:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # Max tokens per chunk
    chunk_overlap=50,     # Overlap between chunks
    separators=["\n\n", "\n", ".", " ", ""],  # Split priority
    length_function=len,
)

chunks = splitter.split_documents(docs)

Chunking Strategy Comparison:

Strategy	Pros	Cons
Fixed size	Simple and consistent	May break semantic units
Recursive splitting	Respects document structure	Uneven chunk sizes
Semantic chunking	Semantically complete	High computational cost
Document structure	Preserves headings/sections	Depends on document format

Step 3: Embedding and Indexing

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Step 4: Retrieval

retriever = vectorstore.as_retriever(
    search_type="similarity",  # or "mmr" for diversity
    search_kwargs={"k": 5}
)

relevant_docs = retriever.get_relevant_documents("What is the agent's memory system?")

Step 5: Generation

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=retriever,
    return_source_documents=True,
)

result = qa_chain.invoke({"query": "What is the agent's memory system?"})
print(result["result"])
print(result["source_documents"])  # Source documents

Advanced RAG Techniques

Query Optimization

Query Rewriting

def rewrite_query(original_query, llm):
    prompt = f"""Please rewrite the following user query into forms 
    more suitable for semantic search.
    Generate 3 query variants from different angles.

    Original query: {original_query}

    Rewritten queries:"""
    return llm.generate(prompt)

# "How to make an agent remember things" -> 
# ["Agent memory system implementation methods", 
#  "LLM long-term memory storage techniques", 
#  "Conversation history persistence solutions"]

HyDE (Hypothetical Document Embeddings)

Have the LLM generate a hypothetical answer first, then use that answer for retrieval:

def hyde_retrieval(query, llm, retriever):
    # 1. Generate hypothetical answer
    hypothetical_answer = llm.generate(
        f"Please answer the following question (even if uncertain): {query}"
    )

    # 2. Use the hypothetical answer's embedding for retrieval
    results = retriever.get_relevant_documents(hypothetical_answer)
    return results

Principle: The hypothetical answer is semantically closer to actual documents in embedding space, yielding better retrieval than the original query.

Retrieval Optimization

Ensemble Retrieval

from langchain.retrievers import EnsembleRetriever, BM25Retriever

# Keyword retrieval
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Vector retrieval
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Ensemble
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7]
)

Maximal Marginal Relevance (MMR)

Balances between relevance and diversity:

\[\text{MMR} = \arg\max_{d_i \in R \setminus S} \left[ \lambda \cdot \text{sim}(d_i, q) - (1-\lambda) \cdot \max_{d_j \in S} \text{sim}(d_i, d_j) \right]\]

where \(\lambda\) controls the balance between relevance and diversity.

Reranking

After initial retrieval, use a cross-encoder to rerank candidate results:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank(query, documents, top_k=3):
    pairs = [(query, doc.page_content) for doc in documents]
    scores = reranker.predict(pairs)

    # Sort by score
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

Cohere Rerank API:

import cohere

co = cohere.Client("your-api-key")
results = co.rerank(
    query="Agent memory system",
    documents=[doc.page_content for doc in candidates],
    top_n=3,
    model="rerank-english-v3.0"
)

Advanced RAG Patterns

Self-RAG

graph TD
    Q[Query] --> D{Need retrieval?}
    D -->|Yes| R[Retrieve Documents]
    D -->|No| G1[Generate Directly]
    R --> E{Documents relevant?}
    E -->|Yes| G2[Generate Based on Documents]
    E -->|No| R2[Re-retrieve/Rewrite Query]
    G2 --> V{Answer supported?}
    V -->|Yes| O[Output]
    V -->|No| G2

Self-RAG (Asai et al., 2023) trains models to autonomously decide when to retrieve, evaluate retrieval quality, and verify generation results.

CRAG (Corrective RAG)

def corrective_rag(query, retriever, llm):
    docs = retriever.get_relevant_documents(query)

    # Evaluate retrieval quality
    relevance = llm.evaluate(
        f"Are the following documents relevant to the query?\nQuery: {query}\nDocuments: {docs}")

    if relevance == "correct":
        # Directly use retrieved results
        return generate(query, docs)
    elif relevance == "ambiguous":
        # Combine retrieved results with web search
        web_results = web_search(query)
        return generate(query, docs + web_results)
    else:  # incorrect
        # Rely entirely on web search
        web_results = web_search(query)
        return generate(query, web_results)

Multi-Hop RAG

For complex questions requiring multi-step reasoning:

def multi_hop_rag(query, retriever, llm, max_hops=3):
    context = []
    current_query = query

    for hop in range(max_hops):
        # Retrieve
        docs = retriever.get_relevant_documents(current_query)
        context.extend(docs)

        # Check if sufficient information to answer
        can_answer = llm.evaluate(
            f"Can the question be answered based on this information?\n"
            f"Question: {query}\nInformation: {context}")

        if can_answer:
            break

        # Generate follow-up query
        current_query = llm.generate(
            f"To answer '{query}', what additional information is needed? Known: {context}")

    return llm.generate(f"Answer the question based on the following information.\n"
                        f"Question: {query}\nInformation: {context}")

Graph RAG

Structured retrieval combined with knowledge graphs:

# Use LLM to extract entities and relations from text
def build_knowledge_graph(documents, llm):
    triples = []
    for doc in documents:
        extracted = llm.extract(
            f"Extract (subject, relation, object) triples from the following text:\n{doc}")
        triples.extend(extracted)
    return triples

# Combine graph retrieval and vector retrieval
def graph_rag(query, graph_store, vector_store, llm):
    # Extract entities from query
    entities = llm.extract_entities(query)

    # Graph retrieval: get entity neighbor relationships
    graph_context = graph_store.get_neighbors(entities, depth=2)

    # Vector retrieval: get semantically related documents
    vector_context = vector_store.similarity_search(query, k=5)

    # Combine contexts for answer generation
    return llm.generate(query, graph_context + vector_context)

RAG Evaluation

Core Metrics

Retrieval Quality

Recall@K: Proportion of relevant documents in the top K results

\[\text{Recall@K} = \frac{|\text{Retrieved relevant docs} \cap \text{Top-K results}|}{|\text{All relevant docs}|}\]

MRR (Mean Reciprocal Rank): Average of reciprocal ranks of first correct answer

\[\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}\]

NDCG (Normalized DCG): Retrieval quality metric accounting for ranking position

Generation Quality

Faithfulness: Whether the generated answer is faithful to retrieved documents
Answer Relevance: Whether the answer is relevant to the question
Context Relevance: Whether retrieved context is relevant to the question

RAGAS Evaluation Framework

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results)

Production-Grade RAG Architecture

graph TB
    subgraph Data Pipeline
        DS[Data Sources] --> ETL[ETL Pipeline]
        ETL --> CP[Chunking]
        CP --> EM[Embedding Generation]
        EM --> VS[(Vector Database)]
        ETL --> MD[(Metadata Store)]
    end

    subgraph Query Pipeline
        UQ[User Query] --> QP[Query Processing<br/>Rewriting/Expansion]
        QP --> HR[Hybrid Retrieval<br/>Vector + Keyword]
        VS --> HR
        HR --> RR[Reranking]
        RR --> CTX[Context Assembly]
        CTX --> LLM[LLM Generation]
        LLM --> PP[Post-Processing<br/>Citation/Verification]
        PP --> OUT[Final Answer]
    end

    subgraph Feedback Loop
        OUT --> FB[User Feedback]
        FB --> AN[Analysis & Optimization]
        AN --> DS
    end

Practical Recommendations

RAG Optimization Checklist

[ ] Choose appropriate chunk size (typically 256-512 tokens)
[ ] Use recursive splitting to preserve document structure
[ ] Implement hybrid search (vector + BM25)
[ ] Add reranking step for improved precision
[ ] Apply query rewriting/expansion
[ ] Add metadata filtering to narrow search scope
[ ] Implement evaluation pipeline for continuous quality monitoring
[ ] Handle multimodal content (tables, images)

Common Pitfalls

Chunks too large or too small: Too large dilutes signal; too small loses context
Ignoring metadata: Timestamps, sources, types are important filtering dimensions
Skipping reranking: Initial retrieval results typically need precision ranking
Not evaluating: Without evaluation, improvement is impossible