Long-Term Memory and Vector Databases
Introduction
When information exceeds the context window capacity, agents need a persistent long-term memory mechanism. Vector databases, by converting text into high-dimensional vectors and performing similarity search, have become the core infrastructure for agent long-term memory.
Tribe-level perspective: vector retrieval (embedding similarity + ANN) is the LLM-era renaissance of the Analogizers tribe in Domingos's taxonomy — RAG = modern CBR. For a tribe-level view of recommender systems, collaborative filtering, matrix factorization, HNSW/IVF/LSH comparisons, and RAG vs classical CBR, see The Master Algorithm — Recommender Systems & Modern Retrieval.
Vector Embeddings
What Are Embeddings
Embeddings map discrete text (words, sentences, paragraphs) into continuous high-dimensional vector spaces, such that semantically similar texts are closer in vector space.
where \(d\) is the embedding dimension (typically 384-3072).
Popular Embedding Models
| Model | Dimensions | Provider | Features |
|---|---|---|---|
| text-embedding-3-small | 1536 | OpenAI | Best cost-performance |
| text-embedding-3-large | 3072 | OpenAI | Highest accuracy |
| text-embedding-ada-002 | 1536 | OpenAI | Classic model |
| BGE-large-en-v1.5 | 1024 | BAAI | Among best open-source |
| BGE-M3 | 1024 | BAAI | Multi-language, multi-granularity |
| E5-large-v2 | 1024 | Microsoft | High-quality open-source |
| voyage-3 | 1024 | Voyage AI | Code/document specialized |
| Cohere embed-v3 | 1024 | Cohere | Multi-language optimized |
Embedding Code Example
from openai import OpenAI
client = OpenAI()
def get_embedding(text, model="text-embedding-3-small"):
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding
# Example
embedding = get_embedding("Agent memory systems are very important")
print(f"Dimensions: {len(embedding)}") # 1536
# Using open-source models (sentence-transformers)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
embeddings = model.encode(["memory systems for agents", "vector databases"])
Similarity Metrics
Cosine Similarity
The most commonly used similarity metric:
- Range: \([-1, 1]\), where 1 means identical, 0 means orthogonal, -1 means completely opposite
- Advantage: Insensitive to vector magnitude, suitable for text similarity
Other Metrics
| Metric | Formula | Applicable Scenario |
|---|---|---|
| Euclidean Distance | \(d = \sqrt{\sum(a_i - b_i)^2}\) | Absolute distance sensitive |
| Dot Product | \(s = \sum a_i b_i\) | Already-normalized vectors |
| Manhattan Distance | \(d = \sum \|a_i - b_i\|\) | High-dimensional sparse data |
Approximate Nearest Neighbor Search (ANN)
Exact nearest neighbor search has \(O(nd)\) complexity (\(n\) vectors, \(d\) dimensions), which is infeasible at scale. ANN algorithms sacrifice small amounts of accuracy for significant speedups.
HNSW (Hierarchical Navigable Small World)
The most popular ANN algorithm currently:
graph TB
subgraph "HNSW Multi-Layer Graph Structure"
subgraph "Layer 2 (Sparse)"
A2((A)) --- B2((B))
end
subgraph "Layer 1 (Medium)"
A1((A)) --- B1((B))
A1 --- C1((C))
B1 --- D1((D))
end
subgraph "Layer 0 (Dense)"
A0((A)) --- B0((B))
A0 --- C0((C))
B0 --- D0((D))
C0 --- E0((E))
D0 --- F0((F))
E0 --- F0
end
end
Q[Query Vector q] -.->|Start from top layer| A2
A2 -.->|Descend to lower layer| A1
A1 -.->|Refine search| E0
Principle:
- Build a multi-layer graph structure: upper layers sparse, lower layers dense
- Search starts from the top layer entry point, greedily searching for nearest neighbors
- Descend layer by layer, refining search results at each layer
- Search complexity: \(O(\log n)\)
IVF (Inverted File Index)
- First use clustering (e.g., K-Means) to partition the vector space
- At search time, only search within the nearest few partitions (nprobe)
- Search complexity: \(O(n \cdot \text{nprobe} / K)\)
Product Quantization (PQ)
Segments and quantizes high-dimensional vectors, compressing storage space (4-64x compression) while supporting approximate distance computation.
Popular Vector Databases
Comparison Table
| Database | Type | ANN Algorithm | Features | Use Case |
|---|---|---|---|---|
| Pinecone | Cloud-hosted | Proprietary | Fully managed, easy to use | Quick start, production-grade |
| Weaviate | Open-source/Cloud | HNSW | Multi-modal, GraphQL | Complex queries, hybrid search |
| ChromaDB | Open-source | HNSW | Embedded, Python-first | Prototyping, local |
| Milvus | Open-source/Cloud | IVF/HNSW | Large-scale, high-performance | Billion-scale data |
| Qdrant | Open-source/Cloud | HNSW | Rust implementation, great filtering | Complex filtering needs |
| pgvector | Extension | IVF/HNSW | PostgreSQL extension | Existing PG infrastructure |
| FAISS | Library | Multiple | Meta open-source library | Research, customization |
ChromaDB Example (Local Development)
import chromadb
# Create client
client = chromadb.Client()
# Create collection
collection = client.create_collection(
name="agent_memory",
metadata={"hnsw:space": "cosine"}
)
# Add memories
collection.add(
documents=[
"User preference: likes concise answers",
"Task record: completed data analysis report",
"Learning note: mastered Python decorators",
],
ids=["mem_1", "mem_2", "mem_3"],
metadatas=[
{"type": "preference", "timestamp": "2024-01-01"},
{"type": "task", "timestamp": "2024-01-02"},
{"type": "knowledge", "timestamp": "2024-01-03"},
]
)
# Retrieve relevant memories
results = collection.query(
query_texts=["What kind of answer style does the user prefer?"],
n_results=2,
where={"type": "preference"} # Optional metadata filter
)
print(results["documents"])
Pinecone Example (Production Environment)
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("agent-memory")
# Write (upsert)
index.upsert(vectors=[
{
"id": "mem_1",
"values": embedding_vector, # 1536-dim vector
"metadata": {
"text": "User preference: concise answers",
"type": "preference",
"timestamp": 1704067200
}
}
])
# Query
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True,
filter={"type": {"$eq": "preference"}}
)
pgvector Example (PostgreSQL)
-- Enable extension
CREATE EXTENSION vector;
-- Create table
CREATE TABLE agent_memory (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536),
memory_type VARCHAR(50),
created_at TIMESTAMP DEFAULT NOW()
);
-- Create HNSW index
CREATE INDEX ON agent_memory
USING hnsw (embedding vector_cosine_ops);
-- Similarity search
SELECT content, 1 - (embedding <=> query_embedding) AS similarity
FROM agent_memory
WHERE memory_type = 'preference'
ORDER BY embedding <=> query_embedding
LIMIT 5;
Vector Search Pipeline
graph LR
A[Raw Text] --> B[Text Chunking]
B --> C[Embedding Model]
C --> D[(Vector Database<br/>Index)]
E[Query] --> F[Query Embedding]
F --> G[ANN Search]
D --> G
G --> H[Top-K Results]
H --> I[Reranking]
I --> J[Final Results]
Hybrid Search
Pure vector search may miss keyword-matched results. Hybrid search combines vector search with traditional keyword search:
# Weaviate hybrid search example
result = client.query.get("Memory", ["content", "type"]).with_hybrid(
query="Python decorators",
alpha=0.7 # 0=pure keyword, 1=pure vector
).with_limit(5).do()
Practical Recommendations
Embedding Model Selection
- General use: OpenAI text-embedding-3-small (best cost-performance)
- High accuracy: OpenAI text-embedding-3-large or Cohere embed-v3
- Private deployment: BGE-M3 or E5-large-v2
- Code scenarios: Voyage Code or CodeBERT
Chunking Strategy
- Fixed size: Simple but may break semantic units
- Semantic chunking: Split based on paragraphs/sections
- Recursive chunking: Split by large structure first, then recursively subdivide
- Recommended chunk size: 256-512 tokens, overlap 50-100 tokens
Performance Optimization
- Use batch operations to reduce API calls
- Build a cache layer for hot data
- Choose appropriate ANN parameters (e.g., HNSW's M and efConstruction)
- Consider vector dimension compression (Matryoshka embeddings)
Further Reading
- RAG-Augmented Memory - Complete RAG pipeline based on vector retrieval
- Johnson, J., Douze, M., & Jegou, H. (2019). "Billion-scale similarity search with GPUs" (FAISS)
- Malkov, Y. A., & Yashunin, D. A. (2020). "Efficient and robust approximate nearest neighbor search using HNSW"