Long-Term Memory and Vector Databases

Introduction

When information exceeds the context window capacity, agents need a persistent long-term memory mechanism. Vector databases, by converting text into high-dimensional vectors and performing similarity search, have become the core infrastructure for agent long-term memory.

Tribe-level perspective: vector retrieval (embedding similarity + ANN) is the LLM-era renaissance of the Analogizers tribe in Domingos's taxonomy — RAG = modern CBR. For a tribe-level view of recommender systems, collaborative filtering, matrix factorization, HNSW/IVF/LSH comparisons, and RAG vs classical CBR, see The Master Algorithm — Recommender Systems & Modern Retrieval.

Vector Embeddings

What Are Embeddings

Embeddings map discrete text (words, sentences, paragraphs) into continuous high-dimensional vector spaces, such that semantically similar texts are closer in vector space.

\[\text{embed}: \text{Text} \rightarrow \mathbb{R}^d\]

where \(d\) is the embedding dimension (typically 384-3072).

Popular Embedding Models

Model	Dimensions	Provider	Features
text-embedding-3-small	1536	OpenAI	Best cost-performance
text-embedding-3-large	3072	OpenAI	Highest accuracy
text-embedding-ada-002	1536	OpenAI	Classic model
BGE-large-en-v1.5	1024	BAAI	Among best open-source
BGE-M3	1024	BAAI	Multi-language, multi-granularity
E5-large-v2	1024	Microsoft	High-quality open-source
voyage-3	1024	Voyage AI	Code/document specialized
Cohere embed-v3	1024	Cohere	Multi-language optimized

Embedding Code Example

from openai import OpenAI

client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

# Example
embedding = get_embedding("Agent memory systems are very important")
print(f"Dimensions: {len(embedding)}")  # 1536

# Using open-source models (sentence-transformers)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")
embeddings = model.encode(["memory systems for agents", "vector databases"])

Similarity Metrics

Cosine Similarity

The most commonly used similarity metric:

\[\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|} = \frac{\sum_{i=1}^{n} a_i b_i}{\sqrt{\sum_{i=1}^{n} a_i^2} \cdot \sqrt{\sum_{i=1}^{n} b_i^2}}\]

Range: \([-1, 1]\), where 1 means identical, 0 means orthogonal, -1 means completely opposite
Advantage: Insensitive to vector magnitude, suitable for text similarity

Other Metrics

Metric	Formula	Applicable Scenario
Euclidean Distance	\(d = \sqrt{\sum(a_i - b_i)^2}\)	Absolute distance sensitive
Dot Product	\(s = \sum a_i b_i\)	Already-normalized vectors
Manhattan Distance	\(d = \sum \\|a_i - b_i\\|\)	High-dimensional sparse data

Approximate Nearest Neighbor Search (ANN)

Exact nearest neighbor search has \(O(nd)\) complexity (\(n\) vectors, \(d\) dimensions), which is infeasible at scale. ANN algorithms sacrifice small amounts of accuracy for significant speedups.

HNSW (Hierarchical Navigable Small World)

The most popular ANN algorithm currently:

graph TB
    subgraph "HNSW Multi-Layer Graph Structure"
        subgraph "Layer 2 (Sparse)"
            A2((A)) --- B2((B))
        end
        subgraph "Layer 1 (Medium)"
            A1((A)) --- B1((B))
            A1 --- C1((C))
            B1 --- D1((D))
        end
        subgraph "Layer 0 (Dense)"
            A0((A)) --- B0((B))
            A0 --- C0((C))
            B0 --- D0((D))
            C0 --- E0((E))
            D0 --- F0((F))
            E0 --- F0
        end
    end

    Q[Query Vector q] -.->|Start from top layer| A2
    A2 -.->|Descend to lower layer| A1
    A1 -.->|Refine search| E0

Principle:

Build a multi-layer graph structure: upper layers sparse, lower layers dense
Search starts from the top layer entry point, greedily searching for nearest neighbors
Descend layer by layer, refining search results at each layer
Search complexity: \(O(\log n)\)

IVF (Inverted File Index)

First use clustering (e.g., K-Means) to partition the vector space
At search time, only search within the nearest few partitions (nprobe)
Search complexity: \(O(n \cdot \text{nprobe} / K)\)

Product Quantization (PQ)

Segments and quantizes high-dimensional vectors, compressing storage space (4-64x compression) while supporting approximate distance computation.

Popular Vector Databases

Comparison Table

Database	Type	ANN Algorithm	Features	Use Case
Pinecone	Cloud-hosted	Proprietary	Fully managed, easy to use	Quick start, production-grade
Weaviate	Open-source/Cloud	HNSW	Multi-modal, GraphQL	Complex queries, hybrid search
ChromaDB	Open-source	HNSW	Embedded, Python-first	Prototyping, local
Milvus	Open-source/Cloud	IVF/HNSW	Large-scale, high-performance	Billion-scale data
Qdrant	Open-source/Cloud	HNSW	Rust implementation, great filtering	Complex filtering needs
pgvector	Extension	IVF/HNSW	PostgreSQL extension	Existing PG infrastructure
FAISS	Library	Multiple	Meta open-source library	Research, customization

ChromaDB Example (Local Development)

import chromadb

# Create client
client = chromadb.Client()

# Create collection
collection = client.create_collection(
    name="agent_memory",
    metadata={"hnsw:space": "cosine"}
)

# Add memories
collection.add(
    documents=[
        "User preference: likes concise answers",
        "Task record: completed data analysis report",
        "Learning note: mastered Python decorators",
    ],
    ids=["mem_1", "mem_2", "mem_3"],
    metadatas=[
        {"type": "preference", "timestamp": "2024-01-01"},
        {"type": "task", "timestamp": "2024-01-02"},
        {"type": "knowledge", "timestamp": "2024-01-03"},
    ]
)

# Retrieve relevant memories
results = collection.query(
    query_texts=["What kind of answer style does the user prefer?"],
    n_results=2,
    where={"type": "preference"}  # Optional metadata filter
)
print(results["documents"])

Pinecone Example (Production Environment)

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("agent-memory")

# Write (upsert)
index.upsert(vectors=[
    {
        "id": "mem_1",
        "values": embedding_vector,  # 1536-dim vector
        "metadata": {
            "text": "User preference: concise answers",
            "type": "preference",
            "timestamp": 1704067200
        }
    }
])

# Query
results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True,
    filter={"type": {"$eq": "preference"}}
)

pgvector Example (PostgreSQL)

-- Enable extension
CREATE EXTENSION vector;

-- Create table
CREATE TABLE agent_memory (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536),
    memory_type VARCHAR(50),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create HNSW index
CREATE INDEX ON agent_memory 
USING hnsw (embedding vector_cosine_ops);

-- Similarity search
SELECT content, 1 - (embedding <=> query_embedding) AS similarity
FROM agent_memory
WHERE memory_type = 'preference'
ORDER BY embedding <=> query_embedding
LIMIT 5;

Vector Search Pipeline

graph LR
    A[Raw Text] --> B[Text Chunking]
    B --> C[Embedding Model]
    C --> D[(Vector Database<br/>Index)]

    E[Query] --> F[Query Embedding]
    F --> G[ANN Search]
    D --> G
    G --> H[Top-K Results]
    H --> I[Reranking]
    I --> J[Final Results]

Hybrid Search

Pure vector search may miss keyword-matched results. Hybrid search combines vector search with traditional keyword search:

\[\text{score}_{\text{hybrid}} = \alpha \cdot \text{score}_{\text{vector}} + (1-\alpha) \cdot \text{score}_{\text{keyword}}\]

# Weaviate hybrid search example
result = client.query.get("Memory", ["content", "type"]).with_hybrid(
    query="Python decorators",
    alpha=0.7  # 0=pure keyword, 1=pure vector
).with_limit(5).do()

Practical Recommendations

Embedding Model Selection

General use: OpenAI text-embedding-3-small (best cost-performance)
High accuracy: OpenAI text-embedding-3-large or Cohere embed-v3
Private deployment: BGE-M3 or E5-large-v2
Code scenarios: Voyage Code or CodeBERT

Chunking Strategy

Fixed size: Simple but may break semantic units
Semantic chunking: Split based on paragraphs/sections
Recursive chunking: Split by large structure first, then recursively subdivide
Recommended chunk size: 256-512 tokens, overlap 50-100 tokens

Performance Optimization

Use batch operations to reduce API calls
Build a cache layer for hot data
Choose appropriate ANN parameters (e.g., HNSW's M and efConstruction)
Consider vector dimension compression (Matryoshka embeddings)