Skip to content

Vector Database in Practice

1. Embedding Models

1.1 Embedding Models Overview

Embedding models convert text into high-dimensional vectors so that semantically similar texts are closer in vector space. Choosing the right embedding model is key to RAG system performance.

1.2 Major Embedding Model Comparison

Model Dimensions Max Length Chinese Support Features
OpenAI text-embedding-ada-002 1536 8191 tokens Good Versatile, API-based
OpenAI text-embedding-3-small 1536 8191 tokens Good Cost-effective
OpenAI text-embedding-3-large 3072 8191 tokens Good Highest accuracy
BGE-large-zh 1024 512 tokens Excellent Top open-source Chinese model
BGE-M3 1024 8192 tokens Excellent Multi-lingual, multi-granularity, multi-functional
E5-large-v2 1024 512 tokens Good From Microsoft, strong performance
E5-mistral-7b 4096 32768 tokens Good LLM-based embedding model
Cohere embed-v3 1024 512 tokens Good Supports retrieval/classification/clustering
Jina-embeddings-v2 768 8192 tokens Good Good long-text support

1.3 Selection Recommendations

  • Chinese scenarios: BGE-M3 or BGE-large-zh
  • Multilingual: BGE-M3 or Cohere embed-v3
  • Quick prototyping: OpenAI text-embedding-3-small
  • Highest accuracy: OpenAI text-embedding-3-large
  • Long text: E5-mistral-7b or Jina-embeddings-v2
  • Private deployment: BGE series or E5 series

1.4 Usage Examples

# OpenAI Embeddings
from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["What is a vector database?", "Vector database introduction"]
)
embeddings = [item.embedding for item in response.data]

# BGE Embeddings (local deployment)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-zh-v1.5")
texts = ["What is a vector database?", "Principles of vector retrieval"]
embeddings = model.encode(texts, normalize_embeddings=True)

2. Vector Database Comparison

2.1 Major Vector Databases

ChromaDB — Lightweight Starter

import chromadb

# Create client
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection
collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

# Add documents
collection.add(
    documents=["Document 1 content", "Document 2 content"],
    metadatas=[{"source": "file1"}, {"source": "file2"}],
    ids=["id1", "id2"]
)

# Query
results = collection.query(
    query_texts=["search keyword"],
    n_results=5
)

Features: Embedded, Python-native, zero configuration, ideal for prototyping

Pinecone — Fully Managed Service

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")

# Create index
pc.create_index(
    name="documents",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("documents")

# Upsert
index.upsert(
    vectors=[
        {"id": "id1", "values": embedding1, "metadata": {"text": "..."}},
        {"id": "id2", "values": embedding2, "metadata": {"text": "..."}},
    ]
)

# Query
results = index.query(vector=query_embedding, top_k=5, include_metadata=True)

Features: Fully managed, zero-ops, auto-scaling, enterprise SLA

Milvus — Scalable Open-Source Solution

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

# Connect
connections.connect("default", host="localhost", port="19530")

# Define Schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
]
schema = CollectionSchema(fields)

# Create collection
collection = Collection("documents", schema)

# Create index
collection.create_index(
    field_name="embedding",
    index_params={"index_type": "HNSW", "metric_type": "COSINE", "params": {"M": 16, "efConstruction": 256}}
)

# Search
collection.load()
results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 64}},
    limit=5,
    output_fields=["text"]
)

Features: Distributed architecture, billion-scale vectors, GPU acceleration, production-grade reliability

pgvector — PostgreSQL Extension

-- Install extension
CREATE EXTENSION vector;

-- Create table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Create index
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

-- Insert data
INSERT INTO documents (content, embedding) 
VALUES ('Document content', '[0.1, 0.2, ...]');

-- Query
SELECT content, 1 - (embedding <=> query_embedding) AS similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 5;

Features: Leverages existing PostgreSQL infrastructure, SQL queries, transaction support, joint queries with relational data

Weaviate — Semantic Search Engine

import weaviate

client = weaviate.Client("http://localhost:8080")

# Create Schema
client.schema.create_class({
    "class": "Document",
    "vectorizer": "text2vec-openai",
    "properties": [
        {"name": "content", "dataType": ["text"]},
        {"name": "source", "dataType": ["string"]},
    ]
})

# Add data (auto-vectorization)
client.data_object.create(
    {"content": "Document content", "source": "file1"},
    class_name="Document"
)

# Semantic search
result = client.query.get("Document", ["content", "source"]) \
    .with_near_text({"concepts": ["search keyword"]}) \
    .with_limit(5) \
    .do()

Features: Built-in vectorization, GraphQL API, hybrid search, modular architecture

2.2 Comparison Summary

Database Deployment Scalability Ease of Use Use Cases
ChromaDB Embedded/local Low Very high Prototyping, small-scale apps
Pinecone Fully managed High High Enterprise, zero-ops
Milvus Self-hosted/cloud Very high Medium Large-scale, high-performance
pgvector PostgreSQL extension Medium High Existing PG infrastructure
Weaviate Self-hosted/cloud High High Semantic search, multimodal

3.1 Distance Metrics

Cosine Similarity

\[\text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||}\]
  • Range: [-1, 1], where 1 means identical
  • Use case: Text semantic similarity (most common)
  • Insensitive to vector magnitude

Dot Product

\[\text{dot}(\mathbf{a}, \mathbf{b}) = \mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i \cdot b_i\]
  • Range: Unbounded
  • Use case: Normalized vectors (equivalent to cosine similarity)
  • Fastest computation

Euclidean Distance

\[L_2(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i}(a_i - b_i)^2}\]
  • Range: [0, +infinity), where 0 means identical
  • Use case: Scenarios where vector magnitude matters

3.2 Selection Recommendations

  • Text retrieval: Cosine similarity (recommended)
  • Normalized embeddings: Dot product (faster)
  • Image/multimodal: Euclidean distance

4. Index Types

4.1 HNSW (Hierarchical Navigable Small World)

Hierarchical structure:
Level 2: *-----------------*
Level 1: *---*---*---*---*
Level 0: ****************
  • Principle: Builds a multi-layer graph structure; search starts from the top layer and descends
  • Pros: Fast query speed, high accuracy
  • Cons: High memory usage (graph structure must be stored)
  • Parameters: - M: Number of connections per node (16-64) - efConstruction: Search width during construction (higher = more accurate but slower) - ef: Search width during query

4.2 IVF (Inverted File Index)

Cluster centers: C1, C2, C3, ...
C1 → [v1, v5, v12, ...]
C2 → [v2, v7, v15, ...]
C3 → [v3, v8, v20, ...]
  • Principle: Clusters vectors first; during query, only searches the nearest clusters
  • Pros: Memory efficient
  • Cons: Lower accuracy than HNSW
  • Parameters: - nlist: Number of clusters - nprobe: Number of clusters to search during query

4.3 Index Selection Recommendations

Data Scale Recommended Index Notes
< 100K Flat (brute force) Small data, exact search
100K-1M HNSW Best balance of accuracy and speed
1M-10M IVF + HNSW Layered indexing
> 10M IVF + PQ Compressed vectors, saves memory

5. Practical Recommendations

5.1 Performance Optimization

  • Batch insertion: Use batch APIs instead of inserting one by one
  • Index warm-up: Load the index into memory before querying
  • Async queries: Use async APIs for higher concurrency
  • Cache hot spots: Cache results of frequently queried terms

5.2 Data Management

  • Metadata filtering: Use metadata to narrow the search scope
  • Incremental updates: Support adding, updating, and deleting documents
  • Data backup: Regularly back up vector data
  • Version management: Track embedding model versions; re-embed when the model changes

5.3 Monitoring Metrics

  • Query latency (P50, P95, P99)
  • Recall@K
  • Index size and memory usage
  • QPS (queries per second)

References


评论 #