AI Engineer's Guide to Vector Databases
Deep dive into vector databases, embeddings, and similarity search for production AI systems
Why Vector Databases Matter for AI Engineers
If you’re building anything with LLMs, recommender systems, or semantic search, you need a vector database. Traditional databases index rows by primary keys or B-tree indexes. Vector databases index high-dimensional vectors by similarity—a fundamentally different paradigm that powers modern AI applications.
A vector database stores embeddings: dense numerical representations of data (text, images, audio) produced by neural networks. These embeddings capture semantic meaning, allowing you to find items that are conceptually similar rather than just lexically matching.
Core Concepts
Embeddings
An embedding is a fixed-length vector (typically 384 to 3072 dimensions) that encodes semantic meaning. Two pieces of text with similar meaning will have embeddings that are close together in vector space.
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-large",
input="Vector databases enable similarity search"
)
# Returns a 3072-dimensional vector
embedding = response.data[0].embedding
Key embedding models to know:
| Model | Dimensions | Use Case |
|---|---|---|
| OpenAI text-embedding-3-large | 3072 | General purpose, high quality |
| OpenAI text-embedding-3-small | 1536 | Cost-effective general purpose |
| Cohere embed-v3 | 1024 | Multilingual, search-optimized |
| BGE-M3 | 1024 | Open-source multilingual |
| GTE-Qwen2 | Variable | Open-source, strong performance |
Distance Metrics
How you measure “closeness” between vectors matters:
- Cosine Similarity: Measures the angle between vectors. Normalized, so magnitude doesn’t matter. Most common choice for text embeddings.
- Euclidean Distance (L2): Straight-line distance. Sensitive to magnitude. Good when absolute values matter.
- Dot Product: Faster to compute than cosine. Equivalent to cosine when vectors are normalized.
- Manhattan Distance (L1): Sum of absolute differences. More robust to outliers in sparse data.
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def euclidean_distance(a, b):
return np.linalg.norm(a - b)
Approximate Nearest Neighbor (ANN) Algorithms
Exact nearest neighbor search is O(n) per query—too slow for millions of vectors. ANN algorithms trade a small amount of accuracy for massive speed gains.
HNSW (Hierarchical Navigable Small World)
- Most popular algorithm in production
- Builds a multi-layer graph structure
- O(log n) query time
- High memory usage (stores full graph in memory)
- Used by: Qdrant, Weaviate, pgvector
IVF (Inverted File Index)
- Clusters vectors into partitions using k-means
- Searches only the closest partitions
- Lower memory than HNSW
- Used by: Milvus, Faiss
ScaNN (Scalable Nearest Neighbors)
- Google’s algorithm using learned quantization
- Strong on high-dimensional data
- Used by: Google’s Vertex AI Vector Search
DiskANN
- Microsoft’s disk-based algorithm
- Handles billion-scale datasets without fitting in memory
- Used by: Azure AI Search, Milvus (hybrid)
Major Vector Databases Compared
Pinecone
- Fully managed, serverless option available
- Simple API, great DX
- Sparse-dense hybrid search built-in
- Namespaces for multi-tenancy
- Best for: Teams wanting zero-ops vector search
from pinecone import Pinecone
pc = Pinecone(api_key="your-key")
index = pc.Index("my-index")
# Upsert vectors with metadata
index.upsert(vectors=[
{"id": "doc1", "values": embedding, "metadata": {"source": "wiki", "topic": "ai"}},
])
# Query with metadata filtering
results = index.query(
vector=query_embedding,
top_k=10,
filter={"topic": {"$eq": "ai"}}
)
Weaviate
- Open-source with cloud offering
- Built-in vectorization modules (connect your model or use built-in)
- GraphQL query interface
- Multi-modal support (text, image, etc.)
- Best for: Teams wanting integrated ML pipeline
Qdrant
- Open-source, written in Rust (fast)
- Rich filtering with payload indexes
- Quantization support (reduce memory 4-8x)
- gRPC and REST APIs
- Best for: Performance-critical applications
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(url="http://localhost:6333")
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
client.upsert(
collection_name="documents",
points=[
PointStruct(id=1, vector=embedding, payload={"text": "document content"}),
]
)
Milvus / Zilliz
- Open-source (Milvus) with managed cloud (Zilliz)
- Scales to billions of vectors
- Multiple index types (HNSW, IVF, DiskANN)
- GPU-accelerated indexing
- Best for: Large-scale enterprise deployments
pgvector
- PostgreSQL extension
- Use your existing Postgres infrastructure
- Supports HNSW and IVF-Flat indexes
- Best for: Teams already on Postgres who want to avoid new infrastructure
CREATE EXTENSION vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536)
);
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
SELECT content, 1 - (embedding <=> query_embedding) AS similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 10;
Production Patterns
Hybrid Search
Pure vector search misses exact matches. Pure keyword search misses semantic matches. Combine both:
# Pseudo-code for hybrid search
def hybrid_search(query, alpha=0.7):
# Vector search (semantic)
vector_results = vector_db.search(embed(query), top_k=20)
# Keyword search (lexical, BM25)
keyword_results = search_engine.search(query, top_k=20)
# Reciprocal Rank Fusion
combined = reciprocal_rank_fusion(
[vector_results, keyword_results],
weights=[alpha, 1 - alpha]
)
return combined[:10]
Metadata Filtering
Always store metadata alongside vectors. Filter before or during search, not after:
# Bad: fetch 1000 results then filter in application
results = index.query(vector=query_vec, top_k=1000)
filtered = [r for r in results if r.metadata["category"] == "tech"]
# Good: filter at the database level
results = index.query(
vector=query_vec,
top_k=10,
filter={"category": {"$eq": "tech"}}
)
Chunking Strategies
How you split documents before embedding dramatically affects retrieval quality:
- Fixed-size chunks (512 tokens): Simple, works okay. Risk splitting mid-sentence.
- Semantic chunking: Split at topic boundaries. Better quality, more complex.
- Recursive character splitting: Try large chunks first, recursively split if too big. Good default.
- Parent-child chunking: Embed small chunks, retrieve the larger parent. Best of both worlds.
# Parent-child chunking pattern
def create_parent_child_chunks(document, child_size=200, parent_size=1000):
parents = split_into_chunks(document, parent_size)
results = []
for parent in parents:
children = split_into_chunks(parent, child_size)
for child in children:
results.append({
"child_text": child,
"child_embedding": embed(child),
"parent_text": parent, # stored as metadata
})
return results
Quantization for Cost Reduction
Production tip: quantize vectors to reduce memory and cost by 4-8x with minimal quality loss:
- Scalar Quantization: float32 → int8 (4x reduction, ~1-2% quality loss)
- Binary Quantization: float32 → 1-bit (32x reduction, use as first-pass filter)
- Product Quantization: Splits vectors into sub-vectors, quantizes each (configurable tradeoff)
When NOT to Use a Vector Database
- Exact matching: Use traditional databases or search engines
- Small datasets (< 10K vectors): NumPy/Faiss in-memory is simpler
- Structured queries only: If you never need similarity search, don’t add complexity
- Real-time updates with consistency: Most vector DBs are eventually consistent
Key Metrics to Monitor
- Recall@K: What fraction of true nearest neighbors are in your top-K results?
- Latency (p50/p95/p99): Query response time under load
- Throughput: Queries per second your system handles
- Index build time: How long after upsert before vectors are searchable?
- Memory usage: Vectors + index overhead per million vectors
Takeaways
- Start with pgvector if you’re already on Postgres and have < 10M vectors
- Choose Qdrant or Weaviate for open-source, self-hosted deployments
- Choose Pinecone if you want zero operational overhead
- Always implement hybrid search for production RAG systems
- Chunk wisely—retrieval quality depends more on chunking than on which vector DB you choose
- Monitor recall, not just latency—a fast wrong answer is worse than a slow right one