A vector database stores high-dimensional embeddings, the numeric representations produced by models, and answers nearest-neighbor queries: given a query vector, return the stored vectors closest to it by cosine or Euclidean distance. This is the retrieval engine behind semantic search, recommendations, and retrieval-augmented generation, where you embed a query and fetch the most similar documents.
Exact nearest-neighbor search is a brute-force scan, $O(nd)$ per query, which is fine for thousands of vectors and hopeless for millions. Vector databases therefore use approximate nearest neighbor (ANN) indexes that trade a little recall for orders-of-magnitude speed.
The two index families
HNSW (Malkov & Yashunin 2016) builds a multi-layer navigable graph and greedily hops toward the query, giving roughly $O(\log n)$ search, 95%-plus recall out of the box, and the ability to absorb inserts without rebuilds, at the cost of 2 to 5 times more memory. IVF partitions vectors into clusters with k-means and only searches the nearest few cells, so it builds faster and uses less memory but its recall drifts as data changes. The rule of thumb: HNSW for most workloads with active writes, IVF for very large, mostly-static sets where memory or build time dominates.
The landscape
The same HNSW core appears everywhere, so the choice is mostly operational. pgvector adds vector columns and HNSW indexes to Postgres and, at the roughly one-million-vector scale, matches dedicated systems while keeping your data in one database. Qdrant is a SIMD-optimized standalone engine with excellent latency; Milvus offers the richest set of indexes for very large scale; Pinecone is fully managed for teams that want zero operational overhead; and FAISS (Johnson et al. 2017) is a library, not a server, for building your own.
How to use it
With pgvector it is a column type and an index, and queries use a distance operator:
CREATE EXTENSION vector;
CREATE TABLE docs (id bigserial PRIMARY KEY, embedding vector(768));
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);
-- the five nearest documents to a query embedding ($1) by cosine distance
SELECT id FROM docs ORDER BY embedding <=> $1 LIMIT 5;
FAISS in depth
FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings, not a server, for building
approximate-nearest-neighbor indexes over dense vectors on CPU or GPU. You choose an index type that trades
exactness, speed, and memory, optionally train it, add your vectors, and search. The workflow is always the
same three calls: train (for index types that learn structure), add, and
search(queries, k), which returns the distances and ids of the k nearest neighbors.
The index types that matter
- IndexFlatL2 / IndexFlatIP: exact brute force under Euclidean or inner-product distance. It is the recall baseline and needs no training, but it is O(nd) per query, so it is for small sets or for measuring the recall the approximate indexes lose.
- IndexIVFFlat: k-means partitions the vectors into
nlistVoronoi cells, and a query searches only thenprobenearest cells. It must betrain-ed to learn the cells;nprobeis the recall/speed dial (more cells, higher recall, slower). - IndexIVFPQ: adds product quantization on top of IVF to shrink memory. Each vector is split into
msubvectors, and each subvector is quantized to one of 256 centroids (nbits=8), so a vector is stored as justmbytes instead of 4d, and distances are computed from small precomputed lookup tables. This is what makes billion-scale search fit in RAM, at some accuracy cost. - IndexHNSWFlat: the graph index from the section above, very fast and high recall, no training, but the most memory.
Composite indexes are built with the index factory, a string that names the pipeline, for
example index_factory(d, "IVF4096,PQ16") or "OPQ16,IVF65536,PQ16" where OPQ adds a
learned rotation that makes product quantization more accurate. For cosine similarity, L2-normalize the
vectors and use the inner-product index. GPU support (index_cpu_to_gpu) gives a large speedup,
and indexes persist with write_index / read_index.
Choosing, per the FAISS guidelines: Flat for exact or tiny data; IVFFlat for
millions of vectors when memory is fine; IVFPQ or an OPQ variant for hundreds of millions to
billions where memory dominates; and HNSW when you have ample RAM and want the best
speed-recall trade. The single most important runtime knob on IVF indexes is nprobe.
import faiss
import numpy as np
d = 768 # embedding dimension
xb = np.random.rand(100_000, d).astype('float32') # database vectors
xq = np.random.rand(5, d).astype('float32') # query vectors
# 1) exact baseline (brute force, L2) -- the recall ground truth
index = faiss.IndexFlatL2(d)
index.add(xb)
D, I = index.search(xq, k=10) # distances and ids of the 10 nearest
# 2) IVF: cluster into nlist cells, then search nprobe of them
quantizer = faiss.IndexFlatL2(d)
ivf = faiss.IndexIVFFlat(quantizer, d, 4096) # nlist = 4096 Voronoi cells
ivf.train(xb) # k-means on the data to learn the cells
ivf.add(xb)
ivf.nprobe = 16 # search 16 cells: more = higher recall, slower
D, I = ivf.search(xq, 10)
# 3) IVF + product quantization: compress each vector to m bytes
ivfpq = faiss.IndexIVFPQ(quantizer, d, 4096, 16, 8) # m=16 subvectors, 8 bits (256 centroids) each
ivfpq.train(xb); ivfpq.add(xb)
ivfpq.nprobe = 16
# composite index via the factory string; OPQ improves PQ accuracy
index2 = faiss.index_factory(d, "OPQ16,IVF4096,PQ16")
# cosine similarity: L2-normalize and use inner product
faiss.normalize_L2(xb)
cos = faiss.IndexFlatIP(d); cos.add(xb)
# move to GPU, and persist to disk
# gpu = faiss.index_cpu_to_gpu(faiss.StandardGpuResources(), 0, ivfpq)
faiss.write_index(ivfpq, "vectors.faiss")
loaded = faiss.read_index("vectors.faiss")
Worked example
The index only approximates this exact cosine ranking, which a brute-force search makes concrete; a query near "cat" retrieves the animal embeddings before the vehicle ones:
import numpy as np
docs = {"cat": [0.9, 0.1, 0.0], "kitten": [0.8, 0.2, 0.0],
"sedan": [0.1, 0.0, 0.9], "truck": [0.0, 0.1, 0.95]}
q = [0.82, 0.18, 0.0]
def cosine(a, b):
a, b = np.array(a), np.array(b)
return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
print(sorted(docs, key=lambda k: cosine(q, docs[k]), reverse=True))
# ['kitten', 'cat', 'sedan', 'truck']
Complexity (time and space)
Exact search is $O(nd)$ time per query for $n$ vectors of dimension $d$. HNSW answers in about $O(\log n)$ with $O(nd + nM)$ memory ($M$ graph edges per node), and IVF in roughly $O(\text{probes} \times \text{cell size})$ with less memory. The dimension $d$ also matters: embeddings are often reduced or quantized (product quantization) to cut both memory and distance-computation cost.
In production
Vector search is the retrieval half of retrieval-augmented generation, so it sits inside most LLM applications: embed a corpus once, then at query time embed the question and fetch the nearest chunks to ground the model's answer. In-house systems frequently build on FAISS directly, while managed and standalone databases (pgvector, Qdrant, Pinecone, Milvus, Weaviate) wrap the same approximate-nearest- neighbor cores with persistence, metadata filtering, and operations.
Follow-up questions
- HNSW vs IVF, when each? HNSW for most workloads with active writes (graph, ~log n, high recall, more memory); IVF for huge, mostly-static data where memory or build time dominates.
- Why approximate instead of exact search? Exact search is $O(nd)$ per query, infeasible at millions of vectors; ANN trades a little recall for orders-of-magnitude speed.
- When is pgvector enough? Up to around a million vectors with HNSW it matches dedicated databases while keeping vectors next to your relational data.
- What does product quantization do? It compresses vectors into short codes so distances are computed cheaply and memory drops, at some accuracy cost.
- What is recall in this context? The fraction of true nearest neighbors the approximate index actually returns; the knob you trade against speed and memory.
References
- Malkov & Yashunin, Efficient and Robust Approximate Nearest Neighbor Search using HNSW Graphs (2016).
- Johnson, Douze & Jégou, Billion-scale Similarity Search with GPUs (FAISS, 2017).
- pgvector (open-source Postgres extension).
- Kleppmann, Designing Data-Intensive Applications (2017).
- Pinecone, Introduction to Faiss (tutorial series).
- facebookresearch, FAISS Wiki: Guidelines to Choose an Index.