Vector databases · Stanley Jacob

A vector database stores high-dimensional embeddings, the numeric representations produced by models, and answers nearest-neighbor queries: given a query vector, return the stored vectors closest to it by cosine or Euclidean distance. This is the retrieval engine behind semantic search, recommendations, and retrieval-augmented generation, where you embed a query and fetch the most similar documents.

Exact nearest-neighbor search is a brute-force scan, $O(nd)$ per query, which is fine for thousands of vectors and hopeless for millions. Vector databases therefore use approximate nearest neighbor (ANN) indexes that trade a little recall for orders-of-magnitude speed.

The two index families

HNSW (Malkov & Yashunin 2016) builds a multi-layer navigable graph and greedily hops toward the query, giving roughly $O(\log n)$ search, 95%-plus recall out of the box, and the ability to absorb inserts without rebuilds, at the cost of 2 to 5 times more memory. IVF partitions vectors into clusters with k-means and only searches the nearest few cells, so it builds faster and uses less memory but its recall drifts as data changes. The rule of thumb: HNSW for most workloads with active writes, IVF for very large, mostly-static sets where memory or build time dominates.

The landscape

The same HNSW core appears everywhere, so the choice is mostly operational. pgvector adds vector columns and HNSW indexes to Postgres and, at the roughly one-million-vector scale, matches dedicated systems while keeping your data in one database. Qdrant is a SIMD-optimized standalone engine with excellent latency; Milvus offers the richest set of indexes for very large scale; Pinecone is fully managed for teams that want zero operational overhead; and FAISS (Johnson et al. 2017) is a library, not a server, for building your own.

How to use it

With pgvector it is a column type and an index, and queries use a distance operator:

CREATE EXTENSION vector;
CREATE TABLE docs (id bigserial PRIMARY KEY, embedding vector(768));
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);

-- the five nearest documents to a query embedding ($1) by cosine distance
SELECT id FROM docs ORDER BY embedding <=> $1 LIMIT 5;

FAISS in depth

FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings, not a server, for building approximate-nearest-neighbor indexes over dense vectors on CPU or GPU. You choose an index type that trades exactness, speed, and memory, optionally train it, add your vectors, and search. The workflow is always the same three calls: train (for index types that learn structure), add, and search(queries, k), which returns the distances and ids of the k nearest neighbors.

The index types that matter

IndexFlatL2 / IndexFlatIP: exact brute force under Euclidean or inner-product distance. It is the recall baseline and needs no training, but it is O(nd) per query, so it is for small sets or for measuring the recall the approximate indexes lose.
IndexIVFFlat: k-means partitions the vectors into nlist Voronoi cells, and a query searches only the nprobe nearest cells. It must be train-ed to learn the cells; nprobe is the recall/speed dial (more cells, higher recall, slower).
IndexIVFPQ: adds product quantization on top of IVF to shrink memory. Each vector is split into m subvectors, and each subvector is quantized to one of 256 centroids (nbits=8), so a vector is stored as just m bytes instead of 4d, and distances are computed from small precomputed lookup tables. This is what makes billion-scale search fit in RAM, at some accuracy cost.
IndexHNSWFlat: the graph index from the section above, very fast and high recall, no training, but the most memory.

Composite indexes are built with the index factory, a string that names the pipeline, for example index_factory(d, "IVF4096,PQ16") or "OPQ16,IVF65536,PQ16" where OPQ adds a learned rotation that makes product quantization more accurate. For cosine similarity, L2-normalize the vectors and use the inner-product index. GPU support (index_cpu_to_gpu) gives a large speedup, and indexes persist with write_index / read_index.

Choosing, per the FAISS guidelines: Flat for exact or tiny data; IVFFlat for millions of vectors when memory is fine; IVFPQ or an OPQ variant for hundreds of millions to billions where memory dominates; and HNSW when you have ample RAM and want the best speed-recall trade. The single most important runtime knob on IVF indexes is nprobe.

import faiss
import numpy as np

d = 768                                          # embedding dimension
xb = np.random.rand(100_000, d).astype('float32')  # database vectors
xq = np.random.rand(5, d).astype('float32')        # query vectors

# 1) exact baseline (brute force, L2) -- the recall ground truth
index = faiss.IndexFlatL2(d)
index.add(xb)
D, I = index.search(xq, k=10)                    # distances and ids of the 10 nearest

# 2) IVF: cluster into nlist cells, then search nprobe of them
quantizer = faiss.IndexFlatL2(d)
ivf = faiss.IndexIVFFlat(quantizer, d, 4096)     # nlist = 4096 Voronoi cells
ivf.train(xb)                                    # k-means on the data to learn the cells
ivf.add(xb)
ivf.nprobe = 16                                  # search 16 cells: more = higher recall, slower
D, I = ivf.search(xq, 10)

# 3) IVF + product quantization: compress each vector to m bytes
ivfpq = faiss.IndexIVFPQ(quantizer, d, 4096, 16, 8)  # m=16 subvectors, 8 bits (256 centroids) each
ivfpq.train(xb); ivfpq.add(xb)
ivfpq.nprobe = 16

# composite index via the factory string; OPQ improves PQ accuracy
index2 = faiss.index_factory(d, "OPQ16,IVF4096,PQ16")

# cosine similarity: L2-normalize and use inner product
faiss.normalize_L2(xb)
cos = faiss.IndexFlatIP(d); cos.add(xb)

# move to GPU, and persist to disk
# gpu = faiss.index_cpu_to_gpu(faiss.StandardGpuResources(), 0, ivfpq)
faiss.write_index(ivfpq, "vectors.faiss")
loaded = faiss.read_index("vectors.faiss")

Worked example

The index only approximates this exact cosine ranking, which a brute-force search makes concrete; a query near "cat" retrieves the animal embeddings before the vehicle ones:

import numpy as np

docs = {"cat": [0.9, 0.1, 0.0], "kitten": [0.8, 0.2, 0.0],
        "sedan": [0.1, 0.0, 0.9], "truck": [0.0, 0.1, 0.95]}
q = [0.82, 0.18, 0.0]

def cosine(a, b):
    a, b = np.array(a), np.array(b)
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

print(sorted(docs, key=lambda k: cosine(q, docs[k]), reverse=True))
# ['kitten', 'cat', 'sedan', 'truck']

Complexity (time and space)

Exact search is $O(nd)$ time per query for $n$ vectors of dimension $d$. HNSW answers in about $O(\log n)$ with $O(nd + nM)$ memory ($M$ graph edges per node), and IVF in roughly $O(\text{probes} \times \text{cell size})$ with less memory. The dimension $d$ also matters: embeddings are often reduced or quantized (product quantization) to cut both memory and distance-computation cost.

In production

Vector search is the retrieval half of retrieval-augmented generation, so it sits inside most LLM applications: embed a corpus once, then at query time embed the question and fetch the nearest chunks to ground the model's answer. In-house systems frequently build on FAISS directly, while managed and standalone databases (pgvector, Qdrant, Pinecone, Milvus, Weaviate) wrap the same approximate-nearest- neighbor cores with persistence, metadata filtering, and operations.

Follow-up questions

HNSW vs IVF, when each? HNSW for most workloads with active writes (graph, ~log n, high recall, more memory); IVF for huge, mostly-static data where memory or build time dominates.
Why approximate instead of exact search? Exact search is $O(nd)$ per query, infeasible at millions of vectors; ANN trades a little recall for orders-of-magnitude speed.
When is pgvector enough? Up to around a million vectors with HNSW it matches dedicated databases while keeping vectors next to your relational data.
What does product quantization do? It compresses vectors into short codes so distances are computed cheaply and memory drops, at some accuracy cost.
What is recall in this context? The fraction of true nearest neighbors the approximate index actually returns; the knob you trade against speed and memory.

References

Malkov & Yashunin, Efficient and Robust Approximate Nearest Neighbor Search using HNSW Graphs (2016).
Johnson, Douze & Jégou, Billion-scale Similarity Search with GPUs (FAISS, 2017).
pgvector (open-source Postgres extension).
Kleppmann, Designing Data-Intensive Applications (2017).
Pinecone, Introduction to Faiss (tutorial series).
facebookresearch, FAISS Wiki: Guidelines to Choose an Index.