Database indexing and B-trees

Software & architecture · architecture · Feb 2024

An index is a secondary structure that lets a database find rows without scanning the whole table, turning an $O(n)$ full scan into an $O(\log n)$ lookup. Most relational databases use B-trees (specifically B+ trees), which are wide, shallow, balanced trees tuned for the realities of disk.

The key idea is high fan-out. A B-tree node is sized to a disk page and holds hundreds of keys with hundreds of children, so the tree height is $\log_B n$ for branching factor $B$. With $B$ in the hundreds, a billion-row index is only about four levels deep, meaning a lookup touches roughly four pages, a handful of disk reads, instead of millions. That alignment of node size with the unit of I/O is the whole reason B-trees beat binary trees on disk.

B+ trees and range scans

In a B+ tree all values live in the leaves and the leaves are linked in order, so a range query finds the start in $O(\log n)$ and then walks the linked leaves sequentially, which is exactly the access pattern disks are fastest at. Internal nodes hold only keys for navigation, which raises fan-out further and keeps the tree shallow.

Composite indexes and selectivity

A composite index on $(a, b)$ is sorted by $a$ then $b$, so it can serve queries that filter on $a$, or on $a$ and $b$, but not on $b$ alone, the leftmost-prefix rule, which makes column order a real design decision. A covering index includes every column a query needs so the query is answered from the index without touching the table. And the planner uses selectivity, how few rows a condition matches, to decide whether an index is even worth using over a scan.

B-tree versus LSM-tree

B-trees update in place and are read-optimized, but every write may rewrite a page. LSM-trees (O'Neil et al. 1996), used by Cassandra, RocksDB, and LevelDB, instead buffer writes in memory and flush sorted, immutable files that background compaction merges, which makes them write-optimized at the cost of read amplification and compaction work. Choosing between them is really choosing whether your workload is read-heavy or write-heavy.

Step by step

  1. Index the columns you filter, join, or sort on, and unique constraints.
  2. A B-tree keeps keys sorted with high fan-out, so lookups touch few pages.
  3. Composite indexes follow the leftmost-prefix rule; order the columns to match queries.
  4. Remember each index costs storage and slows writes that must maintain it.
import bisect

class Index:                       # illustrates index behavior (a real DB uses a B-tree)
    def __init__(self):
        self.keys = []
        self.rows = {}

    def insert(self, key, row):
        bisect.insort(self.keys, key)
        self.rows[key] = row

    def get(self, key):
        i = bisect.bisect_left(self.keys, key)
        if i < len(self.keys) and self.keys[i] == key:
            return self.rows[key]
        return None

    def range(self, lo, hi):
        i = bisect.bisect_left(self.keys, lo)
        j = bisect.bisect_right(self.keys, hi)
        return [self.rows[k] for k in self.keys[i:j]]

Complexity (time and space)

A B-tree gives $O(\log n)$ search, insert, and delete plus efficient range scans, and its height $\log_B n$ is what bounds disk reads. The simplified sorted-array index above has $O(\log n)$ search but $O(n)$ insert because of shifting, which is exactly what a B-tree's node splits avoid. Every index also costs extra storage and slows writes.

Worked example

Point lookups and an ordered range query against the index:

idx = Index()
for k, v in [(5,"e"), (2,"b"), (8,"h"), (1,"a"), (6,"f")]:
    idx.insert(k, v)
print(idx.get(6))        # 'f'              (point lookup)
print(idx.get(3))        # None             (missing key)
print(idx.range(2, 6))   # ['b', 'e', 'f']  (range scan)

Follow-up questions

  • Why a B-tree instead of a binary search tree? High fan-out sizes nodes to disk pages, so the tree is shallow and a lookup touches only a few page reads instead of many.
  • What is the leftmost-prefix rule? A composite index on (a, b) is sorted by a then b, so it serves filters on a or a-and-b, but not b alone.
  • What is a covering index? An index that contains every column a query needs, so the query is answered from the index without reading the table.
  • When does an index hurt? On write-heavy tables: each insert must update every index, and unused indexes waste space and slow writes.
  • B-tree vs LSM-tree? B-trees update in place and are read-optimized; LSM-trees buffer and merge sorted runs, trading read/compaction cost for high write throughput.

References

  1. Bayer & McCreight, Organization and Maintenance of Large Ordered Indexes (B-trees, 1972).
  2. O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (1996).
  3. Kleppmann, Designing Data-Intensive Applications (2017).