PostgreSQL

Software & architecture · databases · Jun 2024

PostgreSQL is the relational database to reach for by default. It speaks SQL, enforces a schema, supports rich types and indexes, and gives you full ACID transactions, so it is the safe home for the data your application cannot afford to corrupt. The right instinct is to start here and only move data elsewhere when a concrete requirement forces it.

Its model is rows in tables connected by foreign keys, queried with joins. Because the planner can combine indexes, joins, and aggregates arbitrarily, you do not have to know your queries in advance the way you do with most NoSQL stores, which is exactly why it is so flexible for evolving products.

How to use it

Define tables and indexes, then read and write with SQL inside transactions. The classic example, a money transfer, shows why transactions matter: both updates commit together or not at all.

CREATE TABLE orders (
    id          bigserial PRIMARY KEY,
    user_id     bigint NOT NULL,
    total_cents integer NOT NULL,
    created_at  timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX idx_orders_user ON orders (user_id);   -- speeds lookups by user

BEGIN;                                               -- all-or-nothing
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Key mechanics, and why

Postgres uses MVCC (multi-version concurrency control): a writer creates a new row version rather than overwriting, so readers never block writers and see a consistent snapshot. The why is concurrency without coarse locks, at the cost of needing periodic vacuuming to reclaim dead versions. Indexes are B-trees by default (logarithmic lookups), but Postgres also offers GIN indexes for full-text and JSONB, letting one database serve relational and semi-structured data.

Trade-offs

A single primary handles writes, so scaling writes beyond one machine needs read replicas, partitioning, or sharding, which is where horizontally-scalable stores pull ahead. For the vast majority of applications you never hit that wall, and the cost of leaving relational guarantees, hand-rolling joins and consistency in the application, is far higher than the cost of a bigger Postgres box.

Internals worth knowing

Under the hood Postgres is a multi-version, write-ahead-logged row store. Every change is first appended to the WAL (write-ahead log) and only later written to the heap, which is what makes commits durable and crash recovery possible; streaming replication is simply shipping that WAL to replicas. MVCC tags each row version with hidden xmin and xmax transaction ids, so a query sees only versions committed before its snapshot, and obsolete versions are reclaimed by VACUUM (run for you by autovacuum). A long-open transaction blocks that cleanup and bloats tables, which is the single most common Postgres performance footgun.

Isolation is configurable: the default is read committed, with repeatable read (snapshot isolation) and serializable (SSI, which aborts transactions that would break a serial order) available when you need them. The planner is cost-based and driven by table statistics, so EXPLAIN (ANALYZE) and fresh statistics are how you diagnose a slow query.

  • Index types: B-tree (default), GIN (full-text, JSONB, arrays), GiST (geometric and nearest-neighbor), BRIN (huge append-only tables), plus partial and expression indexes.
  • Scaling reads: streaming or logical replication to read replicas, and connection pooling (PgBouncer) because every connection is a separate process.
  • Scaling data: native partitioning by range, list, or hash; oversized values are transparently offloaded to TOAST.

Using it from Python

import psycopg   # psycopg 3

with psycopg.connect("postgresql://user:pass@host:5432/db") as conn:
    with conn.cursor() as cur:
        # parameterized query (never f-string SQL: that is how injection happens)
        cur.execute("SELECT id, total_cents FROM orders WHERE user_id = %s", (42,))
        rows = cur.fetchall()
    conn.commit()                  # transactions are explicit; commit or rollback

Worked example

Aggregations are where SQL shines; this ranks the top spenders in one query (illustrative result):

SELECT user_id, count(*) AS n, sum(total_cents) AS cents
FROM orders
GROUP BY user_id
ORDER BY cents DESC
LIMIT 3;

-- user_id |  n | cents
--      42 | 17 | 92310
--       7 |  9 | 41050
--       3 | 12 | 38800

In production

PostgreSQL is the default operational database at a huge number of companies, and it scales much further than people assume before you must leave it. Instagram ran its core data on heavily sharded Postgres: thousands of logical shards mapped onto a smaller set of physical servers, with Snowflake-style IDs that encode the shard so logical shards can be moved between machines as the fleet grows (Instagram Engineering). The practical lesson is that you usually outgrow a single primary long before you outgrow Postgres itself.

Follow-up questions

  • What is MVCC and why does it matter? Multi-version concurrency control keeps old row versions so readers and writers do not block each other; the cost is vacuuming dead versions.
  • When do you outgrow a single Postgres? When write throughput exceeds one primary; then you add read replicas, partition tables, or shard.
  • Why is JSONB useful in a relational database? It stores and indexes semi-structured documents alongside relational tables, so you avoid a second database for flexible fields.
  • What does a transaction guarantee? Atomicity, consistency, isolation, durability (ACID): a group of statements commits entirely or not at all.
  • B-tree vs GIN index? B-tree for ordered/equality lookups on scalar columns; GIN for containment queries over full-text, arrays, and JSONB.

References

  1. PostgreSQL Documentation.
  2. Kleppmann, Designing Data-Intensive Applications (2017).
  3. Instagram Engineering, Sharding & IDs at Instagram.