Cassandra · Stanley Jacob

Apache Cassandra is a wide-column, masterless distributed database built for very high write throughput and availability across many nodes and data centers. Descended from Amazon's Dynamo and Google's Bigtable, it has no single leader: every node is equal, data is spread by consistent hashing, and the system stays writable even when nodes fail.

The defining rule is query-first modeling: you design tables around the exact queries you will run, not around normalized entities. Because Cassandra cannot do joins or ad-hoc filtering efficiently, you often store the same data several times, once per access pattern, and accept that denormalization as the price of scale.

How to use it

A table's primary key has two parts: the partition key decides which node stores the row, and the clustering key orders rows within that partition. Queries that hit a single partition are fast; queries that do not are to be avoided.

CREATE TABLE events_by_user (
    user_id uuid,
    ts      timestamp,
    event   text,
    PRIMARY KEY ((user_id), ts)          -- partition by user_id, cluster by ts
) WITH CLUSTERING ORDER BY (ts DESC);

-- fast: one partition, already ordered newest-first
SELECT * FROM events_by_user WHERE user_id = ? LIMIT 50;

Key mechanics, and why

Writes are cheap because Cassandra is built on a log-structured merge tree: every write is an append to a commit log and an in-memory table that later flushes to immutable files, with background compaction merging them. That append-only design is why write throughput is so high. Consistency is tunable per query via replication factor and consistency level: requiring acknowledgements from a quorum on both reads and writes gives strong consistency, while lower levels trade consistency for latency and availability, a direct expression of the CAP trade-off.

Trade-offs

You get linear write scalability and multi-data-center, always-on availability, and you give up joins, ad-hoc queries, and easy strong consistency. Cassandra fits time-series, event logs, and write-heavy workloads with known query patterns; it is the wrong tool when you need flexible querying or rich transactions.

Internals worth knowing

Cassandra is a log-structured store. A write lands in the commit log (for durability) and an in-memory memtable, which is later flushed to an immutable SSTable; a read may merge several SSTables, and compaction continually rewrites them. The compaction strategy is a real tuning knob: size-tiered (STCS) for write-heavy, leveled (LCS) for read-heavy, and time-window (TWCS) for time-series. Deletes write tombstones rather than removing data, and too many tombstones, or unbounded wide partitions, are the classic Cassandra pitfalls.

The ring is coordinated by gossip, with a Murmur3 partitioner placing each row by token. Consistency is per query: ONE, QUORUM, LOCAL_QUORUM, ALL, and so on, combined with the replication factor, let you choose where you sit on the CAP curve. Hinted handoff and read repair heal replicas that missed writes, Paxos-based lightweight transactions add compare-and-set when you truly need it (at a latency cost), and a per-SSTable bloom filter cheaply skips files that cannot hold a key.

Using it from Python

from cassandra.cluster import Cluster

session = Cluster(["127.0.0.1"]).connect("myks")

# prepared statements are cached server-side and avoid re-parsing
insert = session.prepare(
    "INSERT INTO events_by_user (user_id, ts, event) VALUES (?, ?, ?)")
session.execute(insert, (user_id, ts, "click"))

rows = session.execute(
    "SELECT * FROM events_by_user WHERE user_id = %s LIMIT 50", (user_id,))

Worked example

Because the table is partitioned by user and clustered by time, fetching a user's recent events is a single-partition, pre-sorted read, no scatter-gather and no sort at query time.

In production

Cassandra's archetypal use is write-heavy storage at massive scale. Discord stored first billions and then trillions of messages on Cassandra, modeling a table per query and partitioning by channel and time bucket. They eventually migrated to ScyllaDB, a C++ reimplementation, to escape JVM garbage-collection pauses, shrinking the largest cluster from 177 to 72 nodes and cutting p99 read latency from roughly 40 to 125 ms down to about 15 ms (Discord Engineering). The story is a clean illustration of both Cassandra's strengths (write scale, query-first modeling) and its operational pain points.

Follow-up questions

Partition key vs clustering key? The partition key chooses the node and groups rows; the clustering key orders rows within a partition.
Why model query-first and denormalize? Cassandra has no efficient joins, so you store data per access pattern to keep every read a single-partition lookup.
How is consistency tuned? By replication factor and per-query consistency level; quorum reads plus quorum writes give strong consistency, lower levels favor availability and latency.
Why are writes so fast? An LSM-tree design appends writes to a log and memtable and merges in the background, avoiding in-place updates.
When is Cassandra the wrong choice? When you need ad-hoc queries, joins, or strong multi-row transactions rather than known, write-heavy access patterns.

References

DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store (2007).
Chang et al., Bigtable (2006).
Apache Cassandra Documentation.
Discord Engineering, How Discord Stores Trillions of Messages.