Consensus with Raft · Stanley Jacob

Consensus is how a cluster of machines agrees on a single sequence of operations even though any of them can crash or fall behind. It is the foundation under replicated databases, configuration stores, and the kind of durable job queue that has to survive a node dying mid-task. Raft was designed to provide the same guarantees as Paxos while being genuinely understandable.

Each node runs a copy of the same state machine, and to stay consistent they must apply the same commands in the same order. Raft reduces that to three subproblems: electing a leader, replicating its log, and the safety rules that keep the log consistent.

Leader election

Raft elects one leader that handles all client requests. Time is divided into terms, and the leader sends periodic heartbeats; if a follower stops hearing them it becomes a candidate and starts an election for a new term. The key trick is randomized election timeouts: by waiting a random interval before campaigning, nodes rarely start elections at the same instant, so split votes are uncommon and a leader emerges quickly.

Log replication and majority quorums

Clients send commands to the leader, which appends them to its log and replicates them to followers. Once a majority have stored an entry, the leader marks it committed and applies it. Majority quorums are what make this fault-tolerant: a cluster of $N$ nodes tolerates $\lfloor (N-1)/2 \rfloor$ failures, and because any two majorities of $N$ nodes must overlap in at least one node, no two conflicting decisions can both be committed.

Safety: the two core rules

The subtle part is guaranteeing nothing committed is ever lost across leader changes. Two rules do it. The election restriction: a follower grants its vote only to a candidate whose log is at least as up to date as its own, so a node missing committed entries can never win. The commit rule: a leader only advances the commit index to an entry replicated on a majority and from its current term, which avoids a subtle case where an older entry could otherwise be overwritten. Real systems like etcd and Consul are built on exactly these rules.

def should_grant_vote(state, cand_term, cand_id, cand_last_idx, cand_last_term):
    if cand_term < state.current_term:           # reject stale terms
        return False
    if state.voted_for not in (None, cand_id):   # already voted this term
        return False
    last_term = state.log[-1].term if state.log else 0
    last_idx = len(state.log)
    # candidate log must be at least as up to date as ours (election restriction)
    return (cand_last_term > last_term or
            (cand_last_term == last_term and cand_last_idx >= last_idx))

def advance_commit_index(match_index, current_term, log, num_servers):
    """Leader commits entry n once a majority have replicated it, from this term."""
    for n in range(len(log), 0, -1):
        replicated = 1 + sum(m >= n for m in match_index.values())
        if replicated > num_servers // 2 and log[n - 1].term == current_term:
            return n
    return 0

Complexity (time and space)

Committing one entry costs a round of messages to the followers, $O(N)$ communication for $N$ nodes, and one network round-trip of latency to reach a majority. The log grows without bound unless snapshotted, which production implementations do to truncate old entries. Availability requires a majority to be reachable, so a 5-node cluster keeps working through 2 failures.

Worked example

A follower at term 2 grants a vote to an up-to-date candidate but rejects a stale-term one, and the leader commits the entry a majority has stored:

class Entry:                       # a log entry carries the term it was created in
    def __init__(self, term): self.term = term
class State:
    def __init__(self, current_term, voted_for, log):
        self.current_term, self.voted_for, self.log = current_term, voted_for, log

state = State(2, None, [Entry(1), Entry(2)])
print(should_grant_vote(state, 2, "A", 2, 2))               # True
print(should_grant_vote(state, 1, "A", 2, 2))               # False  (stale term)
print(advance_commit_index({"B": 2, "C": 2}, 2, state.log, 3))   # 2  (majority)

Follow-up questions

Why randomized election timeouts? They make it unlikely two followers campaign at once, so split votes are rare and a leader is elected quickly.
How many failures does Raft tolerate? Floor of (N-1)/2: a 3-node cluster survives 1 failure, a 5-node cluster survives 2, since progress needs a majority.
Why must any two majorities overlap? Two subsets of more than N/2 nodes must share at least one node, so conflicting decisions cannot both gain a majority.
What is the election restriction? A vote is granted only to a candidate whose log is at least as up to date, so a node missing committed entries cannot become leader.
Why only commit current-term entries by majority? Committing an older-term entry purely by replication count can be unsafe; requiring a current-term entry on a majority closes that gap.

References

Ongaro & Ousterhout, In Search of an Understandable Consensus Algorithm (Raft, 2014).
Lamport, Paxos Made Simple (2001).
Kleppmann, Designing Data-Intensive Applications (2017).