A stock exchange accepts buy and sell orders for listed securities, matches them into trades, publishes the resulting prices to the world, and hands the trades off for clearing and settlement. It is the rare interview question whose answer is not a web architecture at all, because the core of a modern equities venue is a single-threaded program walking an in-memory data structure in microseconds, surrounded by machinery whose entire job is to feed that program a perfectly ordered stream and to broadcast what it did. Interviewers ask the question to see whether a candidate can recognize when the standard playbook of caches, queues, and eventual consistency is the wrong playbook, and the recognition has to come early, since every reflexive move toward distributed defaults digs the answer deeper into the wrong shape.
Exchanges are special because their requirements are absolute rather than statistical, and each absolute closes a door that ordinary systems leave open. Matching must complete in microseconds, so anything that can pause, allocate, or wait gets banished from the hot path. Fairness rules are regulatory obligations rather than aspirations, so the order in which messages are processed must be defined, defensible, and physically enforced rather than left to network luck. The same inputs must always produce the same trades, so nondeterminism of any kind, whether in threads, clocks, or data structures, counts as a correctness defect rather than a performance quirk. And every message ever received must be reconstructible years later, because disputes and investigations arrive on regulatory timescales rather than engineering ones. The design below works from those constraints to the architecture they force, with the order book, the sequencer, and the market data fabric as the load-bearing pieces.
Scope and requirements
Functionally the venue handles limit orders, which are instructions to buy or sell at a stated price or better, and market orders, which execute immediately at the best price currently available, plus cancels and modifications. Cancels matter more than intuition suggests, since electronic market makers cancel and replace their quotes constantly as prices move, and cancellations outnumber actual trades many times over on every modern venue. Orders match under price-time priority, the fairness rule stating that a better price always trades first and that at the same price the order that arrived earlier trades first, which is the rule that makes queue position valuable and arrival order worth fighting over. The venue publishes market data covering both the trades and the changing shape of the book, because the prices an exchange discovers are a public good it is obliged to disseminate, and executed trades flow onward to clearing, the post-trade process that nets obligations and settles money against shares, which this design acknowledges and leaves to its batch world.
The non-functional requirements are the unusual part and deserve to be read as a set. Matching latency must sit in the tens of microseconds, because members build strategies around it and venues compete on it. Determinism must be total, in the sense that replaying the same input stream yields bit-identical trades, since determinism is the property that makes recovery, replication, and regulatory replay all possible with a single mechanism. The audit trail must capture every inbound message in its exact processed order, and availability failures must never manifest as lost or reordered orders, because a member can tolerate a venue that pauses for a moment but not one that quietly dropped a cancel while prices were moving against them. Throughput is high but, as the sizing shows, never the truly hard part.
Sizing the problem
A large national exchange processes billions of order messages per day, so take 10 billion as the round number and divide it out. The trading day is six and a half hours, or 23,400 seconds, so the average is about 427,000 messages per second, but exchange traffic concentrates brutally at the open, at the close, and around news, so the system must absorb peaks of ten times the average, on the order of 4 million messages per second across all symbols. The per-symbol picture is what actually shapes the design, because the flow spreads over thousands of listed symbols, and even the hottest symbol on its hottest day peaks at a few hundred thousand messages per second, which becomes the number the matching engine has to beat.
Now compare that demand with what one core supplies. A simple in-memory operation, an array index, a compare, a pointer chase that stays in cache, costs on the order of 100 nanoseconds or less, so one core executes several million such operations per second even with bookkeeping, and matching a single order takes only a handful of them. One thread can therefore match every message the busiest symbol produces with an order of magnitude to spare, and that conclusion is worth stating as the hinge of the whole interview. Memory tells the same story, since a million resting orders at roughly 100 bytes each is 100 MB, so the books for the entire market fit comfortably in one machine's RAM with room beside them for the day's bookkeeping. Those two facts together, that one thread suffices per symbol and that everything fits in memory, decide the whole architecture, and any distributed-systems instinct that survives contact with them should be treated with suspicion for the rest of the hour.
The order interface
Members connect through gateways speaking a session protocol such as FIX, the financial industry's standard order-entry protocol, or a terser binary equivalent where every microsecond is contested, but the logical message set is small and worth writing out. Every acknowledgment carries the sequence number the venue assigned, which is the thread that ties the entire system together, since members, engines, and auditors all refer to the same numbering when they discuss what happened and in what order:
NEW_ORDER { "client_order_id": "c-118", "symbol": "ACME", "side": "buy",
"type": "limit", "qty": 300, "price": 10.02 }
→ ACK { "order_id": 884213, "seq": 90417723 }
CANCEL { "order_id": 884213 }
→ CANCELED { "order_id": 884213, "seq": 90417951 }
EXECUTION { "order_id": 884213, "fill_qty": 200, "price": 10.02,
"trade_id": 7741180, "seq": 90418002 } // pushed to both counterparties
The high-level architecture
Client gateways terminate member sessions and run pre-trade risk checks before anything reaches the core. Buying power, which is the funds or credit a member has available to cover what it is ordering, gets checked against the order's exposure, price collars reject orders absurdly far from the prevailing market before a fat-fingered decimal can print a nonsense trade, and per-member rate limits keep one misfiring algorithm from monopolizing the venue's front door. Everything that passes flows into the sequencer, and from that point the sequenced stream fans out to the matching engines, the audit log, and onward to market data, so every downstream component is a consumer of one authoritative stream rather than a participant in a negotiation about ordering.
Risk-checked orders pass through the sequencer, which stamps a single global order on every message; the engines consume the sequenced stream by symbol range, the audit log records it verbatim, and every book change flows to the market data publisher. The dashed replay path is how a recovering engine catches up.
The order book, worked through
The order book is the central data structure of the venue. For each symbol it keeps two sides of resting limit orders, the bids waiting to buy and the asks waiting to sell, and each side is a sorted collection of price levels where every level holds a first-in-first-out queue of orders. That structure is price-time priority made physical, since the sort across levels delivers price priority and the queue within a level delivers time priority, so the fairness rule is enforced by the shape of the data rather than by checks sprinkled through the code. Near the touch, which is the pair of best prices where all the action concentrates, flat arrays indexed by price tick beat fancier balanced trees, because the handful of levels being hammered stays resident in the core's cache while a tree's pointer chasing would wander across memory and stall the pipeline:
# one matching thread per symbol; no locks anywhere in this code
def on_market_buy(book, qty):
while qty > 0 and book.asks:
level = book.asks.best() # lowest ask price first
resting = level.queue[0] # earliest order at that price first
fill = min(qty, resting.qty)
emit_trade(price=level.price, qty=fill, maker=resting.order_id)
resting.qty -= fill; qty -= fill
if resting.qty == 0: level.queue.popleft()
if not level.queue: book.asks.drop_best()
A worked match makes the mechanics concrete. Suppose ACME's ask side rests at 200 shares at 10.02, 250 at 10.03, and 500 at 10.04, with bids at 10.01 and 10.00 below, so the spread, the gap between the best bid and the best ask, is one cent. A market buy for 300 shares walks the ladder from the cheapest ask upward. It takes all 200 at 10.02, clearing that level entirely, then takes 100 of the 250 resting at 10.03 and stops with its quantity exhausted. Two trades print, 200 at 10.02 and 100 at 10.03, the partially filled resting order keeps its queue position with 150 shares remaining, because a partial fill never costs an order its place in line, and the book's best ask is now 10.03, so the spread has widened from one cent to two. A limit buy at 10.03 for the same 300 shares would have produced identical fills, since both levels it consumed sat at or better than its limit, while a limit buy at 10.02 would have taken the first 200 and then rested at 10.02 as the new best bid, narrowing the very market it just traded with. Every one of those outcomes is fully determined by the book and the incoming order alone, which is the determinism requirement showing up at the smallest scale.
A market buy for 300 walks the ask side in price-time order, taking all 200 at 10.02 and clearing that level in step 1, then 100 of the 250 at 10.03 in step 2, and printing two trades in step 3. The book is left with 150 shares at a new best ask of 10.03.
One thread per symbol, and the sequencer that makes it possible
The design move interviewers wait for concerns the shape of the matching engine, which runs single-threaded per symbol and entirely in memory. The sizing already showed one core has throughput to spare, so parallelism would buy nothing, and it would cost everything that matters here. Locks and cross-core coordination would dominate a microsecond budget, since even an uncontended lock costs tens of nanoseconds and a contended one costs microseconds, and worse, parallel matching introduces scheduling nondeterminism, where the same two orders might match differently depending on which thread won a race, destroying the requirement that a replay produces identical trades. Sequential processing of an already-ordered input stream eliminates locks, keeps the book hot in one core's cache, and makes the engine a pure function from input sequence to trades. The lineage is the LMAX architecture, which demonstrated millions of transactions per second on a single thread fed through pre-allocated ring buffers, and the discipline that comes with the style is strict, allowing no garbage-collection pauses on the hot path, no system calls, and no wall-clock reads inside matching, with even time arriving as sequenced tick messages so that a replay sees byte-identical inputs.
The component that makes already-ordered true is the sequencer. It receives every risk-checked inbound message and stamps it with a monotonically increasing sequence number, and from that point onward the sequence is the definition of what happened first, regardless of network jitter upstream, so two members' arguments about who arrived earlier are settled by comparing numbers rather than by forensics. This single decision turns the cluster into a replicated state machine, the arrangement where deterministic replicas consuming the same ordered commands necessarily hold identical state. Any replica of the matching engine, live or warm standby, that has consumed sequence number N holds exactly the same book as every other replica at N, so failover is less a recovery procedure than a promotion, with the standby already correct by construction. If a replica must start cold, it loads the latest snapshot and replays the sequenced log forward, arriving at provably the same state including the same trades. The sequenced log doubles as the complete audit trail the regulator requires, one stream serving consensus, recovery, and compliance at once, and the sequencer itself stays almost logic-free, stamping and forwarding and nothing else, precisely so it can be made redundant with a hot standby and fast failover without dragging any business state along.
Market data and fairness
Every change to the book, whether an add, a cancel, or an execution, becomes an incremental market data message, and periodic snapshots of the full book let late joiners and gap-recovering consumers resynchronize without replaying the entire day. Inside the facility the feed travels over UDP multicast, a network mode where the switch delivers one transmitted packet to all subscribers simultaneously, so no recipient hears the news first because the fabric itself does not play favorites, and fairness arrives by physics rather than by policy. Lost packets are detected through gaps in the feed's own sequence numbers and repaired from a retransmission service or the next snapshot, which is the standard answer to UDP's unreliability and far cheaper than maintaining per-receiver retransmission state at these fan-outs. The public internet receives conflated feeds, which coalesce rapid updates and publish the latest state at intervals, trading completeness for bandwidth that retail consumers can actually afford, and the conflation is published policy rather than quiet degradation, because a venue's credibility rests on members knowing exactly what each feed promises.
The same fairness logic explains why exchanges sell rack space. Colocation puts member servers inside the exchange's own data center with deliberately equalized cable lengths, so that no participant buys an advantage the venue did not publish a price for, and latency from the matching engine to every colocated rack is engineered to be identical within nanoseconds. Selling proximity openly while equalizing it within the room sounds paradoxical until it is read as regulation by architecture, since the alternative is participants buying buildings next door and racing each other over distances the venue can neither see nor police. Control-plane actions ride the same machinery as orders rather than a separate path. A circuit breaker, the rule that halts trading in a symbol or the whole market when prices move too far too fast, is implemented as a halt command flowing through the sequencer, so every replica and every feed sees the halt at the same sequence number, no order can match ambiguously around it, and the state stays deterministic even while the market is being stopped.
A latency budget from wire to trade
Tens of microseconds is a strange budget by web standards, and itemizing it shows where the nanoseconds actually go. A packet arrives at the gateway's network card, and kernel-bypass networking, a technique where the application reads packets directly from the card's memory instead of waiting on the operating system's network stack, delivers it to user space in about a microsecond where the kernel path would cost ten or more. Risk checks are arithmetic against member state already sitting in cache, another microsecond or so. The hop to the sequencer and the stamped hop onward to the engine each cross a switch, and a cut-through switch, one that begins forwarding a frame before fully receiving it, costs one to two microseconds per traversal. Matching itself, the part everyone fixates on, runs in tens to hundreds of nanoseconds of array and queue manipulation, nearly a rounding error inside its own budget. The acknowledgment then retraces the path, and the wire-to-wire total lands somewhere between five and twenty microseconds, with the matching thread responsible for almost none of it, which is why serious venues obsess over network gear as much as code.
That budget also explains the engineering taboos. A single system call costs about a microsecond, a page fault or a garbage-collection pause costs milliseconds, which is a thousand budgets gone at once, and a lock contended at the wrong moment costs an unbounded amount, so the hot path pre-allocates every object it will ever need, pins its threads to dedicated cores, and leaves nothing to the operating system's discretion at runtime. Jitter matters more than the average, because members co-design their strategies around the venue's published latency distribution, and a venue whose 99th percentile, the time that only the slowest one percent of messages exceed, wanders by tens of microseconds is a venue whose members are effectively trading against noise. This is why exchanges publish latency histograms and treat a drifting tail as an incident even on days when the median sits exactly where it always does.
Scaling, failures, and operations
Scaling proceeds by symbol partitioning. Symbols are independent of one another, since an order for ACME can never match an order for any other listing, so the market shards cleanly across engine instances, the busiest symbols earn the most isolated placement, and only control messages fan out everywhere. Gateways scale horizontally with member count, and the market data tier scales by adding multicast groups and feed handlers. The sequencer is the one component that cannot be parallelized within a partition, because parallel stamping of a single total order is a contradiction in terms, which is why it stays minimal, runs on the fastest path available, and is measured in nanoseconds. Venues that outgrow a single global sequencer give each engine partition its own, accepting that cross-symbol ordering becomes only approximate, which trading semantics permit because no matching rule ever ties the fate of one symbol's order to another symbol's arrival time.
Failure handling falls out of the replicated state machine rather than being bolted on afterward. An engine crash promotes a standby that is already at the same sequence number, so the gap, if any exists at all, is replay rather than reconstruction, and members experience elevated latency for some milliseconds rather than lost orders. A sequencer failure fails over to its standby under one strict rule, namely that no message is delivered to engines until its sequence stamp is durable on the standby, because a stamp that could be forgotten is no ordering at all, and the rule's cost is a replication round trip the budget already carries. Gateways shed sessions to their peers, and members reconnect and resynchronize from their last acknowledged sequence number, which the protocol was designed around from the start. After the close the system changes character entirely. Executed trades feed the clearinghouse, obligations are netted so each member settles only its net position per symbol, and settlement completes at T+1, meaning shares and cash legally change hands one business day after the trade. The microsecond world and the batch world meet in a file, and both sides reconcile it line by line, because an exchange's final invariant is that nobody, whether member, clearinghouse, or regulator, ever disagrees about what traded.
Follow-up questions
- Why is the matching engine single-threaded? One core already exceeds the busiest symbol's peak message rate by an order of magnitude, so parallelism would add lock overhead and scheduling nondeterminism while solving no problem that exists. It would break both the latency budget and the requirement that identical inputs produce identical trades, and replication and recovery both depend on that second property.
- How do replicas stay exactly identical? They consume the same sequenced input through deterministic code, with no wall-clock reads, no randomness, and no iteration over unordered structures, and time itself arrives as sequenced tick messages. A replica at sequence N is therefore bit-identical to any other replica at N, which is the property that failover and regulatory replay both lean on.
- What happens if an engine dies mid-burst? A standby consuming the same stream is already in the same state and is simply promoted, while a cold start loads a snapshot and replays the sequenced log forward. Nothing is lost in either case, because the log rather than the engine's memory is the source of truth, and the engine is just a cached computation over it.
- How is fairness physically enforced? The sequencer defines arrival order at a single point so ordering disputes reduce to comparing numbers, colocation equalizes cable lengths so no member sits physically closer than another, and multicast delivers market data to every subscriber in the same instant. Each mechanism removes one place where favoritism could hide.
- What data structure holds the book? Each side keeps price levels in a structure sorted by price with a first-in-first-out queue per level, which encodes price-time priority directly in the shape of the data. Near the touch, flat arrays indexed by price tick outperform trees, because the few hot levels stay in cache while a tree's pointers scatter across memory.
- Where does clearing fit? It sits downstream in the batch world, where trades flow to the clearinghouse, obligations are netted per member, and settlement completes at T+1. The exchange's responsibility ends at an immutable, sequenced record of what traded, and the line-by-line reconciliation of that record is the contract between the two worlds.
References
- Fowler, The LMAX Architecture (2011), on single-threaded matching and event sourcing at millions of transactions per second.
- LMAX Exchange, Disruptor, the inter-thread messaging library behind that architecture.
- Xu and Lam, System Design Interview, Volume 2 (2022), chapter on stock exchanges.
- Kleppmann, Designing Data-Intensive Applications (2017), on total order broadcast and replicated state machines.