Design a payment system · Stanley Jacob

A payment system for an e-commerce platform moves money from a buyer through the platform to sellers. The checkout page collects a card, an external processor charges it, the platform keeps its fee, and the seller eventually receives the rest. We are explicitly not building a card network or a bank, since both already exist and are regulated accordingly, but rather the backend that orchestrates one, records what happened, and never loses track of a cent along the way. Interviewers like the question because the data is money and the failure modes show up on real bank statements, which inverts the usual instinct to optimize for latency and availability in a way that exposes how a candidate actually weighs correctness against everything else.

That inversion is the defining property of the whole design and deserves a moment of dwelling. A checkout that takes three seconds is fine, because buyers wait for money operations in a way they never wait for a page load, and a payment service that pauses for a minute during an incident is survivable, because carts can be retried once it returns. A double charge is different in kind rather than degree, since it becomes a support ticket, a chargeback with its fee, and a dent in trust that no latency win ever buys back, and a missing payment is worse still because the platform shipped goods it was never paid for. Correctness therefore beats latency here, and it even beats availability, which is a ranking very few other systems share. Everything that follows, the idempotency keys, the append-only ledger, the strict state machine, and the nightly reconciliation, exists to make the system's view of money provably match the bank's view of money, and each section below names what goes wrong without it.

Scope and requirements

The functional core is the pay-in flow, where checkout creates a payment, the platform charges the buyer through a payment service provider, and the amount splits between the seller's balance and the platform's fee. Around that core sit refunds, payouts that move accumulated seller balances out to their bank accounts, ingestion of asynchronous notifications from the processor, and reconciliation against the processor's records. Fraud scoring and tax calculation are real systems that no production platform skips, and I would acknowledge them and exclude them, because each is an interview in itself and neither changes the shape of the money path being designed here.

A payment service provider, or PSP, is a company such as Stripe or Adyen that accepts card payments on a merchant's behalf, and choosing to use one is itself a scope decision worth defending out loud. Raw card numbers never touch the platform, because the PSP returns a token, an opaque reference that stands in for the card and is useless to a thief without the PSP's cooperation. That single property shrinks the burden of PCI DSS, the card industry's security standard whose full audit scope is an enormous ongoing cost, from the entire platform down to one integration boundary. Building direct card-network connectivity instead would save a processing fee measured in fractions of a percent and would cost a compliance program, years of integration work, and the company's focus, which is why almost nobody outside the payments industry does it. The non-functional requirements then read in strict priority order. Money must move effectively once, every movement must remain auditable years later because disputes and regulators arrive on a timescale of years, and the system must tolerate a slow or unreachable PSP without ever guessing what happened to an in-flight charge.

Sizing the problem

Suppose the platform processes 1 million orders per day, the scale of a healthy mid-sized marketplace. A day holds 86,400 seconds, so the average is 1,000,000 divided by 86,400, which comes to about 12 payments per second, and a sale-day peak of thirty times the average is still only around 360 per second. Each payment makes a handful of calls to the PSP that take 200 milliseconds to 2 seconds apiece, so by Little's law, the rule that the number of requests in flight equals the arrival rate multiplied by how long each one lingers, a few hundred requests may be in progress at any moment, which a small stateless service tier absorbs without drama. The ledger grows faster than the traffic suggests, because money paths multiply rows. At roughly five ledger rows per payment, covering the charge legs, the fee split, and the PSP's own cut, the system writes 5 million rows per day, and at 200 bytes per row that is about 1 GB per day, or 365 GB per year, which calls for monthly partitions and an archive tier but strains nothing modern. The arithmetic makes the real point on the candidate's behalf, namely that queries per second are never the story in a payment system, and the engineering budget goes to correctness machinery instead. Anyone proposing a horizontally sharded ledger on day one has solved a problem the numbers say does not exist.

The API

Amounts travel as integers in minor units, cents for dollars, because binary floating point cannot represent most decimal fractions exactly, and a representation that computes 0.1 plus 0.2 and produces 0.30000000000000004 has no business near money. Splits are expressed explicitly in the request so the ledger can record where every cent of the total went rather than inferring it later, and every mutating endpoint accepts an idempotency key, which the deep dive below defines and defends:

POST /api/v1/payments
Idempotency-Key: 9c41e8d2-7b13-4f60-b2aa-51f0c3e8d977
{ "order_id": "ord_312", "amount": 10000, "currency": "usd",
  "payment_method": "pm_card_visa_tok",
  "split": [ { "account": "seller_881",   "amount": 8500 },
             { "account": "platform_fees", "amount": 1500 } ] }
→ 201 { "payment_id": "pay_91", "status": "pending" }

GET  /api/v1/payments/pay_91
→ 200 { "payment_id": "pay_91", "status": "succeeded", ... }

POST /api/v1/payments/pay_91/refunds
Idempotency-Key: 4ab7f1c9-08d2-44ce-9e31-77b4a2d6f188
{ "amount": 10000 }                       → 201 { "refund_id": "ref_17", "status": "pending" }

The ledger data model

The record of money is a double-entry ledger, an idea borrowed from five centuries of accounting practice precisely because it has survived five centuries of people trying to lose money creatively. Every movement of money is written as at least two entries, a debit in one account and a credit in another, and the entries of any single transfer must sum to zero when debits are taken as positive and credits as negative. Money is therefore never created or destroyed in the books but only moved between accounts, and an imbalanced write is rejected at the door by the writing code itself, which converts a whole family of application defects into loud immediate failures instead of quiet drift discovered at month close. The table is append-only, meaning no row is ever updated or deleted, and a correction is a new reversing entry rather than an edit, so the full history of any disputed payment reads straight off the table in the order it happened. The alternative worth naming is a mutable balance column per account, which is smaller and faster to read but destroys history every time it is written, and a balance that cannot explain how it came to be is a liability in any audit and a dead end in any dispute.

CREATE TABLE ledger_entries (
  entry_id     BIGSERIAL PRIMARY KEY,
  transfer_id  UUID    NOT NULL,        -- groups the legs of one movement
  account      TEXT    NOT NULL,
  debit        BIGINT  NOT NULL DEFAULT 0,   -- minor units
  credit       BIGINT  NOT NULL DEFAULT 0,
  currency     CHAR(3) NOT NULL,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- invariant, enforced at write time:
--   per transfer_id, SUM(debit) = SUM(credit)
-- append only: the application never issues UPDATE or DELETE here

A worked example makes the shape concrete. A buyer pays 100 dollars, the platform's quoted fee is 15 dollars, and the PSP takes 3 dollars of that fee for processing. The transfer writes four rows, a debit of buyer_clearing for 10000, a credit of seller_payable for 8500, a credit of platform_revenue for 1200, and a credit of psp_fees for 300. Debits total 10000 and credits total 8500 plus 1200 plus 300, which is also 10000, so the books balance, and any account's balance at any moment is simply the sum of its entries up to that moment. Balances are derived rather than stored as the truth, although a cached balance is a perfectly fine optimization so long as everyone remembers which number is the authority on the day the two disagree.

The high-level architecture

The payment service is a thin orchestrator that owns the payment state machine, calls the PSP, and writes the ledger, and the thinness is deliberate, because every additional line of logic in the money path is another line that can disagree with the books. Webhooks from the PSP arrive on a separate handler, so a flood of notifications after a PSP incident can never crowd out live checkout traffic, and a nightly reconciliation job compares the ledger against the PSP's settlement files, routing any disagreement into operations queues sorted by which kind of wrongness each one represents. The buyer sees none of this machinery, just a spinner of a second or two followed by a confirmation page, and the gap between that simple surface and the bookkeeping underneath is exactly where the design earns its keep.

The payment service drives the PSP and records every movement in the ledger; webhooks update payment state asynchronously, and reconciliation compares the ledger against the PSP's settlement files, routing mismatches to ops queues. Dashed arrows are asynchronous.

Idempotency end to end

An operation is idempotent when performing it twice has the same effect as performing it once, and in payments this is the core mechanism rather than a nicety, because retries are how distributed systems cope with uncertainty, and a payment retried without protection is a payment that might happen twice. The implementation rests on a client-supplied key. The caller generates a unique token per logical operation, the server stores the key alongside the outcome in the same database transaction as the side effect itself, and any later request bearing the same key gets the stored response replayed instead of a fresh execution. Storing the key and the effect in one transaction is the load-bearing detail, since a key recorded before the effect can claim work that never happened, while a key recorded after the effect can miss work that did, and either gap reopens the door the mechanism was meant to close. A hash of the request body is stored as well, so the same key arriving with a different payload is rejected as a client error rather than silently honored, which catches a real class of client-side mistakes where a key is accidentally reused across two different orders.

The retry storm scenario shows the mechanism earning its keep. The payment service calls the PSP to charge 100 dollars and the connection times out after ten seconds. At that moment the truth is genuinely unknown, because the charge may have failed to arrive, or it may have succeeded just after the service stopped waiting, and no amount of local cleverness can distinguish the two cases from the caller's side of the wire. A naive retry risks a second charge, while never retrying risks an order the buyer paid for being treated as unpaid, and both outcomes surface as angry support tickets with the platform plainly at fault. With idempotency key K attached to both attempts, the PSP looks up K on the retry, finds the charge it already created, and returns that original result, so the retry is safe no matter which side of the timeout the first attempt landed on. The platform offers the identical contract to its own clients, which is why the API above takes a key, and the visible effect is that a mobile checkout retrying over a flaky connection, or a buyer tapping the button twice in frustration, produces exactly one payment and one consistent answer.

Steps 1 and 2 carry the charge with idempotency key K, and in step 3 the PSP creates the charge but the response is lost to a timeout. Step 4 retries with the same key, step 5 shows the PSP recognizing K and returning the original charge instead of creating a second one, and step 6 delivers one definitive result to the client while exactly one charge exists.

The state machine and webhooks

Every payment lives in exactly one state drawn from a small set, beginning at created, moving to pending once submitted to the PSP, resolving to succeeded or failed, and from succeeded possibly continuing to refunded. The discipline that makes the machine trustworthy is that transitions fire only on verified events and never on assumption. A timeout does not mean failed but unknown, and unknown is resolved by a webhook or by querying the PSP rather than by guessing, because a guessed failure that was actually a success ships nothing to a paying customer, while a guessed success that was actually a failure ships goods for free, and both mistakes are silent until reconciliation or a complaint finds them. The PSP signs each webhook payload with a shared secret, and the handler verifies the signature before believing a word of the contents, since an unauthenticated endpoint that flips payments to succeeded is an open invitation to anyone who discovers the URL.

A webhook is a callback over HTTP, where instead of the platform polling for news, the PSP sends a request to the platform when something happens, such as a charge succeeding or a dispute opening. Webhooks arrive with two awkward properties that the handler must absorb rather than wish away. They are delivered at least once, so duplicates are routine rather than exceptional, and they are not ordered, so a succeeded event can land before the pending event it logically follows, especially while the PSP's own retry queues drain after an incident on either side. The handler therefore deduplicates on the PSP's event ID and consults the state machine before applying anything, ignoring transitions that are illegal from the current state rather than letting a stale event drag a succeeded payment backward. Events that are neither duplicates nor legally applicable are parked for review instead of dropped, because an event that fits nowhere is evidence of a gap somewhere else in the pipeline, and throwing the evidence away helps nobody.

LEGAL = {("created", "pending"), ("created", "failed"),
         ("pending", "succeeded"), ("pending", "failed"),
         ("succeeded", "refunded")}

def on_webhook(event):
    if not verify_signature(event):           return reject()
    if seen_before(event.id):                 return ack()      # duplicate delivery
    payment = load(event.payment_id)
    if (payment.state, event.target) not in LEGAL:
        return park_for_review(event)         # out of order or impossible
    apply_transition(payment, event)          # ledger entries + state, one transaction

A latency budget for checkout

Walking the checkout call through its budget shows where the seconds live and what they justify. Validating the request and recording the payment intent with its idempotency key is one indexed insert costing a few milliseconds, and writing the pending ledger entries costs a few more in the same transaction. The PSP call then takes everything else, typically 300 milliseconds to 2 seconds for a card charge, because behind the PSP sit the card network and the issuing bank, the buyer's own bank, running whatever risk checks the issuer feels like running that day. A checkout that completes in under three seconds end to end is therefore normal and acceptable, and the budget states plainly that shaving milliseconds off the database work is pointless while two entire seconds sit at the issuer beyond anyone's reach.

The same budget dictates the timeout policy. The service waits perhaps ten seconds on the PSP before giving up, which is long enough that slow issuers usually answer and short enough that checkout threads do not pile up behind a dead network path. On a timeout the payment parks in pending, the buyer sees a message saying the payment is being confirmed rather than a false failure, and resolution arrives by webhook or by an active status query a few seconds later. The one experience the design must never create is the false negative, where the page declares the payment failed, the buyer tries again in a fresh session with a fresh cart and therefore a fresh idempotency key, and the original charge then succeeds quietly in the background, producing two real charges that no key could ever have linked. Telling the buyer the truth about uncertainty, with a pending screen instead of a guessed failure, is the cheap mechanism that prevents the expensive outcome.

Reconciliation

Even with idempotency and a careful state machine, two independent systems recording the same money will drift apart, whether through a webhook provider outage, a dropped message, a fee model change, or a currency rounding rule applied differently on each side. Reconciliation is the process that finds the drift, and the party that notices a discrepancy first resolves it cheaply, while the party that hears about it from a customer resolves it expensively and publicly. Each night the PSP publishes a settlement file, a batch listing of every transaction it processed and every fund movement it made for the account, and a job compares that file line by line against the internal ledger, matching on the PSP's charge identifier. The nightly cadence is alignment rather than laziness, since the PSP's own books close in daily batches, and comparing against an unclosed day produces noise instead of signal.

Discrepancies fall into three classes, each with its own queue and its own runbook. A payment recorded internally but missing from the file usually means the charge is still settling, so it ages a day or two before anyone investigates, and if it persists past that window the service trusted something it should not have, which points the investigation at the ingestion path. A line in the file with no internal record is the severe direction, because money moved that the platform never accounted for, typically a lost webhook compounded by a gap in polling, and it triggers both a repair entry in the ledger and a review of how the event went missing. An amount mismatch is most often fees, partial captures, or foreign exchange, and resolves to a correcting entry once classified. At 1 million payments a day, a discrepancy rate of just 0.01 percent still produces a hundred cases daily, which is exactly why the classes are queued, triaged by severity, and mostly healed by rule rather than handled as individual surprises. An operator's morning starts with the queue counts, and a good day looks like a few dozen aged-out settling cases closing themselves before coffee.

Scaling, failures, and operations

The service tier is stateless and scales horizontally to numbers far beyond what payments require, and the ledger partitions by month with old partitions archived to cheaper storage, so the operational interest concentrates in failure handling rather than capacity. Retries use exponential backoff, where the wait doubles after each failure, one second, then two, then four, with random jitter added so that a thousand clients whose requests failed together do not retry together and arrive as a synchronized thundering herd. Retries are issued only from safe states, because re-submitting a charge that carries its idempotency key is safe by construction, while re-attempting an operation whose previous outcome is unknown and unkeyed is precisely how double charges are born, and the distance between those two sentences is the distance between a resilient system and a dangerous one.

When a PSP degrades, a circuit breaker, a component that stops calling a failing dependency and fails fast until a periodic probe succeeds again, takes it out of rotation, and new payments route to a secondary PSP that the platform keeps integrated and warm for exactly this day. The dangerous moment is the handoff rather than the outage itself. Any charge in an unknown state at the degraded PSP must be resolved by query, webhook, or reconciliation before the platform re-attempts that payment anywhere else, because idempotency keys do not span providers, and no single PSP's deduplication can see a duplicate created at its competitor. Failing over the new traffic is the easy half, while the in-flight unknowns are the half that needs a written procedure, rehearsed before the incident rather than improvised during one. A buyer caught in the window sees a pending order that confirms a few minutes late, which is the designed experience, since the alternative of fast false certainty would be paid for in occasional double charges.

Payouts to sellers are the mirrored flow. Sales accumulate as credits in seller_payable, and on a schedule, daily or weekly by seller preference, a payout job moves balances out to sellers' bank accounts through the PSP's transfer rails, carrying the same style of idempotency keys, writing the same style of ledger entries that debit seller_payable, and reconciling against payout reports exactly as charges reconcile against settlement files. Gating payouts on clean reconciliation of the days they cover is cheap insurance, since clawing money back out of a seller's bank account is far harder than delaying it by a day, and sellers forgive a slow payout far more readily than a corrected one. Monitoring watches the success rate per PSP, where a drop signals a processor incident long before any status page admits one, and the distribution of decline codes, where a spike in a code such as the card networks' catch-all refusal flags a fraud wave or an issuer problem, and both are leading indicators that a plain error-rate dashboard misses entirely.

Follow-up questions

Why is exactly-once delivery impossible, and what do you do instead? Networks lose acknowledgments, so a sender can never distinguish a lost request from a lost reply, and any protocol promising true exactly-once delivery is hiding a retry somewhere inside. The achievable contract is at-least-once delivery combined with idempotent processing, which yields effectively-once outcomes, and the keys in this design are that combination made concrete.
A charge call timed out. What is the payment's state? The state is unknown, and the system must be willing to record it as such rather than guess. The payment stays pending until a webhook or an active PSP query resolves it, and any retry carries the original idempotency key, so resolution and retry can never disagree about which charge they describe.
Why an append-only ledger instead of updating a balance column? Balances are always derivable from entries, but history is not re-derivable from a mutated balance, so the asymmetry favors keeping the entries. Append-only rows provide audit evidence, dispute reconstruction, and corrections as visible reversals, and the zero-sum invariant catches whole classes of defects at write time instead of at month close.
Webhooks arrive twice and out of order. What protects you? Deduplication on the PSP's event ID makes duplicates harmless no-ops, and the state machine refuses illegal transitions, so a stale event cannot regress a succeeded payment. Whatever fits neither category parks in a review queue, because an event with no legal home is evidence that something else misbehaved.
How do you switch PSPs during an incident? Circuit-break the degraded provider, route new payments to the secondary, and resolve every unknown-state charge at the old provider by query or reconciliation before re-attempting it anywhere, because no idempotency mechanism spans two processors. The fresh traffic is the easy part, and the in-flight unknowns are the reason the procedure is written down in advance.
Where do seller payouts fit? They run as the mirrored flow on a schedule, with payable balances accumulating in the ledger, payout transfers carrying their own keys and entries, and payout reports reconciling exactly like settlement files. Holding a payout until its days reconcile cleanly costs a day of patience and saves ever asking a bank to reverse one.

References

Stripe, Designing robust and predictable APIs with idempotency (2017).
Stripe documentation, Receive Stripe events in your webhook endpoint, on signatures, retries, and ordering.
Xu and Lam, System Design Interview, Volume 2 (2022), chapter on payment systems.
Kleppmann, Designing Data-Intensive Applications (2017), on exactly-once semantics, retries, and fault tolerance.