A digital wallet gives users a stored-value account, meaning a balance held by the platform that they top up from a card or bank account and then spend or send instantly, the way a PayPal balance, a Venmo account, or an Alipay wallet works. Transfers between two wallet users never touch a bank, since both balances live inside the same system, which is exactly why they can be instant and also why the system holding them must be incorruptible, because no bank's books exist downstream to catch its mistakes. The interview phrasing usually attaches an aggressive number, a million transfers per second, and the question is really asking two things at once. It asks whether you can scale a money store past a single database, and it asks whether you understand that scaling one multiplies the ways money can quietly go missing.
The hard parts follow directly from that pairing. A transfer that debits one account and credits another is trivial inside one database and genuinely subtle across two, so the design hinges on how accounts are sharded and on which protocol moves money between shards when, as the arithmetic will show, nearly every transfer crosses them. And because the data is money, every mechanism has to leave an audit trail good enough to reconstruct any balance at any past moment, which is the pressure that pushes the design away from a mutable table of numbers and toward an event-sourced core where the history itself is the data.
Scope and requirements
Functionally the wallet supports balance accounts, peer-to-peer transfers, top-ups from external payment methods, withdrawals back out to bank accounts, and a complete transaction history per account, since a money product that cannot produce a statement is not a product. Know-your-customer and anti-money-laundering obligations, the identity verification and transaction monitoring that regulators require of anyone holding customer funds, are acknowledged here as gates on what an account may do rather than designed in depth, and the card-charging side of top-ups is delegated to a payment service provider for the same reasons any payment system delegates it, namely tokenized card data and a vastly smaller compliance surface.
The invariants deserve to be stated plainly, because they are the actual requirements and everything else is implementation detail in their service. No balance may ever go negative. Money is never created or destroyed by a transfer but only moved, so the sum across all accounts changes only at top-up and withdrawal boundaries, and a continuous checker can hold the system to exactly that. Every movement must be attributable to a command from an identified actor, and history must be replayable for audit, so that given the record an examiner can re-derive any balance at any past moment and get the same answer every time. Latency should stay sub-second so the product feels instant in a chat thread, and the headline throughput is 1 million transfers per second, which the sizing below treats as a serious number rather than waving at it.
Sizing the problem
Start with what one machine can actually do. A well-tuned relational database executing small transactions with synchronous durability, meaning each commit is physically on disk before the database acknowledges it, manages somewhere between 5,000 and 10,000 such transactions per second, with the ceiling set by the round trip to storage rather than by computation. Taking the generous end, 1,000,000 divided by 10,000 gives 100 databases running at full saturation, and nobody runs money systems at full saturation, so a realistic deployment needs 200 or more shards once headroom, replication, and uneven load are budgeted. That gap of two full orders of magnitude between the requirement and a single node is the entire reason this design exists, and doing the division out loud is how the interview earns the complexity that follows instead of assuming it.
Sharding then creates the next problem by arithmetic alone. If accounts are spread across 200 shards and transfer pairs are roughly random, the chance that both parties land on the same shard is about 1 in 200, which means 99.5 percent of transfers cross shards. The cross-shard path is therefore the common case rather than the corner case, and whatever protocol handles it has to be fast and safe at full volume rather than an exception handler invoked occasionally. Event volume is equally sobering once multiplied out. One million transfers per second is 86.4 billion events per day, and at around 120 bytes per event that is roughly 10 TB of log per day, so the log must tier into cheap object storage while staying replayable. Even a skeptic who reads the million as a burst peak with a tenth of that sustained still faces tens of shards and terabytes of history, so no plausible reading of the numbers leads back to one database.
The API
Amounts are integers in minor units, and every mutating call carries a client-generated operation ID, which serves as the wallet's idempotency key so that a retried request applies exactly once. Transfers are accepted first and then settle through the machinery described below, which is why the response is a 202 with a pending status rather than an instant final answer, and clients poll or subscribe for the completion that typically arrives within a second:
POST /api/v1/transfers
Operation-Id: 0d9f3b71-2c44-4f8e-9a01-8be2f6c0d5a3
{ "from_account": "acc_211", "to_account": "acc_870",
"amount": 2500, "currency": "usd" }
→ 202 { "transfer_id": "tr_5512", "status": "pending" }
GET /api/v1/transfers/tr_5512 → 200 { "status": "completed" }
GET /api/v1/accounts/acc_211/balance → 200 { "available": 1175, "reserved": 0 }
POST /api/v1/topups
Operation-Id: 77c1a9e4-5b02-4d33-8f1c-09d4e2b6a7f0
{ "account": "acc_211", "amount": 5000, "payment_method": "pm_card_tok" }
The data model, starting naive
The baseline everything else gets measured against is one database holding an accounts table and a transactions table. A transfer is a single ACID transaction, the database term for a group of changes that is atomic, consistent, isolated, and durable, and it debits one row, credits another, and records what it did. On one node this is simply and completely correct, because the balance check and the debit happen atomically, the constraint stops overdrafts no matter how requests interleave, and crash recovery is the database vendor's problem rather than ours:
CREATE TABLE accounts (
account_id BIGINT PRIMARY KEY,
balance BIGINT NOT NULL CHECK (balance >= 0), -- minor units
updated_at TIMESTAMPTZ NOT NULL
);
BEGIN;
UPDATE accounts SET balance = balance - 2500
WHERE account_id = 211 AND balance >= 2500; -- 0 rows = insufficient funds
UPDATE accounts SET balance = balance + 2500
WHERE account_id = 870;
INSERT INTO transactions (op_id, from_acct, to_acct, amount)
VALUES ('0d9f3b71...', 211, 870, 2500); -- unique op_id dedupes retries
COMMIT;
Everything that follows is an attempt to keep these exact guarantees while the two UPDATE statements move to different machines. Naming that explicitly in the interview frames every later choice as a recovery of something the single node gave for free, and it also supplies the test each mechanism must pass, because any design that can drop a credit or permit an overdraft during a crash is not a scaled version of the baseline but a regression from it dressed in distributed clothing.
The high-level architecture
Accounts are partitioned by account ID across shards, and each shard is not one database but a small replicated group whose nodes agree on an ordered command log, so that no single machine's death can lose an acknowledged write. Balances are derived from per-shard event logs rather than stored as primary truth, with read models, which are queryable views built off the logs, serving balance and history queries, and snapshots keeping recovery from ever meaning a replay of years. The user's experience stays plain through all of it, a transfer accepted in milliseconds and confirmed within about a second, with the distributed machinery invisible until an auditor asks precisely how the number on the screen came to be, at which point the machinery is the answer.
Commands route by account ID to a shard, each shard is a Raft-replicated state machine appending to its own event log, and projections off the logs build read models and snapshots. Dashed arrows are asynchronous.
Cross-shard transfers
A transfer between accounts on different shards is a distributed transaction, an operation that must appear atomic even though it spans machines that can fail independently of each other, and the protocol choices are best compared through their failure stories rather than their happy paths, because the happy paths all look identical.
Two-phase commit, or 2PC, is the classical database-level answer. A coordinator asks every participant to prepare, meaning each one durably promises it can commit and holds locks on the affected rows while it waits, and once all participants vote yes the coordinator tells them to commit. The correctness is real, and so is the famous hazard, because if the coordinator crashes between the phases, the participants are left in doubt, still holding locks on account rows, unable either to commit or to roll back until the coordinator returns, since acting alone could contradict what the others did. At hundreds of thousands of transfers per second, a blocked participant freezes a slice of every account on its shard, and the freeze spreads as new transactions queue behind the held locks, which is why high-volume money systems tend to keep 2PC off the hot path entirely rather than merely discouraged.
Try-confirm-cancel, or TC/C, moves the transaction up into the application as two or three ordinary local transactions. In the try step the coordinator reserves funds on the debit side, moving 25.00 from available balance into a reserved bucket on shard A as a real committed row visible to audit. The confirm step then applies the transfer, finalizing the debit on A and crediting the account on B in its own local transaction. If anything fails after a successful try, the cancel step releases the reservation, and the money never moved at all. Each step is idempotent and journaled, so a crashed coordinator is recovered by a scanner process that finds in-flight transfers in the journal and drives each one forward to confirm or backward to cancel according to policy. Nothing ever blocks, because every intermediate state is a committed, legible business state rather than a held lock, and an operator reading the journal during an incident can see exactly which transfers are mid-flight and where each one stopped. The debit side always goes first, and the ordering is structural rather than stylistic, because the failure mode of credit-first ordering is a window where the recipient can spend money that was never successfully taken from the sender, and a crash inside that window has manufactured money out of nothing.
A saga generalizes the same idea into a sequence of local transactions where each step has a compensating action that undoes it if a later step fails. TC/C is essentially a two-step saga specialized for funds, and the specialization matters, because reserving before spending preserves the no-negative-balance invariant even mid-flight, where a generic saga would only restore it after compensation ran. The cost relative to 2PC is that the application owns the protocol, the recovery scanner, and the reserved-funds bookkeeping, and that users can briefly see a reserved amount reducing their available balance, which the support team must be able to explain in one sentence. The gain is that no machine failure anywhere can leave locks pinned across the fleet, and for a system whose entire purpose is to keep money moving at volume, trading some bookkeeping effort for the absence of fleet-wide freezes is a bargain worth taking twice.
In step 1 the client requests a transfer, and in step 2 the coordinator runs the try by reserving the amount on the debit shard. Step 3 is the confirm, applying the credit on shard B and finalizing the debit, while step 4 shows the dashed cancel branch, where any failure after the try releases the reservation and no money has moved. Step 5 returns the outcome to the client.
Event sourcing as the backbone
Event sourcing means the balance is not a stored number but the fold of an append-only event log, where a fold is the result of applying entries one at a time to an accumulating state. The log records facts such as transfer tr_5512 reserving 2500 from acc_211 and the same transfer later confirming, and the current state of any account is whatever applying its events in order produces. Because replaying 86 billion events to answer a balance query would be absurd, each shard takes periodic snapshots, which are materialized states as of a known log position, and a read becomes the latest snapshot plus the short tail of events after it, typically milliseconds of work rather than a history lesson.
What this buys is exactly what the invariants demanded. Audit comes free, because the log is the audit trail rather than a reconstruction of one after the fact. Recovery is deterministic replay, where a node that crashes reloads its snapshot, replays the tail, and lands in provably the same state it would have reached without the crash. Disputes get time travel, since asking what a balance was last March 3rd at noon becomes a fold up to a log position rather than a forensic project across backups. What it costs is equally real and worth conceding before being pressed. Every queryable view, including balances, histories, and statements, is a read model that must be built from the log and kept in sync, and queries against those models are eventually consistent, lagging the log by an interval that has to be measured and watched rather than assumed away. The log also lives forever, which at 10 TB per day means aggressive tiering into object storage, and the event schema must be able to evolve without breaking replay of records written years earlier, a discipline that takes versioned event types and genuine review rather than good intentions.
Consensus, idempotency, and risk controls
Each shard must itself survive machine failure without losing an acknowledged cent, and the standard tool is a replicated state machine, an arrangement where commands are sequenced through a consensus log and applied by deterministic logic on several replicas at once. Consensus here means a protocol such as Raft, which gets a small cluster of replicas to agree on a single ordered log even while a minority of them are down or unreachable. Every replica applies the same commands in the same order through the same logic, so every replica holds the same balances, and when a leader dies an election promotes a follower that already has the full sequence, making failover a role change rather than a data recovery. The shard's event log and its Raft log are naturally the same ordered stream, and the economy is pleasing, since consensus, durability, and audit all share one structure instead of three structures that can quietly disagree.
Idempotency rides inside the state machine rather than beside it. Every command carries the client's operation ID, and the dedupe table mapping operation IDs to outcomes is itself part of the replicated state, so a retried transfer applies once even if the retry lands on a freshly elected leader after a failover, because the new leader holds the same dedupe table for the unanswerable reason that it holds the same log. Risk controls sit on the command path before anything is appended. Velocity checks, which are limits on rate and volume per account per time window, such as at most 50 transfers or 10,000.00 in a rolling hour, reject or hold commands synchronously, and KYC status gates capabilities, so an unverified account might receive money but not send above a small threshold. Heavier fraud models run asynchronously off the event stream and freeze accounts after the fact, which is a deliberate trade that keeps the command path fast while every intervention remains an auditable command in the same log as the transfers it polices.
A latency budget for one transfer
The sub-second promise holds comfortably once the path is written down step by step, and writing it down also shows which failures can hurt it. The gateway authenticates and routes by account ID in a millisecond or two. The try command on the debit shard costs one Raft round, in which the leader appends the command, replicates it to a majority of replicas, and acknowledges, and inside one region that runs one to three milliseconds, dominated by a network round trip plus a disk flush. The confirm on the credit shard costs the same again, the coordinator's own journal writes add a few more, and so a cross-shard transfer completes its actual money movement in well under 20 milliseconds of system time, with the user's perceived latency dominated by their phone's network hop rather than by anything the platform does.
The budget shifts when replicas span data centers for disaster tolerance, since a majority spread across a metro area adds single-digit milliseconds per round and a majority spread across continents adds tens, which is why shards usually keep their voting replicas within one region and ship the log asynchronously to a distant copy for catastrophe recovery, accepting a small bounded loss window for continent-scale disasters in exchange for a fast hot path every day. Under failure the budget degrades visibly but tolerably. A leader election pauses one shard's writes for a few hundred milliseconds, which a user experiences as one slow transfer rather than an error, and a stuck coordinator surfaces as a transfer pinned at pending until the scanner, sweeping every few seconds, drives it to completion. The product can defensibly show money as sent the moment the try commits, because from that point the protocol guarantees the transfer resolves to exactly one of confirmed or cancelled, and never to a limbo that strands the funds.
Scaling, failures, and operations
Shards scale by splitting account ranges, and event sourcing makes the migration mechanical, since moving an account means replaying its events to the destination shard, pausing it briefly to drain in-flight operations, and cutting the routing table over, after which the old copy is dead weight to delete at leisure. Hot accounts are the asymmetry worth watching, because a major merchant receiving thousands of credits per second can saturate its shard while neighboring shards idle. The standard treatments are spreading such a merchant across sub-accounts that aggregate in a read model, so each sub-account's command rate stays inside one shard's budget, and batching micro-credits before they reach the state machine, which trades a moment of credit latency for an order of magnitude less log traffic. The gateway and the coordinators are stateless and scale flat, and the recovery scanner that drives stuck TC/C transfers runs continuously, with its queue depth watched as a first-class health metric, because a growing queue means transfers are entering trouble faster than they are leaving it.
Failures map onto short, rehearsable stories. A shard leader dies, Raft elects a new one in hundreds of milliseconds, and no acknowledged command is lost, because acknowledgment always required majority replication in the first place. A coordinator dies mid-transfer, the scanner finds the journaled try without a confirm, and it either confirms or cancels by policy and timeout, with the user seeing a transfer that took seconds instead of milliseconds rather than a transfer that vanished. Read models lag, the balance endpoint serves slightly stale answers with the lag surfaced in monitoring, and commands keep validating against the authoritative shard state, so staleness never enables an overdraft. The monitor that matters above all others is the conservation check, a continuous job folding per-shard sums plus in-flight reserves and comparing the total against expected global funds, because a nonzero drift means an invariant broke somewhere, and finding that within minutes instead of at quarter close is the difference between a quiet fix and an incident report with a regulator's name on the distribution list.
Follow-up questions
- Why not one big database on bigger hardware? The arithmetic gap is two orders of magnitude, since a single node manages perhaps 10,000 durable transactions per second against a requirement of 1,000,000. Vertical scaling buys at most one order of magnitude at rapidly worsening prices, and a single node is also a single blast radius holding everyone's funds, which no risk review would sign off on.
- Why TC/C over two-phase commit? 2PC holds locks while participants sit prepared and blocks outright if the coordinator dies, which at this rate freezes slices of the fleet behind unresolved locks. TC/C keeps every intermediate state as a committed, auditable row, so recovery is a scanner reading a journal rather than a lock-resolution outage, and no single crash can pin the system.
- What if the confirm on the credit shard keeps failing? The coordinator retries it, because a credit to a valid account cannot legitimately be refused once the try has committed, and idempotent confirms make the repetition safe. If the destination account itself is gone, the transfer cancels and the reservation releases, and the journal makes whichever path is taken resumable from any crash.
- How does a retried transfer apply exactly once across failover? The operation-ID dedupe table lives inside the replicated state machine, so any leader capable of processing the retry necessarily holds the record of the first application, having reached its position by applying the very same log.
- What does event sourcing actually cost? It costs read models that must be built and monitored, eventual consistency on every query path, a log growing around 10 TB per day at the headline rate, and schema evolution discipline so that records written years ago still replay correctly. Those are real ongoing expenses, paid willingly because audit and recovery fall out of the same structure.
- Where do fraud and compliance checks run? Velocity limits run synchronously on the command path before anything is appended, so the cheap checks gate every movement. Heavier scoring runs asynchronously off the event stream and issues freeze commands through the same log as everything else, which keeps the hot path fast while leaving every intervention auditable.
References
- Xu and Lam, System Design Interview, Volume 2 (2022), chapter on digital wallets.
- Fowler, Event Sourcing (2005).
- Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (2014), the Raft paper.
- Kleppmann, Designing Data-Intensive Applications (2017), on distributed transactions, consensus, and total order broadcast.