Design a hotel reservation system

A hotel reservation system sells nights in rooms, which sounds almost too plain to be worth designing until the details begin to press in. A traveler searches a city for a three-night stay, compares a handful of hotels, books a king room, pays with a card, and later perhaps cancels the trip or shifts it by a day. Booking.com, Expedia, and every chain's own website run some version of this machine, and interviewers reach for the question because it looks like ordinary form-and-database work right up until two guests grab the last room in the same instant. The traffic numbers turn out to be small by internet standards, and that smallness carries the real lesson of the problem, because the difficulty here was never going to be throughput but correctness. Every count in the inventory table corresponds to a physical bed that a real person will expect on arrival, so an off-by-one is not a rounding error in a dashboard but a family standing in a lobby at midnight with nowhere to sleep.

The hard parts live in a few specific places, and naming them up front gives the interview its spine. The inventory model decides whether a booking is a clean transaction or a tangle of interval arithmetic. Concurrency control decides what happens when two requests race for the final room, which is the moment the interviewer is quietly waiting to hear handled. Payment sits in the middle of the booking flow, takes seconds, and can fail or hang at the worst possible moment, so its uncertainty has to be contained rather than wished away. Overbooking, which sounds like a defect, turns out to be a deliberate business rule that the system has to enforce precisely rather than stumble into. The walkthrough below takes those in order, the way I would run it in a thirty-minute interview, with the arithmetic done out loud so that each decision is visibly earned rather than asserted.

Scope and requirements

I would start by confirming the features, since five minutes spent agreeing on scope saves twenty spent designing the wrong thing. The system must let a guest search hotels by location and date range, show availability and price per room type, book with payment, and cancel or modify a reservation, and it must give hotel staff a way to manage their inventory and rates. Loyalty programs, dynamic pricing engines, and the hotel's on-premise property management system are all real adjacent systems, but they sit outside the core, and I would say so explicitly so the interviewer knows the omission is a choice rather than an oversight. The most valuable agreement to extract early concerns the unit of sale. Guests book a room type for a date range, such as one king room for three nights, and never a specific room number, because the alternative of selling numbered rooms would commit every booking to a particular door weeks in advance and would forbid the front desk from shuffling assignments around a broken air conditioner. That single product decision shapes the entire data model, which is why it deserves to be pinned down before anyone draws a box.

The non-functional requirements are dominated by one rule, namely that the system must never sell beyond the configured capacity policy under any interleaving of requests. Bookings must be durable once confirmed, since losing one strands a traveler who did everything right. Search should be fast and may tolerate slightly stale availability, while the booking path may not, and that asymmetry is a gift the design should accept rather than fight. Reads vastly outnumber writes, because people browse far more rooms than they ever book, which means the read path and the write path deserve different machinery rather than one compromise serving both. Money also flows through the booking path, so every mutation needs to be safe to retry, a property the payment section earns with a specific mechanism rather than a promise.

Sizing the problem

A modest but realistic framing is 5,000 hotels and 1 million rooms, which works out to 200 rooms per hotel on average, about the size of a mid-range city property. Assume 70 percent occupancy, a typical industry figure, and an average stay of three nights. Occupied room-nights per day are then 1,000,000 rooms times 0.7, or 700,000, and since a three-night stay contributes three of those room-nights, dividing 700,000 by 3 gives roughly 233,000 new bookings per day. A day holds 86,400 seconds, so 233,000 divided by 86,400 comes to about 2.7 bookings per second on average, and planning for a peak of ten times the average means the write path must absorb roughly 30 bookings per second. That number deserves a beat of silence in the interview, because it is tiny. Any single healthy relational database absorbs 30 writes per second without noticing, and saying so out loud reframes the problem correctly. The design effort goes into making every one of those writes right rather than fast, and an answer that begins by sharding the database has misread the question before starting it.

Reads tell a different story, and the contrast is the first architectural fact worth recording. With a sensible 100 to 1 ratio of availability searches to bookings, search runs near 300 queries per second on average and spikes into the thousands during promotions or fare-sale news, which is exactly the load the cache tier exists to soak up. Storage stays small everywhere you look. The availability table holds one row per hotel, room type, and night, and 5,000 hotels times 5 room types times 730 nights of forward inventory is about 18 million rows, which at a bit over a hundred bytes each is around 2 GB, small enough to sit in memory on a modest server. Reservations accumulate at 233,000 per day, roughly 85 million per year, and at 500 bytes each that is about 43 GB per year, which a single instance carries for a decade before anyone needs to discuss archive tiers. Nothing here strains a database, so the constraint that shapes the architecture is correctness under concurrency rather than scale, and the rest of the walkthrough spends its minutes accordingly.

The API

The external surface is compact, and keeping it compact is itself a defense, because every additional mutating endpoint is another place where idempotency and authorization have to be reasoned about. Search is a read that needs no login. Booking is an authenticated write that carries an idempotency key, a client-generated token defined properly in the payment section, and the key sits in the contract from day one because retrofitting idempotency onto a public API after clients exist is far harder than shipping it first. Cancellation and modification operate on an existing reservation and carry the same retry-safety obligations. The shapes look like this:

GET /api/v1/hotels/search?city=lisbon&check_in=2025-10-10&check_out=2025-10-13&guests=2
→ 200 { "hotels": [ { "hotel_id": 42, "room_types": [
        { "room_type_id": 7, "name": "King", "nightly_rate": 180, "available": true } ] } ] }

POST /api/v1/reservations
Idempotency-Key: 7f3c9d2a-91b4-4c0e-a2f1-6d8e0b5c4a17
{ "hotel_id": 42, "room_type_id": 7, "check_in": "2025-10-10",
  "check_out": "2025-10-13", "guest_id": 9001, "payment_token": "tok_x91k" }
→ 201 { "reservation_id": "res_55812", "status": "confirmed" }

DELETE /api/v1/reservations/res_55812        → 200 { "status": "cancelled" }
PATCH  /api/v1/reservations/res_55812        → 200   (date change: cancel plus rebook, one transaction)

The inventory data model

The central decision is to track room types rather than physical rooms, and it deserves a full defense because everything downstream leans on it. A room type is a sellable category such as a king room with a city view, and the hotel holds some count of them. Guests buy a count, and the front desk assigns an actual room number at check-in, which lets the hotel shuffle assignments freely for maintenance, deep cleaning, or upgrades without touching the reservation system at all. The alternative of modeling physical rooms would force the booking write to pick and lock a specific room across a date range, multiply the rows every transaction touches, and invent a new class of conflicts in which two guests contend for room 412 while room 414 sits empty beside it. It would buy nothing the product needs, since no guest books a numbered door, and that mismatch between model and product is exactly the kind of self-inflicted complexity an interviewer hopes to see avoided.

With counts in hand, availability becomes a simple table keyed by hotel, room type, and night, holding the total number of rooms and the number already reserved. A stay spanning three nights touches exactly three rows. Booking the king room from October 10 to October 13 means the nights of October 10, 11, and 12, since the checkout day is not slept in, so the write increments reserved on those three rows inside one database transaction, a transaction being the database's guarantee that a group of changes applies entirely or not at all. If any night is full, the whole booking rolls back and the data is exactly as it was before the attempt, which is the property that keeps a half-booked stay from ever existing.

CREATE TABLE room_type_inventory (
  hotel_id      BIGINT  NOT NULL,
  room_type_id  BIGINT  NOT NULL,
  night         DATE    NOT NULL,
  total_rooms   INT     NOT NULL,
  reserved      INT     NOT NULL DEFAULT 0,
  version       INT     NOT NULL DEFAULT 0,
  PRIMARY KEY (hotel_id, room_type_id, night)
);

CREATE TABLE reservations (
  reservation_id   UUID PRIMARY KEY,
  hotel_id         BIGINT NOT NULL,
  room_type_id     BIGINT NOT NULL,
  guest_id         BIGINT NOT NULL,
  check_in         DATE   NOT NULL,
  check_out        DATE   NOT NULL,
  status           TEXT   NOT NULL,        -- pending, confirmed, cancelled
  idempotency_key  TEXT   NOT NULL UNIQUE,
  expires_at       TIMESTAMPTZ             -- set while a hold is pending
);

One row per night looks wasteful next to storing each reservation as a date range, and the range model is the alternative worth naming. Under ranges, answering whether a room type is free for October 10 through 13 means scanning for any reservation whose interval overlaps the query interval, and overlap logic is notoriously easy to get subtly wrong at the boundaries, while counting how many rooms remain on a given night requires aggregating over every range that covers it. The per-night model makes each question the system asks trivial instead. Availability for a range is a read of exactly N rows, partial overlap between stays needs no interval arithmetic at all because each night stands alone, and the booking write is a handful of integer updates with an obvious correctness condition on each row. Eighteen million small rows is a cheap price for never reasoning about intervals again, and the version column in the schema is the hook the concurrency section will hang its safety on.

The high-level architecture

The shape of the system follows from the read and write split established in the sizing. An API gateway, the single front door that authenticates requests and routes them onward, fronts two services. A search service answers availability queries from a cache and never writes anything, while a reservation service owns the booking transaction against the inventory database, calls the external payment provider, and emits events for everything downstream such as confirmation emails, analytics, and pricing. Splitting the two means a promotion-day flood of searches cannot starve the booking path of connections, and a slow payment provider cannot make browsing sluggish, because the workloads share nothing but the gateway. A single combined service would be simpler to deploy and would work fine at this scale, but it couples the failure domains of the two paths for no saving worth having, and the interview is the right place to say that trade out loud.

Search reads cached availability with bounded staleness, while the booking path goes through the reservation service to the inventory database, which stays the source of truth. Dashed arrows are asynchronous or taken only sometimes.

Concurrency control for the last room

Suppose one king room remains for the night of October 10 and two booking requests arrive within a millisecond of each other. Both read the availability row, both see 99 of 100 reserved, both conclude a room is free, and without protection both increment the count, leaving the hotel having sold 101 rooms it does not have. Neither request did anything wrong individually, which is what makes races so unpleasant to reason about, since the defect lives in the interleaving rather than in either code path. Two standard defenses exist, and both are worth defining carefully, because the interviewer will ask for the pair and for the reasoning that picks between them.

Pessimistic locking takes the locks up front. The transaction issues SELECT ... FOR UPDATE on the three night rows, which tells the database to hold exclusive row locks, and any competing transaction touching those rows simply waits in line until the first commits, so the race disappears by construction. The approach is correct and easy to explain, but its costs are structural rather than incidental. Every booking now serializes on its rows whether or not a conflict exists, so the common uncontended case pays for the rare contended one. Lock wait time grows with anything slow inside the transaction, which is why the cardinal discipline of never holding row locks across a network call such as the payment charge has to be maintained forever by every engineer who ever touches the code. Worse still, two stays locking overlapping night ranges in different orders can deadlock, a standoff in which each transaction waits on a lock the other holds, which the database resolves by killing one of them, and the application then has to retry anyway, surrendering the simplicity that was the approach's main selling point.

Optimistic concurrency control inverts the bet. Each row carries a version column, the transaction reads the rows while remembering their versions, and it then attempts a conditional update, often called compare-and-set, that applies only if the version is still the value it read. If another writer got there first, the version has moved on, the update matches zero rows, and the application re-reads and retries from fresh state. Nothing locks during the read or during the user's think time, and the cost of a conflict is paid only when a conflict actually happens, which the arithmetic below shows is almost never.

UPDATE room_type_inventory
   SET reserved = reserved + 1, version = version + 1
 WHERE hotel_id = 42 AND room_type_id = 7 AND night = '2025-10-10'
   AND version = 18
   AND reserved + 1 <= CEIL(total_rooms * 1.05);   -- the overbooking cap
-- repeated for each night; if any statement touches 0 rows, ROLLBACK and retry

Optimistic wins here because contention is rare by construction, and the claim is checkable rather than rhetorical. Thirty bookings per second spread across 25,000 hotel and room type combinations and 730 distinct nights gives an enormous space of rows for writes to land on, so two simultaneous writes to the same row are a few-times-a-day event rather than a steady state, and locking every transaction to guard against a collision that rare is the wrong trade. When a collision does occur, the loser re-reads and sees one of two things, either a new version with rooms still free, in which case the retry succeeds and the guest never learns anything happened, or a full count, in which case the service returns a clean sold-out answer instead of an error. A bound of three attempts keeps the loop from spinning during a genuine rush, after which the request fails politely rather than burning the database. The cap predicate inside the UPDATE adds a second and independent layer of safety, because even a confused retry can never push reserved past policy, so correctness never depends on the retry logic being clever. A third option worth naming is running the database at serializable isolation, the strictest setting, in which the database itself detects dangerous interleavings and aborts one of the transactions involved. It reaches the same end state, but it makes the abort opaque, charges a tax on every transaction in the system, and still requires retry handling in the application, so it spends more to deliver less legibility than the explicit version column.

Holds, payment, and idempotency

Payment turns the booking into a two-act story, because the charge happens at an external provider, takes anywhere from a few hundred milliseconds to several seconds, sometimes fails, and sometimes times out in a state where the platform genuinely does not know whether money moved. Ordering the two acts is the real design question. Charging first risks taking a traveler's money for a room that another request grabs mid-payment, after which the refund path becomes part of the happy path, which nobody wants to operate. Reserving first without a time limit parks inventory behind every abandoned shopping cart, and a popular hotel could sit fully booked for hours by guests who simply closed the tab. The hold pattern threads between those failure modes. The reservation service first writes a reservation in a pending state and increments the inventory rows in one transaction, stamping the reservation with an expiry ten minutes ahead, a time-to-live after which the hold stops counting against inventory. The room is now off the market while the guest's card is charged. On success the status flips to confirmed and the expiry is cleared, while on failure or abandonment the expiry releases the hold by decrementing the same rows, so the worst an abandoned cart can ever do is sideline one room for ten minutes.

The second mechanism is the idempotency key, a client-generated unique token attached to a logical operation so that performing the operation twice has the same effect as performing it once. The double-submit scenario shows why the key cannot be optional. A guest taps Book, the request reaches the server and succeeds, but the response is lost on a flaky connection, so the client retries the only way it can, by sending the same request again. Without a key, the retry creates a second reservation and a second charge, and the guest discovers the duplicate on a card statement weeks later. With the key, the server finds it already stored on the first reservation, since the unique index on the column makes a second insert impossible, and it simply returns the original result, so the guest sees one booking no matter how many times the button was pressed. The reservation service applies the same idea downstream by deriving an idempotency key from the reservation ID for the payment provider call, so a retried charge also lands exactly once at the provider. The remaining dark corner is a charge that times out, where the provider may or may not have taken the money, and the service must not guess. It leaves the reservation pending and asks the provider for the fate of that key, or waits for the provider's asynchronous notification, and only then confirms or releases, because uncertainty resolved by a query is recoverable while uncertainty resolved by an assumption is not.

In step 1 the guest submits a booking carrying an idempotency key, and in step 2 the service reads the three availability rows along with their versions. Step 3 applies the conditional update that creates a pending hold with an expiry, step 4 shows the re-read and retry taken only on a version conflict, step 5 charges the card through the payment provider, and step 6 flips the hold to confirmed and returns the answer to the guest.

Overbooking, search, and the cache

Overbooking is the practice of deliberately selling more than physical capacity because some fraction of guests reliably fails to arrive. A hotel with 100 king rooms and a measured 5 percent no-show rate can sell 105 and expect to fill the building on a typical night rather than leave paid-for rooms empty, and the recovered revenue funds the occasional bad night. The crucial design point is that this is a configurable business rule rather than a side effect of weak locking. The cap check sits explicitly inside the conditional update as reserved + 1 <= CEIL(total_rooms * 1.05), with the factor stored per hotel and room type, so a property with reliable corporate guests can run 1.02 while an airport hotel full of missed connections runs 1.08. If 102 guests actually arrive, the hotel walks two of them to a nearby property at its own expense, an outcome the revenue team priced when it chose the factor. Selling room 106 would be a defect, while selling room 105 is the plan working as intended, and keeping that line bright is the entire difference between a policy and an incident.

Search occupies the other side of the read and write asymmetry. Availability queries outnumber bookings a hundred to one, and they tolerate staleness in a way bookings never can, because the worst consequence of a stale search result is a guest who clicks through and finds the room just sold, a mild disappointment rather than a broken promise. The search service therefore answers from a cache of availability counts keyed by hotel, room type, and night, refreshed on a short expiry of around 60 seconds and additionally invalidated by booking events, so popular hotels update faster than the worst case. Bounded staleness, meaning a known upper limit on how out of date an answer can be, is the explicit contract, and sixty seconds is a defensible bound because inventory moves slowly at 2.7 bookings per second across the whole estate. The reservation service never trusts the cache for a write decision, since its conditional update against the inventory database is the recheck, so the failure mode of a stale cache is a polite message at checkout saying the room just went, never an oversell. The alternative of strongly consistent search, where every query reads the database directly, would multiply the database load many times over to buy a freshness no browsing guest can perceive, which is why it loses.

A latency budget for the booking path

A budget for where the time goes earns its own section, because it exposes which parts of the design are allowed to be slow and which are not. On the search path, a cache hit costs a millisecond or two inside the data center, so the guest's experience is dominated by the 50 to 100 milliseconds of network between phone and server, and the results page feels instant. A cache miss adds a single indexed read of a few milliseconds against the inventory database and warms the entry for the next visitor, which is why even a completely cold cache after a restart degrades latency only briefly and availability not at all, an outcome the operator watches as a short-lived blip rather than an emergency.

The booking write reads very differently. Authentication and routing at the gateway cost a millisecond or two, reading the three night rows with their versions costs about a millisecond on indexed primary keys, and the conditional update transaction commits in single-digit milliseconds, including the synchronous write of the pending reservation, so everything up to this point fits comfortably inside 20 milliseconds. The payment call then dominates utterly, taking 200 milliseconds on a good day and two or three seconds on an ordinary one, which is two orders of magnitude beyond everything else combined. Three design consequences follow directly from that imbalance. Database row locks must never be held across the payment call, which is half the argument against pessimistic locking restated as arithmetic. The user interface should show progress at around the one-second mark, because a guest cannot distinguish slow from broken and will refresh, which is exactly the double submit the idempotency key was built to absorb. And the service's timeout on the provider, perhaps ten seconds, must sit far below the ten-minute hold expiry, so a timed-out charge leaves minutes of slack in which to resolve the truth by query before the hold lapses. A confirmed booking lands in two to four seconds end to end, nearly all of it payment, and no amount of database tuning would move that number, which is precisely the kind of conclusion a budget exists to make visible.

Scaling, failures, and operations

The services are stateless, holding no data between requests, so they scale horizontally behind the gateway by adding instances, and at 30 writes per second the inventory database runs comfortably as a single primary with replicas for years. When one database is no longer enough, whether for write headroom or for fault isolation, the natural shard key is the hotel ID, where sharding means splitting rows across databases by some key. All inventory rows and reservations for a hotel then live on one shard, so the multi-row booking transaction stays local to a single database and nothing in the correctness story changes, and since a stay never spans hotels, no booking ever needs a distributed transaction. Hashing by something finer such as reservation ID would scatter a hotel's nights across machines and turn every booking into a cross-shard coordination problem, which is the alternative that loses on every axis. Cross-hotel search never touches the shards directly anyway, since it reads the cache and a search index fed by events.

Events deserve a precise mechanism rather than a hand wave. The reservation service writes an event row into an outbox table inside the same transaction as the booking, and a relay process reads that table and publishes the rows to the stream. This transactional outbox pattern guarantees that analytics, dynamic pricing, and confirmation emails see exactly the bookings that committed, with no phantom events from rolled-back transactions and no committed bookings missing from the feed. The tempting shortcut of publishing to the stream directly from application code after the commit fails in both directions, because a crash between commit and publish silently drops an event, while a publish ahead of a failed commit invents one, and either way the downstream consumers drift from the truth.

Cancellation runs the booking in reverse, with one transaction setting the reservation status to cancelled and decrementing reserved on the same night rows, while the refund flows through the payment provider under its own idempotency key according to the rate's cancellation policy. Modification is cancel-plus-rebook inside a single transaction, releasing the old nights and attempting the new nights through the same cap-checked conditional updates, then committing both legs or neither. If the new dates are full, the rollback leaves the original booking untouched, so a guest who asks to move a stay and cannot is left exactly where they started rather than stranded with nothing, which is the outcome any traveler would choose.

The failure stories stay small because the source of truth is one transactional store. Losing a cache node means a temporary surge of cheap count reads against the database while the cache rewarms, visible to the operator as a latency blip on the search dashboard and to guests as nothing at all. A failed primary promotes a replica, and if the business cannot tolerate losing the last few seconds of bookings, replication runs synchronously, so a commit is acknowledged only once a replica also has it. A stalled hold-expiry sweeper just means pending holds linger slightly past their time until the lazy check on the read path releases them, degrading nothing but tidiness. A payment provider outage stops new bookings cold while search continues untouched, and the operator's job during it is communication rather than data repair, because every interrupted booking is either pending with a hold that will expire on its own or confirmed with money taken, and never half of each. The one metric worth paging a human for is the invariant itself, since any row where reserved exceeds the cap means the concurrency control has been bypassed somewhere, and that is a wake-someone-up signal rather than a statistic to trend on a weekly chart.

Follow-up questions

Why track room types instead of physical rooms? Guests buy a category and the front desk assigns rooms at check-in, so inventory reduces to a few integer columns per night. Modeling physical rooms would force the booking write to lock a specific room across dates, invent contention between guests who could both have been satisfied, and buy nothing the product needs, since nobody books a numbered door.
Pessimistic or optimistic locking? At 30 bookings per second across 25,000 sellable combinations and 730 nights, conflicts are a few-times-a-day event, so optimistic concurrency pays the retry cost only when collisions actually happen, and the uncontended majority runs without any waiting at all. Pessimistic locks earn their keep only under sustained contention on the same rows, such as a flash sale on a single property, where retry storms would waste more work than the locks cost.
What stops a retried request from charging twice? An idempotency key stored uniquely alongside the reservation does, because the retry finds the existing key and receives the original result instead of a fresh execution, while the payment provider gets its own key derived from the reservation ID, so the charge deduplicates at that layer as well.
How does overbooking not become overselling? The cap is an explicit predicate inside the conditional update, with the factor stored per hotel and room type, so the limit is enforced at write time by the database itself. Anything past the cap is impossible by construction rather than discouraged by convention, which is the only kind of guarantee that survives staff turnover and 2 a.m. deploys.
How would you shard the database? By hotel ID, so that a stay's night rows and its reservation share a shard and the booking transaction stays local to one database. Search scales separately through the cache and the index, which are fed by events rather than by querying shards, so the sharding choice never leaks into the read path.
Where is staleness acceptable? Search can run a minute behind, because the booking path rechecks the source of truth and the worst case is a polite sold-out message at checkout. The inventory database itself is never read through the cache for a write decision, which keeps the staleness contract one-sided and harmless.

References

Xu and Lam, System Design Interview, Volume 2 (2022), chapter on hotel reservation systems.
Kleppmann, Designing Data-Intensive Applications (2017), on transactions, isolation, and weak isolation anomalies.
PostgreSQL documentation, Transaction Isolation, on what each isolation level actually guarantees.
Stripe, Designing robust and predictable APIs with idempotency (2017).