Design a rideshare service

A rideshare service connects a rider who wants to go somewhere with a driver who is nearby and willing, then shepherds the two of them through pickup, the ride itself, drop-off, payment, and the receipt at the end. Interviewers like the question because it stacks three distinct hard problems on top of each other, beginning with a write-heavy firehose of driver locations that no ordinary database survives, continuing with a matching decision that must be fast and fair under concurrency, and finishing with a trip lifecycle that has to stay correct across two flaky mobile connections at once. A candidate who treats it as "nearby search plus a button" misses where the design actually lives, which is in dispatch correctness and in keeping the location torrent away from durable storage, and an interviewer is listening for exactly that recognition in the first few minutes.

The walkthrough below scopes the product, runs the numbers that shape it, and then spends its depth on the geo index, the matching and dispatch protocol, and the trip state machine, pausing at each stage to name the alternative that was on the table and to say why it loses.

Scope and requirements

Functionally, riders request a ride from a pickup point to a destination and see a fare estimate and a live ETA, while drivers go online, stream their position, receive ride offers, and accept or decline. The system matches each request to a suitable driver, tracks the trip through its stages, and shows the rider the driver approaching in real time. Pricing and payments deserve acknowledgment and a deliberately thin treatment, so surge pricing gets one design paragraph below, and payment processing happens after trip completion in a separate service, since nothing about charging a card belongs on the matching path or should be allowed to slow it down. Out of scope are pooled rides, food delivery, and driver onboarding, each of which is a real system in its own right that the interviewer will gladly park once it has been named.

Non-functionally, matching should complete within a few seconds because a rider staring at a spinner churns to the competing app on the next home-screen page, driver positions shown to riders should be fresh within a handful of seconds so the approaching car moves believably on the map, and the trip record must be durable and auditable, because money, taxes, and disputes all hang off it. The contrast between those last two requirements is worth stating plainly, because locations are high-volume and disposable while trips are low-volume and sacred, and the architecture should treat the two kinds of data completely differently rather than forcing both through one storage tier sized for the wrong one. Availability matters per city, since the product is useless if a whole metro cannot match, and a degraded match beats no match every time.

Sizing the problem

Drivers dominate the write load. Assume 5 million drivers online at peak, each phone reporting GPS every 4 seconds, which gives 5 million divided by 4, or 1.25 million location writes per second, sustained through every rush hour on the planet. The 4-second cadence is itself a design choice, frequent enough that the car icon a rider watches glides rather than teleports, yet sparse enough to spare driver batteries and data plans, and halving it to 2 seconds would double the hardest number in the system for a smoothness gain riders barely notice. Each report is tiny, carrying a driver ID, coordinates, heading, and status in well under 100 bytes, so the total live state is 5 million times 100 bytes, only 500 MB, and the entire fleet's position fits in the memory of one large machine. The challenge is therefore never capacity but purely the write rate and the cell queries running against it concurrently. Riders add a smaller stream, since only riders actively watching a trip need fine-grained updates.

Matching volume is small by comparison. At 20 million trips per day, the average is about 230 requests per second over 86,400 seconds, and even a peak factor of four only reaches roughly 1,000 matches per second, each of which triggers one candidate query, a dozen ETA computations, and a dispatch conversation. Trip records at 20 million per day and roughly 2 KB each, counting the event log, accumulate 40 GB per day, around 15 TB per year, which a sharded relational store handles without drama. The three orders of magnitude between location writes and match requests explain the whole architecture, because the firehose needs an exotic in-memory path while everything else can afford to be conventional, and conventional components are the ones operators already know how to run at 3 a.m.

The API

Riders and drivers speak mostly ordinary HTTPS, with offers and live positions pushed over a persistent connection or mobile push. The idempotency key on ride creation matters enough to return to later, and for now it is enough to say it is a client-generated token that makes retries safe on flaky networks, so a request that times out can be repeated without summoning a second car.

POST /v1/rides
Idempotency-Key: 7f3a9c41-...
{ "pickup":  { "lat": 37.7749, "lng": -122.4194 },
  "dropoff": { "lat": 37.8044, "lng": -122.2712 } }
→ 201 { "ride_id": 99182, "state": "requested", "quote": { "fare": 23.50, "surge": 1.2 } }

POST /v1/drivers/me/locations          (batched, every 4 s)
{ "points": [ { "lat": ..., "lng": ..., "ts": ..., "heading": 80 } ] }
→ 204

server → driver (push):  { "type": "offer", "ride_id": 99182,
                           "pickup_eta_min": 4, "expires_in_s": 12 }
POST /v1/offers/{offer_id}/accept      → 200 { "ride_id": 99182, "state": "matched" }
POST /v1/rides/{ride_id}/cancel        → 200

The data model

Trips live in a relational store as a current-state row plus an append-only event log, meaning a table that is only ever inserted into, never updated, so the full history of every trip survives for receipts, disputes, and support tooling. The relational choice is easy to defend here, since trips need transactions for state transitions, the volume is modest, and an auditor's questions map naturally onto SQL, whereas a document store would trade away transactional transitions for a schema flexibility this fixed lifecycle never uses. Driver live state, by deliberate design, appears nowhere in this schema, and that absence is the single most important line in the data model, because the moment driver positions acquire a durable table the firehose arithmetic from the sizing section lands on it and crushes it.

CREATE TABLE trips (
  trip_id      BIGINT PRIMARY KEY,
  rider_id     BIGINT NOT NULL,
  driver_id    BIGINT,
  state        TEXT NOT NULL,           -- requested | matched | arriving |
                                        -- in_progress | completed | canceled
  pickup_lat   DOUBLE PRECISION NOT NULL,
  pickup_lng   DOUBLE PRECISION NOT NULL,
  dropoff_lat  DOUBLE PRECISION NOT NULL,
  dropoff_lng  DOUBLE PRECISION NOT NULL,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE trip_events (
  trip_id     BIGINT NOT NULL,
  seq         INT NOT NULL,
  event       TEXT NOT NULL,            -- offer_sent, accepted, pickup, ...
  payload     JSONB,
  recorded_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (trip_id, seq)
);

The high-level architecture

Driver position reports flow through a location ingestion tier into an in-memory geosharded index, never touching a durable database on the hot path. The matching service queries that index for candidates, ranks them with the routing and ETA service, and hands the agreed match to the trip service, which owns the state machine and the durable record, so the disposable half of the system and the sacred half meet only at the moment a match is struck. Periodic snapshots of the geo index feed analytics offline, which gives the data science teams their supply heatmaps and demand forecasts without ever taxing the path a rider is waiting on, and the same snapshots double as a warm-start source when an index shard restarts.

Driver GPS flows into the in-memory geo index and never touches durable storage on the hot path, while matching queries cells, ranks by ETA, and the trip service records the durable outcome. Dashed paths are pushed offers and periodic snapshots.

The location firehose and the geo index

At 1.25 million writes per second, the design question nearly answers itself, because writing each report to a durable database would mean a write-ahead log, replication, and disk amplification for data that is overwritten 4 seconds later and worthless within a minute, so driver positions live in memory. Each report upserts a single entry keyed by driver ID, holding coordinates, heading, status, and timestamp, and the index exists to answer one question quickly, namely which available drivers currently sit inside a given set of cells.

Cells come from a hierarchical spatial grid such as geohash, which interleaves the bits of latitude and longitude into a sortable string so that a shared prefix names a rectangle of the map, or H3, Uber's hexagon-based equivalent whose cells have the convenient property that all neighbors sit at similar distances, removing the corner-versus-edge distortion squares suffer. The index maintains, per cell at a chosen resolution around a kilometer across, the set of driver IDs currently inside it, and a driver crossing a cell boundary costs a cheap remove-and-add. Sharding follows geography rather than a hash, so each shard owns the cells of a city or a contiguous group of cells, which keeps a matching query, whose cells are always close together, on a single shard, and the worst skew stays bounded because even the busiest metro's drivers, say 100,000 of them producing 25,000 writes per second, fit comfortably on one shard with replicas for reads. Durability is deliberately absent from this path. What analytics and recovery genuinely need is provided by snapshots, a periodic dump of each shard's state to object storage every few seconds to minutes, and the recovery story stays pleasingly short, since a restarted shard begins empty or from a recent snapshot and refills to full accuracy within one 4-second reporting cycle, which is faster than any database restore could run.

Matching and the dispatch protocol

When a request arrives, the matching service covers the pickup point's neighborhood, meaning the cell containing it plus the ring of neighbors around it, queries the geo index shard for available drivers there, and widens the ring when the result is thin. Ranking the candidates is where a tempting shortcut produces visibly worse matches, because straight-line distance ignores the road network, so the service instead asks the routing and ETA layer for a predicted pickup time per candidate. The worked example to give in an interview is a river city. A driver 600 meters away across the water is 9 minutes from the rider via the nearest bridge, while a driver 1.4 kilometers away on the rider's own bank is 4 minutes out, and ranking by meters dispatches the wrong car every single time the river is in the picture. A dozen ETA lookups per match at 1,000 matches per second is 12,000 ETA queries per second, which a routing layer built on precomputed road-network hierarchies serves without strain, so the better metric costs nothing the system cannot afford.

Dispatch is a conversation rather than an assignment, because drivers decline rides and the protocol has to expect that from the first message. The service offers the ride to the best-ranked driver with a hard timeout of 10 to 15 seconds, and a decline or an expiry moves the offer down the list to the next candidate. Concurrency creates the hazard worth naming out loud. Two riders request at once, both matchers see the same nearest driver, both send offers, and the driver's phone shows one while the system believes whichever acceptance lands first, leaving the other rider matched to nobody and watching a car that never comes. The fix is an exclusive lease, a short-lived lock on the driver's dispatch state that exactly one matcher can hold, implemented as an atomic set-if-absent with expiry in a store like Redis, so a matcher acquires the lease before sending the offer, the lease lifetime slightly exceeds the offer timeout, and a competing matcher that fails the acquire simply skips to its next candidate without waiting. If the driver accepts, the accept validates against the held lease and the trip transitions to matched, and if the offer dies, the lease expires on its own and the driver returns to the pool with no cleanup required, which is the property that makes leases safer than locks, because a lock that must be explicitly released by a process that may have crashed eventually strands a driver, while an expiring lease cannot.

The request arrives (1) and candidates come back from the nearby cells (2), each ranked by predicted pickup ETA (3). The matcher takes the exclusive lease on driver A and sends the offer with a timeout (4), the accept validates against the held lease (5), and on a decline or expiry the lease lapses and the offer moves to driver B (6).

The trip state machine and idempotency

A trip moves through requested, matched, en route to pickup, in progress, completed, or canceled, and the trip service is the single owner of those transitions, validating every one server-side because mobile clients send duplicates, deliver events out of order, and carry clocks that cannot be trusted. An accept arriving for an already-canceled trip returns a clean conflict rather than corrupting state, a second start-trip event is ignored because the state already moved past it, and every transition appends to the event log, which lets receipts, fare calculation, dispute resolution, and support tooling all read from one authoritative history instead of four competing reconstructions of it. Idempotency keys complete the correctness story at the edges, since the rider's create-ride call carries a client-generated token, and when the network drops the response and the app retries, the service recognizes the token and returns the existing ride instead of summoning two cars to the same curb, with the same pattern protecting accepts and cancels.

Surge pricing fits in one deliberately compact paragraph of design. Per cell and per sliding window of a few minutes, the system compares open requests against available drivers, and when demand outruns supply it raises a price multiplier for that cell, smoothed over time and capped so prices move in steps rather than jumps, with the multiplier shown to the rider before they commit. The mechanism rides entirely on data the geo index and matcher already produce, which is why it stays thin here, while the fairness and incentive questions around it belong to product and policy more than to architecture.

A latency budget from request to offer

The few-seconds matching promise deserves the same itemization as any other budget. The rider's request crosses the network and the gateway in perhaps 60 milliseconds, the matching service covers the pickup cells and queries the geo index in a couple of milliseconds because the lookup is a set read against one in-memory shard, and the ETA ranking of a dozen candidates runs as parallel calls that return inside 30 milliseconds or so from the precomputed routing hierarchy. Lease acquisition adds one round trip to the lease store at a millisecond or two, pushing the offer down the driver's persistent connection takes tens of milliseconds, and everything the machines do therefore fits inside a quarter of a second. The remaining seconds belong to the human, since a driver glancing at a phone mounted on the dash needs 5 to 15 seconds to decide, which is why the offer timeout, not any computational stage, is the unit this budget is really made of.

Stacking those units tells the operator what the percentiles will look like before any traffic arrives. A first-offer accept lands the rider a match well under 20 seconds including the human deliberation, one decline followed by an accept stretches the wait toward 30 or 40, and the product smooths the difference by showing contacting-drivers progress rather than a silent spinner, because perceived latency is part of the budget too. When supply runs thin the matcher widens its search ring between offers, trading a longer pickup for a faster match, and the point worth defending in an interview is that each widening is a step in a policy the operators chose in advance, so the system degrades along a curve rather than improvising under pressure.

Scaling, failures, and operations

The system scales and fails along city lines, which is its best structural property. Geo index shards, matchers, and surge computation are all scoped to metro-area cells, so a shard failure degrades one city while the rest of the world matches on, and capacity planning follows commute peaks per region rather than one global curve. Location ingestion is stateless and scales horizontally, the geo index scales by splitting hot cities across more shards, the trip store shards by trip ID and replicates, with its modest write rate making that routine, and the routing layer is read-only at serving time and replicates per region, so every tier grows along the axis its own load actually arrives on.

Degraded modes deserve explicit design rather than improvisation, and each one carries a specific user experience. When candidates are scarce the matcher widens the search radius and lengthens offer timeouts, so the rider waits a little longer and may walk a little further to meet the car. When the ETA service slows or fails, ranking falls back to cached ETAs or to straight-line distance with a penalty for water and highway crossings, accepting worse matches over no matches, and the river-city rider may get the bridge driver for a few minutes until the service recovers. When the lease store is unavailable, matching serializes per driver through the index shard as a slower but still correct fallback. A restarted geo index shard refills from driver reports within seconds, a lost offer resolves itself through lease expiry, and a crashed matcher's in-flight requests are retried idempotently by the riders' clients without anyone noticing. Payments run after completion in a separate service fed by the trip event log, so a payment provider outage delays charging without ever blocking a single match. The dashboards that matter are match latency, offer acceptance rate, time-to-pickup error against prediction, geo index write lag, and lease contention, and that last one earns its place because rising contention is the earliest sign that several matchers are chasing the same thin supply.

Follow-up questions

Why not store driver locations in a durable database? The data is overwritten every 4 seconds and worthless within a minute, so durability buys nothing, and 1.25 million durable writes per second is enormous cost for that nothing. Memory plus periodic snapshots serves both the hot path and analytics, and a restarted shard refills from live reports faster than any restore would run.
Why rank by ETA instead of distance? Roads, rivers, and one-way systems make straight-line distance a poor proxy for arrival time, since the driver 600 meters away across a river can be 9 minutes out while one 1.4 kilometers away on the right bank is 4 minutes out. Pickup ETA is the quantity riders actually experience, and the precomputed routing layer makes it affordable at matching volume.
How exactly does the lease prevent double dispatch? A matcher must atomically acquire the set-if-absent lease on a driver before offering, so a concurrent matcher fails the acquire and moves to its next candidate instead of sending a second offer. Because the lease expires on its own slightly after the offer timeout, a crashed matcher never strands a driver, which is the advantage leases hold over locks that need explicit release.
What happens when a driver's phone goes offline mid-trip? The trip state machine holds the last durable state, location updates stop, and the rider sees a stale-position indicator rather than a car frozen mid-street with no explanation. On reconnect the driver app resyncs state and uploads its buffered events, which the server applies idempotently and in order, so the trip history ends up complete even though it arrived late.
How would you handle a city-wide event like a stadium emptying? Surge incentives concentrate supply in the affected cells, and the matcher widens radii and tolerates longer pickups while the imbalance lasts. Because cells are the unit of computation, the hot spot stays contained to its shard and neighboring cells absorb the spillover rather than the whole region slowing down.
What breaks first at 10x? The geo index write path goes first, and it scales by splitting cells across more shards, since location writes partition perfectly by geography. Matching, trips, and payments grow linearly and much later, because their per-event rates sit orders of magnitude below the location firehose.

References

Uber Engineering, H3: Uber's Hexagonal Hierarchical Spatial Index (2018).
Redis, SET command documentation, on NX and expiry options used for leases.
Amazon Builders' Library, Leader election in distributed systems, on leases and expiry-based safety.
Kleppmann, Designing Data-Intensive Applications (2017), on partitioning, event logs, and idempotence.