Design S3-like object storage

Object storage is the system that holds bytes for everyone else. A client names a bucket and a key, uploads a blob of any size from a few bytes to a few terabytes, and gets the same bytes back later with a GET, and that small contract turns out to be the foundation under data lakes, photo services, backup products, machine learning training sets, and most of the modern web's static content, because nearly every other system would rather delegate the problem of keeping bytes alive for a decade than solve it again from scratch. Interviewers like the question because it forces a candidate to treat durability as a number rather than a slogan, and because the design splits naturally into a metadata problem and a data problem that scale in completely different ways, so a candidate who blends the two together reveals the confusion within the first ten minutes.

The hard parts are not where they first appear. Serving a GET is the easy half of the product, while promising that one hundred billion objects will still be readable in ten years, even as disks fail every day and the fleet is being repaired, rebalanced, and upgraded underneath, is the half that customers are actually paying for. The walkthrough below builds the system in the order I would present it across a thirty-minute interview, starting from requirements and sizing, moving through the two-plane architecture and the write and read paths, and finishing in the deep ends of redundancy, small objects, and metadata at scale.

Scope and requirements

Functionally, the service stores objects in buckets under a flat keyspace, which means there are no real directories anywhere in the system. A key like logs/2025/08/14/host7.gz is just a string, and the directory illusion comes entirely from listing keys by prefix. The core operations are PUT, GET, and DELETE on a single object, listing a bucket by prefix, and multipart upload, a protocol that lets a client push a 100 GB file as independent parallel parts and have the whole object appear atomically at the end. Versioning keeps older copies of a key retrievable after overwrites, lifecycle policies move aging data to cheaper storage automatically, and per-object metadata such as content type rounds out the surface. Just as important is what the service refuses to offer, since rename, append, partial overwrite, and locking are all deliberately absent. Each of those operations couples keys to one another and would drag coordination into paths that are otherwise independent, and the systems that have tried to be both an object store and a filesystem have historically ended up being a mediocre version of each.

The non-functional requirements carry the design. Durability comes first and deserves to be worked out as arithmetic rather than recited as a marketing line. The standard claim is eleven nines, 99.999999999 percent per object per year, and a loss probability of 10^-11 per object-year across a fleet of 100 billion objects gives an expected loss of about one object per year over everything everyone has ever stored, which is the scale of promise the rest of the design has to support. Availability matters but sits a tier lower, around four nines, and that asymmetry is deliberate, because a retryable 500 is an inconvenience the client SDK hides behind an automatic retry, while a lost object is a support ticket that can never be closed. Throughput must scale to many gigabytes per second of aggregate ingest and egress without per-customer interference, and the cost per stored terabyte ultimately decides whether anyone uses the service at all, which is the pressure that pushes the design toward erasure coding and dense disks rather than triple replication on fast media.

Sizing the problem

Assume the service holds 100 PB of logical data with an average object size of 1 MB. Dividing 10¹⁷ bytes by 10⁶ bytes per object gives 10¹¹ objects, one hundred billion of them, and that single division explains most of the architecture. A metadata record carrying a key of up to a kilobyte, a version, a size, a checksum, and a chunk list encodes to roughly 256 bytes, so the metadata alone is 100 billion times 256 bytes, which is about 25 TB before replication and about 75 TB once three copies are kept for safety. No single machine holds 75 TB of hot, constantly mutating state, so the metadata layer is itself a distributed storage problem with its own sharding, replication, and failure handling, and treating it as a footnote to the data path is the classic way to fail this interview.

Traffic submits to the same discipline. Suppose clients PUT 5,000 objects per second on average, which at 1 MB each is 5 GB/s of ingest and about 430 TB of new data per day, and suppose reads run ten to one over writes, which gives 50,000 GETs per second and around 20 GB/s of egress at peak. With the 1.5x redundancy overhead justified below, the fleet stores 150 PB raw, and at 20 TB per disk that is 7,500 disks, or about 250 storage nodes at 30 disks each. Those 7,500 spindles can stream more than 1 TB/s in aggregate when load spreads evenly, so raw sequential bandwidth is comfortable by a wide margin, and the constraints that actually pinch turn out to be per-key hotspots, repair traffic competing with foreground reads, and a metadata tier that must absorb on the order of 55,000 lookups and commits every second without becoming the bottleneck for either plane.

The API

The external interface is deliberately small, and it speaks HTTP because every language, proxy, cache, and command-line tool already does. A custom binary protocol would shave some framing overhead and lose the entire ecosystem of existing clients, which is a terrible trade for a service whose value grows with the number of things that can talk to it. Every write returns an etag, a server-computed fingerprint of the stored bytes, so a client can confirm an upload landed intact and a later conditional request can skip refetching unchanged data. Range reads matter just as much as whole-object reads, since video players and database engines routinely want a narrow slice of an enormous object. Multipart upload is the one piece with real protocol shape, in which the client initiates a session, uploads parts independently and in parallel, and then commits the ordered list of parts in a single call, so a 100 GB object becomes visible atomically or not at all:

PUT /photos/cats/whiskers.jpg          ← body is the object bytes
→ 200 { "etag": "9b2cf535..." }

GET /photos/cats/whiskers.jpg          → 200 + bytes (Range supported)
DELETE /photos/cats/whiskers.jpg       → 204

GET /photos?prefix=cats/&max-keys=1000 → 200 { "keys": [...], "next": "..." }

POST /videos/launch.mp4?uploads        → 200 { "uploadId": "Vqx91" }
PUT  /videos/launch.mp4?partNumber=3&uploadId=Vqx91   ← one 100 MB part
POST /videos/launch.mp4?uploadId=Vqx91 ← body lists part etags, commit

The data model

Inside the metadata service, two logical tables do the work. The object table maps a fully qualified key to the object's current version, size, checksum, and the ordered layout of its data, while the chunk table maps each chunk to the physical nodes holding it. Writing the sketch as SQL keeps it concrete even though the real store is a purpose-built sharded system:

CREATE TABLE objects (
  bucket      VARCHAR(64)  NOT NULL,
  key         VARCHAR(1024) NOT NULL,
  version     BIGINT       NOT NULL,   -- newest wins; old rows kept if versioned
  size        BIGINT       NOT NULL,
  checksum    BYTEA        NOT NULL,   -- whole-object CRC
  chunk_ids   BIGINT[]     NOT NULL,   -- ordered layout of the bytes
  deleted     BOOLEAN      NOT NULL DEFAULT FALSE,  -- delete marker
  PRIMARY KEY (bucket, key, version)
);

CREATE TABLE chunks (
  chunk_id    BIGINT PRIMARY KEY,
  stripe_id   BIGINT,                  -- erasure-coding group, if coded
  nodes       INT[]  NOT NULL,         -- data nodes holding this chunk
  checksum    BYTEA  NOT NULL
);

The separation matters because the two tables answer different questions at different rates. Every user request touches the object table, while the chunk table is also walked continuously by repair and garbage collection, and keeping physical placement out of the object row means the system can shuffle bytes between nodes during rebalancing without ever rewriting user-visible metadata. The alternative of one wide row holding both logical and physical truth sounds simpler, and it loses because every background data movement would then contend with foreground traffic on the same records, which is precisely the coupling that turns routine maintenance into customer-visible latency.

The high-level architecture

The system splits into a control plane and a data plane. A stateless API tier authenticates requests and orchestrates each operation, a metadata service owns the mapping from keys to chunk locations, a placement service decides which data nodes should receive new bytes while balancing capacity, load, and failure domains, and a large fleet of data nodes does nothing but store, serve, and verify chunks. The crucial property is that object bytes never flow through the metadata service and metadata never lives on the data path's disks, so each plane scales on its own terms and a surge in one cannot starve the other. The natural alternative is to let storage nodes own their own metadata the way a filesystem does, and it loses for reasons that compound, since every metadata operation would fan out across data nodes, a bucket listing would have to consult the whole fleet, and the small, hot, consistency-critical state would be welded to the large, cool, throughput-critical state it most needs to be independent of.

On the PUT path the API tier asks placement where new bytes belong, streams chunks straight to data nodes, and commits the key with its chunk locations in the metadata service, while a GET runs the same path in reverse, resolving the key first and then reading chunks directly from the fleet.

The write path and durability

A PUT begins at the API tier, which computes a checksum, a small fingerprint such as a CRC derived from the bytes that later reveals any corruption, while the body is still streaming in, so the data never needs a second pass. The placement service returns a set of target nodes chosen across failure domains, meaning groups of hardware that tend to fail together, so that no two copies of the same data share a rack, a power feed, or ideally a building. The API tier then streams the chunks to those nodes in parallel, each node persists its chunk and verifies the checksum before answering, and the write is acknowledged to the client once a quorum, a minimum count of successful copies such as two of three replicas, has confirmed, with the remaining copy completed in the background within seconds. The two alternatives bracket this choice and both lose. Acknowledging after a single copy would shave a little latency and leave a window in which one disk failure destroys the only copy of data the client was told is safe, which violates the entire premise of the product, while waiting for all three copies hands every customer's latency to the slowest node in the set, and in any set of three there is almost always one node having a bad moment, so the quorum buys nearly all the durability at a fraction of the tail latency.

The latency budget for a 1 MB PUT makes the path feel concrete. The placement decision comes from cached cluster state in well under a millisecond, streaming 1 MB to three nodes in parallel over 10 gigabit links costs about a millisecond of wire time, persisting each chunk to a flash write journal, a small fast log each node uses to make writes durable before they reach their final location on disk, takes another millisecond or two, and the metadata commit through a small consensus group adds perhaps two to five milliseconds, so the client sees an acknowledgment in roughly ten milliseconds at the median, with the quorum protecting the 99th percentile from the one slow disk. Only after the bytes are safe does the API tier commit the metadata row, and that ordering is the whole consistency story in miniature, because the metadata commit is the moment the object exists. If the writer dies before the commit, orphaned chunks sit on data nodes referenced by nothing, and a garbage collector reclaims them later, which is a far better failure mode than metadata pointing at bytes that were never fully written, since the first quietly wastes some disk for a while and the second serves a customer an error about data they were told was stored. What the client experiences in that failure is a timed-out PUT that the SDK retries, and because each retry writes fresh chunks and commits a fresh version, a retry can never half-overwrite anything. Checksums travel with the data forever, stored beside each chunk, verified on every read, and verified again by background scrubbing, so corruption is detected wherever it creeps in rather than at the one desperate moment a repair depends on the bytes being right.

The read path and a latency budget

A GET resolves the key against the metadata service, which returns the chunk list and the nodes holding each chunk, and then the API tier reads the bytes directly from the data fleet. The metadata lookup is a single read against one partition and costs one to two milliseconds, often less when the partition's recent keys sit in cache. For replicated data the API tier picks one replica, preferring the closest and least loaded node, and a 1 MB read costs a few milliseconds from flash or ten to fifteen from a spinning disk once the seek is counted, so the median whole-request latency lands near twenty milliseconds. The tail is managed with hedged requests, a technique where the API tier fires the same read at a second replica if the first has not answered within roughly the 95th percentile latency and takes whichever answer arrives first, which costs a few percent of duplicate reads and removes the worst of the tail caused by one busy disk or a node mid-deploy.

Erasure-coded data adds one wrinkle. The codes used in practice are systematic, meaning the original data chunks are stored as-is alongside the parity, so a normal read fetches the data chunks directly and pays no decoding cost at all. When a node is down, a read of an affected stripe degrades rather than fails, with the API tier fetching any eight of the surviving eleven chunks and reconstructing the missing piece, and since eight parallel 4 MB reads overlap and Reed-Solomon decoding runs at gigabytes per second on one core, the user experiences a read that is tens of milliseconds slower instead of an error page. A Range GET maps the requested byte interval onto chunk indexes and touches only the chunks that overlap it, so pulling the last megabyte of a terabyte object reads one stripe rather than a thousand. Operators watch the degraded-read rate closely, because it climbing is the earliest sign that repair is falling behind the failure rate, well before any durability number is actually threatened.

Replication against erasure coding

The simplest redundancy scheme is replication, where the system keeps three full copies of everything, survives any two losses, and pays three times the logical data in disks. At 100 PB logical that means 300 PB raw and 15,000 disks instead of 7,500, and since disks dominate the cost of the whole service, the pressure to do better is enormous. The better scheme is erasure coding, a technique that splits a block of data into k equal data chunks and computes m additional parity chunks using Reed-Solomon arithmetic, a kind of polynomial math over the bytes, with the property that any k chunks out of the k plus m total are enough to reconstruct the original block. With k equal to 8 and m equal to 4, a 32 MB block becomes eight 4 MB data chunks plus four 4 MB parity chunks spread across twelve different nodes, the storage overhead is 12 over 8, which is 1.5x, and the block survives the loss of any four chunks at once. Against replication that is both less storage, 1.5x instead of 3x, and more failure tolerance, four losses instead of two, and a reasonable interviewer will ask why anyone still replicates at all. Wider stripes sharpen the same question, since a 16 plus 4 scheme would cut overhead to 1.25x, and the reason to stop near 8 plus 4 is that every widening step demands more independent failure domains per stripe, makes every repair read more chunks, and ties each read's tail latency to more machines, so the overhead curve flattens while the operational costs keep climbing. The lunch starts looking expensive as soon as repair enters the picture.

When node n3 dies, the repair worker first reads any 8 of the 11 surviving chunks, two of those reads are drawn, and then recomputes the missing chunk and writes it to a spare node, restoring the stripe to full strength.

The price of erasure coding is repair amplification. When a replica is lost, replication heals by copying one chunk, but when a coded chunk is lost the repair worker must read k chunks, eight in this scheme, to reconstruct a single one, so healing a dead 20 TB disk means reading roughly eight times that, around 160 TB, from peer disks across the fleet. The saving grace is declustered placement, an arrangement where every stripe lives on a different set of twelve nodes, so the chunks of a dead disk have stripe partners spread across thousands of other disks and the rebuilt chunks can also be written to thousands of destinations rather than one. If a thousand disks each contribute 50 MB/s of spare bandwidth, the fleet repairs at 50 GB/s in aggregate, and 160 TB of reads divided by 50 GB/s is about 3,200 seconds, under an hour from failure back to full protection. The replication-era alternative of copying 20 TB into a single replacement disk at 150 MB/s would take about 133,000 seconds, a day and a half during which the affected data sits one failure closer to loss, which is exactly the exposure window the declustered approach removes. Repair traffic is throttled so it never starves customer reads, and the trade has a second face on the read path, since reads of degraded stripes pay the k-way reconstruction cost described above. That combination is why real deployments keep hot and small data on replication, where reads are cheap and repairs are simple, while the bulk of the corpus, large and cool, lives erasure coded where the storage savings compound quietly for years.

Small objects, big objects, and metadata at scale

Erasure coding a 10 KB object directly would be absurd, since splitting it into twelve chunks of under a kilobyte each multiplies per-chunk fixed costs and seek overhead past the size of the data itself, and billions of small objects also threaten the filesystem beneath each data node, where every individual file costs an inode lookup and a directory entry before a single payload byte is read. The fix, made famous by Facebook's Haystack photo store, is to pack small objects by appending them into large block files of around 1 GB while the data node keeps an in-memory index from object ID to offset and length within the block, so reading a small object becomes one memory lookup plus exactly one disk read. The index is cheap to hold, since an entry of roughly thirty bytes times ten million small objects on a node is about 300 MB of RAM, a modest price for never paying a filesystem traversal. The big blocks then become the unit of replication or erasure coding, which means the redundancy machinery never has to know that small objects exist. Deletes write a tombstone in the index rather than disturbing the block, and a compaction process later rewrites blocks whose dead space has crossed a threshold, copying the live objects into a fresh block and reclaiming the rest in bulk, which costs sequential bandwidth at a scheduled time instead of random updates all day.

Large objects flow in the opposite direction. A 100 GB upload arrives as multipart parts of perhaps 100 MB, each part is striped and stored independently the moment it lands, and the final commit writes a manifest, a small record listing the parts in order, which becomes the object's layout. Parts upload in parallel from the client, a part that fails over a flaky connection retries alone without restarting the other 99.9 GB, and a Range GET for bytes near the end of the object touches only the stripes that actually hold them. From the customer's side a multi-hour upload becomes restartable and parallel for free, which is most of why the multipart protocol exists at all.

The metadata service shards the object table by a hash of bucket and key, which spreads load evenly and makes single-key operations cheap, but hashing destroys order, and listing a bucket by prefix is fundamentally an ordered scan. Listing therefore needs an ordered secondary structure, namely a per-bucket key index in a range-partitioned store, one that splits data by sorted key ranges rather than by hashes, so that prefix=cats/ becomes a seek to cats/ followed by a sequential scan. Range-sharding the primary table instead would make listing local and free, and it loses because object keys arrive with highly skewed prefixes, timestamps and sequential names pile onto one partition while others idle, so the design gives each structure the job it is naturally good at and accepts the cost of maintaining two. That index is updated asynchronously from the object table, which is why listings in real systems may trail writes by a moment even when single-key reads do not. For single keys the service promises read-after-write consistency for new objects, meaning a GET issued after a successful PUT sees the object, and the mechanism is straightforward, because both the PUT's commit and the subsequent GET resolve through the same metadata partition, itself a small consensus group, a handful of replicas that agree on an ordered log of changes, so once the commit is durable every reader of that partition observes it.

Scaling, failures, and operations

Each tier scales by its own rule. The API tier is stateless and grows behind load balancers, so its scaling story is the shortest in the design. The metadata service grows by splitting hot partitions, and when a consensus leader dies its group elects a successor in a second or two, during which clients of that one partition see briefly elevated latency and every other partition notices nothing. Data nodes are the easy tier to grow and the interesting tier to operate, because at 7,500 disks with an annualized failure rate around 2 percent, the fleet loses roughly 150 disks per year, about three per week, and disk death has to be a routine the system metabolizes without a human in the loop. When a node misses heartbeats, the placement service marks its chunks under-replicated and feeds a prioritized repair queue in which stripes that have lost the most redundancy are healed first, since a stripe down to eight survivors out of twelve is one failure away from data loss while a stripe at eleven can wait hours without concern. The operator's view of all this is a re-protection dashboard, and the number that matters is the time from a failure to full redundancy, because durability math silently assumes repairs finish faster than the next correlated failure arrives.

Two background processes guard against quieter enemies. Scrubbing is the practice of continuously re-reading stored chunks and verifying their checksums, because disks suffer silent corruption, bit flips and misdirected writes for which no error is ever raised, and a corrupt chunk that is never read is a loss waiting to be discovered during a repair at the worst possible moment. A scrub pass over every byte each month costs a 20 TB disk about 8 MB/s of background reads, which is forgivably small next to its foreground work. Garbage collection walks the chunk table against the object table and reclaims chunks no live object version references, the debris of failed uploads, overwrites, and completed deletes, and it works as a mark-and-sweep pass over immutable snapshots rather than as reference counting, because distributed counters decremented from many places under failure are exactly the kind of mechanism that drifts, while a sweep recomputed from scratch is self-correcting. Deletes themselves are soft in a versioned bucket, with a DELETE writing a delete marker, a metadata row recording that the key now resolves to nothing, while older versions remain retrievable until lifecycle policy expires them, and lifecycle policy also drives tiering, rewriting objects untouched for months into colder, cheaper, more heavily coded storage. The eleven nines arithmetic deserves a closing dose of humility, because at these redundancy levels the dominant remaining risks are correlated failures and defects in the storage software itself, a placement error that stacks a stripe in one rack or a release that corrupts on write, which is why stripes span independent failure domains, why scrubbing never stops, and why careful operators roll storage code out more slowly than nearly anything else they run.

Follow-up questions

Why not build this on a distributed filesystem? The flat keyspace deliberately gives up hierarchical semantics, which means rename, locking, and partial overwrite never have to work across a hundred billion objects. Each of those operations couples keys together, and rename alone turns a one-key write into a multi-key transaction, so dropping them is what lets hash-sharded metadata, immutable chunks, and erasure coding stay simple at 10¹¹ objects. A filesystem keeps promises this workload never asked for and pays for them on every path.
When does replication beat erasure coding? Replication wins for hot data and for small data, because replicas serve reads from any copy with no reconstruction cost, repair by copying one chunk instead of reading eight, and add no coding latency on the write path. The economics flip as objects grow large and cool, where the 1.5x against 3x storage difference dominates the bill and reconstruction almost never runs because the data is rarely read at all.
How does listing work if metadata is hashed? It cannot be served from the hash shards, since a hash scatters adjacent keys by design. Each bucket therefore maintains an ordered key index in a range-partitioned store, where a prefix listing becomes a seek followed by a sequential scan, and because that index is maintained asynchronously from the object table, listings can briefly trail writes even while single-key reads stay read-after-write.
What happens when an entire datacenter is lost? Placement never puts more chunks of one stripe in a single facility than the stripe can afford to lose, so with 8 plus 4 coding spread across at least three sites, a full site loss costs at most four chunks per stripe. Every object stays readable through reconstruction, repair rebuilds from the survivors, and customers experience slower reads on affected stripes rather than missing data.
Why acknowledge a PUT on a quorum instead of all copies? The slowest of three nodes sets the latency when all must answer, and in practice one of any three is almost always slow, so waiting buys little. With two durable copies already sitting in separate failure domains and the third completing in the background within seconds, the window of reduced redundancy is brief enough that the durability cost is negligible against the tail latency saved.
What actually limits durability at eleven nines? Independent disk failures are not the limit, because the coding math absorbs those with room to spare. The remaining risk concentrates in correlated events and in the software itself, a placement defect that stacks a stripe in one rack or a release that corrupts data on write, which is why failure-domain-aware placement and continuous scrubbing are load-bearing parts of the design and why storage code ships behind the most cautious deploy process in the company.

References

Beaver et al., Finding a Needle in Haystack: Facebook's Photo Storage (OSDI 2010), on packing small objects into large append-only files.
Amazon Web Services, Amazon S3 data durability documentation, on the eleven nines design target.
Kleppmann, Designing Data-Intensive Applications (2017), on replication, partitioning, and storage engines.
Xu, System Design Interview, Volume 2 (2022), chapter on S3-like object storage.