Design Google Drive · Stanley Jacob

Google Drive, like Dropbox and OneDrive, is the product where a folder on your laptop, the app on your phone, and a web page all show the same files, and a change made on any of them appears on the rest within seconds. The design question turns out to hinge on a single decision made early and exploited everywhere, which is that files are not stored as files. Content is split into fixed-size blocks named by their own hashes, metadata describes how blocks compose into files and versions, and once that split exists, deduplication, delta sync, versioning, and fast restores all fall out of it almost for free. Interviewers ask the question to see whether a candidate finds that move in the first ten minutes or spends the hour gold-plating an FTP server. The hard parts live in the sync protocol that keeps devices convergent over unreliable networks, and in the conflict story when two offline devices edit the same file, because a sync product that ever loses an edit has failed at the only thing it exists to do. I will walk the design the way I would narrate it in a thirty minute interview, establishing the block decision early and then showing how each feature cashes it in.

Scope and requirements

The scope I would agree on covers upload and download of arbitrary files, automatic sync across a user's devices, folder sharing with permissions, link sharing, version history, and a trash with restore. Real-time collaborative editing of document content, which is the Google Docs problem, is a different system built on operational transforms or CRDTs, the data structures that merge concurrent edits, and I would scope it out explicitly while noting that the file layer below it is exactly what we are designing. Non-functionally, sync should propagate a small edit in a few seconds, and no acknowledged write may ever be lost, since this product is where people keep the only copy of their dissertation, which makes durability the requirement everything else bends around. The client must also behave well on bad networks, resuming interrupted transfers, tolerating offline periods gracefully, and never saturating a home uplink without permission, because a sync client that makes video calls stutter gets uninstalled. Consistency between devices can relax for a few seconds, but the server's record of a file version must be atomic, so a new version either exists in full or does not exist at all, and no reader ever observes half a commit.

Sizing the problem

Take 500 million registered users averaging 10 GB stored each, which multiplies to 5 exabytes of logical data, the kind of number that makes naive replication tripling unaffordable before any optimization is discussed. Two levers control the bill. Deduplication exploits the fact that humanity stores the same bytes over and over, in the same installers, the same forwarded PDFs, and the same memes, and block-level content addressing collapses every identical block to one stored copy, with measured ratios on consumer corpora commonly saving several tens of percent of raw capacity. Cold tiering exploits access skew, since most files are written once and rarely opened again, so blocks untouched for months migrate to cheaper erasure-coded storage, the scheme that splits data into fragments with parity so it survives disk failures at lower overhead than full copies, and a later read pays a small extra latency that a rarely opened file can afford. Traffic, by contrast, is gentle. If 10 percent of users are active daily and each syncs 100 MB, the fleet moves 5 PB a day, an average near 60 GB per second, which a horizontally scaled service tier absorbs without drama, so the hard byte problem in this design lives in storage economics rather than request throughput, and that conclusion shapes where the engineering effort goes.

The block interface

The client and server speak in blocks, not files. Content is chunked into fixed 4 MB blocks, and each block is named by its hash, the fixed-length fingerprint a cryptographic hash function such as SHA-256 computes from the block's bytes, with the property that identical content always yields the identical name and different content effectively never collides. Content addressing makes the protocol idempotent and the dedup automatic at the same time, because the client never asks the server to store something again, it asks which of these blocks the server does not have, and any block the server already holds, whether from this user or a stranger, costs nothing to reference. The choice of 4 MB is a balance, since smaller blocks find more duplication and produce more per-block overhead in metadata and round trips, while larger blocks amortize overhead and miss savings, and 4 MB sits where consumer file sizes make the trade comfortable.

POST /api/v1/blocks/diff
{ "hashes": ["9f31c2...", "a07be4...", "d2188a..."] }
→ 200 { "missing": ["d2188a..."] }       // server already holds the others

PUT /api/v1/blocks/d2188a...
Content-Length: 4194304
→ 201

POST /api/v1/files/f_8812/commit
{ "base_version": 6,
  "blocks": ["9f31c2...", "a07be4...", "d2188a..."], "size": 12582912 }
→ 201 { "version": 7 }
→ 409 { "error": "conflict", "head": 8 }  // someone committed version 7 first

The commit is the atomic step of the protocol, submitting the ordered block list that constitutes the new version, conditional on the version the client last saw, so two racing commits cannot silently interleave their blocks into a file neither device ever held. The loser of the race receives a 409 and handles the conflict deliberately, which is the subject of the sync section, and the deliberateness is the point, because the protocol is shaped so that nothing destructive can ever happen by default.

The data model

Metadata is small, relational in shape, and strongly consistent, so it lives in a sharded SQL database partitioned by namespace, meaning a user's tree or a shared folder's tree, and the partition choice matters because it keeps every ordinary operation on one shard where transactions are cheap. Blocks are immutable bytes in object storage keyed by hash, with reference counts so that shared files and historical versions can pin them against deletion.

CREATE TABLE files (
  file_id      BIGINT PRIMARY KEY,
  namespace_id BIGINT NOT NULL,        -- shard key: user or shared folder
  parent_id    BIGINT,                 -- folder tree
  name         TEXT   NOT NULL,
  head_version INT    NOT NULL,
  trashed_at   TIMESTAMPTZ
);

CREATE TABLE file_versions (
  file_id     BIGINT,
  version     INT,
  size_bytes  BIGINT,
  block_list  CHAR(64)[],              -- ordered block hashes
  created_at  TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (file_id, version)
);

CREATE TABLE acl_entries (
  namespace_id BIGINT,
  principal_id BIGINT,                 -- user or group
  role         TEXT,                   -- viewer / editor / owner
  PRIMARY KEY (namespace_id, principal_id)
);

It is worth pausing on what a version actually is in this model, namely a row holding an ordered list of block hashes and nothing more. Successive versions share every block they did not change, so version history costs only the deltas rather than full copies, and restoring an old version writes one metadata row pointing at blocks that never went anywhere. A feature that sounds expensive, unlimited version history on a 1 GB file, turns out to cost a few kilobytes of hashes per save plus only the genuinely new blocks, which is the first big dividend of the block decision and far from the last.

The high-level architecture

Four services carry the product. The block service fronts object storage, handling diff queries and block transfer, and holds no state of its own. The metadata service owns trees, versions, and permissions, and is deliberately the only writer to the metadata database, because funneling every mutation through one service is what makes the atomic commit enforceable. The notification service keeps a lightweight channel to every online device, using long polling, where the client parks a request that the server answers when something changes, or a platform push channel on mobile, and it is used purely to say that something changed in a namespace, with the actual changes always pulled from the metadata service. Keeping notifications content-free is a deliberate simplification, since a dropped notification then costs staleness rather than correctness, and the puller always learns the full truth from the source. Clients run a local watcher and a local database of synced state, which is what lets them work offline and reconcile later.

Device A uploads only the blocks the server lacks and then commits a version through the metadata service, after which the notification service wakes the user's other devices along the dashed path, and they pull the delta and fetch just the new blocks.

Dedup and delta sync, with the arithmetic

Content addressing pays twice, and each payment deserves its own numbers. Across users, when a popular 4 MB installer block sits in 10,000 accounts, naive storage holds 40 GB of it while the block store holds 4 MB plus 10,000 small references, and that saving applies to every commonly shared file on the service without anyone coordinating anything, which is what makes it scale. Within a single user, dedup becomes delta sync, defined as transferring only the parts of a file that changed rather than the whole file. Consider a 1 GB design file, which is 256 blocks of 4 MB. The user nudges one layer and saves, the client watcher notices the file changed, rehashes it, and finds 255 hashes identical with one new, so the diff request comes back naming a single missing block and the sync uploads 4 MB instead of 1 GB, a 256-fold reduction. On a 10 Mbps home uplink that is the difference between roughly 3 seconds and 14 minutes, which is to say the difference between sync the user never thinks about and sync the user turns off. Every other device then repeats the saving in the download direction, fetching one block and splicing it into its local copy in place.

The real limit of fixed-size blocks shows up on insertion, because adding ten bytes at the front of a file shifts every subsequent byte, so all 256 block boundaries land on different content, every hash changes, and the delta degenerates into a full upload. Content-defined chunking is the known fix, where block boundaries are chosen by a rolling hash of the content itself in the style of rsync, so boundaries travel with the data and an insertion disturbs only its neighborhood, and the cost is variable block sizes and a meaningfully more complex protocol. A consumer service can reasonably ship fixed blocks first and say so out loud, because the dominant consumer workloads, photos and videos that never change after creation and documents that editors rewrite in place, rarely hit the bad case, and the protocol can adopt content-defined chunking later without changing the architecture around it.

The sync protocol, step by step

Sync is a loop that converges three replicas of the truth, namely the local database, the local files on disk, and the server's metadata, and it must survive interruption at any step, since laptops sleep and trains enter tunnels without consulting the protocol. Every step is therefore idempotent, so re-running any of them after an interruption never makes things worse, and the client can simply restart the loop from its local record without negotiating about where it died.

The watcher on device A detects the edit, rehashes the file, and sends the block hashes to the diff endpoint (1), uploads only the blocks the server reported missing (2), and commits version N+1 with the ordered block list, conditional on N (3). The metadata service publishes the change (4), the notification channel wakes device B (5), which fetches the new version's metadata (6) and downloads just the changed blocks, splicing them into its local copy (7).

The conditional commit in step 3 is where conflicts surface, so the conflict story belongs here. Suppose devices A and B both edited the file while offline, both starting from version 6. A reconnects first and commits version 7, and when B's commit conditional on 6 arrives, the server refuses it with a 409. Last-writer-wins, where B's commit would simply replace A's, is the tempting answer and the wrong one for a consumer product, because it silently destroys A's work, and silent data loss is the one unforgivable failure in a product holding people's only copies. The safe resolution keeps both, with the client renaming its local version into a new file, the familiar conflicted copy with the device name in parentheses, committing it as a sibling, and letting the human merge, because the server cannot meaningfully merge a spreadsheet and should not pretend to. Both versions are preserved, every device converges on both files, and the worst outcome is mild clutter, which is the correct price for never having to apologize for a vanished edit.

What a save feels like, with a budget

Putting milliseconds on the loop shows why the product feels instant. The watcher learns of the save from the operating system within tens of milliseconds, and rehashing a 1 GB file costs a second or two of background CPU, though for the common small-document case it is a few milliseconds. The diff round trip is one small request, perhaps 50 milliseconds, the single 4 MB block uploads in about 3 seconds on a 10 Mbps uplink, and the commit is one more small round trip against the metadata shard. The notification reaches the other device within a few hundred milliseconds of the commit, and that device's metadata pull and single-block download mirror the sender's costs on its own connection. End to end, a small edit appears on the second device in two to five seconds, dominated entirely by the slower of the two home connections rather than by anything the service does, which is the right place for the time to go and matches the propagation requirement set at the start.

The budget also explains the failure experience. If the uplink dies after the block upload but before the commit, the server holds an orphaned block and no new version, so the world is unchanged, the client retries the commit when the network returns, and the only cost is invisible. If the notification is lost, device B learns of the change on its lazy backstop poll a few minutes later, so the failure converts speed into staleness without ever risking divergence. Every failure in the loop lands in one of those two buckets, either retry-and-nothing-happened or learn-it-a-little-late, and arranging for only those two buckets to exist is most of what the protocol design was for.

Sharing, versioning, and trash

Sharing is governed by ACLs, access control lists, which are the per-resource tables mapping each user or group to a role such as viewer or editor. ACLs attach at the folder level and apply to the subtree, and a shared folder becomes its own namespace, which matters for sharding because all the members' operations route to one partition and stay transactionally simple, so two editors in the same shared folder get the same conditional-commit protection as two devices of one user. The notification fanout generalizes without new machinery, since a commit in a shared namespace wakes every member's devices rather than just the committer's. Link sharing mints a scoped token, an unguessable random string embedded in a URL that grants exactly one capability, view or edit, on one subtree, and revocation is server-side at any time by deleting the token row, which beats trying to claw back a password-like secret that has already been forwarded.

Versioning and trash are nearly free given immutable blocks, and spelling out why is worth the breath, because old versions are just rows whose blocks still exist, so restoring a version or rescuing a file from trash writes metadata and moves no bytes at all, even for a 50 GB folder. The genuine cost surfaces as retention policy, since pinned blocks cannot be reclaimed, so a consumer tier might keep 30 days of versions and trash, after which a garbage collector decrements reference counts and deletes blocks that reach zero. That collector is the most dangerous process in the system, being the only thing that destroys data on purpose, so it runs with delays, soft-delete grace periods, and rate limits, on the theory that reclaiming space a week late is invisible while reclaiming a live block even once is catastrophic and undetectable until a user opens the file.

The client realities round out the design, and they matter because the client is half the system. Small files dominate counts while large files dominate bytes, so the client batches tiny files into grouped commits to amortize round trips. It caps its bandwidth by default and backs off when the network is busy, because being a polite houseguest on the home connection is a survival requirement for software that runs forever. It verifies every block end to end by rehashing after download, so a corrupt byte introduced anywhere in transit or storage is detected and the block refetched, and it keeps hashing and chunking off the foreground threads so a laptop on battery is not visibly paying for a sync the user never asked to watch.

Scaling, failures, and operations

The block service and notification tier are stateless and scale horizontally, and object storage scales by hash space, which makes the exabyte the easy part of this system. The metadata database is the tier that needs care, sharded by namespace so a user's operations stay single-shard, with shared folders placed by their own namespace ID and read replicas absorbing the listing-heavy read mix that file browsers generate. The notification service holds millions of parked long-poll connections, which is a connection-count problem of exactly the shape a chat system's socket tier has, and it scales the same way, by adding nodes behind a registry that maps users to the node holding their channel.

The failure stories all resolve into the two safe buckets the protocol allows. A client that dies mid-upload retries its idempotent block PUTs and recommits, while the orphaned blocks from abandoned uploads wait for the garbage collector's grace period. A metadata shard failover pauses commits for its namespaces for the few seconds a replica takes to promote, and clients retry transparently, so the user sees a sync icon spin slightly longer. A lost notification wakes nobody, which is exactly why clients also poll lazily every few minutes as a backstop, trading bounded staleness for certainty of convergence. Object storage durability rests on replication or erasure coding underneath, and the end-to-end block hashes mean even a silently corrupted replica is caught at read time rather than trusted. The operations dashboard watches sync convergence time end to end, commit conflict rate, dedup ratio, garbage collector backlog, and notification delivery lag, because each of those is one of the product's promises restated as a number, and a drift in any of them is a promise quietly weakening before any user has noticed.

Follow-up questions

Why split metadata from content at all? The two have opposite shapes, with metadata being small, hot, and in need of transactions, while content is huge, immutable, and cold. Splitting them lets a SQL database and object storage each do exactly what they are good at, and the block hashes serve as the join key between the worlds.
Why 4 MB blocks instead of whole-file storage? Blocks are what make dedup and delta sync possible, since identical blocks across files and users store once, and editing one block of a 1 GB file transfers 4 MB instead of the whole gigabyte. Whole-file storage re-uploads and re-stores everything on every save, which users on home uplinks would feel immediately.
How does the system avoid losing data when two devices edit offline? Commits are conditional on the base version, so the second device's commit is refused rather than applied, and the client keeps both copies by committing its version as a conflicted-copy sibling, surfacing the disagreement for the human to merge instead of letting either side silently overwrite the other.
What makes restore from trash instant for a 50 GB folder? Nothing has to move, because blocks are immutable and were never deleted during the retention window, so the restore is a metadata operation that re-points rows at blocks still sitting exactly where they were.
When do fixed-size blocks perform badly? They degrade on insertions near the start of a file, which shift every later boundary, change all the block hashes, and force a full transfer. Content-defined chunking with a rolling hash repairs that case at the cost of variable block sizes and protocol complexity, and it can be adopted later without rearchitecting.
Why is the garbage collector treated so carefully? It is the only component that destroys bytes on purpose, and a reference-counting error there turns shared dedup into shared data loss across every account pinning the block, so it runs delayed, rate-limited, and behind soft-delete grace periods that make its mistakes recoverable.

References

Dropbox Engineering, Streaming File Synchronization (2014), on block-level sync in practice.
Dropbox Engineering, Inside the Magic Pocket (2016), on exabyte-scale block storage.
Tridgell and Mackerras, The rsync algorithm (1996), the original rolling-hash delta transfer paper.
Xu, System Design Interview, Volume 1 (2020), chapter on designing Google Drive.
Kleppmann, Designing Data-Intensive Applications (2017), on replication, transactions, and consistency.