Design a unique ID generator

Systems design · Distributed building blocks · Feb 2025

Every record a system creates needs a name, so every order, message, photo, and event gets an identifier that nothing else will ever share. On one machine this is a solved problem, because a single integer column that increments can never collide with itself. Across a fleet of machines creating millions of records, with no machine allowed to be a single point of failure and with downstream systems wanting IDs that sort roughly by creation time, it becomes a genuinely interesting design, which is why Twitter, Instagram, and Flickr each wrote engineering posts about how they did it and why interviewers keep asking. The question rewards exact arithmetic, since the whole design lives inside 64 bits, and it surfaces one of the classic distributed-systems puzzles, namely what to do when a machine's clock moves backwards.

Three parts carry most of the difficulty. The first is choosing among very different families of solutions, from random UUIDs to coordinated counters, each of which fails a different requirement, and the failure modes deserve to be named precisely. The second is budgeting bits so the ID fits in a native integer while still encoding time, place, and sequence. The third is the operational edge work around clock skew, worker identity, and burst rates beyond what one worker can mint, which is where the design either survives production or quietly issues a duplicate.

Scope and requirements

The functional requirement sounds like a single line, namely produce identifiers that are globally unique, but the qualifiers carry the design. IDs should be 64 bits, because they will be primary keys and foreign keys in databases and fields in network protocols, and a native integer is half the storage and a fraction of the comparison cost of a 128-bit value; on a table with billions of rows the difference is tens of gigabytes of index alone. They should be roughly sortable by creation time, so that ordering by ID approximates ordering by time, which lets feeds, pagination cursors, and B-tree indexes, the sorted tree structures databases use to find rows, all benefit without a separate timestamp column. And minting should require no coordination on the hot path, because a generator that must talk to a central service per ID has that service's availability and latency stapled to every write in the company, and an outage of the ID service would become an outage of everything.

Non-functionally, the system must mint at least 10,000 IDs per second across the fleet with room for bursts of ten times that, must keep working through the loss of any node, and should degrade predictably rather than ever issuing a duplicate. The asymmetry deserves emphasis, because a duplicate primary key is silent data corruption that surfaces weeks later as two customers sharing an order, while a brief pause in issuance is merely latency that retries absorb. Whenever the design faces a choice between pausing and guessing, it should pause.

Sizing the problem

Suppose the platform creates 10,000 records per second on average, about 860 million per day, with peaks of 100,000 per second during events. The design below gives every worker process the ability to mint 4,096 IDs per millisecond, which is 4 million per second per worker, so even the peak needs only a single worker in principle and a handful in practice for availability. The fleet-level ceiling with the full bit budget, 1,024 workers each minting 4,096 per millisecond, is about 4 billion IDs per second, so raw capacity is never the constraint. The constraint that actually binds is the timestamp's lifetime.

That lifetime is worth computing in front of the interviewer. With 41 bits of milliseconds, the counter spans 2^41 milliseconds, which is about 2.2 trillion milliseconds. Dividing by 1,000 gives 2.2 billion seconds, and dividing again by 31.5 million seconds per year gives roughly 69 years. Counting from the Unix epoch of 1970 would spend 55 of those years before the system ever launched, leaving barely 14, which is why generators define a custom epoch at deployment time, say January 1, 2025, and count milliseconds from there, buying the full 69 years forward. The epoch choice costs nothing at design time and cannot be revisited afterward, which makes it exactly the kind of decision to make deliberately and write down.

The interface

The generator is best shipped as a library embedded in each service, so minting an ID is a function call with no network hop, but a thin HTTP form is worth defining for clients that cannot embed it, scripts and third-party integrations among them. The batch endpoint matters because a caller that needs thousands of IDs, a bulk importer for instance, should not pay a round trip for each one when a single response can carry the whole allocation.

GET /v1/id
→ 200 { "id": 7269841635818741761 }

GET /v1/ids?count=500
→ 200 { "ids": [7269841635818741762, ..., 7269841635818742261] }

GET /v1/id/decode?id=7269841635818741761
→ 200 { "timestamp": "2025-02-11T08:14:02.113Z",
        "datacenter": 3, "worker": 17, "sequence": 1 }

The decode endpoint is nearly free to build and pays for itself in troubleshooting, since every ID carries its own creation time and birthplace, and an engineer staring at a misbehaving order can learn when and where it was created from the ID alone. It also serves as a reminder that these IDs leak timing information to anyone who can read them, a point the operations section returns to.

The design space

Four families of solutions come up, and walking each one against the requirements shows why the layered bit layout wins.

UUID version 4 is 128 bits of randomness, mintable anywhere with no coordination, and with so much entropy that collisions are not a practical concern, since the birthday arithmetic says a fleet would need to mint billions per second for decades before worrying. It fails two requirements rather than one, being twice the agreed width and completely unordered, and the lack of order hurts more than it sounds. A B-tree index clusters nearby keys on the same disk pages, so time-ordered keys append neatly to the rightmost page while random keys land on a different page every insert, splitting pages and evicting cache constantly, and teams that key large tables by UUIDv4 routinely measure several-fold slowdowns in insert throughput compared with sequential keys. Random IDs are the right answer when unguessability is the requirement, for password-reset tokens and similar secrets, and the wrong one here.

Database auto-increment delegates the problem to one database's counter. It is perfectly ordered and trivially correct, and it is also a single point of failure whose every failover risks either repeated or skipped ranges, while every ID minted in the company queues behind one machine's write throughput. The multi-master refinement gives each of k database servers an offset, so server i issues i, then i + k, then i + 2k and so on, which with two servers means one issues odd numbers and the other even, exactly how Flickr's ticket servers worked. That removes the single point of failure but hard-wires k into the arithmetic, and changing it later is painful, because growing from 2 servers to 3 means the existing odd and even sequences collide with any new modulus scheme, so the new server must start beyond every ID ever issued and the old servers must be re-stepped, a delicate migration for what should be an operational non-event. Time ordering across servers is also approximate at best, since the counters advance at independent rates and a busy server races ahead of a quiet one.

A ticket server centralizes the same pattern into one small service that hands out ranges or single values from a counter it persists. It is simple, and Flickr ran it for years, but the hot-path coordination requirement comes straight back, and making the ticket server highly available reintroduces the failover-duplication problem it was meant to hide. Handing out large ranges amortizes the round trips, since a worker that leases a block of 10,000 IDs touches the server once per block, and a range-leasing ticket server remains a respectable fallback wherever clocks cannot be trusted at all. There is, even so, a cleaner shape.

The Snowflake layout

Twitter's Snowflake, published in 2010, packs time, place, and a counter into one 64-bit integer, and the recommended design here is that layout essentially unchanged. The top bit stays zero so the number is positive in signed-integer systems, a small courtesy that avoids a whole class of integration surprises. The next 41 bits hold milliseconds since the custom epoch. Five bits name the datacenter and five the worker, allowing 32 datacenters of 32 workers each, 1,024 generator instances in total. The final 12 bits are a sequence number that counts IDs minted within the same millisecond on the same worker, wrapping at 2^12 = 4,096. Every field earns its position, because placing time in the high bits is what makes numeric order match creation order, and placing the worker bits above the sequence keeps each worker's output disjoint from every other worker's.

1 bit41 bits5 bits5 bits12 bits0timestamp (ms since custom epoch)datacenterworkersequence2^41 ms = about 69 years4,096 IDs per ms

The 64-bit budget allocates a zero sign bit, 41 bits of milliseconds from a custom epoch, 10 bits naming one of 1,024 workers, and a 12-bit per-millisecond sequence.

Because the timestamp occupies the high bits, numeric order is creation order down to the millisecond, with ties broken arbitrarily by worker and sequence, which satisfies the rough-sortability requirement. Because the worker bits make every generator's output disjoint by construction, no two workers can collide even when minting in the same millisecond, which removes coordination from the hot path entirely, and uniqueness becomes a property of the layout rather than of any agreement protocol. The packing itself is shifts and ORs, and the sequence logic fits in a dozen lines.

EPOCH = 1735689600000          # 2025-01-01 in ms

def next_id(self):
    now = clock_ms()
    if now < self.last_ts:
        raise ClockMovedBackwards(self.last_ts - now)
    if now == self.last_ts:
        self.seq = (self.seq + 1) & 0xFFF      # 12-bit wrap
        if self.seq == 0:                      # 4,096 used: wait
            now = spin_until_next_ms(self.last_ts)
    else:
        self.seq = 0
    self.last_ts = now
    return ((now - EPOCH) << 22) | (self.dc << 17) \
           | (self.worker << 12) | self.seq

The sequence-exhaustion branch is the built-in flow control, because a worker asked for its 4,097th ID in one millisecond spins until the next millisecond begins, capping each worker at 4.096 million IDs per second, far above any realistic single-process demand. From the caller's side that worst case is a stall of less than a millisecond, invisible inside any request budget. The other branch, the clock check, is the part interviewers most want to hear about.

When the clock moves backwards

The layout's correctness rests on the timestamp never repeating with a smaller sequence available, and real clocks violate that assumption. NTP, the network time protocol that keeps machine clocks aligned with reference servers, normally adjusts a fast clock by slewing, slowing it slightly until it is correct, but a badly drifted clock can be stepped backwards in one jump, and virtual machine migrations, leap-second handling, and operator mistakes do the same. A worker whose clock steps back two seconds and keeps minting will re-issue the same timestamp values with fresh sequences and produce duplicates of its own earlier IDs, and nothing downstream will notice until two rows claim the same key.

The defenses stack. The generator remembers the last timestamp it used and compares on every call, which makes backwards motion detectable in a single branch. For a small regression of tens of milliseconds, the worker simply waits it out, sleeping until the wall clock passes its remembered high-water mark, and callers see nothing but a slightly longer call. For a large one it must refuse to issue and alert, because waiting two hours is not viable and guessing is how duplicates happen; the operator who gets paged finds a worker that is safe but idle, repairs the clock, and watches the worker resume from its high-water mark with no cleanup required. Hosts should run NTP configured to slew rather than step wherever possible, and the generator can additionally lease its worker ID with the lease invalidated on restart, so a worker that crashes and comes back with a confused clock cannot resume its old identity until the coordination layer confirms the timestamp high-water mark. The plain summary to give in an interview is that the design trades a rare, bounded availability loss, a worker pausing during clock repair, for a guarantee of no duplicates, which is the right side of the trade for primary keys.

Worker identity and deployment

The ten place bits are an assumption that needs operational backing, because two workers configured with the same datacenter and worker number collide freely and silently. Static configuration works for small, stable fleets, with the datacenter baked into the host image and worker numbers assigned by deployment slot, and its failure mode is human, a copy-pasted configuration file, so a startup check that pings peers or registers in a shared store is cheap insurance. The more robust pattern uses a coordination service such as ZooKeeper or etcd, a small strongly consistent store designed for configuration and locks. On startup each generator instance acquires an ephemeral lease on an unused worker number, holds it with heartbeats for as long as it runs, and lets the lease expire when the instance dies, which returns the number to the pool automatically. The coordination store sits entirely off the hot path, touched once at startup and on lease renewal, so its availability gates only the startup of new workers and never the minting of IDs, and that dependency direction is what keeps the central service from becoming the very bottleneck the design set out to avoid.

Order serviceembeds generatorPayment serviceembeds generatorChat serviceembeds generatorZooKeeperleases worker IDsNTP time sourcedisciplines host clockslease worker IDclock sync, every host

Each service embeds the generator library and mints IDs locally, while ZooKeeper hands each instance a leased worker number at startup, off the hot path, and NTP keeps every host's clock disciplined.

A latency budget for minting

Pricing the mint itself is worthwhile, because the choice between the embedded library and the HTTP service is really a latency choice. In the embedded form, a call to next_id is a clock read, a comparison, a few shifts and ORs, and a return, costing well under a microsecond, and even the unlucky caller who hits sequence exhaustion waits at most one millisecond for the next tick. The HTTP form pays connection handling and serialization on top, putting a single mint at roughly a millisecond inside a datacenter, about a thousand times the embedded cost, which is why the library is the default and the service exists only for callers that cannot link it. Batching changes the arithmetic again, since a request for 500 IDs costs one round trip plus 500 sub-microsecond mints, around a millisecond in total, so the per-ID cost collapses to a couple of microseconds. A bulk importer that needs a million IDs should therefore lease them in blocks of thousands rather than calling one at a time, and publishing that guidance alongside the API saves the support conversation later. The fleet-level sanity check is that minting never touches disk, database, or lock, so the generator should never appear in a latency profile, and if it does, something is wrong with a clock.

Scaling, failures, and operations

Scaling is the pleasant part of this design because the hot path has no shared state, so adding capacity means starting more instances, each minting up to 4,096 IDs per millisecond independently, and the only finite resource is the pool of 1,024 worker numbers, which a fleet would have to grow very large to exhaust. If it ever did, the bit budget can be re-cut at design time, trading sequence bits for worker bits, though never after launch, because the meaning of existing IDs must stay fixed forever.

The failure analysis goes tier by tier. A generator instance dying takes nothing with it, since its worker number returns to the pool when the lease expires, and callers retry against another instance or, in the embedded-library form, are themselves the instance, so the generator dies only when its service does. ZooKeeper being briefly unavailable stops new instances from starting but does not affect running ones, which is the right dependency direction. The layout's one true global dependency is wall-clock time, which is why clock monitoring, alerting when any host's offset exceeds a few milliseconds, belongs in the runbook next to disk-full alerts. The subtle operational items concern meaning rather than uptime. The custom epoch and bit layout are forever-contracts that belong in a constants file with a warning comment. IDs expose creation time and rough volume to anyone who can read them, which may matter for competitive or privacy reasons and can be mitigated by encrypting IDs at the API boundary. And JavaScript numbers cannot represent 64-bit integers exactly, because the language stores all numbers as floating point with 53 bits of precision, so APIs serving web clients should transmit IDs as strings, a detail that has produced production incidents at companies that learned it late.

Follow-up questions

  • Why not UUIDv4 and call it done? It is 128 bits against a 64-bit requirement, and its randomness scatters B-tree inserts across pages, measurably slowing large tables. It remains the right tool when IDs must be unguessable or minted by untrusted clients, but not when they are primary keys at scale.
  • What if 4,096 IDs per millisecond is not enough for one worker? That is 4 million per second from one process, so the realistic answer is to run more workers rather than tune one. Within the layout, bits can shift from worker to sequence at design time, and after launch the layout is frozen because existing IDs must keep their meaning.
  • What exactly happens when the clock jumps back ten minutes? The generator detects it against its last-used timestamp, refuses to issue, and alerts, while small regressions of a few milliseconds it simply waits out. Issuing through a backwards jump is the one path to duplicates, so availability is sacrificed there deliberately.
  • How do two workers avoid getting the same worker ID? Either static assignment from deployment slots, verified at startup, or ephemeral leases in ZooKeeper or etcd whose expiry recycles dead workers' numbers. The lease path also gives a place to persist timestamp high-water marks across restarts, which closes the restart-with-a-bad-clock hole.
  • Are these IDs safe to expose publicly? They leak creation time and let observers estimate volume from the sequence and density of IDs, which is exactly how outside analysts have estimated companies' order volumes. If that matters, keep the internal ID and expose an encrypted or randomized public alias at the boundary.
  • How does this compare with Instagram's approach? Instagram packed a similar layout, 41 bits of time, 13 bits of logical shard, and 10 bits of per-shard sequence, inside Postgres itself, minting IDs in a stored procedure per shard, which trades the standalone library for leaning on a database they already ran. The arithmetic is the same and only the host differs.

References

  1. Twitter Engineering, Announcing Snowflake (2010), the original 64-bit layout.
  2. Flickr Engineering, Ticket Servers: Distributed Unique Primary Keys on the Cheap (2010), the multi-master offset design.
  3. Instagram Engineering, Sharding & IDs at Instagram (2012), a Snowflake-style layout inside Postgres.
  4. IETF, RFC 4122: A Universally Unique IDentifier (UUID) URN Namespace (2005).
  5. Xu, System Design Interview, Volume 1 (2020), chapter on unique ID generators.