Suppose there is a settlement on Mars with a few thousand people, and they need the boring things, by which I mean software updates, a package registry, messaging with Earth, shared documents, and an inventory system that both planets can read. The premise is playful, but the engineering deserves to be done straight, because every familiar assumption of distributed systems fails at once. There is no request-response, no health check, no retrying in five seconds, no quorum spanning the two sites, and for two weeks out of every twenty-six months there is no link at all. Interviewers who ask this are not testing space trivia, since none of the answer depends on knowing anything about rockets. They are testing whether a candidate can rebuild familiar designs from first principles when the speed of light becomes the dominant line item in every latency budget, and whether the candidate notices which of their habits were assumptions all along.
This question also stress-tests intellectual hygiene around consistency, because the usual hedge, the comfortable promise that strong consistency will be used where it is needed, stops being available when a single coordination round costs up to 44 minutes. What follows works through the physics, the networking layer that replaces TCP, what consistency can still mean across the gap, and how software actually gets shipped and operated when the other half of the system is between three and twenty-two light minutes away.
Scope and requirements
Functionally, the system must give the Mars settlement working local infrastructure, covering compute, storage, software distribution, identity, and the everyday applications that run on them, and it must synchronize with Earth along three streams, with content and software artifacts flowing outward, telemetry and science data flowing back, and shared application state, the messages, documents, and inventories, flowing both ways. Mars must remain fully operational with the link down, and that requirement deserves emphasis because it is not an edge case to be handled gracefully but a scheduled fact of orbital mechanics that arrives on the calendar like a holiday.
Non-functionally, the design goals invert the usual ranking, and saying so explicitly is half the answer. Availability of local services dominates everything, since a Mars outage cannot wait on Earth for a fix and a settlement's life-support tooling does not get to have a hard dependency on another planet. Interplanetary freshness is explicitly sacrificed, with minutes to hours of staleness accepted as the normal condition rather than a degraded one. Bandwidth efficiency matters the way it did in 1985, because every byte crossing deep space is contended, scheduled, and expensive, and the sync streams must tolerate the link vanishing mid-transfer without losing or duplicating anything, since a transfer that cannot survive an interruption will never finish at all on a link that is interrupted by design.
Sizing the problem, where the physics is the workload
Earth and Mars range between about 55 and 400 million kilometers apart as the two orbits wind around each other, and dividing those distances by the speed of light, 300,000 km/s, gives a one-way light time between roughly 183 and 1,333 seconds, which is 3 minutes at closest approach and 22 minutes near maximum separation, so a round trip costs anywhere from 6 to 44 minutes. No protocol cleverness reduces this number, and treating it as the floor rather than as a latency to optimize is the first sign a design has internalized the problem. Every 26 months the Sun sits between the planets at solar conjunction and blacks out the link for about two weeks, so the system must budget for a planned fortnight of zero connectivity rather than hope around it. Bandwidth forms the third constraint, since radio links through the Deep Space Network deliver megabit-class rates at best from Mars distance, historically a few Mbps down and far less up, while newer optical links have demonstrated hundreds of megabits near closest approach falling to tens of megabits when far. One worked example disciplines every later decision, because shipping 1 TB over a 10 Mbps link means moving 8 times 1012 bits at 107 bits per second, which is 800,000 seconds, about nine and a quarter days of continuous transmission, and the transmission is not even continuous, since ground stations and relay orbiters see each other only during scheduled contact windows lasting minutes to hours. A system that needs to move a terabyte had better start moving it weeks before anyone needs it, which is why prefetching becomes an architectural principle here rather than an optimization.
The interface is bundles, not requests
Application code cannot be written against a request-response interface when the response is 40 minutes away at best, so the programming model becomes message passing with delivery handled by the network over hours. The unit of communication is a bundle, and the API makes the queueing explicit, accepting a destination, a lifetime, and a custody flag, returning immediately, and reporting delivery asynchronously, possibly days later.
from dtn import Node
node = Node("ipn:earth.registry")
# enqueue for the next contact window; returns immediately
receipt = node.send(
destination="ipn:mars.registry",
payload=artifact_block, # content-addressed chunk
lifetime=timedelta(days=7), # discard if undeliverable by then
custody=True, # each hop takes responsibility
)
@node.on_delivery(receipt)
def acked(report): # may fire hours or days later
mark_replicated(report.bundle_id)
@node.on_bundle("artifact/*")
def receive(bundle): # idempotent: duplicates possible
store_if_new(bundle.content_hash, bundle.payload)
The data model is state built to merge
Because both planets keep writing while disconnected, shared state is stored in forms designed for later merging rather than for locking, and the choice of form is made per dataset rather than once for everything. The two workhorses are append-only logs, where each site only ever adds records so that synchronization reduces to an exchange of missing suffixes, and CRDT state, which the consistency section covers in depth. A sketch of the replicated layer follows.
CREATE TABLE replicated_log (
site CHAR(1) NOT NULL, -- 'e' or 'm', who wrote it
seq BIGINT NOT NULL, -- per-site, gapless
hlc BIGINT NOT NULL, -- hybrid logical clock timestamp
entry JSONB NOT NULL,
PRIMARY KEY (site, seq) -- sync = ship suffixes both ways
);
CREATE TABLE crdt_state (
object_id UUID NOT NULL,
site CHAR(1) NOT NULL,
payload BYTEA NOT NULL, -- per-site component, merged by type
PRIMARY KEY (object_id, site)
);
The high-level architecture
The shape is two autonomous regions joined by a store-and-forward relay chain. Each planet runs a complete and deliberately ordinary stack, with services, databases offering normal strong consistency inside the planet, a container registry, identity, and monitoring, all answering in local milliseconds, because consensus among machines in the same settlement is as fast there as it is here, and nothing about Mars makes a local Raft group exotic. Nothing interactive ever spans the gap, and holding that line is the single most important architectural decision in the design. Between the regions sits the delay-tolerant network, made of bundle gateways at each edge, ground stations, and relay orbiters, each holding bundles in persistent storage until the next scheduled contact window lets them move one hop closer to their destination. The alternative shape worth naming is a primary-on-Earth design where Mars runs caching replicas of Earth services, and it loses catastrophically, because every cache miss would be a 44-minute stall, every write would queue against Earth's authority, and conjunction would turn the settlement's infrastructure off for two weeks, so autonomy per planet is not a preference but the only shape that survives the physics.
Each planet is a complete, internally consistent region, and only bundles cross between them, hopping through gateways, ground stations, and relay orbiters that each store everything until the next contact window.
Why TCP cannot make the trip
It is worth being precise about why the ordinary Internet stack fails, because every piece of the replacement answers a specific failure. TCP opens with a three-way handshake, so before a single payload byte moves, one full round trip is spent, up to 44 minutes of pure silence. Its reliability comes from retransmission timers tuned in milliseconds, and at interplanetary delay those timers either fire constantly and flood the link with spurious retransmits or get inflated until loss recovery takes hours. Its congestion control discovers a safe sending rate from a feedback loop of acknowledgments, which is useless when every piece of feedback describes the network as it existed 40 minutes ago, and the end-to-end connection itself assumes the path exists for the conversation's duration, while this path exists only during contact windows and dissolves between them. Delay-tolerant networking, standardized as the Bundle Protocol in RFC 9171, replaces the conversation with custody of messages. A bundle is a self-contained unit carrying its payload, its metadata, and a lifetime after which it may be discarded, and each DTN node persists every bundle it accepts to durable storage, forwarding it when a contact window opens toward the destination, which is store-and-forward in the old email sense applied to all traffic. Custody transfer sharpens the guarantee, because a hop that accepts custody takes responsibility for the bundle and acknowledges hop-by-hop rather than end-to-end, so a bundle never needs to survive the whole path in one attempt, only the next hop, and a lost transmission is retried across one link instead of across the solar system. Routing, finally, is computed from the contact plan, the published schedule of future windows derived from orbital mechanics, rather than discovered by probing, because the network's future topology is one of the few things this system knows with certainty.
The life of a bundle across the gap
Walking one message through the chain makes the machinery concrete. A Mars technician finishes a maintenance report at nine in the morning local time and hits send, and the messaging service hands the bundle to the Mars gateway, which writes it to persistent storage and immediately acknowledges custody, so from the application's point of view the message is sent and the technician moves on with their morning. The gateway consults the contact plan and finds the next orbiter pass scheduled for twenty minutes before noon, so the bundle waits on disk for two hours and forty minutes, and when the window opens the gateway transmits it during the pass along with everything else queued at its priority level, taking custody acknowledgment from the orbiter before marking its own copy releasable. The orbiter's pass over the deep-space antenna comes at a quarter past two in the afternoon, the bundle crosses the long leg at light speed for perhaps 15 minutes given the current geometry, and the Earth ground station accepts custody and forwards it through the terrestrial network, where the remaining journey is ordinary and instant by comparison. The recipient on Earth sees the message about seven hours after it was written, of which fifteen minutes was physics and the rest was waiting for scheduled windows, and the delivery report retraces the same path in reverse so that two Mars mornings later the technician's client quietly marks the report delivered. Every stage of that journey survives a crash, because every stage begins with a write to durable storage and ends with a custody acknowledgment, and the same trip during conjunction simply stretches, with the bundle resting at the gateway for two weeks and the lifetime field deciding which traffic is still worth carrying when the sky clears.
Consistency at 44 minutes through regions and CRDTs
Strong consistency means all replicas agree on a single order of updates, and the protocols that provide it, Paxos and Raft, require a majority round trip per decision. A consensus group spanning Earth and Mars would therefore take at least one interplanetary round trip per write, 6 to 44 minutes to commit a single change, and during conjunction it would commit nothing at all for two weeks. The design rule that falls out is absolute, in that no synchronous coordination crosses the gap, ever, for anything, and any feature whose requirements seem to demand it gets redesigned until it does not. Each planet runs strongly consistent infrastructure internally, and interplanetary state is reconciled asynchronously, which forces an answer to the question most designs get to dodge, namely what happens when both sides changed the same thing during the hours or days between syncs.
CRDTs, conflict-free replicated data types, are data structures whose merge operation is commutative, associative, and idempotent, three algebraic properties that together guarantee replicas applying each other's updates in any order, any number of times, converge to the same state without coordination. The simplest member of the family is the grow-only counter, in which each site keeps its own component, only ever increments it, and computes the merged value as the per-site maximum, summed. Concretely, if Earth's replica of a "supply requests filed" counter holds earth 41, mars 7 while Mars holds earth 38, mars 9, with the difference being writes the other side has not yet seen, then after a sync both sides compute the maximum per component, earth 41 and mars 9, and both read 50, with nothing lost and no coordinator consulted, and the merge would have produced the identical result if the bundles had crossed in the opposite order or arrived twice. Sets, maps, and sequence types follow the same discipline with more bookkeeping. The hazard in the family lives in last-writer-wins registers, which resolve concurrent writes by keeping the one with the larger timestamp, and even with hybrid logical clocks, timestamps that combine physical time with a logical counter so causality is never inverted by clock skew, the resolution is merely deterministic rather than safe, because one side's concurrent write is silently discarded, which is unacceptable for anything resembling an inventory adjustment. The engineering posture that results is to use CRDTs where merge semantics are natural, for counters, sets, presence, and document edits, to use append-only logs where every event must survive, and to route genuinely conflicting business decisions, such as two sides allocating the same scarce resource during a blackout, into an explicit conflict queue for humans, because pretending a merge function can settle a resource dispute is how a settlement ends up with two owners for one rover.
Each site increments only its own component while disconnected, the full states cross as bundles at the next contact in either order, and both sides take the maximum per component, converging on 50 without any coordination.
Shipping software, identity, and trust across the gap
Most interplanetary traffic should be content rather than conversation, and content has the great virtue that it can be shipped ahead of demand. The Mars region runs full mirrors of a container and package registry, documentation, and media libraries, with popularity-weighted prefetch deciding what crosses the link before anyone asks for it, which is the CDN insight applied at planetary scale, and the difference between a developer on Mars waiting milliseconds for a cached package and waiting days for an uncached one is the difference between a working settlement and a stalled one. Everything shipped is content-addressed, named by the hash of its own bytes, so Mars verifies integrity locally without consulting Earth, a re-sent chunk is recognized rather than re-stored, and large artifacts cross as resumable chunked bundles that survive window closures mid-transfer with no wasted retransmission. Deployments are blue-green per planet, meaning the new version stands up alongside the old and traffic flips locally with an instant local rollback path, and one release rule is non-negotiable, namely that no deploy may ever require both planets to change together, so every protocol version must interoperate with its predecessor, because the two sides will run different versions for days at minimum and for a month around conjunction.
Identity and trust need the same rethink, because certificate validation that phones home fails on Mars by construction. An OCSP check, the online query that asks whether a certificate has been revoked, would take 44 minutes to answer, and short-lived certificates renewed hourly would all expire mid-blackout, so the trust architecture becomes hierarchical and offline-capable instead. A Mars intermediate certificate authority, holding its own hardware-protected keys, issues all local certificates with lifetimes stretched from hours to months, and only the intermediate's relationship with the Earth root ever crosses the link. Revocation travels as signed lists carried in the regular sync stream rather than as on-demand lookups, and the settlement holds enough sealed root material, split among officers so no single person can use it, to survive an extended blackout without trust expiring anywhere in the stack. Operations obey the same autonomy rule, since a human on Earth learns of a Mars incident at least 3 and up to 22 minutes after it begins, plus however long until a contact window, so remediation must be local and automatic, with runbooks executed by machines rather than typed by distant hands, while Earth receives lagged telemetry summaries, sampled and compressed to respect the link budget, for forensics and trend analysis rather than firefighting.
Scaling, failures, and operations
Scaling within each planet is ordinary and needs no special treatment, so the real scaling axis is the link, which grows by adding relay orbiters, since each one widens total contact time and adds path diversity, and by upgrading to optical terminals as they mature. The DTN layer scales by storage, because every node must buffer days of traffic, and a back-of-envelope check shows how undemanding that is, given that a 25 Mbps average link fully utilized moves about 270 GB per day, so even a full two-week conjunction backlog stays under 4 TB, a trivial amount of disk for a gateway. The queue is therefore never the constraint, and the link always is, which simplifies capacity planning to a single question about transmission scheduling.
The failure drills write themselves from the constraints, and rehearsing them shows the design holding. A lost contact window, whether from weather at a ground station or an orbiter dropping into safe mode, means bundles wait under custody for the next window, and applications notice nothing except staleness counters ticking upward on their dashboards. Conjunction is rehearsed rather than feared, with prefetch filling Mars caches in the weeks before, the settlement running self-sufficiently through the blackout, and sync queues draining afterward in priority order, operational telemetry first and bulk content last. A corrupted artifact cannot install anywhere, because its content address fails to match its bytes, and the receiving registry simply requests the chunk again. The deepest risk is divergence that merging cannot express, which is contained by keeping the set of truly shared mutable state deliberately tiny, auditing it, and treating every addition to that set as an architectural decision requiring sign-off rather than a schema change. The second deepest is operator drift between two stacks run by two teams 200 million kilometers apart, and the countermeasure is shared declarative configuration synced like any other content, so the planets converge on intent even during the weeks when they cannot converse.
Follow-up questions
- What changes for the Moon instead of Mars? The regime flips entirely, because the Moon sits 1.3 light seconds away, about 2.6 seconds round trip, with continuous links rather than windows. Strong consistency across the gap becomes possible though unpleasant, interactive use works with care, and the right design is a high-latency terrestrial region rather than a DTN island, so almost none of this article's machinery is needed.
- Could you run Paxos across planets for rare, critical decisions? Mechanically it works, with a commit costing one to several round trips, which is tens of minutes to hours, and conjunction halting it entirely for two weeks. For treaty-grade decisions made monthly that cost may be acceptable, and the discipline is keeping such a group off every request path so its slowness can never spread.
- Why bundles instead of just TCP with huge timeouts? Inflating the timeouts still leaves end-to-end retransmission across the whole path, congestion control steering by feedback that is 40 minutes stale, and connections that die at every window closure. Custody transfer moves responsibility hop by hop, so each bundle needs to survive only one link at a time, and that single change is what makes a scheduled, intermittent path usable at all.
- How do users on Earth see Mars data? They see a lagged replica that carries its own provenance, since every cross-planet object records its origin site and sync timestamp, and interfaces surface an "as of" stamp in Mars time rather than pretending a freshness that physics forbids. Stale-and-labeled turns out to be far more useful than the alternative of hiding the lag and letting users discover it the hard way.
- What stops CRDTs from quietly losing writes? Discipline in type selection does, by choosing types whose merge keeps everything, such as grow-only structures, observed-remove sets, and logs, for anything irreversible, and confining last-writer-wins to fields that are genuinely safe to overwrite. Semantic conflicts that no merge can settle, like double-allocating one resource, are routed to a human review queue instead of being resolved silently.
- Where does clock synchronization come from? Each planet maintains its own disciplined time from local atomic references, and hybrid logical clocks order cross-planet events causally without requiring agreement on wall-clock time. NTP across a 44 minute round trip with asymmetric paths cannot deliver tight synchronization, and the design is arranged so that nothing in it ever needs that.
References
- Burleigh et al., RFC 9171: Bundle Protocol Version 7 (2022), the delay-tolerant networking protocol.
- Cerf et al., RFC 4838: Delay-Tolerant Networking Architecture (2007), the architectural rationale.
- Shapiro et al., Conflict-free Replicated Data Types (INRIA, 2011), the formal CRDT foundations.
- NASA JPL, Deep Space Network, on ground stations, scheduling, and link capabilities.
- Kleppmann, Designing Data-Intensive Applications (2017), on consistency, consensus, and convergence.