Design a web crawler · Stanley Jacob

A web crawler starts from a set of seed URLs, downloads those pages, extracts the links they contain, and follows those links to new pages, repeating until it has visited some meaningful slice of the web. Search engines run crawlers to build their indexes, archives run them to preserve pages before they disappear, and research and monitoring systems run them to mine content at scale. The loop sounds like a weekend project, and a single-machine crawler genuinely is one, which is exactly why interviewers ask the question. The distance between the toy and the real system is made of problems that only appear at scale, and most of those problems turn out to be about restraint rather than speed, which surprises candidates who arrive expecting a pure throughput conversation.

Restraint is the social contract of crawling. A crawler is a guest on every server it visits, and from the host's point of view an aggressive crawler is indistinguishable from a denial-of-service attack, so politeness, meaning honoring each site's stated rules and limiting the request rate per host, is a first-class design constraint rather than a feature bolted on later. The hard parts of this design concentrate in three places, namely the URL frontier, which is the data structure that decides what to fetch next while enforcing politeness, the deduplication of both URLs and content at a scale of billions of items, and the traps that a hostile or merely strange web lays for any program that follows links without judgment.

Scope and requirements

Functionally, the crawler should fetch HTML pages starting from the seeds, extract and follow links, store raw page content for downstream consumers such as an indexer, and revisit pages to pick up changes over time. Scope excludes rendering JavaScript-heavy pages, which is a real consideration in modern crawling but belongs to a separate fleet of headless browsers with its own budget, and it excludes downloading non-HTML media beyond recording that the links exist. Non-functionally, the target is one billion pages per month under the politeness rules above, resilience to individual machine failures without losing the frontier, and extensibility, because every crawl operator eventually wants to attach new content processors and the architecture should accept them without surgery.

Coverage and freshness deserve a sentence of requirements philosophy, because they fight each other for the same fetch budget. Every request spent revisiting a known page is a request not spent discovering a new one, so the requirement cannot be to crawl everything and keep all of it fresh at once. The workable framing is to spend a fixed budget where it earns the most, which quietly reframes the scheduler, rather than the downloader, as the most interesting component in the system, and the rest of the design follows from that reframing.

Sizing the problem

One billion pages per month spread over roughly 2.6 million seconds comes to about 385 pages per second, which rounds to 400 for planning. At an average page size of 500 KB, the fleet downloads 400 times 500 KB each second, or 200 MB/s, about 1.6 Gbps sustained, and a small fleet of ordinary machines handles that without exotic networking. Storage is the bigger number. One billion pages times 500 KB is 500 TB per month of raw HTML before compression, and HTML compresses well, typically four or five to one, so 100 to 150 TB per month actually lands in the blob store, with the retention policy deciding how many months of that accumulate before old crawls age out.

The politeness constraint converts throughput into concurrency, and this is the calculation that explains the rest of the architecture. If the crawler limits itself to one request per host every two seconds, then a single host yields at most half a page per second, and sustaining 400 pages per second requires fetches in flight against at least 800 distinct hosts at any given moment. A fetch including DNS resolution, connection setup, and transfer takes about a second on average, so the fleet needs on the order of a thousand concurrent connections spread across that many hosts. The scheduler's real job is therefore keeping hundreds of hosts in polite rotation rather than keeping one queue full, and the frontier described below exists to do exactly that.

The crawler's interface to the web

The crawler speaks plain HTTP, and two exchanges define its manners. Before fetching anything from a host, it fetches and caches that host's robots.txt, the file in which sites declare which paths crawlers may touch and how quickly they may come back, and on revisits it uses conditional requests, which ask the server to send the body only if the page changed since the last visit, so an unchanged page costs a status line instead of a transfer.

GET /robots.txt HTTP/1.1
Host: example.com
User-Agent: research-crawler/1.0 (+https://crawler.example.org/about)

→ 200
User-agent: *
Disallow: /private/
Crawl-delay: 2

GET /products/blue-shoes HTTP/1.1
Host: example.com
If-Modified-Since: Tue, 01 Apr 2025 08:00:00 GMT

→ 304 Not Modified              (unchanged: no body transferred)

The descriptive User-Agent string with a contact URL is part of the same contract. Site operators who notice the crawler in their logs can find out who is fetching, read about its purpose, and ask for changes, while crawlers that hide behind browser agent strings get blocked as abusers because that is precisely what they resemble. The conditional fetch matters economically as well as socially, since a mature crawl revisits far more pages than it discovers, and a 304 response, the not-modified status, costs a few hundred bytes against the half megabyte of a full transfer, which adds up to most of the bandwidth saved on the revisit side of the budget.

The data model

Raw page bodies go to a blob store keyed by content hash, which makes the storage layer naturally deduplicating, and a metadata database carries the crawl state for every URL. The per-URL row is what the scheduler reads and writes on every decision, so it stays deliberately narrow.

CREATE TABLE urls (
  url_hash      BYTEA PRIMARY KEY,     -- 16-byte hash of normalized URL
  url           TEXT NOT NULL,
  host          TEXT NOT NULL,
  last_fetched  TIMESTAMPTZ,
  content_hash  BYTEA,                 -- last seen body checksum
  change_score  REAL DEFAULT 0.5,      -- estimated change frequency
  depth         SMALLINT DEFAULT 0,
  failures      SMALLINT DEFAULT 0
);

One billion URLs at roughly 200 bytes per row is about 200 GB, which shards comfortably by url_hash across a handful of database nodes. The table's job is bookkeeping, holding the change estimate, the failure count, and the last fetch time that the scheduler consults, while the much hotter question of whether a URL has ever been seen before goes through an in-memory structure described later, because answering it with a database round trip per discovered link would multiply the database load by the average number of links on a page, easily a factor of fifty.

The high-level architecture

The system is a loop drawn as a pipeline. The URL frontier holds URLs waiting to be fetched and releases them according to priority and politeness. Fetcher workers take URLs from the frontier, resolve hostnames through a local DNS cache, download the page, and hand the body to a content-dedup check that discards pages whose content was already stored under a different URL. New content goes to the blob store, and a parser extracts the outgoing links, which pass through URL normalization, filtering, and the seen-set test before the survivors re-enter the frontier and close the loop. Traversal is breadth-first in flavor, because links are discovered level by level outward from the seeds, but the frontier's priorities mean it is never strictly breadth-first, since important or fast-changing pages jump the line, and that bending of the order is the behavior a crawl operator actually wants.

The crawl loop in one picture, where the frontier dispenses URLs under politeness rules, fetchers download through a DNS cache, new content is stored and parsed for links, and normalized unseen URLs flow back into the frontier to await their turn.

DNS earns its own box in that picture because name resolution quietly becomes a bottleneck at exactly this scale. Four hundred fetches per second against fresh hostnames means 400 resolver queries per second, each costing tens of milliseconds through a typical recursive resolver chain, and a stalled resolver stalls every fetcher thread waiting behind it. Running a local caching resolver on each crawler node, with the crawler's own cache keyed by host layered on top, turns the common case into a memory lookup, and it concentrates per-host knowledge in one place, which is also where connection reuse against frequently visited hosts naturally lives.

The frontier: front queues for priority, back queues for politeness

A naive frontier is one big first-in-first-out queue, and it fails politeness immediately, because a burst of links discovered on one site, say ten thousand product pages from example.com, would be fetched back to back, hammering that host with the full speed of the fleet. The classic fix, introduced by the Mercator crawler and still the standard answer, splits the frontier into two stages. A set of front queues implements priority, where a prioritizer scores each incoming URL on signals such as site importance, depth from the seeds, and expected change rate, then places the URL in the queue for its priority band. A larger set of back queues implements politeness, where each back queue holds URLs for exactly one host, a table maps hosts to their queues, and a delay heap, which is a priority queue ordered by time, records the earliest moment each host may be contacted again based on its crawl delay.

Incoming URLs are scored into front queues, and the router moves them into per-host back queues without ever mixing hosts. A fetcher pops the next ready host from the delay heap, dequeues one URL from that host's back queue along the dashed path, fetches it, and pushes the host back with a ready time of now plus its crawl delay.

The flow rewards a slow walk-through. A fetcher wanting work pops the delay heap, which yields a host whose ready time has passed, takes one URL from that host's back queue, and fetches it. When the fetch completes, the host goes back on the heap with a ready time of now plus its delay, two seconds by default or whatever the site's Crawl-delay directive requested. If a back queue runs empty, the router refills it by pulling from the front queues, biased toward the high-priority ones, and when a pulled URL belongs to a brand-new host, the router assigns that host to a free back queue and updates the host table. The division of labor stays clean throughout, because the front queues decide what deserves fetching while the back queues and the heap decide when each host may be touched, and no burst of discoveries can ever concentrate requests on one server. Keeping roughly three times as many back queues as fetcher threads is the standard provisioning rule, since with fewer queues the heap runs dry and threads sit idle while politeness timers count down, which wastes the fleet without helping any host.

Deduplication, traps, and recrawl

The seen-set answers whether a URL has ever entered the frontier, and it answers that question billions of times, so it cannot afford a database round trip per link. URLs are first normalized, which lowercases the scheme and host, resolves relative paths, sorts or strips query parameters known to be irrelevant, and removes fragments, so that trivially different spellings of the same page collapse into one canonical form. A Bloom filter then screens the candidates in memory. A Bloom filter is a compact bit array that answers definitely-no or probably-yes using a few hash probes per key, and at 10 bits per key, 10 billion URLs cost about 12.5 GB of memory for a false-positive rate near 1 percent. The trade hiding inside that 1 percent is that a false positive silently skips a never-seen URL, which forfeits about 1 percent of discovered links, an acceptable loss for most crawls because important pages are linked from many places and get other chances to be found. A crawl that cannot accept the loss pairs the filter with a sharded key-value store for confirmation, paying the network lookup only on the rare probably-yes path, which preserves nearly all of the filter's savings while removing the silent skips.

Content dedup is the same idea one level up, because the web is full of identical bodies reachable through different URLs, mirrors, session-ID variants, and www versus bare domains, and storing every copy wastes the blob store while skewing whatever index gets built downstream. An exact checksum over the body catches identical copies almost for free, since the blob store is already keyed by content hash. Near-duplicate detection handles the harder case of the same article wrapped in different boilerplate, and the standard tool is simhash, which condenses a document into a 64-bit fingerprint with the property that similar documents differ in only a few bits, so flagging near-copies means checking each new fingerprint against the corpus within a small Hamming distance, the count of differing bit positions, and indexing tricks make that check feasible at billions of documents.

Spider traps are link structures that generate unbounded URLs, often by accident rather than malice. An infinite calendar whose next-month link never ends, session IDs that mint a fresh URL for every visit, and faceted product listings that multiply filter parameters forever are the classics, and a crawler that follows links blindly will pour its entire budget into one of them. The defenses stack rather than compete. A depth limit per crawl path caps how far any chain can run, URL pattern heuristics reject URLs beyond a length cap or with many repeated path segments, and the most robust layer is a per-domain page budget, which guarantees that even an undetected trap costs only that domain's allotment before the crawler moves on. The same budget bounds the damage from genuinely enormous legitimate sites, which would otherwise absorb a disproportionate share of the crawl simply by being large and densely linked.

Recrawl scheduling closes the loop on freshness. Each fetch that returns a 304 or an unchanged checksum nudges the page's change-rate estimate down, each observed change nudges it up, and the scheduler converts the estimate into a next-visit time, so a news front page earns a visit every few minutes while a stable reference page waits weeks between visits. The change_score column in the metadata table is exactly this estimate, and the frontier's prioritizer reads it when scoring URLs, which is how freshness competes with discovery inside one fetch budget instead of in two separate systems arguing over the same bandwidth.

The life of one URL

Following a single URL through the machine ties the pieces together and shows where the time actually goes. A parser on node 7 extracts a link to a product page on example.com from a page it has just processed. Normalization canonicalizes the URL in microseconds, and the Bloom filter answers definitely-no in well under a microsecond, so the URL is genuinely new. Node 7 does not own example.com, so the URL is forwarded to the owning node through the partitioned exchange queue, arriving within milliseconds, where the prioritizer scores it into a mid-priority front queue. Some minutes later the router pulls it into example.com's back queue, and there it waits behind the host's politeness timer, which under a two-second delay and a handful of queued URLs typically means seconds to minutes of queue time. That wait is the dominant latency in the whole pipeline, and it is deliberate, since the crawler is pacing itself rather than struggling. When the delay heap finally surfaces example.com, a fetcher checks the cached robots.txt verdict in memory, resolves the host through the warm DNS cache in under a millisecond, reuses an open connection where one exists, and spends 200 to 800 ms downloading the half-megabyte body. The checksum misses the content-dedup check, the page lands in the blob store, the URL's row gains a fetch timestamp and a content hash, and the parser starts the cycle again on the new page's links.

The failure branches deserve the same concreteness. If the fetch times out, the fetcher records a failure in the URL's row, pushes the host back on the heap with a lengthened delay, and moves on to another host, with the URL retried later under exponential backoff, meaning each successive failure doubles the wait, and parked entirely after a handful of attempts. If the node owning example.com dies mid-crawl, the contents of its back queues are recovered from the last checkpoint in durable storage, the few fetches that were in flight are simply refetched, and the total cost of the crash rounds to a few duplicate downloads. If robots.txt suddenly disallows a path the crawler has been visiting, the cached copy expires within a day, the new rules load on the next fetch from that host, and the now-forbidden URLs are dropped at dispatch time rather than scrubbed from the frontier, which is cheaper and looks identical from the site's side. An operator watching all of this sees counters moving on a dashboard rather than incidents demanding action, and that calmness is the test of a well-shaped crawler, because at 400 pages per second something is always failing somewhere.

Scaling, failures, and operations

The crawl distributes by partitioning the URL space by host across crawler nodes, so every URL for example.com lives on the same node. That single decision keeps all of the politeness state, namely the back queue, the delay timing, and the robots.txt cache for each host, local to one machine with no cross-node coordination on the hot path, and it localizes DNS cache hits and connection reuse as a free side effect. Links a node extracts for hosts it does not own are forwarded to the owning node, conveniently through a queue partitioned the same way. Adding nodes means re-partitioning hosts, and consistent hashing, a placement scheme where each node owns a stable slice of the hash space, keeps the movement proportional to the capacity added instead of reshuffling everything. A node's frontier survives the node itself because back-queue contents are checkpointed to durable storage, so losing a machine costs at worst the few pages that were in flight when it died.

The failure stories inside the fleet are mostly graceful, as the walk-through above showed, and the blob store and the metadata database scale independently along their own axes, by raw capacity and by url_hash shards respectively. The two real operational hazards are external rather than internal. The first is hammering someone's site through a politeness defect, which is why the per-host rate limiting deserves actual tests rather than trust, and why the User-Agent carries contact information so that the people experiencing the damage can report it directly. The second is being blocked or rate limited by the large content delivery networks that front much of the web, and the durable answer there is spreading requests in time and honoring the signals those networks send, rather than evasion, because evasion converts a technical problem into a reputational one that follows the crawler's name around. A daily report of the top hosts by request count is the cheapest early warning for both hazards, since each one begins as a single host receiving more attention than anyone intended.

Follow-up questions

Why partition the crawl by host rather than by URL hash? Politeness is enforced per host, so host-partitioning keeps the delay state, the robots cache, and the connection pools on one node where they can be consulted in memory. URL-hash partitioning would scatter one host's URLs across the fleet and force every fetch to coordinate against a shared rate limit over the network, which plants a distributed-systems problem in the middle of the hottest loop in the system.
What happens when robots.txt itself is unavailable? A 404 is read as the site declaring no restrictions, but a timeout or a 5xx is treated conservatively, with the crawler retrying later and skipping that host's pages in the meantime, because fetching while the rules are unknown risks violating them. Caching robots.txt for about a day with revalidation balances respect for rule changes against the overhead of refetching a file that rarely changes.
How does the design avoid recrawling the same page forever via different URLs? Three layers cooperate on that problem. URL normalization collapses spelling variants before any lookup happens, the seen-set drops URLs that have already entered the frontier, and content dedup catches the remainder, where genuinely different URLs return the same body, by recording an alias instead of storing another copy.
Is this BFS or DFS, and does it matter? The traversal is breadth-first in flavor, since discovery spreads outward from the seeds, and that order finds important, well-linked pages early. Strict ordering is abandoned in favor of priorities, because politeness already reorders everything and pages differ enormously in value. Depth-first is avoided on purpose, since diving deep into one site is simultaneously rude to that site and low-yield for the crawl.
What breaks first at 10x scale? Memory does, in the seen-set and the frontier, since a 100 GB Bloom filter wants sharding across nodes, which the host partitioning conveniently already provides. Bandwidth grows to 16 Gbps, which pushes the fetcher fleet across regions for capacity and geographic spread, and the metadata table's write rate forces updates to become batched and asynchronous rather than one per fetch.
How would you crawl JavaScript-rendered pages? With a separate rendering fleet of headless browsers, fed from the same frontier but budgeted independently, because a rendered fetch costs 10 to 100 times what a plain one does in CPU and memory. Most operators render selectively, triggering the expensive path only when signals suggest the static HTML was empty of real content, which keeps the rendering fleet a small fraction of the main one.

References

Manning, Raghavan, and Schütze, Introduction to Information Retrieval, chapter 20: Web crawling and indexes (2008), including the Mercator-style frontier.
RFC 9309, Robots Exclusion Protocol (2022).
Google, How Google Search works: crawling and indexing.
Xu, System Design Interview, Volume 1 (2020), chapter on web crawlers.