On every listing page of a vacation rental platform there sits a carousel near the bottom offering similar listings the traveler might like. A traveler lands on a two-bedroom apartment in Lisbon, scrolls past the photos and the reviews, and the carousel offers a handful of alternatives in the same neighborhood at a comparable price. The widget looks small, but at Airbnb's published numbers item-to-item recommendation of this kind drives a meaningful share of all bookings, because it matches how people actually shop for a place to stay. Few travelers refine their search query until the perfect listing appears at the top. Most hop from one plausible option to the next, comparing as they go, until something fits the trip well enough to commit to, and the carousel is the surface that serves exactly that hopping. The design question is how to decide, given the listing being viewed, which other listings deserve those few slots.
The tempting first move is to define similarity by attributes, matching the same city, the same number of bedrooms, and a price within 20 percent. That baseline is worth building and worth beating, because what it misses is precisely what sessions reveal. Two listings can differ completely on paper, a houseboat and a loft say, and still be weighed against each other by the same kind of traveler within the same shopping session, while two near-identical condos can serve entirely different trips. The stronger definition is behavioral, treating two listings as similar when the same kinds of sessions consider them both, and the centerpiece of this design is a way to learn that similarity directly from click logs without anyone ever writing down what similarity means.
Scope and requirements
The product surface is the carousel on the listing page. Given an anchor listing, which is the one currently being viewed, the user's search context of dates and guest count, and a market of perhaps tens of thousands of listings, the system returns around a dozen ranked alternatives within roughly 100 milliseconds of server time so the page never waits on it. Assume a platform with about 5 million active listings worldwide and hundreds of millions of listing-page views per day, which makes the carousel one of the highest-volume surfaces the company runs even though it reads like an afterthought. Search ranking itself, pricing suggestions, and the booking flow all stay out of scope. One requirement separates this problem from recommending movies and deserves stating early, because most travelers arrive with dates, and a recommendation that is unavailable for those dates is worthless no matter how similar it is, so availability enters the design as a hard constraint in candidate generation rather than as a feature the ranker is allowed to trade away.
Framing the problem as a machine learning task
The task splits cleanly in two, and the split is itself a design decision worth defending. The first half learns a representation of listings such that behaviorally similar listings sit close together, which turns the vague instruction to find similar listings into the concrete operation of finding nearest neighbors in a vector space. The second half ranks the retrieved neighbors for this particular traveler and trip, using a small supervised model that predicts the probability of a click and ultimately a booking. The alternative would be a single end-to-end model scoring every possible (anchor, candidate) pair, which looks cleaner on paper and loses in practice, because pair scoring cannot be run against a whole market of tens of thousands of listings on each of hundreds of millions of daily page views, while a learned embedding space hands over the retrieval step almost for free, since nearest-neighbor search over precomputed vectors is among the cheapest operations in the serving toolbox.
Embeddings from click sessions
The centerpiece technique is borrowed from language modeling. The skip-gram objective, which is the idea behind word2vec, learns a vector for every word by training a tiny model to predict which words appear near each other in real sentences, and words used in similar surroundings end up with nearby vectors, which is why hotel and motel land close together without anyone defining either one. The transfer to listings is direct. A user's click session, meaning the sequence of listings they viewed in one sitting, plays the role of a sentence, and each listing plays the role of a word. If thousands of sessions containing the Lisbon apartment also contain a particular loft three streets away, training pushes their vectors together, and no human ever specified what made them alike. Each listing ends up with a vector of perhaps 32 numbers, and the geometry of the space encodes neighborhood, style, and price band simultaneously, because those are exactly the things that co-occur in real shopping behavior. Training uses negative sampling, which contrasts the listings that truly co-occurred in sessions against randomly drawn ones so the model learns what makes the true neighbors special. Airbnb's published refinement adds further negatives drawn from the same market as the anchor, because purely random negatives usually come from other cities and teach the model little beyond geography it already knows, while a hard negative from the same market forces it to learn the finer distinctions of style and price that the carousel actually trades in.
The second refinement from that work treats the booked listing as global context. Sessions that end in a booking are the ones that encode satisfied intent, since the traveler searched, compared, and finally committed, so during training the eventually booked listing is treated as a neighbor of every listing in that session rather than only of the listings within a few clicks of it. Intuitively, every click in a converted session gets pulled slightly toward the thing the traveler finally chose, which tilts the whole space away from what people merely browse together and toward what leads to bookings. That tilt is the difference between a curiosity engine and a revenue engine, and it costs nothing at serving time, because it changes only how the vectors were trained and not how they are used.
New listings have no sessions yet, and the cold-start answer works because of how strongly location dominates similarity in this domain. A new listing is assigned the average of the vectors of a few geographically nearest listings of the same property type and similar price band, for example the three closest comparable apartments, and that borrowed vector is close enough to place it among sensible neighbors from its first day. The approximation only needs to bootstrap impressions, because once travelers begin encountering the listing in sessions it earns its own trained vector at the next refresh and the borrowed one is discarded.
Candidate generation
With every listing embedded, candidate generation becomes a nearest-neighbor lookup, and the engineering question is how to make that lookup fast enough to run on every page view. Exact search over 5 million vectors is unnecessary, because an approximate nearest neighbor index, a data structure that organizes vectors into a navigable graph or partitioned cells and returns almost-nearest neighbors in well under a millisecond, retrieves the top 200 neighbors of the anchor with recall in the high nineties, meaning it finds nearly all of the true nearest vectors nearly all of the time. The index is pre-filtered by market, since a Lisbon anchor should pull Lisbon candidates and the embedding space already concentrates neighbors locally. The hard constraints apply next. Candidates must be available for the traveler's dates, which is checked against the availability calendar index, must fit the guest count, and must pass status filters such as not being the anchor itself and not being suspended. The date filter bites hard in high season, sometimes cutting 200 neighbors down to 80, which is exactly why retrieval starts wide and why availability runs here rather than in the ranker. A soft penalty inside the ranker would still let an unavailable listing claim a slot some fraction of the time, and the traveler who clicks through to a calendar with no open dates loses more trust in that one moment than ten good suggestions can rebuild.
Ranking the candidates
The surviving candidates are scored by a small supervised model, either a GBDT, which is an ensemble of small decision trees trained in sequence so that each tree corrects the errors of the ones before it, or a compact neural network, predicting the probability of a click from the carousel. The features are concrete and mostly relational to the anchor. The price ratio between candidate and anchor matters more than either absolute price, because a candidate at 1.1 times the anchor's price reads as a fair alternative while one at 2.5 times reads as an upsell the traveler did not ask for, and the model learns where that tolerance bends for different markets. Embedding cosine similarity, an angle-based measure of how aligned two vectors are, summarizes the behavioral closeness in one number. Distance to the anchor in meters, review score and review count, capacity fit against the guest count, and instant-book availability fill out the set. Anchoring the features to the listing being viewed is the quiet design decision in this section, because the anchor is the best single statement of what this traveler wants right now, stronger in this moment than their entire long-term history, which may describe past trips taken with different companions on different budgets.
Data, labels, and offline evaluation
Labels come from the carousel's own logs. An impression of a candidate that the traveler clicks becomes a positive, a booking that follows the click counts as a stronger positive, either weighted higher or modeled as a second prediction task, and impressed-but-not-clicked candidates become the negatives. Every impression is logged with its slot position, because position bias, the tendency of earlier slots to attract clicks regardless of fit, would otherwise teach the model that whatever the old system put first was good. The standard treatment trains with position as a feature and fixes it to a constant at serving time, so placement effects and fit effects can be separated.
Offline evaluation uses held-out sessions split by time, training the embeddings and the ranker on weeks 1 through 8 and evaluating on week 9, because a random split would let the model see the future of the very sessions it is tested on and flatter every number it produces. The retrieval stage is measured by recall@K. For each held-out case where a traveler viewing anchor A went on to click listing L, the test checks whether L appears among the top K neighbors of A, and the results are averaged over cases. If the embedding index places the clicked listing in its top 50 for 62 percent of cases while the same-neighborhood-and-price heuristic manages 31 percent, the embeddings have roughly doubled the ceiling on what the ranker can ever show, and the word ceiling is the right framing, because a relevant listing missing from retrieval is unrecoverable by anything downstream. The ranking stage is measured by the average position of the clicked or booked listing among the scored candidates, where moving the eventual booking from average rank 9 to rank 4 moves it from outside the visible carousel to inside it, which is the kind of offline result that usually survives the online test precisely because it maps onto something the traveler actually sees.
The serving architecture
The serving path is short, which is part of the design's appeal, and the budget deserves walking stage by stage. The listing page request arrives carrying the anchor ID, the dates, and the guest count. Fetching the anchor's vector and querying the ANN index for 200 neighbors within the market costs about a millisecond. The availability checks run as one batched calendar lookup costing 10 to 20 milliseconds, batched because 200 individual calendar reads would dominate the entire budget on their own. Feature assembly from the feature store, the shared system that serves feature values consistently to training and serving, takes roughly 10 more milliseconds, and scoring the 80 or so survivors through a small model adds only a few, so the whole path lands comfortably under the 100 millisecond envelope with room left for the diversity pass. Diversity earns its place at the end because nearest neighbors are sometimes too near, and the failure is easy to picture, since ten units in the same condo building are each an excellent candidate and together make a terrible carousel. A simple greedy rule repairs it by selecting candidates in score order while applying a penalty proportional to each candidate's maximum embedding similarity against the already-selected set, so the second unit from the same building drops below a strong option in the next neighborhood, and the traveler sees twelve genuine alternatives rather than one choice repeated twelve times.
The serving path. The ANN index retrieves 200 behavioral neighbors of the anchor, hard availability and capacity filters cut them to around 80, the ranker orders the survivors for this trip, and a diversity pass keeps the carousel from filling with near-duplicates.
The embedding training pipeline
The offline side runs nightly. Raw click logs are sessionized, which means each user's views are grouped into sequences split wherever 30 minutes of inactivity suggests the sitting ended, and trivial single-click sessions are dropped because a sequence of one teaches the skip-gram objective nothing about co-occurrence. The trainer then runs over the resulting corpus, on the order of hundreds of millions of sequences for a mature platform, with the booked-listing global context applied to the converted sessions. The output is one 32-dimensional vector per listing, which for 5 million listings at four bytes per number comes to around 640 megabytes, small enough to hold in memory on every machine that needs it. The ANN index is rebuilt from the fresh vectors and swapped atomically together with the exact vector set it was built from, and that versioning discipline matters far more than it looks, because vectors from different training runs live in different coordinate systems even when trained on nearly identical data, so yesterday's index queried with today's vectors returns neighbors that are silently wrong rather than visibly broken. The traveler inside that failure sees a Lisbon apartment whose suggested alternatives sit in the wrong district at the wrong price, click-through sags within hours, and nothing in the serving path logs a single error, which is why the version mismatch is prevented structurally by the atomic swap rather than caught operationally by someone staring at dashboards. The cold-start path runs inside the same nightly pipeline, assigning neighbor-average vectors to listings created since the last run so they enter the index immediately rather than waiting to be browsed.
The nightly training pipeline. Click logs become session sequences, the skip-gram trainer with booked-listing context produces versioned listing vectors, and the ANN index is rebuilt and swapped atomically. The dashed path bootstraps new listings with the average of nearby comparable listings.
Online evaluation and operations
The online metrics are carousel click-through rate and, more importantly, the downstream booking rate attributed to carousel clicks, read through A/B tests in which a random slice of traffic receives the new system while the rest keeps the old one. Booking is the metric that decides launches, because click-through alone rewards eye-catching but unbookable suggestions, and a carousel that wins clicks while losing bookings is optimizing the wrong end of the funnel. The offline-online gap shows up here in an instructive way, because an embedding change can lift retrieval recall substantially while carousel click-through barely moves, which usually means the ranker rather than retrieval has become the binding constraint at that moment. Reaching that diagnosis out loud is worth doing in an interview, because it treats the stages as one system whose bottleneck moves around rather than as isolated components to be polished independently.
Freshness on the supply side needs deliberate help beyond the borrowed vectors. A new listing still has no reviews and no track record, so the ranker scores it timidly, and without intervention it stays unseen and never accumulates the sessions that would earn it a trained vector of its own, a loop that quietly starves new supply. A small exploration boost for young listings, capped in impressions so its cost stays bounded, breaks the loop, and the marketplace genuinely depends on it, because hosts who receive no visibility in their first month tend to leave the platform. On the monitoring side, the quantities that rot quietly are the inputs rather than the model, and each gets its own watch. Session volume per market is tracked because a tracking change can thin the training corpus for one country without touching another. The fraction of listings whose vectors moved sharply between refreshes is tracked because a spike means a training problem rather than a genuine market shift, and that alarm is usually the first thing the on-call engineer sees when a run goes wrong. ANN recall is measured against exact search on a sample so index degradation has a number attached, and the share of carousels where fewer than 12 candidates survived the availability filter is the early signal that retrieval needs to widen in tight markets before travelers start meeting half-empty carousels. Retraining runs nightly for the embeddings and weekly for the ranker, and the ranker trains on features logged at serving time so that offline and online disagree as little as possible.
Follow-up questions
- Why learn similarity from sessions instead of attributes? Attributes describe what a listing is, while sessions reveal what travelers treat as substitutes, and substitution is the question the carousel actually answers. In the worked example above the attribute baseline reaches roughly half the retrieval recall of the embeddings, and that gap is exactly the behavior the attributes cannot see.
- Why is the booked listing used as global context? It tilts the embedding space from co-browsing toward conversion, because every click in a session that ended in a booking gets pulled toward the booked listing during training. Neighborhoods in the space then organize around what travelers choose rather than only what they look at, which is the distinction the business cares about.
- How do you handle a brand-new listing? The pipeline assigns it the average vector of a few nearest listings of the same type and price band so it enters the index on day one, and a capped exploration boost buys it the impressions it needs to start appearing in sessions. Once those sessions exist, the next nightly run trains it a vector of its own.
- Why filter availability in candidate generation rather than rank with it? For a traveler with dates, availability is a hard constraint rather than a preference, because a perfect listing that is booked for those dates carries exactly zero value. A soft penalty in the ranker would still let unavailable listings claim slots some fraction of the time, and every such slot teaches the traveler not to trust the carousel.
- Why version the embeddings and rebuild the index together? Different training runs place their vectors in different coordinate systems even when the data barely changed, so the index and the vectors must always come from the same run. Mixing versions degrades the neighbors silently, with sagging click-through as the only symptom and no error anywhere to find.
- What changes for a viewer with no dates selected? The availability filter relaxes to general near-term availability, and the ranker leans harder on the anchor-relative features. Degradation is graceful because the anchor rather than the user profile carries most of the intent signal, and the anchor is present on every listing page by definition.
References
- Grbovic and Cheng, Real-time Personalization using Embeddings for Search Ranking at Airbnb (KDD 2018), the listing-embeddings paper with booked-listing global context.
- Grbovic, Listing Embeddings in Search Ranking (Airbnb Engineering blog, 2018), the engineering account including cold start.
- Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality (NeurIPS 2013), the skip-gram objective with negative sampling.
- Aminian and Xu, Machine Learning System Design Interview (2023), chapter on similar listings.