Design an event recommendation system

An event recommendation system powers the home feed of a product like Meetup or Eventbrite. A person opens the app on a Tuesday evening and sees a ranked list of upcoming events nearby, perhaps a pottery workshop on Saturday afternoon, a machine learning meetup on Thursday, and a 10k race three weekends out. When the ranking works, the person taps one of the first few cards, registers, shows up, and comes back the following week. When it fails, the feed reads like a public bulletin board full of events that are sold out, far away, or aimed at someone else entirely, and the person quietly stops opening the app. Organizers depend on the same loop from the other side, because an event with no attendees is a failed event, and a platform that cannot route the right people to a new listing loses its supply before it loses its demand. The interview question sounds like any other recommender at first, which is exactly the trap.

The trap is that events are perishable inventory. A movie sits in the catalog for decades accumulating ratings, but an event is created a few weeks before it happens and dies the moment it starts, so there is never enough interaction history on any single event for classic collaborative filtering to work. Collaborative filtering is the family of methods that recommends items purely from patterns of who interacted with what, so that people with overlapping histories receive each other's items. Matrix factorization, the standard implementation that learns a vector for every user and every item from the interaction matrix, needs dozens or hundreds of interactions per item before those vectors mean anything, and by the time an event has accumulated them, the event is over. The design therefore has to lean on content features, meaning what the event is about, where it is held, and who runs it, and on context features such as where the user is and what time it is, with behavioral signals layered on top at the level of users, topics, and organizers rather than individual events, because those entities live long enough to learn. Most of the interesting decisions below follow from that one constraint.

Scope and requirements

The first thing to agree on with the interviewer is the product behavior itself, which is a personalized feed of upcoming events, refreshed each time the user opens the app, drawing on the user's location, history, and social graph. Search, ticketing and payments, organizer tooling, and notification timing all deserve designs of their own, so I would name them and place them out of scope, keeping the feed itself plus the machinery that ranks it. Scoping out loud converts a vague product prompt into a contract the rest of the conversation can be graded against.

For scale, assume 50 million monthly users, around 5 million upcoming events live at any moment, and roughly a million new events created every week. Feed requests peak at a few thousand per second, so the hard problems here are statistical rather than raw throughput. The latency budget for producing a ranked feed is about 200 milliseconds at the 99th percentile, of which the ML portion should consume well under half so that network, assembly, and rendering have room. Two requirements are unusual enough that I would call them out before moving on. A newly created event must become recommendable within minutes rather than after a nightly batch job, because a popular workshop posted Friday morning can sell out by Friday night, and an organizer whose event receives no early exposure has every reason to leave the platform. Recommendations must also respect hard feasibility, since a user forgives a mediocre suggestion long before they forgive an impossible one, and a wonderful event 400 miles away or one that started an hour ago erodes trust either way.

Framing the problem as a machine learning task

I would frame the core task in one sentence. Given a user, a candidate event, and the request context, predict the probability that the user registers for the event if it is shown. The input is the (user, event, context) triple, the output is a single probability, and the feed is produced by scoring candidates and sorting. This is pointwise ranking, which means each candidate is scored independently and the ordering falls out of the scores, as opposed to pairwise ranking, where the model learns to order pairs of items relative to each other, or listwise ranking, which optimizes the whole list jointly. Pairwise methods in the LambdaMART family tend to win benchmarks where fine ordering among near-duplicates is the whole game, but they surrender the absolute meaning of the score, and this system wants that meaning. A calibrated probability composes cleanly with business rules, so boosting new organizers or demoting sold-out events becomes arithmetic on the score rather than a retraining project, and the infrastructure for pointwise scoring stays simpler at every stage. Pointwise is a starting point rather than a law, and the upgrade path to pairwise exists if the top of the feed later proves poorly ordered even when the right events are present.

Registration is the target rather than click because clicks are cheap and noisy, while a registration is a commitment that correlates with the outcomes the business cares about. The truest signal would be attendance, since a platform full of registrations nobody honors is failing both sides of its market, but attendance arrives days or weeks later, is missing or unreliable for many events, and would starve a nightly training job of fresh labels. The practical compromise is to train on registrations and keep attendance as an evaluation signal and a possible second task later.

Data and labels

Training data comes from the serving logs. Every feed impression is logged with the user, the event, the position in the feed, the feature values used at scoring time, and a timestamp. A positive label is a registration attributed to an impression, say within three days of the user seeing the event, with the window chosen by measuring how long registrations actually take to arrive and cutting where the curve flattens. Negatives are where the design gets subtle, because there are two natural choices and they teach the model different things.

Impressed-but-not-registered negatives are events the user actually saw and passed over. They are informative because the user made a real choice between visible alternatives, but they inherit exposure bias, which means the old system only showed events it already liked, so the model never learns anything about the events that were never shown, and training purely on impressions narrows the system's worldview toward what the previous model believed. Random negatives, sampled from the live event pool, teach broad discrimination but are mostly too easy, since a knitting circle 3,000 miles away teaches the model little beyond what the distance feature already says. The working answer is a blend, mostly impression negatives with a fraction of random ones, and explaining why each ingredient is present is worth more in the interview than defending an exact ratio. Position bias also lives in these logs, because items shown at the top get registered more simply because they were seen, regardless of fit. The standard mitigations are to log the position and include it as a training-time feature that is fixed to a constant at serving time, so the model can attribute some of the lift to placement rather than content, or to randomize the order for a small slice of traffic and prefer that slice for evaluation. Labels carry their own noise as well. No-shows make registration an optimistic proxy, and free events make it a cheap one, since registering costs nothing, so a model trained only on raw registrations drifts toward overvaluing free inventory, and weighting paid registrations higher is a simple counter.

Features and representation

The feature families do most of the work in this system, precisely because per-event history cannot. Each family below substitutes for a signal that a long-lived catalog would have provided for free.

Geography comes first because it gates everything else. Distance from home and distance from work are kept as separate features because they interact with time. Consider a user whose home is 9 miles from a candidate event but whose office is half a mile from it. For a Thursday 6:30pm talk the work distance is the one that matters, while for a Sunday morning run the home distance decides whether they will get out of bed for it. Feeding both distances along with day-of-week and start-hour lets a tree model learn that interplay on its own, where a single generic distance feature would average the two cases into something wrong for both.

Temporal fit comes from the user's own history, summarized as the fraction of their past registrations falling in each day-of-week and time-of-day bucket. A user whose history is 70 percent weekend-daytime should see a Tuesday 10pm concert ranked skeptically even if the topic matches perfectly, because the schedule mismatch predicts a no-show better than any topic feature can.

Topic match uses text embeddings. An embedding is a learned list of numbers representing a piece of text so that texts with similar meaning end up near each other in the vector space. Each event's title and description are embedded once at creation, and the user's interest profile is the average of the embeddings of events they previously registered for. The cosine similarity between the two, an angle-based measure of how aligned two vectors are, becomes a single powerful feature. A hiking enthusiast might score 0.71 against a trail-running meetup and 0.18 against a wine tasting, and the model learns how much weight that similarity deserves relative to distance and time. Compressing two rich vectors down to one number is a deliberate simplification, and revisiting it later is the main argument for a neural ranker.

Organizer track record substitutes for the item history that events never live long enough to collect. The organizer's past attendance rate, event count, and average rating stand in for the reviews the event itself will never have, smoothed toward the global average when the organizer has run only one or two events so that a single lucky event does not look like a flawless record. Smoothing here means blending the observed rate with a prior in proportion to how much evidence exists, so an organizer with two events sits near the global mean while one with two hundred earns their own number. Social proof enters as the count of the user's friends already registered plus a binary any-friend flag, because the jump from zero friends attending to one changes behavior more than the jump from three to four. Price enters relative to the user's own paid history rather than as a raw number. All of these are computed and served through a feature store, the shared system that materializes feature values for both training and serving so the model sees the same definitions in both places.

Model architecture

The baseline that earns its place is a gradient-boosted decision tree model, or GBDT, which is an ensemble of many small decision trees trained in sequence, where each new tree corrects the residual errors of the trees before it. GBDTs are strong on exactly this kind of tabular feature set, with mixed numeric and categorical features at wildly different scales, and they need little feature engineering beyond what is already described. Tree splits handle the nonlinearity in distance, where the difference between 1 and 5 miles matters far more than the difference between 21 and 25, without anyone designing buckets by hand. Training takes minutes on tens of millions of rows, scoring a candidate costs microseconds, and trained with a logistic loss the model outputs usable probabilities. A logistic regression would be simpler still, but it loses here because the interactions described above, distance crossed with hour and price crossed with history, carry most of the signal, and a linear model only sees them if every cross is built by hand.

A neural ranker earns its keep later, under pressure that can be named precisely. It wins when raw embeddings should interact with other features inside the model instead of being compressed to a single cosine number, when the team wants multi-task heads predicting click, registration, and attendance together, or when traffic has grown to where a 1 percent relative gain pays for the serving complexity. Against those gains it costs accelerated hardware in the serving path, slower experiment turnaround, and a calibration story that has to be re-earned. The mature answer in the interview is to start with the GBDT, instrument everything, and move to a neural model when error analysis shows interaction effects the trees cannot reach, rather than opening with the most fashionable architecture and building infrastructure it does not yet need.

Offline evaluation

Evaluation discipline matters more here than the metric menu. The split must be time-based, training on weeks 1 through 8, validating on week 9, and testing on week 10. A random split leaks the future in two ways, because the same event lands in both train and test carrying its eventual popularity with it, and because the model effectively peeks at registration patterns that had not happened yet. The leak is flattering, so the offline numbers come out beautiful and the online test then lands flat. With perishable inventory the time split also mirrors reality, since at serving time the model only ever scores events whose outcomes are unknown.

Recall@K measures whether the things the user actually chose appear in the top K recommendations. If a validation-week user registered for two events and the system's top 10 for them contains one of the two, their recall@10 is 0.5, and the reported number is the average over users, perhaps 0.38 for a decent system. Measuring recall separately at the candidate-generation stage and at the final ranking tells you which stage to fix, since a relevant event missing from the 800 candidates is a retrieval problem that no amount of ranker tuning can repair. AUC measures discrimination, defined as the probability that a randomly chosen positive is scored above a randomly chosen negative. With one positive scored 0.7 against negatives scored 0.8, 0.6, and 0.4, the positive outranks two of the three, so AUC is 2/3, about 0.67. Alongside both I would check calibration by bucketing predictions and comparing each bucket's average prediction to its realized rate, because if events predicted at 8 percent register at 3 percent, every downstream business rule built on the scores inherits the error.

The serving architecture

Scoring 5 million live events per request is out of the question, so serving is shaped as a funnel in which cheap candidate generation narrows millions down to hundreds and the ranker spends its budget only on those. The first and biggest cut is feasibility. Events are indexed by location cell and start date, where a location cell is a fixed geographic tile a few miles across, so a query for events within 10 miles and the next 30 days touches a handful of cells crossed with 30 date buckets. That single index lookup takes 5 million live events worldwide down to perhaps 2,000 in the user's metro area and around 600 once the 30-day window applies, a reduction of nearly four orders of magnitude bought with no machine learning at all. Two smaller sources widen the pool beyond pure geography, namely events the user's friends registered for and events under topics the user follows, each contributing a few hundred candidates. They exist because the geo index is intent-blind and would never surface the slightly-farther event that all of the user's climbing partners just joined. After merging and deduplication, roughly 800 candidates flow to feature assembly and the GBDT, and a final re-rank pass enforces diversity before the top 20 reach the client, so the feed does not become five yoga classes in a row.

The latency budget walks through comfortably, and walking it stage by stage is worth doing aloud. The geo-time index lookup costs about 10 milliseconds, and the social and topic fetches run in parallel inside that same window, so the candidate stage closes at roughly 10. Feature assembly takes the next 25 milliseconds, dominated by batched feature-store reads, which is why the store is keyed for bulk lookup rather than per-candidate round trips, since 800 sequential reads at even a millisecond each would blow the budget on their own. Scoring 800 candidates through a few hundred shallow trees costs under 30 milliseconds on one core, because each candidate is only a few thousand comparisons, and the diversity pass adds about 5 more. The ML path therefore lands near 70 milliseconds against the 200 millisecond envelope, and the slack is deliberate, since the budget is written against tail latency rather than the average and the headroom lets a heavier model ship later without re-architecting the path.

The serving funnel. The geo-time index does the heavy elimination, social and topic sources widen the pool, and the GBDT spends the latency budget on roughly 800 feasible candidates before a diversity pass picks the final 20.

Freshness and the event lifecycle

The second hard part is the lifecycle of an event. Because items are born and die continuously, the candidate index cannot be rebuilt nightly the way a movie catalog can, and it has to absorb new events incrementally instead. When an organizer publishes an event, it passes an automated validation step for policy and quality, an embedding worker converts the title and description into a vector, and the event is appended to the postings for its location cell and date bucket. The whole path is a stream pipeline, meaning each event flows through as an individual message rather than waiting for a scheduled batch, and the target is minutes from publish to recommendable. Batching new events hourly would look close enough on paper, but it concedes the first and most valuable hours of a hot event's life, when early registrations build the social proof that later ranking feeds on. Expiry is the mirror image of insertion, since each entry carries the start time and the index drops events the moment they begin, so feasibility never depends on a cleanup job that might lag behind and recommend something already underway.

New events score sensibly on day one because nothing in the feature set requires per-event history, which is the payoff of the feature design chosen earlier; distance, time fit, topic similarity, price, and organizer track record all exist at creation. The ranking model itself retrains nightly on the rolling registration logs, and nightly is frequent enough because what shifts day to day is the inventory rather than the relationship between features and registration. The embedding model is retrained far less often, on the order of months, because changing it silently moves every event's vector and every user profile along with them, and a re-embedding migration has to be planned, backfilled, and verified rather than allowed to drift in overnight.

The freshness pipeline. A published event is validated, embedded, and appended to the geo-time candidate index within minutes, while the dashed loop retrains the ranker nightly from registration logs.

Online evaluation and operations

Launch decisions come from A/B tests, where a random slice of users gets the new model and the rest keep the old one. The immediate metric is registration rate per session, but the truer goals are attendance rate and 28-day repeat usage, because a system that drives registrations for events people skip is optimizing the proxy rather than the product. Those deeper metrics move slowly and need weeks of data before a test reads cleanly, so the working practice is to gate launches on registration rate with attendance and retention as guardrails reviewed over longer windows, and to keep a long-running holdout on the old experience so slow retention drifts stay measurable.

Cold start splits into two problems with different stakes. New users get a feed assembled from location plus popular-and-diverse local events, sharpened by onboarding topic picks if the product asks for them, and the system simply behaves like a good non-personalized ranker until a few registrations accumulate. New organizers face the colder start, because the track-record features that protect users also bury newcomers, and a marketplace where established organizers absorb all the exposure eventually ossifies into a place new supply avoids. A small deliberate boost for events from new organizers, or a reserved exploration slot in the feed, spends a little short-term registration rate to keep the supply side healthy, and the same machinery doubles as the exploration the model needs to gather labels beyond its own past choices.

Operationally, the model retrains nightly, and the things worth monitoring continuously are the inputs and outputs rather than the weights. Feature distributions are watched for drift, the gradual change in input data over time, because a silent geocoding regression that shifts the distance distribution will degrade rankings days before any business metric reacts. The shape of that incident from the on-call seat is worth describing. Registration rate looks normal for a day or two because users still register for whatever is shown, the median distance-to-event feature has quietly doubled, and the only alarm that fires early is the one comparing the live feature distribution against a training-time snapshot. The user, meanwhile, sees a feed that feels subtly off, with events from the wrong side of the city, and almost never reports it; they simply open the app less. The average predicted registration probability is compared against the realized rate as a standing calibration check, and a sustained gap is treated as an incident even while engagement still looks fine. Training-serving skew, where the features computed offline for training differ from those computed online at serving time, is contained by logging the exact serving-time feature values and training on those logs rather than recomputing features after the fact.

Follow-up questions

Why does classic collaborative filtering fail here? It needs many interactions per item before its estimates mean anything, and an event dies at its start time before accumulating them, so the per-item factors never converge. The fix is to move the behavioral signal up a level, to topics, organizers, and users, which live long enough to learn, and to let content and context features describe the individual event.
Impression negatives or random negatives? A blend of both works best. Impression negatives teach fine distinctions among plausible events but inherit the old system's exposure bias, while random negatives teach broad feasibility but are mostly too easy to be informative on their own, so mostly impressions with a fraction of random samples covers both failure modes at once.
How does the system handle a user who just moved cities? It degrades gracefully because the features generalize rather than memorize. Distance, time fit, and topic similarity carry over to the new metro with no per-city learning, the geo index simply starts returning candidates around the new location, and only the social-proof features go quiet until new connections form there.
When would you replace the GBDT with a neural ranker? The move makes sense when error analysis shows interactions the trees cannot express, typically between raw text embeddings and other features, or when multi-task prediction of click, registration, and attendance together is wanted. The measured gain has to pay for accelerated serving hardware, slower iteration, and a calibration story that must be re-earned.
Why a strict time-based evaluation split? A random split puts the same event in train and test with its eventual popularity attached, which leaks the future and flatters every offline number. Training on weeks 1 to 8 and validating on week 9 matches what serving actually faces, namely events whose outcomes are still unknown at scoring time.
Registration or attendance as the label? Training uses registration because it is plentiful and arrives within days, which keeps nightly retraining fed, while attendance stays in the system as a guardrail metric, a sample weight, or a second task. Attendance alone is truer to the product but too sparse and too delayed to drive daily training by itself.

References

Aminian and Xu, Machine Learning System Design Interview (2023), chapter on event recommendation.
Covington, Adams, and Sargin, Deep Neural Networks for YouTube Recommendations (RecSys 2016), on the candidate generation and ranking funnel.
Chen and Guestrin, XGBoost: A Scalable Tree Boosting System (KDD 2016), on gradient-boosted trees for tabular prediction.
Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality (NeurIPS 2013), the embedding objective behind the text features.