From news feeds and short-video apps to the “For You” rows on your TV, almost everything you see online is sorted by ranking code nobody outside the company ever sees. These systems are blamed for echo chambers, misinformation and “hijacking our attention”. Yet inside large consumer platforms, recommender teams are actually juggling a messy mix of KPIs: watch time, satisfaction, safety, diversity, ad revenue and more.
To cut through the hype, we spoke with Lev Fedorov, an engineer who has spent the last several years building large-scale recommenders for video feeds and image search in consumer products. In this conversation, he talks about what modern recommenders really optimize for, what happens when you rewire ranking in production, and how teams now treat disinformation and “attention addiction” as first-class product constraints rather than afterthoughts.
“The algorithm maximizes outrage”, how close is that to reality?
Q: If you had to explain your recommender’s goal to a CEO in one slide, what would it say – and how different is that from the meme “the algorithm maximizes outrage”? What are the real objectives you’ve seen teams actually optimize for day to day?
Lev Fedorov: On a single slide, I’d write:
“Allocate limited user attention to the content most likely to create long-term value for the user, creators and the platform.”
At scale, the hard problem isn’t generating content – it’s selecting from a massive surplus. There are far more posts than any individual could ever consume, and time is the scarcest resource. The recommender exists to automate that choice and surface content that is genuinely useful or interesting for a specific person.
In practice, teams approximate this with proxies like long-term retention, repeat visits, meaningful engagement and session quality – what we often shorthand as user happiness.
What doesn’t fit on that first slide, but matters just as much, is creator happiness. For many creators this is a profession, and without a strong recommender they may never find an audience. A healthy system helps good creators get discovered, grow organically and monetize sustainably. That creator–user balance isn’t altruism; it’s the foundation of long-term growth, retention and content quality.
This is also where the “the algorithm maximizes outrage” meme comes from. There’s a kernel of truth: emotionally charged content often performs well on short-term engagement metrics. But outrage isn’t the objective – it’s a side effect of optimizing imperfect proxies in isolation. In real systems, teams spend a lot of time constraining those proxies, adding safety signals, and trading off short-term engagement against long-term trust, retention and ecosystem health. The goal isn’t to maximize emotion, but to build a recommender that people and creators still want to use tomorrow.
The dashboards everyone watches, and the metrics that quietly hurt you
Q: When you open the live dashboards for a big consumer recommender, what’s at the top, which of those metrics are secretly dangerous if you stare at them in isolation? Any example where “good” numbers hid a bad user experience?
When you open live dashboards, the top of the screen is usually very pragmatic. On the product side you’ll see impressions, clicks, CTR, and some form of watch time or dwell time. On the system side you’ll see latency, error rates and pipeline health. These are the numbers people look at multiple times a day.
The most dangerous metrics, if you stare at them in isolation, are clicks and CTR. They’re extremely sensitive and very easy to improve – often for the wrong reasons. Clickbait, sensational framing or emotionally provocative content can push CTR up quickly while quietly degrading trust and perceived feed quality.
We’ve seen this failure mode firsthand. In one iteration, aggressively optimizing for clicks led the system to promote content that was very “grabbing” at first glance, but users often abandoned it quickly or felt misled. Dashboards looked great in the short term; qualitative feedback and longer-term signals told a different story.
That’s why mature teams rarely make decisions on a single metric. Changes are evaluated across a combination of clicks, time spent, explicit positive and negative signals (likes, hides, dislikes) and, critically, long-term retention. Many bad user experiences look like wins on day one and only show up weeks later if you bother to look.
The “no duplicates” rule that backfired
Q: In one of the short-video feeds you worked on, you mentioned a long-standing rule to “never show duplicates”. Can you walk us through that experiment where a small-looking change in the ranking stack, around candidate selection and video targets, ended up changing user behavior in a way that product managers did not predict from the spec?
In our system there was a long-standing product rule: never show duplicates. If a post had already appeared in a user’s feed – even if they hadn’t interacted with it – it was filtered out. The assumption was straightforward: repeats would feel like low-quality recommendations and hurt satisfaction.
During a ranking improvement, a seemingly small change in candidate selection had an unintended side effect: previously shown posts started slipping back into the candidate pool. When we noticed this, the expectation from the product side was clear – the A/B test would fail.
The opposite happened. A significant segment of users actually responded better to repeated content. The explanation became obvious in hindsight: when a post first appears, a user may not have the time, attention or context to engage. Seeing the same content again later – at a better moment – led to more meaningful interactions.
From a PM perspective, this behavior was completely non-obvious from the original spec. “No duplicates” sounded like a quality guarantee, but in practice it was suppressing useful second chances. After the experiment, we turned this from an accidental behavior into an explicit feature and started deliberately re-recommending previously seen content under controlled conditions.
The broader lesson was that small, local changes in the ranking stack can surface very human behaviors that product intuition often underestimates – especially around timing and attention.
Echo chambers: structural risk or convenient narrative?
Q: Based on the data you’ve seen, do echo chambers look more like a real structural problem or more like a convenient narrative? What are the concrete signals that tell you “this feed is turning into a bubble” versus “this is just healthy personalization”?
Echo chambers are a real structural risk – but they’re often oversimplified in public narratives. For an individual user, a narrow feed can look perfectly fine: “I like cats and dogs, so I’m happy seeing more cats and dogs.”
Internally, teams call this exploitation: the recommender focuses on what it already knows works for the user and keeps reinforcing those signals. In the short term, that maximizes relevance and engagement.
The problem shows up at the product level and over time. People are more multidimensional than their recent clicks, and their interests change. A feed that only exploits past behavior eventually stops learning. Users discover less new content and end up with a narrower experience, even if individual recommendations still look “relevant”.
That’s where exploration comes in – deliberately introducing new topics, creators or formats to test whether the user’s interests are broader or evolving. Exploration usually comes with a small short-term cost in relevance, but it’s essential for keeping the system adaptive and preventing it from collapsing into a single theme.
The difference between healthy personalization and a real bubble shows up in concrete signals. In a bubble you typically see shrinking content coverage, fewer distinct topics or creators per user over time, and weak responses when exploratory content is occasionally introduced. In a healthy system, even highly personalized feeds still show topic churn, successful discovery of new interests, and recovery when user behavior shifts.
In practice, echo chambers aren’t the inevitable result of personalization; they’re the result of imbalance. Exploitation without exploration turns personalization into a closed loop. The right balance between the two keeps feeds both relevant and diverse.
What actually works to “break the bubble”
Q: Based on your own experiments with feeds and discovery surfaces, what are the most promising tactics you’ve seen to “break the bubble” – and which ones sounded great in a deck but flopped in A/B tests?
The tactics that work best are usually subtle rather than dramatic.
What tends to work:
- Exploration slots. Reserving a small fraction of the feed – for example, every N-th slot – specifically for exploratory content. This gives the model structured opportunities to probe for new interests without overwhelming the user.
- Soft diversity constraints. Instead of hard rules like “never show two posts from the same topic”, we use penalties in the scoring function. That allows the model to balance relevance against diversity in a nuanced way.
- Creator-level diversity. Ensuring users regularly see new or smaller creators alongside established ones. This is good both for discovery and for ecosystem health.
What often sounds good but underperforms:
- Hard topic caps. Rules like “no more than X posts about topic Y per session” look attractive in a slide deck, but in practice they can feel arbitrary. Users sometimes perceive them as the feed “ignoring” their interests.
- Blunt cross-domain injections. Forcing content from completely unrelated domains – say, long-form news into a pure entertainment feed – often leads to quick skips and hides, without any real learning.
Across experiments, the pattern is that users tolerate – and even appreciate – diversity as long as it still feels plausibly relevant. The more a tactic looks like you’re fighting the user’s intentions, the less likely it is to work.
Disinformation moves from policy docs into APIs
Q: Disinformation and political content used to be treated as a policy problem; now they’re a ranking problem. In practical terms, how do those constraints show up in the code and APIs that engineers and product teams actually work with?
The shift shows up not in abstract policy documents, but in very concrete pipelines, flags and constraints that engineers work with every day.
In our system, all new content first goes through a separate ingestion and moderation pipeline before it’s even eligible for recommendation. Nothing enters the recommendation pool by default. One of the key stages in that pipeline is content moderation, implemented as a combination of ML classifiers and human review.
Most content is evaluated automatically: multiple classifiers assess different risk dimensions and produce scores and confidence levels. Based on those signals, the system decides whether the content is allowed, restricted, or blocked from recommendation entirely. When model confidence is low or the risk is high, items are escalated to human moderators, who see both the content and the model outputs and can confirm or override the decision.
Importantly, this doesn’t translate into the ranking model “picking sides” or encoding political opinions. Policy decisions are converted into structured signals and eligibility constraints. Content that fails moderation simply never appears in the recommendation candidate set. Borderline content may be allowed but carry distribution limits or downranking flags.
As a result, ranking engineers don’t write bespoke logic for political topics or disinformation in the core model. They consume a filtered and annotated content pool through well-defined APIs. Disinformation becomes a ranking problem not because the ranker decides what is true, but because policy constraints are enforced upstream and expressed as signals that shape what the ranking system is allowed to optimize over.
When safety collides with growth
Q: What does it look like from the inside when safety or policy constraints collide with growth? Can you share a situation where the team chose to sacrifice short-term engagement for safety – and how that decision was made?
When safety or policy constraints collide with growth, the tension is immediate and visible in metrics.
For example, when we experimented with loosening some moderation thresholds, engagement metrics like clicks and watch time went up – a tempting short-term win. At the same time, we saw more borderline and outright inappropriate content in feeds, including material unsuitable for younger users and items that triggered elevated complaint and report rates.
The product and trust & safety teams reviewed the signals together – complaints, hides, anomalous reporting patterns, qualitative feedback – and concluded that the short-term gains weren’t worth the long-term harm. The experiment was rolled back, and stricter content constraints were restored.
From an engineering perspective, it’s technically trivial to disable moderation and let the system chase engagement. In the real world, that would quickly make the platform unsafe, increase regulatory risk and erode user trust. None of those outcomes are compatible with sustainable growth, which is why mature organizations treat safety metrics as hard constraints, not optional add-ons.
Are we optimizing people into addiction?
Q: There’s a strong narrative that teams “optimize people into addiction”. From your experience shipping real recommenders, what would a genuinely well-being-first loss function and KPI set look like – and what would product leaders need to give up to adopt it?
A genuinely well-being-first approach would change both what we optimize and over what horizon.
On the modeling side, the loss function would put less weight on raw engagement events and more on signals that correlate with healthy, voluntary use over time. That might include:
- Long-term retention adjusted for intensity. Coming back regularly over months is a better sign than a single binge session.
- Satisfaction signals. Explicit ratings, post-session surveys, or lightweight prompts like “Was this session helpful or a waste of time?”
- Negative friction signals. Sudden uninstalls, aggressive use of “not interested” or “block”, or patterns associated with regretful use.
On the KPI side, you’d still track time spent and engagement – but you’d treat them as diagnostics, not goals. The primary success criteria would be something like “percentage of users with stable, self-paced usage patterns and high satisfaction”, even if that means fewer hours per user.
For product leaders, the trade-offs are real. You may have to accept:
- Slower headline growth in time-spent metrics.
- Saying no to certain dark-pattern optimizations that spike engagement but correlate with regret.
- More investment in measurement – surveys, qualitative research, longitudinal studies – because well-being is harder to proxy than clicks.
The good news is that in the long run, systems that respect users’ limits tend to preserve trust and retention. But you have to be willing to optimize for the long run.
Three non-negotiable principles for a new recommender
Q: If you were joining a new social or media startup as the first ranking engineer, what three non-negotiable principles would you set on day one so the recommender doesn’t turn into a black box that nobody trusts – users, PMs or regulators?
I’d start with three principles:
- Explainable recommendations by design.
Users should be able to see why a particular post is shown to them. Sometimes it’s simple – “you follow this author” – sometimes it’s “similar to a post you liked”, possibly with a reference. Even basic explanations go a long way toward building trust and give you a natural place to plug in user controls. - Strong, easy-to-use negative feedback.
“Not interested” buttons, the ability to mute topics or block authors, and clear ways to report problematic content are critical. Many platforms underutilize these signals. They’re some of the most valuable data you can get, because users are literally telling you how to fix their feed. - Independent moderation with transparency.
The platform needs clear rules on what’s allowed, mechanisms to appeal decisions, and public reporting on enforcement. Internally, there should be a clean separation between the moderation pipeline and ranking optimization, with auditable logs of what was blocked or downranked and why. That’s how you help PMs, regulators and users trust that the system is operating fairly.
If you get these three right early, you avoid a lot of the “mysterious black box” perception later.
The dashboards that actually change the conversation
Q: Everyone says “we’ll just fix the algorithm”, but very few people have actually debugged one at scale. What kinds of transparency or explainability tools have you seen actually help PMs and leadership understand what’s going on – and if you could show regulators or journalists just one internal dashboard, what would it be?
I’ve seen two types of tools make a real difference.
- Author-facing traceability dashboards.
These answer questions like “Why was this post downranked?” They collect the full ingestion and moderation status for a piece of content in one place: which checks it passed or failed, whether it hit rate limits, whether it was subject to distribution caps, and so on. Some internal details can’t be fully exposed, but the goal is to give authors as much visibility as possible into how their content is treated. - User-level funnel or “recommendation trace” dashboards.
These are internal tools that trace every step of the recommendation process for a given user or session: what positive and negative signals were considered, which candidates were generated and filtered out at each stage, and how the ranking formula arrived at the final list. This has been invaluable for debugging errors and spotting weak spots in the algorithm.
A simplified version of the same idea powers user-facing explanations: the system can point to the top few factors that led to a recommendation and present them in human language.
If I could show regulators or journalists just one internal dashboard, it would be the user-level recommendation trace. It provides an end-to-end view of how recommendations are generated and makes it clear that there is a structured, rule-based and auditable process, not a mysterious black box.