Teacher models and the demo-to-production gap

Notes on LLM RecSys Product – Edition 2 of a newsletter focused on building LLM powered products.


The central thesis of this newsletter is that we are moving from deterministic workflows to model-centered products – products built around LLM-powered recommender systems as the core primitive.

We’re not fully in that future yet. Most AI products today are still tools that tackle narrow use cases better – generate text, summarize content, answer questions. They are improvements, but they are not yet systems that help people get things done.

Still, the direction of change is clear. As model capabilities continue to improve, user expectations will shift across domains – writing, presentations, job search, customer support -from “give me the tool” to “help me make progress.”

That is the real promise of agents and applied AI products. Not magical autonomy, but systems that understand context, infer intent, and assist meaningfully. In other words, they will need a recommender system at the core.

But not the recommender systems we’re used to. Especially because the open question this creates is how to build predictable, trustworthy systems when the output is inherently probabilistic.

Why traditional recommender systems break

For years, recommender systems were constrained in two fundamental ways.

First, they relied on structured data and fixed taxonomies. They required predefined taxonomies and candidate sets, even though real product data – messages, resumes, notes, goals, intent – is overwhelmingly unstructured.

Second, they were black boxes steered by blunt reward functions. Behavior was optimized indirectly through metrics like clicks, dwell time, or engagement. Improving the system meant tweaking the reward, not improving understanding or reasoning.

That architecture was sufficient for optimizing engagement. It is poorly suited for helping users achieve goals.

LLM-powered recommender systems are semantic and teacher-supervised

The breakthrough in LLM-powered recommender systems isn’t simply replacing parts of the old stack with LLMs. It’s the combination of semantic understanding with teacher supervision.

LLMs can ingest and produce semantic input and output, reason over unstructured data, and operate across large context windows. That dramatically expands what recommender systems can do. While this dramatically improves capability, it still leaves blind spots: when does the system work, when does it fail, and why?

The breakthrough is our ability to pair production models with a high-quality “teacher model” – a large model used to evaluate outputs at scale. This teacher is too expensive to run in the critical path, but ideal for judging behavior, surfacing errors, and identifying quality gaps.

Article content

This structure creates two powerful improvement levers:

  • Improve the teacher → the system’s judgment gets sharper
  • Improve the training data → production models inherit this better judgment.

The introduction of the teacher makes the system self-aware and coachable. Instead of guessing whether the system is improving, you can now see it. In sum, LLMs with semantic understanding give your stack capability, but the teacher model enables governance.

The demo–to–production gap: cost makes your system imperfect but self-aware and coachable

This is also the point where demos and real products diverge.

In a demo, you can run a large, frontier model on every request and get something that looks magical. At scale, that approach collapses quickly under latency and GPU cost. The moment you ship to production, you’re forced to use smaller, cheaper models to keep the system fast and scalable.

And that’s when imperfection shows up. As costs come down, different parts of the stack start making mistakes that weren’t visible in that demo. Your production grade AI-powered system is noticeably imperfect.

That’s the bad news.

From stumbling in the dark to painful self-awareness

The good news is that teacher supervision turns imperfection into something you can work with.

Before teacher-supervised systems, building AI products felt like stumbling in the dark. Teams relied on coarse engagement metrics and small-scale human review to infer how models behaved.

With a teacher model, teams live in painful self-awareness.

You can now evaluate millions of outputs every day. You will know what’s wrong at scale, where it’s wrong, and why. The reason this self-awareness is painful is because you won’t be able to fix everything immediately – cost, latency, and model size constraints are very real – but you can work deliberately, one acute failure mode at a time.

That is the shift: from chasing metrics to diagnosing behavior, from intuition to evaluation, from shiny demos to systems that can actually improve. To be clear: teacher models aren’t ground truth. They’re probabilistic judges that must be anchored in human judgment and long-term outcomes. Their value isn’t correctness, but making evaluation continuous, scalable, and unavoidable.

Evaluation is the mechanism that makes outcome-oriented, probabilistic systems tractable under real-world constraints – and the foundation for how AI-native teams will operate going forward.


All of this brings us to two counter-intuitive takeaways –

(1) Existing recommender systems are deeply limited in many ways. But the magic moment doesn’t arrive when you replace them with LLM-powered recommender systems. As soon as you confront scaling costs, you’re forced to use cheaper, imperfect models. This is where reality cuts differently from the shiny demo.

(2) Semantic models unlock capability, but teacher supervision unlocks governance. LLMs make systems powerful, evaluation makes them understandable and improvable. And while the system might be imperfect, a teacher model enables painful self-awareness and coachability – the kind that reveals problems at scale and lets teams improve deliberately, one acute failure mode at a time.

That sets us up for the next edition – building painful self-awareness and coachability via the evaluation loop.

Half-hearted engagement

If you can’t engage with patience and focus, don’t engage at all.

A half-hearted engagement often causes more damage than no engagement. It is nearly always better to pause, reset, and return with intention than to show up distracted and reactive.

In most situations, the quality of our presence matters more than the speed of our response.

The graduate program and meritocracy

There’s an insightful story Jack Welch often told about a formative moment early in his career at GE. He was part of a graduate program with a cohort of new hires – a mix of people who worked hard and carried their weight, and others who clearly did not.

During performance review time, he learned that everyone in the program was being given the same raise.

He was furious because he knew who in the group was doing exceptional work and who wasn’t. Everyone did. As he put it, “Kids in any class know who the best students are.”

Effort and performance aren’t as subjective as we sometimes pretend.

That moment shaped him. It became the seed for his later, legendary emphasis on meritocracy at GE – including the (controversial) forced ranking system where the bottom 10% were consistently shown the door.

Agree or disagree with the method, the underlying principle he lived by was clear: When you reward strong performance, people rise to the standard. And when you tolerate weak performance – or worse, reward it – you destroy accountability.

And when accountability disappears, the people who thrive on it leave.

In many ways, that is the beginning of the end for any high-performing team or organization.

Don’t skip that reference check

Over the years, I’ve made many hiring decisions. And while I’ve learned a lot about assessing fit, potential, and craft, there’s one lesson that has been etched in with scar tissue: don’t skip that reference check.

Most hirers are diligent about reference checks when hiring senior talent. But the place where corners get cut is with younger talent.

I’ve learned the hard way that this is exactly where discipline matters most.

Good reference checks – and backchannels where appropriate and possible – are invaluable. They reveal patterns, motivations, and (crucially) red flags that you simply cannot infer from a few conversations.

A little bit of work up front can save a lot of pain later.

Occam’s razor x biology retrospective

Dr. Peter Attia’s team shared a thoughtful retrospective on a proposed explanation they shared previously.

Two years ago, they shared results from a trial that showed GLP-1 drugs (Ozempic, Wegovy) weren’t just remarkable in combating efficacy and type II diabetes – but that they were also effective in reducing major adverse cardiovascular events (heart attacks, stokes).

Taking a simplistic Occam’s razor inspired approach, they’d suggested that the reduction in cardiovascular risk was likely due to the weight loss involved.

However, additional data released recently showed that the drugs had a separate positive impact on cardiovascular risk because there was no correlation between a patient’s weight loss and their cardiovascular risk.

Their post ended with a reflection –

Occam’s razor is a useful heuristic for problem-solving across many disciplines, but biology and medicine are rife with instances where this principle has failed. (Indeed, there’s a principle that specifically opposes Occam’s razor in the context of medicine—Hickam’s dictum, which is typically presented as the observation that “a patient can have as many diseases as they damn well please.”).

So although defaulting to the simplest explanation may make sense when all else is equal, biology is often far more complex than we predict, and therefore, we need to be ready to abandon or revise the simpler theory when presented with new information. Such is the case, it would seem, with GLP-1 drugs and cardiovascular benefits.

There were 3 things I loved about this –

  1. The reflection resonated. Topics involving biology are complex and often require a lot of nuance. Occam’s Razor isn’t always the best tool as a result.
  2. I appreciated the retrospective post. We need more of these.
  3. Hickam’s dictum – a patient can have as many diseases as they damn well please. – is hilarious.

Just because you can

We were halfway into our Sunday morning soccer game on a local middle school turf field when a dad walked up and told us we had to leave. He was the coach of a kids’ team, he’d booked the field for 8 a.m., and we needed to move immediately.

It was inconvenient – the grass nearby was wet, uneven, and far from ideal – but we moved and finished our game.

But here’s the kicker:

We were only using a third of the turf.

And for the entire next hour, the kids he was coaching also used only a third of the turf… on the opposite end.

They didn’t come near the space we had been occupying even once.

On our way out, we all chuckled. The whole thing had been unnecessary. It wasn’t about need – it was about “because I can.”

Two lessons.

Next time they ask us, I’ll make sure to ask if they actually plan to use the field. This might avoid the whole dance. Sometimes people enforce rules by default because no one asks.

The second lesson is that I’m sure I’ve done the same thing at some point – enforced something simply because I could.

It’s a good reminder to catch myself the next time I’m in that position.

Just because you can doesn’t mean you should.

Ascribing Intent

Ascribing good intent or bad intent is one of the most quietly powerful choices we make about the people around us. The intent we assume becomes the lens through which we interpret every action. If we don’t trust someone’s intent, it almost doesn’t matter what they do – we’ll find a way to see it negatively. And the reverse is also true.

The most powerful experience I’ve had on intent comes from George R.R. Martin’s A Song of Ice and Fire. The books are a work of art because the entire plot is advanced from the point of view of various characters.

In the first two books, the story is told largely from the Stark family’s point of view. They despise the Lannisters, and through their eyes, Jaime Lannister becomes the embodiment of arrogance, selfishness, and everything that is wrong with human nature. I remember reading those books and feeling the same way – hating Jaime without ever hearing from him directly.

Then book three arrives, and suddenly there’s a chapter from Jaime’s point of view.

One chapter.

And by the end of it, I found myself thinking he was one of the most misunderstood, complex, even heroic characters in the story. Nothing about his circumstances changed – only the intent I had been unconsciously ascribing to him.

It was a revelation.

Everyone else had been projecting bad intent onto him, and I absorbed their view without questioning it.

Ever since, I’ve held onto that lesson.

If someone in your life ascribes bad intent to everything you do, there’s almost no path out. Your actions won’t matter if the lens is fixed.

And whenever I find myself frustrated with someone, or forming a story about why they did something, I think back to Jaime Lannister. It forces me to pause, reflect, and ask: Am I making the same mistake?

Because sometimes, all it takes is a single different point of view to completely change the story.

Giving a damn about the customer experience

I had two wildly different customer service experiences recently.

The first was at a nearby AT&T store as we tried to upgrade our phones – and it had everything you hope never to experience.

Attempts to sell you products based on incomplete information.

Multiple hours of waiting and back and forths.

Staff walking away mid-conversation to serve someone else and not taking accountability for the job being done.

Managers hovering but doing nothing.

The crazy part was being sent to the Apple Store because “it’ll be easier there.”

It was apathy all the way down.

We then interacted with four different people at the Apple Store – each one exceptional.

They cared.

They took accountability.

They stayed with us until things were resolved.

They kept humor and energy throughout.

It was genuinely impressive.

The previous day, we had stopped by Gordon Ramsay’s Burger for a quick meal, mostly because of our fondness for MasterChef. Again, the service was incredible – fast, attentive, and thoughtful.

I tried to unpack the difference.

Was the AT&T issue about monopolies? Probably not – Apple has an even stronger market position, and it hasn’t eroded their service.

Was it about training? Maybe partially, but that didn’t feel like the root cause.

The conclusion I came to was simple a somebody at the top gives a damn.

At Gordon Ramsay’s place, I’m sure everyone knows that Gordon cares deeply about the experience – and he hires people who care too.

At the Apple Store, the culture of caring goes all the way back to Steve Jobs and has been kept alive by the leadership team. You can feel it in every interaction.

And at that AT&T store? My guess is no one up top truly cares. Or at least, not enough to hire the right people, set the right expectations, or hold anyone accountable for the experience customers actually have.

It’s a reminder to anyone in a leadership role – what you consistently care about, measure, hire for, and reinforce eventually becomes the lived experience of your customers.

If you want customers to feel cared for, you need leaders who give a damn first – and then who hire people who do too.