I book a lot of travel through Chase Travel, and many of those flights end up being on United. Every once in a while, I get an an email titled – “Urgent – Your itinerary has been updated.”
Except the message never actually tells me what changed.
So I end up comparing the old itinerary with the new one – flight numbers, timings, connections – trying to spot a difference that isn’t always obvious. In some cases, I never figure it out. Maybe it was a terms-and-conditions tweak. Maybe a backend update. Who knows.
As someone who receives a lot of harsh criticism about the products I’ve built over the years, I always remind myself it’s far easier to criticize than it is to build. I’m sure there’s some operational or technical reason things work this way.
But it’s still a good reminder for anyone building products – and especially for myself: If you notify a user that something changed, make it dead easy to find the change.
A simple visual cue or one clear line of copy can transform an experience.
I was in a situation recently where I found myself reflecting on the repercussions of something I signed without giving it enough thought.
I can think of a couple such examples over the years that have cost me a fair bit of money in retrospect.
One of them involved lessons about trust – the kind you only learn by living through them.
But another was far simpler: I just didn’t weigh what I signed carefully enough.
Take the time to think it through and ask your questions. In the best-case, it was just a bit of extra time. And in the worst case, you just saved yourself a lot of money and hassle.
Notes on LLM RecSys Product – Edition 2 of a newsletter focused on building LLM powered products.
The central thesis of this newsletter is that we are moving from deterministic workflows to model-centered products – products built around LLM-powered recommender systems as the core primitive.
We’re not fully in that future yet. Most AI products today are still tools that tackle narrow use cases better – generate text, summarize content, answer questions. They are improvements, but they are not yet systems that help people get things done.
Still, the direction of change is clear. As model capabilities continue to improve, user expectations will shift across domains – writing, presentations, job search, customer support -from “give me the tool” to “help me make progress.”
That is the real promise of agents and applied AI products. Not magical autonomy, but systems that understand context, infer intent, and assist meaningfully. In other words, they will need a recommender system at the core.
But not the recommender systems we’re used to. Especially because the open question this creates is how to build predictable, trustworthy systems when the output is inherently probabilistic.
Why traditional recommender systems break
For years, recommender systems were constrained in two fundamental ways.
First, they relied on structured data and fixed taxonomies. They required predefined taxonomies and candidate sets, even though real product data – messages, resumes, notes, goals, intent – is overwhelmingly unstructured.
Second, they were black boxes steered by blunt reward functions. Behavior was optimized indirectly through metrics like clicks, dwell time, or engagement. Improving the system meant tweaking the reward, not improving understanding or reasoning.
That architecture was sufficient for optimizing engagement. It is poorly suited for helping users achieve goals.
LLM-powered recommender systems are semantic and teacher-supervised
The breakthrough in LLM-powered recommender systems isn’t simply replacing parts of the old stack with LLMs. It’s the combination of semantic understanding with teacher supervision.
LLMs can ingest and produce semantic input and output, reason over unstructured data, and operate across large context windows. That dramatically expands what recommender systems can do. While this dramatically improves capability, it still leaves blind spots: when does the system work, when does it fail, and why?
The breakthrough is our ability to pair production models with a high-quality “teacher model” – a large model used to evaluate outputs at scale. This teacher is too expensive to run in the critical path, but ideal for judging behavior, surfacing errors, and identifying quality gaps.
This structure creates two powerful improvement levers:
Improve the teacher → the system’s judgment gets sharper
Improve the training data → production models inherit this better judgment.
The introduction of the teacher makes the system self-aware and coachable. Instead of guessing whether the system is improving, you can now see it. In sum, LLMs with semantic understanding give your stack capability, but the teacher model enables governance.
The demo–to–production gap: cost makes your system imperfect but self-aware and coachable
This is also the point where demos and real products diverge.
In a demo, you can run a large, frontier model on every request and get something that looks magical. At scale, that approach collapses quickly under latency and GPU cost. The moment you ship to production, you’re forced to use smaller, cheaper models to keep the system fast and scalable.
And that’s when imperfection shows up. As costs come down, different parts of the stack start making mistakes that weren’t visible in that demo. Your production grade AI-powered system is noticeably imperfect.
That’s the bad news.
From stumbling in the dark to painful self-awareness
The good news is that teacher supervision turns imperfection into something you can work with.
Before teacher-supervised systems, building AI products felt like stumbling in the dark. Teams relied on coarse engagement metrics and small-scale human review to infer how models behaved.
With a teacher model, teams live in painful self-awareness.
You can now evaluate millions of outputs every day. You will know what’s wrong at scale, where it’s wrong, and why. The reason this self-awareness is painful is because you won’t be able to fix everything immediately – cost, latency, and model size constraints are very real – but you can work deliberately, one acute failure mode at a time.
That is the shift: from chasing metrics to diagnosing behavior, from intuition to evaluation, from shiny demos to systems that can actually improve. To be clear: teacher models aren’t ground truth. They’re probabilistic judges that must be anchored in human judgment and long-term outcomes. Their value isn’t correctness, but making evaluation continuous, scalable, and unavoidable.
Evaluation is the mechanism that makes outcome-oriented, probabilistic systems tractable under real-world constraints – and the foundation for how AI-native teams will operate going forward.
All of this brings us to two counter-intuitive takeaways –
(1) Existing recommender systems are deeply limited in many ways. But the magic moment doesn’t arrive when you replace them with LLM-powered recommender systems. As soon as you confront scaling costs, you’re forced to use cheaper, imperfect models. This is where reality cuts differently from the shiny demo.
(2) Semantic models unlock capability, but teacher supervision unlocks governance. LLMs make systems powerful, evaluation makes them understandable and improvable. And while the system might be imperfect, a teacher model enables painful self-awareness and coachability – the kind that reveals problems at scale and lets teams improve deliberately, one acute failure mode at a time.
That sets us up for the next edition – building painful self-awareness and coachability via the evaluation loop.
If you can’t engage with patience and focus, don’t engage at all.
A half-hearted engagement often causes more damage than no engagement. It is nearly always better to pause, reset, and return with intention than to show up distracted and reactive.
In most situations, the quality of our presence matters more than the speed of our response.
There’s an insightful story Jack Welch often told about a formative moment early in his career at GE. He was part of a graduate program with a cohort of new hires – a mix of people who worked hard and carried their weight, and others who clearly did not.
During performance review time, he learned that everyone in the program was being given the same raise.
He was furious because he knew who in the group was doing exceptional work and who wasn’t. Everyone did. As he put it, “Kids in any class know who the best students are.”
Effort and performance aren’t as subjective as we sometimes pretend.
That moment shaped him. It became the seed for his later, legendary emphasis on meritocracy at GE – including the (controversial) forced ranking system where the bottom 10% were consistently shown the door.
Agree or disagree with the method, the underlying principle he lived by was clear: When you reward strong performance, people rise to the standard. And when you tolerate weak performance – or worse, reward it – you destroy accountability.
And when accountability disappears, the people who thrive on it leave.
In many ways, that is the beginning of the end for any high-performing team or organization.
Over the years, I’ve made many hiring decisions. And while I’ve learned a lot about assessing fit, potential, and craft, there’s one lesson that has been etched in with scar tissue: don’t skip that reference check.
Most hirers are diligent about reference checks when hiring senior talent. But the place where corners get cut is with younger talent.
I’ve learned the hard way that this is exactly where discipline matters most.
Good reference checks – and backchannels where appropriate and possible – are invaluable. They reveal patterns, motivations, and (crucially) red flags that you simply cannot infer from a few conversations.
A little bit of work up front can save a lot of pain later.
“My job is to be calm and collected when they’re frantic. My job is to create intensity when they’re not intense. My job is to always be opposite the moment.” | Texas A&M football coach Mike Elko
I thought this was fascinating framing of the role of a coach / leader – it resonated.