Moving from Evals to Eval Loops

Notes on LLM RecSys Product – Edition 3 of a newsletter focused on building LLM powered products.


Eval-first product building is having a moment. We all know we should run evals, ship evals, and just do more evals.

But evals are just snapshots – moments in time when you check if something works. The real breakthrough isn’t running evaluations. It’s building the evaluation loop – the end-to-end system that diagnoses continuously, improves deliberately, and makes probabilistic systems governable.

From Self-Awareness to Action

In the last post, I wrote about teacher models giving us painful self-awareness – you can evaluate millions of outputs and know what’s broken at scale. The eval loop is what you DO with that awareness: a continuous diagnostic and improvement system that makes LLM-powered products actually improvable.

In deterministic products, the process used to be: Ship → Monitor engagement → Small-batch human reviews when something breaks. Feedback was slow and sparse. In LLM recsys products, the eval loop runs continuously.

The Eval Loop

Here’s how it works:

  • The Product Policy defines what “good” looks like – the criteria, constraints, and guardrails for how the system should behave – including a “Golden Set” of high-quality examples.
  • This trains the Teacher Model, your evaluator.
  • The production Stack generates outputs in response to real user interactions. The Teacher continuously evaluates these outputs at a scale of millions per day.
  • We also meld user feedback and investigation into anecdotal issues/data to surface more defects.

This surfacing of defects at scale leads to Diagnosis where failure points to one of three places:

  1. Policy problem – We didn’t define “good” clearly enough. The rubric is incomplete or wrong.
  2. Teacher problem – Our evaluator itself is broken. It’s judging incorrectly.
  3. Distillation problem – The production model hasn’t learned what the teacher knows. Training data or model capacity is the bottleneck.

The fix flows back into the system – refine policy, improve the teacher, update training data – and the loop continues.

Unlocking velocity and shifting measurement

The loop unlocks velocity by operating on two planes simultaneously:

  • Online: The teacher judges live production outputs continuously. You know what’s breaking in real-time, not weeks later through support tickets or engagement dips.
  • Offline: Before shipping any change, you can test it against the teacher. Run it through thousands of test cases. See how eval metrics move. Fail fast without running live experiments on real users.

In deterministic products, we measured outcome metrics: clicks, retention, revenue. In LLM recsys products, these are lagging indicators. The eval loop runs on quality metrics – leading indicators judged by your teacher.

Quality metrics thus guide daily improvement. Outcome metrics tell you if that improvement matters.

The Loop Is The Mechanism

In deterministic products, you improved by changing code. You wrote a new feature, shipped it, measured engagement.

In LLM recsys products, you improve by running the evaluation loop first. You can’t fully specify the system’s behavior upfront – it’s probabilistic. The loop is how you systematically improve something you can’t completely control.

Once again, adding an LLM into your system isn’t a panacea. You’ll be limited in production by latency and cost – so, the eval loop is the only way to build products around models that are powerful but imperfect, capable but costly, impressive in demos but messy at scale.


Next: The eval loop only works if your teacher knows what “good” looks like. That’s where we’re going next.

PS: A quick note on eval suites: As your product matures, you’ll likely move from one teacher to multiple – separate evaluators for various dimensions of the product. The same principle scales. Each evaluator judges a specific dimension, all feeding into the same diagnostic loop.

From AI doomsday to IA, Orwell and Social Support

Was the invention of the axe a good thing or a bad thing? The axe was among the first simple machines — a breakthrough in technology that propelled humanity forward. It helped our ancestors chop wood and hunt. But, it was also used as a weapon in war.

Every incredible advance has had a dark side. We have prevented infant mortality thanks to advances in ultrasound technology. And, yet, the same technology was responsible for female infanticide. Industrial farming has helped us feed billions of humans with fewer humans involved in agriculture than ever before. However, it has also resulted in routine horrible treatment of farm animals.

Given this context, it is often amusing to see the discussion around artificial intelligence. We see talk of doomsday one day (“all the jobs are going away”) and techno-optimism on another (“AI is going to help us by freeing us from repetitive tasks”). Of late, I’ve been seeing more media devoted to the latter. It is worth examining both sides of the conversation.

Not doomsday. The central hypothesis behind the idea that there is no doomsday on the cards is the idea that we’re moving into a world with IA or “Intelligence Augmentation.” The idea here is that AI is great at finding answers. But, it is on us to find questions. We’ll find new and interesting questions to keep us occupied while AI helps us eliminate repetitive tasks and make us more efficient. And, we’ll use ingenuity to create new jobs that don’t exist — just as we created “Yoga instructor” or “Zumba instructor” jobs after the industrial revolution.

One example of this is a painting robot that was featured on Wired (see video — 4 mins) that increased the productivity of human laborers by 4x while taking over all the repetitive tasks. You’ve probably come across similar stories.

The surge in recent positivity is also thanks to an OECD research report that classified ~10% of American jobs as high risk. This is much lower than previous forecasts that labelled ~50% of jobs as high risk.

Maybe doomsday. From The Atlantic on WalMart’s future workforce —

Walmart executives have sketched a picture of the company’s future that features more self-checkouts and a grocery-delivery business — soon escalating to 100 cities from a pilot program in six cities. Personal shoppers will fill plastic totes with avocados and paper towels from Walmart store shelves, and hand off packages to crowdsourced drivers idling in the parking lot. Assembly will be outsourced, too: Workers on Handy, an online marketplace for home services, will mount televisions and assemble furniture.

Such examples are also dime-a-dozen these days. More automation promises more returns to shareholders => happier executives and boards.

Of course, it is also easy to counter all examples of optimism. For example, the same painting robot (featured above) that increased productivity of human laborers by 4x is a great place to start. At some point — assuming other painting firms invested in robots — we will have 4x the amount of painting capacity at hand. Are there as many jobs to go around?

And, the above OECD report that said risks of “massive technological unemployment” are overblown cautioned that we face risks of “further polarisation of the labour market” between highly paid workers and other jobs that may be “relatively low paid and not particularly interesting.”

This Economic graph summarizing some of the findings was particularly interesting.

Notice how the percentage of jobs at risk of automation decreases as a country gets richer?

The polarization that the report warns may not be limited to high skill and low skill jobs then. There is reason to believe that we might see a growing schism between richer and poorer countries.

The truth likely lies somewhere in the middle. All this brings us back to the story of the axe. Every technology breakthrough has a dark side. The challenge, then, is to not get caught in all the techno-optimism that accompanies the emergence of breakthrough technology and to take the effort to think through the second and third order consequences.

As we’ve seen in the revelations about the effects of social media in the past 2 years, the absence of such thought can have serious long term consequences.

So, how do we proceed?

My recommendation would be to stop any debate about whether we’re heading to an AI induced doomsday and, ask the three following questions —

1. Are we clear on what we’re talking about when it comes to AI? There are three major domains of AI that we discuss –

  • AGI or Artificial General Intelligence. This is when robots become capable of being human (a.k.a. West World). Scientists like Alan Turing and John McCarthy envisioned this 70–80 years ago and we’re no closer to it now than we were then.
  • IA or Intelligence Augmentation. A classic current example of this a search engine as it augments our memory and factual knowledge. Many of the machine learning applications today are in this domain.
  • II or Intelligence Infrastructure. An example of this would be machine learning powered security systems that make use of a web of devices (infrastructure) to make human environments safer or more supportive. While we’re still in the early days, there’s plenty of investment in start-ups and fledgling companies directed here.

It is important to be clear about these domains because a lot of mainstream discussion bandwidth is wasted in talking about the dangers of Artificial General Intelligence. That is a waste of time.

Instead, our discussions should center around IA and II. We’ve made plenty of progress using techniques like Deep Learning. And, while both extend human capabilities, they also automate tasks that currently employ large groups of humans in the near term.

2. Are we conscious about the possible dark side of AI — specifically the use of artificial intelligence for surveillance? 
The Economist outlined this in a piece about the Workplace of the future —

And surveillance may feel Orwellian — a sensitive matter now that people have begun to question how much Facebook and other tech giants know about their private lives. Companies are starting to monitor how much time employees spend on breaks. Veriato, a software firm, goes so far as to track and log every keystroke employees make on their computers in order to gauge how committed they are to their company. Firms can use AI to sift through not just employees’ professional communications but their social-media profiles, too. The clue is in Slack’s name, which stands for “searchable log of all conversation and knowledge”.

The good news is that most of the preceding portions of the article talked about the benefits of algorithms in the workplace — fairer pay rises and promotions, improve productivity and so on.

It will be on us to strike a good balance.

3. Are we designing the right social support systems to be able to prepare us?
In a great piece titled “The Robots are coming, and Sweden is fine.” by The New York Times, I found 3 notes fascinating –

  • “In Sweden, if you ask a union leader, ‘Are you afraid of new technology?’ they will answer, ‘No, I’m afraid of old technology,’” says the Swedish minister for employment and integration, Ylva Johansson. “The jobs disappear, and then we train people for new jobs. We won’t protect jobs. But we will protect workers.”
  • 80% of Swedes express positive views about robots and artificial intelligence versus 72% of Americans who declared themselves “worried” per a Pew Research survey.
  • The challenge, of course, is taxation. Taxes are ~60% in Sweden and are a key part of the social contract.

While the thought of ~60% taxes in the US would be morally repulsive, it is unclear how long we’ll be able to sustain the current reality.

German Economist Heiner Flassbeck had a powerful graph showing the declining share of national wealth in rich countries (except Norway).

National wealth in the US and UK is now negative. Low public wealth limits the government’s ability to regulate the economy, redistribute income and mitigate rising inequality.

Regardless of Artificial intelligence, income inequality has been rising everywhere.

If AI is expected to further increase the level of inequality, we’ll need to double down on the discussion on social support systems.

For the record, I’m not optimistic that this will happen. Our ability to prepare for changes before they hurt us is poor (see: climate change).

But, I’m hopeful that we can begin by changing how we approach conversations around AI. Maybe next time we hear a conversation about sentient machines, we’ll put a stop to the conversation and focus it on the actual issues like Orwellian uses of data and investing in social support systems to counter inequality. Maybe that, in turn, will mean thoughtful uses of AI in the the organizations we’re part of.

And, maybe, just maybe, we’ll succeed in making the transition to a world with Intelligence augmentation and Intelligence Infrastructure in the coming decades a lot less painful…


Links for additional reading

  • Shor’s algorithm to solve factorization with quantum computers — on Wikipedia
  • How to Become a Centaur — on MITpress
  • The Painting Robot that didn’t take away anyone’s job — on Wired
    A respite from the robots (but a retraining emergency) — on Axios
  • Machines will take fewer jobs but low-skilled workers will still be badly hit — on The Financial Times
  • OECD research visual — on The Economist
  • The Artificial Intelligence revolution hasn’t happened yet — on Medium
  • The origins of Artificial Intelligence — on Rodney Brooks’ blog
  • The workplace of the future — on The Economist
  • AI State of the Union on YouTube
  • The Exponential View — a curated newsletter that is the source of many of these links — Thanks Azeem
  • The robots are coming, and Sweden is fine — on The New York Times (a must read)
  • How inequality is evolving and why — on Flassbeck Economics (another must read)