Notes on LLM RecSys Product – Edition 4 of a newsletter focused on building LLM powered products.
Quick recap: We’ve covered the central thesis that LLM recsys is the core primitive of AI-native products, how teacher models enable painful self-awareness, and how the eval loop (not just evals) drives systematic improvement.
The eval loop only works if your teacher model knows what “good” looks like. That requires rigorous product policy.
What Product Policy Is
Product policy is the crystallization of the product team’s intuition. It’s the best understanding of what a great user experience looks like.
PRDs describe deterministic features we want to build. Product Policy, on the other hand, defines behavior to exhibit – and gets encoded into the teacher model, the production models, and the eval loop.
You won’t know if your definition of quality matches what users actually value until you test. If your true north metrics – typically laddering into user retention or end outcomes – improve, your intuition was right. If not, you refine. The Product Policy evolves with user signal.
Policy Encodes Judgment
For small products, the Policy could be written entirely by one author.
However, for large products, you typically need multiple contributors who will need to debate every gray area (of which there will be many). These debates matter because policy decisions cascade through the entire system. They define what you measure, what counts as “good,” what users see.
This is where product leaders must be hands-on. You can’t delegate the constitution. Your understanding of the user and your judgment must show.
One note – even in cases of complex policies, I would recommend having one author so your Policy reads as coherent. It forces the entire team to align to one point of view vs. “you take that section and I’ll write this.”
The Rubric
Policy lives in a rubric. This could be a binary 0/1 or a more sophisticated graded rubric. Here’s an example – let’s imagine you’re an e-commerce product team and are building out the policy for a product query.
You look at the most common product queries and pick a popular one – “toaster”
Let’s explore what the rubric might look like. Here’s an example –
Next, assuming the rubric feels right, we might make a policy decision – show only results rated 3-4. Filter out 0s and 1s and show 2/Fair with a different Ux (e.g., “Related”).
This in turn means toaster ovens are out – even though they toast. This is judgment made operational. And it cascades through millions of queries.
The Golden Dataset
The golden dataset brings the rubric to life with examples. A complex policy might need a golden set of 500 or so examples to bring the various items in the judgment to life
Not just “this is a 3” but “this is a 3 because…” – high-quality examples with detailed chain of thought.
The golden dataset evolves over time as you learn from user signal. What you thought was a 3 might become a 2. What you filtered out might need to be included.
The author of the policy should drive the golden dataset process – ensuring consistency, adjudicating disagreements, maintaining the chain of thought quality.
The Two Gates
Policy quality is measured through two gates:
Gate 1: Is the policy clear?
Have multiple raters score the same examples using your rubric. Measure inter-rater agreement using Cohen’s kappa.
Cohen’s kappa measures agreement beyond chance – because even random guessing produces some agreement. (Here’s the Wikipedia article and the Scikit implementation.)
The interpretation:
- κ > 0.8 = strong agreement (policy is clear)
- κ 0.6-0.8 = moderate (policy needs refinement)
- κ < 0.6 = weak (policy is ambiguous)
If your raters can’t agree whether a result is a 2 or a 3, the policy isn’t clear enough. Sharpen definitions, add examples, debate more.
Gate 2: Does the model encode it?
Once Gate 1 passes, measure how well the model’s judgments match your golden dataset. Same metric: Cohen’s kappa between model output and human-labeled examples.
Gate 1 tells you if humans understand the policy. Gate 2 tells you if the model does.
One Policy, Not Many
A common mistake: architecting your system with multiple policies for different stages – e.g., one for ranking, another for filtering, another for personalization.
The problem: every model is imperfect. If each stage operates at 80% quality, your end-to-end experience is 0.8 × 0.8 × 0.8 = 51%.
Three “pretty good” stages compound into a mediocre experience.
The better approach is to build one unified policy. Define what great looks like end-to-end. Train one teacher on that complete experience. Let the production stack learn from that unified signal.
This is harder to build. But it’s the only way to avoid compounding errors through your system.
Next up: The eval loop in practice. How teams actually run it, what metrics matter, and when to invest in policy versus when to keep it simple.
