Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge
Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Sarah Rink
Remote User Research: Dos and Don'ts from the Virtual Field
2020 • Advancing Research Community
Deanna Washington
Scaling Success: Paving the Path from DesignOps to VP
2023 • DesignOps Summit 2023
Gold
Sara Logel
Your Colleagues are Your Users Too
2023 • Advancing Research 2023
Gold
Maggie Dieringer
Creating Consistency Through Constant Change
2024 • DesignOps Summit 2020
Gold
Rachael Dietkus, LCSW
Leading through the long tail of trauma
2022 • Enterprise Community
Bria Alexander
OKRs—Helpful or Harmful?
2022 • DesignOps Community
Sam Proulx
Online Shopping: Designing an Accessible Experience
2023 • Enterprise UX 2023
Gold
Andrew Michael
Building a Product Insights Team
2022 • Advancing Research 2022
Gold
Jon Fukuda
Storytelling for DesignOps
2023 • DesignOps Community
Rachel Posman
A Closer Look at Team Ops and Product Ops (Two Sides of the DesignOps Coin)
2020 • DesignOps Community
Tutti Taygerly
Videconference: How to Work with Difficult People with Tutti Taygerly
2020 • Enterprise Community
David Cronin
Discussion
2015 • Enterprise UX 2015
Gold
Peter Merholz
Customer-Centered Design Organizations
2017 • Enterprise Experience 2017
Gold
Christian Crumlish
Morning Insights Panel
2022 • Design in Product 2022
Gold
Dr. Karl Jeffries
The Science of Creativity for DesignOps
2024 • DesignOps Summit 2020
Gold
Kate Towsey
The State of ResearchOps: More Than Just Theory
2019 • DesignOps Community

More Videos

Adrian Howard

"Ask for stories. Can you tell me about the last time you shipped something that ended up not working well?"

Adrian Howard

Sturgeon’s Biases

September 25, 2024

Mansi Gupta

"How can we not forget about women?"

Mansi Gupta

Women-Centric Research: What, Why, How

March 29, 2023

Farid Sabitov

"Design operations is about creating scalable systems by standardizing, accelerating, and connecting teams and knowledge."

Farid Sabitov

Theme Four Intro

September 9, 2022

Jane Davis

"Finding qualified participants quickly is really hard to keep up with fast-moving product cycles."

Jane Davis

Strategic Shifts and Innovations in User Research: Navigating Challenges and Opportunities

March 11, 2025

Bria Alexander

"Respect and kindness are mandatory; please read our code of conduct to ensure a positive community experience."

Bria Alexander

Opening Remarks

September 9, 2022

Susan Weinschenk

"Changing your language to fit your audience’s way of thinking is crucial to advancing UX in different organizational silos."

Susan Weinschenk

Evaluating the Maturity of UX in Your Organization

January 15, 2020

Dan Willis

"The theme reminds me of serial killers who live in neighborhoods for decades without anyone knowing what’s hidden inside."

Dan Willis

Theme 3: Intro

January 8, 2024

Maish Nichani

"The shift from a production to consumption mindset means my job is not done till my customer’s job is done."

Maish Nichani

Sparking a Service Excellence Mindset at a Government Agency

December 9, 2021

Dave Malouf

"Designers can only produce their best work if we understand and meet their unique motivations and reward systems."

Dave Malouf

Closing Keynote: Amplify. Not Optimize.

October 24, 2019