Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge
Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Patrizia Bertini
DesignOps + KPIs = Measure your Impact!
2024 • DesignOps Summit 2020
Gold
Rachael Dietkus, LCSW
AI: Passionate defenses and reasoned critique [Advancing Research Community Workshop Series]
2024 • Advancing Research Community
Dan Saffer
Why AI projects fail (and what we can do about it)
2025 • Rosenfeld Community
Patrick Commarford
Design Staffing for Impact
2024 • DesignOps Summit 2020
Gold
Dan Willis
Enterprise Storytelling Sessions
2016 • Enterprise UX 2016
Gold
Bria Alexander
Opening Remarks
2023 • Advancing Research 2023
Gold
Soma Ghosh
What emerging methods are advancing UX research [Advancing Research Community Workshop Series]
2023 • Advancing Research Community
Ana Ferreira
Designing Distributed: Leading Doist’s Fully Remote Design Team in Six Countries
2024 • DesignOps Summit 2020
Gold
Alexia Cohen
Increasing Health Equity and Improving the Service Experience for Under-Served Latine Communities in Arizona
2024 • Advancing Service Design 2024
Gold
Indi Young
Thinking styles: Mend hidden cracks in your market
2025 • Rosenfeld Community
Cassini Nazir
The Dangers of Empathy: Toward More Responsible Design Research
2023 • Advancing Research 2023
Gold
Alla Weinberg
Healing Toxic Stress
2024 • DesignOps Summit 2024
Gold
Denise Jacobs
Interactive Keynote: Social Change by Design
2024 • Enterprise Experience 2020
Gold
Surya Vanka
Unleashing Swarm Creativity to Solve Enterprise Challenges
2021 • Design at Scale 2021
Gold
Karen Pascoe
Developing Experience Teams and Talent in the Enterprise
2016 • Enterprise UX 2016
Gold
Heidi Trost
When AI Becomes the User’s Point Person—and Point of Failure
2025 • Rosenfeld Community

More Videos

Ian Johnson

"Dimensionality reduction algorithms take data points in high dimensional space and put them close together in 2D if they're similar."

Ian Johnson

Latent Scope: Finding structure in unstructured data

June 11, 2025

Sam Proulx

"It is very easy to tick all the accessibility checkboxes and still have a poor, ugly, and unhelpful product."

Sam Proulx

Accessibility: An Opportunity to Innovate

September 8, 2022

Prayag Narula

"Data tells you what happens; qualitative research tells you why it happens."

Prayag Narula Hannah Hudson

Empowering Designers to do Good Research

March 11, 2022

John Cutler

"It is impossible to survive in this industry by pushing against the current all the time; you must put yourself in positions to leverage the current."

John Cutler

Oxbows, Rivers, and Estuaries: How to navigate the currents of change (without burning out)

December 3, 2024

Anna Avrekh

"Allies in the room who are not affected by bias are critical voices to help call out inequities."

Anna Avrekh Amy Jiménez Márquez Morgan C. Ramsey Catarina Tsang

Diversity In and For Design: Building Conscious Diversity in Design and Research

June 9, 2021

Noz Urbina

"Personas and journey maps usually live in different places, conflicting and hard to maintain—ROCKS aims to fix that."

Noz Urbina

Rapid AI-powered UX (RAUX): A framework for empowering human designers

May 1, 2025

Greg Petroff

"No one knows how to cook with the new ingredients of technology yet, and that’s why we face ambiguity and uncertainty."

Greg Petroff

Everything is About to Change: Software as Material

June 8, 2016

Andy Polaine

"Product has really understood what makes businesses tick and how to take accountability and get mandate."

Andy Polaine Lavrans Løvlie

What is the role of service design in product-led organizations?

December 3, 2024

Patrizia Bertini

"Without data, we are just a person with an opinion — you can’t measure performance or impact without data."

Patrizia Bertini

DesignOps + KPIs = Measure your Impact!

January 8, 2024