Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge
Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Jack Behar
How to Build Prototypes that Behave like an End-Product
2022 • Design in Product 2022
Gold
Jaime Creixems
Best Practices when Creating and Maintaining a Design System
2023 • Enterprise UX 2023
Gold
Sylvie Abookire
A Civic Designer's Guide to Mindful Conflict Navigation
2022 • Civic Design 2022
Gold
Daniel Orbach
Zero to One: Co-Creating Operating Models with your Team
2024 • DesignOps Summit 2024
Gold
Matt Stone
Scaling Empathy, A Case Study in Change Management
2021 • Design at Scale 2021
Gold
Uday Gajendar
Theme Four Intro
2023 • Enterprise UX 2023
Gold
John Cutler
The Alignment Trap
2023 • Design in Product 2023
Gold
Nathan Curtis
Beyond the Toolkit: Spreading a System Across People & Products
2016 • Enterprise UX 2016
Gold
Rachel Posman
"Ask Me Anything" with Rachel Posman and John Calhoun, Authors of the Upcoming Rosenfeld Book, The Design Conductors
2024 • DesignOps Summit 2024
Gold
Nick Lewis
Designing and building low-carbon websites independently
2025 • Climate UX Interest Group
John Cutler
Oxbows, Rivers, and Estuaries: How to navigate the currents of change (without burning out)
2024 • Advancing Service Design 2024
Gold
Sarah Williams
A Framework for CX Transformation
2021 • Design at Scale 2021
Gold
Amy Bucher
Harnessing behavioral science to uncover deeper truths
2025 • Advancing Research 2025
Gold
Stefanie Owens
Optimizing for Outcomes: Transformation Design in Systems at Scale
2024 • Advancing Service Design 2024
Gold
Erin Hauber
Design is Not the Frosting on the Scaled Agile Layer Cake
2019 • DesignOps Summit 2019
Gold
Tom Armitage
Day 2 Panel: Looking ahead: Designing with AI in 2026
2025 • Designing with AI 2025
Gold

More Videos

Eduardo Ortiz

"Not every insight is eternal; some are tied to a specific prototype or test, while others about human nature last years."

Eduardo Ortiz Robin Beers Rachael Dietkus, LCSW Bruce Gillespie Jess Greco Marieke McCloskey Renee Reid

Day 3 Theme Panel

March 13, 2025

Alexandra Schmidt

"The current design education system doesn't always prepare designers for the complexity of enterprise challenges."

Alexandra Schmidt

Enterprise UX Playbook

December 1, 2022

Andy Barraclough

"Our AI tools help with speed and efficiency, and also help save money by reducing redundant research."

Andy Barraclough Betsy Nelson

From Costly Complexity to Efficient Insights: Why UX Teams Are Switching To Voxpopme

September 23, 2024

Bria Alexander

"Supporting different ways people prefer to engage—like asynchronous workshops or prep time—helps introverts and extroverts contribute equally."

Bria Alexander Laura Gatewood Corey Long Daniel Orbach Laine Prokay Deanna Smith

The Big Question about Resilience: A panel discussion

September 23, 2024

Jon Fukuda

"If design, product, and development use different tools, we end up talking past one another."

Jon Fukuda Amy Evans Ignacio Martinez Joe Meersman

The Big Question about Innovation: A Panel Discussion

September 25, 2024

Renee Reid

"If you make a pretty page, that’s great. But if the users can’t actually use the page or they’re frustrated, you just have a pretty page."

Renee Reid

Becoming a ResearchH.E.R (Highly Enterprise Ready)

June 3, 2019

Jennifer Strickland

"Designers have power. We get to decide who gets heard, who gets included, who gets excluded."

Jennifer Strickland

Adopting a "Design By" Method

December 9, 2021

Jen Briselli

"Our work as service designers is to help people see their own complex system more clearly, learn from conditions, and take the next right steps."

Jen Briselli

Learning is the north star: service design for adaptive capacity

November 19, 2025

Jorge Arango

"These tools are the most powerful symbol manipulation tools ever created, like spreadsheets for language."

Jorge Arango

AI as Thought Partner: How to Use LLMs to Transform Your Notes (3rd of 3 seminars)

May 3, 2024