Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge
Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Corey Long
Hiring in DesignOps: A Critical Study on How to Hire and Get Hired
2024 • DesignOps Summit 2024
Gold
Neema Mahdavi
Operationalizing DesignOps
2018 • DesignOps Summit 2018
Gold
Jason Mesut
Unmasking Design Leadership: Navigating leadership without neglecting ourselves
2025 • Rosenfeld Community
Chris Hodowanec
Agile + User Experience: How to navigate the Agile landscape as an UX Practitioner
2022 • Civic Design 2022
Gold
Jonathon Colman
How to Maximize the Impact of Content Design
2024 • DesignOps Summit 2020
Gold
Kara Kane
Theme One Intro
2022 • Civic Design 2022
Gold
Peter Van Dijck
Hands-on AI #2: Understanding evals: LLM as a Judge
2025 • Rosenfeld Community
Aurobinda Pradhan
Introduction to Collaborative DesignOps using Cubyts
2022 • DesignOps Summit 2022
Gold
Marc Fonteijn
Increase your confidence, influence, and impact (through a Professional Community)
2024 • Advancing Service Design 2024
Gold
Laura Klein
Human vs. machine: Testing AI’s ability to synthesize and analyze research
2026 • Advancing Research 2026
Conference
Verónica Urzúa
The B-side of the Research Impact
2021 • Advancing Research 2021
Gold
Peter Van Dijck
Building the Rosenbot
2024 • Designing with AI 2024
Gold
Peter Merholz
The Trials and Tribulations of Directors of UX
2023 • Enterprise Community
April Reagan
Look, Think, Act: The Futures-Smart Design Organization
2021 • DesignOps Summit 2021
Gold
Roberta Dombrowski
Making Research a Team Sport
2022 • Advancing Research 2022
Gold
Jilanna Wilson
Distributed Design Operations Management
2019 • DesignOps Summit 2019
Gold

More Videos

Sarah Auslander

"Digitaf changed the way parents experience early childhood by connecting them to affordable municipal services."

Sarah Auslander

Incremental Steps to Drive Radical Innovation in Policy Design

November 18, 2022

Uday Gajendar

"The lipstick on the pig—cosmetic fixes give little impact without addressing deeper usability and business value."

Uday Gajendar Lada Gorlenko Dave Malouf Louis Rosenfeld Dan Willis

10 Years of Enterprise UX: Reflecting on the community and the practice

June 18, 2025

Susan Weinschenk

"You will never get to the point where you don't have to teach UX over and over again because people forget and new people come in."

Susan Weinschenk

Evaluating the Maturity of UX in Your Organization

January 15, 2020

Sam Proulx

"Voice control on mobile was developed primarily to support people with disabilities."

Sam Proulx

Mobile Accessibility: Why Moving Accessibility Beyond the Desktop is Critical in a Mobile-first World

September 8, 2022

Kara Kane

"We thought a maturity model for user-centered design would have shown senior stakeholders the impact we were having, but we never had time to build it."

Kara Kane

Communities of Practice for Civic Design

April 7, 2022

John Cutler

"Some of the most important decisions for service design happen on whiteboards among engineers, often without service design representation."

John Cutler

Oxbows, Rivers, and Estuaries: How to navigate the currents of change (without burning out)

December 3, 2024

Erin Hoffman-John

"When organizations have regime change, shifting high leverage points can cause catastrophic effects, sometimes destructive and unintended."

Erin Hoffman-John

This Game is Never Done: Design Leadership Techniques from the Video Game World

November 6, 2017

Tom Armitage

"I think to understand how they work, it’s not just understanding how they operate, it’s understanding how to make the sausage rather than consume the sausage."

Tom Armitage Carla Diana Kanene Ayo Holder

Day 2 Panel: Looking ahead: Designing with AI in 2026

June 11, 2025

Jen Briselli

"Nudge to me now is much more about wiggle it and see what happens rather than expecting exact outcomes."

Jen Briselli

Learning Is The Engine: Designing & Adapting in a World We Can’t Predict

April 16, 2025