Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge
Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Gonzalo Goyanes
Design ROI: Cover a Little, Get a Lot
2022 • DesignOps Summit 2022
Gold
Alana Washington
(Remote) Service Design: A Transformation Case Study
2022 • Design at Scale 2022
Gold
Jessica Norris
ADHD: A DesignOps Superpower
2022 • DesignOps Summit 2022
Gold
James Lang
If you can design an app, you can design a community
2025 • Rosenfeld Community
Alba Villamil
Stereotyped by Design: Pitfalls in Cross-Cultural User Research
2020 • Advancing Research 2020
Gold
Llewyn Paine
Day 1 Using AI in UX with Impact
2025 • Designing with AI 2025
Gold
Stephanie Wade
Building and Sustaining Design in Government
2021 • Civic Design 2021
Gold
Verónica Urzúa
The B-side of the Research Impact
2021 • Advancing Research 2021
Gold
Nancy Douyon
We'll Figure That Out in the Next Launch: Enterprise Tech's Nobility Complex
2018 • Enterprise Experience 2018
Gold
Frances Yllana
DesignOps Exposed: What do our peers really think of us?
2025 • DesignOps Summit 2025
Conference
Sean Dolan
A Practical Look at Creating More Usable Enterprise Customer Journeys
2019 • Enterprise Community
Samuel Proulx
Inclusive Research: Debunking Myths and Getting Started
2025 • Advancing Research 2025
Gold
Christian Madsbjerg
Influencing Strategy
2020 • Advancing Research 2020
Gold
Maggie Dieringer
Creating Consistency Through Constant Change
2024 • DesignOps Summit 2020
Gold
Tricia Wang
SCALE: Discussion
2018 • Enterprise Experience 2018
Gold
Dr. Jamika D. Burge
Theme 3 Intro
2022 • Advancing Research 2022
Gold

More Videos

Trisha Causley

"By default, you’re getting I am very confident of my answers."

Trisha Causley

[Demo] Complexity in disguise: Crafting experiences for generative AI features

June 5, 2024

Kate Towsey

"Research ops is like the most punk thing you can do in the experience design space."

Kate Towsey

Ask Me Anything (AMA) with Kate Towsey

April 2, 2025

Briana Thomas

"When design ops comes in hot, it means honest, orchestrated, and timely."

Briana Thomas Christina Rodriguez

When Design Ops Comes in H.O.T. : A Tale of a Transformed Design Org

September 30, 2021

Brianna Sylver

"Optimal design leadership looked like a CEO who promoted design as strategy and a chief design officer with the ear of the CEO."

Brianna Sylver

Lead With Purpose

March 31, 2020

Anna Poznyakov

"We enable PMs and designers to conduct research rigorously because they are already in contact with customers daily."

Anna Poznyakov Richa Prajapati

Get The Most Out Of Stakeholder Collaboration—and Maximize Your Research Impact

March 12, 2021

Frank Duran

"Engaging influencers early helped them appreciate being included sooner than later, which improved the project."

Frank Duran

Partnership Playbook: Lessons Learned in Effective Partnership

January 8, 2024

Craig Villamor

"If you design for everyone, no one is satisfied — start with specific solutions then generalize."

Craig Villamor

Resilient Enterprise Design

June 8, 2017

Rachel Radway

"I’m not looking for cookie cutter candidates; I want to hear someone who’s taken the less traveled, maybe more challenging route."

Rachel Radway Katie Bingham Joe Wiertel

The Many Paths Of Design Operations

September 8, 2022

Christian Crumlish

"Product managers are responsible for value – making sure something valuable is being created that meets real needs."

Christian Crumlish

AMA with Christian Crumlish, author of Product Management for UX People

March 24, 2022