Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge

Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Sam Proulx
Understanding Screen Readers on Mobile: How And Why to Learn from Native Users
2023 • DesignOps Summit 2023
Gold
Suzan Bednarz
AccessibilityOps for All
2024 • DesignOps Summit 2020
Gold
Jeff Gothelf
The Intersection of Lean and Design
2019 • Enterprise Community
Frances Yllana
The Big Question about Impact: A Panel Discussion
2024 • DesignOps Summit 2024
Gold
Dan Hill
Designing for the infrastructures of everyday life
2024 • Designing with AI 2024
Gold
Discussion
2017 • Enterprise Experience 2017
Gold
Alana Washington
(Remote) Service Design: A Transformation Case Study
2022 • Design at Scale 2022
Gold
Shan Shen
Translating UX Terms into Business Contexts
2023 • Design in Product 2023
Gold
Billy Carlson
Ideation tips for Product Managers
2022 • Design in Product 2022
Gold
Uday Gajendar
Leading through the long tail of trauma
2022 • Advancing Research Community
Onur Kocan
Understanding the Strategy for Civic Design in a Complex City: Istanbul
2022 • Civic Design 2022
Gold
Liza Pemstein
Scaling Research Via an Ops First Model at Clever
2023 • Advancing Research 2023
Gold
Kate Kalcevich
Designing inclusively with AI
2024 • Designing with AI 2024
Gold
Sam Proulx
Everything You Ever Wanted to Know About Screen Readers
2021 • Design at Scale 2021
Gold
Bob Baxley
Theme 4: Intro
2024 • Enterprise Experience 2020
Gold
Kurdin Bazaz
Culture, DIBS & Recruiting
2021 • Design at Scale 2021
Gold

More Videos

Peter Van Dijck

"We need to dramatically change the hats, the walls, and the workflow of how we work together in design and development."

Peter Van Dijck

Hands on AI #3: Claude Code for UX people

October 22, 2025

Johnny Michaelsen

"Building relationships is essential to being of service and having impact."

Johnny Michaelsen

Measure Behaviors, Not Results

April 23, 2026

Boon Yew Chew

"Enterprise UX is really hard because, in addition to designing for products, we also design for our own change and the change our colleagues navigate."

Boon Yew Chew

Making Sense of Systems—and Using Systems to Make Sense of the Enterprise

June 6, 2023

Jim Kalbach

"Instead of playing the melody Miles Davis wrote, the soloist creates a melody spontaneously."

Jim Kalbach

Jazz Improvisation as a Model for Team Collaboration

June 4, 2019

Nick Cochran

"A connection I made at 14 helped me fund college, find my wife, and get my job at ExxonMobil."

Nick Cochran

Growing in Enterprise Design through Making Connections

June 3, 2019

Anne Mamaghani

"Getting to know what people care about in the organization helps tailor workshops to support those goals and gain buy-in."

Anne Mamaghani

How Your Organization's Generative Workshops Are Probably Going Wrong and How to Get Them Right

March 28, 2023

Sarah Flamion

"Marketers don’t try very interesting things when their confidence in data is low, which can create a downward spiraling reinforcing loop."

Sarah Flamion

Complex Problem? Add Clarity by Combining Research and Systems Thinking

March 31, 2020

Jemma Ahmed

"In the enterprise space, the UX function is added later. It’s not part of the company’s original DNA."

Jemma Ahmed

Theme 2 Intro

January 8, 2024

Brendan Jarvis

"The challenges we face are real, but so is our capacity to respond creatively and courageously."

Brendan Jarvis

It was the Best of Times. It was the Worst of Times.

September 25, 2024