Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge

Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #2: Understanding evals: LLM as a Judge
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.

  • Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.

  • Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.

  • Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.

  • High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.

  • Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.

  • Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.

  • Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.

  • Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.

  • Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Ask the Rosenbot
Husani Oakley
Theme Two Intro
2023 • Enterprise UX 2023
Gold
Daniel Gloyd
Designing From the Inside Out: How Method Acting Can Inspire Design Research
2026 • Rosenfeld Community
Gordon Ross
12 Months of COVID-19 Design and Digital Response with the British Columbia Government
2021 • Civic Design 2021
Gold
Jorge Arango
The Best of Both Worlds: How to Integrate Paper and Digital Notes (1st of 3 seminars)
2024 • Rosenfeld Community
Joseph Williams
Unlocking impact and influence through inclusive hiring in research
2021 • Advancing Research Community
Steve Baty
Discussion
2016 • Enterprise UX 2016
Gold
Discussion
2017 • Enterprise Experience 2017
Gold
James Wieselman Schulman
Research is a team sport: advancing the work when everyone does the research
2026 • Advancing Research 2026
Conference
Sarit Geertjes
People, not Petri Dishes: Stories from a Research Recruiter
2019 • DesignOps Community
Daniel Gloyd
Warming the User Experience: Lessons from America's first and most radical human-centered designers
2024 • Rosenfeld Community
Anupama Dhareshwar
From blueprint to bot: Designing resilient AI-powered services
2025 • Advancing Service Design 2025
Gold
Samuel Proulx
From Standards to Innovation: Why Inclusive Design Wins
2025 • DesignOps Summit 2025
Gold
Steve Portigal
Looking Back…to Look Ahead
2024 • Advancing Research 2024
Gold
Catherine Courage
The Enterprise UX Journey: Lessons From the Voyage & The Opportunity Ahead
2015 • Enterprise UX 2015
Gold
Jose Coronado
From Zero to Hero
2022 • DesignOps Summit 2022
Gold
Spencer L. A. Stultz
Why Social Justice Frameworks are Necessary for Successful DEI/JEDI Initiatives
2023 • DesignOps Summit 2023
Gold

More Videos

Andrew Custage

"Younger visitors are nearly twice as likely to engage with digital chat or help but are less satisfied afterward."

Andrew Custage Michael Mallett

The Digital Journey: Research on Consumer Frustration and Loyalty

March 29, 2023

Sandra Camacho

"Bias is a particular tendency, feeling or opinion in favor or against something, usually without reason or evidence."

Sandra Camacho

Creating More Bias-Proof Designs

January 22, 2025

Jen van der Meer

"Every spreadsheet is emotional. It’s a stage direction."

Jen van der Meer

Service design performs value

November 19, 2025

Jake Burghardt

"Research waste is valuable customer insight that was unseen, ignored, or unintentionally left out of planning."

Jake Burghardt

Stop wasting research: Create new value with insight summaries

July 9, 2025

Mark Templeton

"The hardest thing for any tech company to do is move from technology-out thinking to customer and business outcome thinking."

Mark Templeton

Creating a Legacy: the ultimate experience

June 9, 2017

Alla Weinberg

"Mental health challenges are now the norm among employees across all organizational roles."

Alla Weinberg

Workers Are Sick of Change: The Cure is Psychological Safety

June 6, 2023

Mansi Gupta

"Trust is about perceptions versus realities shaped by higher expectations and harsher consequences for women."

Mansi Gupta

Women-Centric Research: What, Why, How

March 29, 2023

Tutti Taygerly

"Sharing my story with him helped build a stronger, more trusting relationship."

Tutti Taygerly

Make Space to Lead

June 12, 2021

James Lang

"The ethics in community design are critical — emotional damage can be severe if mishandled."

James Lang

If you can design an app, you can design a community

May 22, 2025