Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Samuel Proulx
Inclusive Research: Debunking Myths and Getting Started
2025 • Advancing Research 2025
Gold
Kristen Honey
"Let’s Talk About Data and Crisis”: Public Digital Service Delivery = Open Data + Human Centered Design
2021 • Civic Design Community
Patrizia Bertini
DesignOps + KPIs = Measure your Impact!
2024 • DesignOps Summit 2020
Gold
Jorge Arango
Meeting of the Waters: Designing for Successful Inorganic Growth
2021 • Enterprise Community
Fisayo Osilaja
[Demo] The AI edge: From researcher to strategist
2024 • Designing with AI 2024
Gold
Steve Portigal
War Stories LIVE! Steve Portigal
2020 • Advancing Research 2020
Gold
Shelby Switzer
Making Space for Community Knowledge-sharing in a Distributed World
2021 • Civic Design 2021
Gold
Patrick Boehler
The service shift: transforming media organizations to create real value through design
2025 • Advancing Service Design 2025
Gold
James Chudley
Decarbonising User Journeys: How minimising enables us to do more with less
2025 • Climate UX Interest Group
Stephanie Wade
Building and Sustaining Design in Government
2021 • Civic Design 2021
Gold
Devon Powers
Imagining Better Futures
2022 • Advancing Research 2022
Gold
Nicole Umphress
Delivering Design Education During a Global Pandemic: Lessons Learned
2022 • Design at Scale 2022
Gold
Bria Alexander
Opening Remarks
2021 • DesignOps Summit 2021
Gold
Andrew Webster
Scaling Design Capability: How Involved Should You Be?
2021 • DesignOps Summit 2021
Gold
Kate Koch
Flex Your Super Powers: When a Design Ops Team Scales to Power CX
2021 • DesignOps Summit 2021
Gold
Sarah Brooks
Theme Three Intro
2022 • Civic Design 2022
Gold

More Videos

Cornelius Rachieru

"On lower layers, research spaces are limited and focused, while at higher layers, problem spaces are ambiguous and require holistic approaches."

Cornelius Rachieru

Handling Complexity: Framing a Scale of Design

June 9, 2021

Sam Proulx

"Accessibility requires flexible designs, not limited designs."

Sam Proulx

Accessibility: An Opportunity to Innovate

March 9, 2022

Bria Alexander

"Lauren Cantor is our house librarian and has made an outstanding contribution to the conference."

Bria Alexander

Day 3 Welcome

September 25, 2024

Maria Rosala

"There are so many different variations of repositories, from research libraries to databases cataloging atomic insights."

Maria Rosala Shivanjali M.

Research Repositories

March 12, 2026

Joshua Graves

"You need to know your limits on how far you’re willing to exert influence and your energy investment."

Joshua Graves

We Need To Talk: Navigating Conversations with Your Boss (Part 1 of 3)

April 14, 2025

Alla Weinberg

"Stress drops our operating IQ by half, making it impossible to collaborate or problem-solve effectively."

Alla Weinberg

Workers Are Sick of Change: The Cure is Psychological Safety

June 6, 2023

Kwabena Opoku

"I use storytelling as armor to present findings clients may not want to hear, preserving the authentic voice."

Kwabena Opoku Leonie Annor-Owiredu Sam Ladner

Methodological toolkit for unique research impact

March 11, 2026

Nina Jurcic

"Without shared ownership and participation, design systems risk becoming isolated and irrelevant."

Nina Jurcic

The Design System Rollercoaster: From Enabler and Bottleneck to Catalyst for Change

October 3, 2023

Nathan Curtis

"The boundary between design and development must open up; I ask if designers can make pull requests to code."

Nathan Curtis Nalini P. Kotamraju Jack Moffett Dawn Ressel

Discussion

June 9, 2016