Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Alexandra Schmidt
Why Ethics Can't Save Tech
2022 • Civic Design 2022
Gold
Joerg Beringer
Scaling User Research with AI: Continuous Discovery of User Needs in Minutes
2025 • Designing with AI 2025
Gold
Richard Buchanan
Creativity and Principles in the Flourishing Enterprise
2018 • Enterprise Experience 2018
Gold
Ben Davies
Expert Panel: The Principles of Research Repository Design
2022 • Advancing Research 2022
Gold
Liam Thurston
Why Your Design Team Is Quitting, And How To Fix It
2022 • Design at Scale 2022
Gold
Jenny Price
From Tradition to Transformation: Unlocking Startup Agility in a Legacy Enterprise
2025 • DesignOps Summit 2025
Gold
Scher Foord
Turn the Ship Around: How to Apply Design Thinking Across Your Organization
2021 • Design at Scale 2021
Gold
Theresa Neil
Designing for Wellness: Specializing in Healthcare
2024 • Rosenfeld Community
Failure Friday #4: Invisible Work: How I Stalled My Career by Not Showing My Work
2025 • Rosenfeld Community
Erin Weigel
UX Lessons from running more than 1,200 A/B Tests
2024 • Rosenfeld Community
Abby Covert
Stuck? Diagrams Help
2022 • DesignOps Community
Indra Klavins
A Design Ops Girl in a Dev Ops World
2019 • DesignOps Summit 2019
Gold
Kurdin Bazaz
Culture, DIBS & Recruiting
2021 • Design at Scale 2021
Gold
Laine Riley Prokay
Carving a Path for Early Career DesignOps Practitioners
2022 • DesignOps Summit 2022
Gold
Sheryl Cababa
Expanding Your Design Lens with Systems Thinking
2023 • Enterprise Community
Stefanie Owens
Optimizing for Outcomes: Transformation Design in Systems at Scale
2024 • Advancing Service Design 2024
Gold

More Videos

Ilana Lipsett

"The internet as the self-regulating market is a failed experiment."

Ilana Lipsett

Anticipating Risk, Regulating Tech: A Playbook for Ethical Technology Governance

December 10, 2021

Victor Udoewa

"Open yourself up to the ways of being, knowing, doing, working, living, and breathing of others."

Victor Udoewa

Theme One Intro

March 27, 2023

Sam Proulx

"Avoid strictly timed interactions because they stress everyone, especially people using screen readers or with cognitive disabilities."

Sam Proulx

Online Shopping: Designing an Accessible Experience

October 3, 2023

Harry Max

"Start with yourself in prioritization, and you’ll likely infect your team and then the organization."

Harry Max Jim Meyer

Prioritization for Leaders (2nd of 3 seminars)

June 27, 2024

Sam Proulx

"It took competitors over seven years to catch up to Apple’s accessibility features like Siri, dark mode, and multitouch."

Sam Proulx

To Boldly Go: The New Frontiers of Accessibility

September 9, 2022

Saara Kamppari-Miller

"I am one of the lucky ones. I have not lost my job this year and I get to work on inclusion and accessibility every day."

Saara Kamppari-Miller

Theme Three Intro

October 4, 2023

Veevi Rosenstein

"Having a back office support in place made it much easier to convince new researchers to join the team."

Veevi Rosenstein

Building for Scale: Creating the Zendesk UX Research Practice

January 8, 2024

Peter Van Dijck

"You never start AI product design from the technology itself; you start from the user outcomes and retro-engineer the needed context."

Peter Van Dijck

Building impactful AI products for design and product leaders, Part 3: Understand AI architectures: RAG, Agents, Oh My!

July 30, 2025

Denise Jacobs

"Designers are change agents. This is part of your passion and what gets you up in the morning."

Denise Jacobs Nancy Douyon Renee Reid Lisa Welchman

Interactive Keynote: Social Change by Design

January 8, 2024