Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Uday Gajendar
Theme Four Intro
2023 • Enterprise UX 2023
Gold
Marc Fonteijn
Increase your confidence, influence, and impact (through a Professional Community)
2024 • Advancing Service Design 2024
Gold
James Lang
Hopeful Futures for UX Research
2026 • Rosenfeld Community
Ellen Chisa
The Values of Design
2023 • Design in Product 2023
Gold
Lukas Moro
“Feels Like Paper!”: Interfacing AI through Paper
2025 • Designing with AI 2025
Gold
Kit Unger
Theme 3 Intro
2022 • Design at Scale 2022
Gold
Andy Polaine
What is the role of service design in product-led organizations?
2024 • Advancing Service Design 2024
Gold
Dave Gray
Group Activity: Making Sense of DesignOps
2017 • DesignOps Summit 2017
Gold
Saara Kamppari-Miller
"Prototype" vs "Prototype"--Breaking Down and Rebuilding Our Understanding of What We Do
2019 • DesignOps Summit 2019
Gold
Erin Weigel
Testing and Experimentation Tools
2026 • Advancing Research 2026
Conference
Matt Duignan
HITS, Microsoft's internal human insight system: From research library to living body of knowledge
2019 • Advancing Research Community
Toby Haug
Discussion
2017 • Enterprise Experience 2017
Gold
Nina Jurcic
The Design System Rollercoaster: From Enabler and Bottleneck to Catalyst for Change
2023 • DesignOps Summit 2023
Gold
Mac Smith
Measuring Up: Using Product Research for Organizational Impact
2021 • Advancing Research 2021
Gold
Victor Lombardi
Bridging Design and Climate Science
2024 • Climate UX Interest Group
Dan Willis
Enterprise Storytelling Sessions
2015 • Enterprise UX 2015
Gold

More Videos

Sahibzada Mayed

"What would a personal cheer team that wants you to grow and blossom look and feel like? Is that the future of design ops?"

Sahibzada Mayed Lauren Lin

Cultivating Design Ecologies of Care, Community, and Collaboration

October 4, 2023

Frank Duran

"Lead with the ask, lead with the value, so people understand what you’re looking for from the start."

Frank Duran

Partnership Playbook: Lessons Learned in Effective Partnership

January 8, 2024

Nancy Douyon

"People in some countries are three times more likely to buy a product if it’s localized in their language."

Nancy Douyon

We'll Figure That Out in the Next Launch: Enterprise Tech's Nobility Complex

June 15, 2018

Onur Kocan

"Unmanaged challenges reduce the quality of life in society."

Onur Kocan Ayhan Ensici

Understanding the Strategy for Civic Design in a Complex City: Istanbul

November 16, 2022

Rusha Sopariwala

"One day we were in the office, the next day suddenly remote—there was no choice but to adapt."

Rusha Sopariwala

Remote, Together: Craft and Collaboration Across Disciplines, Borders, Time Zones, and a Design Org of 170+

June 9, 2022

Scott Plewes

"The SIR model is simple but can be applied beyond viruses—to information spread, finance, and UX adoption."

Scott Plewes

Why Isn't Your UX Approach Going Viral?: A Mathematical Model

March 28, 2023

Sam Proulx

"Retrofitting accessibility at a later date is difficult, costly, and demoralizing."

Sam Proulx

Accessibility: An Opportunity to Innovate

September 8, 2022

Michelle Bejian Lotia

"We wanted people to get value from day one so they could see how much more they would get from all insights centralized."

Michelle Bejian Lotia Anne-Marie Morell

Rolling Out a Repository: How Zapier Centralizes Insights from Across their Organization

March 28, 2023

Amy Jiménez Márquez

"Facilitation is essential—not just for design work but to build alignment and manage people."

Amy Jiménez Márquez Michael J. Metts Joie Chung

The Atypical UX Manager Path

July 23, 2020