Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Rusha Sopariwala
Remote, Together: Craft and Collaboration Across Disciplines, Borders, Time Zones, and a Design Org of 170+
2022 • Design at Scale 2022
Gold
Ryan Matthew
Bridging Design and Code: AI-Powered Design System Integration
2025 • DesignOps Summit 2025
Gold
Isaac Heyveld
Expand DesignOps Leadership as a Chief of Staff
2022 • DesignOps Summit 2022
Gold
Farid Sabitov
Theme Four Intro
2022 • DesignOps Summit 2022
Gold
Nathan Curtis
Discussion
2016 • Enterprise UX 2016
Gold
Greg Petroff
The Compass Mission
2021 • Advancing Research 2021
Gold
Jaime Creixems
Best Practices when Creating and Maintaining a Design System
2023 • Enterprise UX 2023
Gold
Nick Cochran
Growing in Enterprise Design through Making Connections
2019 • Enterprise Experience 2019
Gold
Jen Crim
Culture, DIBS & Recruiting
2021 • Design at Scale 2021
Gold
Louis Rosenfeld
Founder’s Welcome
2022 • Design in Product 2022
Gold
Frank Duran
Partnership Playbook: Lessons Learned in Effective Partnership
2024 • Enterprise Experience 2020
Gold
Lavrans Løvlie
Ask me anything – Authors of Service Design: From Insight to Implementation
2025 • Advancing Service Design 2025
Gold
Sam Proulx
Mobile Accessibility: Why Moving Accessibility Beyond the Desktop is Critical in a Mobile-first World
2022 • Civic Design 2022
Gold
Samuel Proulx
Designing beyond caricatures: Embracing real, diverse user needs
2024 • Advancing Service Design 2024
Gold
Alana Washington
Theme 3 Intro
2021 • DesignOps Summit 2021
Gold
Nidhi Singh Rathore
Embracing participation to unlock deeper truths in commercial research
2025 • Advancing Research 2025
Gold

More Videos

Deanna Zandt

"The hustle does suck, and Gen Z is doing this other thing over here, arranging in healthier ways."

Deanna Zandt

The Unspoken Complexity of “Self-Care” with Deanna Zandt

July 21, 2022

Bria Alexander

"If you feel uncomfortable or bullied, it is so important that you know you can reach out for help."

Bria Alexander

Opening Remarks

October 4, 2023

Ellie Krysl

"The DP&M tool has everything designers wish they could see from Jira and Figma, but often can’t because it’s either inappropriate or too hard to keep updated in those tools."

Ellie Krysl Jon Fukuda

Planned Right. Managed Right. Designed Right.

June 6, 2023

Rachael Dietkus, LCSW

"Our curation process is identity hidden to prioritize equity and thematic mapping over affiliation or geography."

Rachael Dietkus, LCSW Victor Udoewa Jennifer Strickland

Everything You Need to Know about the Civic Design 2022 Call for Presentations

May 17, 2022

Anat Fintzi

"Engineers didn’t understand why they were building what they were building, and product managers didn’t understand what engineers were building."

Anat Fintzi Rachel Minnicks

Delivering at Scale: Making Traction with Resistant Partners

June 9, 2022

Kate Kalcevich

"I recommend asking about user needs rather than disability, like whether someone needs captions or larger fonts."

Kate Kalcevich

Integrating Accessibility in DesignOps

September 23, 2024

John Cutler

"Quality is value to some person who matters."

John Cutler

The Alignment Trap

November 29, 2023

Steve Portigal

"In the moment, it felt really uncomfortable, and I never thought I’d want to share that with anyone."

Steve Portigal Susan Simon-Daniels Tamara Hale Randolph Duke II

War Stories LIVE! Q&A-Discussion

March 30, 2020

Dan Willis

"The user directorate’s revolt shutting down new development for a year was unprecedented and opened the door for ambitious change."

Dan Willis

Filling the Void

November 7, 2018