Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Sylvie Abookire
A Civic Designer's Guide to Mindful Conflict Navigation
2022 • Civic Design 2022
Gold
Daniel Korczynski
From generic to contextual research insights with AI | Live Q&A
2026 • Advancing Research 2026
Conference
Laurent Christoph
Scale the impact of DesignOps in 3D: Diligence, Decision, Discipline
2025 • DesignOps Community
Steve Portigal
Looking Back…to Look Ahead
2024 • Advancing Research 2024
Gold
Victor Udoewa
Radical Participatory Research: Decolonizing Participatory Processes
2022 • Advancing Research 2022
Gold
Lisa Gironda
Opener: Chief of Staff–An unexpected journey
2024 • DesignOps Summit 2020
Gold
Sam Ladner
Data Exhaust and Personal Data: Learning from Consumer Products to Enhance Enterprise UX
2016 • Enterprise UX 2016
Gold
Wendy Johansson
Design at Scale: Behind the Scenes
2021 • Enterprise Community
Daniel Gloyd
Designing Warmth
2025 • Rosenfeld Community
Kristin Skinner
Five Years of DesignOps
2021 • DesignOps Summit 2021
Gold
George Abraham
Design Systems To-Go: Introducing a Starter Design System, and Indigo.Design Overview (Part 1)
2021 • DesignOps Summit 2021
Gold
Ellie Krysl
Planned Right. Managed Right. Designed Right.
2023 • Enterprise UX 2023
Gold
Ovetta Sampson
Research in the Automated Future
2022 • Advancing Research 2022
Gold
Jon Fukuda
Storytelling for DesignOps
2023 • DesignOps Community
Greg Petroff
Design is the Differentiator: Bringing New Design Innovations to a Very Antiquated and Very Large Industry
2021 • Design at Scale 2021
Gold
Bria Alexander
Opening Remarks Day 2
2024 • Advancing Research 2024
Gold

More Videos

Dan Saffer

"AI struggles with context, taste, and common sense, and designers bring that to the table."

Dan Saffer

Why AI projects fail (and what we can do about it)

May 14, 2025

Dr. Jamika D. Burge

"Concept testing is almost like a mini workshop focusing on the value and benefits of that method."

Dr. Jamika D. Burge

Theme 3 Intro

March 11, 2022

Crystal Yan

"A Slack bot that sends random customer quotes encourages ongoing learning and discovery."

Crystal Yan

Building a Customer-Centric Culture

March 30, 2020

Ash Brown

"Every small initiative can lead to significant change when multiplied across communities."

Ash Brown

Silver Linings: What DesignOps Learned in the Shift to WFH

October 23, 2020

Kate Towsey

"Managing vendors is a full-time job, and research ops teams often coordinate numerous vendors for tools and recruitment."

Kate Towsey

The State of ResearchOps: More Than Just Theory

June 20, 2019

Megan Blocker

"Tools don't matter as much as the desire to learn and ask why things keep happening the way they do."

Megan Blocker Amy Bucher Katie Hansen Ricardo Martins Nidhi Singh Rathore

Day 2 Theme Panel

March 12, 2025

Sara Asche Anderson

"Being human is the foundation—seeing the person on the other side of the conversation as a human being."

Sara Asche Anderson Jamie Kaspszak

Not Your Ordinary Re-Brand: Design's Path to Driving Customer Obsession at Best Buy

January 8, 2024

Ali Jeffery

"Green spaces are not luxuries—they're necessities for urban life."

Ali Jeffery Sheri Chudow

How DesignOps Helped Enable Wall Street to Work Remotely

October 22, 2020

Jon Fukuda

"Change is the reason we keep paying attention to stories; without it, there is no story."

Jon Fukuda

Storytelling for DesignOps

August 17, 2023