Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Changying (Z) Zheng
Navigating Innovation with Integrity
2024 • DesignOps Summit 2024
Gold
Emily DiLeo
Stronger Together: Lessons Learned from UX Research Ops
2024 • DesignOps Summit 2024
Gold
Kit Unger
Theme 1 Intro
2022 • Design at Scale 2022
Gold
Neema Mahdavi
Operationalizing DesignOps
2018 • DesignOps Summit 2018
Gold
Sofía Delsordo
Public Policy for Jalisco's Designers to Make Design Matter
2021 • Civic Design 2021
Gold
James Lang
Hopeful Futures for UX Research
2026 • Rosenfeld Community
Mariah Hay
Ethics in Tech Education: Designing to Provide Opportunity for All
2018 • Enterprise Experience 2018
Gold
Shipra Kayan
Make your research synthesis speedy and more collaborative using a canvas
2025 • Rosenfeld Community
Bria Alexander
OKRs—Helpful or Harmful?
2022 • DesignOps Community
Lija Hogan
Contexts of Use: A Framework for Connection
2021 • Civic Design 2021
Gold
Peter Van Dijck
Building the Rosenbot
2024 • Designing with AI 2024
Gold
Steve Turbek
Designing Interactive Graphics with AI Code Help
2026 • Rosenfeld Community
Mila Kuznetsova
How Lessons Learned from Our Youngest Users Can Help Us Evolve our Practices
2022 • Advancing Research 2022
Gold
Mac Smith
Measuring Up: Using Product Research for Organizational Impact
2021 • Advancing Research 2021
Gold
Sarah Brooks
Theme 3 Intro
2021 • Civic Design 2021
Gold
Victor Lombardi
Bridging Design and Climate Science
2024 • Climate UX Interest Group

More Videos

Sara Logel

"Stakeholders are users too; the product we’re sharing is the research and learnings for their decision-making."

Sara Logel

Your Colleagues are Your Users Too

March 29, 2023

Dan Willis

"Enterprise problems are often engineering problems as much as design problems."

Dan Willis

Enterprise Storytelling Sessions

May 13, 2015

Aiyana Bodi

"Finding your people inside an organization and being brave together is a great first step to climate action."

Aiyana Bodi James Christie Marc O'Brien Louis Rosenfeld

Three Key Climate Initiatives and How You Can Help

September 11, 2024

Liam Thurston

"Cross-functional relationships often matter more to output than relationships within the immediate team."

Liam Thurston

Why Your Design Team Is Quitting, And How To Fix It

June 10, 2022

Chris Moses

"The first year federal incentives ended, the market changed completely; buyers started listening to end users more."

Chris Moses

Stretching the Definition of DesignOps with Product Development

November 7, 2018

Janaki Kumar

"Human tasks will become more complex and multi-dimensional as simple tasks get automated."

Janaki Kumar

Innovate with Purpose

June 14, 2018

Smitha Papolu

"The pandemic pulled our remote team much closer—being physically apart made us feel closer."

Smitha Papolu Nova Wehman-Brown Melissa Schmidt Adam Menter

Theme 3 Discussion

January 8, 2024

Noah Bond

"One of the most important parts of any successful research study is your research sample."

Noah Bond

Redefining truth and inclusivity: Navigating data ownership and ethical research in the age of disinformation

March 11, 2025

Tanya Snook

"There wasn't much attention paid to onboarding team members—not the clients or users, but the actual team."

Tanya Snook

Designing the team experience: Building culture through onboarding

November 4, 2021