Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Jon Fukuda
Theme One Intro
2022 • DesignOps Summit 2022
Gold
Cheryl Platz
Merging Improv with Design
2019 • Enterprise Community
Lais de Almeida
Designing Data Services
2024 • Advancing Service Design 2024
Gold
Dr. Jamika D. Burge
A Genuine Conversation about the Future of UX Research
2024 • Advancing Research Community
James Rampton
The Basics of Automotive UX & Why Phones Are a Part of That Future
2024 • Rosenfeld Community
Victor Udoewa
Beyond Methods and Diversity: The Roots of Inclusion
2024 • Advancing Research 2024
Gold
Anat Fintzi
Delivering at Scale: Making Traction with Resistant Partners
2022 • Design at Scale 2022
Gold
Andrew Webster
Scaling Design Capability: How Involved Should You Be?
2021 • DesignOps Summit 2021
Gold
Rachael Dietkus, LCSW
Leading through the long tail of trauma
2022 • Enterprise Community
Farid Sabitov
Theme Four Intro
2022 • DesignOps Summit 2022
Gold
Prayag Narula
Dialing for Research: How to Reach the Unreachable
2022 • Advancing Research 2022
Gold
Ellie Krysl
Planned Right. Managed Right. Designed Right.
2023 • Enterprise UX 2023
Gold
Shipra Kayan
How Tess Dixon Facilitates Team Engagement and Collaboration at Condé Nast Using Miro 
2021 • DesignOps Summit 2021
Gold
Dr. Jamika D. Burge
How UX researchers can partner with (and not be replaced by) AI [Advancing Research Community Workshop Series]
2023 • Advancing Research Community
Kelly Goto
Emotion Economy: Ethnography as Corporate Strategy
2015 • Enterprise UX 2015
Gold
Erin May
Distributed, Democratized, Decentralized: Finding a Research Model to Support Your Org
2022 • Advancing Research 2022
Gold

More Videos

Rob Mitzel

"You owe it to your organization and your team to grow the practice while staying true to your vision."

Rob Mitzel Sébastien Malo

The Tale of Two Companies: Building a Successful UX Practice in a Century-Old Enterprise

January 8, 2024

Rima Campbell

"Thirty to fifty percent of total efforts are typically spent on rework, which saps morale and wastes resources."

Rima Campbell Amrit S Bhachu

Increase Productivity and Drive Business Impact

September 24, 2024

Bud Caddell

"In 2019 and 2020, established design ops teams were incredibly rare; now they’re the number one response."

Bud Caddell

Theme 2 Intro

September 30, 2021

Jaskiran Kang

"Leadership believing it’s about people and skills, opening up and unlocking talent—that was really beautiful."

Jaskiran Kang

Why Community is Key to Professionalizing Design

October 28, 2022

Ellie Krysl

"The DP&M tool has everything designers wish they could see from Jira and Figma, but often can’t because it’s either inappropriate or too hard to keep updated in those tools."

Ellie Krysl Jon Fukuda

Planned Right. Managed Right. Designed Right.

June 6, 2023

Billy Carlson

"Aligning text with text and images with images sets an easy flow for scanning — mixing them breaks the user’s rhythm."

Billy Carlson

Pro-level UI Tips for Beginners

September 9, 2022

Louis Rosenfeld

"AI can serve as a conversational broker, helping non-designers interact naturally with design systems."

Louis Rosenfeld Billy Carlson Jon Fukuda Maria Taylor

How AI will Change DesignOps Tooling

October 3, 2023

Sarah Auslander

"Always design a thing by considering it in this next largest temporal context — a day in a month, a decade in a century."

Sarah Auslander Betsy Ramaccia Gordon Ross

Insights Panel

November 18, 2022

Alexis Lucio

"Accessibility is innovation and this statement could potentially be some unchecked ableism."

Alexis Lucio

Scaling Accessibility Through Design Systems

June 9, 2022