Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Dave Hora
Research in the Face of Complexity: New Sensibility for New Situations
2025 • Rosenfeld Community
Verónica Urzúa
The B-side of the Research Impact
2021 • Advancing Research 2021
Gold
Sandra Camacho
Creating More Bias-Proof Designs
2025 • Rosenfeld Community
Peter Merholz
Design at Scale is People!
2021 • Design at Scale 2021
Gold
Llewyn Paine
Day 1 Using AI in UX with Impact
2025 • Designing with AI 2025
Gold
Adam Cutler
Discussion
2016 • Enterprise UX 2016
Gold
Sam Proulx
Designing For Screen Readers: Understanding the Mental Models and Techniques of Real Users
2021 • Civic Design 2021
Gold
Dave Hoffer
UX Job Search AMA #3 with Joanne Weaver and Dave Hoffer
2025 • Rosenfeld Community
Nancy Douyon
We'll Figure That Out in the Next Launch: Enterprise Tech's Nobility Complex
2018 • Enterprise Experience 2018
Gold
Peter Morville
The Architecture of Understanding
2015 • Enterprise UX 2015
Gold
Tatyana Mamut
Opening Keynote: Breaking Conway's Law--or How to Work Differently and Not Ship Your Org Chart
2019 • Enterprise Experience 2019
Gold
Bria Alexander
Welcome
2022 • DesignOps Summit 2022
Gold
Jemma Ahmed
Theme Panel
2025 • Advancing Research 2025
Gold
Chris Geison
What is Research Strategy?
2021 • Advancing Research 2021
Gold
Saara Kamppari-Miller
"Prototype" vs "Prototype"--Breaking Down and Rebuilding Our Understanding of What We Do
2019 • DesignOps Summit 2019
Gold
Louis Rosenfeld
Day 1 Welcome
2024 • DesignOps Summit 2024
Gold

More Videos

Jennifer Kanyamibwa

"Transparency is building respect for what your teammates do."

Jennifer Kanyamibwa

Creating the Blueprint: Growing and Building Design Teams

November 8, 2018

Ariel Kennan

"Many governments need to buy design services to augment their internal capacities."

Ariel Kennan

Theme 2 Intro

December 9, 2021

Robert Reimann

"In enterprise products, personas are closely related to roles and skills, but user research is essential to understand needs."

Robert Reimann

Taming Design Complexity with UX Models

June 8, 2017

Onur Kocan

"Finding the balance between transformation and preservation is a delicate issue in Istanbul."

Onur Kocan Ayhan Ensici

Understanding the Strategy for Civic Design in a Complex City: Istanbul

November 16, 2022

Jennifer Kong

"Users were initially excited but six months later expressed frustration and uncertainty about the tool’s usefulness."

Jennifer Kong

Journeying toward AI-assisted documentation in healthcare

June 5, 2024

Ebru Namaldi

"We must not only fix today’s problem but also foresee what the future holds to build the best playground for our teams."

Ebru Namaldi

Designing the Designer’s Journey: Scaling Teams, Culture, and Growth Through DesignOps

September 11, 2025

Jane Davis

"Time and bandwidth are the eternal enemy of research."

Jane Davis

Strategic Shifts and Innovations in User Research: Navigating Challenges and Opportunities

March 11, 2025

Maria Giudice

"Everyone is a designer."

Maria Giudice

Remaking the Making Company: Moving from Product to Experience

June 9, 2016

Llewyn Paine

"If we want our value to be defined as anything other than studies run, we must learn how to trace and articulate the actual business impact."

Llewyn Paine

Coexisting with AI: A practical guide for researchers to navigate tools, ethics, and integration

March 11, 2025