Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Erin Hauber
Design is Not the Frosting on the Scaled Agile Layer Cake
2019 • DesignOps Summit 2019
Gold
Kate Towsey
Participant Recruitment and Management Tools
2026 • Advancing Research 2026
Gold
Sara Logel
Your Colleagues are Your Users Too
2023 • Advancing Research 2023
Gold
Ned Dwyer
The Intersection of Design and ResearchOps
2024 • DesignOps Summit 2024
Gold
Craig Villamor
Resilient Enterprise Design
2017 • Enterprise Experience 2017
Gold
Jessamyn Edwards
Surviving Your UX Career in Enterprise Design
2021 • Enterprise Community
Alla Weinberg
Design Teams Need Psychological Safety: Here’s How to Create It
2022 • DesignOps Summit 2022
Gold
Corey Long
Hiring in DesignOps: A Critical Study on How to Hire and Get Hired
2024 • DesignOps Summit 2024
Gold
Katy Mogal
But Do Your Insights Scale?
2021 • Advancing Research 2021
Gold
Alëna Iouguina
Designing Systems at Scale
2018 • DesignOps Summit 2018
Gold
Caroline Jarrett
Have fun with statistics?
2024 • Rosenfeld Community
Snehal Pendharkar
Conducting pre-research with AI agent personas: Pressure-testing concepts for expert workflows
2026 • Designing with AI 2026
Conference
Melissa Eggleston
Practical People Skills for Building Trust on Teams and with Partners
2021 • Civic Design 2021
Gold
Mike Oren
Improving Democratized Research with CustomGPTs and Gems
2026 • Rosenfeld Community
Janelle Estes
UX Research Trends
2021 • Advancing Research Community
Sarah Brooks
Theme 3 Intro
2021 • Civic Design 2021
Gold

More Videos

Uday Gajendar

"Craft is not just about a beautiful final object, but a facilitative anchor that enables productive teamwork across departments."

Uday Gajendar

The Wicked Craft of Enterprise UX

May 13, 2015

Liz Ebengo

"Uganda is Africa’s largest refugee host nation with over 1.5 million refugees, 60% of whom are children—yet this crisis is often invisible in media."

Liz Ebengo

The Burden on Children: The Cost of Insufficient Post-Conflict Services and Pathways Forward

December 4, 2024

Gordon Ross

"Flexibility is the network's ability to reconfigure itself and yet retain its goals; scalability means expanding or shrinking size with little disruption; survivability means withstanding attacks to nodes or codes."

Gordon Ross

12 Months of COVID-19 Design and Digital Response with the British Columbia Government

December 8, 2021

Alexandra Schmidt

"Design ethics can impact some harms of new technology but not all."

Alexandra Schmidt

Why Ethics Can't Save Tech

November 18, 2022

Jorge Arango

"Links are first-class citizens in hypertext note-taking; connecting notes brings ideas to life and reveals new relationships."

Jorge Arango

Exploding the Notebook: How to Unlock the Power of Linked Notes (2nd of 3 seminars)

April 19, 2024

Josh Clark

"AI is not just a tool for productivity or efficiency; it's a design material with unusual strengths and weaknesses."

Josh Clark Veronika Kindred

Sentient Scenes and Radically Adaptive Experiences

June 11, 2025

John Maeda

"Accepting the need to change is not a normal instinct because it runs counter to survival."

John Maeda Alison Rand

About Design Organizations

May 13, 2019

Dan Ward

"We do experiments not to make things work but to learn something new."

Dan Ward

Failure Friday #1 with Dan Ward

February 7, 2025

Devon Powers

"Naming things like the 'vibe shift' is powerful in shaping the narratives around future developments."

Devon Powers

Imagining Better Futures

March 9, 2022