Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Shelby Switzer
Making Space for Community Knowledge-sharing in a Distributed World
2021 • Civic Design 2021
Gold
Bas Raijmakers, PhD (RCA)
What Design Research can Learn from Documentary Filmmaking
2022 • Advancing Research 2022
Gold
Jemma Ahmed
Collaboration: learning from other fields beyond our own [Advancing Research Community Workshop Series]
2024 • Advancing Research Community
Magdalena Zadara
Zero Hour: How to Get Far Quickly When Starting Your Digital Service Unit Late
2022 • Civic Design 2022
Gold
Stephen Anderson
Puzzled? How to Coordinate Humans for Complex Challenges
2021 • Enterprise Community
Adel Du Toit
Get Your CFO To Say: 'Our Strategic Goal is User Obsession'
2022 • Design at Scale 2022
Gold
Erika Kincaid
Connecting the Dots: How to Foster Collaboration and Build a Strong Design Review Culture
2022 • Design at Scale 2022
Gold
Kyria Stephens
Power to Heal: Civic Design in the Aftermath of Tragedy
2022 • Civic Design 2022
Gold
George Aye
Designing a New Social Contract
2026 • Rosenfeld Community
Fredrik Matheson
First-time users, longtime strategies: Why Parkinson’s Law is making you less effective at work – and how to design a fix.
2016 • Enterprise UX 2016
Gold
Cennydd Bowles
Exit Interview #2: Rediscovering the ethical heart of design
2025 • Rosenfeld Community
Corey Long
Hiring in DesignOps: A Critical Study on How to Hire and Get Hired
2024 • DesignOps Summit 2024
Gold
Prayag Narula
Dialing for Research: How to Reach the Unreachable
2022 • Advancing Research 2022
Gold
Ash Brown
Silver Linings: What DesignOps Learned in the Shift to WFH
2020 • DesignOps Summit 2020
Gold
Kit Unger
Theme 1 Intro
2022 • Design at Scale 2022
Gold
James Lang
Hopeful Futures for UX Research
2026 • Rosenfeld Community

More Videos

Florence Okoye

"Storytelling became the root of the workshops because it helped everyone relate and respect each other’s experiences."

Florence Okoye

AfroFuturism and UX Research

March 27, 2023

Joerg Beringer

"You get a lot of different research outputs in a matter of minutes from your scope input."

Joerg Beringer Thomas Geis

Scaling User Research with AI: Continuous Discovery of User Needs in Minutes

June 10, 2025

Sandra Camacho

"Systemic bias means a tendency for procedures and practices within an institution that favors certain social groups over others."

Sandra Camacho

Creating More Bias-Proof Designs

January 22, 2025

Sam Yen

"Are we excited to go to work in the technology industry to remove jobs and eliminate the human from the equation?"

Sam Yen

Driving Organizational Change Through Design? Do more of this and less of that

June 9, 2017

Paul Pangaro, PhD

"Cybernetics is the discipline of systems with purpose, looking at how a system acts to achieve its goals."

Paul Pangaro, PhD

Systems Disciplines: Table Stakes for 21st Century Organizations

June 6, 2023

Sam Proulx

"Disability isn't rare; one in five people have a disability right now."

Sam Proulx

Accessibility: An Opportunity to Innovate

September 8, 2022

Maria Taylor

"Knowledge management solutions must focus on both the explicit knowledge that can be documented and the tacit knowledge that resides in people’s experiences."

Maria Taylor

Knowledge is Power: Managing the Lifeblood of the Design Org

October 3, 2023

Daniel Korczynski

"Stop trying to replace researcher with AI."

Daniel Korczynski Justyna Parmee

From generic to contextual research insights with AI | Live Q&A

March 11, 2026

Ashley Sewall

"Stability today is more complicated than staying on a fixed path; adaptability and diverse skills bring security."

Ashley Sewall

Exit Interview #5: Designing My Life After Tech

February 19, 2026