Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Nicole Bergstrom
AccessibilityOps: Moving beyond “nice to have”
2024 • DesignOps Community
Shipra Kayan
How we Built a VoC (Voice of the Customer) Practice at Upwork from the Ground Up
2021 • DesignOps Summit 2021
Gold
Sarah Auslander
Insights Panel
2022 • Civic Design 2022
Gold
Dr. Jamika D. Burge
Broad Strokes: Connecting Design, Research, and AI to the World Around Us
2023 • Enterprise UX 2023
Gold
Dan Hill
Designing for the infrastructures of everyday life
2024 • Designing with AI 2024
Gold
Lada Gorlenko
Theme 2 Intro
2022 • Design at Scale 2022
Gold
Jen van der Meer
Service design performs value
2025 • Advancing Service Design 2025
Gold
Clara Kliman-Silver
UX Futures: The Role of Artificial Intelligence in Design
2023 • Enterprise UX 2023
Gold
Carol Smith
Operationalizing Responsible, Human-Centered AI
2023 • Enterprise UX 2023
Gold
Sam Proulx
Mobile Accessibility: Why Moving Accessibility Beyond the Desktop is Critical in a Mobile-first World
2022 • DesignOps Summit 2022
Gold
Jemma Ahmed
Convergent Research Techniques in Customer Journey Mapping
2020 • Advancing Research 2020
Gold
Richard Buchanan
Creativity and Principles in the Flourishing Enterprise
2018 • Enterprise Experience 2018
Gold
Katie Hansen
Experimental research: techniques for deep, psychology-driven insights
2025 • Advancing Research 2025
Gold
Adam Cutler
Discussion
2016 • Enterprise UX 2016
Gold
Billy Carlson
Ideation tips for Product Managers
2022 • Design in Product 2022
Gold
Anne Mamaghani
How Your Organization's Generative Workshops Are Probably Going Wrong and How to Get Them Right
2023 • Advancing Research 2023
Gold

More Videos

Marieke McCloskey

"We can’t be there for every behavior after the nudge, nor do we necessarily want to be."

Marieke McCloskey

User Science: Product Analytics & User Research

March 11, 2021

Saara Kamppari-Miller

"Our card deck helps give us respectful language and bite-sized info to engage with different communities."

Saara Kamppari-Miller

Inclusive Design is DesignOps

September 29, 2021

Taiye Akin-Akinyosoye

"The environment should make participants feel comfortable to open up and be real."

Taiye Akin-Akinyosoye

Amplifying voices and enhancing user research through group interviews

March 12, 2025

Kate Towsey

"Research ops is like the most punk thing you can do in the experience design space."

Kate Towsey

Ask Me Anything (AMA) with Kate Towsey

April 2, 2025

Bob Baxley

"If you told me about a conference like this 30 years ago, you could have gotten all the participants in the backseat of a minivan."

Bob Baxley

Theme 4: Intro

January 8, 2024

Sarah Alvarado

"Democratization models sound great but often people have to squeeze research on top of already full-time jobs."

Sarah Alvarado Nalini P. Kotamraju Anne Mamaghani Peter Merholz

How to make UX research leadership more effective [Advancing Research Community Workshop Series]

October 26, 2023

Satyam Kantamneni

"Socializing the vision is a constant process that must percolate into everyone’s subconscious."

Satyam Kantamneni

Do You Have an Experience Vision?

March 23, 2023

Sheryl Cababa

"Always design a thing by thinking of its next larger context: a chair in a room, a room in a house, a house in an environment."

Sheryl Cababa Alexis Oh

Thinking in systems to address climate with Sheryl Cababa

June 12, 2024

Bria Alexander

"Today's theme around researchers thriving in the organization is led by Steve Portico."

Bria Alexander

Opening Remarks

January 8, 2024