Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Bria Alexander
Theme Two Intro
2023 • DesignOps Summit 2023
Gold
Mackenzie Cockram
Integrating Qualitative and Quantitative Research from Discovery to Live
2022 • QuantQual Interest Group
Prayag Narula
Dialing for Research: How to Reach the Unreachable
2022 • Advancing Research 2022
Gold
Tracy McGoldrick
IBM User Experience Program—The What, Why and How
2021 • Advancing Research Community
Melissa Tsang
From Insights to Action: Driving Business Values through DesignOps
2024 • DesignOps Summit 2020
Gold
Uday Gajendar
Theme Four Intro
2023 • Enterprise UX 2023
Gold
Xenia Adjoubei
Empowering Communities Through the Researcher in Residence Program
2023 • Advancing Research 2023
Gold
Sahibzada Mayed
The Politics of Radical Research: A Manifesto
2023 • Advancing Research 2023
Gold
Jon Fukuda
All the Ops: Successful cross-functional collaboration
2025 • DesignOps Summit 2025
Conference
Alastair Simpson
Debunking the Myths of Cross-Disciplinary Collaboration
2019 • DesignOps Summit 2019
Gold
Nalini P. Kotamraju
An Organizational Story: Salesforce Lightning Design System
2016 • Enterprise UX 2016
Gold
Sara Asche Anderson
Not Your Ordinary Re-Brand: Design's Path to Driving Customer Obsession at Best Buy
2024 • Enterprise Experience 2020
Gold
Mansi Gupta
Women-Centric Research: What, Why, How
2023 • Advancing Research 2023
Gold
Greg Petroff
Exit Interview #1: Greg Petroff: From Silicon Valley Executive to Sonoma County Possibilitarian
2025 • Rosenfeld Community
Dr. Karl Jeffries
The Science of Creativity for DesignOps
2024 • DesignOps Summit 2020
Gold
Briana Thomas
The Quiet Force: Uncovering Hidden Leadership in High-Impact Design Teams
2024 • DesignOps Summit 2024
Gold

More Videos

Bria Alexander

"Cohorts are one of the favorite parts of the conference, so really participate if you’re in one."

Bria Alexander

Opening Remarks

November 18, 2022

Sean Fitzell

"We treated the different types of information about jobs as transparency layers, like in an anatomy textbook."

Sean Fitzell Sarah Han Kayla Farrell

Craft of User Research: Building Out Jobs to be Done Maps

March 12, 2021

Tricia Wang

"Fear of AI comes from not trusting the inputs, the data powering these models."

Tricia Wang

From Users to Shapers of AI: The Future of Research

March 25, 2024

Julie Norvaisas

"Rigor doesn’t mean rigid; there needs to be some creativity allowed there too."

Julie Norvaisas Pert Eilers Mina Jonsson

Back to basics, or start from scratch?

March 12, 2025

Reginé Gilbert

"Inclusion and accessibility need to be part of every designer's role, even if today it is one person's job."

Reginé Gilbert

Asking the Right Questions: Life, Hope and Moving Forward During the Pandemic

June 10, 2022

Robin Beers

"Learning is the new knowing, and researchers are catalysts of organizational learning."

Robin Beers

Navigating organizational systems: Rethinking researcher’s role in driving change

March 13, 2025

Angy Peterson

"Our platform enables real-time dashboards so governments can continuously monitor and refine service interactions."

Angy Peterson Bob Ainsbury

More Than Technology: Personalized Public Sector Experiences

December 10, 2021

Dave Hora

"Introducing insights into a system is just adding a new actor; without managing interactions, nothing changes."

Dave Hora

Research in the Face of Complexity: New Sensibility for New Situations

August 27, 2025

Nalini P. Kotamraju

"I am responsible not just for insights but for my team's well-being, legal commitments, and fiduciary responsibilities."

Nalini P. Kotamraju

Two Jobs in One: Being a “Leader who is a Researcher” and a “Researcher who is a Leader"

March 10, 2021