Video: Building impactful AI products for design and product leaders, Part 2: Evals are your moat

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Peter Van Dijck

Founding Partner and CEO, The Helpful Intelligence Company

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

•

AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.
•

Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.
•

LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.
•

Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.
•

A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.
•

Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.
•

Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.
•

Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.
•

Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.
•

AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Previous video

Surprise me

Next video

Dig deeper—ask the Rosenbot:

How can automated evals help in scaling testing for AI systems with open-ended inputs and outputs?

Why is using an LLM as a judge effective in evaluating another LLM’s output?

What are best practices for defining quality criteria and risks when building AI eval datasets?

How do synthetic data and manual data creation work together to improve AI model testing?

In what ways should domain expertise be involved in tagging and evaluating AI outputs?

Theme Two Intro

2023 • DesignOps Summit 2023

Gold

Integrating Qualitative and Quantitative Research from Discovery to Live

2022 • QuantQual Interest Group

Dialing for Research: How to Reach the Unreachable

2022 • Advancing Research 2022

Gold

IBM User Experience Program—The What, Why and How

2021 • Advancing Research Community

From Insights to Action: Driving Business Values through DesignOps

2024 • DesignOps Summit 2020

Gold

Theme Four Intro

2023 • Enterprise UX 2023

Gold

Empowering Communities Through the Researcher in Residence Program

2023 • Advancing Research 2023

Gold

The Politics of Radical Research: A Manifesto

2023 • Advancing Research 2023

Gold

All the Ops: Successful cross-functional collaboration

2025 • DesignOps Summit 2025

Conference

Debunking the Myths of Cross-Disciplinary Collaboration

2019 • DesignOps Summit 2019

Gold

An Organizational Story: Salesforce Lightning Design System

2016 • Enterprise UX 2016

Gold

Not Your Ordinary Re-Brand: Design's Path to Driving Customer Obsession at Best Buy

2024 • Enterprise Experience 2020

Gold

Women-Centric Research: What, Why, How

2023 • Advancing Research 2023

Gold

Exit Interview #1: Greg Petroff: From Silicon Valley Executive to Sonoma County Possibilitarian

2025 • Rosenfeld Community

The Science of Creativity for DesignOps

2024 • DesignOps Summit 2020

Gold

The Quiet Force: Uncovering Hidden Leadership in High-Impact Design Teams

2024 • DesignOps Summit 2024

Gold

Latest Books All books

The Staff Designer

Grow, Influence, and Lead as an Individual Contributor

By Catt Small

December 2025

Design for Privacy

Keeping Personal Information Private

By Robert Stribley

November 2025

Service Design (2nd edition)

From Insight to Implementation

By Lavrans Løvlie, Ben Reason, Andy Polaine

October 2025

The Game Development Strategy Guide

Crafting Modern Video Games That Thrive

By Cheryl Platz

September 2025

Stop Wasting Research

Maximize the Product Impact of Your Organization's Customer Insights

By Jake Burghardt

June 2025

We Need to Talk

A Survival Guide for Tough Conversations

By Joshua Graves

April 2025

September 2024

The User Experience Team of One (2nd Edition)

A Research and Design Survival Guide

By Leah Buley, Joe Natoli

August 2024

Design for Impact

Your Guide to Designing Effective Product Experiments

By Erin Weigel

June 2024

Managing Priorities

How to Create Better Plans and Make Smarter Decisions

By Harry Max

May 2024

Duly Noted

Extend Your Mind through Connected Notes

By Jorge Arango

January 2024

Design That Scales

Creating a Sustainable Design System Practice

By Dan Mall

November 2023

Interviewing Users (2nd Edition)

How to Uncover Compelling Insights

By Steve Portigal

October 2023

Dig deeper with the Rosenbot

How can career ladders be designed to support growth in multi-disciplinary design teams?

What role do lightweight rituals and scaffolding play in embedding continuous discovery practices?

How can discovery operating models improve feature adoption and reduce rework?

Summary

Key Insights

Notable Quotes

Dig deeper—ask the Rosenbot:

More Videos

Opening Remarks

Craft of User Research: Building Out Jobs to be Done Maps

From Users to Shapers of AI: The Future of Research

Back to basics, or start from scratch?

Asking the Right Questions: Life, Hope and Moving Forward During the Pandemic

Navigating organizational systems: Rethinking researcher’s role in driving change

More Than Technology: Personalized Public Sector Experiences

Research in the Face of Complexity: New Sensibility for New Situations

Two Jobs in One: Being a “Leader who is a Researcher” and a “Researcher who is a Leader"

Latest Books All books

The Staff Designer

Design for Privacy

Service Design (2nd edition)

The Game Development Strategy Guide

Stop Wasting Research

We Need to Talk

Human-Centered Security

The Design Conductors

Research That Scales

The User Experience Team of One (2nd Edition)

Design for Impact

Managing Priorities

Duly Noted

Design That Scales

Interviewing Users (2nd Edition)

Dig deeper with the Rosenbot