Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval

Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Peter Van Dijck

Peter Van Dijck

UX and AI builder, CEO Sputnik Legal

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

•

Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.
•

Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.
•

UX and product teams can and should learn evals as a practical, non-technical skill.
•

Creating your own golden dataset is essential and cannot be outsourced or fully automated.
•

Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.
•

Evaluations measure task performance, not the underlying model itself, allowing comparison across models.
•

Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.
•

Biases are baked into models during training via evals used in post-training refinement.
•

LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.
•

Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Previous video

Next video

Ask the Rosenbot

Or choose a question:

What are the three components that make up an AI eval?

How can UX professionals use evals to improve AI product reliability?

Why is it necessary to create your own golden dataset for AI evaluations?

How do evals help in comparing performance across different AI models?

Why can't AI models learn or improve after deployment through prompting alone?

Robin Beers

How to create actionable insight in the face of politics and silos [Advancing Research Community Workshop Series]

2023 • Advancing Research Community

Marissa Cui

Climate Design Product Showcase

2024 • Climate UX Interest Group

Renee Bouwens

Landing Product Impact: Aligning Research as a Foundational Driver for Delivering the World’s Best Products

2023 • QuantQual Interest Group

Karen McGrane

AI for Information Architects: Are the robots coming for our jobs?

2024 • Rosenfeld Community

Uday Gajendar

From AI to Zeitgeist: Theory as the design antidote to AI hype

2025 • Rosenfeld Community

Jemma Ahmed

Democratization: Working with it, not against it [Advancing Research Community Workshop Series]

2024 • Advancing Research Community

Ryan Matthew

Bridging Design and Code: AI-Powered Design System Integration

2025 • Rosenfeld Community

Robert Schwartz

We're Here for the Humans

2017 • Enterprise Experience 2017

Joshua Graves

We Need To Talk: Navigating Conversations with Your Boss (Part 1 of 3)

2025 • Rosenfeld Community

Courtney Maya George

Scale Your Organization and Grow Your Designers

2022 • DesignOps Summit 2022

Kritika Sony

Moving AI offscreen: Exploring failures, constraints, and recovery in physical game design

2026 • Designing with AI 2026

Jilanna Wilson

Distributed Design Operations Management

2019 • DesignOps Summit 2019

Nancy Douyon

We'll Figure That Out in the Next Launch: Enterprise Tech's Nobility Complex

2018 • Enterprise Experience 2018

Matt Bernius

Learnings from Applying Trauma-Informed Principles to the Research Process

2022 • Advancing Research 2022

Michele Marut

Research Repositories Reconsidered

2019 • DesignOps Community

George Abraham

Design Systems To-Go: Indigo.Design Overview and Exploring the Developer Workflow (Part 3)

2021 • DesignOps Summit 2021

More Videos

Jaime Creixems

"Try to make your component hierarchy as flat as possible because tall, nested structures become hard to navigate."

Best Practices when Creating and Maintaining a Design System

June 7, 2023

Onur Kocan

"Finding the balance between transformation and preservation is a delicate issue in Istanbul."

Onur Kocan Ayhan Ensici

Understanding the Strategy for Civic Design in a Complex City: Istanbul

November 16, 2022

JD Buckley

"Many teachers told us more than half their middle schoolers have first-hand experience with guns."

Communicating the ROI of UX within a large enterprise and out on the streets

June 14, 2018

Amber Knabl

"Design for one extent to many is Microsoft’s inclusive design motto that guides effective accessibility."

Empowering innovation: The critical role of inclusive product development in the AI era

June 4, 2024

Kritika Sony

"Working with AI is a back and forth negotiation, not a vending machine where one prompt gets you the answer."

Moving AI offscreen: Exploring failures, constraints, and recovery in physical game design

June 10, 2026

Mujtaba Hameed

"Always design your high-impact slides for the busy executive in the back of a limo."

Frameworks for Excellence: Using Visual Thinking and Communication to Elevate Your Research

March 26, 2024

Catherine Courage

"More features and functions do not equal a better product."

Catherine Courage

The Enterprise UX Journey: Lessons From the Voyage & The Opportunity Ahead

May 13, 2015

Dominique Ward

"What we practice at a small scale is a pattern for the whole system—fractal problems repeat across teams and orgs."

The Most Exciting Time for DesignOps is Now

September 8, 2022

Michelle Morrison

"100% of people responded to our care kits offer and wrote back about their experience. That sparked community connection during isolation."

Michelle Morrison

Culture Design

May 21, 2020

Latest Books All books

Sentient Design

Sentient Design

Crafting Intelligent Interfaces with AI

By Josh Clark, Veronika Kindred

June 2026

Designing Assistant Technology

Designing Assistant Technology

AI That Makes Us Smarter

By Christopher Noessel

March 2026

The Staff Designer

The Staff Designer

Grow, Influence, and Lead as an Individual Contributor

By Catt Small

December 2025

Design for Privacy

Design for Privacy

Keeping Personal Information Private

By Robert Stribley

November 2025

Service Design (2nd edition)

Service Design (2nd edition)

From Insight to Implementation

By Lavrans Løvlie, Ben Reason, Andy Polaine

October 2025

The Game Development Strategy Guide

The Game Development Strategy Guide

Crafting Modern Video Games That Thrive

By Cheryl Platz

September 2025

Stop Wasting Research

Stop Wasting Research

Maximize the Product Impact of Your Organization's Customer Insights

By Jake Burghardt

June 2025

We Need to Talk

We Need to Talk

A Survival Guide for Tough Conversations

By Joshua Graves

April 2025

Human-Centered Security

Human-Centered Security

How to Design Systems That Are Both Safe and Usable

December 2024

The Design Conductors

The Design Conductors

Your Essential Guide to Design Operations

October 2024

Research That Scales

Research That Scales

The Research Operations Handbook

By Kate Towsey

September 2024

The User Experience Team of One (2nd Edition)

The User Experience Team of One (2nd Edition)

A Research and Design Survival Guide

By Leah Buley, Joe Natoli

August 2024

Design for Impact

Design for Impact

Your Guide to Designing Effective Product Experiments

By Erin Weigel

June 2024

Managing Priorities

Managing Priorities

How to Create Better Plans and Make Smarter Decisions

By Harry Max

May 2024

Duly Noted

Duly Noted

Extend Your Mind through Connected Notes

By Jorge Arango

January 2024

Dig deeper with the Rosenbot

How did Kritika use AI to identify and fix problems with failed 3D prints?

How do you create a framework that captures diverse AI skills across varied design roles?

How can AI support frontline workers without replacing their critical local knowledge and judgment?