Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval

Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Robin Beers
How to create actionable insight in the face of politics and silos [Advancing Research Community Workshop Series]
2023 • Advancing Research Community
Marissa Cui
Climate Design Product Showcase
2024 • Climate UX Interest Group
Renee Bouwens
Landing Product Impact: Aligning Research as a Foundational Driver for Delivering the World’s Best Products
2023 • QuantQual Interest Group
Karen McGrane
AI for Information Architects: Are the robots coming for our jobs?
2024 • Rosenfeld Community
Uday Gajendar
From AI to Zeitgeist: Theory as the design antidote to AI hype
2025 • Rosenfeld Community
Jemma Ahmed
Democratization: Working with it, not against it [Advancing Research Community Workshop Series]
2024 • Advancing Research Community
Ryan Matthew
Bridging Design and Code: AI-Powered Design System Integration
2025 • Rosenfeld Community
Robert Schwartz
We're Here for the Humans
2017 • Enterprise Experience 2017
Gold
Joshua Graves
We Need To Talk: Navigating Conversations with Your Boss (Part 1 of 3)
2025 • Rosenfeld Community
Courtney Maya George
Scale Your Organization and Grow Your Designers
2022 • DesignOps Summit 2022
Gold
Kritika Sony
Moving AI offscreen: Exploring failures, constraints, and recovery in physical game design
2026 • Designing with AI 2026
Conference
Jilanna Wilson
Distributed Design Operations Management
2019 • DesignOps Summit 2019
Gold
Nancy Douyon
We'll Figure That Out in the Next Launch: Enterprise Tech's Nobility Complex
2018 • Enterprise Experience 2018
Gold
Matt Bernius
Learnings from Applying Trauma-Informed Principles to the Research Process
2022 • Advancing Research 2022
Gold
Michele Marut
Research Repositories Reconsidered
2019 • DesignOps Community
George Abraham
Design Systems To-Go: Indigo.Design Overview and Exploring the Developer Workflow (Part 3)
2021 • DesignOps Summit 2021
Gold

More Videos

Jaime Creixems

"Try to make your component hierarchy as flat as possible because tall, nested structures become hard to navigate."

Jaime Creixems

Best Practices when Creating and Maintaining a Design System

June 7, 2023

Onur Kocan

"Finding the balance between transformation and preservation is a delicate issue in Istanbul."

Onur Kocan Ayhan Ensici

Understanding the Strategy for Civic Design in a Complex City: Istanbul

November 16, 2022

JD Buckley

"Many teachers told us more than half their middle schoolers have first-hand experience with guns."

JD Buckley

Communicating the ROI of UX within a large enterprise and out on the streets

June 14, 2018

Amber Knabl

"Design for one extent to many is Microsoft’s inclusive design motto that guides effective accessibility."

Amber Knabl

Empowering innovation: The critical role of inclusive product development in the AI era

June 4, 2024

Kritika Sony

"Working with AI is a back and forth negotiation, not a vending machine where one prompt gets you the answer."

Kritika Sony

Moving AI offscreen: Exploring failures, constraints, and recovery in physical game design

June 10, 2026

Mujtaba Hameed

"Always design your high-impact slides for the busy executive in the back of a limo."

Mujtaba Hameed

Frameworks for Excellence: Using Visual Thinking and Communication to Elevate Your Research

March 26, 2024

Catherine Courage

"More features and functions do not equal a better product."

Catherine Courage

The Enterprise UX Journey: Lessons From the Voyage & The Opportunity Ahead

May 13, 2015

Dominique Ward

"What we practice at a small scale is a pattern for the whole system—fractal problems repeat across teams and orgs."

Dominique Ward

The Most Exciting Time for DesignOps is Now

September 8, 2022

Michelle Morrison

"100% of people responded to our care kits offer and wrote back about their experience. That sparked community connection during isolation."

Michelle Morrison

Culture Design

May 21, 2020