Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval
Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Bethany Brown
Rewiring operations with service design and AI
2025 • Advancing Service Design 2025
Conference
Brian T. O’Neill
Does Designing and Researching Data Products Powered by ML/AI and Analytics Call for New UX Methods?
2022 • QuantQual Interest Group
Gabriela Barneva
Operationalizing Inclusive Design in Design Ops
2025 • DesignOps Summit 2025
Gold
Doug Powell
Closing Keynote: Design at Scale
2018 • DesignOps Summit 2018
Gold
Rob Mitzel
The Tale of Two Companies: Building a Successful UX Practice in a Century-Old Enterprise
2024 • Enterprise Experience 2020
Gold
Melissa Tsang
From Insights to Action: Driving Business Values through DesignOps
2024 • DesignOps Summit 2020
Gold
Michael Weir
Mixed Methods and Behavioural Science
2023 • Rosenfeld Community
Peter Merholz
The Trials and Tribulations of Directors of UX
2023 • Enterprise Community
Sheryl Cababa
Day 2 Panel
2024 • Designing with AI 2024
Gold
Lija Hogan
Contexts of Use: A Framework for Connection
2021 • Civic Design 2021
Gold
Yalenka Mariën
Designing for Digital Inclusion in the Belgian Government
2021 • Civic Design 2021
Gold
Maria Skaaden
Panel Discussion: Methodologies and Work Environments
2018 • DesignOps Summit 2018
Gold
Kaitlin Tasker
Fast and Fearless Inclusive Research
2023 • Advancing Research 2023
Gold
Lisa Welchman
Cleaning Up Our Mess: Digital Governance for Designers
2018 • Enterprise Experience 2018
Gold
Megan Clegg
Space for Everyone: Reframing Accessibility Through a Wider Lens
2021 • Design at Scale 2021
Gold
Christian Rohrer
Research Operations at Scale
2017 • DesignOps Summit 2017
Gold

More Videos

Sheryl Cababa

"Systems thinking is not just about complexity but grounded in ethics to help deliver a better future for humans and the planet."

Sheryl Cababa

Expanding your Design Lens with Systems Thinking

March 28, 2023

Bria Alexander

"We talked all about resilience, curated by myself yesterday."

Bria Alexander

Day 3 Welcome

September 25, 2024

Simon Wardley

"There is no such thing as one-size-fits-all methods; what works depends on the component’s evolutionary stage."

Simon Wardley

Maps and Topographical Intelligence

January 31, 2019

Jorge Arango

"Some of the older content has discoverability problems, which is typical with blogs."

Jorge Arango

[Demo] How to re-categorize content at scale using LLMs

June 5, 2024

Jeff Gothelf

"Innovation is not a spice that you sprinkle on and suddenly magical new things come out."

Jeff Gothelf

Innovation Studios: the Engines of Enterprise Experimentation

May 14, 2015

Lona Moore

"Creating an environment where people feel safe to try design helps build enthusiasm and psychological safety."

Lona Moore

Scaling Design Beyond Designers

June 11, 2021

Samuel Proulx

"Complexity is not a barrier to accessibility; even complicated games like The Last of Us are accessible."

Samuel Proulx

Invisible barriers: Why accessible service design can’t be an afterthought

December 3, 2024

Daniel J. Rosenberg

"You have to keep the injection sites moving because of scar tissue, adding to the body hacking complexity."

Daniel J. Rosenberg

Digital Medicine Design

September 26, 2019

Scher Foord

"Design systems allow us to wrap up a prototype quickly, test, fail fast, and iterate."

Scher Foord Corey Greenltch Sarah Rowe

Turn the Ship Around: How to Apply Design Thinking Across Your Organization

June 10, 2021