Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval

Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Erin Hauber
Design is Not the Frosting on the Scaled Agile Layer Cake
2019 • DesignOps Summit 2019
Gold
Ovetta Sampson
Research in the Automated Future
2022 • Advancing Research 2022
Gold
Kristin Skinner
Theme 2: Introduction and Provocation
2024 • DesignOps Summit 2020
Gold
Michaela Mora
Advanced Concept Testing Approaches To Guide Product Development and Business Decisions
2022 • Advancing Research 2022
Gold
Josh Clark
Sentient Scenes and Radically Adaptive Experiences
2025 • Designing with AI 2025
Gold
Jilanna Wilson
Distributed Design Operations Management
2019 • DesignOps Summit 2019
Gold
James Chudley
Decarbonising User Journeys: How minimising enables us to do more with less
2025 • Climate UX Interest Group
Savannah Carlin
Don't botch the bot: Designing interactions for AI
2024 • Designing with AI 2024
Gold
Liam Thurston
Why Your Design Team Is Quitting, And How To Fix It
2022 • Design at Scale 2022
Gold
Veevi Rosenstein
Building for Scale: Creating the Zendesk UX Research Practice
2024 • Enterprise Experience 2020
Gold
Steve Turbek
Designing Interactive Graphics with AI Code Help
2026 • Rosenfeld Community
Weidan Li
Qualitative synthesis with ChatGPT: Better or worse than human intelligence?
2024 • Designing with AI 2024
Gold
Gregg Bernstein
Opportunistic Research with Gregg Bernstein
2019 • Advancing Research Community
Jonathan Fairman
Integrating generative AI into enterprise products: A case study from dscout
2024 • Designing with AI 2024
Gold
Prayag Narula
How to Empower Your Designers to Do Good Research – And Why You Want To
2022 • Design at Scale 2022
Gold
Kevin Bethune
Gatekeepers and Servant Leadership
2020 • DesignOps Community

More Videos

Eduardo Ortiz

"Some of the most impactful insights come from data analytics and user research collaborating to find behaviors seen in real people and aggregate data."

Eduardo Ortiz Robin Beers Rachael Dietkus, LCSW Bruce Gillespie Jess Greco Marieke McCloskey Renee Reid

Day 3 Theme Panel

March 13, 2025

Richard Buchanan

"Creativity means a change in perception, moving from what is familiar to what may seem ridiculous initially."

Richard Buchanan

Creativity and Principles in the Flourishing Enterprise

June 15, 2018

Jorge Arango

"The goal is not to take notes; the goal is to think effectively, as Andy Matus says."

Jorge Arango

The Best of Both Worlds: How to Integrate Paper and Digital Notes (1st of 3 seminars)

April 5, 2024

Cheryl Platz

"Competitiveness doesn’t motivate play as much as it used to; companionship and collaboration are more motivating now."

Cheryl Platz

Embrace Your Fun Factor: Game Development Best Practices for Product Design

January 9, 2026

Ash Brown

"If we invest in green technologies today, we can secure a sustainable tomorrow."

Ash Brown

Silver Linings: What DesignOps Learned in the Shift to WFH

October 23, 2020

Angelos Arnis

"Dark patterns like Airbnb's price concealment and Twitter's neglect of accessibility were designed decisions, whether designers were involved or not."

Angelos Arnis

Navigating the Rapid Shifts in Tech's Turbulent Terrain

October 2, 2023

Jeff Gothelf

"If you don’t tell teams what to make, they need a product discovery process to decide what to build."

Jeff Gothelf

Who does what by how much?

November 20, 2025

Sam Proulx

"Those who don’t understand history are doomed to repeat it."

Sam Proulx

To Boldly Go: The New Frontiers of Accessibility

September 9, 2022

Brendan Jarvis

"What does it mean to be a design ops person? That question matters and we must revisit it often."

Brendan Jarvis

It was the Best of Times. It was the Worst of Times.

September 25, 2024