Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval
Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Neil Barrie
Widening the Aperture: The Case for Taking a Broader Lens to the Dialogue between Products and Culture
2024 • Advancing Research 2024
Gold
Amanda Kaleta-Kott
The Joys and Dilemmas of Conducting UX Research with Older Adults
2022 • Advancing Research 2022
Gold
Wendy Johansson
Design at Scale: Behind the Scenes
2021 • Enterprise Community
Dan Willis
Enterprise Storytelling Sessions
2019 • Enterprise Experience 2019
Gold
Mary-Lynne Williams
Exit Interview #4: From Product Design Leadership to Sound Healing
2026 • Rosenfeld Community
Peter Merholz
The Trials and Tribulations of Directors of UX
2023 • Enterprise Community
Brennan Hartich
Communicating and Establishing DesignOps as a New Function
2018 • DesignOps Summit 2018
Gold
Luz Bratcher
This Is a Talk for Tired People
2022 • Design at Scale 2022
Gold
Bud Caddell
Theme 2 Intro
2021 • DesignOps Summit 2021
Gold
Sarah Auslander
Incremental Steps to Drive Radical Innovation in Policy Design
2022 • Civic Design 2022
Gold
Juhan Sonin
Design Now! The Agenda for Action
2025 • Rosenfeld Community
Sean Fitzell
Craft of User Research: Building Out Jobs to be Done Maps
2021 • Advancing Research 2021
Gold
Liam Thurston
Why Your Design Team Is Quitting, And How To Fix It
2022 • Design at Scale 2022
Gold
Bria Alexander
Opening Remarks
2021 • DesignOps Summit 2021
Gold
Patrick Boehler
The service shift: transforming media organizations to create real value through design
2025 • Advancing Service Design 2025
Gold
Ariel Kennan
Theme Two Intro
2022 • Civic Design 2022
Gold

More Videos

JD Buckley

"We couldn’t even begin to demonstrate our team’s impact unless we could answer the question compared to what."

JD Buckley

Communicating the ROI of UX within a large enterprise and out on the streets

June 14, 2018

Jose Coronado

"Supporting designers with basic tools like a working laptop can be surprisingly complex but hugely impacts efficiency."

Jose Coronado Julie Gitlin Lawrence Lipkin

People First - Design at JP Morgan

June 10, 2021

Sheryl Cababa

"Systems thinking requires constant engagement with stakeholders; working in isolation will not lead to successful outcomes."

Sheryl Cababa

Expanding Your Design Lens with Systems Thinking

February 23, 2023

Harry Max

"If your priorities are roughly aligned with organizational priorities, you’ll be making good trade-offs."

Harry Max Jim Meyer

Prioritization for Leaders (2nd of 3 seminars)

June 27, 2024

Alissa Briggs

"If we do X, then Y percent of people will do Z—that’s how you frame hypotheses to learn from both success and failure."

Alissa Briggs

How to Coach Enterprise Experimentation

May 14, 2015

Joanna Vodopivec

"Starting with positives and acknowledging good work helps break down resistance when sharing findings that reveal weak points."

Joanna Vodopivec Prabhas Pokharel

One Research Team for All - Influence Without Authority

March 9, 2022

Rebecca Topps

"Some participants prefer tasks to be direct and focused, without lengthy scenario explanations."

Rebecca Topps

Planning and conducting remote usability studies for accessibility

September 10, 2020

Alan Williams

"About 20% of users shown our Net Promoter Score survey provide written feedback that can be analyzed for improvements."

Alan Williams Rose Deeb

Designing essential financial services for those in need

February 10, 2022

Heidi Trost

"If Charlie has been tampered with, Alice needs a clear way to be alerted that she shouldn't trust it."

Heidi Trost

When AI Becomes the User’s Point Person—and Point of Failure

August 7, 2025