Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval
Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Bria Alexander
The Big Question about Resilience: A panel discussion
2024 • DesignOps Summit 2024
Gold
Samuel Proulx
Inclusive Research: Debunking Myths and Getting Started
2025 • Advancing Research 2025
Gold
Patrizia Bertini
Pushing DesignOps’ Influence into New Global Markets
2022 • DesignOps Summit 2022
Gold
Dan Willis
Enterprise Storytelling Sessions
2015 • Enterprise UX 2015
Gold
Ryan Matthew
Bridging Design and Code: AI-Powered Design System Integration
2025 • DesignOps Summit 2025
Gold
Richard Buchanan
Creativity and Principles in the Flourishing Enterprise
2018 • Enterprise Experience 2018
Gold
Tutti Taygerly
Videconference: How to Work with Difficult People with Tutti Taygerly
2020 • Enterprise Community
Matteo Gratton
Can Data and Ethics Live Together?
2021 • DesignOps Summit 2021
Gold
Tamara Hale
War Stories LIVE! Tamara Hale
2020 • Advancing Research 2020
Gold
Nathan Shedroff
Redefining Value: Bridging the Innovation Culture Divide
2015 • Enterprise UX 2015
Gold
Jeff Sussna
What DesignOps Can Learn From DevOps
2017 • DesignOps Summit 2017
Gold
Mike Brzozowski
UX in everyday products: Empowering climate conscious choices
2024 • Climate UX Interest Group
Craig Villamor
Design Systems for Ethical Design
2023 • Enterprise Community
Jemma Ahmed
Theme 2 Intro
2024 • Advancing Research 2024
Gold
Stephen Pollard
Closing Keynote: Getting giants to dance - what can we learn from designing large and complex public infrastructure?
2017 • DesignOps Summit 2017
Gold
Greg Petroff
Design is the Differentiator: Bringing New Design Innovations to a Very Antiquated and Very Large Industry
2021 • Design at Scale 2021
Gold

More Videos

Lisa Welchman

"People can have the same values and ideas but if you don’t tune them properly, you just don’t get what you want."

Lisa Welchman

Cleaning Up Our Mess: Digital Governance for Designers

June 14, 2018

Lisa Gironda

"Most chiefs of staff burn out in two years, so it’s important to be clear with yourself and your leader about your goals."

Lisa Gironda

Opener: Chief of Staff–An unexpected journey

January 8, 2024

Jen Briselli

"Nudge to me now is much more about wiggle it and see what happens rather than expecting exact outcomes."

Jen Briselli

Learning Is The Engine: Designing & Adapting in a World We Can’t Predict

April 16, 2025

Ned Gartside

"The digital sector’s compounded impact on carbon footprint exceeds industries like airlines when aggregated across millions."

Ned Gartside Mike Gifford Zoe Lopez-Latorre Tzviya Siegman

Navigating accessibility and climate

April 17, 2024

Sean Dolan

"Most B2B research is done remotely and online, so we needed a tool that was easy to use without in-person facilitation."

Sean Dolan

A Practical Look at Creating More Usable Enterprise Customer Journeys

October 31, 2019

Katie Hansen

"Every piece of research we’ve conducted represents valuable context and insights that can still guide us today."

Katie Hansen

Finding the unknown in the known: Harnessing meta-analysis and literature review

March 12, 2025

Hugh Dubberly

"Small companies like Descartes Labs built supercomputers entirely in AWS with no physical servers, showing a profound shift in how computing power is accessed."

Hugh Dubberly

Problems with Problems: Reconsidering the Frame of Designing as Problem-Solving

June 19, 2019

Sam Proulx

"The standard SUS question about needing a technical person to use the system is frustrating because many assistive tech users are very technical themselves."

Sam Proulx

SUS: A System Unusable for Twenty Percent of the Population

September 29, 2021

Barb Spanton

"You need to find your own mattress—a grounding purpose or phrase—that you can rely on when progress feels hopeless."

Barb Spanton

Doing Work That Matters: A Look Beyond The Idealistic Notion of 'Doing Meaningful Work'

June 10, 2022