Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval
Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Joshua Graves
We Need To Talk: Managing Ludicrous Requests at Work (Part 3 of 3)
2025 • Rosenfeld Community
Bria Alexander
Reflect and Chart Forward
2021 • Civic Design 2021
Gold
Jemma Ahmed
Redefining the research toolkit: Expanding methodologies for a changing world
2025 • Advancing Research 2025
Gold
Erika Flowers
AI-Readiness: Preparing NASA for a Data-Driven, Agile Future
2025 • Designing with AI 2025
Gold
Nathan Curtis
Discussion
2016 • Enterprise UX 2016
Gold
Kristin Skinner
8 Types of Measures in Design Operations
2020 • DesignOps Community
Sharon Bautista
Time to Make the Donuts: How User Research Helped Bridge Disparate Teams
2024 • Enterprise Experience 2020
Gold
Gabriela Barneva
Operationalizing Inclusive Design in Design Ops
2025 • DesignOps Summit 2025
Gold
Louis Rosenfeld
Opening Remarks
2022 • Civic Design 2022
Gold
Nathan Shedroff
Redefining Value: Bridging the Innovation Culture Divide
2015 • Enterprise UX 2015
Gold
Jacqui Frey
Setting the Table for Dynamic Change
2019 • DesignOps Summit 2019
Gold
Dave Malouf
The Past, Present, and Future of DesignOps: a 2-part DesignOps Community Call (Part 1)
2022 • DesignOps Community
Rachael Dietkus, LCSW
AI: Passionate defenses and reasoned critique [Advancing Research Community Workshop Series]
2024 • Advancing Research Community
John Calhoun
Two Sides of the DesignOps Coin: Teams Ops and Product Ops
2024 • DesignOps Summit 2020
Gold
Rob Mitzel
The Tale of Two Companies: Building a Successful UX Practice in a Century-Old Enterprise
2024 • Enterprise Experience 2020
Gold
Kim Holt
A Salesforce Panel Discussion on Values-Driven DesignOps
2022 • DesignOps Summit 2022
Gold

More Videos

George Abraham

"The Ab Builder is a collaborative workspace where designers and developers have synchronized component libraries and themes."

George Abraham Stefan Ivanov

Design Systems To-Go: Reimagining Developer Handoff, and Introducing App Builder (Part 2)

October 1, 2021

Greg Petroff

"The greatest advantage you have in life is the speed at which you learn."

Greg Petroff

Design is the Differentiator: Bringing New Design Innovations to a Very Antiquated and Very Large Industry

June 9, 2021

Joshua Graves

"Stakeholder mapping helps you see who needs what and align your tactics accordingly."

Joshua Graves

We Need To Talk: Managing Ludicrous Requests at Work (Part 3 of 3)

May 12, 2025

Sara Asche Anderson

"We’re no longer in the business of selling computers and TVs. We are in the happiness business."

Sara Asche Anderson Jamie Kaspszak

Not Your Ordinary Re-Brand: Design's Path to Driving Customer Obsession at Best Buy

January 8, 2024

Tristin Oldani

"Money talks, and cost savings often align with energy savings — that’s why it matters."

Tristin Oldani

Turning awareness into action with Climate UX

January 16, 2025

Erin Hoffman-John

"We often say keep the big picture in mind and the devil’s in the details—system design bridges these two opposing needs."

Erin Hoffman-John

This Game is Never Done: Design Leadership Techniques from the Video Game World

November 6, 2017

Sam Proulx

"If you want an accessible app on iOS and Android, using the standard native APIs and controls will get you there."

Sam Proulx

Mobile Accessibility and You

June 9, 2022

Michelle Morrison

"Practices are the guardrails that empower individuals to be creative problem solvers and innovators."

Michelle Morrison

Practice What You Preach

January 8, 2024

Jonathan Fairman

"Testing AI products requires longitudinal methods to see how relationships and experiences evolve."

Jonathan Fairman Kevin Johnson

Integrating generative AI into enterprise products: A case study from dscout

June 5, 2024