Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval
Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Sam Proulx
Designing For Screen Readers: Understanding the Mental Models and Techniques of Real Users
2021 • Civic Design 2021
Gold
Eniola Oluwole
Lessons From the DesignOps Journey of the World's Largest Travel Site
2019 • DesignOps Summit 2019
Gold
Soma Ghosh
What emerging methods are advancing UX research [Advancing Research Community Workshop Series]
2023 • Advancing Research Community
Peter Boersma
How to Define and Maintain a DesignOps Roadmap
2023 • DesignOps Summit 2023
Gold
Amy Jiménez Márquez
The Atypical UX Manager Path
2020 • Enterprise Community
Frank Duran
Partnership Playbook: Lessons Learned in Effective Partnership
2024 • Enterprise Experience 2020
Gold
Dana Chisnell
The Sensemaking Business
2026 • Advancing Research 2026
Conference
Laura Weiss
Turn Down the Heat: 3 Ways to Handle Conflict in the Moment
2024 • Rosenfeld Community
Laine Riley Prokay
Carving a Path for Early Career DesignOps Practitioners
2022 • DesignOps Summit 2022
Gold
Leisa Reichelt
The Five Dysfunctions of Democratized Research at Scale
2020 • Advancing Research 2020
Gold
Florence Okoye
AfroFuturism and UX Research
2023 • Advancing Research 2023
Gold
Andy Warr
Under My (Research) Umbrella: The Benefits and Challenges of Building a Unified Insights Function
2024 • Advancing Research 2024
Gold
Louis Rosenfeld
Welcome / Housekeeping
2023 • Enterprise UX 2023
Gold
Bria Alexander
Theme 1 Intro
2024 • DesignOps Summit 2024
Gold
Matt Stone
Scaling Empathy, A Case Study in Change Management
2021 • Design at Scale 2021
Gold
Rittika Basu
Age and Interfaces: Equipping Older Adults with Technological Tools
2023 • Advancing Research Community

More Videos

Marc Fonteijn

"The more experienced you are, the more satisfied you tend to be with your salary; it can almost be double compared to beginners."

Marc Fonteijn

First Insights from the 2025 Service Design Salary(+) Report

December 4, 2024

Sean Fitzell

"Agents do a lot of jobs simultaneously with completely different mental contexts."

Sean Fitzell Sarah Han Kayla Farrell

Craft of User Research: Building Out Jobs to be Done Maps

March 12, 2021

Dr. Jamika D. Burge

"All commercially available facial recognition software perform worse on darker females."

Dr. Jamika D. Burge

Broad Strokes: Connecting Design, Research, and AI to the World Around Us

June 7, 2023

Laura Weiss

"Everyone contributes to conflict; it’s not about right or wrong, but what each person is trying to protect."

Laura Weiss

Turn Down the Heat: 3 Ways to Handle Conflict in the Moment

November 20, 2024

Dante Guintu

"The pandemic flipped the equation, and IBM became the place that design talent left."

Dante Guintu

How to Crush the Talent Crunch

September 8, 2022

Renee Reid

"I want people to question why I’m not in the room if I’m not there."

Renee Reid

Becoming a ResearchH.E.R (Highly Enterprise Ready)

June 3, 2019

Sarah Auslander

"My younger self had a lot of impostor syndrome, but everyone even with more experience doesn’t totally know what they’re doing."

Sarah Auslander Betsy Ramaccia Gordon Ross

Insights Panel

November 18, 2022

Lavrans Løvlie

"Many new adopters use service design artifacts as a tick-box exercise rather than embracing the mindset change required."

Lavrans Løvlie Ben Reason

Ask me anything – Authors of Service Design: From Insight to Implementation

November 19, 2025

Joel Branch

"AI development is often uninformed and hurried, resulting in deployments that don’t operate well in the real world."

Joel Branch

Humanizing AI: Filling the Gaps with Multi-faceted Research

March 11, 2021