Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval
Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Matt Bernius
Trauma-informed Research: A Panel Discussion
2021 • Advancing Research Community
Kevin M. Hoffman
Theme 2: Enterprise Team Journey
2019 • Enterprise Experience 2019
Gold
Sarah Alvarado
How to make UX research leadership more effective [Advancing Research Community Workshop Series]
2023 • Advancing Research Community
Gabrielle Verderber
Documentation Your Team Will Actually Use
2023 • DesignOps Summit 2023
Gold
Gina Mendolia
Therapists, Coaches, and Grandmas: Techniques for Service Design in Complex Systems
2024 • Advancing Service Design 2024
Gold
Anat Fintzi
Delivering at Scale: Making Traction with Resistant Partners
2022 • Design at Scale 2022
Gold
Gretchen Anderson
Scaling the Human Center
2017 • Enterprise Experience 2017
Gold
Uday Gajendar
The Rise of Meta-Design: A Starter Playbook
2022 • Enterprise Community
Bob Baxley
Leading with Design Operations Past and Present
2019 • DesignOps Community
Shipra Kayan
Make your research synthesis speedy and more collaborative using a canvas
2025 • Rosenfeld Community
Roy Opata Olende
How Zapier Uses ‘All Hands Research’ to Increase Exposure to Users
2020 • Advancing Research Community
Louis Rosenfeld
Coffee with Lou
2024 • Rosenfeld Community
Luke Roberts
Panel Discussion
2024 • Advancing Service Design 2024
Gold
Christian Crumlish
Morning Insights Panel
2022 • Design in Product 2022
Gold
Teresa Swingler
Look, Up in the Sky! UX/UI for Aerospace
2022 • Enterprise Community
Paula Bach
Improving Legacy Software: How Much Better Does it Have to Be?
2022 • Advancing Research 2022
Gold

More Videos

Neil Barrie

"Culture is the lens through which we experience our lives and products today."

Neil Barrie

Widening the Aperture: The Case for Taking a Broader Lens to the Dialogue between Products and Culture

March 25, 2024

Lada Gorlenko

"If your users are happy, that anecdotal feedback often carries more weight early than quantitative metrics."

Lada Gorlenko Sharbani Dhar Sébastien Malo Rob Mitzel Ivana Ng Michal Anne Rogondino

Theme 1: Discussion

January 8, 2024

Fredrik Matheson

"If you say great user experience, no, it’s not specific, it’s not measurable, it’s not actionable."

Fredrik Matheson

First-time users, longtime strategies: Why Parkinson’s Law is making you less effective at work – and how to design a fix.

June 8, 2016

Liwei Dai

"The AI planning monkey and banana problem made me question whether getting the bananas is really the end goal."

Liwei Dai

The Heart and Brain of the AI Research

March 31, 2020

Jon Fukuda

"Each team should talk about their core values like integrity, accountability, and respect to work together with the same mindset."

Jon Fukuda Amy Evans Ignacio Martinez Joe Meersman

The Big Question about Innovation: A Panel Discussion

September 25, 2024

Michelle Bejian Lotia

"Automation reminders to authors help close the loop on what actions come from their published insights."

Michelle Bejian Lotia Anne-Marie Morell

Rolling Out a Repository: How Zapier Centralizes Insights from Across their Organization

March 28, 2023

Kristen Honey

"Innovators need a tribe to recharge their batteries because you get no from every direction."

Kristen Honey

"Let’s Talk About Data and Crisis”: Public Digital Service Delivery = Open Data + Human Centered Design

November 18, 2021

Lais de Almeida

"Speculative use cases connect technical components to human impact and help prioritize work into slices of value."

Lais de Almeida Maria Izquierdo

Designing Data Services

December 4, 2024

John Cutler

"The highest leverage thing you can do is design statements that capture the essence in ways that set sail a thousand ships."

John Cutler

The Alignment Trap

November 29, 2023