Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval
Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Steve Portigal
War Stories LIVE! Steve Portigal
2020 • Advancing Research 2020
Gold
Jorge Arango
Scale Smart: AI-Powered Content Organization Strategies
2024 • DesignOps Summit 2024
Gold
Chris Geison
What's Next for Research?
2021 • Advancing Research Community
Craig Brookes
"Just Make it Look Good" and Other Ways We're Misunderstood
2021 • Design at Scale 2021
Gold
Paul Pangaro, PhD
Systems Disciplines: Table Stakes for 21st Century Organizations
2023 • Enterprise UX 2023
Gold
Nick Cochran
Growing in Enterprise Design through Making Connections
2019 • Enterprise Experience 2019
Gold
Margot Bloomstein
Fostering Trust in Your Brand and Beyond
2020 • Enterprise Community
Jemma Ahmed
Collaboration: learning from other fields beyond our own [Advancing Research Community Workshop Series]
2024 • Advancing Research Community
Frank Duran
Partnership Playbook: Lessons Learned in Effective Partnership
2024 • Enterprise Experience 2020
Gold
Sam Proulx
Accessibility: An Opportunity to Innovate
2022 • Design at Scale 2022
Gold
Jemma Ahmed
Theme 2 Intro
2024 • Advancing Research 2024
Gold
Jane Davis
Strategic Shifts and Innovations in User Research: Navigating Challenges and Opportunities
2025 • Advancing Research 2025
Gold
Frances Yllana
The Big Question about Impact: A Panel Discussion
2024 • DesignOps Summit 2024
Gold
Steve Baty
Discussion
2016 • Enterprise UX 2016
Gold
Weidan Li
Qualitative synthesis with ChatGPT: Better or worse than human intelligence?
2024 • Designing with AI 2024
Gold
Maria Giudice
Empowering change: Reigniting purpose, passion and impact in research
2025 • Advancing Research 2025
Gold

More Videos

Todd Healy

"We don’t want product teams doing research just to check a box—they need to own and act on the insights."

Todd Healy Jess Greco

Driving Change with CX Metrics

June 7, 2023

Satyam Kantamneni

"Vision is a co-owned artifact created by the VM, design lead, and engineering lead."

Satyam Kantamneni

Do You Have an Experience Vision?

March 23, 2023

Jose Coronado

"No one organization has figured it all out; we’re all learning and evolving how to scale and integrate design effectively."

Jose Coronado Julie Gitlin Lawrence Lipkin

People First - Design at JP Morgan

June 10, 2021

Stephen Pollard

"The landing charge is the key thing airlines look at and that needs to stay flat."

Stephen Pollard

Closing Keynote: Getting giants to dance - what can we learn from designing large and complex public infrastructure?

November 7, 2017

Cennydd Bowles

"The risk is that AI will cut designers out of decision making because it’s perceived as cheaper or faster."

Cennydd Bowles Dan Rosenberg Lisa Welchman

Day 1 Panel

June 4, 2024

Mac Smith

"If you can follow the simple steps of preparing, aligning, and delivering, you can shift the focus and impact of your team."

Mac Smith

Measuring Up: Using Product Research for Organizational Impact

March 12, 2021

Melissa Eggleston

"When you tell the truth, especially hard truths, how you say it matters as much as the truth itself."

Melissa Eggleston Maya Israni Florence Kasule Owen Seely Andrea Schneider

Practical People Skills for Building Trust on Teams and with Partners

December 9, 2021

Dianne Que

"If you don't understand what your audience values, you won’t prove value."

Dianne Que

Real Talk: Proving Value through a Scrappy Playbook

October 23, 2019

Sam Proulx

"When we solve accessibility challenges, everyone benefits, not just people with disabilities."

Sam Proulx

To Boldly Go: The New Frontiers of Accessibility

June 10, 2022