Log in or create a free Rosenverse account to watch this video.
Log in Create free account100s of community videos are available to free members. Conference talks are generally available to Gold members.
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
This video is featured in the AI and UX playlist.
Summary
The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.
Key Insights
-
•
AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.
-
•
Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.
-
•
LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.
-
•
Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.
-
•
A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.
-
•
Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.
-
•
Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.
-
•
Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.
-
•
Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.
-
•
AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.
Notable Quotes
"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."
"You have to build a detailed definition of what is good for my system to do meaningful automated evals."
"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."
"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."
"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."
"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."
"Evals are really your intellectual property—they define what good looks like in your domain."
"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."
"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."
"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."
Or choose a question:
More Videos
"Meta-analysis is like looking at the average Rotten Tomatoes score to get an overall sentiment from many individual reviews."
Katie HansenFinding the unknown in the known: Harnessing meta-analysis and literature review
March 12, 2025
"Our digital swag bag is free to all registered attendees and packed with awesome sponsor offers."
Bria AlexanderOpening Remarks
June 9, 2021
"Sponsor sessions are not just salesy sessions; they’re actually really fantastic sessions put on by your peers."
Uday Gajendar Louis RosenfeldDay 1 Welcome
June 4, 2024
"We treated the different types of information about jobs as transparency layers, like in an anatomy textbook."
Sean Fitzell Sarah Han Kayla FarrellCraft of User Research: Building Out Jobs to be Done Maps
March 12, 2021
"Designers would often say, I didn’t realize research was this hard or what you go through is traumatic."
Marjorie Stainback Kelsey KingmanTransforming Strategic Research Capacity through Democratization
October 24, 2019
"Inclusive research means building for diverse people, not just for ourselves or a narrow segment."
Janelle EstesUX Research Trends
January 28, 2021
"AI intelligence is too cheap to meter, or we can call it McKinsey interns too cheap to meter."
Matt WebbContext Window: Five Futures for AI
June 11, 2025
"High-quality participants are the fuel that makes the whole research operation engine run smoothly."
Kate Towsey Basel Fakhoury Oren Friedman Graham GardnerParticipant Recruitment and Management Tools
March 12, 2026
"You don’t feel like it’s a huge corporate burden company, nor a crazy fast startup. It feels authentic and committed."
Kurdin Bazaz Liz Rytting Alex KarrCulture, DIBS & Recruiting
June 10, 2021