Log in or create a free Rosenverse account to watch this video.
Log in Create free account100s of community videos are available to free members. Conference talks are generally available to Gold members.
Summary
If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.
Key Insights
-
•
Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.
-
•
Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.
-
•
Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.
-
•
Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.
-
•
High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.
-
•
Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.
-
•
Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.
-
•
Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.
-
•
Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.
-
•
Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.
Notable Quotes
"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."
"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."
"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."
"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."
"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."
"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."
"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."
"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."
"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."
"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."
Or choose a question:
More Videos
"Clear outcomes and defined secure behaviors are better than vague goals like 'be more secure'."
Heidi TrostTo Protect People, You Have to Protect Information: A Human-Centered Design Approach to Cybersecurity
January 23, 2025
"Designers need to be facilitators of other people's expertise and knowledge, not just creators of designed things."
Sheryl CababaExpanding Your Design Lens with Systems Thinking
February 23, 2023
"All behavior your component is capable of, including responsive and interaction states, will work inside UXPin."
Jack BeharHow to Build Prototypes that Behave like an End-Product
December 6, 2022
"When you get stuck, fall back to the loop: observe, reflect, and make. Get out and talk to users."
Mitchell BernsteinOrganizing Chaos: How IBM is Defining Design Systems with Sketch for an Ever-Changing AI Landscape
September 29, 2021
"The internal champion is the ideal choreographer who builds trust, invests in onboarding and bridges the agency and the client teams."
Séamus ByrneAligning Teams with Choreography
January 8, 2024
"I’m getting hopeful. For what it’s worth."
Dave Hoffer Joanne WeaverUX Job Search AMA #2 with Joanne Weaver and Dave Hoffer
May 21, 2025
"Outdated content will erode trust and make folks question whether other information is worth their time."
Gabrielle VerderberDocumentation Your Team Will Actually Use
October 3, 2023
"Testing with people who have low vision shows that even passing contrast ratios can still result in unreadable text if font weights are too thin."
Kate KalcevichIntegrating Accessibility in DesignOps
September 23, 2024
"We’re still operating from an industrial era hangover — a mechanistic way of problem solving that doesn’t fit today’s complexity."
Robin BeersResearch as a Catalyst for Organizational Transformation
March 12, 2021