upgrade-evals
CommunityFind and prioritize AI failures.
Product & Management#llm#prioritization#evaluations#traces#labeling#error-analysis#failure-categories
Authorbreethomas
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Help product managers and engineers discover, quantify, and prioritize real failure modes in AI features by analyzing actual pipeline traces rather than relying on intuition or pre-defined categories.
Core Features & Use Cases
- Trace collection guidance: sampling strategies for diverse and representative traces from production or staged systems.
- Pass/fail labeling and note-taking: structured instructions for binary judgments and concise observations that surface root causes.
- Grouping, labeling, and prioritization: iterative clustering into actionable failure categories, computing failure rates, and recommending fixes or evaluators.
- Use Case: A PM uses ~100 real user traces to surface top failure categories, then directs engineers to fix prompt issues, add validators, or build evaluators.
Quick Start
Ask the skill to analyze recent production traces, label pass/fail, surface emergent failure categories, and recommend the highest-impact fixes.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: upgrade-evals Download link: https://github.com/breethomas/bette-think/archive/main.zip#upgrade-evals Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.