upgrade-evals

Community

Find and prioritize AI failures.

Authorbreethomas
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Help product managers and engineers discover, quantify, and prioritize real failure modes in AI features by analyzing actual pipeline traces rather than relying on intuition or pre-defined categories.

Core Features & Use Cases

  • Trace collection guidance: sampling strategies for diverse and representative traces from production or staged systems.
  • Pass/fail labeling and note-taking: structured instructions for binary judgments and concise observations that surface root causes.
  • Grouping, labeling, and prioritization: iterative clustering into actionable failure categories, computing failure rates, and recommending fixes or evaluators.
  • Use Case: A PM uses ~100 real user traces to surface top failure categories, then directs engineers to fix prompt issues, add validators, or build evaluators.

Quick Start

Ask the skill to analyze recent production traces, label pass/fail, surface emergent failure categories, and recommend the highest-impact fixes.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: upgrade-evals
Download link: https://github.com/breethomas/bette-think/archive/main.zip#upgrade-evals

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.