evaluation

Community

Build robust evaluation suites, ensure AI quality.

Authorcraigtkhill
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the problem of inconsistent, subjective, or incomplete evaluation of AI features, which leads to unreliable performance and difficulty in tracking improvements. It standardizes the creation of comprehensive evaluation suites, ensuring your AI models are rigorously tested and validated.

Core Features & Use Cases

  • Standardized Eval Structure: Provides templates for spec.md and rubric.md to ensure consistent and clear evaluation design across all features.
  • Mixed Validation Types: Guides you in using both code-based (deterministic checks) and LLM-as-judge (quality assessment) validations for comprehensive coverage.
  • Objective Rubric Creation: Emphasizes writing concrete, objectively verifiable criteria for LLM-based evaluations, reducing subjectivity and improving reliability.
  • Use Case: When developing a new AI feature, use this Skill to create a robust evaluation suite that includes a detailed specification of what to test, a clear rubric for LLM-as-judge assessments, and a plan for both code-based and LLM-based validations, ensuring high-quality AI outputs and faster iteration.

Quick Start

Help me create a new evaluation suite for an AI feature that generates code, starting with the spec.md template and defining a few code-based and LLM-judged requirements.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: evaluation
Download link: https://github.com/craigtkhill/stdd-agents/archive/main.zip#evaluation

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository