ai-evalcheck-and-golden-evals

Community

Manage AI evaluation suite and data.

Authorroaming-rockenfels
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill streamlines the management and validation of AI evaluation test cases and golden data, ensuring comprehensive coverage and accuracy of AI agent behaviors.

Core Features & Use Cases

  • Eval Suite Management: Add, update, and maintain a suite of over 46 test cases across 7 critical evaluation dimensions.
  • Golden Data Maintenance: Ensure the accuracy of golden evaluation data, updating it only when intentional behavior changes occur.
  • Eval Validation: Run evalCheck (or its offline equivalent npm run test) to verify the health of the evaluation suite and identify regressions.
  • Coverage Analysis: Verify that all 7 evaluation dimensions have adequate test case coverage.
  • Use Case: After introducing a new tool for the AI agent, use this Skill to add corresponding test cases, update golden data if necessary, and run evalCheck to confirm the new tool integrates correctly and doesn't break existing functionality.

Quick Start

Use the ai-evalcheck-and-golden-evals skill to add a new eval case for the tool selection accuracy dimension.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: ai-evalcheck-and-golden-evals
Download link: https://github.com/roaming-rockenfels/ghostfolio/archive/main.zip#ai-evalcheck-and-golden-evals

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.