ai-evalcheck-and-golden-evals

Name: ai-evalcheck-and-golden-evals
Availability: InStock
Author: roaming-rockenfels

Community

Manage AI evaluation suite and data.

Software Engineering #testing #validation #ai #evaluation #coverage #test cases #golden data

Authorroaming-rockenfels

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill streamlines the management and validation of AI evaluation test cases and golden data, ensuring comprehensive coverage and accuracy of AI agent behaviors.

Core Features & Use Cases

Eval Suite Management: Add, update, and maintain a suite of over 46 test cases across 7 critical evaluation dimensions.
Golden Data Maintenance: Ensure the accuracy of golden evaluation data, updating it only when intentional behavior changes occur.
Eval Validation: Run evalCheck (or its offline equivalent npm run test) to verify the health of the evaluation suite and identify regressions.
Coverage Analysis: Verify that all 7 evaluation dimensions have adequate test case coverage.
Use Case: After introducing a new tool for the AI agent, use this Skill to add corresponding test cases, update golden data if necessary, and run evalCheck to confirm the new tool integrates correctly and doesn't break existing functionality.

Quick Start

Use the ai-evalcheck-and-golden-evals skill to add a new eval case for the tool selection accuracy dimension.

ai-evalcheck-and-golden-evals

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper