ai-evalcheck-and-golden-evals
CommunityManage AI evaluation suite and data.
Authorroaming-rockenfels
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill streamlines the management and validation of AI evaluation test cases and golden data, ensuring comprehensive coverage and accuracy of AI agent behaviors.
Core Features & Use Cases
- Eval Suite Management: Add, update, and maintain a suite of over 46 test cases across 7 critical evaluation dimensions.
- Golden Data Maintenance: Ensure the accuracy of golden evaluation data, updating it only when intentional behavior changes occur.
- Eval Validation: Run
evalCheck(or its offline equivalentnpm run test) to verify the health of the evaluation suite and identify regressions. - Coverage Analysis: Verify that all 7 evaluation dimensions have adequate test case coverage.
- Use Case: After introducing a new tool for the AI agent, use this Skill to add corresponding test cases, update golden data if necessary, and run
evalCheckto confirm the new tool integrates correctly and doesn't break existing functionality.
Quick Start
Use the ai-evalcheck-and-golden-evals skill to add a new eval case for the tool selection accuracy dimension.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: ai-evalcheck-and-golden-evals Download link: https://github.com/roaming-rockenfels/ghostfolio/archive/main.zip#ai-evalcheck-and-golden-evals Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.