Name: Model Evaluation Benchmark Skill
Availability: InStock
Author: rysweet

System Documentation

What problem does it solve?

This Skill automates the complex and time-consuming process of evaluating and comparing the performance of different AI models against standardized benchmarks.

Core Features & Use Cases

End-to-End Benchmarking: Orchestrates setup, execution, analysis, and reporting for model evaluations.
Comprehensive Metrics: Measures efficiency (duration, cost), quality (code scores), and workflow adherence.
Use Case: When deciding between deploying GPT-4 or Claude Opus for a new feature, use this skill to run a benchmark suite that objectively measures which model performs better on relevant tasks.

Quick Start

Run the model evaluation benchmark suite for Opus and Sonnet models.

Please help me install this Skill: Name: Model Evaluation Benchmark Skill Download link: https://github.com/rysweet/RustyClawd/archive/main.zip#model-evaluation-benchmark-skill Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

Model Evaluation Benchmark Skill

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper