Name: eval-bench
Availability: InStock
Author: Rachasumanth

System Documentation

What problem does it solve?

This Skill automates the rigorous evaluation of AI models, ensuring their performance, safety, and reliability before deployment.

Core Features & Use Cases

Model Evaluation: Run comprehensive benchmarks using lm-evaluation-harness covering common NLP tasks (MMLU, HellaSwag, ARC).
Safety & Bias Testing: Assess models for truthfulness, toxicity, bias, and stereotyping with benchmarks like TruthfulQA, ToxiGen, BBQ, and CrowS-Pairs.
Code Generation Evaluation: For code models, evaluate performance on HumanEval, MBPP, and MultiPL-E.
RAG & Perplexity: Evaluate Retrieval-Augmented Generation quality with Ragas and measure perplexity on held-out datasets.
Reporting: Generate human-readable reports summarizing findings and providing release readiness recommendations.

Quick Start

Run a full model evaluation using the eval-bench skill with default settings.

Please help me install this Skill: Name: eval-bench Download link: https://github.com/Rachasumanth/text2llm001/archive/main.zip#eval-bench Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

eval-bench

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper