eval-bench

Community

Benchmark models comprehensively.

AuthorRachasumanth
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill automates the rigorous evaluation of AI models, ensuring their performance, safety, and reliability before deployment.

Core Features & Use Cases

  • Model Evaluation: Run comprehensive benchmarks using lm-evaluation-harness covering common NLP tasks (MMLU, HellaSwag, ARC).
  • Safety & Bias Testing: Assess models for truthfulness, toxicity, bias, and stereotyping with benchmarks like TruthfulQA, ToxiGen, BBQ, and CrowS-Pairs.
  • Code Generation Evaluation: For code models, evaluate performance on HumanEval, MBPP, and MultiPL-E.
  • RAG & Perplexity: Evaluate Retrieval-Augmented Generation quality with Ragas and measure perplexity on held-out datasets.
  • Reporting: Generate human-readable reports summarizing findings and providing release readiness recommendations.

Quick Start

Run a full model evaluation using the eval-bench skill with default settings.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: eval-bench
Download link: https://github.com/Rachasumanth/text2llm001/archive/main.zip#eval-bench

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.