evaluating-llms-harness

Community

Benchmark LLMs against academic standards.

AuthorAXGZ21
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill automates the evaluation of Large Language Models (LLMs) against a comprehensive suite of academic benchmarks, providing standardized metrics for model quality and performance.

Core Features & Use Cases

  • Comprehensive Benchmarking: Evaluates LLMs across 60+ academic benchmarks including MMLU, HumanEval, GSM8K, and TruthfulQA.
  • Model Comparison: Facilitates direct comparison of different LLMs or different versions of the same LLM.
  • Training Progress Tracking: Enables monitoring of model performance during training cycles.
  • Industry Standard: Utilizes the widely adopted lm-evaluation-harness framework used by major AI labs.
  • Use Case: A researcher wants to compare the reasoning capabilities of two new LLMs. They can use this Skill to run both models through MMLU and GSM8K benchmarks and get comparable accuracy scores.

Quick Start

Use the evaluating-llms-harness skill to evaluate the 'meta-llama/Llama-2-7b-hf' model on the 'mmlu' and 'gsm8k' tasks.

Dependency Matrix

Required Modules

lm-evaltransformersvllm

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: evaluating-llms-harness
Download link: https://github.com/AXGZ21/hermes-agent-railway/archive/main.zip#evaluating-llms-harness

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.