Name: evaluating-llms-harness
Availability: InStock
Author: AXGZ21

System Documentation

What problem does it solve?

This Skill automates the evaluation of Large Language Models (LLMs) against a comprehensive suite of academic benchmarks, providing standardized metrics for model quality and performance.

Core Features & Use Cases

Comprehensive Benchmarking: Evaluates LLMs across 60+ academic benchmarks including MMLU, HumanEval, GSM8K, and TruthfulQA.
Model Comparison: Facilitates direct comparison of different LLMs or different versions of the same LLM.
Training Progress Tracking: Enables monitoring of model performance during training cycles.
Industry Standard: Utilizes the widely adopted lm-evaluation-harness framework used by major AI labs.
Use Case: A researcher wants to compare the reasoning capabilities of two new LLMs. They can use this Skill to run both models through MMLU and GSM8K benchmarks and get comparable accuracy scores.

Quick Start

Use the evaluating-llms-harness skill to evaluate the 'meta-llama/Llama-2-7b-hf' model on the 'mmlu' and 'gsm8k' tasks.

Please help me install this Skill: Name: evaluating-llms-harness Download link: https://github.com/AXGZ21/hermes-agent-railway/archive/main.zip#evaluating-llms-harness Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

evaluating-llms-harness

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper