evaluating-llms-harness
CommunityBenchmark LLMs against academic standards.
Education & Research#benchmarking#llm evaluation#performance metrics#lm-evaluation-harness#model quality#academic benchmarks
AuthorAXGZ21
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill automates the evaluation of Large Language Models (LLMs) against a comprehensive suite of academic benchmarks, providing standardized metrics for model quality and performance.
Core Features & Use Cases
- Comprehensive Benchmarking: Evaluates LLMs across 60+ academic benchmarks including MMLU, HumanEval, GSM8K, and TruthfulQA.
- Model Comparison: Facilitates direct comparison of different LLMs or different versions of the same LLM.
- Training Progress Tracking: Enables monitoring of model performance during training cycles.
- Industry Standard: Utilizes the widely adopted lm-evaluation-harness framework used by major AI labs.
- Use Case: A researcher wants to compare the reasoning capabilities of two new LLMs. They can use this Skill to run both models through MMLU and GSM8K benchmarks and get comparable accuracy scores.
Quick Start
Use the evaluating-llms-harness skill to evaluate the 'meta-llama/Llama-2-7b-hf' model on the 'mmlu' and 'gsm8k' tasks.
Dependency Matrix
Required Modules
lm-evaltransformersvllm
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: evaluating-llms-harness Download link: https://github.com/AXGZ21/hermes-agent-railway/archive/main.zip#evaluating-llms-harness Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.