nemo-evaluator

Community

Benchmark LLMs with NeMo Evaluator.

Authoreyadsibai
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill streamlines the process of evaluating Large Language Models (LLMs) by providing a robust framework for running industry-standard benchmarks and setting up complex evaluation pipelines.

Core Features & Use Cases

  • Comprehensive Benchmarking: Supports over 100 benchmarks across 18+ harnesses, including MMLU, HumanEval, and GSM8K.
  • Reproducible Evaluation: Utilizes containerization for consistent results across different environments.
  • Flexible Deployment: Enables evaluation on local Docker, Slurm HPC clusters, or cloud platforms.
  • Use Case: You need to compare the performance of two new LLMs on coding tasks and general knowledge. Use this Skill to configure and run HumanEval and MMLU benchmarks for both models, generating a comparative report.

Quick Start

Install the NeMo Evaluator SDK by running pip install nemo-evaluator-launcher.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: nemo-evaluator
Download link: https://github.com/eyadsibai/ltk/archive/main.zip#nemo-evaluator

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.