nemo-evaluator-sdk

Community

Benchmark LLMs at scale.

AuthorDoanNgocCuong
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill provides a robust and scalable platform for evaluating Large Language Models (LLMs) across a wide array of benchmarks, ensuring reproducible and enterprise-grade performance assessment.

Core Features & Use Cases

  • Extensive Benchmarking: Evaluates LLMs on 100+ benchmarks from 18+ harnesses, including MMLU, HumanEval, GSM8K, safety, and VLM tasks.
  • Multi-Backend Execution: Supports evaluation on local Docker, Slurm HPC clusters, and cloud platforms.
  • Reproducible Evaluation: Utilizes a container-first architecture for consistent and reliable benchmarking.
  • Use Case: A research team needs to compare the performance of three different LLMs on a suite of academic and safety benchmarks. They can use this Skill to configure and run evaluations across their Slurm cluster, generating standardized reports for each model.

Quick Start

Use the nemo-evaluator-sdk skill to evaluate the 'meta/llama-3.1-8b-instruct' model on the 'ifeval' task using local Docker execution.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: nemo-evaluator-sdk
Download link: https://github.com/DoanNgocCuong/continuous-training-pipeline_T3_2026/archive/main.zip#nemo-evaluator-sdk

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.