nemo-evaluator-sdk
CommunityBenchmark LLMs at scale.
AuthorDoanNgocCuong
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill provides a robust and scalable platform for evaluating Large Language Models (LLMs) across a wide array of benchmarks, ensuring reproducible and enterprise-grade performance assessment.
Core Features & Use Cases
- Extensive Benchmarking: Evaluates LLMs on 100+ benchmarks from 18+ harnesses, including MMLU, HumanEval, GSM8K, safety, and VLM tasks.
- Multi-Backend Execution: Supports evaluation on local Docker, Slurm HPC clusters, and cloud platforms.
- Reproducible Evaluation: Utilizes a container-first architecture for consistent and reliable benchmarking.
- Use Case: A research team needs to compare the performance of three different LLMs on a suite of academic and safety benchmarks. They can use this Skill to configure and run evaluations across their Slurm cluster, generating standardized reports for each model.
Quick Start
Use the nemo-evaluator-sdk skill to evaluate the 'meta/llama-3.1-8b-instruct' model on the 'ifeval' task using local Docker execution.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: nemo-evaluator-sdk Download link: https://github.com/DoanNgocCuong/continuous-training-pipeline_T3_2026/archive/main.zip#nemo-evaluator-sdk Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.