nemo-evaluator-sdk

Name: nemo-evaluator-sdk
Availability: InStock
Author: DoanNgocCuong

Community

Benchmark LLMs at scale.

Software Engineering #benchmarking #llm evaluation #docker #reproducible #hpc #nvidia #nemo

AuthorDoanNgocCuong

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill provides a robust and scalable platform for evaluating Large Language Models (LLMs) across a wide array of benchmarks, ensuring reproducible and enterprise-grade performance assessment.

Core Features & Use Cases

Extensive Benchmarking: Evaluates LLMs on 100+ benchmarks from 18+ harnesses, including MMLU, HumanEval, GSM8K, safety, and VLM tasks.
Multi-Backend Execution: Supports evaluation on local Docker, Slurm HPC clusters, and cloud platforms.
Reproducible Evaluation: Utilizes a container-first architecture for consistent and reliable benchmarking.
Use Case: A research team needs to compare the performance of three different LLMs on a suite of academic and safety benchmarks. They can use this Skill to configure and run evaluations across their Slurm cluster, generating standardized reports for each model.

Quick Start

Use the nemo-evaluator-sdk skill to evaluate the 'meta/llama-3.1-8b-instruct' model on the 'ifeval' task using local Docker execution.

nemo-evaluator-sdk

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper