eval-engine

Community

LLM evaluation pipeline

Authormqzkim
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill automates the process of building and running evaluation pipelines for Large Language Models (LLMs), enabling systematic quality assessment and comparison.

Core Features & Use Cases

  • Dataset Management: Load and manage datasets for evaluation.
  • LLM-as-Judge Evaluation: Utilize LLMs to evaluate responses based on various criteria.
  • Customizable Metrics: Supports accuracy, relevance, hallucination, harmfulness, and Korean-specific quality.
  • A/B Testing: Facilitates comparison between different model runs or configurations.
  • Use Case: Evaluate a new chatbot's responses against a benchmark dataset using multiple metrics, identifying areas for improvement.

Quick Start

Use the eval-engine skill to run an evaluation on the dataset named 'dataset_abc' using accuracy and hallucination evaluators with the Codex-haiku judge model.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: eval-engine
Download link: https://github.com/mqzkim/trading/archive/main.zip#eval-engine

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.