Name: eval-engine
Availability: InStock
Author: mqzkim

System Documentation

What problem does it solve?

This Skill automates the process of building and running evaluation pipelines for Large Language Models (LLMs), enabling systematic quality assessment and comparison.

Core Features & Use Cases

Dataset Management: Load and manage datasets for evaluation.
LLM-as-Judge Evaluation: Utilize LLMs to evaluate responses based on various criteria.
Customizable Metrics: Supports accuracy, relevance, hallucination, harmfulness, and Korean-specific quality.
A/B Testing: Facilitates comparison between different model runs or configurations.
Use Case: Evaluate a new chatbot's responses against a benchmark dataset using multiple metrics, identifying areas for improvement.

Quick Start

Use the eval-engine skill to run an evaluation on the dataset named 'dataset_abc' using accuracy and hallucination evaluators with the Codex-haiku judge model.

Please help me install this Skill: Name: eval-engine Download link: https://github.com/mqzkim/trading/archive/main.zip#eval-engine Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

eval-engine

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper