eval-engine
CommunityLLM evaluation pipeline
Authormqzkim
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill automates the process of building and running evaluation pipelines for Large Language Models (LLMs), enabling systematic quality assessment and comparison.
Core Features & Use Cases
- Dataset Management: Load and manage datasets for evaluation.
- LLM-as-Judge Evaluation: Utilize LLMs to evaluate responses based on various criteria.
- Customizable Metrics: Supports accuracy, relevance, hallucination, harmfulness, and Korean-specific quality.
- A/B Testing: Facilitates comparison between different model runs or configurations.
- Use Case: Evaluate a new chatbot's responses against a benchmark dataset using multiple metrics, identifying areas for improvement.
Quick Start
Use the eval-engine skill to run an evaluation on the dataset named 'dataset_abc' using accuracy and hallucination evaluators with the Codex-haiku judge model.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: eval-engine Download link: https://github.com/mqzkim/trading/archive/main.zip#eval-engine Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.