ai-evaluation-suite

Community

Ensure AI quality, performance, and safety.

Authordoctorduke
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Building reliable AI systems is hard. This Skill provides a comprehensive toolkit to rigorously evaluate LLM outputs, RAG systems, and AI agents, ensuring your AI performs as expected, avoids hallucinations, and is free from bias, saving you from costly production failures.

Core Features & Use Cases

  • LLM Quality Assessment: Use LLM-as-judge to score outputs on coherence, relevance, factuality, and more.
  • Hallucination & Bias Detection: Automatically identify factual inconsistencies and demographic biases in AI-generated content.
  • Cost & Performance Optimization: Track token usage, latency, and cost to optimize your AI's efficiency.
  • Use Case: You're deploying a new summarization model. Use this Skill to automatically evaluate its output quality against a test set, detect any hallucinations, and compare its cost-effectiveness against a cheaper model before going live.

Quick Start

Evaluate the quality of the LLM's response "Quantum entanglement is a phenomenon..." to the query "Explain quantum entanglement" using the LLMQualityEvaluator.

Dependency Matrix

Required Modules

anthropicnumpypytest

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: ai-evaluation-suite
Download link: https://github.com/doctorduke/claude-config/archive/main.zip#ai-evaluation-suite

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.