evaluation-metrics

Community

Measure and improve LLM quality.

Authorpluginagentmarketplace
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the critical need to systematically measure and improve the quality of Large Language Models (LLMs), especially in production environments, by providing frameworks, benchmarks, and metrics.

Core Features & Use Cases

  • Comprehensive Metrics: Offers a suite of text generation and RAG-specific metrics (BLEU, ROUGE, BERTScore, Faithfulness, Relevancy, etc.).
  • Benchmark Suites: Integrates with standard benchmarks like MMLU and HumanEval for robust model assessment.
  • Evaluation Frameworks: Provides tools for structured evaluation, A/B testing, and hallucination detection.
  • Use Case: An AI engineer needs to compare two LLM models for a customer support chatbot. They can use this Skill to run an A/B test, evaluating metrics like faithfulness and answer relevancy on a set of test questions to determine which model performs better.

Quick Start

Use the evaluation-metrics skill to evaluate a list of model predictions against their ground truth references using BLEU and ROUGE scores.

Dependency Matrix

Required Modules

ragasdatasetsevaluatelangchaintenacityscipy

Components

scriptsreferencesassets

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: evaluation-metrics
Download link: https://github.com/pluginagentmarketplace/custom-plugin-ai-engineer/archive/main.zip#evaluation-metrics

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.