Name: evaluation-metrics
Availability: InStock
Author: pluginagentmarketplace

System Documentation

What problem does it solve?

This Skill addresses the critical need to systematically measure and improve the quality of Large Language Models (LLMs), especially in production environments, by providing frameworks, benchmarks, and metrics.

Core Features & Use Cases

Comprehensive Metrics: Offers a suite of text generation and RAG-specific metrics (BLEU, ROUGE, BERTScore, Faithfulness, Relevancy, etc.).
Benchmark Suites: Integrates with standard benchmarks like MMLU and HumanEval for robust model assessment.
Evaluation Frameworks: Provides tools for structured evaluation, A/B testing, and hallucination detection.
Use Case: An AI engineer needs to compare two LLM models for a customer support chatbot. They can use this Skill to run an A/B test, evaluating metrics like faithfulness and answer relevancy on a set of test questions to determine which model performs better.

Quick Start

Use the evaluation-metrics skill to evaluate a list of model predictions against their ground truth references using BLEU and ROUGE scores.

Please help me install this Skill: Name: evaluation-metrics Download link: https://github.com/pluginagentmarketplace/custom-plugin-ai-engineer/archive/main.zip#evaluation-metrics Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

evaluation-metrics

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper