evaluation-metrics
CommunityMeasure and improve LLM quality.
Authorpluginagentmarketplace
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill addresses the critical need to systematically measure and improve the quality of Large Language Models (LLMs), especially in production environments, by providing frameworks, benchmarks, and metrics.
Core Features & Use Cases
- Comprehensive Metrics: Offers a suite of text generation and RAG-specific metrics (BLEU, ROUGE, BERTScore, Faithfulness, Relevancy, etc.).
- Benchmark Suites: Integrates with standard benchmarks like MMLU and HumanEval for robust model assessment.
- Evaluation Frameworks: Provides tools for structured evaluation, A/B testing, and hallucination detection.
- Use Case: An AI engineer needs to compare two LLM models for a customer support chatbot. They can use this Skill to run an A/B test, evaluating metrics like faithfulness and answer relevancy on a set of test questions to determine which model performs better.
Quick Start
Use the evaluation-metrics skill to evaluate a list of model predictions against their ground truth references using BLEU and ROUGE scores.
Dependency Matrix
Required Modules
ragasdatasetsevaluatelangchaintenacityscipy
Components
scriptsreferencesassets
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: evaluation-metrics Download link: https://github.com/pluginagentmarketplace/custom-plugin-ai-engineer/archive/main.zip#evaluation-metrics Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.