llm-eval

Community

Measure and improve LLM performance.

AuthorLuisSambrano
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the critical need for robust evaluation of Large Language Model (LLM) applications, ensuring quality, performance, and reliability.

Core Features & Use Cases

  • Automated Metrics: Utilize metrics like BLEU, ROUGE, BERTScore, Accuracy, Precision, Recall, F1, MRR, and NDCG for quantitative assessment.
  • Human Evaluation: Incorporate human judgment on dimensions such as accuracy, coherence, relevance, fluency, safety, and helpfulness.
  • LLM-as-Judge: Leverage powerful LLMs to evaluate outputs, enabling scalable qualitative assessment.
  • A/B Testing & Regression: Facilitate controlled experiments and continuous monitoring for performance regressions.
  • Use Case: When deploying a new chatbot, use this Skill to systematically measure its response quality against a benchmark dataset using a combination of automated metrics and human review, ensuring it meets performance targets before going live.

Quick Start

Use the llm-eval skill to evaluate your model's accuracy and BLEU score against a set of test cases.

Dependency Matrix

Required Modules

nltkrouge_scorebert_scoretransformersdetoxifyopenaiscikit-learnscipynumpy

Components

scriptsreferencesassets

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: llm-eval
Download link: https://github.com/LuisSambrano/antigravity-config/archive/main.zip#llm-eval

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.