llm-eval
CommunityMeasure and improve LLM performance.
AuthorLuisSambrano
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill addresses the critical need for robust evaluation of Large Language Model (LLM) applications, ensuring quality, performance, and reliability.
Core Features & Use Cases
- Automated Metrics: Utilize metrics like BLEU, ROUGE, BERTScore, Accuracy, Precision, Recall, F1, MRR, and NDCG for quantitative assessment.
- Human Evaluation: Incorporate human judgment on dimensions such as accuracy, coherence, relevance, fluency, safety, and helpfulness.
- LLM-as-Judge: Leverage powerful LLMs to evaluate outputs, enabling scalable qualitative assessment.
- A/B Testing & Regression: Facilitate controlled experiments and continuous monitoring for performance regressions.
- Use Case: When deploying a new chatbot, use this Skill to systematically measure its response quality against a benchmark dataset using a combination of automated metrics and human review, ensuring it meets performance targets before going live.
Quick Start
Use the llm-eval skill to evaluate your model's accuracy and BLEU score against a set of test cases.
Dependency Matrix
Required Modules
nltkrouge_scorebert_scoretransformersdetoxifyopenaiscikit-learnscipynumpy
Components
scriptsreferencesassets
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: llm-eval Download link: https://github.com/LuisSambrano/antigravity-config/archive/main.zip#llm-eval Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.