evaluating-llms
CommunityElevate LLM quality with robust evaluation.
Software Engineering#llm-as-judge#llm evaluation#faithfulness#ragas#benchmark testing#safety testing
Authorancoleman
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill provides a comprehensive framework for evaluating LLM performance, ensuring your AI applications are accurate, reliable, and safe.
Core Features & Use Cases
- Automated Metrics: Utilize metrics like BLEU, ROUGE, and BERTScore for generation tasks.
- LLM-as-Judge: Employ powerful LLMs to assess nuanced quality criteria with custom rubrics.
- RAG Evaluation: Measure faithfulness, relevance, and context quality using the RAGAS framework.
- Safety Testing: Detect hallucinations, bias, and toxicity in LLM outputs.
- Benchmark Testing: Assess models against standards like MMLU and HumanEval.
- Use Case: You've built a RAG system and need to ensure its answers are factual and relevant. Use this Skill's faithfulness and relevance metrics to validate its performance before deployment.
Quick Start
Use the evaluating-llms skill to run a RAGAS faithfulness check on your system's output.
Dependency Matrix
Required Modules
ragasdeepevallm-evalopenaianthropicscikit-learn
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: evaluating-llms Download link: https://github.com/ancoleman/ai-design-components/archive/main.zip#evaluating-llms Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.