Name: evaluating-llms
Availability: InStock
Author: ancoleman

System Documentation

What problem does it solve?

This Skill provides a comprehensive framework for evaluating LLM performance, ensuring your AI applications are accurate, reliable, and safe.

Core Features & Use Cases

Automated Metrics: Utilize metrics like BLEU, ROUGE, and BERTScore for generation tasks.
LLM-as-Judge: Employ powerful LLMs to assess nuanced quality criteria with custom rubrics.
RAG Evaluation: Measure faithfulness, relevance, and context quality using the RAGAS framework.
Safety Testing: Detect hallucinations, bias, and toxicity in LLM outputs.
Benchmark Testing: Assess models against standards like MMLU and HumanEval.
Use Case: You've built a RAG system and need to ensure its answers are factual and relevant. Use this Skill's faithfulness and relevance metrics to validate its performance before deployment.

Quick Start

Use the evaluating-llms skill to run a RAGAS faithfulness check on your system's output.

Please help me install this Skill: Name: evaluating-llms Download link: https://github.com/ancoleman/ai-design-components/archive/main.zip#evaluating-llms Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

evaluating-llms

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper