Skill Explorer

Searching protocol for "benchmark tasks"

benchmark-audit

Official

Audit benchmark quality and validity.

Few Config

bysourcegraph

llm-evaluation

Community

Benchmark LLMs with standard tasks and backends.

Advanced

bytylertitsworth

sync-metadata

Official

Keep task metadata in sync.

No Config

bysourcegraph

prompt-benchmark

Community

Benchmark prompts with measurable results.

Advanced

bymanutej

score-tasks

Official

Evaluate benchmark task quality.

Few Config

bysourcegraph

tbench

Community

Benchmark AI agents with Terminal-Bench.

Advanced

byneilmovva

lm-evaluation-harness

Community

Benchmark LLMs with standardized 60+ tasks.

Advanced

byovachiever

benchmark-driven-improvement

Official

Systematic benchmark fixes for Serf.

Advanced

byprime-radiant-inc

tbench

Community

Benchmark Unix agents with Terminal-Bench.

Advanced

byonchainengineer

quick-rerun

Official

Verify fixes locally and fast.

Few Config

bysourcegraph

eval-recipes Runner Skill

Community

Benchmark amplihack improvements with eval-recipes.

Advanced

byrysweet

benchmark-logging

Community

Consistent benchmarks with auditable logs.

Few Config

bydrpedapati